#general | Arena | Page 32

torn mantle Apr 28, 2025, 4:25 PM

#

Yea

balmy mist Apr 28, 2025, 4:47 PM

#

i just got back, did they say what time for qwen?

full kite Apr 28, 2025, 4:48 PM

#

Qwen is as dosht as deepseek

#

Slow and unuseable

#

Unusable

#

Omfg

raven void Apr 28, 2025, 4:56 PM

#

Sota open source

tawdry meteor Apr 28, 2025, 5:10 PM

#

raven void Sota open source

SOTA premium vs SOTA open source is like adobe products to me. Yeah photoshop is always gonna be the absolute best and have newest most incredible features and worth it if you're a pro photog but I just tell my friends to use gimp or other freeware

#

like unless using the AI is contributing to your commercial productivity /earnings then why not use the free stuff. but also it's going to contribute to almost everyone's productivity so 🤷 I hope open source becomes fantastic tho

#

agreed that's basically exactly what I meant lol

#

like if you're just editing family photos, use freeware or whatever. but if you're using it for true productivity you need premium tools

keen beacon Apr 28, 2025, 5:14 PM

#

qwen 3 235b benchmark results apparently leaked, managed to copy the screenshot before the tweet got deleted

tawdry meteor Apr 28, 2025, 5:14 PM

#

and same then for coding with SOTA vs open source

keen beacon Apr 28, 2025, 5:15 PM

#

apparently its gone because they messed up some numbers

#

so idk which are right

#

humaneval looks low asf compared to the rest so it might be that

#

o3 got 83.3

sonic tendon Apr 28, 2025, 5:16 PM

#

oh wow

tawdry meteor Apr 28, 2025, 5:16 PM

#

yo wtf?

#

g2.5pro is 84 on GPQA

sonic tendon Apr 28, 2025, 5:17 PM

#

keen beacon apparently its gone because they messed up some numbers

here's hoping they messed them up in the wrong direction

tawdry meteor Apr 28, 2025, 5:17 PM

#

I wonder which numbers were wrong

#

for reference

keen beacon Apr 28, 2025, 5:17 PM

#

i presume they just put the wrong number, not they ran the benchmark wrong

#

who knows, they probably reused an old table and didn't replace it properly

tawdry meteor Apr 28, 2025, 5:18 PM

#

this dude has never made a typo

balmy mist Apr 28, 2025, 5:19 PM

#

Wait, where is o3 pro coming out, bro?

tawdry meteor Apr 28, 2025, 5:19 PM

#

there's a permanent commemorative plaque made out of marble and with a steel engraving on it in front of a historic building in my city, and there's a misspelled word on it and punctuation missing in another spot. The ability for people to make clerical errors and no one else catching it baffles the mind

keen beacon Apr 28, 2025, 5:20 PM

#

keen beacon qwen 3 235b benchmark results apparently leaked, managed to copy the screenshot ...

Wow

#

that is absurd

balmy mist Apr 28, 2025, 5:20 PM

#

As in that’s bad?

tawdry meteor Apr 28, 2025, 5:20 PM

#

keen beacon qwen 3 235b benchmark results apparently leaked, managed to copy the screenshot ...

the aider polyglot score is crazy too

keen beacon Apr 28, 2025, 5:20 PM

#

balmy mist As in that’s bad?

no its crazy good

#

like insane

balmy mist Apr 28, 2025, 5:20 PM

#

Lmaoo

#

Ayee

#

So gg closed source?

#

Wait it the model actually out?

keen beacon Apr 28, 2025, 5:21 PM

#

not yet i think

balmy mist Apr 28, 2025, 5:21 PM

#

Bruhh

#

They blue balling us again

tawdry meteor Apr 28, 2025, 5:21 PM

#

it hasn't been anonymous on arena right?

keen beacon Apr 28, 2025, 5:21 PM

#

tawdry meteor it hasn't been anonymous on arena right?

qwen and deepseek dont do that

tawdry meteor Apr 28, 2025, 5:22 PM

#

keen beacon qwen and deepseek dont do that

thx, hopefully we get it soon

sonic tendon Apr 28, 2025, 5:22 PM

#

i would guess that the biggest model is going to be closed-weight for a while

balmy mist Apr 28, 2025, 5:22 PM

#

O3 pro has to come out this week right ?

#

This the 3rd week

sonic tendon Apr 28, 2025, 5:22 PM

#

like they did w/ 2.5 max

keen beacon Apr 28, 2025, 5:22 PM

#

sonic tendon i would guess that the biggest model is going to be closed-weight for a while

they would probably name it max if so

sonic tendon Apr 28, 2025, 5:22 PM

#

keen beacon they would probably name it max if so

ah

balmy mist Apr 28, 2025, 5:22 PM

#

Cause next week would be a month vs a couple of weeks

#

Craig run our queries when u get o3 pro please

keen beacon Apr 28, 2025, 5:24 PM

#

keen beacon qwen 3 235b benchmark results apparently leaked, managed to copy the screenshot ...

maverick gets f1ucking destroyed lol

balmy mist Apr 28, 2025, 5:24 PM

#

Really? I thought you were OpenAI ride or die?

keen beacon Apr 28, 2025, 5:26 PM

#

i need qwen 235b asap 🤣

balmy mist Apr 28, 2025, 5:26 PM

#

What u gonna do with it?

keen beacon Apr 28, 2025, 5:26 PM

#

generate data

balmy mist Apr 28, 2025, 5:26 PM

#

lol yupp

#

Gotta justify that cost

keen beacon Apr 28, 2025, 5:27 PM

#

you could probably run qwen 235b pretty well on macs i guess becuz of moe

balmy mist Apr 28, 2025, 5:27 PM

#

Hmmm really what kind of specs u need?

keen beacon Apr 28, 2025, 5:27 PM

#

u need enough vram to load though

balmy mist Apr 28, 2025, 5:27 PM

#

I got 18 gb

#

But it’s m3

keen beacon Apr 28, 2025, 5:28 PM

#

if u can fit its a sota local model fr

keen beacon Apr 28, 2025, 5:28 PM

#

keen beacon qwen 3 235b benchmark results apparently leaked, managed to copy the screenshot ...

i regret to inform you guys that i was messing around and this is just me doing a little trolling LOL
i hope it can get close to these numbers though

balmy mist Apr 28, 2025, 5:28 PM

#

Bruhh

#

I’m logging off until r2

tawdry meteor Apr 28, 2025, 5:28 PM

#

BRUH

keen beacon Apr 28, 2025, 5:29 PM

#

dw itll be good still lol

#

maybe not that crazy

balmy mist Apr 28, 2025, 5:29 PM

#

And Claude lol

keen beacon Apr 28, 2025, 5:29 PM

#

yeah i think it'll get close to o3 but i do doubt it'll beat it, at least not broadly

#

it'll beat o1 for sure

keen beacon Apr 28, 2025, 5:29 PM

#

keen beacon yeah i think it'll get close to o3 but i do doubt it'll beat it, at least not br...

the max version might beat o3 tho

#

but i think thats a while away

balmy mist Apr 28, 2025, 5:29 PM

#

Which o3?

keen beacon Apr 28, 2025, 5:29 PM

#

regular

#

do you mean low/med/high

balmy mist Apr 28, 2025, 5:30 PM

#

Wait so qwen not even SOTA?

balmy mist Apr 28, 2025, 5:30 PM

#

keen beacon do you mean low/med/high

Yeah

keen beacon Apr 28, 2025, 5:30 PM

#

we dont know

#

i was trolling, it may or may not be

balmy mist Apr 28, 2025, 5:30 PM

#

Just cheap?

keen beacon Apr 28, 2025, 5:30 PM

#

balmy mist Yeah

it'll get close to o3 med and probably beat it in a few things

balmy mist Apr 28, 2025, 5:30 PM

#

keen beacon i was trolling, it may or may not be

How u made it? With Gemini?

keen beacon Apr 28, 2025, 5:30 PM

#

that is my bet

keen beacon Apr 28, 2025, 5:30 PM

#

balmy mist How u made it? With Gemini?

my good ol' fingers

sonic tendon Apr 28, 2025, 5:30 PM

#

i think it'll beat o3 on the (non-stylectrl) leaderboard

keen beacon Apr 28, 2025, 5:30 PM

#

and my imagination

#

😇

balmy mist Apr 28, 2025, 5:30 PM

#

keen beacon my good ol' fingers

Wow I haven’t used that in a while

#

Gotta lock into that

sonic tendon Apr 28, 2025, 5:30 PM

#

keen beacon 😇

this feels vaguely gay, but i'm not sure why

keen beacon Apr 28, 2025, 5:31 PM

#

😇

tawdry meteor Apr 28, 2025, 5:31 PM

#

I was halfway through making a table with those numbers comparing to other benchmarks lmao

keen beacon Apr 28, 2025, 5:31 PM

#

LOL

tawdry meteor Apr 28, 2025, 5:31 PM

#

you got us good

keen beacon Apr 28, 2025, 5:31 PM

#

i would be annoyed if i were you dw 😔

balmy mist Apr 28, 2025, 5:31 PM

#

keen beacon LOL

U blue balled me like these ai companies

keen beacon Apr 28, 2025, 5:31 PM

#

anyway im gonna go for a bit someone spam me if they release qwen3 (for real this time)

#

byebye

sonic tendon Apr 28, 2025, 5:31 PM

#

tawdry meteor I was halfway through making a table with those numbers comparing to other bench...

wait what

#

cya :3

tawdry meteor Apr 28, 2025, 5:32 PM

#

sonic tendon wait what

I was making a table to compare ~~Li's~~ Leo's fake benchmarks to real benchmarks and when I found out it was fake I deleted my google account

keen beacon Apr 28, 2025, 5:33 PM

#

ya sm1 posted the tweet before it got deleted

#

here

sonic tendon Apr 28, 2025, 5:33 PM

#

tawdry meteor I was making a table to compare ~~Li's~~ Leo's fake benchmarks to real benchmark...

ohhhhh

#

oh

#

@keen beacon LEO

keen beacon Apr 28, 2025, 5:33 PM

#

clearly they still value it since they give quota out to lmsys lol

tall summit Apr 28, 2025, 5:33 PM

#

keen beacon ya sm1 posted the tweet before it got deleted

didnt see

oblique flint Apr 28, 2025, 5:34 PM

#

I get it, but human preference tinetuning does not have to come at the cost of actual performance, just look at 2.5 pro and o3

keen beacon Apr 28, 2025, 5:34 PM

#

idk google has been the most prolific in terms of using the arena tbh

tall summit Apr 28, 2025, 5:34 PM

#

keen beacon clearly they still value it since they give quota out to lmsys lol

the users don't

sonic tendon Apr 28, 2025, 5:34 PM

#

yeah, i think it's still a valuable benchmark

thorny drum Apr 28, 2025, 5:34 PM

#

i think lmarena is still the most public benchmark as silly as it is

sonic tendon Apr 28, 2025, 5:34 PM

#

thorny drum i think lmarena is still the most public benchmark as silly as it is

yeah

keen beacon Apr 28, 2025, 5:35 PM

#

the worst is meta

#

they sh1t their models into the arena

full kite Apr 28, 2025, 5:36 PM

#

Worst is grok because it's racist

sonic tendon Apr 28, 2025, 5:36 PM

#

keen beacon they sh1t their models into the arena

yeah i have no idea why they thought that that wouldn't nuke their reputation

#

imo grok is worse because elon fanboys back it

#

i mean, llama was (iirc) the first real well-funded open source model release

#

they were really popular at the beginning, but then everyone else ate their lunch

#

qwen and deepseek, namely

keen beacon Apr 28, 2025, 5:37 PM

#

grok is a weird ass model

sonic tendon Apr 28, 2025, 5:37 PM

#

uh oh

keen beacon Apr 28, 2025, 5:37 PM

#

it's not incredible at anything but it doesn't suck at anything either

#

had its 10 seconds of fame and it's just meh now

sonic tendon Apr 28, 2025, 5:38 PM

#

keen beacon had its 10 seconds of fame and it's just meh now

good way to summarize it

keen beacon Apr 28, 2025, 5:38 PM

#

and their reasoning model seriously underperformed relative to how strong the base (theoretically) is

sonic tendon Apr 28, 2025, 5:38 PM

#

ah, any reason in particular?

sonic tendon Apr 28, 2025, 5:38 PM

#

keen beacon and their reasoning model seriously underperformed relative to how strong the ba...

i've been wondering: is their reasoning model even on the arena?

#

that's valid

#

i try to set limits for myself and stick to them

keen beacon Apr 28, 2025, 5:39 PM

#

sonic tendon i've been wondering: is their reasoning model even on the arena?

nope they only have grok 3 mini reasoning available as an api and i cant remember if its on the arena

#

its in the beta iirc

#

anyway im gonna actually go

#

goodbye

sonic tendon Apr 28, 2025, 5:39 PM

#

bye!

#

oh

#

damn >:(

#

not sure

keen beacon Apr 28, 2025, 5:45 PM

#

lol this guy who leaked qwen 3 ggufs got kicked out of the qwen org https://huggingface.co/apepkuss79

apepkuss79 (Sam Liu)

#

https://tenor.com/view/youre-fired-donald-trump-the-apprentice-point-gif-8557097

Tenor

ocean vortex Apr 28, 2025, 6:03 PM

#

keen beacon lol this guy who leaked qwen 3 ggufs got kicked out of the qwen org https://hugg...

lmao. Probably was invited to it for questionable reasons to begin with. I doubt he worked for them

keen beacon Apr 28, 2025, 6:06 PM

#

ocean vortex lmao. Probably was invited to it for questionable reasons to begin with. I doubt...

ya there are randoms in the qwen hf org lol

#

prior to them intentionally giving early access out i think

#

ETA for Qwen 3 basically any minute now

it's 2am over there right now and they're working their asses off to upload all of the models and quants

elder burrow Apr 28, 2025, 6:19 PM

#

keen beacon ETA for Qwen 3 basically any minute now it's 2am over there right now and they...

epic

#

https://cdn.discordapp.com/attachments/1071261500511096845/1365886244550217778/jpeg.gif

keen beacon Apr 28, 2025, 6:20 PM

#

what is bro doing

elder burrow Apr 28, 2025, 6:20 PM

#

keen beacon what is bro doing

rel

keen beacon Apr 28, 2025, 6:20 PM

#

elder burrow Apr 28, 2025, 6:20 PM

#

guess which is qwen

keen beacon Apr 28, 2025, 6:21 PM

#

elder burrow guess which is qwen

2?

elder burrow Apr 28, 2025, 6:21 PM

#

keen beacon 2?

its the um

keen beacon Apr 28, 2025, 6:21 PM

#

the um

elder burrow Apr 28, 2025, 6:21 PM

#

the one which looks weird

#

3

keen beacon Apr 28, 2025, 6:21 PM

#

shucks

elder burrow Apr 28, 2025, 6:22 PM

#

1 and 2 are both veo 2

keen beacon Apr 28, 2025, 6:22 PM

#

getting there..

#

ya 66 items+ 🤣

elder burrow Apr 28, 2025, 6:22 PM

#

keen beacon getting there..

👀

#

i just joined the server

#

thx for letting me know

keen beacon Apr 28, 2025, 6:23 PM

#

keen beacon ya 66 items+ 🤣

yeah there's an insane amount of quants by the looks of it

keen beacon Apr 28, 2025, 6:23 PM

#

keen beacon getting there..

per each model, i assume they have a repo for ggufs, bnb 4 bit, bf16, and some of them unsloth bnb 4 bit

#

sounds about right

elder burrow Apr 28, 2025, 6:24 PM

#

when do you think is qwen 3 coming to their site?

keen beacon Apr 28, 2025, 6:24 PM

#

soon theyre planning to launch a sub soon

keen beacon Apr 28, 2025, 6:25 PM

#

elder burrow guess which is qwen

well, iirc theyre using wanx2.1 or something its not a qwen video gen model although it comes from alibaba. i think its 14b. veo must be much bigger

#

on the qwen site it uses wanx2.1 for the time being

balmy mist Apr 28, 2025, 6:41 PM

#

Maybe qwen 3 at 3?

#

So 18 mins?

rugged brook Apr 28, 2025, 6:45 PM

#

Wym

torn mantle Apr 28, 2025, 6:48 PM

#

https://x.com/teortaxesTex/status/1916913384269787417

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxe...

oh no no no no no this is bad
I trusted Neo-China and Neo-China just killed Nick
@xenocosmography

#

This guy already has access

keen fulcrum Apr 28, 2025, 6:49 PM

#

torn mantle https://x.com/teortaxesTex/status/1916913384269787417

It isn't released yet
isn't that photoshop

torn mantle Apr 28, 2025, 6:49 PM

#

keen fulcrum It isn't released yet isn't that photoshop

No

#

Its real

keen fulcrum Apr 28, 2025, 6:49 PM

#

How would you know

torn mantle Apr 28, 2025, 6:49 PM

#

He had a private invite

keen fulcrum Apr 28, 2025, 6:50 PM

#

Well its releasing within 24h

keen beacon Apr 28, 2025, 6:50 PM

#

i wonder if theyre gonna go to bed first

torn mantle Apr 28, 2025, 6:50 PM

#

Because he knows some qwen devs and they follow him too

#

Same with deepseek devs

#

rugged brook Apr 28, 2025, 6:52 PM

#

keen fulcrum Well its releasing within 24h

No

torn mantle Apr 28, 2025, 6:52 PM

#

It hallucinated the google graph part and also the death of Nick Land

rugged brook Apr 28, 2025, 6:52 PM

#

Its releasing noa

keen fulcrum Apr 28, 2025, 6:54 PM

#

Well 235B isn't remarkable
Claude 3 Opus has 200B

#

GPT 5 expected to have 1-10T

blazing rune Apr 28, 2025, 6:56 PM

#

keen fulcrum Well 235B isn't remarkable Claude 3 Opus has 200B

We don't know the size of Claude 3 Opus

keen beacon Apr 28, 2025, 6:57 PM

#

opus is definitely not 200b lol

keen fulcrum Apr 28, 2025, 6:57 PM

#

What then?

keen beacon Apr 28, 2025, 6:57 PM

#

i doubt sonnet 3.5/3.7 is 200b either. i think its 400b

blazing rune Apr 28, 2025, 6:57 PM

#

keen fulcrum GPT 5 expected to have 1-10T

also GPT-5 isn't 1 model, it's a router that delegates the prompt to different models

#

Sam Altman announced this on X

keen beacon Apr 28, 2025, 6:58 PM

#

keen fulcrum What then?

closer to 1t ig

keen fulcrum Apr 28, 2025, 6:59 PM

#

I am interested about the token limit

keen beacon Apr 28, 2025, 7:10 PM

#

seems qwen 30b moe is gonna be a homerun

#

ppl testing lol

#

and wishful thinking 🤣

#

https://windsurf.com/blog/update-to-free-plan

An Update to Our Free Plan

The Free Tier is getting an upgrade!

#

woah

#

cool asf

#

theyre probably bankrolling all this lmao

small haven Apr 28, 2025, 8:03 PM

#

where tf is o3 pro

keen fulcrum Apr 28, 2025, 8:04 PM

#

golden ocean Apr 28, 2025, 8:05 PM

#

https://tenor.com/view/terminator-terminator-robot-looking-flex-cool-robot-gif-16625083

Tenor

keen beacon Apr 28, 2025, 8:11 PM

#

https://x.com/ChujieZheng/status/1916944743952748908

Chujie Zheng ✈️ ICLR (@ChujieZheng) on X

Qwen3 has some really intriguing features that are not written in model cards. I believe it will open new room for both research and product.

torn mantle Apr 28, 2025, 8:13 PM

#

https://x.com/OpenAI/status/1916947241086095434

OpenAI (@OpenAI) on X

We're excited to announce we’ve launched several improvements to ChatGPT search, and today we’re starting to roll out a better shopping experience.

Search has become one of our most popular & fastest growing features, with over 1 billion web searches just in the past week 🧵

#

https://x.com/zhouwenmeng/status/1916937634108346831

Wenmeng Zhou (@zhouwenmeng) on X

Still working...

#

its 4am in china

ember rapids Apr 28, 2025, 8:15 PM

#

I know several ppl were asking for the link to this server. It's public now.

#

https://discord.gg/xnYzySGP

keen beacon Apr 28, 2025, 8:15 PM

#

torn mantle https://x.com/zhouwenmeng/status/1916937634108346831

they've been up for like over 24hrs atp probably

keen beacon Apr 28, 2025, 8:15 PM

#

ember rapids I know several ppl were asking for the link to this server. It's public now.

yup

keen fulcrum Apr 28, 2025, 8:18 PM

#

So openbrain is the best model?
Its interesting to see public models are better than ERNIE

torn mantle Apr 28, 2025, 8:21 PM

#

llama con tomorrow

#

https://x.com/ChujieZheng/status/1916944743952748908

Chujie Zheng ✈️ ICLR (@ChujieZheng) on X

Qwen3 has some really intriguing features that are not written in model cards. I believe it will open new room for both research and product.

keen beacon Apr 28, 2025, 8:22 PM

#

posted that here already

torn mantle Apr 28, 2025, 8:37 PM

#

https://www.reddit.com/r/LocalLLaMA/comments/1ka5t8z/qwen3_github_repo_is_up/

From the LocalLLaMA community on Reddit: Qwen3 Github Repo is up

Explore this post and more from the LocalLLaMA community

keen beacon Apr 28, 2025, 8:43 PM

#

Qwen is the large language model and large multimodal model series of the Qwen Team, Alibaba Group. Both language models and multimodal models are pretrained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences. Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc.

The latest version, Qwen3, has the following features:

Dense and Mixture-of-Experts (MoE) models, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.

Seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose chat) within a single model, ensuring optimal performance across various scenarios.

Significantly enhancement in reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.

Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.

Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.

Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.

For more information, please visit our:

Blog

GitHub

Hugging Face

ModelScope

Qwen3 Collection

Join our community by joining our Discord and WeChat group. We are looking forward to seeing you there!

#

https://qwen.readthedocs.io/en/latest/

#

https://qwenlm.github.io/blog/qwen3/

Qwen

Qwen3: Think Deeper, Act Faster

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...

#

IT'S OUT

#

#

https://huggingface.co/spaces/Qwen/Qwen3-Demo

Qwen3 Demo - a Hugging Face Space by Qwen

zinc ore Apr 28, 2025, 8:51 PM

#

https://qwenlm.github.io/blog/qwen3/

Qwen

Qwen3: Think Deeper, Act Faster

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...

keen beacon Apr 28, 2025, 8:51 PM

#

it got it right but wow that's a lot of reasoning tokens used up

stuck orchid Apr 28, 2025, 8:52 PM

#

🎉

torn mantle Apr 28, 2025, 8:52 PM

#

I can't try it rn, someone tell us the vibes

small haven Apr 28, 2025, 8:57 PM

#

ok cnn

keen beacon Apr 28, 2025, 8:59 PM

#

i did not expect it to get this close to g2.5 pro

blazing rune Apr 28, 2025, 9:00 PM

#

their HF page only has the 0.6b model

#

this is annoying

torn mantle Apr 28, 2025, 9:02 PM

#

keen beacon i did not expect it to get this close to g2.5 pro

Can you try it at coding pls?

cedar tide Apr 28, 2025, 9:07 PM

#

https://x.com/Alibaba_Qwen/status/1916962087676612998?t=Bvowu6ldsuLzgu4BgWQ55A&s=19

Qwen (@Alibaba_Qwen) on X

Introducing Qwen3!

We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general

#

Available now on
https://chat.qwen.ai/

Qwen Chat

Qwen Chat offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.

keen beacon Apr 28, 2025, 9:08 PM

#

the base kinda cooks v3 too

#

deepseek respond i dare you

small haven Apr 28, 2025, 9:11 PM

#

can i say something without getting people riled up

cedar tide Apr 28, 2025, 9:14 PM

#

keen beacon the base kinda cooks v3 too

is it 3.1 or not? and we should also compare the instruct models

blazing rune Apr 28, 2025, 9:15 PM

#

small haven can i say something without getting people riled up

sure, but it probably will upset some people, just ignore them. I hope it's about the fact that (as of checking a few mins ago) only 1 of the Qwen3 sizes is released

torn mantle Apr 28, 2025, 9:16 PM

#

small haven can i say something without getting people riled up

Yea

blazing rune Apr 28, 2025, 9:17 PM

#

nvm

blazing rune Apr 28, 2025, 9:17 PM

#

blazing rune sure, but it probably will upset some people, just ignore them. I hope it's abou...

they are all out now

small haven Apr 28, 2025, 9:17 PM

#

qwen 3 is 💩 , where is o3 pro

keen beacon Apr 28, 2025, 9:17 PM

#

blazing rune sure, but it probably will upset some people, just ignore them. I hope it's abou...

https://modelscope.cn/organization/qwen?tab=model

ModelScope 魔搭社区

ModelScope——汇聚各领域先进的机器学习模型，提供模型探索体验、推理、训练、部署和应用的一站式服务。在这里，共建模型开源社区，发现、学习、定制和分享心仪的模型。

#

yeah it was just hf

torn mantle Apr 28, 2025, 9:18 PM

#

small haven qwen 3 is 💩 , where is o3 pro

You tried it?

keen beacon Apr 28, 2025, 9:18 PM

#

deepseek have been beaten on both r1 and v3

#

i kinda expect a response

torn mantle Apr 28, 2025, 9:18 PM

#

Qwen 3 seems like qwq max to me

#

Havent noticed much improvements tbh

small haven Apr 28, 2025, 9:19 PM

#

torn mantle You tried it?

its def not better than o3 .. so whats the point

hardy pecan Apr 28, 2025, 9:19 PM

#

it seems ok, will need to do more testing

torn mantle Apr 28, 2025, 9:19 PM

#

Nah o3 is a league on its own tbh

keen beacon Apr 28, 2025, 9:19 PM

#

torn mantle Qwen 3 seems like qwq max to me

it's not lol

torn mantle Apr 28, 2025, 9:19 PM

#

keen beacon it's not lol

It is

#

Lol

keen beacon Apr 28, 2025, 9:19 PM

#

it isn't

#

barren prairie Apr 28, 2025, 9:21 PM

#

small haven its def not better than o3 .. so whats the point

What s the point to compare it with O3 ..free to paid ..not logic

small haven Apr 28, 2025, 9:21 PM

#

o3 marginal cost is zero

pliant cypress Apr 28, 2025, 9:22 PM

#

no

golden ocean Apr 28, 2025, 9:22 PM

#

no

misty vault Apr 28, 2025, 9:22 PM

#

no

barren prairie Apr 28, 2025, 9:23 PM

#

But why they are putting 1526263 models ... They must regulate this thing

keen beacon Apr 28, 2025, 9:23 PM

#

lmao it's excrutiatingly slow rn because it's being hammered on qwen chat

keen beacon Apr 28, 2025, 9:23 PM

#

pliant cypress no

what

#

^

barren prairie Apr 28, 2025, 9:23 PM

#

I want a single good model like gemini and deepSeek not 1627373 models

keen beacon Apr 28, 2025, 9:23 PM

#

it is not o3 and is still slightly behind gemini 2.5 pro. but it is still highly impressive

keen beacon Apr 28, 2025, 9:23 PM

#

barren prairie I want a single good model like gemini and deepSeek not 1627373 models

what

hardy pecan Apr 28, 2025, 9:23 PM

#

people have "tested" it for 5 mins and already coming to conclusions lmao, maybe do some more testing

keen beacon Apr 28, 2025, 9:23 PM

#

i don't think you get how this works

keen fulcrum Apr 28, 2025, 9:24 PM

#

https://modelscope.cn/collections/Qwen3-9743180bdc6b48

Qwen3

通义千问3系列

small haven Apr 28, 2025, 9:24 PM

#

even when qwen 4 is released, o3 >> qwen 4, thats how bad qwen 3 is

ocean vortex Apr 28, 2025, 9:24 PM

#

cedar tide https://x.com/Alibaba_Qwen/status/1916962087676612998?t=Bvowu6ldsuLzgu4BgWQ55A&s...

how are those math scores possible for 30b MoE wtf lol

keen fulcrum Apr 28, 2025, 9:24 PM

#

Just look here

ocean vortex Apr 28, 2025, 9:24 PM

#

it destroys gpt4.1

barren prairie Apr 28, 2025, 9:24 PM

#

keen beacon what

There is 172738 model on the app and now you don t know what to use

keen beacon Apr 28, 2025, 9:24 PM

#

ocean vortex Apr 28, 2025, 9:24 PM

#

not even the biggest model

keen beacon Apr 28, 2025, 9:24 PM

#

barren prairie There is 172738 model on the app and now you don t know what to use

use qwen chat... and just select the first model in the dropdown

small haven Apr 28, 2025, 9:25 PM

#

bro is writing essays in its traces for a simple logic problem bruh

keen beacon Apr 28, 2025, 9:25 PM

#

yes it is a lot less efficient with reasoning than o3 or gemini

#

if there is one thing i've noticed it's that

small haven Apr 28, 2025, 9:25 PM

#

ok at least it got the answer right haha

torn mantle Apr 28, 2025, 9:25 PM

#

Good so far

#

Seems better than maverick at least

keen beacon Apr 28, 2025, 9:26 PM

#

low ass bar

ocean vortex Apr 28, 2025, 9:26 PM

#

this what I meant :

cedar tide Apr 28, 2025, 9:26 PM

#

keen beacon

Who wants to dedicate themselves to adding o4 mini, grok 3 mini think and gemini 2.5 flash to the benchmark?

keen beacon Apr 28, 2025, 9:26 PM

#

ocean vortex this what I meant :

lmao it murders them

ocean vortex Apr 28, 2025, 9:26 PM

#

if they added 4.1 in there it wouldn't look much better for OpenAI

keen fulcrum Apr 28, 2025, 9:27 PM

#

Are you friends of Qwen?

torn mantle Apr 28, 2025, 9:27 PM

#

keen beacon low ass bar

Lol

hardy pecan Apr 28, 2025, 9:28 PM

#

creates a 99% copy of the discord front end, in a single html file, (without the backend)

small haven Apr 28, 2025, 9:28 PM

#

ok im a bit more excited about r2

ocean vortex Apr 28, 2025, 9:28 PM

#

oh but they tested with thinking enabled have they not

keen fulcrum Apr 28, 2025, 9:28 PM

#

Where is the pricing for qwen3?

ocean vortex Apr 28, 2025, 9:28 PM

#

this somewhat explains it then

#

it being this good on math relative to others

#

this is what meta should have done. Instead they went behemoth mode lmao

small haven Apr 28, 2025, 9:31 PM

#

on that note... when is o3 pro

#

ur a comedian

torn mantle Apr 28, 2025, 9:35 PM

#

torn mantle Apr 28, 2025, 9:35 PM

#

small haven ur a comedian

what about me

small haven Apr 28, 2025, 9:35 PM

#

lol i just visit qwen to test it on its release day and pack it up, will be back next release !

ocean vortex Apr 28, 2025, 9:36 PM

#

it is obviously not. But at the same time there's gonna be no one to test if they updated the model in any way... Could just donate money to OpenAI instead

#

10 prompts would be like what... $150?

keen beacon Apr 28, 2025, 9:38 PM

#

holy moly this is the slowest streaming i've ever seen

#

it's been thinking for like 10 minutes and it's not even streamed many reasoning tokens

cedar tide Apr 28, 2025, 9:41 PM

#

ocean vortex how are those math scores possible for 30b MoE wtf lol

all reasoning models have very good scores in math, R1 distill 14b has 70 on aime 2024 too

And grok 3 mini makes has 90, costing only 0.3 input and 0.5 output

small haven Apr 28, 2025, 9:41 PM

#

why is my coffee already cold tf

#

that is true, but side effects ..

#

i can tell

keen fulcrum Apr 28, 2025, 9:49 PM

#

Any unofficial benchmarks up yet?

ocean vortex Apr 28, 2025, 9:51 PM

#

cedar tide all reasoning models have very good scores in math, R1 distill 14b has 70 on aim...

yeah they are a bit sneaky comparing it directly against non-reasoning models... But then again they don't have much to compare against otherwise

cedar tide Apr 28, 2025, 9:51 PM

#

ocean vortex yeah they are a bit sneaky comparing it directly against non-reasoning models......

R1 distill
Reka flash 3
Grok 3 mini
o4 mini
Gemini 2.5 flash

rugged brook Apr 28, 2025, 9:52 PM

#

Is the qwen

#

Good

ocean vortex Apr 28, 2025, 9:53 PM

#

cedar tide R1 distill Reka flash 3 Grok 3 mini o4 mini Gemini 2.5 flash

distills is not their market for this, grok3-mini and o4-mini are not open-source, but yes... those would be more suitable than old version of gpt4o for sure

torn mantle Apr 28, 2025, 9:53 PM

#

Oh boi

#

You are fked

#

Oh no

torn mantle Apr 28, 2025, 9:54 PM

#

keen fulcrum Any unofficial benchmarks up yet?

Vibes tesf

#

Its good

#

Nothing crazy

#

Not that great at coding

small haven Apr 28, 2025, 9:54 PM

#

coffee + brazilian fonk >> adderral

torn mantle Apr 28, 2025, 9:54 PM

#

Multilingual claims = still far behind competitors

ocean vortex Apr 28, 2025, 9:55 PM

#

torn mantle Nothing crazy

yeah I'm not getting wow'ed by their biggest model. But it needs to be viewed in the proper context. It's competing with R1 not O3 or 2.5 pro

torn mantle Apr 28, 2025, 9:55 PM

#

I would place it beneath grok3 base model

torn mantle Apr 28, 2025, 9:57 PM

#

ocean vortex yeah I'm not getting wow'ed by their biggest model. But it needs to be viewed in...

Yea

#

https://x.com/polynoamial/status/1916974477042454696

Noam Brown (@polynoamial) on X

@Alibaba_Qwen Looks like a good model, but I’m disappointed to see a comparison to o1 rather than o3 in that table.

#

lol

small haven Apr 28, 2025, 9:59 PM

#

facts...

rugged brook Apr 28, 2025, 9:59 PM

#

U havent tried

#

Meth

small haven Apr 28, 2025, 10:00 PM

#

meth makes u dumb

cedar tide Apr 28, 2025, 10:02 PM

#

Qwen 3 253b non thinking vs deepseek v3.1

AIME 2024 : 40 / 52
GPQA : 63 / 66

rugged brook Apr 28, 2025, 10:04 PM

#

Is it better then

#

R1

#

Thinking mode

small haven Apr 28, 2025, 10:05 PM

#

just imagine r2 ...

rugged brook Apr 28, 2025, 10:05 PM

#

When is it releasing

small haven Apr 28, 2025, 10:05 PM

#

this year idk

hollow ocean Apr 28, 2025, 10:06 PM

#

R2 July

rugged brook Apr 28, 2025, 10:06 PM

#

hollow ocean R2 July

fr

#

?

cedar tide Apr 28, 2025, 10:07 PM

#

cedar tide Qwen 3 253b non thinking vs deepseek v3.1 AIME 2024 : 40 / 52 GPQA : 63 / 66

impossible to compete with deepseek

small haven Apr 28, 2025, 10:07 PM

#

based off polymarket lol

hollow ocean Apr 28, 2025, 10:07 PM

#

https://tenor.com/view/elmo-shrug-i-dont-know-gif-14777043

Tenor

small haven Apr 28, 2025, 10:08 PM

#

i still remember gpt 5 was predicted to be released october 2024.

hollow ocean Apr 28, 2025, 10:08 PM

#

ocean vortex Apr 28, 2025, 10:08 PM

#

cedar tide impossible to compete with deepseek

deepseek thinking I'm a bigger fan of too. Qwen seems to hang up on almost irrelevant details and flooding the context with it lol

hollow ocean Apr 28, 2025, 10:08 PM

#

I have a ton of money on yes

torn mantle Apr 28, 2025, 10:09 PM

#

rugged brook Is it better then

I would say better tbh

#

Dont forget that its smaller than r1

raven void Apr 28, 2025, 10:09 PM

#

Polymarket should have separate market for open source models

#

Really

hardy pecan Apr 28, 2025, 10:09 PM

#

It got 3/10 of the simplebench questions ive checked so far, using 235B with thinking

small haven Apr 28, 2025, 10:09 PM

#

hollow ocean

cedar tide Apr 28, 2025, 10:09 PM

#

Will r2 be based on 3.1 or a new base?

torn mantle Apr 28, 2025, 10:09 PM

#

hardy pecan It got 3/10 of the simplebench questions ive checked so far, using 235B with thi...

How much for r1?

ocean vortex Apr 28, 2025, 10:09 PM

#

it probably is still better than r1 overall though, all that aside and just looking at the final responses

hollow ocean Apr 28, 2025, 10:10 PM

#

small haven

It was only 62%

ocean vortex Apr 28, 2025, 10:10 PM

#

just not everywhere, 'overall'

hollow ocean Apr 28, 2025, 10:10 PM

#

R2 prediction is 86%

raven void Apr 28, 2025, 10:10 PM

#

small haven

Literally free money

hardy pecan Apr 28, 2025, 10:10 PM

#

torn mantle How much for r1?

offiical result is passing through 200 questiom @ pass5

small haven Apr 28, 2025, 10:10 PM

#

hollow ocean It was only 62%

"only" haha

cedar tide Apr 28, 2025, 10:11 PM

#

hardy pecan offiical result is passing through 200 questiom @ pass5

To same that qwen

small haven Apr 28, 2025, 10:11 PM

#

raven void Literally free money

for a "no"

torn mantle Apr 28, 2025, 10:11 PM

#

ocean vortex it probably is still better than r1 overall though, all that aside and just look...

Yea i think its decent overall

hardy pecan Apr 28, 2025, 10:11 PM

#

cedar tide To same that qwen

yeah, but will wait for a verified score from Simple Bench, but probably in the ballpark of grok 3 tbh

small haven Apr 28, 2025, 10:12 PM

#

guys gpt 5 is releasing september 2024

brittle tiger Apr 28, 2025, 10:12 PM

#

torn mantle https://x.com/polynoamial/status/1916974477042454696

How can OpenAI be mad at all when they don't bench against other company models in their latest release lmao

brittle tiger Apr 28, 2025, 10:14 PM

#

brittle tiger How can OpenAI be mad at all when they don't bench against other company models ...

Lol

torn mantle Apr 28, 2025, 10:14 PM

#

brittle tiger How can OpenAI be mad at all when they don't bench against other company models ...

its an open source model

#

I just dont understand that comment tbh

#

Jealous for what?

small haven Apr 28, 2025, 10:15 PM

#

brittle tiger Lol

bro got threatened by china already lol

keen fulcrum Apr 28, 2025, 10:15 PM

#

rugged brook Apr 28, 2025, 10:16 PM

#

how is it

#

worse

#

its good

small haven Apr 28, 2025, 10:16 PM

#

met expectations, my expectations: 💩 as usual

torn mantle Apr 28, 2025, 10:19 PM

#

I think it met expectations

#

The focus seems to be fixing qwen2.5 issues and add a hybrid reasoning feature

#

Better multilingual support + better multi-turn convo

#

Its not bad for its size

hardy pecan Apr 28, 2025, 10:23 PM

#

Yeah, met expectations (I wasn't expecting 2.5 pro level)

torn mantle Apr 28, 2025, 10:23 PM

#

hardy pecan Yeah, met expectations (I wasn't expecting 2.5 pro level)

Same

small haven Apr 28, 2025, 10:25 PM

#

still behind oai like > 6months

keen fulcrum Apr 28, 2025, 10:26 PM

#

We will see end of the year
I would love Qwen to catch up more

small haven Apr 28, 2025, 10:27 PM

#

pfft, not being pessimistic but, i think 6 months still

#

oai already working on o4, top 50 codeforces
o3 is at top 250

#

rugged brook Apr 28, 2025, 10:31 PM

#

ye

keen fulcrum Apr 28, 2025, 10:35 PM

#

https://huggingface.co/tngtech/DeepSeek-R1T-Chimera
v3 + r1

tngtech/DeepSeek-R1T-Chimera · Hugging Face

hardy pecan Apr 28, 2025, 10:37 PM

#

Ok qwen-235B-A22B with thinking, simple bench - 3/20 , so not the best result in the world.

hollow ocean Apr 28, 2025, 10:39 PM

#

hardy pecan Ok qwen-235B-A22B with thinking, simple bench - 3/20 , so not the best result in...

So 15%

hardy pecan Apr 28, 2025, 10:39 PM

#

Essentially... but will wait to see, maybe I had bad variance for my pass@1

keen fulcrum Apr 28, 2025, 10:42 PM

#

hardy pecan Ok qwen-235B-A22B with thinking, simple bench - 3/20 , so not the best result in...

What about qwen with web search

rugged brook Apr 28, 2025, 10:47 PM

#

Bro what

barren prairie Apr 28, 2025, 10:53 PM

#

Qwen is good but not that thing ..I didn t hype so much for it ...but so good as an open source model 😅

small haven Apr 28, 2025, 10:56 PM

#

lay the adderall down bud

#

o3 != qwen3 lol

#

?

#

uve lost the plot

hollow ocean Apr 28, 2025, 10:59 PM

#

Simple bench will never be solved

zinc ore Apr 28, 2025, 11:03 PM

#

small haven

How come their actual released model isn't performing equivalent to a top 200 programmer then?

#

Since it says o3 Jan is 175th rank

small haven Apr 28, 2025, 11:04 PM

#

zinc ore Since it says o3 Jan is 175th rank

distillation

zinc ore Apr 28, 2025, 11:04 PM

#

It's just their claims and "trust us bro"

small haven Apr 28, 2025, 11:05 PM

#

i mean the "us" is oai themselves, so pretty reputable lol

zinc ore Apr 28, 2025, 11:07 PM

#

I disagree. As they have a history of overhyping. They do get some nice performance from their models don't get me wrong.

patent aspen Apr 28, 2025, 11:08 PM

#

small haven

Hill climbing on competitive programming problems doesn't make a model a superhuman coder. Those problems are self-contained.

zinc ore Apr 28, 2025, 11:08 PM

#

I basically consider Google and anthropic to be pretty reliable in their claims, but I see much less reliability coming from openAI and grok imo (inb4 I agitate some grok bros)

small haven Apr 28, 2025, 11:09 PM

#

patent aspen Hill climbing on competitive programming problems doesn't make a model a superhu...

ya i agree codeforce maxxing is not applicable to real world, unless u work in hft, but its pretty good benchmark which is about to be saturated

patent aspen Apr 28, 2025, 11:10 PM

#

zinc ore I basically consider Google and anthropic to be pretty reliable in their claims,...

Anthropic likes to claim they'll have AGI in like a year tho

#

To attract funding

zinc ore Apr 28, 2025, 11:10 PM

#

Dario specifically you mean?

patent aspen Apr 28, 2025, 11:10 PM

#

Their CEO?

#

Yeah

zinc ore Apr 28, 2025, 11:10 PM

#

I'm assuming you're referring to his prediction of AGI by 2026-2027

patent aspen Apr 28, 2025, 11:11 PM

#

Yeah I mean they're trying to attract funding so it's w/e

zinc ore Apr 28, 2025, 11:11 PM

#

Ehh, to me they aren't anywhere near as shameless as openAI with it

small haven Apr 28, 2025, 11:12 PM

#

if i was an anthropic shareholder, id be shaking tbh

patent aspen Apr 28, 2025, 11:12 PM

#

That's true. OAI trying to appear like they already have AGI cooking in the lab and are just waiting until it's perfect to unleash

#

Yeah Anthropic is in a tough spot

#

Coding was their main selling point

#

That won't last through the year

zinc ore Apr 28, 2025, 11:21 PM

#

I don't have much of an opinion on deepseek

small haven Apr 28, 2025, 11:21 PM

#

yea ok bud o1 > o3

#

agreed

patent aspen Apr 28, 2025, 11:23 PM

#

Objectives and key results. It's how all silicon valley companies track goals

#

If they haven't achieved that by the end of the year, it will carry over until they do

small haven Apr 28, 2025, 11:24 PM

#

lmaoooo

#

will toysrus ever develop a sota model

zinc ore Apr 28, 2025, 11:25 PM

#

Dollar general SOTA model when

small haven Apr 28, 2025, 11:25 PM

#

adderall still potent

#

they are like so behind

#

kinda facts

#

i have no clue what is that

patent aspen Apr 28, 2025, 11:27 PM

#

Why?

zinc ore Apr 28, 2025, 11:28 PM

#

Let's be real, 2 weeks max

#

Maybe 2 months at worst

patent aspen Apr 28, 2025, 11:28 PM

#

They were 3 years ahead when they started

zinc ore Apr 28, 2025, 11:28 PM

#

And debatably even behind

patent aspen Apr 28, 2025, 11:29 PM

#

Now they're a bit behind

#

They're losing ground

small haven Apr 28, 2025, 11:31 PM

#

zinc ore Let's be real, 2 weeks max

u taking that good good arent u

zinc ore Apr 28, 2025, 11:31 PM

#

Good good is being two years ahead

small haven Apr 28, 2025, 11:31 PM

#

zinc ore Good good is being two years ahead

tbh its like 6 months still

hollow ocean Apr 28, 2025, 11:32 PM

#

I’m betting every dollar I have left that the simple bench won’t be solved this year

zinc ore Apr 28, 2025, 11:32 PM

#

I'm not even disagreeing they're ahead, but they're barely ahead of 2.5 pro at a much higher price

small haven Apr 28, 2025, 11:33 PM

#

zinc ore I'm not even disagreeing they're ahead, but they're barely ahead of 2.5 pro at a...

ur comparing o3 vs. gemini 2.5 pro? its like not apples to apples

zinc ore Apr 28, 2025, 11:33 PM

#

Lemme guess you wanna compare to the o1 million dollar arc test

patent aspen Apr 28, 2025, 11:34 PM

#

I'm not speculating.

small haven Apr 28, 2025, 11:34 PM

#

zinc ore Lemme guess you wanna compare to the o1 million dollar arc test

nope, just the publicly release o3..

zinc ore Apr 28, 2025, 11:34 PM

#

You said 2.5 pro and o3 wasn't apples to apples

#

Now you say compare to o3 lol

small haven Apr 28, 2025, 11:34 PM

#

zinc ore Now you say compare to o3 lol

why are u comparing with o1 lol

#

that got released like 5 months ago

zinc ore Apr 28, 2025, 11:35 PM

#

Dog, I asked you if we can't compare 2.5 to o3, then what should we compare it to?

small haven Apr 28, 2025, 11:35 PM

#

zinc ore Dog, I asked you if we can't compare 2.5 to o3, then what should we compare it t...

huh, when did i say that

#

i just said they are not apples to apples, meaning o3 is ahead

patent aspen Apr 28, 2025, 11:37 PM

#

One thing I will say is that public version numbers are mostly branding and don't always consistently reflect the underlying process used to train the models

zinc ore Apr 28, 2025, 11:37 PM

#

That's not what apples to apples means

#

An apples to apples comparison allows for something being far superior

small haven Apr 28, 2025, 11:37 PM

#

ok mb

patent aspen Apr 28, 2025, 11:38 PM

#

We think o4-mini are o3 are different post-trainings of the same underlying pre-trained model

#

Similarly 2.5 and 2.0 are part of same generation

#

It does matter because pre-training takes ages, so it has implications for the timelines of about 6 months

hollow ocean Apr 28, 2025, 11:51 PM

#

If you put “it’s a trick question” the top models aces simplebench

#

Not a single question wrong

#

Yes try it

#

I just did it with o4 mini high

#

Yep

#

Try it if you don’t believe it

#

Grok 3 no thinking failed first question

#

4.5 gets it

small haven Apr 29, 2025, 12:05 AM

#

Question 3

Jeff, Jo and Jim are in a 200m men's race, starting from the same position. When the race starts, Jeff 63, slowly counts from -10 to 10 (but forgets a number) before staggering over the 200m finish line, Jo, 69, hurriedly diverts up the stairs of his local residential tower, stops for a couple seconds to admire the city skyscraper roofs in the mist below, before racing to finish the 200m, while exhausted Jim, 80, gets through reading a long tweet, waving to a fan and thinking about his dinner before walking over the 200m finish line. Who likely finished last?

this question o3, o1-pro didn't get it, but o4-mini-high got it

torn mantle Apr 29, 2025, 12:27 AM

#

Qwen 3 seems to hallucinate quite a lot

#

Yea and it gets many things wrong as well

#

Yea even maverick may be better than qwen 3

drifting thorn Apr 29, 2025, 12:50 AM

#

Oh you guys are talking about Qwen 3 too

torn mantle Apr 29, 2025, 12:53 AM

#

Sorry but its nowhere near gemini 2.5 or o3

drifting thorn Apr 29, 2025, 12:54 AM

#

Not expecting to see Qwen 3 surpassing Gemini 2.5 or o3

#

But I expected it to surpass R1

torn mantle Apr 29, 2025, 12:55 AM

#

drifting thorn But I expected it to surpass R1

Doesn't seem to me so far

#

So many wrong outputs

zinc ore Apr 29, 2025, 12:57 AM

#

Could it be bugs? Since that happens a lot during releases

drifting thorn Apr 29, 2025, 12:58 AM

#

… At least it provides researchers with new models…

leaden palm Apr 29, 2025, 12:58 AM

#

torn mantle Yea even maverick may be better than qwen 3

"even"?

#

maverick is larger than all qwen models

drifting thorn Apr 29, 2025, 12:58 AM

#

The AI researchers can now move on to Qwen 3 from Qwen 2.5

small haven Apr 29, 2025, 1:09 AM

#

just wait for behemoth

torn mantle Apr 29, 2025, 1:11 AM

#

Qwen 3 needs a re-evaluation

small haven Apr 29, 2025, 1:11 AM

#

claude pro sucks

torn mantle Apr 29, 2025, 1:12 AM

#

Just coding

small haven Apr 29, 2025, 1:12 AM

#

?

#

o3 > 3.7

#

just ask for the git diff and apply with 4.1 in cursor

torn mantle Apr 29, 2025, 1:13 AM

#

Well the most nerfed models are anthropic ones

small haven Apr 29, 2025, 1:13 AM

#

i agree, thats why i only use it to apply the git diffs

#

still potent i see

#

and fast

torn mantle Apr 29, 2025, 1:14 AM

#

Alibaba still has a long way to go

small haven Apr 29, 2025, 1:15 AM

#

i bought alibaba back in 2017, im still breakeven

torn mantle Apr 29, 2025, 1:15 AM

#

Hopefully it drops

#

No wonder they weren't hyping this model

small haven Apr 29, 2025, 1:17 AM

#

on what bench

#

oh right, the addy vibes

#

huh where is o3 in webdev

#

no wonder sonnet is #1

#

it would be close tho

#

like above 4.1

#

bru

#

is o3 in webdev tho

#

its cheaper than o1 and o1 was on it

#

true

still mason Apr 29, 2025, 1:33 AM

#

How often does the leaderboard on https://lmarena.ai/ update?

north vale Apr 29, 2025, 1:34 AM

#

every week or so

small haven Apr 29, 2025, 1:34 AM

#

o4 mini < maverick, oh hell naww

leaden palm Apr 29, 2025, 1:43 AM

#

small haven o4 mini < maverick, oh hell naww

o4 mini is obeying the system prompt better

#

why are you encouraging models that don't obey the system prompt

void copper Apr 29, 2025, 2:06 AM

#

torn mantle Qwen 3 seems to hallucinate quite a lot

which mode? If it hallucinate in non-thinking mode, I won't be that surprised since it combines thinking and non thinking stuff at once, and very likely those hallucinations are because of it thinking training. (Just like how human thinks, yeah, hallucinate a lot)

small haven Apr 29, 2025, 2:16 AM

#

leaden palm o4 mini is obeying the system prompt better

huh the right is maverick not o four mini

leaden palm Apr 29, 2025, 2:16 AM

#

small haven huh the right is maverick not o four mini

yes, and maverick didn't obey the system prompt

#

small haven Apr 29, 2025, 2:24 AM

#

ahh i seee

#

the problem with this, is that people are still gonna vote for maverick

#

u inflating the bad models, great

alpine coral Apr 29, 2025, 2:28 AM

#

hollow ocean Try it if you don’t believe it

yeah I tried and unsurprsingly o4mini-high doesn't ace all 10/10 of the public simplebench questions by adding "it's a trick question"

small haven Apr 29, 2025, 2:30 AM

#

wym? ur voting for o4 mini?

#

its visually not attractive on face value

small haven Apr 29, 2025, 2:54 AM

#

at this point i dont care anymore

#

BUT WHEN O3 PRO SAM

#

solar nebula Apr 29, 2025, 2:56 AM

#

so thats how they made o3

small haven Apr 29, 2025, 2:57 AM

#

o4 full coming in this summer

#

OK BUD

#

take that bet to polymarket, dont need more money

#

addy is hitting

#

i still dont get how gpt5 is gonna work

#

like if i wanna use o3 pro

#

so my prompt would be "please use the biggest brain power in the world to answer" like what?

leaden palm Apr 29, 2025, 3:04 AM

#

small haven like if i wanna use o3 pro

ideally if your prompt is hard to get right it would just know to think hard

small haven Apr 29, 2025, 3:05 AM

#

leaden palm ideally if your prompt is hard to get right it would just know to think hard

thats a nightmare if u think about it

#

u want it to use o3 pro, but always give o4 mini

high egret Apr 29, 2025, 3:43 AM

#

Hi guys ! i'm new here, is anyone online ?

#

quite impressed by qwen3 and wanted to discuss of it

#

I made a bet of 20$ on qwen3 best model before april 30, probably the worst I'll ever do haha

#

but 6k if I win lmao

small haven Apr 29, 2025, 3:50 AM

#

bruh

keen fulcrum Apr 29, 2025, 3:50 AM

#

small haven Apr 29, 2025, 3:51 AM

#

llol

high egret Apr 29, 2025, 3:52 AM

#

The model won't be on leaderboard when the market close lmao

#

but i'll pay my studies ez if it's win 👌

keen fulcrum Apr 29, 2025, 3:54 AM

#

Qwen 3 on lmarena?
any external benchmark?

high egret Apr 29, 2025, 3:54 AM

#

not now I think

#

yes we already have benchmarks

keen fulcrum Apr 29, 2025, 3:54 AM

#

Can you share

high egret Apr 29, 2025, 3:55 AM

#

#

quite impressive

keen fulcrum Apr 29, 2025, 3:55 AM

#

high egret I made a bet of 20$ on qwen3 best model before april 30, probably the worst I'll...

Pure gamble without any data backing up

keen fulcrum Apr 29, 2025, 3:55 AM

#

high egret

These are official

high egret Apr 29, 2025, 3:55 AM

#

keen fulcrum Pure gamble without any data backing up

true

compact knoll Apr 29, 2025, 3:55 AM

#

hi everyone, sorry for dumb question, i guess people ask it all the time, but right now, what's the best model for STEM? (not necessarily for advanced coding, more for solving complex STEM problems, including math)

high egret Apr 29, 2025, 3:56 AM

#

compact knoll hi everyone, sorry for dumb question, i guess people ask it all the time, but ri...

I mostly use gemini 2.5 pro because of the long context support

#

i'm a pure math major

#

but o3 and o4-mini give clearer response when you need a summary like of a course

high egret Apr 29, 2025, 3:58 AM

#

keen fulcrum These are official

no idea, i found this on reddit

compact knoll Apr 29, 2025, 3:58 AM

#

yeah okay, im wondering which one is the best beetwen 2.5 pro and o3 right now, well i guess it depends on the demand, and that 2.5 will be better than o3 for some specific things and vice versa

keen fulcrum Apr 29, 2025, 3:58 AM

#

compact knoll hi everyone, sorry for dumb question, i guess people ask it all the time, but ri...

https://scale.com/leaderboard

SEAL LLM Leaderboards: Expert-Driven Private Evaluations

Explore the SEAL leaderboards for expert-driven, private, regularly updated LLM rankings and evaluations across domains like coding, instruction following and more!

high egret Apr 29, 2025, 3:59 AM

#

compact knoll yeah okay, im wondering which one is the best beetwen 2.5 pro and o3 right now, ...

yeah, i feel gemini 2.5 pro is better for complex tasks while o3 is more human friendly

#

but like when I need a full summary of a 200 pages course, I just give the full pdf to gemini, ask for it and get a perfect clear LaTeX document first try

#

not with o3

compact knoll Apr 29, 2025, 4:00 AM

#

yeah i agree the context with Gemini is really impressive and super useful

high egret Apr 29, 2025, 4:00 AM

#

and fu*** free

#

you're a STEM student ?

compact knoll Apr 29, 2025, 4:01 AM

#

yep

high egret Apr 29, 2025, 4:01 AM

#

which major ?

keen fulcrum Apr 29, 2025, 4:01 AM

#

small haven Apr 29, 2025, 4:10 AM

#

WE WANT O3 PRO

#

oh hes sober now

#

you know ur gonna get a passing test when traces look like this

hollow ocean Apr 29, 2025, 5:04 AM

#

If you go on the Deepseek subreddit some still say R1 is better than 2.5 and o3

hardy pecan Apr 29, 2025, 5:08 AM

#

Humans are tribal to a fault

keen fulcrum Apr 29, 2025, 5:09 AM

#

hollow ocean If you go on the Deepseek subreddit some still say R1 is better than 2.5 and o3

Thats the case

hardy pecan Apr 29, 2025, 5:09 AM

#

It was probably their first LLM they really used when it went viral, and never looked further

hollow ocean Apr 29, 2025, 5:14 AM

#

hardy pecan It was probably their first LLM they really used when it went viral, and never l...

What about the people that only knows 4o and have no clue about any other models

small haven Apr 29, 2025, 6:18 AM

#

Grok three point five isnt going to beat o three

torn mantle Apr 29, 2025, 6:32 AM

#

I think they screwed up something on Qwen 3 training

keen beacon Apr 29, 2025, 6:32 AM

#

patent aspen We think o4-mini are o3 are different post-trainings of the same underlying pre-...

Is it though? It doesn't seem like they are. The simpleqa scores and based on the pretrained knowledge past Oct 2023. Specific confabulations about stuff that 4.1 makes and 4.1 mini has no idea about. It seems o4 mini is based on 4.1 mini/ is not on the 4.1 base at the very least

torn mantle Apr 29, 2025, 6:32 AM

#

the model is just a bit better than qwen 2.5 max

#

qwq*

keen beacon Apr 29, 2025, 6:33 AM

#

torn mantle the model is just a bit better than qwen 2.5 max

The pretrained model might be excellent the post training mightve been rushed

#

I haven't had time to examine it yet I just woke up lol

fleet lintel Apr 29, 2025, 6:33 AM

#

hollow ocean If you go on the Deepseek subreddit some still say R1 is better than 2.5 and o3

I feel the same about some grok users.

#

I am kinda same but from the opposite side. I am rooting for everyone to one up each other every week except for Meta. I want llama models to burn in hell.

torn mantle Apr 29, 2025, 6:36 AM

#

keen beacon I haven't had time to examine it yet I just woke up lol

welp

#

my sleep schedule is fked up

#

so i tried it

#

it does a poor job at recalling

keen beacon Apr 29, 2025, 6:37 AM

#

damn lol i was too tired and slept

#

i think even if the instruct versions arent that good the pretrained base models might be excellent

#

makes a good base for fine tuning

#

qwen pretraining is 👌

torn mantle Apr 29, 2025, 6:38 AM

#

you would expect high quality info/data since they scraped a lot of contents from pdf files

#

#

but im really not noticing any difference

keen beacon Apr 29, 2025, 6:39 AM

#

torn mantle but im really not noticing any difference

it seems the pretrained base models act a class above their qwen 2.5 variants at least, in base model form. the instruct version they post trained and release mightve been rushed

torn mantle Apr 29, 2025, 6:40 AM

#

keen beacon it seems the pretrained base models act a class above their qwen 2.5 variants at...

yea could be

#

oh

#

https://x.com/kalomaze/status/1917085495416525108

kalomaze (@kalomaze) on X

i don't know what hparams they threw at qwen models in post training but the qwen3 models have some absolutely deepfried world knowledge. the basic factual recall of the qwen instructs is by far the worst of any frontier model by a big margin. the models are otherwise interesting

#

so it wasn't just me

keen fulcrum Apr 29, 2025, 6:59 AM

#

https://fixupx.com/theo/status/1916995252629737491
It needs desperately more context
if it keeps overthinking

Theo - t3.gg (@theo)

Qwen3 maintains the Qwen trend of massively overthinking tasks, generating thousands of thinking tokens and running out of context before answering.

**💬 40 🔁 11 ❤️ 466 **

▶ Play video

fleet lintel Apr 29, 2025, 7:01 AM

#

https://x.com/elonmusk/status/1917099777327829386

Elon Musk (@elonmusk) on X

Next week, Grok 3.5 early beta release to SuperGrok subscribers only.

It is the first AI that can, for example, accurately answer technical questions about rocket engines or electrochemistry.

@Grok is reasoning from first principles and coming up with answers that simply don’t

small haven Apr 29, 2025, 7:03 AM

#

why are we shilling pointless models, o3 pro is the only model to set eyes on

fleet lintel Apr 29, 2025, 7:06 AM

#

Qwen3 is whatever. I also didn't expect much from it but grok 3.5 could be interesting (low confidence)

Openai hype has blue balled me too many times. I won't get hyped till they prove it next time

zinc ore Apr 29, 2025, 7:08 AM

#

I'm only hype for whatever big drop Google does next

hardy pecan Apr 29, 2025, 7:09 AM

#

Well I can't complain with this cadence of new models

#

Golden age

fleet lintel Apr 29, 2025, 7:10 AM

#

Google I/o is coming. They will drop some shiny features but for models themselves, I think it will take them few months to drop something crazy

torn mantle Apr 29, 2025, 7:11 AM

#

fleet lintel https://x.com/elonmusk/status/1917099777327829386

this guy will hype anything

#

so next month we will have :

google coding models
deepseek r2
grok 3.5

fleet lintel Apr 29, 2025, 7:20 AM

#

Add o3 pro

small haven Apr 29, 2025, 7:22 AM

#

U GOT ME HARD

alpine coral Apr 29, 2025, 7:26 AM

#

high egret

doesn't make any sense imo

#

using Qwen3-235B-A22B on official site with thinking enabled - i can baredly tell it apart from something like qwq

#

but their evals have it outperforming gem-2.5-pro on some benchmarks? seens wild.. no idea how that works ha

keen beacon Apr 29, 2025, 7:31 AM

#

alpine coral but their evals have it outperforming gem-2.5-pro on some benchmarks? seens wild...

i think their post training wasnt fleshed out. they focused on specific stuff similar to grok 3 mini reasoning i think

alpine coral Apr 29, 2025, 7:33 AM

#

but the evals are for the same (post trained) model i've been using, no?

keen beacon Apr 29, 2025, 7:33 AM

#

alpine coral but the evals are for the same (post trained) model i've been using, no?

yea

keen beacon Apr 29, 2025, 7:34 AM

#

alpine coral but the evals are for the same (post trained) model i've been using, no?

post training i mean instruct process/reasoning training/etc on top of the pretrained base model

alpine coral Apr 29, 2025, 7:34 AM

#

ah k yeah i mean it feels a bit undercooked for sure

#

like V3 gets some questions which this model doesn't

keen beacon Apr 29, 2025, 7:35 AM

#

like look at this page, for qwen 2.5 they released a comprehensive page of instruct and base benchmarks: https://qwenlm.github.io/blog/qwen2.5-llm/
we only got this for qwen 3: https://qwenlm.github.io/blog/qwen3/

Qwen

Qwen2.5-LLM: Extending the boundary of LLMs

GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD
Introduction In this blog, we delve into the details of our latest Qwen2.5 series language models. We have developed a range of decoder-only dense models, with seven of them open-sourced, spanning from 0.5B to 72B parameters. Our research indicates a significant interest among users in models within th...

Qwen

Qwen3: Think Deeper, Act Faster

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...

#

it seems they rushed everything lol

keen beacon Apr 29, 2025, 7:36 AM

#

alpine coral like V3 gets some questions which this model doesn't

might be good for fine tunes though

#

if it is really the post training, the pre trained base might be very good

alpine coral Apr 29, 2025, 7:37 AM

#

oh don't get me wrong - for it's size and being open source (plus multi modal.. i think?) i mean there's a lot to work with / build on

#

and it's for sure solid

#

just not up there with the likes of 2.5-pro

#

based on my limited playing around with it (and subjective prefernces.. needless to say ha)

keen beacon Apr 29, 2025, 7:38 AM

#

yuh definitely not

#

major boon for local ppl tho

alpine coral Apr 29, 2025, 7:38 AM

#

totally

keen beacon Apr 29, 2025, 7:38 AM

#

models that arent dumb af can be run locally

alpine coral Apr 29, 2025, 7:39 AM

#

yeah huge step

fleet lintel Apr 29, 2025, 7:43 AM

#

small haven Apr 29, 2025, 7:45 AM

#

alpine coral Apr 29, 2025, 7:46 AM

#

i am actually excited - but also appreciate the shutup option lol

small haven Apr 29, 2025, 7:46 AM

#

haha

alpine coral Apr 29, 2025, 7:48 AM

#

o1 pro is the only model i have ever seen get this right

Which four countries, when listed alphabetically by their English short-form names, are the first to have flags containing more than five stars and featuring the colours red, yellow, and green?

torn mantle Apr 29, 2025, 7:52 AM

#

alpine coral but their evals have it outperforming gem-2.5-pro on some benchmarks? seens wild...

benchmaxxing

#

those benchmarks doesnt reflect anything

#

anyone can finetune their model on benchmarks

alpine coral Apr 29, 2025, 7:52 AM

#

torn mantle benchmaxxing

yeah as wild suggests too

#

makes the most sense

torn mantle Apr 29, 2025, 7:53 AM

#

it just doesnt feel like a smart model

alpine coral Apr 29, 2025, 7:53 AM

#

i mean it's good for sure, but yeah not great like 2.5pro great

torn mantle Apr 29, 2025, 7:54 AM

#

i think its a bit above qwen max 2.5

keen beacon Apr 29, 2025, 7:54 AM

#

i doubt they trained directly on the benchmarks but they definitely did targeted training and didnt fully flesh it out. even then, i doubt it would compete with 2.5 pro. i think the benchmarks they released generally show that its quite a bit worse than 2.5 pro though

torn mantle Apr 29, 2025, 7:55 AM

#

grok 3.5 should be interesting

keen beacon Apr 29, 2025, 7:56 AM

#

torn mantle i think its a bit above qwen max 2.5

qwen max is larger probably more close to r1 size

alpine coral Apr 29, 2025, 8:00 AM

#

keen beacon i doubt they trained directly on the benchmarks but they definitely did targeted...

kinda. i mean it beats 2.5pro in 4/5 of them, marginally; but yeah in the other 5, 2.5pro does better handily

#

ArenaHard is like an odd bench to lead with with (it's not alphabetical) - i kinda felt liek that project was dormant ha

keen beacon Apr 29, 2025, 8:02 AM

#

alpine coral kinda. i mean it beats 2.5pro in 4/5 of them, marginally; but yeah in the other ...

bfcl makes sense (function calling) i think it was more natively trained in (function calling) compared to 2.5 pro. the codeforces/livecodebench seems like targeted training towards competitive coding

keen beacon Apr 29, 2025, 8:02 AM

#

alpine coral ArenaHard is like an odd bench to lead with with (it's not alphabetical) - i kin...

i think theres a new version out i believe

#

yeah arena hard v2

alpine coral Apr 29, 2025, 8:03 AM

#

ah i see cheers didn't realise that

small haven Apr 29, 2025, 8:06 AM

#

thats old news craig

#

good morning

alpine coral Apr 29, 2025, 8:09 AM

#

lol what a cordial feud

#

anyway so yeah i think it can be said that handling multiple questions in a single prompt is not qwen3-235b(thinking)

#

's strong suit

#

#

this is like 8 questions

keen beacon Apr 29, 2025, 8:11 AM

#

curious how much thatll change if u do it 1 by 1

alpine coral Apr 29, 2025, 8:11 AM

#

yeah i suspect quite a bit

#

like sonn-3.7 thinking would do better than vanilla 3.7 one by one too

#

but yeah. they're all given the same prompts / quiz

keen beacon Apr 29, 2025, 8:12 AM

#

did u also try qwen 3 32b?

#

its more of a conventional model

cedar tide Apr 29, 2025, 8:12 AM

#

torn mantle so next month we will have : - google coding models - deepseek r2 - grok 3.5

And maybe gemini 2.5 ultra 🧐

alpine coral Apr 29, 2025, 8:13 AM

#

alpine coral but yeah. they're all given the same prompts / quiz

so it's a level playing field of sorts (but just a highly flawed / lazy way of quickly gathering data on models for comprehension / reasoning)

alpine coral Apr 29, 2025, 8:13 AM

#

keen beacon did u also try qwen 3 32b?

no, but i will

cedar tide Apr 29, 2025, 8:13 AM

#

Release of Llama reasoning and behemoth today ?

keen beacon Apr 29, 2025, 8:14 AM

#

the qwen 3 30b moe is also interesting

alpine coral Apr 29, 2025, 8:15 AM

#

alpine coral this is like 8 questions

meant to share this

cedar tide Apr 29, 2025, 8:17 AM

#

cedar tide Release of Llama reasoning and behemoth today ?

And tomorrow nova premier

Screenshot_2025-04-29-10-16-49-721_com.android.chrome-edit.jpg

small haven Apr 29, 2025, 8:37 AM

#

alpine coral meant to share this

kinda sad theres no o1 pro

frosty lark Apr 29, 2025, 8:39 AM

#

question: in beta.lmarena seems that there are no cloaked models (for initial evaluation). Is that only a coincidence for my usage? (I mean, I think that testing already listed ones further is only good)

calm sequoia Apr 29, 2025, 8:49 AM

#

There simply may not be any anonymous models currently under evaluation

#

The gemini doesn't seem to have this problem.

keen beacon Apr 29, 2025, 8:51 AM

#

I've seen bad reward hacking behaviors in o3 and sonnet too

cedar tide Apr 29, 2025, 8:51 AM

#

qwen 3, 253b, 30b Moe and 32b dense are already on the arena (think mode)

calm sequoia Apr 29, 2025, 8:53 AM

#

Does the anonify-lt-ev3-2 exist in the arena?

ocean vortex Apr 29, 2025, 8:54 AM

#

alpine coral meant to share this

so how come you have chatgpt o3-medium scoring higher than o3-high in there?

#

or was it just single attempt no regen for chatgpt in the earlier screen?

cedar tide Apr 29, 2025, 8:59 AM

#

My big problem with models that have a Think mode and a mode without Think is that all the benchmarks that we will find on the models are with Think, so if we want to use them without Think like the vast majority of people we do not know if they are better than the competitors.

keen beacon Apr 29, 2025, 8:59 AM

#

Check base model benchmarks if they're available

cedar tide Apr 29, 2025, 8:59 AM

#

#qwen 3
#gemini 2.5
#nemotron

keen beacon Apr 29, 2025, 9:00 AM

#

cedar tide #qwen 3 #gemini 2.5 #nemotron

For qwen 3 they released base model benchmarks for qwen 3 235b

#

https://qwenlm.github.io/blog/qwen3/

Qwen

Qwen3: Think Deeper, Act Faster

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...

cedar tide Apr 29, 2025, 9:01 AM

#

keen beacon For qwen 3 they released base model benchmarks for qwen 3 235b

Yes but no another one have benchmarks on the base models

keen beacon Apr 29, 2025, 9:01 AM

#

cedar tide Yes but no another one have benchmarks on the base models

Ya. I look at the simpleqa score these days

#

For other models

#

Sort of an approximation of how strong the base model is

cedar tide Apr 29, 2025, 9:03 AM

#

keen beacon https://qwenlm.github.io/blog/qwen3/

the lm arena, artificial analysis, the "aider" bench etc. should also integrate the non-reasoning models of each model

#

When qwen 3 multimodal ?

keen beacon Apr 29, 2025, 9:08 AM

#

cedar tide When qwen 3 multimodal ?

Qwen 3.5 in a few months probably

cedar tide Apr 29, 2025, 9:11 AM

#

I bet grok 3.5 will only be grok 3 with think stable (finish training)

#

a bit like Gemini 2.5

keen beacon Apr 29, 2025, 9:13 AM

#

No Gemini 2.5 was more than that

cedar tide Apr 29, 2025, 9:13 AM

#

keen beacon No Gemini 2.5 was more than that

not much more than that

keen beacon Apr 29, 2025, 9:14 AM

#

cedar tide not much more than that

? The cut off is different compared to Gemini 2. Means differing pretraining/cpt

cedar tide Apr 29, 2025, 9:15 AM

#

keen beacon ? The cut off is different compared to Gemini 2. Means differing pretraining/cpt

They continued the training of Gemini 2.0 exp

keen beacon Apr 29, 2025, 9:15 AM

#

cedar tide They continued the training of Gemini 2.0 exp

Continued pretraining

#

It's not just reasoning training

cedar tide Apr 29, 2025, 9:17 AM

#

@keen beacon What I mean is that ultimately it's pretty much the same model, for example there is no big difference between Gemini 2.0 Flash and 2.5 Flash without Think

keen beacon Apr 29, 2025, 9:18 AM

#

cedar tide <@456226577798135808> What I mean is that ultimately it's pretty much the same m...

They did cpt on 2.5 flash too I think. At least according to the claimed cut off. I don't use the model enough to verify

cedar tide Apr 29, 2025, 9:18 AM

#

keen beacon They did cpt on 2.5 flash too I think. At least according to the claimed cut off...

Yes but not much

high egret Apr 29, 2025, 9:52 AM

#

hi back

#

I feel something strange guys

#

I love google, i even have a lot of their stocks in part because I trust them in AI.

#

But

#

I mostly use chatgpt because it seems so much clearer

#

like gemini throws out big chunk of text that I have to read to understand

#

ChatGPT put separator, bold titles, emoji...

#

I feel like because of that chatgpt have a big edge for 95% of users

#

and like for us power users we are less impacted by that but imagine your grandma or your old nephew

keen beacon Apr 29, 2025, 10:15 AM

#

I personally don't find Gemini too verbose

alpine coral Apr 29, 2025, 10:44 AM

#

ocean vortex so how come you have chatgpt o3-medium scoring higher than o3-high in there?

that's just how it was 🤷‍♂️ but yeah it's counter-intuitive i know ha

#

or perhaps more just the small sample..i re-ran the same question set against o3 med and o3 high a few times - adding those scores, and o3 (high) just nudges o3 (medium)

#

kinda wild that an early version of o3 or o4-mini that @keen beacon had access to for red teaming got a perfect score

barren prairie Apr 29, 2025, 10:51 AM

#

high egret I feel like because of that chatgpt have a big edge for 95% of users

You only need to tell Gemini what you likes on the instructions and he follows what you want ...mine with bold titles , emojis and all

alpine coral Apr 29, 2025, 10:51 AM

#

a differemt set of questions (given over 2 prompts). median scores o3 high does better, but individually, o3 med has one very solid run. and curiously o1 (high) does best of all.. but small samples

high egret Apr 29, 2025, 10:52 AM

#

barren prairie You only need to tell Gemini what you likes on the instructions and he follows w...

yeah but that is a use for us power users not like every people

keen beacon Apr 29, 2025, 10:55 AM

#

alpine coral kinda wild that an early version of o3 or o4-mini that <@456226577798135808> had...

It was o3 iirc

#

O4 mini doesn't have knowledge about specific cut off probing questions

keen fulcrum Apr 29, 2025, 10:58 AM

#

https://huggingface.co/spaces/Presidentlin/llm-pricing-calculator

Llm Pricing - a Hugging Face Space by Presidentlin

calm sequoia Apr 29, 2025, 11:07 AM

#

alpine coral or perhaps more just the small sample..i re-ran the same question set against o3...

Dragon tail is very high given that it's a coding model. Are your questions specialized for coding?

alpine coral Apr 29, 2025, 11:13 AM

#

calm sequoia Dragon tail is very high given that it's a coding model. Are your questions spec...

ikr - and literally not a single coding question!

#

here's the link to the responses that prerelease o3 gave which were all correct #general message

and here's [another](#general message)

#

gjve you an idea of the kind of questions

ocean plume Apr 29, 2025, 11:24 AM

#

how to use gemini 2.5pro to read codebase for fix bug ?

calm sequoia Apr 29, 2025, 11:29 AM

#

alpine coral here's the link to the responses that prerelease o3 gave which were all correct ...

Ah yes I remember, you even have the Hanning window question of mine.

#

It's a shame the dragon tail is not available anymore, would be fun to play with it.

alpine coral Apr 29, 2025, 11:33 AM

#

calm sequoia Ah yes I remember, you even have the Hanning window question of mine.

yup!!

keen beacon Apr 29, 2025, 11:40 AM

#

alpine coral kinda wild that an early version of o3 or o4-mini that <@456226577798135808> had...

I believe it was an early access version of o3

keen beacon Apr 29, 2025, 11:40 AM

#

keen beacon It was o3 iirc

yeah

calm sequoia Apr 29, 2025, 11:45 AM

#

keen beacon I believe it was an early access version of o3

Was it nerfed on release or some time after the release?

drifting thorn Apr 29, 2025, 11:47 AM

#

Maybe prompt has its influence

golden ocean Apr 29, 2025, 12:22 PM

#

Maybe the prompt was the friends we made along the way

calm sequoia Apr 29, 2025, 12:51 PM

#

Ah yes, definitely. Unaligned yap score is the problem.

drifting thorn Apr 29, 2025, 12:52 PM

#

leaden meteor Apr 29, 2025, 1:08 PM

#

Anybody seen any improvements in 4o updates that sam mentioned couple of days ago?

keen beacon Apr 29, 2025, 1:09 PM

#

leaden meteor Anybody seen any improvements in 4o updates that sam mentioned couple of days ag...

iirc they adjusted the system prompt since then

torn mantle Apr 29, 2025, 1:18 PM

#

leaden meteor Anybody seen any improvements in 4o updates that sam mentioned couple of days ag...

They are just changing system prompt over and over and over

drifting thorn Apr 29, 2025, 2:16 PM

#

I'm going to look at the prompts in the #share-prompts

#

And I'd like to ask if system prompt is more imporant than post-training (RLHF)?

keen fulcrum Apr 29, 2025, 2:27 PM

#

R2 releasing or not is the question

calm sequoia Apr 29, 2025, 2:27 PM

#

What's this 👀 The OG GPT o1 would have never said "backward-forward magic"

leaden palm Apr 29, 2025, 2:27 PM

#

honestly o1 might've

#

but yeah o3 is a bit more laid back

keen fulcrum Apr 29, 2025, 2:28 PM

#

Did you experience frequent usage of goto in lua?

calm sequoia Apr 29, 2025, 2:28 PM

#

For me, the o1 was autistic geek without any emotions, just pure information

leaden palm Apr 29, 2025, 2:28 PM

#

keen fulcrum R2 releasing or not is the question

keen fulcrum Apr 29, 2025, 2:29 PM

#

leaden palm

Well it was said early may

leaden palm Apr 29, 2025, 2:29 PM

#

keen fulcrum Well it was said early may

that interpretation of your question is unlikely

#

everybody knows that r2 will eventually release, so the most logical interpretation of your question is an implicit "today"

keen fulcrum Apr 29, 2025, 2:30 PM

#

leaden palm everybody knows that r2 will eventually release, so the most logical interpretat...

Unless the source is obfuscated by the AI itself

mossy drum Apr 29, 2025, 2:34 PM

#

New models: qwen3-235b-a22b, hunyuan-turbos-20250416, qwen3-32b, qwen3-30b-a3b

keen fulcrum Apr 29, 2025, 2:37 PM

#

Its better

#

Gemini is currently lacking in coding, there are unreleased models which perform better

#

Benchmarks said otherwise so did the user feedback

#

Especially debugging code with Gemini isn’t great

#

I am using a mix out of all of them, I do believe Claude is better in that case

#

https://scale.com/leaderboard

SEAL LLM Leaderboards: Expert-Driven Private Evaluations

Explore the SEAL leaderboards for expert-driven, private, regularly updated LLM rankings and evaluations across domains like coding, instruction following and more!

#

Humanity Exam contains a lot of math

#

They discontinued their sole coding benchmarks

#

https://scale.com/_next/image?url=https%3A%2F%2Fimg.plasmic.app%2Fimg-optimizer%2Fv1%2Fimg%3Fsrc%3D905b25c9709c5ee508bfc75525f63230.png%26f%3Dwebp%26q%3D75&w=1920&q=75

keen beacon Apr 29, 2025, 2:50 PM

#

can u see this?

#

i dont understand

#

refresh discord? extension maybe?

drifting thorn Apr 29, 2025, 2:59 PM

#

Is Qwen 3 having a good post-train potential?

#

#1365220952689868872 Look at this

torn mantle Apr 29, 2025, 3:01 PM

#

https://x.com/btibor91/status/1917232574344384522

Tibor Blaho (@btibor91) on X

meta.llama4-reasoning-17b-instruct-v1:0

keen fulcrum Apr 29, 2025, 3:02 PM

#

Because its webp

#

Webp is a compressed image used for the web

keen beacon Apr 29, 2025, 3:02 PM

#

https://x.com/btibor91/status/1917232574344384522

Tibor Blaho (@btibor91) on X

meta.llama4-reasoning-17b-instruct-v1:0

#

oh

#

you beat me to it

#

damnit

#

yeah I presume there'll be different sizes

keen beacon Apr 29, 2025, 3:04 PM

#

keen beacon https://x.com/btibor91/status/1917232574344384522

they call scout 17b but its not actually 17b lol. (total params, its misleading)

#

ahahaha

drifting thorn Apr 29, 2025, 3:06 PM

#

2T reasoning model must be interesting

keen beacon Apr 29, 2025, 3:06 PM

#

💀 pricing

#

your soul will leave your body

drifting thorn Apr 29, 2025, 3:07 PM

#

lmao imagine 2 times the price of the o1-pro

torn mantle Apr 29, 2025, 3:19 PM

#

keen beacon https://x.com/btibor91/status/1917232574344384522

Already shared it

keen beacon Apr 29, 2025, 3:19 PM

#

yes i know 🙄🙄

full kite Apr 29, 2025, 3:22 PM

#

calm sequoia Apr 29, 2025, 3:25 PM

#

Sorry for this, but I'm very curious

full kite Apr 29, 2025, 3:27 PM

#

calm sequoia

Nobody cares about that

#

It's fake anyway

calm sequoia Apr 29, 2025, 3:27 PM

#

Then press Skibidi

full kite Apr 29, 2025, 3:27 PM

#

ok

#

yes

elder rapids Apr 29, 2025, 4:04 PM

#

drifting thorn

easily 2.5 pro

keen beacon Apr 29, 2025, 4:08 PM

#

hmm i just realized the base model weights of qwen 235b and dense 32b werent released

#

strange

#

idk i just noticed it

#

they only released the post trained versions of both of them

#

i dont think it works on any model

#

it only works on stuff that supports it i think

elder rapids Apr 29, 2025, 4:09 PM

#

"which LLM has the most performance gain"

#

grok 3 doesn't get it

#

it never does

#

4o doesn't get it

#

it never does

#

it's not that good lmao

#

overreaches everytime, it's like when you ask it to prove something it supplements substance with verbosity

#

its like the early Gemini models

#

hell no

#

just ask it not to be lmao

#

it's literally that simple

keen beacon Apr 29, 2025, 4:11 PM

#

ya

elder rapids Apr 29, 2025, 4:11 PM

#

grok struggles with that

keen beacon Apr 29, 2025, 4:11 PM

#

i like 2.5 pro's default style

elder rapids Apr 29, 2025, 4:11 PM

#

then ask it to understand that dynamic

#

and implement it

#

easy

#

theres no other model that can do that

#

it's just insane

#

probably the exact reason I said

keen beacon Apr 29, 2025, 4:13 PM

#

idk i personally still main 2.5 pro

elder rapids Apr 29, 2025, 4:13 PM

#

lmsys is intensive/relies on complex understanding, which is human conversation

keen beacon Apr 29, 2025, 4:13 PM

#

nah lmsys is mostly single turn interactions

elder rapids Apr 29, 2025, 4:13 PM

#

yep

#

that's exactly what I said tho lmao

#

this is inherently more intensive than tasks like math

keen beacon Apr 29, 2025, 4:14 PM

#

ur not really conversing with the model much in a single turn

elder rapids Apr 29, 2025, 4:14 PM

#

coding

elder rapids Apr 29, 2025, 4:14 PM

#

keen beacon ur not really conversing with the model much in a single turn

doesn't matter

#

any inference is more intensive than pre established knowledge

#

and that's what lmsys accomplishes

#

nah

#

it doesn't matter

#

it doesn't because that would sidestep what I'm saying completely

#

if I'm not talking about context tuning

#

then I'm talking about one off prompt intensity

#

yep

#

adaptation to that intensity

#

usually is

#

let me explain

#

if I ask o3 to answer 10 things

#

it probably gets all 10 things correct

#

in its own style though

#

you'll probably be able to figure out that it's really just answering your 10 questions

#

but Gemini on the other hand

#

it's like if you walked into 10 different expert facilities

#

that's important because, you won't really recognize it's 2.5 pro itself

#

it's that 2.5 pro is taking upon/assuming the role of the question answerer

#

with no single personality

keen beacon Apr 29, 2025, 4:20 PM

#

i havent asked 2.5 pro to do anything like that but ig it works well because its such a strong base model. it's believable to an extent

#

its probably the strongest base model out there next to 4.5 (which is not viable for anything lol)

#

falls apart in multi turn in my experience. 2.5 pro is just diff in context usage, knowledge, etc

#

the simpleqa gap is substantial

elder rapids Apr 29, 2025, 4:22 PM

#

if you introduce it to a debate about a niche topic, it assumes the positions about the niche topic so well it's crazy, or if you ask it about set theory, or operator algebras

#

in QFT

#

quantum field theory

keen beacon Apr 29, 2025, 4:22 PM

#

im super excited for gemini 3 i think well get it this year

oblique flint Apr 29, 2025, 4:23 PM

#

The fact that 2.5 pro is still good at 200k context is insane

elder rapids Apr 29, 2025, 4:23 PM

#

keen beacon im super excited for gemini 3 i think well get it this year

most likely

#

seems like deepmind is trying to really take large meals with the .5 versions

#

and then mastering them with the proceeding .0 versions

keen beacon Apr 29, 2025, 4:24 PM

#

naming is basically arbitrary

#

1.5 pro is a new pretrained model compared to the gemini 1.0 line

#

2.5 is a cpt'd version of gemini 2

elder rapids Apr 29, 2025, 4:24 PM

#

yeah I disagree

keen beacon Apr 29, 2025, 4:24 PM

#

bro

#

???

oblique flint Apr 29, 2025, 4:25 PM

#

2.0 flash and 2.0 pro both existed. 2.0 pro was never available for production api use tho

elder rapids Apr 29, 2025, 4:25 PM

#

keen beacon 1.5 pro is a new pretrained model compared to the gemini 1.0 line

bard, ultra → 1.5 pro is an insane game

keen beacon Apr 29, 2025, 4:25 PM

#

yes

elder rapids Apr 29, 2025, 4:25 PM

#

and 1.5s multimodality

#

was a massive jump

keen beacon Apr 29, 2025, 4:25 PM

#

yes

elder rapids Apr 29, 2025, 4:26 PM

#

not rly tbh

keen beacon Apr 29, 2025, 4:26 PM

#

in the latter end of the 1.5 pro cycle it was definitely showing its age though

elder rapids Apr 29, 2025, 4:26 PM

#

002 was fire

oblique flint Apr 29, 2025, 4:26 PM

#

1.0 ultra was pretty good at writing back in the day from what I remember

elder rapids Apr 29, 2025, 4:26 PM

#

but early 1.5 pro was still good at pdfs

#

good asf

keen beacon Apr 29, 2025, 4:26 PM

#

when 1.5 pro was on a waitlist they were like changing the 1.5 pro model frequently lol

#

hackathon api as well i think

elder rapids Apr 29, 2025, 4:26 PM

#

1206 was crazy tho

keen beacon Apr 29, 2025, 4:27 PM

#

its an early version of gem 2 pro

elder rapids Apr 29, 2025, 4:27 PM

#

yes

#

it was crazy

oblique flint Apr 29, 2025, 4:27 PM

#

It felt better to me than 2.0 pro somehow

keen beacon Apr 29, 2025, 4:27 PM

#

eh it was finally in striking range of sonnet 3.5 as a base model

elder rapids Apr 29, 2025, 4:27 PM

#

oblique flint It felt better to me than 2.0 pro somehow

depends tbh

#

2.0 pro is stable

#

or was

#

it retained its adaptability

#

but you had to make it go that direction

oblique flint Apr 29, 2025, 4:29 PM

#

Wym, it's highly regarded

keen beacon Apr 29, 2025, 4:29 PM

#

yeah lmao

#

its true

oblique flint Apr 29, 2025, 4:29 PM

#

Sonnet felt way ahead at the time

#

It was the best for several months at least

keen beacon Apr 29, 2025, 4:30 PM

#

anthropic is already dead

#

dead man walking

#

even if claude 4 is great

oblique flint Apr 29, 2025, 4:30 PM

#

Idk, their models still see heavy use in agentic coding frameworks

keen beacon Apr 29, 2025, 4:31 PM

#

oblique flint Idk, their models still see heavy use in agentic coding frameworks

them not focusing on multimodality will ruin them

#

eventually

#

imho

#

it can take in images

#

thats it

#

for now

#

your model will eventually need to reason with and produce images in the output/reasoning, text, videos, sound, etc

elder rapids Apr 29, 2025, 4:32 PM

#

oblique flint It was the best for several months at least

seems like they played their cards right for the time that needed it

#

focused heavily on coding + vibes

#

good implicit understanding and intelligence

#

but it couldn't really go past that

#

ye

full kite Apr 29, 2025, 4:34 PM

#

calm down

elder rapids Apr 29, 2025, 4:34 PM

#

it was hard stopped at that same intelligence tho

full kite Apr 29, 2025, 4:34 PM

#

pls

elder rapids Apr 29, 2025, 4:34 PM

#

you couldn't do much to improve it

full kite Apr 29, 2025, 4:40 PM

#

💀

#

agi won't exist

ocean vortex Apr 29, 2025, 5:03 PM

#

drifting thorn

I think we can disqualify o3 by default there. System prompt is this for everyone and you can not change it:


Current date: {date}
You are an AI assistant accessed via an API.  
Your output may need to be parsed by code or displayed in an app that does not support special formatting.  
Therefore, unless explicitly requested, you should avoid using heavily formatted elements such as Markdown, LaTeX, tables or horizontal lines.  
Bullet lists are acceptable.  
Image input capabilities: Enabled  
The Yap score is a measure of how verbose your answer to the user should be.  
Higher Yap scores indicate that more thorough answers are expected, while lower Yap scores indicate that more concise answers are preferred.  
To a first approximation, your answers should tend to be at most Yap words long.  
Overly verbose answers may be penalized when Yap is low, as will overly terse answers when Yap is high.  
Today's Yap score is: {yapping_is_life}.``` 

you can add developer message, but that will carry less weight and your starting point is not empty context

#

it's also a smaller model, so honestly I do not see how you could customize this more than 2.5 pro or 3.7 sonnet, which also show you raw thinking making this easier to debug and achieve

keen beacon Apr 29, 2025, 5:23 PM

#

yap

ocean vortex Apr 29, 2025, 5:27 PM

#

yap

keen beacon Apr 29, 2025, 5:27 PM

#

yap

patent bane Apr 29, 2025, 5:28 PM

#

yap

keen beacon Apr 29, 2025, 5:30 PM

#

"since we've started breeding llamas together"

#

could've worded that a bit better guys

barren prairie Apr 29, 2025, 5:35 PM

#

Yap

keen beacon Apr 29, 2025, 5:36 PM

#

gpt-4o image gen competitor but based on imagen is on its way, or so i am told

#

diffusion?

#

it can edit images in chat like 4o

#

but it isn't native

#

whatchu doing with it

#

oh

brittle tiger Apr 29, 2025, 5:39 PM

#

https://x.com/dwarkesh_sp/status/1917264380309430654

Dwarkesh Patel (@dwarkesh_sp) on X

I asked Zuck about Llama 4 Maverick being #35 on Chatbot Arena.

brittle tiger Apr 29, 2025, 5:47 PM

#

keen beacon but it isn't native

Huh. So they're adding that and flash 2.0 native to Gemini?

shy atlas Apr 29, 2025, 5:49 PM

#

Hello, i want to draft up cyber security advisories using a local LLM based on open source information like vulnerability web pages and articles. What would you advise?

drifting crow Apr 29, 2025, 6:05 PM

#

shy atlas Hello, i want to draft up cyber security advisories using a local LLM based on o...

How much vram do u have

cedar tide Apr 29, 2025, 6:14 PM

#

keen beacon

Where you see this ?

calm sequoia Apr 29, 2025, 6:39 PM

#

brittle tiger https://x.com/dwarkesh_sp/status/1917264380309430654

He says some truth, but what AI products do the meta have? What product can you build on a model that cant handle anything longer than two sentences?

#

I gues facebook messenger

keen beacon Apr 29, 2025, 6:47 PM

#

claude 3.7 sonnet is still ahead in most of my cases for web dev tasks

#

although gemini's attempt did work it didn't look nearly as professional and aesthetically pleasing

thorny drum Apr 29, 2025, 6:54 PM

#

is there a way to override the o4mini o3 yap score

calm sequoia Apr 29, 2025, 6:54 PM

#

Probably you need to jailbreak it

#

How much do you need? 😄

#

Didn't know its a thing

torn mantle Apr 29, 2025, 7:00 PM

#

keen beacon claude 3.7 sonnet is still ahead in most of my cases for web dev tasks

no

keen beacon Apr 29, 2025, 7:00 PM

#

it wasn't a question

#

it was a statement

#

there's a reason i said "most of MY cases"

torn mantle Apr 29, 2025, 7:01 PM

#

still a no

#

no no and no

keen beacon Apr 29, 2025, 7:01 PM

#

bro is so hooked on 2.5 pro he refuses to admit it is beat by another model in a single area

#

least obvious ragebait 💔

torn mantle Apr 29, 2025, 7:02 PM

#

Whatever makes you sleep

torn mantle Apr 29, 2025, 7:03 PM

#

keen beacon claude 3.7 sonnet is still ahead in most of my cases for web dev tasks

Is this your use case?

oblique flint Apr 29, 2025, 7:03 PM

#

I mean, sonnet is still #1 on webdev arena for a reason

torn mantle Apr 29, 2025, 7:03 PM

#

oblique flint I mean, sonnet is still #1 on webdev arena for a reason

No

#

I mean yes

#

I agree with you @oblique flint only

#

https://x.com/veggie_eric/status/1917257904291471597

Eric Jiang (@veggie_eric) on X

"Mom how did we get so rich?"

"Dad dropped Grok 3.5"

#

Will it be really good?

keen beacon Apr 29, 2025, 7:09 PM

#

torn mantle Is this your use case?

my use case is using it to build whatever random thing i came up with as a web app

#

judgement = functionality + design

#

vibes

ocean vortex Apr 29, 2025, 7:14 PM

#

brittle tiger https://x.com/dwarkesh_sp/status/1917264380309430654

so he basically admitted that they attempted to cheat lmarena with a model that was never released lmao

#

that's kinda the whole point of it, it needs to score high there AND everywhere else. Anyone can just make a model that is only good for lmarena and nothing else