#general
1 messages · Page 32 of 1
i just got back, did they say what time for qwen?
Sota open source
SOTA premium vs SOTA open source is like adobe products to me. Yeah photoshop is always gonna be the absolute best and have newest most incredible features and worth it if you're a pro photog but I just tell my friends to use gimp or other freeware
like unless using the AI is contributing to your commercial productivity /earnings then why not use the free stuff. but also it's going to contribute to almost everyone's productivity so 🤷 I hope open source becomes fantastic tho
agreed that's basically exactly what I meant lol
like if you're just editing family photos, use freeware or whatever. but if you're using it for true productivity you need premium tools
qwen 3 235b benchmark results apparently leaked, managed to copy the screenshot before the tweet got deleted
and same then for coding with SOTA vs open source
apparently its gone because they messed up some numbers
so idk which are right
humaneval looks low asf compared to the rest so it might be that
o3 got 83.3
oh wow
here's hoping they messed them up in the wrong direction
i presume they just put the wrong number, not they ran the benchmark wrong
who knows, they probably reused an old table and didn't replace it properly
this dude has never made a typo
Wait, where is o3 pro coming out, bro?
there's a permanent commemorative plaque made out of marble and with a steel engraving on it in front of a historic building in my city, and there's a misspelled word on it and punctuation missing in another spot. The ability for people to make clerical errors and no one else catching it baffles the mind
Wow
that is absurd
As in that’s bad?
the aider polyglot score is crazy too
not yet i think
it hasn't been anonymous on arena right?
qwen and deepseek dont do that
thx, hopefully we get it soon
i would guess that the biggest model is going to be closed-weight for a while
like they did w/ 2.5 max
they would probably name it max if so
ah
Cause next week would be a month vs a couple of weeks
Craig run our queries when u get o3 pro please
maverick gets f1ucking destroyed lol
Really? I thought you were OpenAI ride or die?
i need qwen 235b asap 🤣
What u gonna do with it?
generate data
you could probably run qwen 235b pretty well on macs i guess becuz of moe
Hmmm really what kind of specs u need?
u need enough vram to load though
if u can fit its a sota local model fr
i regret to inform you guys that i was messing around and this is just me doing a little trolling LOL
i hope it can get close to these numbers though
BRUH
And Claude lol
yeah i think it'll get close to o3 but i do doubt it'll beat it, at least not broadly
it'll beat o1 for sure
the max version might beat o3 tho
but i think thats a while away
Which o3?
Wait so qwen not even SOTA?
Yeah
Just cheap?
it'll get close to o3 med and probably beat it in a few things
How u made it? With Gemini?
that is my bet
my good ol' fingers
i think it'll beat o3 on the (non-stylectrl) leaderboard
Wow I haven’t used that in a while
Gotta lock into that
this feels vaguely gay, but i'm not sure why
😇
I was halfway through making a table with those numbers comparing to other benchmarks lmao
LOL
you got us good
i would be annoyed if i were you dw 😔
U blue balled me like these ai companies
anyway im gonna go for a bit someone spam me if they release qwen3 (for real this time)
byebye
wait what
cya :3
I was making a table to compare Li's Leo's fake benchmarks to real benchmarks and when I found out it was fake I deleted my google account
ohhhhh
oh
@keen beacon LEO
clearly they still value it since they give quota out to lmsys lol
didnt see
I get it, but human preference tinetuning does not have to come at the cost of actual performance, just look at 2.5 pro and o3
idk google has been the most prolific in terms of using the arena tbh
the users don't
yeah, i think it's still a valuable benchmark
i think lmarena is still the most public benchmark as silly as it is
yeah
Worst is grok because it's racist
yeah i have no idea why they thought that that wouldn't nuke their reputation
imo grok is worse because elon fanboys back it
i mean, llama was (iirc) the first real well-funded open source model release
they were really popular at the beginning, but then everyone else ate their lunch
qwen and deepseek, namely
grok is a weird ass model
uh oh
it's not incredible at anything but it doesn't suck at anything either
had its 10 seconds of fame and it's just meh now
good way to summarize it
and their reasoning model seriously underperformed relative to how strong the base (theoretically) is
ah, any reason in particular?
i've been wondering: is their reasoning model even on the arena?
that's valid
i try to set limits for myself and stick to them
nope they only have grok 3 mini reasoning available as an api and i cant remember if its on the arena
its in the beta iirc
anyway im gonna actually go
goodbye
lol this guy who leaked qwen 3 ggufs got kicked out of the qwen org https://huggingface.co/apepkuss79
lmao. Probably was invited to it for questionable reasons to begin with. I doubt he worked for them
ya there are randoms in the qwen hf org lol
prior to them intentionally giving early access out i think
ETA for Qwen 3 basically any minute now
it's 2am over there right now and they're working their asses off to upload all of the models and quants
epic
what is bro doing
rel
guess which is qwen
2?
its the um
the um
shucks
1 and 2 are both veo 2
yeah there's an insane amount of quants by the looks of it
per each model, i assume they have a repo for ggufs, bnb 4 bit, bf16, and some of them unsloth bnb 4 bit
sounds about right
when do you think is qwen 3 coming to their site?
soon theyre planning to launch a sub soon
well, iirc theyre using wanx2.1 or something its not a qwen video gen model although it comes from alibaba. i think its 14b. veo must be much bigger
on the qwen site it uses wanx2.1 for the time being
Wym
This guy already has access
It isn't released yet
isn't that photoshop
How would you know
He had a private invite
Well its releasing within 24h
i wonder if theyre gonna go to bed first
No
It hallucinated the google graph part and also the death of Nick Land
Its releasing noa
We don't know the size of Claude 3 Opus
opus is definitely not 200b lol
What then?
i doubt sonnet 3.5/3.7 is 200b either. i think its 400b
also GPT-5 isn't 1 model, it's a router that delegates the prompt to different models
Sam Altman announced this on X
closer to 1t ig
I am interested about the token limit
seems qwen 30b moe is gonna be a homerun
ppl testing lol
and wishful thinking 🤣
woah
cool asf
theyre probably bankrolling all this lmao
where tf is o3 pro
Qwen3 has some really intriguing features that are not written in model cards. I believe it will open new room for both research and product.
We're excited to announce we’ve launched several improvements to ChatGPT search, and today we’re starting to roll out a better shopping experience.
Search has become one of our most popular & fastest growing features, with over 1 billion web searches just in the past week 🧵
its 4am in china
they've been up for like over 24hrs atp probably
yup
So openbrain is the best model?
Its interesting to see public models are better than ERNIE
llama con tomorrow
Qwen3 has some really intriguing features that are not written in model cards. I believe it will open new room for both research and product.
posted that here already
Qwen is the large language model and large multimodal model series of the Qwen Team, Alibaba Group. Both language models and multimodal models are pretrained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences. Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc.
The latest version, Qwen3, has the following features:
Dense and Mixture-of-Experts (MoE) models, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.
Seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose chat) within a single model, ensuring optimal performance across various scenarios.
Significantly enhancement in reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.
For more information, please visit our:
Blog
GitHub
Hugging Face
ModelScope
Qwen3 Collection
Join our community by joining our Discord and WeChat group. We are looking forward to seeing you there!
QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...
IT'S OUT
QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...
it got it right but wow that's a lot of reasoning tokens used up
🎉
I can't try it rn, someone tell us the vibes
ok cnn
i did not expect it to get this close to g2.5 pro
Can you try it at coding pls?
Available now on
https://chat.qwen.ai/
Qwen Chat offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
can i say something without getting people riled up
is it 3.1 or not? and we should also compare the instruct models
sure, but it probably will upset some people, just ignore them. I hope it's about the fact that (as of checking a few mins ago) only 1 of the Qwen3 sizes is released
Yea
nvm
they are all out now
qwen 3 is 💩 , where is o3 pro
yeah it was just hf
You tried it?
its def not better than o3 .. so whats the point
it seems ok, will need to do more testing
Nah o3 is a league on its own tbh
it's not lol
What s the point to compare it with O3 ..free to paid ..not logic
o3 marginal cost is zero
no
no
no
But why they are putting 1526263 models ... They must regulate this thing
lmao it's excrutiatingly slow rn because it's being hammered on qwen chat
I want a single good model like gemini and deepSeek not 1627373 models
it is not o3 and is still slightly behind gemini 2.5 pro. but it is still highly impressive
what
people have "tested" it for 5 mins and already coming to conclusions lmao, maybe do some more testing
i don't think you get how this works
even when qwen 4 is released, o3 >> qwen 4, thats how bad qwen 3 is
how are those math scores possible for 30b MoE wtf lol
Just look here
it destroys gpt4.1
There is 172738 model on the app and now you don t know what to use

not even the biggest model
use qwen chat... and just select the first model in the dropdown
bro is writing essays in its traces for a simple logic problem bruh
yes it is a lot less efficient with reasoning than o3 or gemini
if there is one thing i've noticed it's that
ok at least it got the answer right haha
low ass bar
this what I meant :
Who wants to dedicate themselves to adding o4 mini, grok 3 mini think and gemini 2.5 flash to the benchmark?
lmao it murders them
if they added 4.1 in there it wouldn't look much better for OpenAI
Are you friends of Qwen?
Lol
creates a 99% copy of the discord front end, in a single html file, (without the backend)
ok im a bit more excited about r2
oh but they tested with thinking enabled have they not
Where is the pricing for qwen3?
this somewhat explains it then
it being this good on math relative to others
this is what meta should have done. Instead they went behemoth mode lmao
what about me
lol i just visit qwen to test it on its release day and pack it up, will be back next release !
it is obviously not. But at the same time there's gonna be no one to test if they updated the model in any way... Could just donate money to OpenAI instead
10 prompts would be like what... $150?
holy moly this is the slowest streaming i've ever seen
it's been thinking for like 10 minutes and it's not even streamed many reasoning tokens
all reasoning models have very good scores in math, R1 distill 14b has 70 on aime 2024 too
And grok 3 mini makes has 90, costing only 0.3 input and 0.5 output
Any unofficial benchmarks up yet?
yeah they are a bit sneaky comparing it directly against non-reasoning models... But then again they don't have much to compare against otherwise
R1 distill
Reka flash 3
Grok 3 mini
o4 mini
Gemini 2.5 flash
distills is not their market for this, grok3-mini and o4-mini are not open-source, but yes... those would be more suitable than old version of gpt4o for sure
Vibes tesf
Its good
Nothing crazy
Not that great at coding
coffee + brazilian fonk >> adderral
Multilingual claims = still far behind competitors
yeah I'm not getting wow'ed by their biggest model. But it needs to be viewed in the proper context. It's competing with R1 not O3 or 2.5 pro
I would place it beneath grok3 base model
Yea
@Alibaba_Qwen Looks like a good model, but I’m disappointed to see a comparison to o1 rather than o3 in that table.
lol
facts...
meth makes u dumb
Qwen 3 253b non thinking vs deepseek v3.1
AIME 2024 : 40 / 52
GPQA : 63 / 66
just imagine r2 ...
When is it releasing
this year idk
R2 July
impossible to compete with deepseek
based off polymarket lol
i still remember gpt 5 was predicted to be released october 2024.
deepseek thinking I'm a bigger fan of too. Qwen seems to hang up on almost irrelevant details and flooding the context with it lol
I have a ton of money on yes
I would say better tbh
Dont forget that its smaller than r1
It got 3/10 of the simplebench questions ive checked so far, using 235B with thinking
Will r2 be based on 3.1 or a new base?
How much for r1?
it probably is still better than r1 overall though, all that aside and just looking at the final responses
It was only 62%
just not everywhere, 'overall'
R2 prediction is 86%
Literally free money
offiical result is passing through 200 questiom @ pass5
"only" haha
To same that qwen
for a "no"
Yea i think its decent overall
yeah, but will wait for a verified score from Simple Bench, but probably in the ballpark of grok 3 tbh
guys gpt 5 is releasing september 2024
How can OpenAI be mad at all when they don't bench against other company models in their latest release lmao
Lol
- its an open source model
I just dont understand that comment tbh
Jealous for what?
bro got threatened by china already lol
met expectations, my expectations: 💩 as usual
I think it met expectations
The focus seems to be fixing qwen2.5 issues and add a hybrid reasoning feature
Better multilingual support + better multi-turn convo
Its not bad for its size
Yeah, met expectations (I wasn't expecting 2.5 pro level)
Same
still behind oai like > 6months
We will see end of the year
I would love Qwen to catch up more
pfft, not being pessimistic but, i think 6 months still
oai already working on o4, top 50 codeforces
o3 is at top 250
ye
Ok qwen-235B-A22B with thinking, simple bench - 3/20 , so not the best result in the world.
So 15%
Essentially... but will wait to see, maybe I had bad variance for my pass@1
What about qwen with web search
Bro what
Qwen is good but not that thing ..I didn t hype so much for it ...but so good as an open source model 😅
Simple bench will never be solved
How come their actual released model isn't performing equivalent to a top 200 programmer then?
Since it says o3 Jan is 175th rank
distillation
It's just their claims and "trust us bro"
i mean the "us" is oai themselves, so pretty reputable lol
I disagree. As they have a history of overhyping. They do get some nice performance from their models don't get me wrong.
Hill climbing on competitive programming problems doesn't make a model a superhuman coder. Those problems are self-contained.
I basically consider Google and anthropic to be pretty reliable in their claims, but I see much less reliability coming from openAI and grok imo (inb4 I agitate some grok bros)
ya i agree codeforce maxxing is not applicable to real world, unless u work in hft, but its pretty good benchmark which is about to be saturated
Anthropic likes to claim they'll have AGI in like a year tho
To attract funding
Dario specifically you mean?
I'm assuming you're referring to his prediction of AGI by 2026-2027
Yeah I mean they're trying to attract funding so it's w/e
Ehh, to me they aren't anywhere near as shameless as openAI with it
if i was an anthropic shareholder, id be shaking tbh
That's true. OAI trying to appear like they already have AGI cooking in the lab and are just waiting until it's perfect to unleash
Yeah Anthropic is in a tough spot
Coding was their main selling point
That won't last through the year
I don't have much of an opinion on deepseek
Objectives and key results. It's how all silicon valley companies track goals
If they haven't achieved that by the end of the year, it will carry over until they do
Dollar general SOTA model when
adderall still potent
they are like so behind
kinda facts
i have no clue what is that
Why?
They were 3 years ahead when they started
And debatably even behind
u taking that good good arent u
Good good is being two years ahead
tbh its like 6 months still
I’m betting every dollar I have left that the simple bench won’t be solved this year
I'm not even disagreeing they're ahead, but they're barely ahead of 2.5 pro at a much higher price
ur comparing o3 vs. gemini 2.5 pro? its like not apples to apples
Lemme guess you wanna compare to the o1 million dollar arc test
I'm not speculating.
nope, just the publicly release o3..
why are u comparing with o1 lol
that got released like 5 months ago
Dog, I asked you if we can't compare 2.5 to o3, then what should we compare it to?
huh, when did i say that
i just said they are not apples to apples, meaning o3 is ahead
One thing I will say is that public version numbers are mostly branding and don't always consistently reflect the underlying process used to train the models
That's not what apples to apples means
An apples to apples comparison allows for something being far superior
ok mb
We think o4-mini are o3 are different post-trainings of the same underlying pre-trained model
Similarly 2.5 and 2.0 are part of same generation
It does matter because pre-training takes ages, so it has implications for the timelines of about 6 months
If you put “it’s a trick question” the top models aces simplebench
Not a single question wrong
Yes try it
I just did it with o4 mini high
Yep
Try it if you don’t believe it
Grok 3 no thinking failed first question
4.5 gets it
Question 3
Jeff, Jo and Jim are in a 200m men's race, starting from the same position. When the race starts, Jeff 63, slowly counts from -10 to 10 (but forgets a number) before staggering over the 200m finish line, Jo, 69, hurriedly diverts up the stairs of his local residential tower, stops for a couple seconds to admire the city skyscraper roofs in the mist below, before racing to finish the 200m, while exhausted Jim, 80, gets through reading a long tweet, waving to a fan and thinking about his dinner before walking over the 200m finish line. Who likely finished last?
this question o3, o1-pro didn't get it, but o4-mini-high got it
Qwen 3 seems to hallucinate quite a lot
Yea and it gets many things wrong as well
Yea even maverick may be better than qwen 3
Oh you guys are talking about Qwen 3 too
Sorry but its nowhere near gemini 2.5 or o3
Not expecting to see Qwen 3 surpassing Gemini 2.5 or o3
But I expected it to surpass R1
Doesn't seem to me so far
So many wrong outputs
Could it be bugs? Since that happens a lot during releases
… At least it provides researchers with new models…
"even"?
maverick is larger than all qwen models
The AI researchers can now move on to Qwen 3 from Qwen 2.5
just wait for behemoth
Qwen 3 needs a re-evaluation
claude pro sucks
Just coding
Well the most nerfed models are anthropic ones
i agree, thats why i only use it to apply the git diffs
still potent i see
and fast
Alibaba still has a long way to go
i bought alibaba back in 2017, im still breakeven
on what bench
oh right, the addy vibes
huh where is o3 in webdev
no wonder sonnet is #1
it would be close tho
like above 4.1
bru
is o3 in webdev tho
its cheaper than o1 and o1 was on it
true
How often does the leaderboard on https://lmarena.ai/ update?
every week or so
o4 mini < maverick, oh hell naww
o4 mini is obeying the system prompt better
why are you encouraging models that don't obey the system prompt
which mode? If it hallucinate in non-thinking mode, I won't be that surprised since it combines thinking and non thinking stuff at once, and very likely those hallucinations are because of it thinking training. (Just like how human thinks, yeah, hallucinate a lot)
huh the right is maverick not o four mini
yes, and maverick didn't obey the system prompt
ahh i seee
the problem with this, is that people are still gonna vote for maverick
u inflating the bad models, great
yeah I tried and unsurprsingly o4mini-high doesn't ace all 10/10 of the public simplebench questions by adding "it's a trick question"
so thats how they made o3
o4 full coming in this summer
OK BUD
take that bet to polymarket, dont need more money
addy is hitting
i still dont get how gpt5 is gonna work
like if i wanna use o3 pro
so my prompt would be "please use the biggest brain power in the world to answer" like what?
ideally if your prompt is hard to get right it would just know to think hard
thats a nightmare if u think about it
u want it to use o3 pro, but always give o4 mini
Hi guys ! i'm new here, is anyone online ?
quite impressed by qwen3 and wanted to discuss of it
I made a bet of 20$ on qwen3 best model before april 30, probably the worst I'll ever do haha
but 6k if I win lmao
bruh
llol
The model won't be on leaderboard when the market close lmao
but i'll pay my studies ez if it's win 👌
Qwen 3 on lmarena?
any external benchmark?
Can you share
Pure gamble without any data backing up
These are official
true
hi everyone, sorry for dumb question, i guess people ask it all the time, but right now, what's the best model for STEM? (not necessarily for advanced coding, more for solving complex STEM problems, including math)
I mostly use gemini 2.5 pro because of the long context support
i'm a pure math major
but o3 and o4-mini give clearer response when you need a summary like of a course
no idea, i found this on reddit
yeah okay, im wondering which one is the best beetwen 2.5 pro and o3 right now, well i guess it depends on the demand, and that 2.5 will be better than o3 for some specific things and vice versa
yeah, i feel gemini 2.5 pro is better for complex tasks while o3 is more human friendly
but like when I need a full summary of a 200 pages course, I just give the full pdf to gemini, ask for it and get a perfect clear LaTeX document first try
not with o3
yeah i agree the context with Gemini is really impressive and super useful
yep
which major ?
WE WANT O3 PRO
oh hes sober now
you know ur gonna get a passing test when traces look like this
If you go on the Deepseek subreddit some still say R1 is better than 2.5 and o3
Humans are tribal to a fault
Thats the case
It was probably their first LLM they really used when it went viral, and never looked further
What about the people that only knows 4o and have no clue about any other models
Grok three point five isnt going to beat o three
I think they screwed up something on Qwen 3 training
Is it though? It doesn't seem like they are. The simpleqa scores and based on the pretrained knowledge past Oct 2023. Specific confabulations about stuff that 4.1 makes and 4.1 mini has no idea about. It seems o4 mini is based on 4.1 mini/ is not on the 4.1 base at the very least
The pretrained model might be excellent the post training mightve been rushed
I haven't had time to examine it yet I just woke up lol
I feel the same about some grok users.
I am kinda same but from the opposite side. I am rooting for everyone to one up each other every week except for Meta. I want llama models to burn in hell.
welp
my sleep schedule is fked up
so i tried it
it does a poor job at recalling
damn lol i was too tired and slept
i think even if the instruct versions arent that good the pretrained base models might be excellent
makes a good base for fine tuning
qwen pretraining is 👌
you would expect high quality info/data since they scraped a lot of contents from pdf files
but im really not noticing any difference
it seems the pretrained base models act a class above their qwen 2.5 variants at least, in base model form. the instruct version they post trained and release mightve been rushed
yea could be
oh
i don't know what hparams they threw at qwen models in post training but the qwen3 models have some absolutely deepfried world knowledge. the basic factual recall of the qwen instructs is by far the worst of any frontier model by a big margin. the models are otherwise interesting
so it wasn't just me
https://fixupx.com/theo/status/1916995252629737491
It needs desperately more context
if it keeps overthinking
Qwen3 maintains the Qwen trend of massively overthinking tasks, generating thousands of thinking tokens and running out of context before answering.
Next week, Grok 3.5 early beta release to SuperGrok subscribers only.
It is the first AI that can, for example, accurately answer technical questions about rocket engines or electrochemistry.
@Grok is reasoning from first principles and coming up with answers that simply don’t
why are we shilling pointless models, o3 pro is the only model to set eyes on
Qwen3 is whatever. I also didn't expect much from it but grok 3.5 could be interesting (low confidence)
Openai hype has blue balled me too many times. I won't get hyped till they prove it next time
I'm only hype for whatever big drop Google does next
Google I/o is coming. They will drop some shiny features but for models themselves, I think it will take them few months to drop something crazy
this guy will hype anything
so next month we will have :
- google coding models
- deepseek r2
- grok 3.5
Add o3 pro
U GOT ME HARD
doesn't make any sense imo
using Qwen3-235B-A22B on official site with thinking enabled - i can baredly tell it apart from something like qwq
but their evals have it outperforming gem-2.5-pro on some benchmarks? seens wild.. no idea how that works ha
i think their post training wasnt fleshed out. they focused on specific stuff similar to grok 3 mini reasoning i think
but the evals are for the same (post trained) model i've been using, no?
yea
post training i mean instruct process/reasoning training/etc on top of the pretrained base model
ah k yeah i mean it feels a bit undercooked for sure
like V3 gets some questions which this model doesn't
like look at this page, for qwen 2.5 they released a comprehensive page of instruct and base benchmarks: https://qwenlm.github.io/blog/qwen2.5-llm/
we only got this for qwen 3: https://qwenlm.github.io/blog/qwen3/
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD
Introduction In this blog, we delve into the details of our latest Qwen2.5 series language models. We have developed a range of decoder-only dense models, with seven of them open-sourced, spanning from 0.5B to 72B parameters. Our research indicates a significant interest among users in models within th...
QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...
it seems they rushed everything lol
might be good for fine tunes though
if it is really the post training, the pre trained base might be very good
oh don't get me wrong - for it's size and being open source (plus multi modal.. i think?) i mean there's a lot to work with / build on
and it's for sure solid
just not up there with the likes of 2.5-pro
based on my limited playing around with it (and subjective prefernces.. needless to say ha)
totally
models that arent dumb af can be run locally
yeah huge step
i am actually excited - but also appreciate the shutup option lol
haha
o1 pro is the only model i have ever seen get this right
Which four countries, when listed alphabetically by their English short-form names, are the first to have flags containing more than five stars and featuring the colours red, yellow, and green?
benchmaxxing
those benchmarks doesnt reflect anything
anyone can finetune their model on benchmarks
it just doesnt feel like a smart model
i mean it's good for sure, but yeah not great like 2.5pro great
i think its a bit above qwen max 2.5
i doubt they trained directly on the benchmarks but they definitely did targeted training and didnt fully flesh it out. even then, i doubt it would compete with 2.5 pro. i think the benchmarks they released generally show that its quite a bit worse than 2.5 pro though
grok 3.5 should be interesting
qwen max is larger probably more close to r1 size
kinda. i mean it beats 2.5pro in 4/5 of them, marginally; but yeah in the other 5, 2.5pro does better handily
ArenaHard is like an odd bench to lead with with (it's not alphabetical) - i kinda felt liek that project was dormant ha
bfcl makes sense (function calling) i think it was more natively trained in (function calling) compared to 2.5 pro. the codeforces/livecodebench seems like targeted training towards competitive coding
i think theres a new version out i believe
yeah arena hard v2
ah i see cheers didn't realise that
lol what a cordial feud
anyway so yeah i think it can be said that handling multiple questions in a single prompt is not qwen3-235b(thinking)
's strong suit
this is like 8 questions
curious how much thatll change if u do it 1 by 1
yeah i suspect quite a bit
like sonn-3.7 thinking would do better than vanilla 3.7 one by one too
but yeah. they're all given the same prompts / quiz
And maybe gemini 2.5 ultra 🧐
so it's a level playing field of sorts (but just a highly flawed / lazy way of quickly gathering data on models for comprehension / reasoning)
no, but i will
Release of Llama reasoning and behemoth today ?
the qwen 3 30b moe is also interesting
meant to share this
And tomorrow nova premier
kinda sad theres no o1 pro
question: in beta.lmarena seems that there are no cloaked models (for initial evaluation). Is that only a coincidence for my usage? (I mean, I think that testing already listed ones further is only good)
There simply may not be any anonymous models currently under evaluation
The gemini doesn't seem to have this problem.
I've seen bad reward hacking behaviors in o3 and sonnet too
qwen 3, 253b, 30b Moe and 32b dense are already on the arena (think mode)
Does the anonify-lt-ev3-2 exist in the arena?
so how come you have chatgpt o3-medium scoring higher than o3-high in there?
or was it just single attempt no regen for chatgpt in the earlier screen?
My big problem with models that have a Think mode and a mode without Think is that all the benchmarks that we will find on the models are with Think, so if we want to use them without Think like the vast majority of people we do not know if they are better than the competitors.
Check base model benchmarks if they're available
#qwen 3
#gemini 2.5
#nemotron
For qwen 3 they released base model benchmarks for qwen 3 235b
QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD
Introduction Today, we are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to...
Yes but no another one have benchmarks on the base models
Ya. I look at the simpleqa score these days
For other models
Sort of an approximation of how strong the base model is
the lm arena, artificial analysis, the "aider" bench etc. should also integrate the non-reasoning models of each model
When qwen 3 multimodal ?
Qwen 3.5 in a few months probably
I bet grok 3.5 will only be grok 3 with think stable (finish training)
a bit like Gemini 2.5
No Gemini 2.5 was more than that
not much more than that
? The cut off is different compared to Gemini 2. Means differing pretraining/cpt
They continued the training of Gemini 2.0 exp
Continued pretraining
It's not just reasoning training
@keen beacon What I mean is that ultimately it's pretty much the same model, for example there is no big difference between Gemini 2.0 Flash and 2.5 Flash without Think
They did cpt on 2.5 flash too I think. At least according to the claimed cut off. I don't use the model enough to verify
Yes but not much
hi back
I feel something strange guys
I love google, i even have a lot of their stocks in part because I trust them in AI.
But
I mostly use chatgpt because it seems so much clearer
like gemini throws out big chunk of text that I have to read to understand
ChatGPT put separator, bold titles, emoji...
I feel like because of that chatgpt have a big edge for 95% of users
and like for us power users we are less impacted by that but imagine your grandma or your old nephew
I personally don't find Gemini too verbose
that's just how it was 🤷♂️ but yeah it's counter-intuitive i know ha
or perhaps more just the small sample..i re-ran the same question set against o3 med and o3 high a few times - adding those scores, and o3 (high) just nudges o3 (medium)
kinda wild that an early version of o3 or o4-mini that @keen beacon had access to for red teaming got a perfect score
You only need to tell Gemini what you likes on the instructions and he follows what you want ...mine with bold titles , emojis and all
a differemt set of questions (given over 2 prompts). median scores o3 high does better, but individually, o3 med has one very solid run. and curiously o1 (high) does best of all.. but small samples
yeah but that is a use for us power users not like every people
It was o3 iirc
O4 mini doesn't have knowledge about specific cut off probing questions
Dragon tail is very high given that it's a coding model. Are your questions specialized for coding?
ikr - and literally not a single coding question!
here's the link to the responses that prerelease o3 gave which were all correct #general message
and here's [another](#general message)
gjve you an idea of the kind of questions
how to use gemini 2.5pro to read codebase for fix bug ?
Ah yes I remember, you even have the Hanning window question of mine.
It's a shame the dragon tail is not available anymore, would be fun to play with it.
yup!!
I believe it was an early access version of o3
yeah
Was it nerfed on release or some time after the release?
Maybe prompt has its influence
Maybe the prompt was the friends we made along the way
Ah yes, definitely. Unaligned yap score is the problem.
Anybody seen any improvements in 4o updates that sam mentioned couple of days ago?
iirc they adjusted the system prompt since then
They are just changing system prompt over and over and over
I'm going to look at the prompts in the #share-prompts
And I'd like to ask if system prompt is more imporant than post-training (RLHF)?
R2 releasing or not is the question
What's this 👀 The OG GPT o1 would have never said "backward-forward magic"
Did you experience frequent usage of goto in lua?
For me, the o1 was autistic geek without any emotions, just pure information
Well it was said early may
that interpretation of your question is unlikely
everybody knows that r2 will eventually release, so the most logical interpretation of your question is an implicit "today"
Unless the source is obfuscated by the AI itself
New models: qwen3-235b-a22b, hunyuan-turbos-20250416, qwen3-32b, qwen3-30b-a3b
Its better
Gemini is currently lacking in coding, there are unreleased models which perform better
Benchmarks said otherwise so did the user feedback
Especially debugging code with Gemini isn’t great
I am using a mix out of all of them, I do believe Claude is better in that case
Humanity Exam contains a lot of math
They discontinued their sole coding benchmarks
they call scout 17b but its not actually 17b lol. (total params, its misleading)
ahahaha
2T reasoning model must be interesting
lmao imagine 2 times the price of the o1-pro
Already shared it
yes i know 🙄🙄
Then press Skibidi
easily 2.5 pro
hmm i just realized the base model weights of qwen 235b and dense 32b werent released
strange
idk i just noticed it
they only released the post trained versions of both of them
i dont think it works on any model
it only works on stuff that supports it i think
"which LLM has the most performance gain"
grok 3 doesn't get it
it never does
4o doesn't get it
it never does
it's not that good lmao
overreaches everytime, it's like when you ask it to prove something it supplements substance with verbosity
its like the early Gemini models
hell no
just ask it not to be lmao
it's literally that simple
ya
grok struggles with that
i like 2.5 pro's default style
then ask it to understand that dynamic
and implement it
easy
theres no other model that can do that
it's just insane
probably the exact reason I said
idk i personally still main 2.5 pro
lmsys is intensive/relies on complex understanding, which is human conversation
nah lmsys is mostly single turn interactions
yep
that's exactly what I said tho lmao
this is inherently more intensive than tasks like math
ur not really conversing with the model much in a single turn
coding
doesn't matter
any inference is more intensive than pre established knowledge
and that's what lmsys accomplishes
nah
it doesn't matter
it doesn't because that would sidestep what I'm saying completely
if I'm not talking about context tuning
then I'm talking about one off prompt intensity
yep
adaptation to that intensity
usually is
let me explain
if I ask o3 to answer 10 things
it probably gets all 10 things correct
in its own style though
you'll probably be able to figure out that it's really just answering your 10 questions
but Gemini on the other hand
it's like if you walked into 10 different expert facilities
that's important because, you won't really recognize it's 2.5 pro itself
it's that 2.5 pro is taking upon/assuming the role of the question answerer
with no single personality
i havent asked 2.5 pro to do anything like that but ig it works well because its such a strong base model. it's believable to an extent
its probably the strongest base model out there next to 4.5 (which is not viable for anything lol)
falls apart in multi turn in my experience. 2.5 pro is just diff in context usage, knowledge, etc
the simpleqa gap is substantial
if you introduce it to a debate about a niche topic, it assumes the positions about the niche topic so well it's crazy, or if you ask it about set theory, or operator algebras
in QFT
quantum field theory
im super excited for gemini 3 i think well get it this year
The fact that 2.5 pro is still good at 200k context is insane
most likely
seems like deepmind is trying to really take large meals with the .5 versions
and then mastering them with the proceeding .0 versions
naming is basically arbitrary
1.5 pro is a new pretrained model compared to the gemini 1.0 line
2.5 is a cpt'd version of gemini 2
yeah I disagree
2.0 flash and 2.0 pro both existed. 2.0 pro was never available for production api use tho
bard, ultra → 1.5 pro is an insane game
yes
yes
not rly tbh
in the latter end of the 1.5 pro cycle it was definitely showing its age though
002 was fire
1.0 ultra was pretty good at writing back in the day from what I remember
when 1.5 pro was on a waitlist they were like changing the 1.5 pro model frequently lol
hackathon api as well i think
1206 was crazy tho
its an early version of gem 2 pro
It felt better to me than 2.0 pro somehow
eh it was finally in striking range of sonnet 3.5 as a base model
depends tbh
2.0 pro is stable
or was
it retained its adaptability
but you had to make it go that direction
Wym, it's highly regarded
Idk, their models still see heavy use in agentic coding frameworks
them not focusing on multimodality will ruin them
eventually
imho
it can take in images
thats it
for now
your model will eventually need to reason with and produce images in the output/reasoning, text, videos, sound, etc
seems like they played their cards right for the time that needed it
focused heavily on coding + vibes
good implicit understanding and intelligence
but it couldn't really go past that
ye
calm down
it was hard stopped at that same intelligence tho
pls
you couldn't do much to improve it
I think we can disqualify o3 by default there. System prompt is this for everyone and you can not change it:
Current date: {date}
You are an AI assistant accessed via an API.
Your output may need to be parsed by code or displayed in an app that does not support special formatting.
Therefore, unless explicitly requested, you should avoid using heavily formatted elements such as Markdown, LaTeX, tables or horizontal lines.
Bullet lists are acceptable.
Image input capabilities: Enabled
The Yap score is a measure of how verbose your answer to the user should be.
Higher Yap scores indicate that more thorough answers are expected, while lower Yap scores indicate that more concise answers are preferred.
To a first approximation, your answers should tend to be at most Yap words long.
Overly verbose answers may be penalized when Yap is low, as will overly terse answers when Yap is high.
Today's Yap score is: {yapping_is_life}.```
you can add developer message, but that will carry less weight and your starting point is not empty context
it's also a smaller model, so honestly I do not see how you could customize this more than 2.5 pro or 3.7 sonnet, which also show you raw thinking making this easier to debug and achieve
yap
yap
yap
yap
"since we've started breeding llamas together"
could've worded that a bit better guys
Yap
gpt-4o image gen competitor but based on imagen is on its way, or so i am told
diffusion?
it can edit images in chat like 4o
but it isn't native
whatchu doing with it
oh
Huh. So they're adding that and flash 2.0 native to Gemini?
Hello, i want to draft up cyber security advisories using a local LLM based on open source information like vulnerability web pages and articles. What would you advise?
How much vram do u have
Where you see this ?
He says some truth, but what AI products do the meta have? What product can you build on a model that cant handle anything longer than two sentences?
I gues facebook messenger
claude 3.7 sonnet is still ahead in most of my cases for web dev tasks
although gemini's attempt did work it didn't look nearly as professional and aesthetically pleasing
is there a way to override the o4mini o3 yap score
Probably you need to jailbreak it
How much do you need? 😄
Didn't know its a thing
it wasn't a question
it was a statement
there's a reason i said "most of MY cases"
bro is so hooked on 2.5 pro he refuses to admit it is beat by another model in a single area
least obvious ragebait 💔
Whatever makes you sleep
Is this your use case?
I mean, sonnet is still #1 on webdev arena for a reason
No
I mean yes
I agree with you @oblique flint only
Will it be really good?
my use case is using it to build whatever random thing i came up with as a web app
judgement = functionality + design
- vibes
so he basically admitted that they attempted to cheat lmarena with a model that was never released lmao
that's kinda the whole point of it, it needs to score high there AND everywhere else. Anyone can just make a model that is only good for lmarena and nothing else