#general
1 messages · Page 33 of 1
xdd
day 13 of no o3 pro
day 1 of not caring about grok 3.5
big brain is 404
im retired
not interested in introducing virus to my system
ok buddy
agree
Is it normal that the Q3_K_M quant is bigger than the Q3_K_XL quant? it's unsloth's
wym virus?
malware
seems like i cant get enough of o3
this thing is so crazy
O3 pro only for rich p2w users
thank you o3
life without free sota models access method
actually r2 may have some interesting outputs
hopefully its not like qwen where they only focus on fixing old bugs
huh what is this arrow.. doesn't trigger anything
Been away, how's new qwen performance?
Any sota vibe in it?
Or just good enough for a Chinese model?
it isnt sota but its solid
probably not
running locally i guess
yes the media is forcing o3 pro lesssgoo
we dont care anymore about anthropic
like seriously their models are unusable with the rate limits and they nerfed it
writes short/consise outputs all the time = so lazy
seems like you werent using it since day 1
apparently not
3.7 is literally unusable
it also seems fine to me 🤷
that goes for claude 2, claude instant, claude 3 opus, claude 3.5 sonnet, claude 3.7 sonnet
i use a lot of claude (still)
yeah i've been interested in them since the first ever claude
when the only way you could access their models publicly was with a crappy slack bot
claude 1 and claude instant were great creatively tbh
they just lost coherency over long context 😔
day 1 :
first company to introduce 100k context windows i think
months later :
it could be: system prompt changes/randomness
same prompt
if you're serious about getting the most out of these models you wouldn't use them over their consumer UIs
use the api dawg
are you calling me schizo?

there have been so many instances in the ai industry of people claiming degradation when there have been no changes and it's like one giant placebo
no, he was talking about people who still complain claude is being nerfed on the api
im jk
but in all seriousness i do think they played with the system prompt ( default one ) to output shorter results
guys o3 marginal cost is zero
o3 this o3 that
you probably also dream of o3 pro
enough
use gemini 2.5 pro a bit
or qwen 3
Did Qwen3 live up to what you expected?
11
18
2
Met expectations
no pro is more like you having 10 clones of yourself trying to do the same thing individually and then all of those are merged into 1 solution lol
Interesting. The only way this makes sense in my head is that they are disguising upcoming o3-pro like that. They used to do this with alpha releases (like gpt4-all-tools) where select users that normally use lower tier models (free gpt3.5) got access to upcoming best models/features.
Probably for data collection among users not terribly familiar with the models related the most to what they are releasing
nvmd im idiot. i have pro plan
it isnt better
Ok ty
its good if u want something to run locally
How do you run 700 billions model
Ok how much
idk there are a lot, just look it up
😡
That not an answer
well the o1 pro right now is still as 💩 , so dont think theyre disguising it as o3 pro
thank u o1 pro...
ok... thanks for the info. keeps using o3
omg this o3-pro on api is insane. hope you get to experience it someday @small haven
ok buddy show me ur ass on oai playground, u wont
o3 keeps passing these tests, its like orgasmic
this button looks new in dr
screenshot or troll
screenshare please
Grok 3.5 will win lmarena in May
life is good with operator
what do you think will the biggest model of qwen3 score on lmarena leaderboard ?
btw one the model is added, generaly, how much time does it take to have enough votes to be on leaderboard ?
could ye
if it has the same gain as gemini 2 -> 2.5 flash (adding thinking and updating the tuning) it would rank just slightly above 0324
if someone have an approximate answer for this question
qwen 3 is not gonna be #1 lol
mr gambler, ill tell you, qwen has no chance of topping the leaderboard for your betting site, sorry
but not for topping the leaderboard, honestly I would really be impressed if it topped 2.5 flash
and honestly the smaller model like 4B are really much more impressive than the 235B one
It'll do fine, but it wont stand out
yeah, I feel that alibaba is better with smaller model
Why you use deep research to answer that
like QwQ was already more impressive that qwen 2.5 max
He want a full analysis of the whole the web to be 100% sure
@hardy pecan and I was meaning the delay to appear on leaderboard, not the score, that was the question I wanted an answer
Qwen 3 235b has a surprisingly high arena hard score, I wouldn't be surprised if it performs well. But it won't top the leaderboard
Right, the models generally need above 2000 votes to appear on the leaderboard
Generally this takes awile, say 5 or 6 days depending on activity and hype generated around the released model
Regular o3 would be enough
Just kidding I think he made a missclick
ok thx
The more I use gemini 2.5 pro the more I realise it's crazy
it gave me a full linear algebra course based on the video of my uni course
it's so perfect
Ohh why use it over o3
Its just troll idc lol
O3 could do the same
O3 can take in videos?
that is not deep research
that is not deep research
it is o1 pro / o3 pro
my bad
the reason he brought it up was pro discussion
Xiaomi dropped a pretty good 7b model
100%. I just don't see a reason anymore to have chatgpt subscription.
Just tested the new QWEN. At math it's at similar level as DeepSeek R1 but much worse than Claude. At world-knowledge and logic it is at similar level as models, such as GPT 4o or Grok. Not SOTA.
Is Gemini 2.5 PRO via the Windsurf chat the same as via, e.g. AIstudio?
Do you run models locally?
7
10
2
No
are u excited for o3 pro
10
13
2
shut up
included it for this question set
generally struggles.. tho with thinking enabled, it does quite well in one run
So o1 still great
on that particular question set, yeah it does really well
but on others, it isn't at the top (though always up there)
didn't realise they reverted to an older 4o amid this personality backlash ha
We have rolled back last week’s GPT‑4o update in ChatGPT so people are now using an earlier version with more balanced behavior. The update we removed was overly flattering or agreeable—often described as sycophantic.
https://openai.com/index/sycophancy-in-gpt-4o/
Guys, IDK all the technical stuff. I'm just curious how different are the rating/ranking methodology for all the different AI rating/ranking sites?
I'm seeing quite a significant difference in ranking, is that expected?
Some beef for lmarena
I've said it before - it would be more beneficial for lmarena to rebrand as RLHF site instead of benchmark
yeah and it's increasingly felt like it's become testing ground for the big labs.. soo many anon models over like the past 12 months.. i'm not surprised to see a paper make (and though i haven't read it, try to prove) these kinds of claims
TBH this critique is not fair. Nobody wants to test the open source models as they were much worse for most of the time. The LMarena also has to think about user side (voters)
This is very very legit
yeah totally
we'd see farrrrrr fewer anon models if they all had to be published / added to leaderboard
would be just like the good ol im-a-good-chatgpt days again ha
rather than anon model spam
It's the same as everywhere - stuff gets ruined where the money comes in.
They have a short time window to do something before a new lmarena appears (real open source)
yeah that's also why i don't really like (though i totally understand from the perspective of the project creators) the 'graduation' from an academic project to a commericial (hedge fund backed) like startup
Sadly this is just some lengthy corporate speech https://x.com/lmarena_ai/status/1917492084359192890
Thanks for the authors’ feedback, we’re always looking to improve the platform!
If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the
Nobody would blame lmarena if you would have separate benchmark for anonymous models + data gathering. The data may even be better if no betting would be involved 😄
hmm between the paper and their response there's a bit to unpack and digest ha
I ask sensitive question so it gets less sycophantic
But then I am thinking should I get supergrok if 3.5 is good next week
And shame to know o3 access is only through API (impossible for me) or their subscription, I am contemplating
This dumb. Don't we want to see models being tested? Everyone instantly saw through the meta BS
No we didn't. It wasn't more surprising than Grok getting top 1.
Grok is actually good tho
We don't know nightwhisper exists if that rule is in place. I don't like
Right here we go, maverick is good for some who like emojis and yapping
The issue is that I want to know if I'm testing model for the benchmark or for RLHF in advance
the perfect job for an LLM aha.. i gave the paper along with LMArena's resposnse and referenced blog posts to gem pro 2.5, sonn-3.7 and o3.. and ask to evaluate who 'has the upper ground'. they all come to the same conclusions (which kinda confirm my surface-level take): the paper is empirical, LMArena's response is largely rhetorical and doesn't address the core claims made by the paper.. iow the paper wins
Paper: Transparently states scrape period, method, code reference, full tables and simulations; backs each claim with quantitative evidence and stress-tests core BT assumptions.
LMArena response: Offers principles (human preference matters, open-source ethos) and restates existing policy. Provides no new datasets, no re-analysis of the authors’ code, and no point-by-point numerical rebuttal. The single concrete figure (41 % open-source battles) is aggregate, methodology-free, and does not address provider-level skew.
The rebuttal is largely rhetorical; it neither falsifies the documented selection-bias mechanism nor supplies counter-analysis of sampling or deprecation effects. By contrast, the paper demonstrates those effects with real and simulated data and exposes structural deviations from BT assumptions essential for a fair leaderboard.
Bottom Line
On the evidence presently available, the authors of ‘The Leaderboard Illusion’ hold the upper ground. Their critique is data-driven and methodologically explicit, while LMArena’s response asserts intentions and policies without supplying empirically grounded refutations.
o3
fwiw here's sonnet and 2.5's takes
Personally I'm fine with it if the anon models are high quality
It makes the arena interesting
Wasn't this only the case with a nighwishperer? AFAIK other were just meh
Or dragontail
Cant remmember
Gemini 2.5 pro as nebula, pre release Gemini 2 I think too
Yeah, it was peak lmarena. But they were released. We are talking about unrealeased models - appears in the arena as anonymous and never sees the daylight after.
There is also eureka chatbot 🤣 from google
That was like that
when they were testing prerelease Gemini 2, some of those iterations weren't released I thinj
Same goes for now
Google makes the arena interesting
The only abject abuse of anon models I feel are the meta models
yeah in principle i am too (it's fun ha) - but quality is kinda key.. we remember the handful of notable/interesting ones, nebula, sus-columns etc, but the arena is swamped by them now.. and they're not all high quality
Hmm after prolonged thought I can agree that it's better to have them in lmarena. The results still should be publicised. Maybe while keeping the anonymous name.
like :
🤖 400+ models on the leaderboards!
📊 300+ pre-release evaluations!
how many of those 300 are actually unveiled? i feel it's like ten maybe
I don't think having some of them on the leaderboard helps really tbh. Depending on when they remove the model, the number might not be valid with extremely high CI
It makes the leaderboard unnecessarily confusing. I feel like they've got a good thing going esp with google anon models
Just enforce a little more criteria and vetting on the models
i mean we're the ones providing the data.. currently it just goes to google, oai, meta and xai.. i agree the LB would become a mess if they were all added
but still, they could do disclosures.. like a ballpark elo or whatever. . which lab the model was from etc
It's already obvious most of the time. Most of them train their company in. It removes some of the mystique of the arena if they do that
yeah i mean after they're pulled from the arena ofc.. i dunno like a monthly roundup
would reduce the incentive for these companies to just spam it with with slight iterations of the same thing
hey, how do the qwen3 32b and 30a3b models perform?
idk if its the right place to ask but i cant find much stuff yet
there's been a bit of discussion here - maybe scroll up and have a read
can you send a message link to lead me there?
but i think general take is quite decent but nothing groundbreaking (in terms of pure perfomrance)
somwhere around here #general message
Oh Cohere had just made a multimodal RAG
gemini is so horrible at listening
he can't touch you
listening what
following instrunctions
you need to talk to him nicely
say please daddyyyy
I'm controlled by elon
Waiting for o3 and o4 mini on the webdev arena
what is o3
isn't that already come up like 1 month ago with the ghibli images
Serious ?
what
You're on the lm arena discord but you don't follow the releases of the best llms?
Do you know the reasoning models?
Nope
ok
This GPT 1 image or GPT 4o
so it's better than 3o
3o dont exist
why is there 2 4o
Its o3
that doesnt make any sens
And its with gemini 2.5 pro the best llm ever
It only listened when I told it that I was going to invade europe
Uhhh
Are you
4o vs o4 dummy
Yeah he's fkn rtrd
Which LLM has the most performance gain with good prompts?
10
19
1
2.5 Pro
there are two 4o because openai can't name things
but 4o and o4 are completely different lol
where is o3 pro tho
@deep adder you sad about gpt4 leaving?
What's gpt4
Hey Hi
I am from BharatGen we had builded our own llm with indic nuance
May I know how we can integrate our model api in chatbot arena to get better comparison
GPT-4 is OpenAI's most advanced AI language model. It's way smarter, more creative, and better at reasoning than previous versions. It can handle much longer text and can even understand images (not just text). You find it powering things like ChatGPT Plus and Microsoft Copilot - Gemini 2.5 pro 😂
wait why dont we have an ai bot in here for questions and searching?
that would be so clutch
it should be
its just different syntax at the end of the day and llms are beasts at languages
if anything give it docs and info on any language you want and guide it
what you been making?
yes
Whatever happened to that world-building prompt project you had?
gemini isnt powered by chatgpt
get a life
Thats what I do
Why is he saying random sht
wdym
"undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired" true or false?
what programming lil bro
Im not a weird geek
bro ,whar
how
can i make minecraft
what is pythin
ngga what is all of that
im using google to make my school homework
a what??
It already did that
like 10 years ago
bro google was created in 2008 or sm sht
tf are you talking about
google made chatgpt
no
I'm excited to be able to use Grok 3.5 on Poe
Like if I can
Don't take it as an info bruh
I'm just guessing
But andrey is wrong on Openrouter. It's not better than lmarena. Interestgly it's skewed other way around. Maybe aggregate of openrouter and lmarena is the answer
Who is
folsom-exp-v1? He answered 95% of my questions correctly
O4mini failed it
Small mistake but interresting
google did it
chatgpt
why
elon musk
what is grok
what does grok mean
google created elon musk
what does neologism mean
everyone and their datacentre, yes ha
what does that mean
literally just that
neologism is a old word
Is thinking ? @barren prairie
a new word (or neologism) is by definition is invented recently ha
what is neologism
I thought it was a political thing
lol jfc you weren't trolling.. um 'neologism' is just fancy way word for describing a new word
that's it
nah this model is not good
i had it many times
when
o
k
like skibidi is neologysm
is grok 3.5 better than chatgpt
or google
no
Prove it
Wow I can merely recognise Winter's face
deezpeek
grok 3.5 is out?
in college?
lets do a petition to keep gpt4 @deep adder
why did they have to do a big thing about it leaving
thye should of just removed it, now im sad
ahh
nvm i dont care
when is grok 3.5 coming out
No
i gotta say if that comes out fast then XAI really cooking with their releases
oh high school, i see
idk i dont think you can say o3 is better than 2.5 pro
they are better in different ways
like o3 has its own lane
but for a general model, 2.5 is the best
like plug and play
Google is better cause they are rich
Chatgpt 4 already exist
Your opinion of AGI vs Political take
4
14
AMA at 9:30am PT with Head of Model Behavior @joannejang to talk all things ChatGPT's personality.
bruhh they playing
release o3 pro man
disagree. OG gpt4 was undertrained and they scaled gpt4.5 the same way, so it's undertrained relative to it's size too. Which means it's just mostly wasteful with no return
terrence tao is gonna be all over the deepseek prover
I didn't see this at the time but if you are "very curious" I'm not concerned about AGI and a fencesitter about trump
you could train og gpt4 sized model on lots of data and make it better than gpt4.5 in every way for sure
does anyone else hate how the new chatgpts glaze you every 5 seconds and add 20 emojis to every message? I’m at work bro… like can you take yourself seriously please???
yes this is a well documented problem, system prompts mitigate it a bit
they admitted glazing to be a problem, so it's gonna be fixed eventually...
you know what? you…….. you’re right
Maybe they trained it on 2.5 outputs or smth lmao
gemini was always the worst at this
2.5 doesn’t yap nearly as much
though 2.5 is better than their earlier models
crazy how it’s basically switched in the past month
now gemini is the one talking like a nih document
Gemini keeps it real
I was referring to "glazing". Phrases like "you are right to call me out!", "you are absolutely correct to question this!", even when you are writing nonsense deliberately
this was gemini thing
on the same token, tho, i feel like 2.5 corrects you a lot more if there’s something wrong with your approach or you make an inaccurate statement. gpt-4o kinda just plays along.
yes it does
oai said they reverted back to older 4o on chatgpt cause of these problems https://openai.com/index/sycophancy-in-gpt-4o/
so both 4o and gemini are bad as far as I'm concerned lol
I'm on top
its basically unusable imo
the model is very smart but will never correct you (which turns into a game of how can you prompt the model in a way that introduces no bias) and then has some BS yap score that limits its responses
yap score thing only applies to o3 and o4-mini. Some other models have something loosely similar but it's less specific, there's no specific score number for them
This dude is vibe coding final boss. Insane he did this in canvas
yeah i meant o3 and o4mini
for me, when i look at the thinking it almost always is just discussing 'how can i say this using as few tokens as possible'
especially for coding tasks
they kinda fcked people over as soon as they introduced "developer message" tbh. It's now clear that they did this so they could take away system role from you and only use it themselves lol
developer message ≠ system message
it's weaker with less weight
u good lol?
officially two weeks of no o3 pro
i thought it was 3?
why are you so obsessed with it. No one else is doing it because it doesn't make sense price wise to use. It's still the same model just used differently. You could probably code smth similar yourself with the existing o3 api.
3 weeks?
ngmi
Bro is feinin for o3 😂 😂 😂
so is everyone else
if u had access to o3 pro and o3, im prtty sure, ur gonna spam o3 pro come on now
We changed questions to reflect real world use cases
Results don't reflect real world usage at all
Yeah I find livebench highly suspect
That keep completely changing their question sets then saying they are better. Also lol @ 2.5 not even being there.
good
yea we should orient more to real world scenarios
WTF
Ultra coming at I/O
is this
fr
wtf
wut
please lets not make sunstrike = gemini 2.5 ultra
Really doubt. I don't think they tried ultra internally until very recently
i dont like saying this, but o1 pro > o3 still
Qwen mcp coming
woah
Ultra isn't too surprising. They wouldn't be tweeting those cryptic brick wall emojis to troll
i think io this year is gonna be one of the best they've ever done
gemini 2.5 ultra, imagen 4, upgrades to ai in google search, lots of updates to gemini integration in google products
possibly veo 3 or a preview of it as well
Someone under nda said google has 2 novel things to show off that r pretty insane
I don’t doubt it they’re cooking
At io or just sometime this year?
to be fair though it's becoming bit more challenging to asses these models lately. Reasoning added a lot more variables, so if you focus on something too specific you can end up with one model unreasonably high and the other very low quite easy. Also I do think the progress OpenAI made o1 to o3 is somewhat overblown tbh
Sounds like they don't understand ndas 😂
That's cool info tho
OpenAI made ReAct agents with o3/o4, that part I think is more impressive
i think ultra gonna def be matching o3, i remember gemini 1.0 ultra was more reliable than pro back then, pretraining matter
2.5 pro already is matching o3
what are the chances for 2.5 ultra this week ?
i don't think there's going to be an ultra
thats cope, its not
if they're already making them free on ai studio w/o api then they probably won't spend a ton on another much more expensive model
yeah I agree. It's sometimes too concise and taking shortcuts for the worse, but that gets compensated with the fact that it's a more capable base model
more capacity and some things it can still do the same or even better while generating less
omggggg plss
gg openai
They were very vague
I doubt it would make much sense, but we are yet to see a truly enormous reasoning model so who knows... Google is certainly in position to do so with TPUs. They could just charge like Sonnet API prices for a model that is much much much bigger. Instead of it being free
If this code string means Gemini 2.5 Ultra, they are definitely saving it for I/O on the 20th. would be shocked if it came out before, only chance would be anon in arena and i doubt that as well
Definitely for I/O
if their track record is anything to go by... They will announce on I/O something which came live 2 weeks earlier lmao
My only question is do they announce or release at I/O. Mainly a question of getting it ready. I think just announcement would be deflating so betting on release
or like a nothing burger update to context size / pricing / availability and whatnot
cursor just soft launched cloud agents
what did they announce at I/O 2024? It was nothing special iirc
their biggest updates were before and after
context size 1M->2M I think it was
so maybe they will release 2.5 pro for the 3rd time at I/O. Was free but now 50% cheaper for those who insist on paying 😂
so ultra before r2?
flash 2.5 lite
around that time probably
I/O event is 22 May if im not wrong
Do y'all think one of the models we've been seeing in arena is ultra? Or that it's not been tested on there yet?
Definitely not
We thank the authors' for their feedback. However, there are a number of factual errors and misleading statements in this writeup:
Regarding the statement that some model providers are not treated fairly:
- This is not true. Given our capacity, we have always tried to honor all
its nightwhisper
nightwhisper won't be as intelligent as o3, bearish on google
nw was in arena way before the gdm hypeposting began so I don't think so. and it wouldn't make since to put ultra in arena for a couple days then release it a month and half later
lol
gemini ultra tho
Google test models seem to be gone.
Surely they will start a new testing season.
meta system prompt leak
Ultra is just a subscription huh
launched yesterday.. nothing interesting in my view. mostly ignore
yea it didnt say "gemini 2.5 ultra" lol
Yes
w
No
grats on the 5 dollars
I've gotten it a couple times. sucks imo
congrats
worse than 4.1 mini and claude 3.5 rip 🙏🏻
np :p
I underestimated
Yeah
I underestimated o3
it’s so good
Now
I can’t use gemini
Cuz its so asss
it types weird
idk how to explain it
o3 follows my instruction better
Yeah
What’s the difference between o4 mini and o3
O3 architect and debugger, Gemini coder
yea makes sense I saw a difference but I couldn’t tell what it exactly was
yea
Don't rely on a single benchmark site to assess models
Livebench is pretty limited in what it tests anyway
Very narrow set of questions they ask, it absolutely doesn't represent the breadth of programming tasks
And they keep changing the questions entirely every couple weeks and changing the scores
Gemini ultra?
are you using the app version
I suspect it might not be an ultra model but an ultra tier ?
or both
in "coding" lmao, that's what you filtered + that's why it's so "low"
while it's arguably the most versatile coder there
then cap asf
o3 can't follow instructions for nothing
ye
it has the nerfed models
wait you didn't know that?
this whole time?
yo no wonder you're getting a different experience
because the models are worse, but ye ofc integration is important
you just won't have access to the actual model everyone has been talking about and praising
system instructions/guardrails
and prob less thinking time
on your data?
the app does too
unless you have advanced I'm p sure
but not even that's absolutely confirmed
afaik
nah
Best overall is gemini 2.5 pro
And if you disagree then you are biased
Or you work for anthropic
How much did they pay you?
I dont need that
LOL
wait a min, use o3 to fix a test, dont pass, use o1 pro, dont pass, use o4 mini high, passes... ok
ok
if that's cursor isn't that like $0.3/request
while gemini is literally free on vertex/ai studio
p2w
im not sure but i think every tool call is $0.3 if using o3
the more i use gemini the more stupid things i notice it doing
???
this is a bad prompt altogether
that's what happened to me too
the second response was even longer
oh wow 2.5 pro does very well on this
even though the prompt is bad
🤷
send yours and I'll send mine
Discovery Tool server is now open
Quoting ʟᴇɢɪᴛ (@legit_api)
︀
launching tomorrow in Beta
︀︀
︀︀Dev Mode is just placeholder server name
oh nvm I misunderstood
but that's not how it works
gpt zero is outdated
its weighted too much on punctuation
that's o3s attempt?
holy that's way worse
wtf
that's cap lmao
look at mine
base
1
not sure tbh, it's structure is too academic
damn this one is the best one by far
cinematic asf
that's necessarily true, but thats not what I'm saying
I feel like aistudio adds delay to responses at large context length
Anyone buying this lol?
Also any news?
What's this
Funny that thus happened to me with gemini instead of o3
ok buddy
yall can stay on gemini, while i fast track with o3
who wanna pay $200 for unlimited gemini 2.5 pro ?
craig ur gonna stash up $200 for gemini ultra, arent u
I ran OpenAI-MRCR against Qwen3 (working on 8B and 14B). The smaller models (<8B) will NOT be included due to their max context lengths are less than 128k. Took awhile to run due to rate limits initially. (https://x.com/DillonUzar/status/1917754730857504966)
I used the default settings for each model (fyi - thinking mode is enabled by default).
AUC @ 128k Score:
- Llama 4 Maverick: 52.7%
- GPT-4.1 Nano: 42.6%
- Qwen3-30B-A3B: 39.1%
- Llama 4 Scout: 38.1%
- Qwen3-32B: 36.5%
- Qwen3-235B-A22B: 29.6%
- Qwen-Turbo: 24.5%
See more: https://contextarena.ai/
Qwen3-235B-A22B consistently performed better at lower context lengths, but rapidly decreased closer to its limit, which was different compared to Qwen3-30B-A3B. Will eventually dive deeper into why and examine the results closer.
Note: There's been some subtle updates to the website over the last few days, will cover that later. I have a couple of big changes pending.
Enjoy.
No problem. Lmk if there are any other models you want me to try besides o1-pro and other higher-priced models. Limited budget atm :/
omg some common sense !
I have it on my Todo to look much closer at those results (and a handful of a few others). I imagine its a quirk.
I've burned passed both respectively 😛
wen o1 pro my guy
Anyone seen that paper about how lmarena is junk?
in terms of mrcr?
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted...
huh
I might run a smaller sample size for o1-pro to get an idea of how it might perform, but not a full sample (~$38k+).
thats why we gotta test it out
appreciate it
It's a junk leaderboard, lmarena.ai would rather gain favour from Google and meta than have a fair leaderboard
Corruption is the term for that kind of behaviour
GPT-4.1's family also doesn't drop like that. I imagine it's cause the o-series wasn't focused on long context. Should be interesting to see what it might be like when OpenAI merges the GPT and o series.
seems a bit like that doesn't it. can see the same drop off happens with o4-mini at essentially the same point (i.e. when the context bin hits 64k tokens)
also qwen-turbo's recall seems literally unusable for anything close to 1m tokens
love your work btw!
Yeah... Basically after 128k, all it gets right is the key it is told to prepend the answer with (counts for 1-5% of the total score per test if it gets it right) 😅 . You can check out its answers by hovering over a score cell and clicking the button that shows up.
Thanks! Really working hard to make this a nice site. More to come soon. 🙂
And as always, open to suggestions about nearly anything to improve it.
aha i see. and nice! didn't realise could explore the answers - nice touch
Note: I still grade a model that stops a response due to hitting its context limit, which generally only happens for the last bin it has results for (usually only a small handful of tests, which may impact the score by 1-5%). More noticeable for reasoning models (due to their reasoning tokens adding more to the output, pushing it past its limits). An inherited issue from OpenAI-MRCR, which I have some ideas to improve upon.
One of the upcoming changes (likely out this weekend) is a built-in diff viewer for the results (when you open the popup). Pretty bare bones atm.
aha yeah that would be the way to go for sure:)
is that less of an issue for oai's reasoning models? like given the reasoning tokens for those models aren't added to the actual output nor accumulate in the context, but are discarded?
ah actually nvm.. ig most of the context window is actually being occupied by the input, come to think of it
bingo 😉
Hi, I'm new to LMArena. I tried lmarena.ai, entered a prompt, got responses from model A and B. Now I'd like to vote which one I like better - except for the life of me, I can't find a button to that end.
How does one vote on this site?
Grok 3 Mini (High) is when I noticed that issue. I spent 3+ days fighting with it above 32k input tokens (especially for 8needle tests), where it would happily output 30k+ reasoning tokens (which made 64k+ bins near impossible to run). I don't grade outputs that return 0 output tokens (reasoning tokens don't get counted in that check). I almost gave up on testing that variant.
The buttons should show up below the chat window (and only after both models finished responding). Are you able to share a pic?
ah yeah i see what you mean - must've banging your head against the wall ha
I'm using Firefox on Linux
that's odd.. perhaps a bug 🤷♂️ maybe try just typing "thanks" and hitting send - perhaps the buttons will appear then
but yeah otherwise maybe just refresh (or try a different browser), which would mean losing the interaction.. (hopefully wasn't a really good response from one that you're eager to know what model it is)
Now it worked:
I would have given it to model B, but now its response to my "Thanks" annoyed me. So model A it is now.
Oh, now it tells me the models' real names, too. After voting.
ha a real time demonstration of how human preferences do not neatly map onto raw quality
One of the models (Model A) was gemini flash preview or something like that; the other was called "claybrook". Now Google doesn't show anything under that name - is it safe to assume that it's an undercover model being tested by some large provider prior to release?
your assumption is spot on
(I believe claybrook is a model from google, just fwiw, but yes it's unreleased and being anonymously tested in the arena under that pseudonym)
When does the leaderboard usually update
When it would please a megacorp
oh my goodness gracious
btw where did u get api access to qwen 3? chutes/:free variant on openrouter?
great work btw
Paid variants of the model on OpenRouter. Some providers offer 128k endpoints
(primarily NovitaAI)
OpenAI launched 4.1 to tell Google they're cooked
I just found out about this, rlly cool
It can restart my device... it also knows all the apps i have downloaded
it's quite slow for an assistant tbh
what app is this?
Cursor (design concept)
I was wrong, gpt-4.1 is actually the best model to code with rn.
I'm sorry but gemini 2.5 pro is so overrated.
that's bs
fr.
that's why i use this lol https://www.reddit.com/r/ChatGPT/comments/1k9bxdk/the_prompt_that_makes_chatgpt_go_cold/
here is the worst model
https://fixupx.com/AmazonScience/status/1917738059346633132?t=8SsYZx_dk7mkmE26Ee7leQ&s=19
🚀 Amazon Nova Premier, our most capable teacher model for creating custom distilled models, is now available on Amazon Bedrock!
︀︀
︀︀Built for complex tasks like Retrieval-Augmented Generation (RAG), function calling, and agentic coding, its one-million-token context window enables analysis of large datasets while being the most cost-effective proprietary model in its intelligence tier.
︀︀
︀︀Learn more: amzn.to/4jwmV2l
wow it sucks
even Maverick and Gemini Flash are better 👀
Cost 2.5$ input
12.5$ output
oh that's really good - great fun ha
(applied on the right..)
Lmao
yeah what wowee lol
tbh qwen 3 4b destroys everyone at that size range
its crazy
i thought phi 4 mini reasoning was crazy but qwen 3 4b is next level good at the size
it makes all the other research attempts at around that size a joke
︀︀Built for complex tasks like Retrieval-Augmented Generation (RAG), function calling, and agentic coding...
invariably how models that were built with highest hopes (to be a genuine all purpose frontier model) but which fall well short, are described i feel ha RAG and agents
they also released simpleqa benchmarks, but with tool use 🤣
lmfao
its also curious qwen didnt release qwen 3 32b base/235b base. the scores are very impressive for it. probably dont want their competitors having it for now
im also assuming this is a ss from their wip technical report. (it was in the qwen 3 blog post)
those numbers are old the released phi 4 mini reasoning performs even better
Anyway qwen 3 4b destroys everyone in the 4b size range
Ah sorry, i shared the wrong tweet
Glad to see the team used a 3.8B model (Phi-4-mini-reasoning) to achieve 94.6 in Math-500 and 57.5 in AIME-24.
︀︀arxiv: arxiv.org/pdf/2504.21233
︀︀hf: huggingface.co/microsoft/Phi-4-mini-reasoning
︀︀Azure: aka.ms/phi4-mini-reasoning/azure
meanwhile qwen 3 4b has like 20 points over that in aime lol
Do you have a table comparing qwen with phi?
no but i will make one in a bit
Thx
when will qwen3 appear in lmarena?
yeah does particularly well in maths and coding i see
Seems solid no?
Where can i try it?
yeah teh phi 4 reasoning plus 14b is seemingly extremely impressive
phi 4 mini reasoning not so much
On your pc, or in azure api
i have tried qwen3 for RAG it is actually good
My pc won't run it
which prob explains a bit why i find the models kinda underwhelming.. i don't do any testing for coding - it's all language/comprehension and reasoming with a bit of instruction following
Il waiting to see against qwen 14b
i think the tuning on the base model (the post trained versions we use) was inadequate
the pretrained models seem extremely strong
i think this is why they didnt release the frontier ones, the 32b base and 235b moe
the base model versions
yeah that would make sense
the problem is that qwen have not shared a bench of this model
i mean unreleased basically = proprietary (unless they change their mind) - can hardly blame them
So we will compare with 30b a3b
the 14b should be stronger than the 30b a3b
Yes but we dont have bench
i dont it will be stronger as a3b is moe right?
it should be stronger conventionally speaking (the 14b)
the 30b moe is extremely interesting though
i am kinda new i always thought that the architecture plays a key role
im not sure about qwen 3 30b moe but a rule of thumb is sqrt(total params * active params) ~= dense model perf. but qwen 3 30b seems to break that
i mean parameters are important too
oh
what model is 'gemini 2 thinking' in those charts?
yellow
probably 2.0
ahh right
i thought it was only since 2.5
was 2 flash thinking only ever experimental?
yea
@keen beacon i have doubt about the sqrt one?
so if we take 30B-A3B we get 9.5B as the dense equivalent right?
does it mean that qwen3 14B is equivalent to qwen3 30B A3B?
it should be slightly stronger if we use that rule of thumb
qwen 30b moe seemingly breaks that rule though
Where you find this ?
seems like it is true the moe is performing better than the dense equi
yes
thats why i said it breaks the rule
the simpleqa score is telling
despite lower active params, but higher total params, it stores more world knowledge compared to the 14b
gemini 2.5 is moe too
none of them are gonna beat gemini 2.5
i would say its primarily the post training though that is the most important part
i also would bet the base models would score much higher on simpleqa, the post trained versions seem damaged heavily
@keen beacon nowadays i am kinda performing more vibe coding than before how do i reduce it
i use it cause it gives me more optimal sol
thats primarily a you problem, idk how to fix that, it requires a personal solution. you could force the conditions though (if u really want to stop vibe coding rather than be productive), learn an obscure language (that isn't supported well in llms) and start coding in that
personally day-to-day i use a language that isnt supported that well by llms yet, so i can't really vibe code
but i try to do it in a normal way it takes more than 2 to 3hrs as i need to read the docs and then implement it
do u want to stop vibe coding or do you want to do your work faster? (assuming its a job)
if you want to do your work faster, then keep vibe coding or smthing (assuming your resulting code is of satisfactory quality)
i want to reduce vibe and also be productive
u need to examine how you use ai/your workflow/etc. then try to work things out and strike a better balance. 🤷
should add a column for qwen3-4b (there's data here) if you can bothered aha
i wish alibaba had a qwen 3 api lol. im noticing bad degradation on some of the providers
it does extremely well on the tasks i need data for
Have you tested fireworks yet ? Thinking about using it
For the big MoE
actually nvm i was being lazy and forgot we have these fancy AI machines ha
v impressive for such a tiny model
imagine if scaling literally worked in terms of parameter size.. like extrapolating this 4b param model's scores out to something the size of GPT-4.5 or whatever
well.. i think most those benchmarks are bounded by 100%.. so ig you just hit that quickly ha
i havent checked it thoroughly still waiting for finalized pricing
early on w deepinfra i noticed outputs were garbled after a while/massively degraded on long responses
(it doesnt happen on chat.qwen.ai or even chutes, but chutes is weird)
rn qwen 3 30b moe is 0.9m/tok on fireworks, qwen 3 235b is 0.1m/tok lol. and they stated theyre still working on the pricing. not gonna use it until they finalize it even if quality is good
they used o3 mini traces hmm
ig u can see phi 4 reasoning traces to see how o3 mini reasons in raw form
im surprised oai allowed this lol
microsoft have invested like $30bn into oai
yeah but it could have an impact on their investment wrt competitors
microsoft is practically (if not literally) majority shareholder of oai
competitors can now see o3 mini traces, style, etc., when openai tried to hide any of the new stuff hard
so they;re not competitors (microsoft/phi and oai)
Just had an idea. The ELO score is determined by the battles model fought in the arena. There always exist a SET of models to choose from as an opponent in any given battle. This set consists of two sub-sets: better model subset and worse model subset. The ratio between those determines you ELO. If you provide a new model for testing, together with a number of objectively worse models or variants (e.g., mini, nano, flash, etc.), the subset of worse models inreases. The ratio of worse-to-better models increases in you favor from the start. Therefore, you get ELO boost by giving the arena model series instead of a single model.
Any ideas against this?
oh i see what you're sayig now
i mean if openai is valued less itll impact ms
yeah gotcha- misunderstood your initial point
perhaps oai is kinda resigned to the fact that those traces will always be sought be their competitors - and to some extent, extracted
like they're outliers by even hiding it
but ig they think (or know) they have some kinda special sauce goin on theere
so yeah i do see your point
they really dont want this happening they have a lot of extreme safeguards about it, unlike any other company. i doubt deepseek or qwen really extracted anything and used it
its kinda crazy how far they've gone in trying to hide it
then theyre doing the identity thing for newer models on top of that
fwiw i'd personally be shocked if none of the major labs had to tried to reverse engineer / extract as much as they possibly can from each others' models ha
but also, raw reasoning excerpts exist, e.g. https://www.reddit.com/r/ChatGPT/comments/1fussvn/o1_preview_accidentally_gave_me_its_entire/
and yeah just the style.. "hmm let me reconsider.." etc
it's all kinda familar (seems drawn from a similar source)
i'd be surprised if deepseek/qwen (or google's thinking models for that matter) R1/qwq were totally in-house creations
- google did not use any openai/deepseek/qwen traces, im pretty sure. their cold start is distinct compared to anyone else, u can see in the style.
- im pretty sure qwq 32b preview didn't use o1 preview traces, at least in the final preview model. its one of the most unique models ive encountered tbh, a lot of "qwqisms" and "second language behavior" (based on my extensive usage of it. plus code switching)
- although they share a resemblance to o1 preview traces, im pretty deepseek made their own coldstart as well.
- qwen modeled their traces after deepseek after r1, though, im still pretty sure its still partially made by them/even if it wasn't completely by them (might've used r1, im not sure)
speculation obviously fwiw
yeah i mean i'm just speculating - no proof / all anecdotal (and don't really use qwen or deepseek models ever aside from playing around)
ha oh
i feel you're more informed on it. it's just what i've always thought - but haven't looked into it
google for sure has a totally distinct feeling
in terms of its thinking models
tbh additionally making ur own cold start isnt difficult, i dont really see an incentive to not make your own if you're a frontier lab. bootstrapping off of someone else isn't a sustainable strategy
i found the complaints about deepseek using leaked traces (i believe) really disingenuous
and based on superficial patterns after working with a ton of reasoning models, processing traces/etc
i'm not familiar with them
not bootstrapping - just borrowing elements of how they implemented the actual 'reasoning' part. like it's more than just givig it more inference opportunity. it basically has an inernal monologue (v similar to anthropi'c 'scratchpad', ive always thought)
having visibility of how the first company did it would be useful.
but not essential
they definitely took inspiration off of o1 preview traces, to generate their own cold start, but i doubt they leaked and trained on them in a massive scale
yeah i'm not suggesting they tried to copy it or anything
just yeah, where they could, take inspiration from (at the very very least)
the reasoning style is highly dependent on cold start btw. you are not getting gemini style, qwqisms, etc., randomly from rl
yeah, i was also mainly surprised by the low price, so also waiting for the final statements
10 cents per million tokens would be real nice, but i guess we don't live in paradise do we
i think the microsoft and openai had some heavy beef, so it might just be literally a way to get back at them lol
yeah when r1 was released people were complaining about it
some lab folks i believe
49% as far as I know, but none voting shares
oh really? this is honestly just my perseonal speculation ha
yea (i believe). i just found it disingenuous to claim that when u worked with traces a lot u probably know it isnt true
off of superficial stuff
thats why i thought u were talking about that
oh nope aha
i just assume most progress is iterative
it's rare for something brand new to ever come along
i guess the argument is over the extent
rather than the fact of iteration
anyway.. separately ha it's been interesting to have a proper look through that paper which criticises the Arena for allowing a handful of providers to flood it with anonymous models (and then choose which to use as the public release and, by extension, get added to the LB, while the rest are discarded, resulting in a sampling bias).
Is there a way to find old leaderboards? For example the leaderboard on the 14th of April
the appendix includes a table with all the anon models they identified
there is also a section 'unknown' (and, curiously, they say qwen had a private model in the arena, albeit not under a pseudonym, so kinda different but still would be a first for a chinese lab afaik)
ya cool huh?
was just as a sidenote really
i thought the paper was very weak. felt they had axe to grind more than doing a proper fair take. they mostly used meta being shady bc llama was weak to tarnish arena but that was already addressed by lmsys with policy update. and botched a ton of stats according to lmsys
i honestly think lmarena's response is laughable - they don't address the actual points made in the paper
they make some totally irrelevant analogy about steph curry
weak af
it might be fairly said that cohere has an axe to grind with the arena
but i don't think that discredits the paper.. it's imperfect but rigorous from what i can tell. lmarena has only provided rhetoric in response...nothing to actually refute the paper's best-of-N strategy.. their data covering properiatery vs open models covers the entirety of the Arena's existence, rather than the few months the authors of the paper were scraping / analysing
I still think the central claim of their conclusion unfairly uses meta shadiness to a make much broader ridiculous point: "The widespread and apparent willful participation in the gamification of arena scores from a handful of top-tier industry labs is undoubtedly a new low for the AI research field. As scientists, we must do better. As a community, we must demand better."
"undoubtedly a new low for the AI research field" gimme a break that's laughable. and openai and google are not willfully gaming scores
I think much of the paper feels like you are asking a very verbose reasoning model to criticize your project and it just starts coming up with many (and partly obvious) weaknesses, most of which you will straight up ignore.
Otherwise the more soft and obvious critics, like most of the data going to big tech (inc. openai) are very valid, but also why the lmarena is able to exist at this scale in the first place (imo). (Because it is not just a benchmark, but also a high quality and agile data source for the companies)
But I am glad that they chose this approach of writing a paper over the article approach most other labs take to address issues. This is something we should have more off.
Any news on augment code?
Where do you find this ?
Thx
yeah agree.. that language is a bit much (even though i think the paper seems otherwise sound)
anyway.. was just posting those screenshots it's interesting to see all the anon models catalogued like that
they are a fun part of the arena - no doubt...
in fact.. it wouldn't be nearly as interesting without them
but yeah it's kinda gotten out of hand.. and i'm sympathetic to this 'best of n' argument and that being distortive and in some ways arguably unfair (even though there's technically nothing stop any lab, however small, from spamming the arena; they would just need to ask as often as google, meta etc do.. ig)
its been a long time but i think there was drama about it a long time ago. something about how they wont add models or smthing and starting a competitor to lmsys. (never came to fruition i think)
crazy how many models Meta tried
and they were all bad
ha yeah that's kinda what i mean by out of hand
june 4 2023 xd (i dont think its relevant at all today, and presumably it was resolved quickly since i think nothing happened after that)
lol geez i struggled there
i won't edit again - let it stand ha
but yeah i mean i hope lmarena doesn't become a victim of its own success (it needs integrity to be meaningful.. but betwen betting and playing footsies with literally some of the biggest companies in the world.. there may come a breaking point)
but yeah they are the definition of having a first-mover advantage
and all the scale (votes) that brings with it
be very hard for another crowdsource project to displace it (more likely the arena falls one day i'd think) ]
early days i think they couldve had a chance
yup - competing first movers
but now yeah they're the like the google of ai chatbot crowdsource voting ha
So i have a choice
My parents are making me decide now
Change my major to artifical intelligence and persue a technology career
or stick to marketing
with lovable and cursor becoming the top preformers i might have too 💔 gpt wrapper method
svelte
easy choice
or LuaU because LuaU is super simplified
so JS
yea thats the stuff i like the most web development
app creation is a whole other field
yes
working on a project now
2.5 pro for debugging yes
And Workflow/MVP planning
Azure backend
mvp = minimum viable project for early customers
like skeleton build to test theory/logistics
I get github pro free for two years so i have access to 4.1, gemini, claude 3.7 for free
For my resume
I'm a marketing student rn
Making a trend finder + tracker for cross platform data analysis
would look good on my resume and portfolio cuz im also doing business administration so i need to know how to lead a company, etc
Great I dont like AI studio that much
If you sign up on the Gemini website
there will be a free 1 month trial
Nah bad ui gemini official website better
When they end just make a new account
They have no KYC or nothing
u cant branch on the gemini product i think amongst other things. the gemini product seems bad compared to aistudio, which still has its own issues
they can but they dont
i've done it for awhile
wym by branch?
ah
on chatgpt u can edit ur message and immediately start a new branch without losing the previous branch/conversation
that explains it
whenever i get close to running out of tokens
i just tell it to copy itself in a prompt
then use gemini's gem manager so it remembers the base project
then that prompt acts like its loading a save
restores token context count + allows you to continue where you left off
just my little work around
i guess it really is a preference thing at the end of the day ai studio is more for people who understand ai, etc
I just need bare bones maybe add a few prints here and there or give me some other solutions
is qwen 3 good at tool calling like o3?
like does it have tool calling built in the reasoning as well?
i think
maybe not as good but pretty good i believe
where u are getting access to it btw
qwen3 api
i havent tried it yet, just wanna do a project with it
wait can we even use the tool calling magic in api?
for o3 or qwen?
i think yes for both
i wish we could put our own tools in the reasoning
both
whatt
https://platform.openai.com/docs/guides/tools?api-mode=responses see function calling
qwen3 doesnt have built in tools i think, its all your own functions.
so i can add lets say 50 more tools to o3 with function calling?
i dont think this is a good idea but i guess lol
why not lol?
they said qwen is ass
it will take a lot of tokens to review and decide which tool to use 🤣
it isnt
what do you think is worth more time figuring out what tool to use or training models to mimic those tools? like instead of having a model that can generate 3d animations, why not just have a model use the blender mcp?
i think this is what openAi is noticing lowkey
we already have solid models and benchmarks are getting cooked
so why not just focus on building more tools around our models, which i guess is what is happening lol
both are valid avenues
the mcp thing is gonna matter significantly more going forward
i want to see if i can do it a lil on a small scale with mcps
it indirectly impacts a model that can generate 3d animations
but have them use the mcps in the thinking
exactly
the only thing is having to update the models thinking base of new mcps and tools
this data can be used for a model that can generate 3d animations directly
i see what you mean
but what tool calling does qwen have?
ppl saying the 30b model is really good
idk u can use mcp i believe. even the official qwen website is gonna support it ootb
