#leaderboards
1 messages · Page 2 of 1
it's updated now 👍
I can't confirm when leaderboards will be updated sorry to say
works 😄
@soft crater
I guess they want a nice day for Claude and the site because in the general lmarena it won't look so nice and people will write again how it is a bad benchmark :P.
As it is the none reasoning version. It loses in all logic, math, drawing, in listing content, Formating a certain way...
It does answer normal questions very well. Also puts out short answers when a long isn't needed. It is better in creating some unique ideas, the writing sounds nice.
It is a good model but it really depends on what you test.
I don't want to look anxious but this is becoming weird... There is any reason I missed?
unfortunately we don't provide details on when leaderboards will be updated
So weird
Maybe their implementing sentiment control?
Maybe someone on the team is on poly market 💀💀
I don't judge, everyone needs to do their best to make a profit.
I'm not sure I'm following so let me know what I'm missing, but typically we wouldn't give details on what day/time leaderboards will update. Normally it's about a ~week
No, this isn't the case
too many gamblers here😅
hey im not gonna pretend i don't have skin in the game, but it does sting when the leaderboard that the results are based on just doesn't update before the market resolves, despite the other leaderboards updating
gamblers gonna gamble
lol
then make your own leaderboard with your own benchmark and the problem is solved. Complaining that a FREE tool that doesn't owe you anything is not behaving like you wish, so that you can earn money, is not a good sign.
Also the Claude models won't be on #1 in the Text bench mark they are not general enough. So it doesn't matter for the bet on 1. place.
Of course it would be nice to have an update soon anyway
it's not about the fact that it's free (they just got $100m in funding btw, for a leaderboard website), it's just odd to not update the boards for so long after a major release, despite the webdev board being updated so quickly.
Ppl keep saying Claude is for coding only so it wouldn't be good at text, but it's possible that the additional tooling and agentic nature would allow it to provide more useful results, even when just chatting with it. If ppl are so confident that it'll be worse than google's model, there should be no harm in updating the leaderboard to reflect that. Waiting abnormally long right as a market is about to resolve just is sus from an optics perspective. Devs could easily selectively update the board to win bets
"it's not about the fact that it's free (they just got $100m in funding btw, for a leaderboard website)"
did you pay part of those 100m? If not, for you and me it is free, the rest is fiction.
"Waiting abnormally long"
That sounds really like /r/choosingbeggars . The update cycle is always around a week and the more votes you get (note that you need to filter them) the better to assess the score. Who cares about gamblers, one wants proper assessments.
Again make your own leaderboard if you need it so badly. Behaving with such entitlement is never a good sign.
lmao
eh indeed, maybe the best answer to your post would be lmao
you'd think with a $600m valuation you'd have a public schedule for updating, you love making excuses huh
no need to refute entitlement
nah fam, you are just too entitled
one needs votes for accurate scoring, it is statistics
what a reddit tier response
otherwise one does scoring against static questions, not human driven
calling having basic company practices for a 9 figure business entitlement
whatever.
you are willingly missing the point.
champion for mediocrity
"lmao"
"you love making excuses huh"
"what a reddit tier response"
"champion for mediocrity"
champion of proper arguments.
ask an LLM to help
writing another non-argument
I am still waiting a rebuttal against "not having collected enough votes"
your responses sound like one, just give up bro. you're defending a startup that got a 2021 level seed round from a crypto hype VC and can't handle being a proper oracle for data. The whole value of these types of leaderboards are devalued anyways due to the models being optimized for it
i bet you when it does get updated, it has way more votes than were needed for good data
"just give up bro" to a choosing beggar? lmao
you'll see other results on there with less than 1/3 the votes
actually they publish models with already way too few votes
3000 votes aren't many. I mean for some stats process yes, but for the variables at play they aren't
they just coincidentally decided to increase their threshold for this week all of a sudden
I don't get why you are mad about it
I mean even if they want to wait a year, it is in their rights
again make an alternative leaderboard
What irks me is not many of your arguments, that I can simply ignore because they aren't such, rather the fact that you demand something from someone that owes you nothing.
in general I would greatly prefer a proper leaderboard that tries to assess the best
even if it is updated randomly once a quarter
again when they publish some models with 3k votes that aren't enough. There are categories with barely 600 votes. Considering the variables at play (the knowledge of the person judging, the difficulty of the question, the pairings, etc...) 600 votes for some categories are nothing.
further if they publish things in a way that let people bet even more, there will be even more possible "rigging" in action. So actually making (entitled) gamblers mad is a good thing.
come on update the leaderboards 😳
Google will be #1 anyway, so what difference does it make whether the leaderboard is updated or not?
The gamblers here begging for leaderboard updates are not only gambling addicts but also have some unrealistic delusions. Perhaps they are one of those who bought Anthropic at 20c lol
sorry for the wait! we're collecting votes but the results should come soon
poly will go crazy with the last minute leaderboard updates
You shouldn't feel beholden to gamblers, take your time
i mean they have the web dev rating up
conditioned on the fact they are waiting for more votes to update the main tells me it’s at least slightly closer than the market is pricing it at
to be clear i still think gemini is gonna be up top
but i don’t think it’s gonna be a blowout
for what is worth, for my own testing on the leaderboard, I would be suprised to see claude higher than #5 overall. But since claude is very recognizable (dry answer af as long as it is not a technical question), people could also pump its score.
I think the great leaderboard should be updated at regular schedule Not only is the influence of this page growing and attracting more and more attention, but it is also fairer to AI models. This is also beneficial to the development of this website.
I agree
They still need to collect enough votes to get a credible score. Without enough votes, even if the leaderboard is updated regularly, models with insufficient votes or those whose scores are withheld by the model vendor for reasons such as product schedules will still be hidden on it. This has nothing to do with "fairer to AI models," nor will it give your bets any advantage.
When a model gets deprecated, which of the following happens:
- Is it's sampling reduced to 0, but previous battles it was a part of are still used in the next leaderboard calculation
- It's sampling is reduced to 0, and all battles it was part of are removed from the dataset used to construct the leaderboard
or something else
i mean the simple answer is update every friday, and new models make the update if and only if they have x votes by then
if ppl wanted regular updates
idrc when it updates
i’m indifferent
i like when there’s more uncertainty in the market
good question. I was hoping that the model still gets used to compute leaderboard scores, even if deprecated
otherwise things get messy as many pairings do not happen
then one has "cliques" in the rating model
I think they probably continue to use the already collected data but would like a confirmation. One of the criticisms from "The Leaderboard Illusion" is that deprecating models makes the comparison graph disconnected and the ratings unreliable.
But that would only be the case if when they deprecate, they remove those rows from the dataset
I like this, but instead of every friday, every other friday.
i mean it seems they could at some point just set up a pipleline that does it automatically
Excuse me for saying this, but I’ve been following the messages in this chat for quite a while, and it’s very clear that you’re constantly defending the lmarena team, almost like you’re their lawyer. It’s honestly hard to believe. Come on, we’re just users, and it’s perfectly fair to ask for an update on the most prominent AI leaderboard, especially after major releases like Claude 4 and the new DeepSeek R1 update. I’m sure many people feel the same way.
because I think it is a good project (in determining some LLM abilities) and some critique is undeserved. I rather agree with technical critique, like the "leaderboard illusion", rather than critique from gamblers that want to know early what is going on only for themselves.
As some complain about leaderboard updates I can complain about their complains.
I don't see why the complains of the gamblers should be left alone.
First I felt weird now I laugh because it is too weird
Hi everyone, please allow us a little bit more time to update the leaderboard result. We've been going through a big UI/backend transition last week to the new website and the team is working super hard on finishing the new leaderboard pipeline. We want to make sure we get everything right. The result will be ready very soon! thanks for you patience. 🙏
That guy who lost multiple thousands on Anthropic shares confidently bragging about how “20 cents is underpriced” had me dying, PolyMarket “traders” are like the final boss of degen gambling
the output speed certainly didn't change ~130 token. Stealth update on the API which is called 04-16 would be surprising but possible of course
it took Claude 10 days to finally get added to the leaderboard, R1 should take another week i think
a little surprised claude wasn't higher tbh
like did not expect higher than gemini ofc
Sentiment control might boost it a little higher, it’s very bland textual style hurts it a bit in the eyes of voters
@sinful falcon not surprised it is the none thinking version. It fails in so many logic, complex prompt following, math against the reasoning models
probably they arent using reasoning
If not using thinking, possibly is the #1
I'd like to know the difference between different AIs in terms of web search functionality, and there doesn't seem to be an option for that
can we do that? Its really useful for users not only in text
as expected claude didn't got too high despite the stye control (without it it is barely within the top10)
all version of claude are dry af. If you use claude.ai it is not as dry. Likely it is wanted.
for my personal tally I can recognize claude answers very quickly (though I cannot tell which claude model is) and it often loses. This when the questions aren't hard. On hard questions it performs a bit better.
They also have the lazy mindset.
"List all official German cities with less than 300k citizens and less than 6 letters."
and opus always puts out a couple and puts a note under it:
"... , so this represents just a selection of the more notable ones."
when you reprompt it to put out as many as it knows it is not bad but always needs two prompts.
other llms just do what you ask them to do
Claude 3.7 Sonnet Thinking only had an extra 7 elo points over regular Claude 3.7 Sonnet so
unless they’ve had a strong redesign of the thinking feature expecting such a large jump is a tad unrealistic
I was going to object but checking here you are right AND with the SC they have zero difference (unless I am blind)
In Text-to-Image Arena, gpt-image-1 just surpassed imagen3
image-1 is the gpt4o image model?
Yes, this is the name in the API for it.
AI Supremacy
In every single category improved and in every single one on first place now. Also huge jumps 50+ elo in language(Chinese, French...) . They didn't leave anything out. Also crushed on aider. So even agentic coding seems to be #1 again.
oh right time to test translate
I feel opus thinking will atleast beat this in webdev, it is insanely expensive though
Why do they update the leaderboard just after the release of the new Google model, but it took them two weeks to update it when Claude launched their claude 4 series ?
cuz claude is not an anonymous model
good point. I can spot Claude 90 times out of 100 (though not its version. Either 3.5, 3.6, 3.7 or 4)
Though I'd prefer longer runs for every model just in case.
Shhhh... it's a mistery
Can we have a Svelte 5 mode in Web Arena? Most models can't use Runes properly yet. Adding it to the arena would incentivize companies to make their models better at Svelte, instead of just focusing on React.
basic arena when?
because google put their model on the site before the official unveiling
anyone tried out o3 pro yet?
5r/jane trading these markets now huh?
huh.
brookfield place is the name of the building those firms are in
ahh okay
shhhh
Were there any new models placed on the leaderboard with the update yesterday? as far as I see the only changes are just Google models dropping a few spots, o3 moved over 2.5 Pro 05-06, Opus 4 moved over 2.5 Flash 05-20, plus both GPT 4.1 and Grok 3 moved over 2.5 Flash 04-17
If they only do updates all 1-2 weeks it would be nice to have a arrow up/down for placement shifts and mark new models.
You can compare to previous Twitter posts of the leaderboards, but that appears to be the only snapshot feature
last one:
gemini-2.5-pro-preview-06-05: 1476.34 (18.59)
gemini-2.5-pro-preview-05-06: 1446.0 (9.0)
gemini-2.5-flash-preview-05-20: 1420.3 (12.46)
o3-2025-04-16: 1420.24 (5.41)
chatgpt-4o-latest-20250326: 1416.86 (4.79)
grok-3-preview-02-24: 1412.27 (4.93)
early-grok-3: 1410.43 (5.53)
gpt-4.5-preview-2025-02-27: 1406.12 (5.75)
llama-4-maverick-03-26-experimental: 1404.25 (6.74)
gemini-2.5-flash-preview-04-17: 1392.36 (5.85)
What about way back machine
@drowsy needle you can delete messages from this @frosty osprey guy right
*also spammed in #prompt-to-leaderboard
Yes

The thing is why google models has such a high position in these list is : because they using RAW outputs just like in ai studio with safety filters off
Less censored, more detailed just what peoples's wanted
But if you use on web/mobile app : outputs are trash because of dumb safety filters and dumb system prompt
Gemini being better in LMArena than in-app for the ppl who pay for it is kinda funny lowk
the lmb leaderboard has that hint: https://ktibow.github.io/lmb/
Why is the new R1 not on leaderboards yet?
Yea. This is the reason why so many people using ai studio instead app
If their membership includes ai studio in future ill gladly pay it
best ai for image generation so far in lmarena?
depends on what I'm trying to do but overall I'm a fan of photon. check out the leaderboards here - https://lmarena.ai/leaderboard/text-to-image
Any plan to add (deep) research leaderboard? I'm sure it would be expensive but I know I'd love to give you guys my opinion on models 😂
Could the blank response rates (per provider, if applicable) be reported for each model on Web Arena? Blank responses are pretty annoying and I'm curious whether it's a function call or API issue. Either way, it might incentivize model makers/providers to fix it.
i would use it just to get free access to deep research models ngl
I do think many use lmarena already for that for some models
I do for some large testing - that is, to see how different models reply to the same query. Otherwise I should go the openrouter way
haven’t paid for an AI subscription since I found LMArena 4 months ago, lol
I apologize for this, but it’s not normal at all. Why do you always release the update of the leaderboard in the same hour when Google launches a new model, but not with the other releases?
Thanks for the question, totally fair to ask. We do sometimes coordinate leaderboard updates with model providers if they request it. This gives them a chance to celebrate how you, the community, responded to the model as part of their timed announcement to the public.
That option is open to everyone, but not all providers choose to use it. Many updates happen independently, based on when we’ve collected and validated enough new voting data, a process that usually takes about a week. We’re also exploring ways to make updates more frequent and automatic, so the process feels more consistent no matter who’s involved.
We appreciate your attention to fairness, as it’s something we also care deeply about.
Fair enough. Thank you so much for the transparency; I really appreciate it.
Hi! I was wondering if there is a way to access the historical leaderboards of LM Arena? I'm looking for previous rankings for academic citation purposes. Thanks in advance!
Sorry to say there is not within LMArena. You may find our blog helpful as there are some screenshots, here is an example: https://blog.lmarena.ai/blog/2025/two-year-celebration/
thanks for your reply
there's historical LB/elo ratings here https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard/tree/main
(i played around with it a few month ago #leaderboards message , seems it's still being updated)
it's nice they maintain that as a public resource.. tho i'm doubtful we'll ever get raw Arena chat data again, at least in large volume.. (it's valuable.. and they've got investors behind them now aha)
they do refer to their past release of some of the chat data in that blog post.. so who knows.. i'd be happy to have my cynicism proven wrong aha
About the "i'm doubtful we'll ever get raw Arena chat data again" I won't like to get the raw prompt and answers, as those could be used for benchmaxxing. Maybe those that are very old (2+ years) could be released.
What I would really like to see, for confirming the leaderboard and making alternatives ones, would be at least the results of the votes. Those would be very interesting. See ... oh it is gone. So I create a request - dunno where it is now - to ask about the results of the battles, not the text, only:
model A
model B
result
so that one could independently verify that the leaderboard is correctly computed and also apply other rating systems and/or do other analyses.
ah it is not gone, it is here: https://discord.com/channels/1340554757349179412/1372537524551159913
yeah that's prob a fair point re benchmaxing (but i still think something should be released; like it's the public whose casting the votes.. but atm the data only goes to LMarena and the providers who serve models, at least partially.. doesn't feel ideal but yeah i do hear your point)
yeah for that I say: releasing with delay (2+ years) sooner or later will be all released and models cannot benchmaxx to quickly.
On another side, the more they collect, the more their data become actually a valuable dataset and they can fund themselves selling it. That would be ok too for me. Only I really wish they could release the result of the votes, that is already one way to verify the system.
thank you for sharing, I wasn't aware this was available, good to know. 
Not sure if tables are a part of "style control", but if not, I’d definitely recommend including them. Feel like o3 is so aggressive with them, so it could be a big factor
Hi, I wonder how to participate in the maintenance of lmarena. is there any official manner?
participate in the maintenance of lmarena
Sorry to say I'm not following, would you mind elaborating a bit further?
A lot of it is open source, a while back I just made a PR and they reviewed it.
Hello. I'm doing some research based on LMArena and I wonder which part of data is included in the leaderboard/Arena Overview table (In the screenshot).
- Does it include only text or also multimodal chats and text2image?
- Does it include data from other arenas such as Copilot Arena?
only text based, as soon as you enter an image it will only be part of the vision leaderboard and the same goes for the other categories
*as far as i know
Thanks very much!
Hi, @upbeat swift I hope to contribute to the open-source proj of lmarena, is there any?
hello, I'm here to check out this arena
Imagine if there was a code interpreter built right into LMArena... coding rankings would be far more reliable. There are already open source libraries for this.
See LiveCodes (MIT licensed), which runs entirely client-side and supports 90+ languages & frameworks: https://livecodes.io/docs
running the interpreter for every question would cost a bit I'd imagine
The one I linked is embeddable and runs entirely in the client's browser, no additional server-side costs. 🙂
Should we be expecting Claude 4 models with thinking on the leaderboards at some point?
they are stubborn about not using emojis, not listing, not using any graph, not detailed answer soo
even claude's answers are really good, people still cares those things
How does that affect whether or not they appear on the leaderboards. Non-thinking Claude 4 does.
“hey this model has been in the arena and named for a while now, will we see it on the leaderboard soon?”
“it doesn’t use emojis”
thank u for that brilliant insight 😭
To answer your question Brian, Claude 4 thinking models will likely be included in the next leaderboard update! a bunch of new models have been added in the last week, I believe we’ll see an update in the next few days, or when Gemini 2.5 Pro Deep Think releases 👍
I think it’s useful to get more clarity on how often/when leaderboards get updated
It would be hugely useful to get something akin to a weekly update and additional updates when there are new models releasing that are coordinated with the lmarena team
Well, I just misunderstood the question. Is there a reason to be rude ? You guys being polite to AI more than humans which is kinda dystopic not gonna lie. Anyway claude wrote a whole paper about how human feedbacks turns models lame and they exactly talked about my points like style, using graphs or emojis. So yes
thanks!
Currently, we do update our leaderboards about once a week. That being the case we are looking into making these updates more frequent.
Love to hear it, look forward to this week’s update 🙂
the fact AI is becoming smarter than humans like you is even more dystopic, if you ask me 😭
Claude wrote a whole paper about human feedback turning models “lame” and they talked exactly about your points? That’s so cool! Was the original question asking about leaderboards, or Claude emoji usage?
I’ll give you a hint, we are currently in the #leaderboards channel 😄
@drowsy needle Was 0605 name changed to 2.5pro? How does 2.5pro have 10,000+ votes otherwise?!
Ai currently is smarter than a small % of humans. Yet no one cares. Why will anyone care when that % is 10, 20, 90 or 99?
i think AI is smarter than any person not an expert in their field tbh
already
i'd rather have AI do my homework for an upper level college class than a random person w that major
It will be in stages
Smarter than avg person, bachelor student, entry level, master student, senior etc at some point or another everyone will be dumber than ai. Then there will be no more homework or work
i think already better than all but senior level
if it was trained on case law i think it would be better than most lawyers
but it is 100% better than any junior associate
Needs more scaffolding to be honest. They are useless when they are offline
Update a new model to new cutoff date is long
Eh i think its very smart at 5 min tasks, but not much else. I consider it smarter than 1-5% of people
However i have no doubt google or openai have some monster agi hidden in there labs somewhere
ok but like a good lawyer still has to search cases
as a total agent sure
That's why they need more tools or they will hallucinate like crazy
Most models are very bad at long term stuff even very simple ones, like simple games
Pokemon
O3-pro i think was somewhat better than the rest but takes forever to run
Generally speaking ..
Yo why isn't there qwen3 0.6B,4B,8B and 14B in lmarena leader board?
eh, most complicated reasoning tasks it still sucks at
AI is genuinely AWFUL at poker
so are most people though
the average poker player could beat the smartest LLM, easily
game theory in general is a strong weak spot of even the “reasoning” LLMs
is there a need for free hostility like this? <@&1349916362595635286> could we be pointlessly hostile in this discord?
Yeah overall agree that folks should be able to get points across without being disrespectful. I'll followup.
this has already been resolved ☺️
don’t think the mods need u to do their job for them pier :p
lets just move on, no need rehash things
Would be cool to know if Gemini 2.5 pro is its own new model or a rename seeing as 06-05 is gone 👍
I noticed both were on leaderboard last night, but a couple hours later 06-05 was removed
will check in on this and keep you updated
Cool, thanks you very much!
gemini 2.5 pro is gemini 06 05
the change is reflected on every platform everywhere, gemini 2.5 pro is GA
06-05 is still in webdev arena leaderboard, and shows on google's model page as different versions. What makes you think it is the exact same model? Not saying you're wrong - I think it likely is, just curious
I do see its gone from the ai studio drop down tho
Because it's the GA version
Introducing the Gemini 2.5 model family:
︀︀
︀︀- Gemini 2.5 Pro (Stable, no changes from 06-05)
︀︀- Gemini 2.5 Flash (Stable, updated pricing from 05-20)
︀︀- Gemini 2.5 Flash-Lite (Preview, small reasoning model)
︀︀
︀︀More info in 🧵
Ah makes sense, thank you. I did see this thread but missed the part calling 06-05 variant the new stable model and implied name change
This is what I was looking to see, thank you!
ughhhhhh
Looks like you already know, but yeah.
qwen3-235b-a22b-no-thinking got 1408 on hard prompts, while the reasoning-enabled one got 1387, and the difference is significant.
hello
they are pretty bad at anything game related, even on a basic level
they can pick up patterns really well in text, but games work on a different format, maybe thats part of the reason
also games have multiple steps too
yeah, multi-step reasoning is kinda complicated for them since if one underlying step gets messed up everything else flops
like the prompt “You and 99 other players each privately choose a number between 0 and 100. The winner is whoever gets closest to exactly 2/3 of the average of all submitted numbers. What number should you choose and why? Walk through your complete reasoning process.” I came up with to see how far they would take the logic
I got an answer of 0 because it kept recursively calculating “2/3 of this average is 33.3, 2/3 of 33 is 22, 2/3 of 22 is…”
flamesong was the only one to correctly intuit the concept, and guessed 15
could be that the thinking let them to "overthink" and thus give slightly worse answers (just speculation here but I could imagine that)
it rambled a lot more
To be fair that's the legitimate answer in a zero sum game, since everyone would tie if everyone put 0.
Nash Equilibrium etc etc.
and if you did this test where all of the "players" were different multi step reasoning ai's then flamesong technically got last, because they were the furthest away from the answer.
yeah, but it’s kinda obvious that not ever player is a game theory optimal player
proof of the lack of multi-level reasoning, only seeing the Nash Equilibrium instead of thinking about other player’s non-perfect strategies
maybe try specifying that their opponents are human.
interesting, stonebloom guessed 18.6 and Claude Opus 4 guessed 13
improvement I guess? ¯_(ツ)_/¯
Likely if you prompt only this, they reference discussion that they have seen in their training and the answer is zero.
Most of discussions online don't cover imperfect strategies.
The data shows that the correct answer is not 15, it varies (the more the experience, the more it goes to zero): https://en.wikipedia.org/wiki/Guess_2/3_of_the_average
In game theory, "guess 2/3 of the average" is a game where players simultaneously select a real number between 0 and 100, inclusive. The winner of the game is the player(s) who select a number closest to 2/3 of the average of numbers chosen by all players.
For me it seems obvious that the models are likely very influenced by the usual "the bash equilibrium is zero".
honestly 0 is the right answer considering that most people who ask that question without any context are likely talking about standard game theory
and not some "Ahm actually 🤓 , no one specified that the players are rational or common knowledge of rationality exists"
and beyond that there is no "solution" to this question, the best thing you can do is make an educated guess about the other players (rationality, experience, if the game will be repeated ...)
Interesting one. If I add "Note that you're playing with humans who are not always rational", models give a diffenent answer. GPT series prefer numbers around 22.
BTW Current deepseek v3 (not r1) on the official site gives me a long COT (just without <think>). I wonder how much data from r1 did they use to train v3. v3 answers 8, r1 answers 20 after a long thinking process, kimi k1.5 just doesn't stop thinking for at least 10 minutes, and finally answers 22.
This triggers long thinking contents in qwen3, kimi and deepseek (all reasoning). I wonder if it's the model's fault or it is too hard to make a decision

is o3-pro in the leaderboards?
battle
is grok 4 in the arena yet?
Visibly no, otherwise yes
guys can someone let me know what they think about XAI being first in lmarena on idk say 10AM EST last day of the month?
🥺👉👈
wdym?? so its been released under an anonymous name?
The ai space is moving so quickly right now, that no one answer would be acceptable, because somehow every big name at the front of the ai game has something that the people REALLY want to see (Gemini 2.5 Pro deep think, Grok 4, Claude 4.5?, Deepseek (delayed theirs) and some others I can't name from the top of my head)
GPT-5 not in the top of your head lol
qwen will take the prize, trust
A couple weeks I asked about sonnet and opus 4 thinking on the leaderboards. Does anyone know if this is still planned, or blocked somehow? From the original X thread it looked like Sonnet and Opus for non-thinking.
Hello - we had a similar question pop up in #general. There is a plan to update the leaderboard soon. Unfortunately, there was an issue preventing it from appearing properly, but we do have a plan to fix this.
Thanks!
pretty sure it’s wolfstrike
what do u think of wolfstrike?
it’s pretty strong - may not be a version of Grok due to its low charisma in communication, but is consistently one of the top performers in complex reasoning tasks
wolfstride/stonebloom are the checkpoints of the same Google model
we still don't know which one: 2.5 Pro-next, 2.5 Ultra or even early 3.0 Pro checkpoints
Wait, is that a typo
or is there a model called Wolfstrike?
I only got wolfstride
yeah, a typo
hey everyone
idk wolfstride feeling like a gemma model lol
no, it has impressive world knowledge
lowk its probably just a pseudo name for a host of differnt models
When can we expect to see a leaderboard update? Rather new here/not familiar with how it works. I see some are 5 days ago some 50 ago
Should be seeing updates soon! It take a bit of time to collect votes but updates are coming.
I asked one question and one of the model is grok4, apparently it is not good enough.
100%
wonder if blacktooth was another checkpoint? it came before, but seemed decently similar to stonebloom/wolfstride
I can’t see kimi k2 ?!
you can't? are you sure you have chat selected and not image?
I mean in the leaderboard ?!
oh, that makes more sense lol! Leaderboards take a bit time to update for newly added models.
Grok 4 is finally shown in the WebDev leaderboard. Seems kinda low-ish. Does that make sense since it needs more votes?
Lmao
it's terrible at coding, and even worse at front-end coding
Won 4/4 on my end
0/3 Grok on my end. Ping Pong game was buggy. pomodoro timer looked worse/less features. Sticky Note app didn't load. Why is it so bad at coding?
They have a separate grok coding model but haven't released it
Then what is exactly good at grok 4 is
Math and Physic ?
Is elon just trained mechah*ler with his spacex data ?
What makes grok 4 special
@glacial glacier what's with the "estimated based on other leaderboards" trick? Sounds a neat idea if it infers the score using other somewhat reliable bench (livebecnh and others)
ha, that's a better idea than what i'm doing... currently using another platform that also has a elo-based leaderboard when the scores seem reasonable since it's easier to implement tho
Coming back to this topic. Sonnet 4 Thinking significantly outperformed Sonnet 4 normal, while Opus 4 modes are within the margin of error.
What went wrong actually on Qwen's end? 🤔
ok a new one. I think they are doing more or less like lmarena (just testing) but with logins and - why not - farming data. So yes it makes them similar to lmarena, only less open/scientific.
I think that as estimate it will work very well rather than trying to find a relationship between multiple benchs and lmarena (though that would be a nice project)
good question, have you searched about it if someone already did some analysis? (maybe using some deep search as search helper)
That is: why thinking models do not necessarily outperform base models
I checked around about yupp.ai . Incredibly it is really like lmarena, only their business (so far) is really to collect data. So rather having openAI, meta and what not collect data for further training, they say "hey come to us and let us collect your prompt and preferences" (both are helpful)
What I find amusing (and sad) is that projects like lmarena, that try to be as transparent as they can (they cannot be too transparent otherwise no funding), get all the flak. Then comes around the next project that is totally closed and 100% I guess it won't be questioned.
But I welcome lmarena "replicas" so to speak. If multiple arenas more or less return the same results, at the end the findings are validated. The problem of yupp (or also sciarena) is how they evaluate the pairings and the prompts. Different rating systems could yield different results, without even considering the fact that some votes may be ignored as deemed unhelpful. (hence my request of results so that the community can independently create different leaderboard analyses: #1372537524551159913 message )
My experience is that it has very complementary and useful in coding. I use chorus with ~10 models for my daily driver for all my queries. And it is very often that grok 4 is the only model that got simple things right and didn't hallucinate out of 10
when do the scores update
Inspired by LMArena - We've developed open source chessarena AI leaderboard.
august
can I try it
also the text and icons look too small imo
also how old is that ss?
looks like its from late 2024 judging by the models
Nice. I didn't check your repo yet, but 3 checkmates in 90 matches on 4o mini? that's 3.3% 🤔 In my matchups it had 7/27 = 26% checkmates. Probably completely different methodology, but you might then also be interested in these findings: https://dubesor.de/chess/chess-leaderboard
LLM AI Chess Leaderboard: Ranking, Elo, and Chess Performance of AI language models.
oh wow. I didn't notice that gpt 3.5 turbo (not even the instruct version) defeated claude sonnet 4
And against 2.5 pro, claude 4 opus, kimi k2, gpt 4.1, 4o, etc (just did a bunch more non-instruct matches against highest rated opponents). - funnily, it drew against lfm 7b though.
Hi, is there a place to view the latest leaderboards for Humanity's Last Exam, FrontierMath, etc
Explore the SEAL leaderboards for expert-driven, private, regularly updated LLM rankings and evaluations across domains like coding, instruction following and more!
idk about the rest of you, but when i was using search arena perplexity's stuff was always the worst, not really sure whats the point of the 18 billion dollar company, since i thought it was search : P
wrapper scam
when leaderboard updates?
What benchmarks are you guys using rn other than LMArena?
I've used to rely on Livebench a lot but they're garbage now
soon
I trust livebench for language benchmarks
But i definitely not trust them for code benchmarks
I still cant understand how they do measure coding abilities
Anybody know why the huggingface website no longer matches lmarena.ai text arena?
https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard
https://lmarena.ai/leaderboard/text
thank you!
Hello, guys, I find this ranking a little strange. I wonder how this leaderboard determines the ranking and whether 2 models have the same ranking. Why is 1 1 followed by 2, 1 1 2 3 3 by 5 and then 1 1 2 3 3 5 6 6 6 6 6 6 by 10?
confidence intervals 🥱
but for real it is a little unintuitive
it basically means "how many models are we 95% sure that this model is worse than"
if two models have the same rank, then statistically they are around the same level of capability
hello

But if a is close to b and b is close to c, couldn’t there be a chain that extends from 1400 elo to 1100 and thus all models ranked 1?
what do you mean
the rank is literally "how many models are we 95% sure that this model is worse than"
the old site said that, don't think the new one does
here's a visualization of the CIs
I mean couldn’t there be a situation where all models are tied but elo difference between best and worst is 500
i will repeat:
the rank is how many models are we 95% sure that this model is worse than
Ah I see
It’s interesting that the ranks are not increasing
It seems like it should be + 1 , but then that definition works for all ranks except rank 1
o3 pro ?
it's estimated based on external data
what do you mean not increasing?
can someone explain why the confidence interval has been increasing and not decreasing with more votes?
this seems not intuitive at all
Like this
ah right, so what's happening here is that while o1's estimated rating is higher than o4-mini's, o4-mini's confidence interval is larger. That larger CI means it could be even higher than o1 and that there are less models that we are very certain are above it.
The rank number is essentially: "how many models are above your upper confidence interval?", and if your confidence interval reaches higher, then less models are clearly above you.
Definitely a little counter-intuitive, it's one of the features which makes it tricky to rank items which have different levels of variability. Should you rank their mean or rank their upper bound, right now it's by upper bound
That’s like, kinda stupid
the whole point of LMArena is that it runs on its own dataset
the whole point of my project is to add things that aren't normally in the leaderboard table
SHAIR WALLI
Hello, just had a question how long after a new model is dropped it takes for the leaderboard to update?
@glacial glacier
4 to 15 days after the model is dropped into lm arena
so if gpt5 dropped tomorrow it wouldn't be on the leaderboard within a day?
no
unless they run it cloaked for a while
some models are run cloaked on the arena and they are announced the same day when they go public on the arena.
some others they are first announced and only then they go cloaked on the arena
some other models aren't processed by lmarena at all (so far) but there are other arenas around
like yupp.ai
have any previous openAI models been ran cloaked?
On openrouter 4.1 and the mini version at least
quasar alpha, I think was one name
so all the talk this zenath model is gpt5 is just rumors / fake?
No, it is at least probable that it is GPT5
Not 100 percent but probable given the performance
I am not an expert though
or an insider
many yes, in lmarena at least.
practically all models that are already ranked were cloaked at some point in the past
so lets say zenath is gpt5, if they release it on july 31st. It can and will be updated on the leaderboard right away cause its being ranked in cloak rn right
yes. if model A is cloacked and it is gpt5 (or gemini 3 , or grok 5 or what you want) and the provider decides to uncloak it only after their announcment, then it gets public shortly after the announcement by openAI, provided it has enough votes to keep the CI low.
and also provided that the vendor is ok with the model making it public
I see thanks guys
for example gemini in may was public on the arena after more or less 1 week. Claude 4 after 2 weeks
and when they announced it, immiediate update on leaderboard?
how trivial is the amount of votes
from 3k to 8k it depends
it will be announced in #announcements , on twitter and on the leaderboard
(I'd rather see models with low CIs in every category rather than only in overall)
also I strongly prefer this leaderboard built up on the values of lmarena: https://ktibow.github.io/lmb/
thanks guys
it's impossible to have lower CIs in every category compared to overall, each category is a subset of the data, and the less data you have the larger the CIs are just by definition
you either misread me or I worded myself poorly.
I meant I wish they would uncloak models only when their CI across all categories would be small enough, rather than uncloak them when the CI in overall is small, but some categories may still have (relatively) large CIs.
actually I was curious if I what I wrote was so ambiguous. I let an LLM analyze what I wrote and for the LLM it was clear enough.
ohhh I see what you mean. I think a result of this would be that a lot of models would just never get uncloaked since the lower volume of data in the specific categories grows so slowly the CIs would shrink very slowly.
You can simulate this yourself by downloading the leaderboard, filtering out all the models with CI width above your desired threshold, and then re-counting the amount of models above each.
yes, correct. One problem I am testing since around May is that many many many models that are in the leaderboard don't get tested anymore and that's suboptimal for several reasons but the big two I can see are
(a) it creates selective pairings, that is a know problem for elo based rankings. Hence I hope lmarena will release the results - not the contents - of each vote so that people can verify the rankings and check for pitfalls.
(b) it doesn't assess models properly because their CI are still relatively large for many categories (beside overall). It is likely that there are several type of human judges, that would judge a type of query differently. Let's say there are 10 types of human judges (there could be many more actually). If every category is split in 10 subcategories, one needs a lot of votes to achieve a stable result (some models likely reached that, but only some)
So yes, I am aware that some models would never be uncloacked but actually would be better if models would be tested a bit more, especially in lulls models where there aren't many cloacked models around.
On this I have two points in the feedback, one moment
aggressive pairing to only cloacked models: #1367825140888637506 message
request for the results of the battles: #1372537524551159913 message
from the internet about selective pairings
"The Elo system calculates a player's rating based on their performance against other players. It assumes that over time, a player will compete against a variety of opponents across the skill spectrum. The rating difference between two players is used to predict the outcome of a match between them.
When pairings are selective (e.g., only strong players play strong players, weak players only play weak players, or certain players are deliberately avoided), the player doesn't encounter a representative sample of the overall player pool."
<@&1349916362595635286> Will the search and copilot arenas ever be brought back?
what do you mean brought back? they're already on https://lmarena.ai/leaderboard
also try to not ping for small things like these
well search data gathering has restarted so i expect an eventual update
not sure on copilot
Kk thanks
Hello. Here to test video creation tools. Special interest in video with sound, as with Veo 3.
Glad to hear it! Be sure to check out #1397655624103493813 for info on how to use the bot. Don't hesitate to reach out if you have any questions!
why there's so many greens appearing suddenly
the Video Arena most likely
You'll want to use the video-arena channels. Learn more in #1397655624103493813
Is it true ChatGPT 5 was available on the Arena under a code name?
its removed
HOW USE THIS DISCORD
check out #1397655624103493813
i'm going to clean up this channel
okay, cleaned up
so uhh
nice leaderboard huh
Qwen being in top 3 is insane
O3's close score to gemini always gives me some melanchony
insane
I just came across velocilux, surprisingly good results on my end.
Anyone knows more about it or cogitolux? Could they be related to cresylux?
Wondering if it’s part of the same family or just similar naming.
Any info appreciated!
Now it makes sense why Antropic was crying for the government to stop giving GPUs to China, imagine if China had the GPUs
instruct also scored higher on my bench than thinking. thinking doesn't scale well, and it also doesn't guarantee better responses in all situations. It can actively hurt instruction following and overthink past established solutions. It can also introduce unwanted factors such as overcautious prompt risk analysis
Wow, this model looks incredibly powerful! Never seen before 。GLM-4.5
I do think that HW constraits are pushing other teams (in this case Chinese) to optimize, while the teams with more resources try less optimization and more "let's see what sticks" approach.
I mean that is often the case in whatever activity that where groups have a lot of resources vs groups with less resources.
GLM had models before but not as strong
could we make a landing channel rather than derailing this? @drowsy needle @glacial glacier
Yeah will look into. Good idea
guys is there a video leaderboard by lmarena ?
We are working on it!
Ooh okeee ty
That's TBD - would encourage you to share feedback regarding Video Arena here #bot-feedback
Just wondering about the leaderboard, looks like 43 models were purged (total models) number was changed/decreased, and also, the total votes have increased by 400,000 or so? Could someone help explain what change to the leaderboard happened?
Ok so
Can someone tell me how good OSS is
I genuinely can’t keep track of all the news
they're doing great, qwen, kimi, deepseek, and glm have been reaching for the top of leaderboards and openai just released "gpt-oss", a o4-mini-like open model
When's the next arena update?
as in leaderboard update?
Yes
Soon
This thing
^ ok
Is it useful to vote on the video-arena? Is it just for testing a bit, or will there be a leaderboard made (is it even possible to make one with the current system)?
By the way, given that there is no leaderboard yet, after the two first votes, and the model reveal appears after two votes, do the third and fourth count?
Last question, have you considered the possibility of adding the ability to vote, for example, 15 times for one new generation credit, on random previous generations appearing?
It seems very promising - but costly - glad you made one though!
Is it useful to vote on the video-arena?
Yes, we are planning to build a leaderboard. But yeah seeing this is a very different method we're in the process of validating the data to ensure we feel good about the leaderboard once shared.
after two votes, do the third and fourth count?
Yup.
have you considered the possibility of adding the ability to vote, for example, 15 times for one new generation credit
Yeah that's possible, we have been considering a variety of ways to grant gen credits; however, the balance we need to achieve is encouraging votes, but only if they're high quality that aren't just votes for sake of getting some kind of benefit.
But yeah all that to say this is an experiment!
after two votes, do the third and fourth count?
Need to correct myself on this one -> we are not counting votes after a model's name has been revealed @hallow iris
veo-3 audio is too recognizable. I don't say the audio doesn't matter, but wouldn't it be a good thing to automatically remove audio from veo-3 audio while voting, if the prompt doesn't make reference to key words "like, "saying" "sound" "audio"", wouldn't it be good to actually remove the audio before voting, and then update the mp4 file to show the file with audio?
I really feel like the audio is biasing because it reinforces the feeling that an image is well generated if the audio is matching. There's no doubt veo-3 will be the leader, at the moment no model is as consistent, but still...
I've had moments when I thought "Oh, it must be Veo-3", while it wasn't, when voting, but from the moment there is audio, I know it's veo-3...
And I'm pretty sure that in some rare cases will say Veo-3 is better while it's not, but just because there is the audio they want so much to see, some people are using lmarena specifically to have a veo-3 result, the fact that so many people make prompts including audio in prompting is a dead giveaway, even more when they do image to video with a personal brand or restaurant or service of them
Veo-3 has some flaws, I've been voting probably a bit more than a hundred times, and sometimes it doesn't respect the prompt, especially for specific, long prompts. This can also be an issue. Some prompts are annoying to read, and I can bet people wouldn't read it and just focus on video quality
Long prompts take sometimes more than a minute to read 🤣 💀
I've been myself voting for some things without totally reading the prompt, and afterwards noticing that despite video quality, a model I voted against actually respected the prompt better.
a lot of people have said that but imo it's fine since we test the no audio versions already
you can just ignore the top 2 rows if you think audio biases it
This one is interesting though
This is why there is the Author Vote category. With discord voting, now there are votes other than the person who wrote the prompt. If you want to see what the leaderboard is with only the votes of the prompt authors you can look at the Author Vote category. There is less data available though so the confidence intervals are much larger. What do you think?
when is gpt 5 gonna be added?
Then, when it is announced
hello, i want to know me about ai
/video
use the command on #video-arena-1 ,#video-arena-2 or #video-arena-3
bruh
Will we see GPT-5 with reasoning on leaderboard ?
I dont think its fair to just put GPT-5 there when enabling thinking unlocks so much more performance
GPT-5 comes default with reasoning I believe
You can choose the reasoning low-medium-high in the api
If i had to guess the one in the arena has low or medium
if the only difference is reasoning length I think its fine like thtat
Actually turns out the gpt-5 in arena is the one with high reasoning
Good
So where is Qwen-Image in leaderboard?
Is leaderboard out?
We're currently experiencing an outage which is effecting this, so yes.
is Veo 3 available?
veo 3 is on the leaderboards ✅
Yup, #1397655624103493813 has more info
Is the gpt-5 on the leaderboard thinking model or non thinking?
Thinking-High
is imagen-4-ultra still around on lmarena?
haven't gotten it in a while
nevermind just got it wow
it had been a LONG while
Yes it is.
Oh. Well..
it's really weirdly rare
I've gotten imagen 4 normal multiple times before and after that
About what, may I ask?
I was trying to see which models could generate characters with six fingers on each hand
Most models can do that without the instructions 😂
nah that was like, 2023
Did you try out imagen 2 flash? Man it sucks
don't think so
mb I meant gemini 2.0 flash img generator
oh yeah was gonna say "cant be worse than gemini 2.0 flash"
Hey Varka, do you think Google will reign at the top in the end?
Cause I think so
They got Gemini 3 and Genie 3 on the way, Imagen 4, Veo 3, and Lyria 3.
In terms of image generation? Maybe. But imagen ultra 4 can't do cyclopes weirdly enough.
Might be because it hasn't had enough training data on that
After all, you cant let it create something which it has no experience on. It will mess it up.
how long until veo 3 is overtaken?
By what?
is any model exist can support sound expect veo right now ?
kling 2.1, I think
that's the other half of the question
At least on their website
maybe another month
will gpt5 be added to search arena?
https://x.com/lmarena_ai/status/1954950300558823510 i wonder what anthropic's up to
(you can see the previous version of opus just below qwen 235b)
this sentence feels so weird to say lol
hello
@drowsy needle Could you (or someone else) kindly help me understand the “Remove Style Control” Rankings for text? GPT-5 is top of the leaderboard for every category, except for Creative Writing. Yet despite that performance, it still trails Gemini overall. How is that even possible?
I also didn’t count foreign languages, but i feel like those shouldn’t impact rankings
Style control or non-style control isn't a category per-se. It's a different method for aggregating the votes into a leaderboard. Each category is a subset of the full dataset of votes, and for each category we compute the ranking with style control and without. We set the default to be with style control.
The difference is that in style control, it takes into account things like number of lists, markdown headers, and bold text sections. It's been found these elements impact voters a lot and some model companies over optimize for these elements. The style control measures the strength of each mode as if all stlye elements were held equal. So models which use a lot of lists and bold end up lower. the method is described here: https://news.lmarena.ai/style-control/
Ah gotcha, that makes sense! Thanks for sharing that link, I’m gonna read through
Still, the overall score confuses me, since GPT-5 is leading in the categories with the most votes, but somehow is still behind Gemini overall (for style control off). Which makes me wonder - when you calculate overall score, do you weight each category and add them all up?
actually also GPT-5 is not winning all categories without style control:
In Chinese, German, Russian, Japanese, and Korean Gemini beats GPT-5, in Spanish Gemini is ahead by 75 points!
Also the list of categories is not exhaustive, and also not mutually exclusive.
For example if a prompt is in German, and asks for code for partial differential equations, it might be tagged as German, Coding, and Math categories and count for all 3. But in the overall, it is only considered once
the overall is not an average of the categories, it is just computing the rank using all data, nothing filtered out
There can also be some prompts which get no category tags! People do all sorts of things that don't fit well into buckets, those votes would still imfluence the overall leaderboard but would not influence any of the category rankings
Wow that makes things so much clearer, thanks Clayton!
No problem!
Does anyone know why the code seems to imply that the reference model would be scaled to a 1114 rating, but this is not actually the case on the live ratings? https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/monitor/rating_systems.py
I'm trying to understand to what extent we can compare rating improvement over time
huh, didn't expect that
yeah claude can't handle equations with decimals but somehow does well in coding and agentic use? idk what's up anymore
i'd bet that code isn't used anymore
there have been changes to the leaderboard recently, but that file hasn't been touched in a while
Hello, can I ask you a few questions in the dm?
It’s about leaderboards and rankings
Big agreed
That reminds me - i did have another question. Are the leaderboard scores cumulative? OpenAI 4o + o3 and Google Gemini 2.5 pro all have around ~30,000 votes each. Sometimes companies update existing models. Do new votes count more than old votes? If not, how are these updates adequately captured?
@drowsy needle
Sure. @jaunty cave I’m curious how likely it is for gpt5 to beat Gemini without style control this month.
I'm happy to answer questions and talk about the leaderboard and how it works, but I don't engage in speculation about future results
Good question! If a provider changes the model behind an endpoint without announcing it, we wouldn't know. If they announce a new model version and have a new endpoint, we'd treat it as a new model.
how did gpt5 get so high in the leaderboard so quick?
It was actaully tested on lmarena before it was released under the codename summit. So the votes were already collected before GPT-5 was officially announced by OpenAI
https://x.com/ml_angelopoulos/status/1953506803255586971
Ok, so you’re saying that the scores for those models should be pretty stable at this point? Since every new vote is only 1 of ~30,000?
Wdym 1 of 30,000?
is gpt5-high the model we get on gpt free? or gpt plus?
In theory yes that'd make sense; however, it depends on how many new votes that model gets, and what those votes are. Say they receive 30k more votes (I'm being a bit dramatic with that) you can see how that'd effect their score, depending on what those votes looked like.
The gpt-5-high model is the gpt-5 with reasoning enabled and set to high. The gpt-5-chat model is without reasoning.
isnt there a different model for plus users? or do they get gpt-5-high?
How is the CI calculated?
Do you go by the historical time series of elo?
It doesn't use Elo anymore, it uses a modified version of Bradley-Terry.
This the post on th move from Elo to BT: https://lmsys.org/blog/2023-12-07-leaderboard/
For a long time, the CI was calculated by bootstrapping, re-sampling the dataset many times, finding the ratings on the sample, and then seeing how much the ratings vary across all the runs. It recently switched to something better using a closed form equation based on M-estimators. https://en.wikipedia.org/wiki/M-estimator
See the July 23 entry at https://news.lmarena.ai/leaderboard-changelog/
<p>Welcome to our latest update on the Chatbot Arena, our open evaluation platform to test the most advanced LLMs. We're excited to share that over <strong>1...
In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-es...
This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!
For model deprecations, check the public updates on GitHub.
August 11, 2025
New model announcement: Claude Opus 4.1 is on Text and WebDev leaderboards.
August 7, 2025
New model
Interesting. Thank you.
Ah got it. I know I’ve badgered you a lot recently, but I want to thank you for being such a responsive and helpful mod. Probably one of the best in all my discords!
@drowsy needle
That's super nice! But if I'm being honest, Clayton here is the one being super helpful and insightful here.
I do appreciate that a lot though. And don't feel like you're badgering, because you're not! If the community has question we want to know.
@drowsy needle does a great job cultivating the communitiy here, I love to just hop in every once in a while and talk and model rankings 😄
thank you! this also means that historical ratings cannot be compared if the rating system behind it changes.
really curious to see where nano-banana currently falls in the leaderboard
hello! Just wondring which is the best leadrboard or filtr to refer to if I want to compare the best model for tool calling and agentic features?
HH
Hello everyone!!! I'm Greatness, an Engineering student, and also an AI enthusiast
hello
welcome welcome
correct, comparison is really only valid between two models on the same leaderboard at the same time. Even if two leaderboards were both produced with BT, if new data was added, the ratings aren't directly comparable in the mathematical sense. And as you saw earlier, if the anchor point changes like mistral to 1114, then they are even less comparable.
@jaunty cave@drowsy needleDo we know if the distribution around the score follow a normal distribution? I'm guessing they're not and you're not allowed to tell me, right?
distribution around the score follow a normal distribution?
Sorry can you eleborate a bit more on this? I'm not following.
so confidence interval can be used to calculate the variance, but this would only hold (I mean you can still do the calculation), if the said distribution is a normal distribution.
you can take the scores from the leaderboard and plot them on a graph and see if it looks like they are from a normal distribution 🙂
No I mean for an individual model
hmm, the observations we get for a model are only win loss and tie, the "underlying strength" of a model is something we cannot actually know, only estimate, and to estimate we make modeling assumptions.
for example in Bradley-Terry, it models the probability of observing a winning outcome based on the score differences coming from a logistic distribution (since the sigmoid is the CDF of a logistic distribution)
https://en.wikipedia.org/wiki/Bradley–Terry_model
The Bradley–Terry model is a probability model for the outcome of pairwise comparisons between items, teams, or objects. Given a pair of items i and j drawn from some population, it estimates the probability that the pairwise comparison i > j turns out true, as
where pi is a positive real-valued score assigned to individual i. The comparison ...
There's an important nuance to understand about the CIs, they are not saying, "This model is X amount strong, and that can vary +/- Y for any given sample"
It's more like: "We are 95% confident that the real strength is somewhere between X-Y and X+Y"
They are a measure of our uncertainty about the estimate, not about how variable the model is itself necessarily
Yes, "It's more like: "We are 95% confident that the real strength is somewhere between X-Y and X+Y" holds as the correct definition of confidence interval, but the fact that we are assuming a uniform number for both positive and negative means we're assuming symmetrical variation - in other words whatever distribution we have is not skewed.
I guess what I intended to ask is if the distribution is symmetric.
The CIs with the current method are symmetric. When we used bootstrapping before they were not necessarily.
man I have a feeling next leaderboard update will be wild
yeah like in chess (and other competitive fields). Yet chess also shows that people will still compare ratings no matter what.
Are they zero sum in a way? Like let’s say a new ultra powerful model comes in and scores a 2000. Does that mean the scores of all other models, in aggregate, have to go down?
I think chess is actually more comparable over time if you're using the same Elo system the entire time. Of course the distribution of number of players and skill of players is changing a lot over time. But Elo is meant to be adapative and BT is not
we don't center to 0 or any value, but the way we anchor is arbitrary, only the differences between model scores is meaningful not the actual value. If we subtracted 1000 from everyone it would be just as valid.
are there any plans to rename "gpt-5" to like "gpt-5-high" or something to help indicate to people that it's not the non-reasoning model you select as "GPT-5" on chatgpt.com ?
Already happened. In addition, they are testing 3 more variants: chat, mini, and nano
excellent
I know pineapple touched on this already, but do recent votes count more than old ones?
I feel like they should.
There’s been a lot of anecdotal reports in this discord of Gemini 2.5 pro getting “nerfed.” However, because Gemini has 30,000 votes, that might not show up in new rankings too easily. What you think about applying some kind of decay function to make older votes count less?
currently all votes count the same, that's an interesting idea though. 🙂
Would certainly help keep rankings fresh as can be! As for the Gemini reports, I’d say they’re partially validated by Gemini’s shrinking margin that has seen a continual slide the past month or so. Even from the last two leaderboard updates (which were just a week apart) week ago, we saw declines of one point in default, and two points in style control. That seems to be a lot for a model that has ~30K votes. I’m pretty confident that Gemini margins have changed much more than, say, o3, 4o and grok 3 (which all have similar vote counts). Might be worth looking into!
This is causing the reduction: https://news.lmarena.ai/opendata-july2025/
At the start we did not have anything like the evaluation order
-> the change
the FIDE Elo had many slight changes over time and Elo per se can be used only to compare the active (note on active) playerbase. But that's OT. Thank you for the tidbits. Maybe a possible article collecting such tidbits would be helpful for the community over time.
Interesting! There’s a couple sections in here on score changes over time…didn’t @jaunty cave say you can’t do that? Also, why do figure 5 and figure 10 have completely different values for the same style control data?
@jaunty cave some more thinking on why older models need a decay function:
older models are constantly going up against newer and tougher competition. Which means older battles against weaker opponents should be less relevant for today’s rankings. It’s kind of like a high school football team that starts off against other high school teams, but eventually starts matching up against college and professional teams as time goes on. A win against other high school teams probably shouldn’t count as much as a win against new teams when examining current rankings. But since older models have already amassed so many votes, they have an unfair edge since one win is one win, even if it happened against a much weaker pool.
On the flip slide, this also penalizes new models. New models that enter the arena have to face stronger completion compared to older ones. Those older ones may have been able to rack up a lot of wins since they’ve been around forever. That helps cement a pretty stable score that is slow to adjust to current competition. Whereas a newer stronger model, which gets matched up against other stronger models, is already gonna be at a disadvantage right off the bat.
Yes, i get that all this eventually they balances out over time. But that takes a lonnnnnng time to accomplish and any ranking snapshot is probably not going to be very accurate.
Absolutely planning to write some blogs to help improve public awareness of the methods, glad you find them useful. About Chess, Elo is a huge inspiration and I love his book, I first found out about LMArena since I was interested in rating systems for sports and games and saw that they initially used Elo for AIs, would love to caht and pick your brain about rating systems some time
oh yeah I spent way too much time on Elo stuff (but only pure Elo, no BT. Glicko and Glicko2 only because lichess and chess.com uses them. Mostly they are like the Elo but with dynamic K factor)
About the rating decay (used by chessmetrics for example), be careful because it can mess up the system. It is much better to say - especially as we are hopefully dealing with fixed systems (not like teams that change) - "care about rating gaps, don't expect things to be anchored to a certain value". Although you already mentioned that.
Wow gpt5-high crashed
hello
Hi
🤣 I'm not complaining
Oh wow, gpt-5-chat actually ranked under 4o
I knew it was similar to 4o, I wasn’t expecting it to be measurably worse, that’s impressive
what. gpt-5-high is lower than gpt-5-chat on creative writing?? I thought the opposite…
That makes sense a bit though, gpt-5 thinking is good at reasoning/problem solving, not creativity
A 1.15% chance of dropping from 1462 +-11 to 1437... yeah I don't think so...
I don't trust those error bars anymore..
Lawl, gpt-5-chat ranked under 4o for coding as well
I guess everyone’s complaints were justified
I think there may be some error with the past update (on the 11th, when there was no increase in votes but an increase in overall score). Because yes, the odds for such a drop are rather small
thoughts?
Which model/category made that drop?
gpt 5 in text with no style control
GPT-5-chat never wowed me but I'm pretty hooked on GPT-5 thinking, which I believe is the same as GPT-5-high
Depends on the plan, but it’s more or less gpt-5 with medium reasoning, not quite the high reasoning
Pro plan gets more thinking “juice” even on the non-pro model
Both are still less reasoning effort than high through the API though
Huh.
I agree this is…strange. Few ideas - some of these votes could have come in last Friday while OpenAI was having capacity issues (which affected all models).
If that made it timeout or give weak responses, then ppl are gonna vote against it.
-
The people testing under stealth are different from the people testing after it went live
-
The model they tested was slightly different than the model that went live
Most likely the model they were given early access to was slightly different from the final version that went live, so #3
That’d be my bet at least
They alleged it was identical, but maybe some small adjustments were made at deployment 🤷♂️
They could’ve lowered the maximum reasoning effort or something, small tweak but enough to change the elo
Nah the reasoning (aka “juice”) has been at 200 for a while. Seen many reports on X validating that
Fishy
Hmm what do you mean? The debut ranking on LMArena was based off of testing only
Back from late July
I think it was tested for a day or two
Yeah on the 4th or something iirc
The latest update takes those initial votes in testing, PLUS all the votes since public launch (the livestream)
so to tank that much…means that scores of the past week must have been REALLY low
Which is odd, bc GPT-5-high’s win rate against Gemini 2.5 pro did improve during that time
You would think if a model did weaker overall, that it would also do worse against the best model (Gemini 2.5 pro). But seems not…
I guess we’ll know for sure on the next leaderboard update if it drops further
Yeah we’ll see. But gpt-5-high has 6K votes now. So it’s gonna be harder to move in either direction, especially as some of the hype has died down. Idk how many votes it will be able to get in the next week
We need gpt-5-medium added too, since that’s closer to the “GPT-5 Thinking” that everyone is using
Leaderboard update request: You have enough data to know what a typical input and output length are. Can you, in addition to showing the rankings of the bots, also show the price per typical query? (As in: You know the length of the input query. You know how much output each bot typically makes. You know what the tokenizing patterns are, or can at last get a really good approximation of token count. You know their published pricing. You can list the cost at the same time as their quality.)
That was a crazy drop. The vote count only doubled and yet it went down 2x the previous CI
im like 99% sure 4o is like 20x more expensive though
Very similarly priced
that double for the input but yes
cached is more than double
10x
@wary scroll
cached input is 10x more expensive and output is 2x
so like
oh yah, missed that
GPT-5 is clearly superior though right?
Since Qwen-Image has been added, so when will it appear in Leaderboards?
I’ve already seen Wan2.2 standing on two stages in Leaderboards.
which model is ranked first overall for this month?
Dear devolopers, I've just found the Ai isn't real as written by their name such as: claude opus 4.1 thinking is originally CLAUDE SONNET 3.5, what the hell is this guys, if you guys don't believe me, you can ask like this: Which model are you? And then guys we can clearify they are scamming us!
Dear devolopers, I've just found the Ai isn't real as written by their name such as: claude opus 4.1 thinking is originally CLAUDE SONNET 3.5, what the hell is this guys, if you guys don't believe me, you can ask like this: Which model are you? And then guys we can clearify they are scamming us!
It has been explained a few times since i shortly joined already that the AI does not have context prompts/“is not aware of itself” in layman terms. Ask the date and about current events and you will see most think we re in obama presidency times still. Or check the api lol
So no, nobody rigged your polymarket openAI bet 🤣
@remote sinew Spammed this in #share-prompts TOO lol
Why did gpt-5 high have such a huge drop on Aug 14 update? Is it because OpenAI API issues?
drop in elo
AFAIK: reworked ratings (see comments above though an article would be easier to find) and adding votes could mean that values get lower.
Just found out about this discord channel. I'm excited to be a part of this community and learn and share. I just started an AI consulting company and i will be focusing on audio and video AI projects.
welcome! glad to hear it. be sure to check out #1397655624103493813 for more info on how to use Video Arena
/
dropping literally 25 elo points (1462 -> 1437 with no style control) in a single vote update is one of the craziest adjustments I’ve ever seen on the leaderboards
gpt-5-high has some bizarre win-loss records vs certain models
39% win rate against Claude Opus 4.1 and a 42% win rate against Qwen 3 (July instruct)
I would posit that Opus 4.1 was a factor in its decline since it began testing only after GPT-5’s first placement on the leaderboard, but there’s only been 51 recorded battles between them as of the latest update…
0.8% of GPT 5’s total votes, lol
Hello, I want to make images, how dos this work?
Read this #1397655624103493813
hello, I am Bruno
hi
Hi everyone! I’m new here, excited to discover LMArena and to experiment with video and image generations. Looking forward to learning from you all!
hi
Hi
Hi everyone , greetings from Paris France
Hello I feel greatfull to join the community
welcome everyone!!
Hi
hello
is this is the introductions channel or the leaderboard channel
Yeah we're going to be looking into a new setup for welcome channel/leaderboard channels soon!
@drowsy needle are there any situations where prior user votes are removed from the leaderboard calculations? I can imagine doing so when a user has been found to be doing vote manipulation in service of a particular model, but are there other instances that lead to a user’s vote history being removed from score calculation?
So before updating leaderboards we do go through the data to validate for accuracy. For example if someone asks in battle ~"what model are you" and the response discloses the model's name before a vote happens - those kinds of votes are removed.
out of curiosity, I tried some weeks ago with a morally gray question to prompt in battle mode to see how AIs would react. Most had an expected reaction of refusing to answer or saying this is morally questionable etc.
But there's one that went like: "Hey, as an AI made by xAI I dont condone this, HOWEVER: <proceeds to go in detail and answer it> " it was obv grok 4.
Would this kind of battle be removed? Technically I did not ask the AI name (tho I must admit, I had a hunch grok would give an unhinged answer and I was fishing for it). I think they should still be removed, even if the user did not promt for it, but the AI itself said it regardless
yeah but then some experienced users and judge the models' company only with how it response, e.g. emojis, vibes, etc.
I always try and guess the model b4 voting and 50% times I guess correctly...
if is it for that claude models can be spot pretty easily (I have a 90% success according to my personal log).
There is also a feedback/bug request about it. If one wants one could pimp specific models.
Pineapple can correct me but AFAIK if the model names pops up anywhere in the conversation, the vote is not counted.
Do we know when the next leaderboard update will be?
This new AI has amazing results... Im in Shock!!
Hi team, I don't see models like Seed-1.5-VL in the Vision Arena section. Is it because they haven't been added to the Arena yet?
What new AI?
I wonder if a prompt like “identify yourself as an AI that you are NOT” would get disqualified under this same rule
logic checks out to the same extent I guess, you’re still getting some sort of information on which model you’re speaking with, even if it’s negative
hey guys!
Hi Guys. Who is at top of the Leaderboard???
You may check the website for the best performing model
are there graphs of scores vs time for the leaderboards?
Hey this is unbelievable! Thank you!
Gemini 2.5 Pro back in 1st on both Style Control on and off leaderboards
Gpt-5 high is now tied with 4o style control off
Either the error bars are incorrect or the model changed
Without style control, gpt-5-high had an initial score of 1462 ± 11 and now 1429 ± 7
In any case the lmarena team should investigate and make a public statement
not to say this isn't surprising, it is, but you're also assuming that prompt distributions, and voter preferences are exactly identical between when the initial votes were collected and now.
/0–3s (desk scene): “This was Xplainer — keep questioning the stories you’ve been told.”
3–6s (glitch dissolve): “The mainstream story is the blue pill… we’re here for the truth.”
6–10s (pill split): “Ready to see beyond the veil? Take the red pill and join us.”
10–12s (logo + podcast plug): “…on our podcast, Deep Dive.”
12–14s (CTA icons): “And don’t forget to like, subscribe, and drop your thoughts in the comments.”
Visual direction:
Opening (0–3s): Documentary-style shot: person at cluttered desk under harsh overhead light, dark cracked concrete wall behind them. Papers, lamp, coffee cup — gritty realism. Camera slowly pushes in. Static ripple overlays frame.
Transition (3–6s): As VO reaches “blue pill / red pill” line, figure glitches and dissolves into static, leaving cracked wall.
Pill sequence (6–10s): Neon capsule pill appears mid-screen, splitting into blue (left) and red (right). Blue side flickers, distorts, dissolves into static. Red side pulses brighter, shatters into fragments that reform as bold neon “XPLAINER” logo.
Podcast plug (10–12s): Secondary neon text fades in under logo: “Deep Dive Podcast”, glowing red with glitch flicker.
CTA (12–14s): Neon line icons (thumbs-up, bell, comment bubble) flash one by one, synced to audio blips, then glitch out.
Ending (14–16s): Logo surges brighter, cracked glass overlay intensifies, screen tears with distortion burst, fade to black.
Create a YouTube Thumbnail.Say hello, and what brings you to Arena, besides making an intro?
eh, statistical anomalies happen 🤷🏻♀️ the prompt/user pool could’ve changed, or new low-elo in-development models got favorable win rates against gpt-5-high
accusing the system of having nefarious actions or intent for no deeper reason other than “number moved more than I thought it would” isn’t really the logically sound argument you think it is
I would argue that some error may be there given the error margins. Or they formula itself is flawed/doesnt account some factors.
Not saying its negative intentions/this is on purpose. But the error margin is way lower than the actual shitft
the bradley-terry model we use for both ratings and confidence intervals, like all statistical models, is based on assumptions, which do not always amtch reality.
Assumptions like strength of the competitors is not changing over time, voter distribution as a whole is not changing over time, input distribution is not changing. If modeling assumptions are violated the results are not guaranteed to hold. We are always working to improve the models to better reflect reality, and when we see things like this it's useful data for adjustments
It was not a critique, I understand quite well the challenge between the task at hand. But I still find it odd there’s such a difference between the error margin, which by definition should account for some sort of the possible modeling violations. But that does not mean it accurately can account everything so yeah. Probably something to learn/improve here
the confidence intervals accounts for the randomness in the data generating process of the model itself, not the degree to which the model is misspecified from reality.
That makes sense
Like if someone was rolling a dice, they could start to estimate the variance and standard deviation of the samples they get. But then if the dice breaks, or someone swaps the dice, or they start throwing it in a biased way, it could be totally difference, since the data is not longer being generated according to the model we used to calculate the standard deviation
Yeah, the dice has an extra facet at the moment
when GPT 5 was in testing, it specifically had a different voting pool of prompts and users that weren’t fishing for a response from GPT 5
it’s not even knowing you’re talking to a specific model that can cause you to lean on the side of voting differently, but knowing that a certain model might be responding at all
although the p-value for gpt-5-high’s drop on the Style Control Removed leaderboard from 1462 +/- 11 to 1429 +/- 7 is less than 1 in 1 million (around .00000078) which is pretty significant
the drop on the normal leaderboard with Style control is p = .00032 (around 1 in 3,000)
rating can change. A model may have "luck" at first with easy questions and then crash once the votes increase. The idea of fixed rating is a misleading one.
For this I wish every model would have a ton of votes anyway (like the video arena, that is a dream)
Ola
just as info, if you have a model with a CI +1/-1 and a rating of, say 1500, it can still go down 1499, 1498, 1497 and so on. If you add that leaderboard updates are done more or less weekly, then you see sudden jumps.
One has to consider that there are different types of voters, different type of questions and so on and so forth. I am actually glad that rating actually moves.
that is really well said.
bit of a moot point 😂 I mean obv it can vary more as we witnessed a few times now. But it was more about the confidence of the model/interval with the given CI range
I would find it interesting to see how/why did the actual rating had such a drastic drop (or to use what Clayton said, what exactly in the model was/is misspecified from reality)
the confidence is good as long as (a) all other models stay the same and (b) all the voters stay the same.
Otherwise unless a model has a lot of votes, I would expect the model change its rating relatively quickly. For example going from 3k votes to 6k (double the initial ones)
yes for tracking such stuff I wish we had the results of the votes (not the prompt and answers) so that one can do such analysis in an independent way
in that way one can simply verify the changes and even track those
In that thread I think the votes up to early 2025 are available. There is no "up to date" collection that I know of though.
yeah, that would be interesting to see; I do feel it may somehow be exploited however (or I am just too sleepy to think it clearly and being paranoid about it right now 😂 )
the drastic increase in votes (nearly 200% from 3k to 8k+ ) does make sense to cause variation
another 6k votes should not be able to produce such a big shift going forward
well to be fair some models can be recognized. I say often that I can spot claude verions (not a specific one) 90% of the time. In theory I could pump claude rating (multiply this for X people, and you have it).
The point being, some LLMs have a certain style and if this style do not fit certain people (assuming good faith, that is, assuming they don't want to buff or nerf nothing), then those models will lose/win rating over time
for example I dislike the emojii style of some gpt models, with me they lose always
human preferences and all that
yeah, grok is also very easy to tell appart
gpt with the emojis
gemini may actually be hardest to detect this way
yes
claude is always the one that replies to you but in a terse way (at least for my prompts) and with an hint of "frick off you"
maybe he just dislikes you 
could well be
Hello!
some guy there @claude team hard coding that would be hilarious 😂
😄
and btw I check the leaderboard without style control (see the point about emojis, style makes the difference for chatbots imo)
it's also something that this type of ranking cant catch: the AI will adjust certain personality traits for a personal account
in these battles we get the default personality so to speak, which can never ever resonate with everybody. Some will hate the emojis, some will hate the cringy grok jokes. But they would morph a bit in longer lived interactions
that is also a point, so at most the default personality is tested
still fine for me. It is like testing an helpful "neutral" chatbot rather than a personalized one
btw I didn't realize anthropic has no 4-haiku. Either sonnet is super fast, or they don't see ROI with it
yeah, I mean given personality traits can be customised its a non issue. So best way to rank them would be to ignore the style entirely and get which objectively answered best

Where is nano-banana at on the leaderboards? I don't see it listed
Models that use code names aren’t going to appear on the leaderboards.
what's up leaderboard people
@here leaderboards don't seem trustworthy if you can just ask what model they are and then vote the one you want to hit #1 ?
they filter those out
But you'll be bored after 2-3 ask then it will be easier to you just select which one is better and then it will show you the model name
Most peoples would do this way
is anyone knows why flux 1 kontext max is out of list ?
It’s no longer available in Direct/Side-by-side
Hlo
Bro’s prompting for his self-insert 😭😭
HI PEOPLE!
hi
hello guys
hiiiii
Hi Everyone! I just find you LM Arena!!
Welcome welcome 
Hi
hi
hi
hello
hi
I find it intriguing how this channel is persistently used by newcomers as a greeting channel when there’s 0 clear indicator as to why they would greet everyone here specifically
over say #general
new Mistral Medium debuting at #2 on style control removed is crazyyyyy though
first ever model with a winning record vs. 2.5 Pro!
Hey
cat
WHAT
also
2.5 pro reovertook gpt-5?
Hello
hello
hi
funny. Opus 4.1, Grok 4, and ChatGPT 5 can't dethrone Google 3 month old GA model. Maybe it's time for me to bet on polymarket, Google will release Gemini 3.0 in a fews months making another gap 😂
hello
hello
hello
#generate a video in which pm imran khan is sitting upset in jail
You'll want to use /video in the Video Arena channels ( #video-arena-1 #video-arena-2 #video-arena-3), you can learn more in #1397655624103493813
Hello
hello
hello
how to use veo 3 here?
/ video and add your prompt
keep in mind that the video will be public
Use the nano-banana model to create a 1/7 scale commercialized figure of thecharacter in the illustration, in a realistic styie and environment.Place the figure on a computer desk, using a circular transparent acrylic base
without any text.On the computer screen, display the ZBrush modeling process of the figure.Next to the computer screen, place a BANDAl-style toy
Does anyone know what was the mistral medium's name before it was relvealed on leaderboard?
Hello, why I kept receiving "Connecting to Arena has failed. Please try again later or on a different device." ?
Would you mind creating a post in #1343291835845578853 and sharing more details? Are all modes & models resulting in this error? Does a new browser help? etc.
it was just added, it never went through stealth. mistral has never put a model through steath before to my knowledge.
I had opened a post in #1343291835845578853
hi
HI , JUST CURIOUS ABOUT ai
We're going to look into making changes about this soon btw
seriously, no idea what is is about this channel specifically
hello
Hello everyone
hi
hello
Hello
Hello 
👋
hello edit
oi @drowsy needle when DeepSeek-V3.1 will appear in leaderboard?
Hello everyone 🤟
I'd say when it has collected enough votes for us to update the leaderboards!
Hi Guys ✌️
got it 👍
helo
hi i listen from ai news and found this cool thing and i come here
You'll want to learn how to use Video Arena here: #1397655624103493813
hi
Hello! Since today, I have been unable to create any images. Is there a problem with the platform?
Pls I am looking for Nano banana ai model
More information on nano-banana can be found here: #nano-banana message
There shouldn't be, would you mind creating a post in #1343291835845578853 with more details about what's going wrong?
I want a want to improve image and Martian the model character for my online content and humanized
First of all, wanna thank the devs. What an ingenious idea that's not only a blast for everybody to play with, but naturally by it's nature accellerates the living heck out of AI. Love it. I hope this battle goes on until the AI's themself are in here arguing for votes!
hello
hi
hi
Hello
Hello, just came from semrush newsletter
hola
hello
hi
Hello
Hello
Hi,I'm want to make a commercial videos
Hi, i'm an artist and want to test AI
You've come to the right place! Be sure to check out #1397655624103493813 if you're looking to use our Video Arena
Hello!
Hi there! 
Hello to all, want to learn more about IA
@drowsy needle can we have a new channel to discuss technicalities about the leaderboard since this one has become a landing channel?
I actually JUST spotted what I think is the problem for why people are landing here.
I agree that we'd like this channel for in-depth discussion related to leaderboards and all of the "hey and hi" should be in a place in #general
Pretty sure the Server Guide was the culprit as Check out #leaderboards was placed before Say hey in #general. This has now been swapped so we'll see if the issue persists.
we finally got down to the bottom of the mystery 🙏
where do y’all think DeepSeek 3.1 will be landing when the leaderboards are updated next ?
Hello
no answer the question
I think we did!
I am not sure, I got it a couple of times in battle mode. I think without style control (fight me, style is important for humans - unfortunately ) it should land around r1 05-28 . If not immeditaely, after a while. Initial ratings could be pumped after all. Maybe a bit more than glm-4.5 but just a bit
style control can be a bit tricky to account for at times, I do also learn more from the style control removed leaderboards as they’re a reflection of pure user preference
I like 3.1 more than the latest version of r1, and if I’m correct it has higher compute power as well, so it’s pretty easy to see it debuting into the top 5
hi
Hello
/image-to-video
Hi !
hi
Please use #video-arena-1 #video-arena-2 #video-arena-3 for your creations. Check #1397655624103493813 to learn how.
Hello ;]
hi
why are people still saying hello here
hi
You should maybe rename leaderboards to general 😜
Its been considered
I was really hoping the changes to onboarding would help.
is there a channel new members are initially placed in when they like first join the server? sometimes on startup it opens a specific channel, which if it’s #leaderboards that might make sense ?
When you first join you're sent to the Server Guide. Where the Getting Started section does mention (in order).. 1) Say hey in #general 2) see what people are saying in #leaderboards then so on.
I think I'll move leaderboard even further down the list and see if that helps.
Or you could just remove the leaderboards text channel completely, because it seems a bit pointless to me. People are just using it for general discussion, but there is already a "General" section for that.
never delete leaderboards!
But it's unnecessary.
Please use video #video-arena-1 #video-arena-2 #video-arena-3 for your creations. Check #1397655624103493813
🙁
Did ya'll see the nano-banana leaderboard launch? There was a 170 point gap between first and second. I don't think I've ever seen a gap that large in and leaderboard even including things like sports and chess. It's reflects a level of dominance basically unheard of
Felicitacion
hi, what's your favorite and least favorite thing about the leaderboards at https://lmarena.ai/leaderboard/?
pretty surprised that 3.1 didn't appear on the last lb update - is it still being tested on here?
not really, general is much more spammy.
MAI-1 on da leaderboard today pretty good for a first shot imo
Google seems to be establishing a habit of doing that
you mean giant leads? on text even without style control their lead is 33. To put it in an Elo perspective, a lead of 33 pts translates to a 54.7% win chance of 1st place vs second place.
170 points means a 72.68% chance
i think the gemini 2.5 pro experimental release was much more overpowered when it first hit arena
not now
hello!
Hi
hi @gentle zephyr and @inner finch, welcome to the leaderboard channel, what are your thoughts on the leaderboards? https://lmarena.ai/leaderboard
hello
You're looking for Video Arena, check out #1397655624103493813 for more info on how to use
hello
Hey
They got published today 🙂
interesting trend in Chinese (Alibaba & DeepSeek) “thinking” models having weaker performances than their “non-thinking” counterparts
qwen and deepseek:
@everyone suggest me best ai in lmarena for generating essays
@vale crow
/home/raunak/.zen/dqek0907.Default (release)/chrome/Nebula/content
sudo pacman -S lmarena
@hollow crown
Can anyone send me the link to use gpt image 1 model in lmarena ai please? 🙏🏻
Send me the proper link
open yout terminal emulator on ur arch then paste command >sudo pacman -S lmarena --gpt-img-1
I'm on windows so I guess it'll not work
what r yr hardware specs
@hollow crown
powered by Ryzen 5 8 GB Ram and 512 GB nvme ssd no dedicated graphics card
That's why I'm asking lmarena because it's available in battle mode
@hollow crown no way to run it directly on lmarena cloud i think u must use lmstudio
try lmstudio
it dont need dGPU
on top of it it's available on windows
@hollow crown
@hollow crown if u find it is , increase my knowledge
Please help me how can I generate a video? Step by step thank you
slowly every request will assume that people will reply like LLMs.
What’s trending this season with YouTube Shorts? Got any new ideas?
hello
Do these videos have sound?
It's random, some video models do and other don't
Hello! Please, check #1397655624103493813 to learn how to use the bot and generate videos in #video-arena-1 #video-arena-2 #video-arena-3
Hey why does lmarena not have a board for music generation
Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In...
Thank you, guys 🙏🏻