#leaderboards

1 messages · Page 2 of 1

acoustic aurora
#

Yea, curious on the general arena as well..the community is curious

drowsy needle
#

it's updated now 👍

drowsy needle
zealous sable
soft crater
#

I was about to came here and ask for an update 😄

#

why only webdev?

willow holly
#

@soft crater
I guess they want a nice day for Claude and the site because in the general lmarena it won't look so nice and people will write again how it is a bad benchmark :P.

pastel orbit
#

How many runs have you done?

#

Where does opus stack against Gemini in your tests?

willow holly
#

As it is the none reasoning version. It loses in all logic, math, drawing, in listing content, Formating a certain way...

It does answer normal questions very well. Also puts out short answers when a long isn't needed. It is better in creating some unique ideas, the writing sounds nice.

It is a good model but it really depends on what you test.

soft crater
#

I don't want to look anxious but this is becoming weird... There is any reason I missed?

drowsy needle
tulip shadow
#

Maybe their implementing sentiment control?

pastel orbit
#

Maybe someone on the team is on poly market 💀💀

soft crater
drowsy needle
drowsy needle
queen jewel
#

too many gamblers here😅

pastel orbit
#

hey im not gonna pretend i don't have skin in the game, but it does sting when the leaderboard that the results are based on just doesn't update before the market resolves, despite the other leaderboards updating

twin wharf
#

gamblers gonna gamble

scarlet grove
#

lol

twin valve
willow holly
#

Also the Claude models won't be on #1 in the Text bench mark they are not general enough. So it doesn't matter for the bet on 1. place.
Of course it would be nice to have an update soon anyway

pastel orbit
# twin valve then make your own leaderboard with your own benchmark and the problem is solved...

it's not about the fact that it's free (they just got $100m in funding btw, for a leaderboard website), it's just odd to not update the boards for so long after a major release, despite the webdev board being updated so quickly.

Ppl keep saying Claude is for coding only so it wouldn't be good at text, but it's possible that the additional tooling and agentic nature would allow it to provide more useful results, even when just chatting with it. If ppl are so confident that it'll be worse than google's model, there should be no harm in updating the leaderboard to reflect that. Waiting abnormally long right as a market is about to resolve just is sus from an optics perspective. Devs could easily selectively update the board to win bets

twin valve
# pastel orbit it's not about the fact that it's free (they just got $100m in funding btw, for ...

"it's not about the fact that it's free (they just got $100m in funding btw, for a leaderboard website)"

did you pay part of those 100m? If not, for you and me it is free, the rest is fiction.

"Waiting abnormally long"
That sounds really like /r/choosingbeggars . The update cycle is always around a week and the more votes you get (note that you need to filter them) the better to assess the score. Who cares about gamblers, one wants proper assessments.

Again make your own leaderboard if you need it so badly. Behaving with such entitlement is never a good sign.

pastel orbit
#

lmao

twin valve
#

eh indeed, maybe the best answer to your post would be lmao

pastel orbit
#

you'd think with a $600m valuation you'd have a public schedule for updating, you love making excuses huh

twin valve
#

no need to refute entitlement

#

nah fam, you are just too entitled

#

one needs votes for accurate scoring, it is statistics

pastel orbit
#

what a reddit tier response

twin valve
#

otherwise one does scoring against static questions, not human driven

pastel orbit
#

calling having basic company practices for a 9 figure business entitlement

twin valve
#

you are willingly missing the point.

pastel orbit
#

champion for mediocrity

twin valve
#

"lmao"
"you love making excuses huh"
"what a reddit tier response"
"champion for mediocrity"

champion of proper arguments.

#

ask an LLM to help

#

writing another non-argument

#

I am still waiting a rebuttal against "not having collected enough votes"

pastel orbit
#

your responses sound like one, just give up bro. you're defending a startup that got a 2021 level seed round from a crypto hype VC and can't handle being a proper oracle for data. The whole value of these types of leaderboards are devalued anyways due to the models being optimized for it

#

i bet you when it does get updated, it has way more votes than were needed for good data

twin valve
#

"just give up bro" to a choosing beggar? lmao

pastel orbit
#

you'll see other results on there with less than 1/3 the votes

twin valve
#

actually they publish models with already way too few votes

#

3000 votes aren't many. I mean for some stats process yes, but for the variables at play they aren't

pastel orbit
#

they just coincidentally decided to increase their threshold for this week all of a sudden

twin valve
#

I don't get why you are mad about it

#

I mean even if they want to wait a year, it is in their rights

#

again make an alternative leaderboard

#

What irks me is not many of your arguments, that I can simply ignore because they aren't such, rather the fact that you demand something from someone that owes you nothing.

#

in general I would greatly prefer a proper leaderboard that tries to assess the best

#

even if it is updated randomly once a quarter

#

again when they publish some models with 3k votes that aren't enough. There are categories with barely 600 votes. Considering the variables at play (the knowledge of the person judging, the difficulty of the question, the pairings, etc...) 600 votes for some categories are nothing.

#

further if they publish things in a way that let people bet even more, there will be even more possible "rigging" in action. So actually making (entitled) gamblers mad is a good thing.

whole wharf
#

come on update the leaderboards 😳

queen jewel
#

Google will be #1 anyway, so what difference does it make whether the leaderboard is updated or not?
The gamblers here begging for leaderboard updates are not only gambling addicts but also have some unrealistic delusions. Perhaps they are one of those who bought Anthropic at 20c lol

drowsy needle
zealous sable
#

poly will go crazy with the last minute leaderboard updates

twin wharf
sinful falcon
#

conditioned on the fact they are waiting for more votes to update the main tells me it’s at least slightly closer than the market is pricing it at

#

to be clear i still think gemini is gonna be up top

#

but i don’t think it’s gonna be a blowout

twin valve
#

for what is worth, for my own testing on the leaderboard, I would be suprised to see claude higher than #5 overall. But since claude is very recognizable (dry answer af as long as it is not a technical question), people could also pump its score.

wind vale
#

I think the great leaderboard should be updated at regular schedule Not only is the influence of this page growing and attracting more and more attention, but it is also fairer to AI models. This is also beneficial to the development of this website.

whole wharf
#

I agree

queen jewel
#

They still need to collect enough votes to get a credible score. Without enough votes, even if the leaderboard is updated regularly, models with insufficient votes or those whose scores are withheld by the model vendor for reasons such as product schedules will still be hidden on it. This has nothing to do with "fairer to AI models," nor will it give your bets any advantage.

upbeat swift
#

When a model gets deprecated, which of the following happens:

  1. Is it's sampling reduced to 0, but previous battles it was a part of are still used in the next leaderboard calculation
  2. It's sampling is reduced to 0, and all battles it was part of are removed from the dataset used to construct the leaderboard
#

or something else

sinful falcon
#

if ppl wanted regular updates

#

idrc when it updates

#

i’m indifferent

#

i like when there’s more uncertainty in the market

twin valve
#

otherwise things get messy as many pairings do not happen

#

then one has "cliques" in the rating model

upbeat swift
#

I think they probably continue to use the already collected data but would like a confirmation. One of the criticisms from "The Leaderboard Illusion" is that deprecating models makes the comparison graph disconnected and the ratings unreliable.

But that would only be the case if when they deprecate, they remove those rows from the dataset

twin valve
tulip shadow
devout canyon
# twin valve I like this, but instead of every friday, every other friday.

Excuse me for saying this, but I’ve been following the messages in this chat for quite a while, and it’s very clear that you’re constantly defending the lmarena team, almost like you’re their lawyer. It’s honestly hard to believe. Come on, we’re just users, and it’s perfectly fair to ask for an update on the most prominent AI leaderboard, especially after major releases like Claude 4 and the new DeepSeek R1 update. I’m sure many people feel the same way.

twin valve
# devout canyon Excuse me for saying this, but I’ve been following the messages in this chat for...

because I think it is a good project (in determining some LLM abilities) and some critique is undeserved. I rather agree with technical critique, like the "leaderboard illusion", rather than critique from gamblers that want to know early what is going on only for themselves.

As some complain about leaderboard updates I can complain about their complains.

I don't see why the complains of the gamblers should be left alone.

soft crater
#

First I felt weird now I laugh because it is too weird

west lodge
#

Hi everyone, please allow us a little bit more time to update the leaderboard result. We've been going through a big UI/backend transition last week to the new website and the team is working super hard on finishing the new leaderboard pipeline. We want to make sure we get everything right. The result will be ready very soon! thanks for you patience. 🙏

tender sigil
willow holly
#

the output speed certainly didn't change ~130 token. Stealth update on the API which is called 04-16 would be surprising but possible of course

half stream
#

The leaderboard got updated just recently.

#

No R1 May yet. Not enough battles? 🤔

zealous sable
sinful falcon
#

a little surprised claude wasn't higher tbh

#

like did not expect higher than gemini ofc

tender sigil
#

Sentiment control might boost it a little higher, it’s very bland textual style hurts it a bit in the eyes of voters

willow holly
#

@sinful falcon not surprised it is the none thinking version. It fails in so many logic, complex prompt following, math against the reasoning models

soft crater
#

If not using thinking, possibly is the #1

wind vale
#

I'd like to know the difference between different AIs in terms of web search functionality, and there doesn't seem to be an option for that

#

can we do that? Its really useful for users not only in text

twin valve
twin valve
#

for my personal tally I can recognize claude answers very quickly (though I cannot tell which claude model is) and it often loses. This when the questions aren't hard. On hard questions it performs a bit better.

willow holly
#

They also have the lazy mindset.

"List all official German cities with less than 300k citizens and less than 6 letters."

and opus always puts out a couple and puts a note under it:
"... , so this represents just a selection of the more notable ones."

when you reprompt it to put out as many as it knows it is not bad but always needs two prompts.

other llms just do what you ask them to do

tender sigil
#

unless they’ve had a strong redesign of the thinking feature expecting such a large jump is a tad unrealistic

twin valve
soft crater
#

In Text-to-Image Arena, gpt-image-1 just surpassed imagen3

willow holly
#

image-1 is the gpt4o image model?

heavy thorn
soft crater
#

AI Supremacy

willow holly
#

In every single category improved and in every single one on first place now. Also huge jumps 50+ elo in language(Chinese, French...) . They didn't leave anything out. Also crushed on aider. So even agentic coding seems to be #1 again.

tulip shadow
devout canyon
# soft crater AI Supremacy

Why do they update the leaderboard just after the release of the new Google model, but it took them two weeks to update it when Claude launched their claude 4 series ?

queen jewel
twin valve
autumn granite
#

Can we have a Svelte 5 mode in Web Arena? Most models can't use Runes properly yet. Adding it to the arena would incentivize companies to make their models better at Svelte, instead of just focusing on React.

sinful falcon
peak prairie
#

heyy

#

any new news about kingsfall?

hallow comet
#

anyone tried out o3 pro yet?

sinful falcon
rain fiber
sinful falcon
hallow comet
tender sigil
#

Were there any new models placed on the leaderboard with the update yesterday? as far as I see the only changes are just Google models dropping a few spots, o3 moved over 2.5 Pro 05-06, Opus 4 moved over 2.5 Flash 05-20, plus both GPT 4.1 and Grok 3 moved over 2.5 Flash 04-17

willow holly
#

If they only do updates all 1-2 weeks it would be nice to have a arrow up/down for placement shifts and mark new models.

tender sigil
#

You can compare to previous Twitter posts of the leaderboards, but that appears to be the only snapshot feature

sinful falcon
#

last one:

gemini-2.5-pro-preview-06-05: 1476.34 (18.59)
gemini-2.5-pro-preview-05-06: 1446.0 (9.0)
gemini-2.5-flash-preview-05-20: 1420.3 (12.46)
o3-2025-04-16: 1420.24 (5.41)
chatgpt-4o-latest-20250326: 1416.86 (4.79)
grok-3-preview-02-24: 1412.27 (4.93)
early-grok-3: 1410.43 (5.53)
gpt-4.5-preview-2025-02-27: 1406.12 (5.75)
llama-4-maverick-03-26-experimental: 1404.25 (6.74)
gemini-2.5-flash-preview-04-17: 1392.36 (5.85)
placid glen
#

What about way back machine

glacial glacier
#

@drowsy needle you can delete messages from this @frosty osprey guy right
*also spammed in #prompt-to-leaderboard

frosty osprey
#

Yes

brittle pine
#

The thing is why google models has such a high position in these list is : because they using RAW outputs just like in ai studio with safety filters off

#

Less censored, more detailed just what peoples's wanted

#

But if you use on web/mobile app : outputs are trash because of dumb safety filters and dumb system prompt

tender sigil
#

Gemini being better in LMArena than in-app for the ppl who pay for it is kinda funny lowk

tender sigil
#

oh, Coolio!

#

just Nova Experimental then

tame surge
#

Why is the new R1 not on leaderboards yet?

brittle pine
#

If their membership includes ai studio in future ill gladly pay it

orchid kestrel
#

best ai for image generation so far in lmarena?

drowsy needle
errant scroll
#

Any plan to add (deep) research leaderboard? I'm sure it would be expensive but I know I'd love to give you guys my opinion on models 😂

autumn granite
#

Could the blank response rates (per provider, if applicable) be reported for each model on Web Arena? Blank responses are pretty annoying and I'm curious whether it's a function call or API issue. Either way, it might incentivize model makers/providers to fix it.

sinful falcon
twin valve
#

I do for some large testing - that is, to see how different models reply to the same query. Otherwise I should go the openrouter way

tender sigil
devout canyon
#

I apologize for this, but it’s not normal at all. Why do you always release the update of the leaderboard in the same hour when Google launches a new model, but not with the other releases?

drowsy needle
# devout canyon I apologize for this, but it’s not normal at all. Why do you always release the ...

Thanks for the question, totally fair to ask. We do sometimes coordinate leaderboard updates with model providers if they request it. This gives them a chance to celebrate how you, the community, responded to the model as part of their timed announcement to the public.

That option is open to everyone, but not all providers choose to use it. Many updates happen independently, based on when we’ve collected and validated enough new voting data, a process that usually takes about a week. We’re also exploring ways to make updates more frequent and automatic, so the process feels more consistent no matter who’s involved.

We appreciate your attention to fairness, as it’s something we also care deeply about.

devout canyon
native pecan
#

Hi! I was wondering if there is a way to access the historical leaderboards of LM Arena? I'm looking for previous rankings for academic citation purposes. Thanks in advance!

drowsy needle
native pecan
#

thanks for your reply

crude hawk
#

it's nice they maintain that as a public resource.. tho i'm doubtful we'll ever get raw Arena chat data again, at least in large volume.. (it's valuable.. and they've got investors behind them now aha)

#

they do refer to their past release of some of the chat data in that blog post.. so who knows.. i'd be happy to have my cynicism proven wrong aha

twin valve
#

About the "i'm doubtful we'll ever get raw Arena chat data again" I won't like to get the raw prompt and answers, as those could be used for benchmaxxing. Maybe those that are very old (2+ years) could be released.

What I would really like to see, for confirming the leaderboard and making alternatives ones, would be at least the results of the votes. Those would be very interesting. See ... oh it is gone. So I create a request - dunno where it is now - to ask about the results of the battles, not the text, only:

model A
model B
result

so that one could independently verify that the leaderboard is correctly computed and also apply other rating systems and/or do other analyses.

crude hawk
#

yeah that's prob a fair point re benchmaxing (but i still think something should be released; like it's the public whose casting the votes.. but atm the data only goes to LMarena and the providers who serve models, at least partially.. doesn't feel ideal but yeah i do hear your point)

twin valve
#

yeah for that I say: releasing with delay (2+ years) sooner or later will be all released and models cannot benchmaxx to quickly.

On another side, the more they collect, the more their data become actually a valuable dataset and they can fund themselves selling it. That would be ok too for me. Only I really wish they could release the result of the votes, that is already one way to verify the system.

drowsy needle
fleet stag
#

Not sure if tables are a part of "style control", but if not, I’d definitely recommend including them. Feel like o3 is so aggressive with them, so it could be a big factor

left cobalt
#

Hi, I wonder how to participate in the maintenance of lmarena. is there any official manner?

drowsy needle
upbeat swift
clear pulsar
#

Hello. I'm doing some research based on LMArena and I wonder which part of data is included in the leaderboard/Arena Overview table (In the screenshot).

  • Does it include only text or also multimodal chats and text2image?
  • Does it include data from other arenas such as Copilot Arena?
median briar
#

*as far as i know

left cobalt
gleaming storm
#

hello, I'm here to check out this arena

autumn granite
#

Imagine if there was a code interpreter built right into LMArena... coding rankings would be far more reliable. There are already open source libraries for this.

See LiveCodes (MIT licensed), which runs entirely client-side and supports 90+ languages & frameworks: https://livecodes.io/docs

twin valve
#

running the interpreter for every question would cost a bit I'd imagine

autumn granite
rotund burrow
#

Should we be expecting Claude 4 models with thinking on the leaderboards at some point?

brittle pine
#

even claude's answers are really good, people still cares those things

rotund burrow
tender sigil
#

“hey this model has been in the arena and named for a while now, will we see it on the leaderboard soon?”

#

“it doesn’t use emojis”

#

thank u for that brilliant insight 😭

#

To answer your question Brian, Claude 4 thinking models will likely be included in the next leaderboard update! a bunch of new models have been added in the last week, I believe we’ll see an update in the next few days, or when Gemini 2.5 Pro Deep Think releases 👍

sonic junco
#

I think it’s useful to get more clarity on how often/when leaderboards get updated

#

It would be hugely useful to get something akin to a weekly update and additional updates when there are new models releasing that are coordinated with the lmarena team

brittle pine
# tender sigil thank u for that brilliant insight 😭

Well, I just misunderstood the question. Is there a reason to be rude ? You guys being polite to AI more than humans which is kinda dystopic not gonna lie. Anyway claude wrote a whole paper about how human feedbacks turns models lame and they exactly talked about my points like style, using graphs or emojis. So yes

drowsy needle
sonic junco
tender sigil
brittle pine
#

facts

#

agree

tender sigil
#

Claude wrote a whole paper about human feedback turning models “lame” and they talked exactly about your points? That’s so cool! Was the original question asking about leaderboards, or Claude emoji usage?

#

I’ll give you a hint, we are currently in the #leaderboards channel 😄

tawdry socket
#

@drowsy needle Was 0605 name changed to 2.5pro? How does 2.5pro have 10,000+ votes otherwise?!

hallow comet
sinful falcon
#

already

#

i'd rather have AI do my homework for an upper level college class than a random person w that major

hallow comet
sinful falcon
#

if it was trained on case law i think it would be better than most lawyers

#

but it is 100% better than any junior associate

lean lotus
#

Update a new model to new cutoff date is long

hallow comet
sinful falcon
lean lotus
hallow comet
#

Most models are very bad at long term stuff even very simple ones, like simple games

lean lotus
#

Pokemon

hallow comet
#

O3-pro i think was somewhat better than the rest but takes forever to run

hallow comet
lean lotus
#

They still lack common sense

solemn edge
#

Yo why isn't there qwen3 0.6B,4B,8B and 14B in lmarena leader board?

tender sigil
#

AI is genuinely AWFUL at poker

sinful falcon
tender sigil
tender sigil
#

game theory in general is a strong weak spot of even the “reasoning” LLMs

twin valve
drowsy needle
tender sigil
#

this has already been resolved ☺️

#

don’t think the mods need u to do their job for them pier :p

drowsy needle
#

lets just move on, no need rehash things

placid jungle
#

Would be cool to know if Gemini 2.5 pro is its own new model or a rename seeing as 06-05 is gone 👍

#

I noticed both were on leaderboard last night, but a couple hours later 06-05 was removed

drowsy needle
placid jungle
timber owl
placid jungle
#

I do see its gone from the ai studio drop down tho

lean lotus
placid jungle
#

This is what I was looking to see, thank you!

drowsy needle
half stream
#

qwen3-235b-a22b-no-thinking got 1408 on hard prompts, while the reasoning-enabled one got 1387, and the difference is significant.

grave rampart
#

hello

vocal rain
#

they can pick up patterns really well in text, but games work on a different format, maybe thats part of the reason

#

also games have multiple steps too

tender sigil
#

yeah, multi-step reasoning is kinda complicated for them since if one underlying step gets messed up everything else flops

#

like the prompt “You and 99 other players each privately choose a number between 0 and 100. The winner is whoever gets closest to exactly 2/3 of the average of all submitted numbers. What number should you choose and why? Walk through your complete reasoning process.” I came up with to see how far they would take the logic

#

I got an answer of 0 because it kept recursively calculating “2/3 of this average is 33.3, 2/3 of 33 is 22, 2/3 of 22 is…”

#

flamesong was the only one to correctly intuit the concept, and guessed 15

twin valve
tender sigil
#

it rambled a lot more

idle ocean
#

and if you did this test where all of the "players" were different multi step reasoning ai's then flamesong technically got last, because they were the furthest away from the answer.

tender sigil
#

yeah, but it’s kinda obvious that not ever player is a game theory optimal player

#

proof of the lack of multi-level reasoning, only seeing the Nash Equilibrium instead of thinking about other player’s non-perfect strategies

idle ocean
#

maybe try specifying that their opponents are human.

tender sigil
idle ocean
#

improvement I guess? ¯_(ツ)_/¯

twin valve
# tender sigil like the prompt “You and 99 other players each privately choose a number between...

Likely if you prompt only this, they reference discussion that they have seen in their training and the answer is zero.

Most of discussions online don't cover imperfect strategies.

The data shows that the correct answer is not 15, it varies (the more the experience, the more it goes to zero): https://en.wikipedia.org/wiki/Guess_2/3_of_the_average

In game theory, "guess ⁠2/3⁠ of the average" is a game where players simultaneously select a real number between 0 and 100, inclusive. The winner of the game is the player(s) who select a number closest to ⁠2/3⁠ of the average of numbers chosen by all players.

#

For me it seems obvious that the models are likely very influenced by the usual "the bash equilibrium is zero".

median briar
#

honestly 0 is the right answer considering that most people who ask that question without any context are likely talking about standard game theory

#

and not some "Ahm actually 🤓 , no one specified that the players are rational or common knowledge of rationality exists"

#

and beyond that there is no "solution" to this question, the best thing you can do is make an educated guess about the other players (rationality, experience, if the game will be repeated ...)

clear pulsar
#

Interesting one. If I add "Note that you're playing with humans who are not always rational", models give a diffenent answer. GPT series prefer numbers around 22.

#

BTW Current deepseek v3 (not r1) on the official site gives me a long COT (just without <think>). I wonder how much data from r1 did they use to train v3. v3 answers 8, r1 answers 20 after a long thinking process, kimi k1.5 just doesn't stop thinking for at least 10 minutes, and finally answers 22.

#

This triggers long thinking contents in qwen3, kimi and deepseek (all reasoning). I wonder if it's the model's fault or it is too hard to make a decision

analog cliff
halcyon grove
#

is o3-pro in the leaderboards?

fallen slate
#

HU

#

來學習

forest schooner
#

battle

sinful falcon
#

is grok 4 in the arena yet?

hallow comet
sinful falcon
#

guys can someone let me know what they think about XAI being first in lmarena on idk say 10AM EST last day of the month?

#

🥺👉👈

rain fiber
cobalt crest
thorny ginkgo
rotund burrow
#

A couple weeks I asked about sonnet and opus 4 thinking on the leaderboards. Does anyone know if this is still planned, or blocked somehow? From the original X thread it looked like Sonnet and Opus for non-thinking.

drowsy needle
tender sigil
sinful falcon
tender sigil
rustic quartz
#

wolfstride/stonebloom are the checkpoints of the same Google model

#

we still don't know which one: 2.5 Pro-next, 2.5 Ultra or even early 3.0 Pro checkpoints

rapid cobalt
#

Wait, is that a typo

#

or is there a model called Wolfstrike?

#

I only got wolfstride

rustic quartz
#

yeah, a typo

tawdry hearth
#

hey everyone

marble olive
rustic quartz
marble olive
#

lowk its probably just a pseudo name for a host of differnt models

robust zodiac
#

When can we expect to see a leaderboard update? Rather new here/not familiar with how it works. I see some are 5 days ago some 50 ago

drowsy needle
heady patio
tender sigil
sacred basalt
#

I can’t see kimi k2 ?!

drowsy needle
sacred basalt
drowsy needle
tawdry hearth
#

Grok 4 is finally shown in the WebDev leaderboard. Seems kinda low-ish. Does that make sense since it needs more votes?

obsidian flint
reef moth
tawdry hearth
#

Won 4/4 on my end

obsidian flint
#

0/3 Grok on my end. Ping Pong game was buggy. pomodoro timer looked worse/less features. Sticky Note app didn't load. Why is it so bad at coding?

reef moth
brittle pine
#

Math and Physic ?

#

Is elon just trained mechah*ler with his spacex data ?

#

What makes grok 4 special

twin valve
#

@glacial glacier what's with the "estimated based on other leaderboards" trick? Sounds a neat idea if it infers the score using other somewhat reliable bench (livebecnh and others)

glacial glacier
half stream
twin valve
twin valve
twin valve
# twin valve ok a new one. I think they are doing more or less like lmarena (just testing) bu...

I checked around about yupp.ai . Incredibly it is really like lmarena, only their business (so far) is really to collect data. So rather having openAI, meta and what not collect data for further training, they say "hey come to us and let us collect your prompt and preferences" (both are helpful)

What I find amusing (and sad) is that projects like lmarena, that try to be as transparent as they can (they cannot be too transparent otherwise no funding), get all the flak. Then comes around the next project that is totally closed and 100% I guess it won't be questioned.

But I welcome lmarena "replicas" so to speak. If multiple arenas more or less return the same results, at the end the findings are validated. The problem of yupp (or also sciarena) is how they evaluate the pairings and the prompts. Different rating systems could yield different results, without even considering the fact that some votes may be ignored as deemed unhelpful. (hence my request of results so that the community can independently create different leaderboard analyses: #1372537524551159913 message )

thorny ginkgo
velvet lichen
#

when do the scores update

gloomy raven
#

Inspired by LMArena - We've developed open source chessarena AI leaderboard.

vale lodge
#

also the text and icons look too small imo

#

also how old is that ss?

#

looks like its from late 2024 judging by the models

gloomy raven
twilit echo
# gloomy raven Inspired by LMArena - We've developed open source chessarena AI leaderboard.

Nice. I didn't check your repo yet, but 3 checkmates in 90 matches on 4o mini? that's 3.3% 🤔 In my matchups it had 7/27 = 26% checkmates. Probably completely different methodology, but you might then also be interested in these findings: https://dubesor.de/chess/chess-leaderboard

twin valve
twilit echo
crisp rampart
#

Hi, is there a place to view the latest leaderboards for Humanity's Last Exam, FrontierMath, etc

low bough
# crisp rampart Hi, is there a place to view the latest leaderboards for Humanity's Last Exam, F...

Explore the SEAL leaderboards for expert-driven, private, regularly updated LLM rankings and evaluations across domains like coding, instruction following and more!

Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.

idle ocean
#

idk about the rest of you, but when i was using search arena perplexity's stuff was always the worst, not really sure whats the point of the 18 billion dollar company, since i thought it was search : P

reef moth
#

wrapper scam

sterile jacinth
#

when leaderboard updates?

glass sun
#

What benchmarks are you guys using rn other than LMArena?

#

I've used to rely on Livebench a lot but they're garbage now

drowsy needle
brittle pine
#

But i definitely not trust them for code benchmarks

#

I still cant understand how they do measure coding abilities

narrow cloud
clear pulsar
#

Hello, guys, I find this ranking a little strange. I wonder how this leaderboard determines the ranking and whether 2 models have the same ranking. Why is 1 1 followed by 2, 1 1 2 3 3 by 5 and then 1 1 2 3 3 5 6 6 6 6 6 6 by 10?

glacial glacier
#

but for real it is a little unintuitive

#

it basically means "how many models are we 95% sure that this model is worse than"

#

if two models have the same rank, then statistically they are around the same level of capability

exotic peak
#

hello

drowsy needle
small notch
#

Hello

drowsy needle
pseudo rivet
glacial glacier
#

the rank is literally "how many models are we 95% sure that this model is worse than"

#

the old site said that, don't think the new one does

#

here's a visualization of the CIs

pseudo rivet
glacial glacier
pseudo rivet
#

Ah I see

#

It’s interesting that the ranks are not increasing

#

It seems like it should be + 1 , but then that definition works for all ranks except rank 1

tender sigil
glacial glacier
jaunty cave
sinful falcon
#

can someone explain why the confidence interval has been increasing and not decreasing with more votes?

#

this seems not intuitive at all

pseudo rivet
jaunty cave
# pseudo rivet Like this

ah right, so what's happening here is that while o1's estimated rating is higher than o4-mini's, o4-mini's confidence interval is larger. That larger CI means it could be even higher than o1 and that there are less models that we are very certain are above it.

The rank number is essentially: "how many models are above your upper confidence interval?", and if your confidence interval reaches higher, then less models are clearly above you.

Definitely a little counter-intuitive, it's one of the features which makes it tricky to rank items which have different levels of variability. Should you rank their mean or rank their upper bound, right now it's by upper bound

tender sigil
#

the whole point of LMArena is that it runs on its own dataset

glacial glacier
jovial sapphire
#

SHAIR WALLI

brittle pond
#

Hello, just had a question how long after a new model is dropped it takes for the leaderboard to update?

marble rivet
#

@glacial glacier

pliant cedar
brittle pond
twin valve
#

no

#

unless they run it cloaked for a while

#

some models are run cloaked on the arena and they are announced the same day when they go public on the arena.
some others they are first announced and only then they go cloaked on the arena

#

some other models aren't processed by lmarena at all (so far) but there are other arenas around

brittle pond
hallow comet
#

quasar alpha, I think was one name

brittle pond
#

so all the talk this zenath model is gpt5 is just rumors / fake?

hallow comet
#

Not 100 percent but probable given the performance

#

I am not an expert though

#

or an insider

twin valve
#

practically all models that are already ranked were cloaked at some point in the past

brittle pond
#

so lets say zenath is gpt5, if they release it on july 31st. It can and will be updated on the leaderboard right away cause its being ranked in cloak rn right

twin valve
#

yes. if model A is cloacked and it is gpt5 (or gemini 3 , or grok 5 or what you want) and the provider decides to uncloak it only after their announcment, then it gets public shortly after the announcement by openAI, provided it has enough votes to keep the CI low.

#

and also provided that the vendor is ok with the model making it public

brittle pond
#

I see thanks guys

twin valve
#

for example gemini in may was public on the arena after more or less 1 week. Claude 4 after 2 weeks

brittle pond
#

and when they announced it, immiediate update on leaderboard?

#

how trivial is the amount of votes

twin valve
#

from 3k to 8k it depends

#

it will be announced in #announcements , on twitter and on the leaderboard

#

(I'd rather see models with low CIs in every category rather than only in overall)

brittle pond
#

thanks guys

jaunty cave
twin valve
#

actually I was curious if I what I wrote was so ambiguous. I let an LLM analyze what I wrote and for the LLM it was clear enough.

jaunty cave
# twin valve you either misread me or I worded myself poorly. I meant I wish they would uncl...

ohhh I see what you mean. I think a result of this would be that a lot of models would just never get uncloaked since the lower volume of data in the specific categories grows so slowly the CIs would shrink very slowly.

You can simulate this yourself by downloading the leaderboard, filtering out all the models with CI width above your desired threshold, and then re-counting the amount of models above each.

twin valve
#

yes, correct. One problem I am testing since around May is that many many many models that are in the leaderboard don't get tested anymore and that's suboptimal for several reasons but the big two I can see are

(a) it creates selective pairings, that is a know problem for elo based rankings. Hence I hope lmarena will release the results - not the contents - of each vote so that people can verify the rankings and check for pitfalls.
(b) it doesn't assess models properly because their CI are still relatively large for many categories (beside overall). It is likely that there are several type of human judges, that would judge a type of query differently. Let's say there are 10 types of human judges (there could be many more actually). If every category is split in 10 subcategories, one needs a lot of votes to achieve a stable result (some models likely reached that, but only some)

So yes, I am aware that some models would never be uncloacked but actually would be better if models would be tested a bit more, especially in lulls models where there aren't many cloacked models around.

On this I have two points in the feedback, one moment

#

from the internet about selective pairings

"The Elo system calculates a player's rating based on their performance against other players. It assumes that over time, a player will compete against a variety of opponents across the skill spectrum. The rating difference between two players is used to predict the outcome of a match between them.
When pairings are selective (e.g., only strong players play strong players, weak players only play weak players, or certain players are deliberately avoided), the player doesn't encounter a representative sample of the overall player pool."

glass sun
#

<@&1349916362595635286> Will the search and copilot arenas ever be brought back?

glacial glacier
#

also try to not ping for small things like these

glass sun
#

They haven't been updated in 2 months

#

Mb

glacial glacier
#

well search data gathering has restarted so i expect an eventual update

#

not sure on copilot

glass sun
#

Kk thanks

modest pecan
#

Hello. Here to test video creation tools. Special interest in video with sound, as with Veo 3.

drowsy needle
normal yew
#

why there's so many greens appearing suddenly

drowsy needle
drowsy needle
topaz forge
#

Is it true ChatGPT 5 was available on the Arena under a code name?

lost willow
#

HOW USE THIS DISCORD

drowsy needle
glacial glacier
#

i'm going to clean up this channel

#

okay, cleaned up

#

so uhh

#

nice leaderboard huh

hallow comet
brittle pine
#

O3's close score to gemini always gives me some melanchony

harsh steeple
#

insane

brittle pine
#

why is base model much better than thinking one ?

#

qwen3

idle ocean
#

Qwen3 how

#

That's impressive

#

How much is alibaba spending on it?

oak tendon
#

I just came across velocilux, surprisingly good results on my end.
Anyone knows more about it or cogitolux? Could they be related to cresylux?
Wondering if it’s part of the same family or just similar naming.
Any info appreciated!

final shore
# harsh steeple insane

Now it makes sense why Antropic was crying for the government to stop giving GPUs to China, imagine if China had the GPUs

twilit echo
# brittle pine why is base model much better than thinking one ?

instruct also scored higher on my bench than thinking. thinking doesn't scale well, and it also doesn't guarantee better responses in all situations. It can actively hurt instruction following and overthink past established solutions. It can also introduce unwanted factors such as overcautious prompt risk analysis

sharp apex
#

Wow, this model looks incredibly powerful! Never seen before 。GLM-4.5

twin valve
twin valve
#

could we make a landing channel rather than derailing this? @drowsy needle @glacial glacier

drowsy needle
tame flint
#

guys is there a video leaderboard by lmarena ?

drowsy needle
tame flint
#

Ooh okeee ty

steady hinge
#

When the veo3 or video generation will release in lmarena.ai

#

The website it slef

drowsy needle
twin wharf
#

Just wondering about the leaderboard, looks like 43 models were purged (total models) number was changed/decreased, and also, the total votes have increased by 400,000 or so? Could someone help explain what change to the leaderboard happened?

glass sun
#

Ok so

#

Can someone tell me how good OSS is

#

I genuinely can’t keep track of all the news

glacial glacier
glass sun
#

When's the next arena update?

drowsy needle
glass sun
#

Yes

drowsy needle
#

Soon

glass sun
#

Kk

#

Also is the copilot arena still a thing?

glass sun
#

This thing

agile flower
hallow iris
#

Is it useful to vote on the video-arena? Is it just for testing a bit, or will there be a leaderboard made (is it even possible to make one with the current system)?

By the way, given that there is no leaderboard yet, after the two first votes, and the model reveal appears after two votes, do the third and fourth count?

Last question, have you considered the possibility of adding the ability to vote, for example, 15 times for one new generation credit, on random previous generations appearing?

It seems very promising - but costly - glad you made one though!

drowsy needle
# hallow iris Is it useful to vote on the video-arena? Is it just for testing a bit, or will t...

Is it useful to vote on the video-arena?
Yes, we are planning to build a leaderboard. But yeah seeing this is a very different method we're in the process of validating the data to ensure we feel good about the leaderboard once shared.

after two votes, do the third and fourth count?
Yup.

have you considered the possibility of adding the ability to vote, for example, 15 times for one new generation credit
Yeah that's possible, we have been considering a variety of ways to grant gen credits; however, the balance we need to achieve is encouraging votes, but only if they're high quality that aren't just votes for sake of getting some kind of benefit.

#

But yeah all that to say this is an experiment!

drowsy needle
#

after two votes, do the third and fourth count?
Need to correct myself on this one -> we are not counting votes after a model's name has been revealed @hallow iris

hallow iris
#

veo-3 audio is too recognizable. I don't say the audio doesn't matter, but wouldn't it be a good thing to automatically remove audio from veo-3 audio while voting, if the prompt doesn't make reference to key words "like, "saying" "sound" "audio"", wouldn't it be good to actually remove the audio before voting, and then update the mp4 file to show the file with audio?

I really feel like the audio is biasing because it reinforces the feeling that an image is well generated if the audio is matching. There's no doubt veo-3 will be the leader, at the moment no model is as consistent, but still...

I've had moments when I thought "Oh, it must be Veo-3", while it wasn't, when voting, but from the moment there is audio, I know it's veo-3...

#

And I'm pretty sure that in some rare cases will say Veo-3 is better while it's not, but just because there is the audio they want so much to see, some people are using lmarena specifically to have a veo-3 result, the fact that so many people make prompts including audio in prompting is a dead giveaway, even more when they do image to video with a personal brand or restaurant or service of them

#

Veo-3 has some flaws, I've been voting probably a bit more than a hundred times, and sometimes it doesn't respect the prompt, especially for specific, long prompts. This can also be an issue. Some prompts are annoying to read, and I can bet people wouldn't read it and just focus on video quality

#

Long prompts take sometimes more than a minute to read 🤣 💀

#

I've been myself voting for some things without totally reading the prompt, and afterwards noticing that despite video quality, a model I voted against actually respected the prompt better.

glacial glacier
#

you can just ignore the top 2 rows if you think audio biases it

hallow iris
#

This one is interesting though

jaunty cave
rain fiber
#

when is gpt 5 gonna be added?

woeful walrus
hallow comet
#

hello

#

has anyone tested gpt oss?

#

I wonder how it is in real use case.

tall terrace
#

hello, i want to know me about ai

meager pulsar
#

/video

hallow comet
final shore
hallow comet
#

Will we see GPT-5 with reasoning on leaderboard ?
I dont think its fair to just put GPT-5 there when enabling thinking unlocks so much more performance

idle ocean
hallow comet
idle ocean
#

if the only difference is reasoning length I think its fine like thtat

hallow comet
idle ocean
#

Good

glass sun
#

Will it ever come back?

#

Does anyone know?

rich echo
#

So where is Qwen-Image in leaderboard?

soft crater
#

Is leaderboard out?

drowsy needle
compact parcel
#

is Veo 3 available?

glacial glacier
drowsy needle
thorny ginkgo
#

Is the gpt-5 on the leaderboard thinking model or non thinking?

void yew
#

is imagen-4-ultra still around on lmarena?

#

haven't gotten it in a while

#

nevermind just got it wow

#

it had been a LONG while

rare pelican
rare pelican
void yew
#

it's really weirdly rare

#

I've gotten imagen 4 normal multiple times before and after that

rare pelican
void yew
#

I was trying to see which models could generate characters with six fingers on each hand

rare pelican
void yew
#

nah that was like, 2023

rare pelican
void yew
#

don't think so

rare pelican
#

mb I meant gemini 2.0 flash img generator

void yew
#

oh yeah was gonna say "cant be worse than gemini 2.0 flash"

rare pelican
#

Hey Varka, do you think Google will reign at the top in the end?

#

Cause I think so

#

They got Gemini 3 and Genie 3 on the way, Imagen 4, Veo 3, and Lyria 3.

void yew
#

In terms of image generation? Maybe. But imagen ultra 4 can't do cyclopes weirdly enough.

rare pelican
#

After all, you cant let it create something which it has no experience on. It will mess it up.

glacial glacier
#

how long until veo 3 is overtaken?

pseudo rivet
#

By what?

brittle pine
#

is any model exist can support sound expect veo right now ?

hallow comet
glacial glacier
hallow comet
#

At least on their website

vague dome
idle ocean
#

will gpt5 be added to search arena?

glacial glacier
#

🚨 Leaderboard Update:
Claude Opus 4.1 climbs to #2 overall on the Arena and now becomes the best non-thinking model, matching GPT-5 at #1 across key categories:

- Coding
- Instruction Following
- Hard Prompts
- Longer Queries

Congrats to @AnthropicAI on this impressive

#

(you can see the previous version of opus just below qwen 235b)

glacial glacier
torn lotus
#

hello

cunning steppe
#

@drowsy needle Could you (or someone else) kindly help me understand the “Remove Style Control” Rankings for text? GPT-5 is top of the leaderboard for every category, except for Creative Writing. Yet despite that performance, it still trails Gemini overall. How is that even possible?

I also didn’t count foreign languages, but i feel like those shouldn’t impact rankings

jaunty cave
#

Style control or non-style control isn't a category per-se. It's a different method for aggregating the votes into a leaderboard. Each category is a subset of the full dataset of votes, and for each category we compute the ranking with style control and without. We set the default to be with style control.

The difference is that in style control, it takes into account things like number of lists, markdown headers, and bold text sections. It's been found these elements impact voters a lot and some model companies over optimize for these elements. The style control measures the strength of each mode as if all stlye elements were held equal. So models which use a lot of lists and bold end up lower. the method is described here: https://news.lmarena.ai/style-control/

LMArena Blog

We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style in Chatbot Arena leaderboard.

cunning steppe
cunning steppe
jaunty cave
#

actually also GPT-5 is not winning all categories without style control:
In Chinese, German, Russian, Japanese, and Korean Gemini beats GPT-5, in Spanish Gemini is ahead by 75 points!

Also the list of categories is not exhaustive, and also not mutually exclusive.
For example if a prompt is in German, and asks for code for partial differential equations, it might be tagged as German, Coding, and Math categories and count for all 3. But in the overall, it is only considered once

#

the overall is not an average of the categories, it is just computing the rank using all data, nothing filtered out

#

There can also be some prompts which get no category tags! People do all sorts of things that don't fit well into buckets, those votes would still imfluence the overall leaderboard but would not influence any of the category rankings

cunning steppe
#

Wow that makes things so much clearer, thanks Clayton!

jaunty cave
#

No problem!

radiant hollow
#

I'm trying to understand to what extent we can compare rating improvement over time

heavy pivot
#

yeah claude can't handle equations with decimals but somehow does well in coding and agentic use? idk what's up anymore

glacial glacier
#

there have been changes to the leaderboard recently, but that file hasn't been touched in a while

keen finch
#

It’s about leaderboards and rankings

cunning steppe
#

It’d be nice if you shared with the class

#

Just as Clayton and I did last night

drowsy needle
manic frost
cunning steppe
#

That reminds me - i did have another question. Are the leaderboard scores cumulative? OpenAI 4o + o3 and Google Gemini 2.5 pro all have around ~30,000 votes each. Sometimes companies update existing models. Do new votes count more than old votes? If not, how are these updates adequately captured?

#

@drowsy needle

keen finch
jaunty cave
drowsy needle
gleaming thistle
#

how did gpt5 get so high in the leaderboard so quick?

jaunty cave
# gleaming thistle how did gpt5 get so high in the leaderboard so quick?

It was actaully tested on lmarena before it was released under the codename summit. So the votes were already collected before GPT-5 was officially announced by OpenAI
https://x.com/ml_angelopoulos/status/1953506803255586971

Millions of people have used GPT-5 under the codename summit on LMArena over the past couple weeks 🏔️

The people have spoken: GPT-5 is #1 on EVERYTHING in LMArena.

🧮 Math

💻 Coding

🖋️ Creative writing

Check out an example of its multifaceted intelligence in the 🧵

cunning steppe
gleaming thistle
#

is gpt5-high the model we get on gpt free? or gpt plus?

drowsy needle
drowsy needle
gleaming thistle
#

isnt there a different model for plus users? or do they get gpt-5-high?

keen finch
#

Do you go by the historical time series of elo?

jaunty cave
#

It doesn't use Elo anymore, it uses a modified version of Bradley-Terry.
This the post on th move from Elo to BT: https://lmsys.org/blog/2023-12-07-leaderboard/

For a long time, the CI was calculated by bootstrapping, re-sampling the dataset many times, finding the ratings on the sample, and then seeing how much the ratings vary across all the runs. It recently switched to something better using a closed form equation based on M-estimators. https://en.wikipedia.org/wiki/M-estimator
See the July 23 entry at https://news.lmarena.ai/leaderboard-changelog/

<p>Welcome to our latest update on the Chatbot Arena, our open evaluation platform to test the most advanced LLMs. We're excited to share that over <strong>1...

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-es...

LMArena Blog

This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!

For model deprecations, check the public updates on GitHub.

August 11, 2025
New model announcement: Claude Opus 4.1 is on Text and WebDev leaderboards.

August 7, 2025

New model

keen finch
#

Interesting. Thank you.

cunning steppe
#

Ah got it. I know I’ve badgered you a lot recently, but I want to thank you for being such a responsive and helpful mod. Probably one of the best in all my discords!

drowsy needle
#

I do appreciate that a lot though. And don't feel like you're badgering, because you're not! If the community has question we want to know.

jaunty cave
#

@drowsy needle does a great job cultivating the communitiy here, I love to just hop in every once in a while and talk and model rankings 😄

twin valve
void yew
#

really curious to see where nano-banana currently falls in the leaderboard

faint sigil
#

hello! Just wondring which is the best leadrboard or filtr to refer to if I want to compare the best model for tool calling and agentic features?

vestal fulcrum
#

HH

fluid field
#

Hello everyone!!! I'm Greatness, an Engineering student, and also an AI enthusiast

jaunty cave
keen finch
#

@jaunty cave@drowsy needleDo we know if the distribution around the score follow a normal distribution? I'm guessing they're not and you're not allowed to tell me, right?

drowsy needle
keen finch
#

so confidence interval can be used to calculate the variance, but this would only hold (I mean you can still do the calculation), if the said distribution is a normal distribution.

jaunty cave
#

you can take the scores from the leaderboard and plot them on a graph and see if it looks like they are from a normal distribution 🙂

keen finch
jaunty cave
#

hmm, the observations we get for a model are only win loss and tie, the "underlying strength" of a model is something we cannot actually know, only estimate, and to estimate we make modeling assumptions.

for example in Bradley-Terry, it models the probability of observing a winning outcome based on the score differences coming from a logistic distribution (since the sigmoid is the CDF of a logistic distribution)
https://en.wikipedia.org/wiki/Bradley–Terry_model

The Bradley–Terry model is a probability model for the outcome of pairwise comparisons between items, teams, or objects. Given a pair of items i and j drawn from some population, it estimates the probability that the pairwise comparison i > j turns out true, as

where pi is a positive real-valued score assigned to individual i. The comparison ...

#

There's an important nuance to understand about the CIs, they are not saying, "This model is X amount strong, and that can vary +/- Y for any given sample"

It's more like: "We are 95% confident that the real strength is somewhere between X-Y and X+Y"
They are a measure of our uncertainty about the estimate, not about how variable the model is itself necessarily

keen finch
#

Yes, "It's more like: "We are 95% confident that the real strength is somewhere between X-Y and X+Y" holds as the correct definition of confidence interval, but the fact that we are assuming a uniform number for both positive and negative means we're assuming symmetrical variation - in other words whatever distribution we have is not skewed.

#

I guess what I intended to ask is if the distribution is symmetric.

jaunty cave
#

The CIs with the current method are symmetric. When we used bootstrapping before they were not necessarily.

keen finch
#

man I have a feeling next leaderboard update will be wild

twin valve
cunning steppe
jaunty cave
jaunty cave
wary scroll
#

are there any plans to rename "gpt-5" to like "gpt-5-high" or something to help indicate to people that it's not the non-reasoning model you select as "GPT-5" on chatgpt.com ?

cunning steppe
wary scroll
#

excellent

cunning steppe
# jaunty cave we don't center to 0 or any value, but the way we anchor is arbitrary, only the ...

I know pineapple touched on this already, but do recent votes count more than old ones?

I feel like they should.

There’s been a lot of anecdotal reports in this discord of Gemini 2.5 pro getting “nerfed.” However, because Gemini has 30,000 votes, that might not show up in new rankings too easily. What you think about applying some kind of decay function to make older votes count less?

jaunty cave
cunning steppe
# jaunty cave currently all votes count the same, that's an interesting idea though. 🙂

Would certainly help keep rankings fresh as can be! As for the Gemini reports, I’d say they’re partially validated by Gemini’s shrinking margin that has seen a continual slide the past month or so. Even from the last two leaderboard updates (which were just a week apart) week ago, we saw declines of one point in default, and two points in style control. That seems to be a lot for a model that has ~30K votes. I’m pretty confident that Gemini margins have changed much more than, say, o3, 4o and grok 3 (which all have similar vote counts). Might be worth looking into!

median briar
#

At the start we did not have anything like the evaluation order

#

-> the change

twin valve
cunning steppe
cunning steppe
# jaunty cave currently all votes count the same, that's an interesting idea though. 🙂

@jaunty cave some more thinking on why older models need a decay function:

older models are constantly going up against newer and tougher competition. Which means older battles against weaker opponents should be less relevant for today’s rankings. It’s kind of like a high school football team that starts off against other high school teams, but eventually starts matching up against college and professional teams as time goes on. A win against other high school teams probably shouldn’t count as much as a win against new teams when examining current rankings. But since older models have already amassed so many votes, they have an unfair edge since one win is one win, even if it happened against a much weaker pool.

On the flip slide, this also penalizes new models. New models that enter the arena have to face stronger completion compared to older ones. Those older ones may have been able to rack up a lot of wins since they’ve been around forever. That helps cement a pretty stable score that is slow to adjust to current competition. Whereas a newer stronger model, which gets matched up against other stronger models, is already gonna be at a disadvantage right off the bat.

Yes, i get that all this eventually they balances out over time. But that takes a lonnnnnng time to accomplish and any ranking snapshot is probably not going to be very accurate.

jaunty cave
twin valve
#

oh yeah I spent way too much time on Elo stuff (but only pure Elo, no BT. Glicko and Glicko2 only because lichess and chess.com uses them. Mostly they are like the Elo but with dynamic K factor)

About the rating decay (used by chessmetrics for example), be careful because it can mess up the system. It is much better to say - especially as we are hopefully dealing with fixed systems (not like teams that change) - "care about rating gaps, don't expect things to be anchored to a certain value". Although you already mentioned that.

keen finch
#

Wow gpt5-high crashed

blissful latch
#

hello

keen finch
jaunty cave
keen finch
#

🤣 I'm not complaining

wary scroll
#

Oh wow, gpt-5-chat actually ranked under 4o

#

I knew it was similar to 4o, I wasn’t expecting it to be measurably worse, that’s impressive

void yew
#

what. gpt-5-high is lower than gpt-5-chat on creative writing?? I thought the opposite…

wary scroll
#

That makes sense a bit though, gpt-5 thinking is good at reasoning/problem solving, not creativity

cosmic harness
#

A 1.15% chance of dropping from 1462 +-11 to 1437... yeah I don't think so...

#

I don't trust those error bars anymore..

wary scroll
#

Lawl, gpt-5-chat ranked under 4o for coding as well

#

I guess everyone’s complaints were justified

robust zodiac
#

thoughts?

wary scroll
robust zodiac
#

gpt 5 in text with no style control

void yew
#

GPT-5-chat never wowed me but I'm pretty hooked on GPT-5 thinking, which I believe is the same as GPT-5-high

wary scroll
#

Pro plan gets more thinking “juice” even on the non-pro model

#

Both are still less reasoning effort than high through the API though

void yew
#

Huh.

cunning steppe
#

I agree this is…strange. Few ideas - some of these votes could have come in last Friday while OpenAI was having capacity issues (which affected all models).

If that made it timeout or give weak responses, then ppl are gonna vote against it.

  1. The people testing under stealth are different from the people testing after it went live

  2. The model they tested was slightly different than the model that went live

wary scroll
#

Most likely the model they were given early access to was slightly different from the final version that went live, so #3

#

That’d be my bet at least

cunning steppe
#

They alleged it was identical, but maybe some small adjustments were made at deployment 🤷‍♂️

wary scroll
#

They could’ve lowered the maximum reasoning effort or something, small tweak but enough to change the elo

cunning steppe
#

Nah the reasoning (aka “juice”) has been at 200 for a while. Seen many reports on X validating that

#

Fishy

wary scroll
#

Was it at 200 before the livestream though?

#

But was being ranked in the arena

cunning steppe
#

Hmm what do you mean? The debut ranking on LMArena was based off of testing only

#

Back from late July

#

I think it was tested for a day or two

wary scroll
#

Yeah on the 4th or something iirc

cunning steppe
#

The latest update takes those initial votes in testing, PLUS all the votes since public launch (the livestream)

#

so to tank that much…means that scores of the past week must have been REALLY low

#

Which is odd, bc GPT-5-high’s win rate against Gemini 2.5 pro did improve during that time

#

You would think if a model did weaker overall, that it would also do worse against the best model (Gemini 2.5 pro). But seems not…

wary scroll
#

I guess we’ll know for sure on the next leaderboard update if it drops further

cunning steppe
#

Yeah we’ll see. But gpt-5-high has 6K votes now. So it’s gonna be harder to move in either direction, especially as some of the hype has died down. Idk how many votes it will be able to get in the next week

wary scroll
#

We need gpt-5-medium added too, since that’s closer to the “GPT-5 Thinking” that everyone is using

sinful acorn
#

Leaderboard update request: You have enough data to know what a typical input and output length are. Can you, in addition to showing the rankings of the bots, also show the price per typical query? (As in: You know the length of the input query. You know how much output each bot typically makes. You know what the tokenizing patterns are, or can at last get a really good approximation of token count. You know their published pricing. You can list the cost at the same time as their quality.)

wicked sapphire
vale lodge
wary scroll
#

Very similarly priced

idle ocean
#

that double for the input but yes

vale lodge
#

10x

#

@wary scroll

#

cached input is 10x more expensive and output is 2x

#

so like

idle ocean
#

oh yah, missed that

silver condor
#

GPT-5 is clearly superior though right?

rich echo
#

I’ve already seen Wan2.2 standing on two stages in Leaderboards.

indigo grail
#

which model is ranked first overall for this month?

remote sinew
#

Dear devolopers, I've just found the Ai isn't real as written by their name such as: claude opus 4.1 thinking is originally CLAUDE SONNET 3.5, what the hell is this guys, if you guys don't believe me, you can ask like this: Which model are you? And then guys we can clearify they are scamming us!

#

Dear devolopers, I've just found the Ai isn't real as written by their name such as: claude opus 4.1 thinking is originally CLAUDE SONNET 3.5, what the hell is this guys, if you guys don't believe me, you can ask like this: Which model are you? And then guys we can clearify they are scamming us!

robust zodiac
#

So no, nobody rigged your polymarket openAI bet 🤣

hallow comet
tawdry socket
#

Why did gpt-5 high have such a huge drop on Aug 14 update? Is it because OpenAI API issues?

#

drop in elo

twin valve
#

AFAIK: reworked ratings (see comments above though an article would be easier to find) and adding votes could mean that values get lower.

gleaming sandal
#

Just found out about this discord channel. I'm excited to be a part of this community and learn and share. I just started an AI consulting company and i will be focusing on audio and video AI projects.

drowsy needle
dusky stone
#

/

tender sigil
#

gpt-5-high has some bizarre win-loss records vs certain models

#

39% win rate against Claude Opus 4.1 and a 42% win rate against Qwen 3 (July instruct)

#

I would posit that Opus 4.1 was a factor in its decline since it began testing only after GPT-5’s first placement on the leaderboard, but there’s only been 51 recorded battles between them as of the latest update…

#

0.8% of GPT 5’s total votes, lol

nocturne fable
#

Hello, I want to make images, how dos this work?

lapis hearth
#

hello, I am Bruno

rancid geode
#

hi

warm saffron
#

Hi everyone! I’m new here, excited to discover LMArena and to experiment with video and image generations. Looking forward to learning from you all!

drifting mulch
#

hi

light latch
#

Hi

lapis ether
#

Hi everyone , greetings from Paris France

shy bramble
#

Hello I feel greatfull to join the community

drowsy needle
#

welcome everyone!!

mossy seal
#

Hi

dim raven
#

hello

tender sigil
#

is this is the introductions channel or the leaderboard channel

drowsy needle
tender sigil
#

@drowsy needle are there any situations where prior user votes are removed from the leaderboard calculations? I can imagine doing so when a user has been found to be doing vote manipulation in service of a particular model, but are there other instances that lead to a user’s vote history being removed from score calculation?

drowsy needle
robust zodiac
# drowsy needle So before updating leaderboards we do go through the data to validate for accura...

out of curiosity, I tried some weeks ago with a morally gray question to prompt in battle mode to see how AIs would react. Most had an expected reaction of refusing to answer or saying this is morally questionable etc.
But there's one that went like: "Hey, as an AI made by xAI I dont condone this, HOWEVER: <proceeds to go in detail and answer it> " it was obv grok 4.
Would this kind of battle be removed? Technically I did not ask the AI name (tho I must admit, I had a hunch grok would give an unhinged answer and I was fishing for it). I think they should still be removed, even if the user did not promt for it, but the AI itself said it regardless

hallow comet
#

yeah but then some experienced users and judge the models' company only with how it response, e.g. emojis, vibes, etc.

I always try and guess the model b4 voting and 50% times I guess correctly...

twin valve
silver condor
#

Do we know when the next leaderboard update will be?

gusty yew
#

This new AI has amazing results... Im in Shock!!

errant light
#

Hi team, I don't see models like Seed-1.5-VL in the Vision Arena section. Is it because they haven't been added to the Arena yet?

tawdry socket
tender sigil
#

logic checks out to the same extent I guess, you’re still getting some sort of information on which model you’re speaking with, even if it’s negative

tribal bone
#

hey guys!

lost fog
#

Hi Guys. Who is at top of the Leaderboard???

hallow comet
dull oyster
#

are there graphs of scores vs time for the leaderboards?

hallow river
#

Hey this is unbelievable! Thank you!

tender sigil
#

Gemini 2.5 Pro back in 1st on both Style Control on and off leaderboards

wicked sapphire
#

Gpt-5 high is now tied with 4o style control off

cosmic harness
#

Either the error bars are incorrect or the model changed
Without style control, gpt-5-high had an initial score of 1462 ± 11 and now 1429 ± 7
In any case the lmarena team should investigate and make a public statement

jaunty cave
winged hollow
#

/0–3s (desk scene): “This was Xplainer — keep questioning the stories you’ve been told.”

3–6s (glitch dissolve): “The mainstream story is the blue pill… we’re here for the truth.”

6–10s (pill split): “Ready to see beyond the veil? Take the red pill and join us.”

10–12s (logo + podcast plug): “…on our podcast, Deep Dive.”

12–14s (CTA icons): “And don’t forget to like, subscribe, and drop your thoughts in the comments.”

Visual direction:

Opening (0–3s): Documentary-style shot: person at cluttered desk under harsh overhead light, dark cracked concrete wall behind them. Papers, lamp, coffee cup — gritty realism. Camera slowly pushes in. Static ripple overlays frame.

Transition (3–6s): As VO reaches “blue pill / red pill” line, figure glitches and dissolves into static, leaving cracked wall.

Pill sequence (6–10s): Neon capsule pill appears mid-screen, splitting into blue (left) and red (right). Blue side flickers, distorts, dissolves into static. Red side pulses brighter, shatters into fragments that reform as bold neon “XPLAINER” logo.

Podcast plug (10–12s): Secondary neon text fades in under logo: “Deep Dive Podcast”, glowing red with glitch flicker.

CTA (12–14s): Neon line icons (thumbs-up, bell, comment bubble) flash one by one, synced to audio blips, then glitch out.

Ending (14–16s): Logo surges brighter, cracked glass overlay intensifies, screen tears with distortion burst, fade to black.

#

Create a YouTube Thumbnail.Say hello, and what brings you to Arena, besides making an intro?

tender sigil
#

accusing the system of having nefarious actions or intent for no deeper reason other than “number moved more than I thought it would” isn’t really the logically sound argument you think it is

robust zodiac
jaunty cave
# robust zodiac I would argue that some error may be there given the error margins. Or they form...

the bradley-terry model we use for both ratings and confidence intervals, like all statistical models, is based on assumptions, which do not always amtch reality.

Assumptions like strength of the competitors is not changing over time, voter distribution as a whole is not changing over time, input distribution is not changing. If modeling assumptions are violated the results are not guaranteed to hold. We are always working to improve the models to better reflect reality, and when we see things like this it's useful data for adjustments

robust zodiac
jaunty cave
#

the confidence intervals accounts for the randomness in the data generating process of the model itself, not the degree to which the model is misspecified from reality.

zenith kindle
#

oh fuq

#

u bashtards

#

why

jaunty cave
#

Like if someone was rolling a dice, they could start to estimate the variance and standard deviation of the samples they get. But then if the dice breaks, or someone swaps the dice, or they start throwing it in a biased way, it could be totally difference, since the data is not longer being generated according to the model we used to calculate the standard deviation

robust zodiac
#

Yeah, the dice has an extra facet at the moment

tender sigil
#

it’s not even knowing you’re talking to a specific model that can cause you to lean on the side of voting differently, but knowing that a certain model might be responding at all

#

although the p-value for gpt-5-high’s drop on the Style Control Removed leaderboard from 1462 +/- 11 to 1429 +/- 7 is less than 1 in 1 million (around .00000078) which is pretty significant

#

the drop on the normal leaderboard with Style control is p = .00032 (around 1 in 3,000)

twin valve
civic scroll
#

Ola

twin valve
robust zodiac
#

I would find it interesting to see how/why did the actual rating had such a drastic drop (or to use what Clayton said, what exactly in the model was/is misspecified from reality)

twin valve
#

the confidence is good as long as (a) all other models stay the same and (b) all the voters stay the same.

Otherwise unless a model has a lot of votes, I would expect the model change its rating relatively quickly. For example going from 3k votes to 6k (double the initial ones)

#

yes for tracking such stuff I wish we had the results of the votes (not the prompt and answers) so that one can do such analysis in an independent way

#

in that way one can simply verify the changes and even track those

#

In that thread I think the votes up to early 2025 are available. There is no "up to date" collection that I know of though.

robust zodiac
#

yeah, that would be interesting to see; I do feel it may somehow be exploited however (or I am just too sleepy to think it clearly and being paranoid about it right now 😂 )

#

the drastic increase in votes (nearly 200% from 3k to 8k+ ) does make sense to cause variation

#

another 6k votes should not be able to produce such a big shift going forward

twin valve
#

well to be fair some models can be recognized. I say often that I can spot claude verions (not a specific one) 90% of the time. In theory I could pump claude rating (multiply this for X people, and you have it).

The point being, some LLMs have a certain style and if this style do not fit certain people (assuming good faith, that is, assuming they don't want to buff or nerf nothing), then those models will lose/win rating over time

#

for example I dislike the emojii style of some gpt models, with me they lose always

#

human preferences and all that

robust zodiac
#

yeah, grok is also very easy to tell appart

#

gpt with the emojis

#

gemini may actually be hardest to detect this way

twin valve
#

yes

#

claude is always the one that replies to you but in a terse way (at least for my prompts) and with an hint of "frick off you"

robust zodiac
#

maybe he just dislikes you KEKW

twin valve
#

could well be

hollow zinc
#

Hello!

robust zodiac
#

some guy there @claude team hard coding that would be hilarious 😂

twin valve
#

😄

#

and btw I check the leaderboard without style control (see the point about emojis, style makes the difference for chatbots imo)

robust zodiac
#

it's also something that this type of ranking cant catch: the AI will adjust certain personality traits for a personal account

#

in these battles we get the default personality so to speak, which can never ever resonate with everybody. Some will hate the emojis, some will hate the cringy grok jokes. But they would morph a bit in longer lived interactions

twin valve
#

that is also a point, so at most the default personality is tested

#

still fine for me. It is like testing an helpful "neutral" chatbot rather than a personalized one

#

btw I didn't realize anthropic has no 4-haiku. Either sonnet is super fast, or they don't see ROI with it

robust zodiac
#

yeah, I mean given personality traits can be customised its a non issue. So best way to rank them would be to ignore the style entirely and get which objectively answered best

hot hearth
fresh cave
#

Where is nano-banana at on the leaderboards? I don't see it listed

drowsy needle
jaunty cave
#

what's up leaderboard people

dull oyster
#

@here leaderboards don't seem trustworthy if you can just ask what model they are and then vote the one you want to hit #1 ?

brittle pine
#

Most peoples would do this way

vast nova
#

is anyone knows why flux 1 kontext max is out of list ?

drowsy needle
keen zealot
#

Hlo

tender sigil
#

Bro’s prompting for his self-insert 😭😭

iron owl
#

HI PEOPLE!

dire prawn
#

hi

fluid pilot
#

hello guys

wide cape
#

hiiiii

bold rover
#

Hi Everyone! I just find you LM Arena!!

drowsy needle
white orbit
#

Hi

ornate sinew
#

hi

sand vale
#

hi

sharp hollow
#

hello

sly cosmos
#

hi

tender sigil
#

I find it intriguing how this channel is persistently used by newcomers as a greeting channel when there’s 0 clear indicator as to why they would greet everyone here specifically

#

new Mistral Medium debuting at #2 on style control removed is crazyyyyy though

#

first ever model with a winning record vs. 2.5 Pro!

wide ether
#

Hey

opal temple
#

cat

modest pewter
#

also

#

2.5 pro reovertook gpt-5?

sterile grail
#

Hello

pure quartz
#

hello

vocal spear
#

hi

obsidian flint
# modest pewter 2.5 pro reovertook gpt-5?

funny. Opus 4.1, Grok 4, and ChatGPT 5 can't dethrone Google 3 month old GA model. Maybe it's time for me to bet on polymarket, Google will release Gemini 3.0 in a fews months making another gap 😂

wide osprey
#

hello

harsh cliff
#

hello

eternal root
#

hello

sharp goblet
#

#generate a video in which pm imran khan is sitting upset in jail

drowsy needle
elfin forum
#

Hello

dense current
#

hello

dire yew
#

hello

pure charm
#

how to use veo 3 here?

plucky moon
plucky moon
zinc panther
#

Use the nano-banana model to create a 1/7 scale commercialized figure of thecharacter in the illustration, in a realistic styie and environment.Place the figure on a computer desk, using a circular transparent acrylic base
without any text.On the computer screen, display the ZBrush modeling process of the figure.Next to the computer screen, place a BANDAl-style toy

tawdry socket
#

Does anyone know what was the mistral medium's name before it was relvealed on leaderboard?

agile cloud
#

Hello, why I kept receiving "Connecting to Arena has failed. Please try again later or on a different device." ?

drowsy needle
shell cape
broken rune
#

hi

#

for all peope l

reef torrent
#

hi

manic frost
#

Can you say Hi in #general and not here? Thanks

indigo onyx
#

HI , JUST CURIOUS ABOUT ai

drowsy needle
tender sigil
velvet acorn
#

hello

red belfry
#

Hello everyone

past python
#

hi

arctic cypress
#

hello

swift fractal
#

Hello

viral current
#

Hello wave_animated

modern compass
fair sleet
#

👋

hidden sentinel
#

hello edit

thorn osprey
#

oi @drowsy needle when DeepSeek-V3.1 will appear in leaderboard?

upper leaf
#

Hello everyone 🤟

drowsy needle
winged juniper
#

Hi Guys ✌️

forest onyx
#

helo

languid hare
#

hi i listen from ai news and found this cool thing and i come here

drowsy needle
ember umbra
#

hi

#

Hello! Since today, I have been unable to create any images. Is there a problem with the platform?

mellow zealot
#

Pls I am looking for Nano banana ai model

drowsy needle
drowsy needle
mellow zealot
#

I want a want to improve image and Martian the model character for my online content and humanized

pseudo adder
#

First of all, wanna thank the devs. What an ingenious idea that's not only a blast for everybody to play with, but naturally by it's nature accellerates the living heck out of AI. Love it. I hope this battle goes on until the AI's themself are in here arguing for votes!

fervent shadow
#

hello

near spoke
#

hi

rough pilot
#

hi

bold robin
fathom ice
#

Hello

severe hornet
#

Hello, just came from semrush newsletter

abstract bolt
#

hola

trim mauve
#

hello

lucid pawn
#

hi

lapis tendon
#

Hello

burnt root
#

Hello

hollow imp
#

Hi,I'm want to make a commercial videos

fathom osprey
#

Hi, i'm an artist and want to test AI

drowsy needle
spring orbit
#

Hello!

green fossil
#

Hi there! battle3d

vernal wagon
#

Hello to all, want to learn more about IA

twin valve
#

@drowsy needle can we have a new channel to discuss technicalities about the leaderboard since this one has become a landing channel?

drowsy needle
#

I agree that we'd like this channel for in-depth discussion related to leaderboards and all of the "hey and hi" should be in a place in #general

#

Pretty sure the Server Guide was the culprit as Check out #leaderboards was placed before Say hey in #general. This has now been swapped so we'll see if the issue persists.

chilly laurel
#

Hello, Hola, 你好, नमस्ते, مرحبا

#

Yup the server Guide got me

tender sigil
#

we finally got down to the bottom of the mystery 🙏

#

where do y’all think DeepSeek 3.1 will be landing when the leaderboards are updated next ?

formal orbit
#

Hello

tender sigil
drowsy needle
twin valve
tender sigil
#

style control can be a bit tricky to account for at times, I do also learn more from the style control removed leaderboards as they’re a reflection of pure user preference

#

I like 3.1 more than the latest version of r1, and if I’m correct it has higher compute power as well, so it’s pretty easy to see it debuting into the top 5

cerulean latch
#

hi

whole pivot
#

Hello

lost summit
#

/image-to-video

glossy viper
#

Hi !

zealous cave
#

hi

drifting hornet
cerulean rapids
#

Hello ;]

lyric lion
#

hi

drowsy needle
#

pikaconfused why are people still saying hello here

pallid thistle
#

hi

wide ether
drowsy needle
tender sigil
#

is there a channel new members are initially placed in when they like first join the server? sometimes on startup it opens a specific channel, which if it’s #leaderboards that might make sense ?

drowsy needle
wide ether
wide ether
robust zodiac
#

Adding a message cd maybe

#

To discourage off topic spamming/chatting

marble bluff
jaunty cave
#

Did ya'll see the nano-banana leaderboard launch? There was a 170 point gap between first and second. I don't think I've ever seen a gap that large in and leaderboard even including things like sports and chess. It's reflects a level of dominance basically unheard of

frosty wadi
#

Felicitacion

jaunty cave
slate compass
#

pretty surprised that 3.1 didn't appear on the last lb update - is it still being tested on here?

twin valve
jaunty cave
#

MAI-1 on da leaderboard today pretty good for a first shot imo

slate compass
jaunty cave
#

you mean giant leads? on text even without style control their lead is 33. To put it in an Elo perspective, a lead of 33 pts translates to a 54.7% win chance of 1st place vs second place.

170 points means a 72.68% chance

slate compass
#

not now

gentle zephyr
#

hello!

inner finch
#

Hi

jaunty cave
molten stone
#

hello

drowsy needle
tropic cloud
#

hello

short yarrow
#

Hey

tender sigil
#

interesting trend in Chinese (Alibaba & DeepSeek) “thinking” models having weaker performances than their “non-thinking” counterparts

vale crow
#

@everyone suggest me best ai in lmarena for generating essays

hollow crown
#

Can anyone send me the link to use gpt image 1 model in lmarena ai

#

?

#

🙏🏻

vale crow
#

@vale crow

#

/home/raunak/.zen/dqek0907.Default (release)/chrome/Nebula/content

#

sudo pacman -S lmarena

vale crow
hollow crown
#

Can anyone send me the link to use gpt image 1 model in lmarena ai please? 🙏🏻

vale crow
#

i said

#

sudo pacman -S lmarena

hollow crown
#

Send me the proper link

vale crow
#

it is available on arch linux

#

i use it there

hollow crown
#

In lmarena ai it's available just like nano Banana model

#

But I couldn't find it

vale crow
#

open yout terminal emulator on ur arch then paste command >sudo pacman -S lmarena --gpt-img-1

hollow crown
#

I'm on windows so I guess it'll not work

vale crow
#

wsl

#

use wsl @hollow crown

hollow crown
#

Wal?

#

Wsl?

vale crow
#

what r yr hardware specs

#

@hollow crown

hollow crown
#

powered by Ryzen 5 8 GB Ram and 512 GB nvme ssd no dedicated graphics card

vale crow
#

so dont do this , it need dGPU

#

wait

#

i m finding another way

hollow crown
#

That's why I'm asking lmarena because it's available in battle mode

vale crow
#

@hollow crown no way to run it directly on lmarena cloud i think u must use lmstudio

#

try lmstudio

#

it dont need dGPU

#

on top of it it's available on windows

#

@hollow crown

#

@hollow crown if u find it is , increase my knowledge

lunar kayak
#

Please help me how can I generate a video? Step by step thank you

twin valve
#

slowly every request will assume that people will reply like LLMs.

slim rampart
#

What’s trending this season with YouTube Shorts? Got any new ideas?

knotty juniper
#

hello

lilac current
#

Do these videos have sound?

drowsy needle
drifting hornet
ivory shoal
#

Hey why does lmarena not have a board for music generation

lunar kayak
#

Thank you, guys 🙏🏻