#leaderboards

1 messages · Page 3 of 1

silk holly
#

trying to generate a video...... can I prcveed?

drowsy needle
simple carbon
#

Please include Pixverse 5.0. It is a very strong model. It is better than Hailuo in Artificial Analysis

quaint zenith
#

hello

past inlet
#

thank bro

marble bluff
tough sand
#

Nice and Amazing... I can learn More Ideas... Thanks!

cinder prawn
#

jo

prisma smelt
#

go go

white token
#

niece

west pecan
#

/video

queen owl
#

/image

drowsy needle
#

@west pecan @queen owl you both are looking for Video Arena, read more on how to use here: #1397655624103493813

amber comet
stoic nymph
#

this can't be true, is oss 120b on LMarena?

median briar
stoic nymph
#

?

median briar
stoic nymph
median briar
#

np :)

stoic nymph
#

WILL CHECK IT OUT!

bronze hollow
#

Hehe

thorny mauve
jaunty cave
humble seal
#

Hello

jaunty cave
river fjordBOT
#
<:warning:892823499205406760> Channel locked

Site outage, will turn back on when resolved.

drowsy needle
#

Site Outage - Hey everyone, there looks to be an outage with the site, our team is aware and working on a fix ASAP. We've turned off messagin in this server until the site is restored. Our apologies for the inconvenience!

river fjordBOT
#
<:success:865860339278413864> Channel unlocked

Welcome back :ablobwave:

drowsy needle
quartz shuttle
#

Hi community! I'm just starting to explore the project's data and have found that the .pkl files don't seem to have a consistent structure. I was wondering if anyone knows why this might be the case. Perhaps it's due to different versions or data sources? Any insights on the origin of these files would be greatly appreciated.

jaunty cave
#

Hi @quartz shuttle, I'm afraid that's just an artifact of sharing our data as we use it. We are always working to improve our data pipelines and that sometimes includes changing how we structure and store the data. Since the data has been releasing for almost 2 years now, this involves a nubmer of changes. If you have specific questions about the formats I'm happy to help if I can.

naive prairie
#

Guys, anyone looking to collaborate in a way to integrate LM Arena to our AIvsAI platform for games?

We let devs build AI agents for classic fighting games, provide leaderboards benchmarking scores and achievements

We want to expand our LLM module to see how LLMs perform in gaming as a benchmark for LM arena also

twin valve
#
OpenLM.ai

This leaderboard is based on the following benchmarks.
Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 4M+ user votes to compute Elo ratings. AAII - Artificial Analysis Intelligence Index aggregating 8 challenging evaluations. ARC-AGI - Artificial General Intelligence benchmark v2 to measure fl...

clear elm
tender sigil
#

Leaderboard update anytime soon? it’s been a full week now since the last one

#

understandable delay with no new models released, but still

tender sigil
drowsy needle
twin valve
jaunty cave
# tender sigil Leaderboard update anytime soon? it’s been a full week now since the last one

we updated vision on Tuesday https://lmarena.ai/leaderboard/vision
It was accompanied by a slight change in methodology in order to keep data quality high
https://news.lmarena.ai/leaderboard-changelog/

LMArena Blog

This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!

For model deprecations, check the public updates on GitHub.

September 2, 2025
Due to the increase in image generation traffic brought by nano-banana, we noticed there were prompts in our

elder willow
#

Up

pulsar wave
#

/video

drowsy needle
west pecan
#

hello

#

giving message do not have permission in this channel

grizzled turret
#

up

upbeat lichen
#

hallo

orchid herald
#

Why is Leaderboard info is dated by the 28th of August? Isn't it being updated? Thank you?!

severe cedar
#

up

upper carbon
#

hey fellas

tender sigil
soft fjord
#

/video

loud spear
#

hi all

brave gazelle
#

hi

drowsy needle
drowsy needle
#

Our Video Arena bot isn't working at the moment. Check out #1397655624103493813 for more information on how to use the bot (when it's workoing properly)

drowsy needle
astral leaf
#

hello

slate holly
#

hello

grizzled oak
#

hi

livid remnant
#

How to upload a image to edit with prompt?

next onyx
#

Hi

humble magnet
#

Hello

tropic pivot
tropic pivot
tender sigil
#

no text leaderboard update in 9 days 😢

#

will probably mean a significant leaderboard adjustment when it does update tho! excited to see where new Qwen places 😳

orchid pagoda
#

hello

reef frigate
#

helo

leaden flint
#

hi

twin valve
#

the stream of "hello" continues! Hello to all!

#

btw the fact that image leaderboards have a density of votes (No.votes / No. Models) so high compared to the text arena, just shows that images are easier to digest for votes (and have higher demand)

robust zodiac
twin valve
#

sure. Small business trying to make flyiers and what not, but still.

drowsy needle
drowsy needle
jaunty cave
jaunty cave
jaunty cave
west pecan
#

Hi, do we have option to select starting frame and ending frame? That would be awesome if you add this

lime cosmos
#

Where can I request that the censorship applied to texts be lowered a little? It no longer allows me to make a correction when I paste the text.

sullen yacht
#

AWESOME

drowsy needle
jaunty cave
spare raft
#

hi were do i find the leaderboards

neat pond
#

hello

drowsy needle
spare raft
#

Thanks

tender sigil
tender sigil
#

I think that’s moreso a testament to the difficulties of creating/lower priority towards image models among AI companies that aren’t currently leading the field, likely due to level of expense and infrastructure needed to do training runs for image generation models compared to text

#

more structural incentives leading to monopolization of AI image generation compared to text, but it is also true that the text arena has been running since the beginning of LMArena and Image Arena is a newer creation

jaunty cave
#

That's a good point about the time they've been running, but look at image-edit arena. 6 million votes on 10 models 🙂
It's been around even less time that text to image but has overtaken text arena in votes. Nano-Banana was quite an event

twin valve
short sparrow
#

For search what model is really best o3 is on led but its not showing results as expected

tender sigil
fresh marsh
short sparrow
visual dock
#

3D cartoon style, Princess Nadine with Yusuf by her side, while Laila stands slightly behind them watching with concern, background of ancient temple at night, torches glowing, cinematic dramatic lighting, high quality"

#

/3D cartoon style, Princess Nadine with Yusuf by her side, while Laila stands slightly behind them watching with concern, background of ancient temple at night, torches glowing, cinematic dramatic lighting, high quality"

fresh marsh
#

For medical stuff at least

fading mason
#

Awesome 👍

rare narwhal
#

Hello, can somebody point me to the best place to get the newest leaderboard data? It looks like the text arena scores have been updated 2 days ago but the latest file on Hugging Face is still August 29. The new web omits a lot of the underlying information unfortunately.

drowsy needle
rare narwhal
#

Many thanks, helpful! Is there any plans to make the data that was previously shown directly in the web interface available again? I think the category scores are now hidden and only the ranks are shown (e.g., instruction following).

drowsy needle
rare narwhal
drowsy needle
dusty cipher
#

👍

steep temple
#

💐

drowsy needle
#

@undone sable Be sure to check out #1397655624103493813 for more information on how to properly use Video Arena

short sparrow
fresh marsh
#

Claude opus CONSISTENTLY gives me false information that would literally kill patients if I listened to it

#

So I now don't trust it for anything at all

#

Context is I'm a resident and I use them for ideas on differentials and management for the most complex patients I get

burnt axle
#

@jaunty cave hi

jaunty cave
burnt axle
#

How to generate veo

drowsy needle
# burnt axle How to generate veo

Note to check out #1397655624103493813 for more inforamtion on how to use the bot. It's important to keep in mind that you're unable to select which specific model you want, as it's random which model provider you get for your prompt.

tender sigil
#

definitely will be more interesting to watch considering Gemini’s unbreakable lead in 1st for the last almost 6 months now

fresh marsh
#

they consider #1 gemini?

#

seems unclear to me

#

nvm, their other listings

ivory peak
#

Hello

#

/video

tender sigil
#

Gemini 2.5 Pro holds a strong lead over second place Qwen3 Max there

fresh marsh
#

why is the gemini 2.5 pro on lmarena so different from my 2.5 pro in my browser?

#

it gives such better answers

jaunty cave
#

That's super interesting Kami, what sort of ways is it different? Do you have any examples of the same prompt and what you like about one response more than the other?

urban sparrow
#

interesting

sterile linden
#

hi

twin valve
#

has ktibow left the server?

Furthermore <@&1349916362595635286> , if #ai-creations is for creations with AI, what about community ones (not necessarily AI based?)

#

I wanted to ask them if LMB is back, it seems back.

drowsy needle
strong knot
#

It's quite saddening. But, if you want a similar discord to the old lmsys, openrouter is an option. Ktibow is also there. Idk how active though compared to here.

candid gorge
#

So isn't it possible to create unlimited videos?

median briar
#

Really nice interface with a lot of new features (for the leaderboard)

brittle pine
#

Answer is simply: web/app version have extra safety filters + unnecesary system prompt

#

in ai studio you can get raw output same as lmarena

#

Also in api, you have chance to turn off those filters and you can avoid that system prompt but web app will continue to be worse

#

The thing is gemini is installed default in sooooo many android phones, like "billions" android phones, so they feel like they have no chance to make any mistake which causes they keep security higher even if means worse outputs

fresh marsh
#

paying for pro doesnt give you better rate limits in studio right?

brittle pine
#

i dont understand too

#

Homewer, ai studio's generous limits wont lasts forever, i think when gemini 3.0 releases, limits gonna be lesser

#

Also you cant get veo 3 and deep research on ai studio, only app

drowsy needle
# twin valve LMB is a creation of KTIbow.

Thanks for the clarification! I wouldn't be able to answer that question for you. I'm assuming if you reach out to them (DM or Friend request) they'd be able to answer.

twin valve
#

yeah I'll try to see if I can reach him via github, if not, well it is not the end of the world. Thank you!

fresh marsh
drowsy needle
fresh marsh
#

makes it feel stupid to pay for both of them when the free lmarena gives me very superior answers

fresh marsh
jaunty cave
brittle pine
#

i dont think there is any system prompt in lmarena

#

Must be plasebo

#

Or maybe lmarena's model version is older and maybe older version works better but im not sure about that

fresh marsh
fresh marsh
#

i dont even have to tell you which is which

#

same exact prompt copy pasted

#

@brittle pine

#

gemini 2.5 pro and gemini 2.5 pro on lmarena

jaunty cave
#

Very interesting. I'm no medical expert, is your feedback that the answer from the model on lmarena is more comprehensive, detailed, accurate? Some other qualities?

brittle pine
#

Can you compare ai studio and lmarena

#

(use 32k token and disable google search pls)

#

maybe in ai studio, google search is open and that could be cause worse outputs

#

turn off search

#

And use 32k thinking token, i think it will be same with lmarena

#

Because both are using Api. Both lmarena and ai studio using Api

#

Only web/mobile app is different

fresh marsh
fresh marsh
#

with search enabled it was giving me actual garbage. it gave me literal idiotic information that would kill the patient or at least some of their organs. 2.5 pro in studio with 32k tokens and search enabled i mean

compact oracle
#

/imagine

fresh marsh
#

very interesting

#

its literally magnitudes better without search enabled (2.5 pro studio search disabled). with search enabled and same exact settings it gives me output almost identical to the garbage i posted above

#

wtf?

#

how does that even make sense?

#

@brittle pine

fresh marsh
jaunty cave
fresh marsh
#

"We don't have search enabled except in search arena. " this explains a ton lol

jaunty cave
#

yeah there are a lot of interesting human biases, kind of cool that we can collect enough data to actually measure them.

It's really funny, we put search in its own arena because we thought it would be an unfair advantage, but actually a lot of the time the results without searh are better. 🤣

fresh marsh
#

surprisingly few models exist for search lol

#

honestly didnt even realize i was comparing them without search

#

have to compare with search now

#

is there no gpt5-high or thinking etc for search?

jaunty cave
#

gpt-5-search is gpt-5 with web_search tool enabled https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses and with reasoning_effort = "high".

It would be cool to add more settings of reaosning level but search functionality isn't used all that much compared to default text to justify adding so many at this point. Hopefully it will grow. We gotta do more to promote it, I think a lot of people don't realize it's available or that you have to enable it

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

fresh marsh
jaunty cave
#

it's the same model, but in search arena that model is allowed to use a tool to search the web and then has access to the results when it writes its answer. In theory this should improve it but if the info it gets from the web isn't great it actually can give worse responses

fresh marsh
#

i see thanks

brittle pine
#

Sadly in gemini app you cant turn off search. Same with chatgpt, you cant turn off search in app.

#

And using search is usually bad because when you ask something, and if it uses search, it usually reads 25-30 source and giving you a answer with using 25-30 source

#

But if it cant use search, then it uses WHOLE DATABASE, WHOLE TRAIN DATA, DIFFERENT LANGUAGES' INFORMATIONS

#

but if search is open, quickly check 30 website and summaries to you

#

Then you can ask why they making search default open ? well, most of their trained data ending with 2024 or 2025 january so

#

they want to give you up to date information

#

they not want to that their LLM saying "no, Trump is not president, Joe Biden is"

twin valve
brittle pine
#

i dont know, it effects a lot

#

Also im sure search is causing influencing by your native language a lot which is not good

#

When i ask something, i want what whole world's opinion(included China), but search usually using your native language and well

#

giving less detailed or wrong answers

stiff orbit
#

Dear LMArena Team,

I noticed the new models Qwen3-next-80b-a3b-instruct and Qwen3-next-80b-a3b-thinking from Alibaba Qwen have been released recently, and some reports mention they've been added to LMArena. However, I don't see them on the Text leaderboard yet. Could you explain why they aren't ranked? Is it due to insufficient votes, ongoing processing, or another reason?

Thanks for your help!

drowsy needle
sharp portal
#

Where can I find the archived versions of the leaderboard? (i.e. not the latest)

gusty grove
#

Hello everyone. I'm very glad to join this impactful Ai community.

sharp portal
scarlet harbor
ancient ledge
#

How are about new system prompt gpt5 rank?

scarlet harbor
glass sun
scarlet harbor
#

Gets the point across

twin valve
# scarlet harbor

600+ elo ratings between o3 and o3-pro ? Unless you use funky elo K factor parameters, is not even close to reality.

scarlet harbor
#

o3 Pro played very decent moves tbh, none of them were bad until endgame(Which every LLM struggles with) but Pro was able to get a past pawn, o3 had some inaccuraces in middle and more in endgame then Pro.

#

Makes me hype for how 5 Pro does.

devout canyon
scarlet harbor
tender sigil
#

text leaderboard update any time soon? been a week since the last one on Monday :p

drowsy needle
tender sigil
#

okie 😄 excited to see where new GPT-5 system prompt places!! wondering if Opus 4.1 Thinking will continue it’s upward trajectory to possibly overtake Gemini 2.5 Pro 😳

glass sun
#

Will the copilot arena EVER be brought back?

#

Yes or no

#

I’ve tried asking around so many times and I haven’t received an answer

jaunty cave
#

@glass sun we can't make any promises, sorry :/

glass sun
#

Mkay

vivid fable
#

ok

jaunty cave
#

text to video and image to video leaderboards updated! lots more votes!

shell cape
devout basalt
#

Seedream highrea must be 1st place what you guys think

jaunty cave
twin valve
# scarlet harbor Those are the results, I used a Chess.com bot, then took away around 300 elo poi...

I see but then you should specify that you tested them on chess. Further for chess tests there are already a couple of benchmarks that are interesting (but they don't tell the whole story)

https://dubesor.de/chess/chess-leaderboard
https://maxim-saplin.github.io/llm_chess/

and there are others.

Because otherwise one would assume the elo is relatable to the lmarena elo (it is not) and even if it is related to chess, it would be related to chess.com bots elo (that is not relatable to player elo). It is a rabbit hole the elo thing.

#

@twilit echo interesting, the LLM chess leaderboard that was based on a random player now stepped up the game a bit with Komodo chess levels (that again aren't totally comparable to the playerbase)

#

I am pretty sure that when people will see negative elo will say "wait that's wrong" (it isn't, elo works with differences)

twilit echo
#

once they reach like ~400 rating they'd lose 0.0001 elo per loss

twin valve
#

as far as I know it should work no problem.

1 /(1+10 ^((-3000-(-3200))/400) works no problem. The -3200 player has a probability of 0.24 vs a -3000 player .

#

identical if I do

1 /(1+10 ^((3000-(2800))/400)

a 3000 player vs a 2800

#

it works with differences, it doesn't matter the sign

#

ah I see you mean rating gain/loss. That's true.

#

likely they let the random player be clobbered, and hence he got real low, and then the others were scaled down.

twilit echo
#

thats theory vs practise. in practicial application, no player will ever reach negative elo, no matter the matchups or how terrible

twin valve
#

sure, if one starts high enough it likely won't happen

#

in cases like lichess and chess.com they have rating floors because there are some that want just to lose

#

and then you make an inverse ladder (that is, player A completely loses to player B, player C completely loses to player A, player D completely loses to C and so on)

twilit echo
#

I used random movers and worstfish (deliberately always pick the lowest rated moves), and they wouldnt be able to get negative in my standard elo environment

twin valve
#

you mean worstfish is not able to get negative from 441 ?

#

even adding -0.1 a lot of times?

twilit echo
#

nope, at least not in 100 matches

#

everything else is not practical reality

twin valve
#

I see. Yes if it adds -0.1 every time (or less), then it takes like forever

#

from the llm leaderboard with negative values. checking the result I believe the random player is simply anchored very low. LLMs that mostly are equal to it (see the results vsR) have around -30 rating.

That could explain all the other ratings (then it also depends on the K factor they use and so on)

twilit echo
#

as much as i can appreciate his commitment, i just dont find the way the methods are applied logical or useful to me. good that different things exists though, but as a user its not useful to me and I always try to make whatever I enjoy consuming.
o1 is 140 elo just doesnt translate to anything for me. it doesn't relate to humanity.

twin valve
#

agreed. It is simply fun

#

interesting that they use stockfish to approximate maia engines. I guess another side project

twilit echo
#

I tried starting mapping Stockfish to Elo, (even did a youtube video on it), but I actually found out the Stockfish elo levels are vastly inaccurate. e.g. whatever chess.com labels 1000 elo or 1500 elo is completely not related to any reality scaling. thus I abandoned that method and switched to accuracy, which while still inaccurate in particular in huge back and forth swing (blunder followed by blunder), found it to be more reliable overall.

twin valve
#

yes, the community says the same (a lot of players want to know "if I beat bot XY, what is my real elo?" and the result is: they are all over the place)

#

Ok I think there is a bit more blogging here, relatively "hidden" from the main site: https://github.com/maxim-saplin/llm_chess/blob/main/docs/notes.md

" We first calibrate Random vs Dragon; then any model’s games vs Random/Dragon can be combined into a single 1D MLE estimate with a 95% CI."

so Random got clobbered. The 1st level of dragon is 125 (if I understood), hence the ton of negative things. Still he could explain that in the main page.

twilit echo
#

even the worst dragon engine is like 1200. might be missing something, but 125 is far worse than random.

#

against a natural pool of players of all skill levels

twin valve
#

they set the first level to 125

twilit echo
#

well, in my player set, a complete random got like 438 elo, and worstfish was high 200s

#

after ~100 matches

twin valve
#

ok, so if you would set the random to -30 (if I understand the data, that should be its level) then worstfish would be -260 at most.

Yeah something doesn't add up.

#

as you said, it is good that there is variation, but it is mostly fun

scarlet harbor
jaunty cave
jaunty cave
tender sigil
limber musk
#

hello

jaunty cave
#

hi

proud siren
#

hai guys

jaunty cave
#

lively night, what's up AI people

lethal stag
#

Hey what's up gang Vulpes here what's cooking

jaunty cave
#

just writing code with AI

#

and using it to judge how good AIs are at writing code

lethal stag
jaunty cave
#

I'm making LMArena 😎

lethal stag
#

oh, just testing?

jaunty cave
#

I work here, but yeah a lot of my job is testing AI

lethal stag
#

Ah, lol I'm slow

#

gotcha

jaunty cave
#

lol np, what you up to? What bings you to lmarena

lethal stag
#

boredom

#

but I'm working on a prompt, to turn Gemini into a good image-to-prompt generator

jaunty cave
#

what's your favorite AI to use when you're board?

lethal stag
#

like system instructions

#

banana is bananas, so that

jaunty cave
#

haha that's cool, creative way to use AI to better use AI.

#

once you get the prompt, then you feed that into banana or seeddream or smth?

lethal stag
lethal stag
#

but Gemini seems to be random when it comes to making those prompts, sometimes they are on point, other times, I have to correct like 50% of it

jaunty cave
#

intereting, do you find that gemini is good at coming up with prompts, like for example is this better than just taking the original photo, and a photo of you and asking banana to combine it

#

I see

lethal stag
#

idk actually, I just feel like having a precise prompt gives so much more control and reusability

jaunty cave
#

makes sense!

lethal stag
#

and by using a prompt wit banana, instead of asking for a combo, there's more chance you get through censorship I think

#

I think there's more guardrails when it comes to deepfaking and putting ppl together etc.

jaunty cave
#

oh yeah, I think especilly in the gemini app I heard people complaining it would not allow them to do a lot of things

lethal stag
#

yeah, it's very sensitive to anything that could possibly upset the model

#

but it's barely any better in the Arena, but I have already gotten used to the vocabulary, that you can or cannot use for banana prompts

twin valve
twin valve
twin valve
#

(for the confused, the user has the username matching the name of a Legionaire in the Cesar Legion in Fallout News Vegas)

lethal stag
scarlet harbor
jaunty cave
#

We should introduce a category for latin langauge on lmarena leaderboard 😄

tender sigil
#

text leaderboard just updated!

tender sigil
#

Anthropic does not move closer to Gemini for #1 - but Qwen3 Max expands its lead in #2 on no style control !

jaunty cave
#

kinda crazy that OAI has 4 models within 1 point of each other overall, and like they are definitely each good at diffent things but it kind of cancels out

dreamy yoke
#

i will win

agile sorrel
#

helo

eternal vortex
#

It's good to see legacy platforms holding their own in 2025. Let's me know that AI is highly scalable and on the right track.

atomic mural
#

hi

tropic forge
#

nice to read

robust zodiac
#

hey! is there a way I can see the leaderboards before last update?

#

(eg to compare current vs old one)

sick glade
#

Hi

robust zodiac
#

thanks!

lofty sun
#

Hi, y'all!

Good day!

twin valve
lethal stag
# twin valve It always irked me that in the Legion people didn't speak much in Latin despite ...

IIRC, only maybe Ceasar himself and perhaps some of his generals actually were even a little bit educated or cared at all about ancient romans and their culture/language/history. All the plebs below them were conquered and forcefully integrated into the legion or raised and brainwashed from childhood, to follow chain of command and obey Caesar/other superiors, to be fanatic killers, conquerors and enslavers of others who are not part of the Legion. Any education, if at all, would be only in stuff like weapons training, tactics, maybe some Latin phrases relevant to those things etc. but that's as sophisticated as it gets for them, I don't think most of them even understand or know, that Ceasar is making them all LARP as members of a ancient civilization from the past, they just do what they are told. And that's just the legionaries and soldiers, everyone else are slaves and women...

twin valve
#

sure but for how they conduct themselves as "high and mighty" they could have read a thing or two. But yeah likely it fits.

#

@drowsy needle I like the "movers" in the announcement. If you keep that one can track the changes over time (unless one goes through the saved elo leaderboards)

drowsy needle
merry patrol
#

how do I turn on the filter for Open LLM only on the leaderboard, guys?

tender sigil
#

It’s the License tab all the way on the right - I don’t think you can sort but basically any model that doesn’t say “Proprietary” is open source

merry patrol
#

it's kinda annoying that it's not sorted, but thx

pseudo rivet
#

If someone made a classifier on responses to predict which LLM it came from, how accurate could it be? And if someone is doing so, could they create bot accounts to target a model and raise it in the leaderboards because they bet money on it?

alpine bridge
#

I don’t understand

#

What’s the difference between Open rank and Rank (UB)?

#

Open sourced?

jaunty cave
scarlet harbor
zealous hinge
broken harbor
#

What is the overall general image to video best model but completely by your opinion not statistics guys? Really glad to find this community, I honestly find Kling most reliable especially for high motion. MidJourney I also find good for more creative stuff, art.

jaunty cave
jaunty cave
#

some new stuff on the leaderboards, grok-4-fast

rose cradle
#

isn't it weird that kimi 0905 is better than 0711 in every way yet it comes after it in the leaderboard, and deepseek v3.1-thinking is better than 3.1 and r1 but comes after them?

jaunty cave
#

@rose cradle good question, just raised this with the team, currently it's just sorted by overall and then ties broken alphabetically. We'll update that to sort by the ratings in the other categories

rose cradle
#

But shouldn't overall be different for them? For example, I think Kimi had a lower overall because math benchmark only came much later, but when it did, to my surprise, its position didn't move.

jaunty cave
#

the overall here is the overall rank, which accounts for overlapping confidence intervals, here the 2 kimis and 2 deepseek are all rank 10 overall, and it just sorts them alphaetically

rose cradle
#

But look at 3.1 thinking, it is better than 3.1 in a lot of things and it is 10 instead of 9

jaunty cave
#

the overall column here is not an aggregate of the other columns, it just means the ratings when you include all votes. The categories do not cover all of the prompts. It is odd that it's better overall but worse on most categories.

in this case they have very close ratings and cis 1417 +/- 6 and 1414 +/- 7. But what's happening is that slight difference makes there one more model above them which 3.1 overlaps confidence intervals with and 3.1 thinking does not. So according to this method, there is one more model we are more certain is better than thinking but not better than 3.1

And that rank is the main sort key. I think cases like this definitely seem weird and are arguents for using a different sorting, but no matter how you order it something will look weird

rose cradle
#

Ah, ok, thanks. I always thought that was aggregate since you replaced the numbers from the previous leaderboard with this

jaunty cave
#

the one on the old site? I'm not actually sure how that one worked, I joined after it was deprecated. appreciate the feedback a lot

rose cradle
#

Yes, the old one just had a single overall number and companies would fight for that number. Now it is kind of weird. They fight in individual categories but the main one is more subjective.

#

I also think you should remove copilot benchmark, and add links to lm battle web arena because it is super hidden

jaunty cave
#

copilot 😥 , too many other things to work on rn to resurrect that. it was also kinda hard to use. webdev arena integration definitely coming tho!

rose cradle
#

Yeah, that's why I think you should remove it, it's been 120 days since last update

viscid zinc
#

I would like to thank all the designers of this website, everyone who contributed to its completion, and everyone who thought of it. All thanks and appreciation for your efforts and hard work. You are geniuses and smart makers. I love you, I love you with all my heart. You are in my heart more than anyone who worked hard on something like this. I love you. All love to the entire team. You are creative. You deserve to be at the top of designers and at the top of this world. Thank you, thank you. You are better than Elon Musk.

tropic pivot
fallen laurel
#

Hello to everyone, .....I am here to learn from the diverse skill and mindset of all present, and contribute my own 2 cents.

hollow crown
#

How to use seedream 4.0

#

I couldn't find it in direct chat

#

Can somebody help? 🙏🏻

compact roost
hollow crown
#

Damn

#

I'm using it in replicate

#

After adding my credit card

#

I thought I could use it for free

old comet
#

hi

simple dock
#

Nice

tepid topaz
#

hello

errant belfry
#

tonight is the night

simple portal
#

Hello

#

I am new ...happy to connect..

marble bluff
hoary compass
#

the overall column here is not an aggregate of the other columns, it just means the ratings when you include all votes. The categories do not cover all of the prompts. It is odd that it's better overall but worse on most categories.

in this case they have very close ratings and cis 1417 +/- 6 and 1414 +/- 7. But what's happening is that slight difference makes there one more model above them which 3.1 overlaps confidence intervals with and 3.1 thinking does not. So according to this method, there is one more model we are more certain is better than thinking but not better than 3.1

And that rank is the main sort key. I think cases like this definitely seem weird and are arguents for using a different sorting, but no matter how you order it something will look weird

versed wigeon
#

hello

marble bluff
tropic pivot
warped bone
#

hi

#

@tropic pivot

smoky tapir
#

is it any possibility to use LMArena API Key?

drowsy needle
main estuary
#

hi everyone plzzz i asked something

#

available everyone for me plzz

drowsy needle
main estuary
# drowsy needle whats up?

Okay, tell me, how can I generate unlimited videos either on the official LMArena website or in this group? What should I do so that I get a video exactly related to my prompt, with sound included? I also messaged LMArena in their inbox and gave them a prompt, but I didn’t get a response. What’s going on with this?

main estuary
#

??

drowsy needle
# main estuary Okay, tell me, how can I generate unlimited videos either on the official LMAren...

Sure there are a few questions in here:

  1. The Video Arena bot can only be used in this Discord server, you can't use through Direct Messages.
  2. It's not unlimited, as you can only create 5 generations per day.
  3. If sound is included or not is going to be random. Not all models have sound capabilities, and since it's random what models you get, that means it's going to be random if you get sound or not.
desert crane
#

Hi Everyone

main estuary
drowsy needle
main estuary
drowsy needle
#

Good place for questions/feedback about the bot would be in #bot-feedback btw. We'd like to keep this channel focussed on #leaderboards discussion.

main estuary
#

ok i understand my sweet lover brother @drowsy needle

drifting hornet
true arch
#

new here, not sure if this is right place to ask, for search leaderboard, can we include llama (meta ai) as well? if this has been consider, what are reasons not to include? Thanks all.

jaunty cave
#

Hi @true arch does meta surface an API where we can get answers to queries using results of web search + llama model? All the models on search arena are offered as search + llm products by their respective companies.

tough osprey
#

Hello, i'm working right now on a LMArena review video . The "Exclude Ties" feature is working? Because though the models rank list changes they're still showing ties.

sour hamlet
#

Where can we actually View the leaderboards does this also displayed on the Web?

white iron
#

sup

marble bluff
jaunty cave
jaunty cave
full pecan
full pecan
#

hehe

austere charm
#

hello ! everyone ! how are u all ?

pallid lotus
#

let´s check it out

drifting hornet
jaunty cave
dire abyss
#

hello

drowsy needle
drowsy needle
tender sigil
#

look at all these new text models debuting dog we ain’t never getting a leaderboard update 😭💔

jaunty cave
#

really interesting to see how variations even in provider, fal vs seedream official can have noticeable differences. I saw this post earlier about how different providers of other open source text models can differ a lot from the official. 💀
https://github.com/MoonshotAI/K2-Vendor-Verfier

GitHub

Verify Precision of all Kimi K2 API Vendor. Contribute to MoonshotAI/K2-Vendor-Verfier development by creating an account on GitHub.

stray parrot
jaunty cave
stray parrot
#

I hadn't noticed Fal was listed as a separate model lol

#

I love the comparisons to the OpenRouter providers in general, I think the transparency is super important!

jaunty cave
#

Yeah we didn't have the -fal provider suffix before. We added it now since we are serving the same model from different providers. We put a note about it on our Leaderboard Changelog to clarify it. https://news.lmarena.ai/leaderboard-changelog/

LMArena Blog

This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!

For model deprecations, check the public updates on GitHub.

September 25, 2025
New model announcement:
Seedream-4-2k has been added to the Text-to-Image and Image Edit leaderboards.

Note that Seedream-4-high-res-f...

acoustic island
#

Hello

drowsy needle
drifting hornet
flat star
#

hoi

burnt raptor
zealous adder
#

Flash slash

twin valve
#

@twilit echo (dunno where to put this, since #ai-creations is now filled to the brim with pics)

I still check your awesome LLM chess benchmark and I check whether the elo changes are stabilizing or not. Gpt5 seems still to have potential to go up. O3 a bit less. Gpt 4.5 feels like invincible - that thing can go real high but I guess it is expensive (dunno if it still plays but the match with k2 should be relatively recent. Would it be possible to add dates just as reference?). Grok4 seems stable as well. gpt-3.5-turbo-instruct ist still unclear. It still clobbers gpt5 sometimes.

#

Also other models keep coming but they get brought in clap city according to their scores.

#

it is mostly openAI and xAI. Gemini is strange though, its rating is far from stable if one sees the rating swings it gets.

#

and again this guy has the gpt 35 at the bottom https://maxim-saplin.github.io/llm_chess/ but as you ( @twilit echo ) discovered, gpt 35 plays well if the opponent plays well and plays terribly if the opponent plays terribly. Since gptt 35 played only against the random player, it makes sense it goes at the bottom and the benchmark maintainer will never notice that is actually much stronger than that.

#

correction, the problem seem the amount of illegal moves, since the support layer is not strictly helpful.

twilit echo
#

but since Elo is self-correcting the Elo changes even for retired models (since their former opponents change ratings, which gives less/more points adjusted in realtime). gpt-4.5 used to be higher elo (1817 continuation on retirement day)

twin valve
twilit echo
twilit echo
#

but with a database used as source to compute everything, its trivial to revisit any desired point in time and compute all stats as they were precisely at that moment. so nothing is ever lost even if I don't store hard values, because the formulas are what matters, not the output

jaunty cave
tender sigil
jaunty cave
tender sigil
#

Got it — batch MLE for Bradley-Terry makes total sense! Just to clarify: when you say "not using Elo," is the displayed score scale (e.g., 1469, 1437) still a log-transformed version of the BT parameters (like Elo uses for readability), even though the underlying estimation is batched BT? And how do confidence intervals factor in — are they computed on the raw BT scale before scaling to the displayed numbers?

jaunty cave
#

Yep both the scores and the standard errors (+/- values) are computed in what I call the "natural scale" which means initialized to 0, using natural log (base e) and no scaling factor.

The "Elo" scale uses log base 10 and a scaling factor of 400, and an initial rating of something like 1000, 1200 etc.
After we fit the scores we multiply them by 400/log(10) and then add 1000 to each. For the +/- we only do the scaling

tender sigil
#

okay, yeah that makes a ton of sense! thanks for the clarity 😄

twin valve
# twilit echo every single day all elo is recomputed from scratch. i don't store the elo value...

I see but how I understood it, since the previous games happened in the past, they had an influence about the new games (since Elo is moving forward in time), but the newer games do not have an influence on previous games.

Hence a model that doesn't play anymore has the Elo fixed in time.

Am I understanding the approach?

(if yes) It would be different if you, say, would compute the elo regardless of time. So picking results randomly, many times, and then averaging the final result.

tender sigil
#

how are this many ppl that stupid to not read the actual channel they’re sending stuff in I don’t get it

twilit echo
#

my system is more like retroactive Elo or Bayesian rating where the entire history is continuously reinterpreted based on all available information.
This approach is mathematically more sound because it uses maximum information to determine ratings

autumn gyro
#

what is leaderboard in LMArena guys?!

twin valve
twin valve
glass viper
#

nice

alpine bridge
#

when will it finally update

fresh marsh
#

Can you explain to us your thought process for putting this in the leaderboard channel?

uneven adder
sour tundra
#

amazing

mystic cloak
#

Sweet

worn ocean
#

Hello

earnest path
#

Ola gente

gentle pendant
#

Hi

full parrot
#

Hi

sudden surge
#

HELLO

dim wyvern
#

Helloo

bronze ivy
#

Hello

tender sigil
#

everybody treating this like the greeting channel again 😭

#

hopefully we see a leaderboard update tomorrow!

#

probably been enough time to gather enough votes for all of the new models

simple spindle
#

hello

full parrot
#

Hi

tender sigil
#

shush

#

talk about the leaderboards

#

say hello elsewhere

uneven adder
#

Seems benchmaxxxed for some names..

robust zodiac
tender sigil
#

I LOVE POLYMARKET

#

I had good profit off of holding Yes for Alibaba #2 and Google #1 in style control markets

robust zodiac
#

Good picks

#

I cashed out alibaba when it was at 97 i think

tender sigil
#

yea I’ve cashed out some of my position to trade on other markets but

#

there’s some funny business moving in it currently

#

big buys on OpenAI yes and Alibaba no but

#

based off of GPT-5-High’s system prompt

#

I don’t have high hopes for its performance debut it doesn’t seem much stronger

#

October #2 will be interesting

#

Google has been increasing in price, not because it’ll be surpassed I don’t think but likely because new Gemini flash could debut at #2

robust zodiac
#

Flash no2 debut would be brutal

tawny narwhal
#

When they update text leaderboard?

tender sigil
#

Hopefully today!

loud lotus
#

is long-cat-thinking here in test?

drowsy needle
amber comet
idle ocean
drowsy needle
tawdry socket
#

Why isn't Claude 4.5 on leaderboards yet?

#

@drowsy needle When was Claude 4.5 added to arena ? today?

carmine oasis
#

being*

tawdry socket
#

Bring added to leaderboard or added to the arena?

drowsy needle
tawdry socket
#

Oh. So you just added in the arena. Once you have enough votes, it will be added to leaderboard?

drowsy needle
#

And yes, once we collect enough votes and validate we'll push a leaderboard update. TBD on when that'll happen.

tawdry socket
#

So, it is still not there on other arena like text yet?

thorn scroll
#

hi

robust zodiac
drowsy needle
twin valve
rose cradle
tender sigil
#

completely understandable with the amount of new models that have debuted recently - wanting to release scores for them all at once but

#

@tawdry socket Claude Sonnet 4.5 is in Text Arena currently

#

I got it on a prompt earlier today

tawdry socket
#

Nice. How come they didn't have this as one of the anonymous model? Claude doesn't do that?

#

@tender sigil

#

How is it anyway? I can't test it right now .

tender sigil
#

at least they don’t for full release

#

but I have never seen someone show evidence of a codenamed model being part of the Claude family

tawdry socket
#

Makes sense. Claude could have shown off lmarena performance at least on web Arena to market their model like other companies do when they release...

tender sigil
#

even more models added to arena omg 😭 September has been insane

tender sigil
#

fingers crossed 🤞

radiant glacier
#

any chances of the text leaderboard getting updated by tomorrow?

#

😄

drowsy needle
prime valve
#

yes

valid ridge
#

good luck

spare slate
#

I don't get it. How do you actually get the AI to create a video. All I'm seeing is this conversation

void pike
#

#video-arena-3 a little baby come on classroom and background music video generate

carmine oasis
#

When is sonnet added to leaderboard?

echo pewter
#

Hlo

#

Woww

tropic pivot
ruby idol
#

good

rose stream
#

i wonder why

carmine oasis
#

Its not enough time with automation to put sonnet45 placement?

rose stream
soft crater
#

Everytime a new famous model is released people start asking for leaderboard update. My best guess is that the update is combined with the big companies. I'm not saying that it is manipulated i I'm just suggesting that there is a schedule.

tender sigil
#

There has been a ridiculous amount of new models released recently so

#

it is likely they are trying to update all at once in a big batch - it’s possible we don’t even see a leaderboard update today

idle ocean
#

Yeah, theres a kot of new models that need a lot of votes before it can be acurate

loud lotus
#

damn there are too many models. the rate of leaderboard upgrade cannot catch up with the rate of new models releasing

native bronze
#

@tropic pivot

drowsy needle
jaunty cave
# tender sigil how are this many ppl that stupid to not read the actual channel they’re sending...

Not LMArena but definitely leaderboard related. @twin valve as the resident chess expert what are your thoughts about the most recent chess ratings news? Seems like a previous change was not implemented correctly, and then applied retroactively with little/no warning?
https://x.com/BortnykChess/status/1972746172029936024

Something LMArena would definitely like to avoid

FIDE JUST STOLE MY RATING! How can they go back in time and take all my ratings? Absurd!

amber comet
#

@crimson seal Please, check out ⁠⁠how-to-video-bot to learn how to properly prompt the bot.

tender sigil
#

now up to 13 new models in the arena since the last LB update 😂

#

Wild times to be an arena fan!!

jaunty cave
tender sigil
#

I know right???

#

can’t keep up with getting to talk to all these new players 😂

jaunty cave
tender sigil
#

bahahahahahahaha

#

or make 100 different bots play the same game or logic puzzle and steadily eliminate them when they miss a step

fresh marsh
jaunty cave
fresh marsh
#

like once a week or something

#

battle royal between all models llike you said

#

lmao

drowsy needle
#

@vale island be sure to review #1397655624103493813 for information on how to properly use Video Arena

twin valve
# jaunty cave Not LMArena but definitely leaderboard related. <@257929879163633680> as the res...

Hey! Chess expert is too much (I think @twilit echo is as expert as me if not more) . Though I am rating interested. FIDE didn't apply them retroactively, they simply do a poor job. They noticed that people "farm" rating (nothing new really) and so they decided that for blitz and rapid time control they would use the pure elo formula - and not the 400 points cap - for all players over a certain rating. They did this in Dec 2024.

The problem? They forgot to implement it. Chess is full of nice ideas and poor executions (ironic for an intelletual game)

So thanks to Nakamura FIDE decided to create such a rule also for classical. Wronly imo, because without enough simulations they simply wreck the work of Mr. Sonas (that introduced the most recent rating fix as per March 2024). Simply adding a ton of edge case rules mess up with the ratings (and then people foolishly compare ratings between eras, while a lot of small but important details changed in the meantime).

Hence Bortnyk, that together with Naroditsky plays a lot of blitz and rapid in Charlotte, cries about his rating. He is partially right, because FIDE did a poor job of implenting the change and thus it feels like a rating steal, but it is not.

IMO it would have been easier to say "nah, if players farm rating we simply decide to unrate the event for that player", that was always a FIDE prerogative (they did it already with Alireza in 2023)

#

The problem with FIDE is that they are treating the 2700 and 2800 group as a sort of marketing. As if lmarena would say "we cannot have ratings going stale or going down, we need new models to reach 1450, 1500, 1550 and so on. We need to pump the ratings for marketing". So they try to find rules to not erode the 2800/2700 group while not pumping it too much where it feels fake.

#

while in reality only difference matters in ratings.

#

btw on reddit /r/chess there are like 3-4 threads about it.

jaunty cave
#

👀 👀 👀

rose stream
#

Does anyone know why Google has been in the top spot for some time on text?

#

Gemini is good but i rarely ever see it considered the top model by anyone

robust zodiac
#

cause it wins battles constantly?

#

what people "feel" vs a measured deterministic algorithm is often not matching

quaint solstice
#

Hello!!!

jaunty cave
robust zodiac
jaunty cave
#

lol oops I meant to reply to Vitor 🤣

#

You are definitely right, Gemini ranks high because it consistently wins the comparisons

tender sigil
#

Yes!!! new leaderboard update is awesome

tender sigil
#

both new versions of Gemini Flash, new Qwen 3 max and both versions of v1, and both terminus variants of V3.1 - Sonnet 4.5 and GLM 4.6 should debut soon!

tame flint
#

guys in webdev whats diff between qwen3-coder and qwen3-coder-plus ?

idle ocean
#

qwen 3 coder plus is newer, and apparently worse?

drifting hornet
drifting hornet
jaunty cave
# tame flint guys in webdev whats diff between qwen3-coder and qwen3-coder-plus ?

qwen3-coder is this model https://github.com/QwenLM/Qwen3-Coder
and qwen3-coder-plus is an updated version from 9/23 https://www.alibabacloud.com/help/en/model-studio/qwen-coder#a3bbe78773cec

GitHub

Qwen3-Coder is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud. - QwenLM/Qwen3-Coder

drowsy needle
wooden mango
#

Why is gemini 2.5 pro at the top

jaunty cave
wooden mango
#

Really sherlock?

#

But it's garbage

drowsy needle
lilac current
#

here is a working Sora 2 code - SYKQBJ

kind bane
#

Thanks

hardy compass
#

he is normally walking. Cinematic camera motion

rigid viper
#

every time I do a prompt to generate a video, I can't find where it's generated

lyric furnace
#

cat and lion world tur

naive vault
#

Will sonnet 4.5 thinking ever be on the leaderboard

idle ocean
#

was added a few days ago mate

#

don't expect it on the leaderboard that fast all the time

naive vault
#

Wasn’t it added same time as 4.5

idle ocean
#

Yeah, I'm not a lm arena insider but if I had to bet 4.5 thinking probably took first place by far and so they are doing their due diligence to check that that was legit.

#

the non thinking mode has 12 on each side error bars I think

drowsy needle
idle ocean
naive vault
#

Will be interesting to see if it surpasses 2.5 pro

idle ocean
drowsy needle
idle ocean
#

a really cute dog specifically

drowsy needle
fresh marsh
idle ocean
#

Huh

fresh marsh
#

?

#

I meant for general use

#

Not coding

#

Was only saying I doubt 4.5 thinking took 1st place by so far they're delaying the results to verify or whatever was proposed above

#

And now Clayton is making me think otherwise OkAnd

idle ocean
#

Lol

fresh marsh
drowsy needle
hasty mica
#

NICE!

edgy pendant
#

bro plese help me to finde out the code of the sora 2

undone tangle
#

hello

safe thistle
#

🚨 Text-to-Image Leaderboard Shakeup!

Hunyuan Image 3.0 by @TencentHunyuan just stormed into the #1 spot in the Arena 🏆 - ranked as both the top overall and top open-source Text-to-Image model.

🖼️ This image generation model has leapfrogged over Seedream 4, and the famous

idle ocean
manic frost
#

Its only a better version of DALL-E3

idle ocean
#

I mean all of them are better versions of dall e when you think about it hard enough

autumn granite
quartz mountain
frigid wigeon
#

ok

tropic pivot
solemn locust
#

Hello i need coupon flova

idle ocean
#

Please check ⁠how-to-video-bot to learn how to generate

#

the 2 messages above this are literally about this

#

my brain

jaunty cave
viscid palm
#

In the leaderboards, Claude 4.5 sonnet is ranked below Deepseek R1 and gpt5.

However from my usage, I've seen that sonnet makes way better frontend UIs compared to those above it.
Why is it still ranked so below?

late ridge
#

nice

jaunty cave
viscid palm
#

gpt 5 at the top for web dev

#

but purely frontend speaking, claude sonnet performs better from my experience

modest pewter
modest pewter
viscid palm
modest pewter
#

k

idle ocean
viscid palm
#

?

idle ocean
idle ocean
viscid palm
#

Meh even non-thinking performs better than gpt 5

idle ocean
#

If you say so

viscid palm
#

Although I just realised the gpt 5 here is the "high" model, particularly not the one that is available to everyone for free so idk.

drowsy needle
rancid imp
#

Good app

twin valve
idle ocean
#

? Why am i being pinged

twin valve
#

for the discussion we had in the other channel about the topic. The quoted message is a perfect example of what I meant. But If that is "too many messages ago" then nvm.

idle ocean
#

I mean i agree that people are like that..

twin valve
#

ok

sinful pulsar
#

Ok

viscid palm
amber comet
jaunty cave
viscid palm
#

Non-thinking

#

Even 4.1 beats gpt 5 (free) and 2.5 pro Gemini (from my experience)

#

But this is purely frontend-wise. Maybe web dev on the leaderboard is a culmination of all things web dev.

coral pulsar
#

Hi

idle ocean
drowsy needle
#

@wise depot for sonnet-4-5 I believe it's ranked higher despite having a lower score because the upper bound CI could be higher than the rank of the other models ranked below.

#

For these having the same rank & CI I'm not sure why they're listed this way. Will ask and followup when I know more.

ashen kelp
#

yo

drowsy needle
idle ocean
#

thx

rough cliff
jovial vector
#

Hello

loud lotus
#

does gemini3 show up in arena?

idle ocean
#

Not out yet

robust zodiac
#

is there anything even known about gemini 3 being out? I know there's some coednamed models but neither really sticks out to me as a potential gemini 3 so far

idle ocean
#

not really, supposedly some ab tests on studio but that's it

robust zodiac
#

that's what I thought but I see some people get super hyped about it that I dont even know what to think lol

remote trail
#

(according to some sources, in 2 days [thursday], which i still doubt)

#

but in january we will have it, 100% sure

#

highest probability: this quarter

idle ocean
#

The way some people were talking, it was if it was already released which was annoying

drowsy needle
#

@slow solstice be sure to check out #1397655624103493813 for information on how to use the bot properly.

#

Let me know if you have any questions.

fresh marsh
#

people see in the networking data on those that it has an entirely different version number than 2.5 pro or 2.5 flash

#

and that apaprently its much better

#

not sure if i believe that theyd release a new flash preview end of september for 2.5 if theyre releasing 3.0 pro in october though

#

seems stupid

idle ocean
#

I have seen a better/different feeling model then 2.5

#

true about the flash thing that baffles me

#

they slowly tweaked 2.5 pro over its lifetime, maybe they had a bunch of changes to 2.5 and for some reason ended up just doing it all at once

fresh marsh
#

havent looekd into it myslef though

idle ocean
#

I have never heard of such as thing, but I can look into that

#

the December claim is a little stretched, cause 1.0 and 2.0 were both released in December, but plenty of other models get released at other times

median briar
#

I think they were unsure if releasing a new 2.5 version was a good idea (but had a smaller team working on it the whole time) and now they have decided to release it (idk why).

#

But I bet the reason why they did not update 2.5 pro (which they probably have a better version of internally) is to make 3.0 flash look better.

#

And stick to the claim of 3 flash outperforming 2.5 pro

autumn granite
autumn granite
#

Sonnet 4.5 on Web Dev Arena leaderboard...

frozen knot
#

thats non thinkin tho

#

also gpt 5 being at the top is weird it always makes like the exact same ui

rose stream
#

Anthropid models just spam gradients and their spacing is really off a lot of the time

frail pecan
#

guys do you know how can I use the Claude sonnet 4.5 thinking? I think it’s different from the external thinking function right?

frail pecan
# idle ocean ??

Or they’re same? Sorry that I’m not familiar with Claude🫡

jaunty cave
frail pecan
jaunty cave
#

oh, sry I don't actaully know all about how to use Claude through their app, feel free to use on LMArena though. I was a huge fan of clearlove too. Amazing career

#

Some lore about me, I started getting interested in statistics and machine learning by ranking League of Legends pros. ClearLove had amazing stats. here's a post I made 9 years ago. https://www.reddit.com/r/leagueoflegends/comments/4q783c/kda_rankings_in_professional_play_over_9200_pro/

Now almost 10 years later I work on leaderboards of AI models instead 😄

Reddit

Explore this post and more from the leagueoflegends community

idle ocean
#

wow, 9 years ago, back then I was younger

frail pecan
marble bluff
modest pewter
rough cliff
potent cave
#

/video promt

amber comet
drifting hornet
drifting hornet
jaunty cave
#

search arena leaderboard updated today, 20k new votes, not much actually changed though lol. We need some new search models

jaunty cave
#

what in the world does :malicepfpmoment: mean?

fresh marsh
#

Hard to explain, inside joke someone into sneakers might get

delicate acorn
#

helo

twin valve
idle ocean
#

Yeah

#

Only xAI seems to have gunned for it

fresh marsh
#

not saying its worse just think its funny

idle ocean
#

Cause gemini sucks at search

twin valve
# fresh marsh dont understand how grok beats google at search lmao

either

  • they don't provide their best search model on lmarena (AFAIK the search models are more or less similar to the free and pro subscription models, the one around 20$, not the ones that costs more)
  • or search up to a certain level of complexity has no moat
  • or the search questions aren't too tricky and so many models performs similarly (the lmarena search started much later than the lmarena text, so the models there are more or less all recent)

Also not all vendors are in the lmarena. For example mistral has search abilities as cohere does, but they aren't in the arena.

idle ocean
#

nah

#

you are both wrong

#

the reason why Gemini is so bad at search is 78% of the time the links it posts don't work

#

that it

#

Sometimes the best answer is the simplest one

twin valve
#

Interesting because in my cases they work (I mean those under sources) but yeah could be the case too if I got consistently lucky.

radiant glacier
#

why is qwen ranked so well?

#

every time I've tested it the answers were terrible related to lower ranked models

jaunty cave
#

What sort of questions do you usually ask?

fresh marsh
#

As in more accurate to reality information

#

I think I used qwen max preview not the other qwen max

#

But dont remmebr

cerulean bane
#

@cerulean bane

marble bluff
wooden mango
tropic pivot
dark solar
#

👍

frozen knot
vestal quiver
#

Why hasn't LMArena updated the text-to-video and image-to-video leaderboards for almost a month?

fresh marsh
#

I wonder if theyre not counting votes either until then?

fresh marsh
quartz mountain
#

@dreamy crater please head to -> #1397655624103493813 to learn how to prompt the bot and in what channes it is available for use

#

@last storm please review the info on how to use the Arena bot and in what channels by going to -> #1397655624103493813

queen ivy
#

looking for feedback on what the community considers a good AI model with so many to choose from. Saves time.

empty whale
#

i need a vid presenting future tense with will and going to

brave cobalt
#

nano

drowsy needle
drowsy needle
#

Would encourage others to votes on their preferences in the video arena channels!

marble bluff
fresh marsh
light mirage
#

How to create videos?

wooden mango
tender sigil
#

this channel for some reason

fresh marsh
idle ocean
#

Yeah

rough cliff
tropic pivot
runic glen
#

How to generate 3D video?

drowsy needle
drowsy needle
jaunty cave
autumn granite
#

I think it's interesting how the order is so different from the coding category on the text leaderboard

twin valve
#

@drowsy needle

twin valve
autumn granite
wooden mango
#

Haven't tried it for coding yet

autumn granite
#

This is just Web Dev Arena like Pier said, so actual agentic coding performance might differ.

twin valve
# autumn granite Yeah there's that aspect of it, but I also think the coding subcategory is kinda...

yes, hence I asked you to do that as you are quick.

The point of the "unreliability" is simply that ratings have a lot of nuances. With the filtering we are checking the results within models that matched each other, rather than considering other models as well. It is a property of the ratings.

Otherwise one should generate other ratings (with filtering). Hopefully one can do that once we get all the results of the battles.

#

the best IMO would be like having a sort of meta benchmark, with many different (somewhat reliable) benchmarks taken in account.

Some did those, but they don't get updated that often.

Example: https://x.com/scaling01/status/1919217718420508782

The Ultimate LLM Meta-Leaderboard averaged across the 28 best benchmarks

Gemini 2.5 Pro > o3 > Sonnet 3.7 Thinking

autumn granite
#

LiveBench agentic coding scores got updated (I think they increased the number of maximum turns).

idle ocean
#

5 Pro does badly, weird

fresh marsh
#

And sonnet 4.5 non thinking

fresh marsh
#

Lol

twin valve
#

hence my point "I wish such meta leaderboards would be done more often"

fresh marsh
drowsy needle
idle ocean
#

sora 2 doesn't beat veo 3?

fresh marsh
#

thats unbelievable to me

wooden mango
#

Veo 3 is kinda nice

viscid haven
#

Hi

marble bluff
hidden cedar
#

hello

wooden mango
#

Hmm fried chicken yess

glacial wigeon
#

nice one

timber stone
#

gemini 2.5

high pilot
#

hey, how can find my videos??

wooden mango
night scroll
#

@lofty solstice Note that this is an English only server.

neat quarry
wooden mango
#

What did you use?

fringe ridge
#

hi

tight rock
#

Hey guys 👋 , I just want to know how often the text leaderboard updates. I see that the latest update was on 8 Oct. If Google releases a new model in late October, will this be reflected on the leaderboard soon?

drowsy needle
tight rock
tight rock
drowsy needle
fresh marsh
#

😭

upbeat swift
#

Saw this paper a couple days ago: https://arxiv.org/pdf/2510.02306
It makes use of lmarena datasets (text, search, vision) and compare several rating systems. Elo, Glicko, TrueSkill, and an online variant of Bradley-Terry.

They claim that by just ignoring data from ties you get better predictive accuracy (even on datasets including ties) It's cool they released code too.

I have some issues with it though, mainly that it only uses dynamic rating systems rather than true Bradley-Terry. In my experience the dynamic rating systems are much more noisy and have recency bias on arena data. Also in my thinking Elo and online Bradley-Terry should be the same thing. 🤔

Any thought it was interesting, wonder if anyone else has seen this or has done experiments specifically about ties.

wheat tiger
#

Hello

edgy pendant
#

can you give me invitation code for sora 2 plese give me

twin valve
# upbeat swift Saw this paper a couple days ago: https://arxiv.org/pdf/2510.02306 It makes use ...

this is what I meant with "the community would do such analysis if this https://discord.com/channels/1340554757349179412/1372537524551159913 would happen".

Btw if I am not wrong lmarena has already a category where ties are removed.

Also cthorrez seems the same account of @jaunty cave . Is not the same user behind those accounts?

I think ties are actually important. In my simulations for other use cases (chess , videogames) ties really helps stabilize the values.

Sometimes the compress ratings differences too much though, so one could also play around with weights for ties (for example in some videogames we decided with the developer to weight the result of draws by 0.5). But the more one plays around the less consensus the model has (that is, it becomes more arbitrary)

heady flume
#

hello, our team train a language model, we want to sumbit on the leardbord, could pls tell me how to sumbimit the modle, thanks.

twin valve
idle ocean
rough cliff
brave pier
#

how to use the bot

rough cliff
twin valve
twin valve
idle ocean
twin valve
# idle ocean Clayton

I see. I may have been mixing people but Clayton should have the surname (at least what I saw online) similar to the other user. I can be wrong though.

idle ocean
#

Ok

silk yoke
#

very nice

upbeat swift
idle ocean
#

So true

twin valve
#

so I wasn't wrong. Good to know.

jaunty cave
drowsy needle
drowsy needle
idle ocean
#

they said claude 4 sonnet performance and it seems like they didn't lie about that

drowsy needle
jaunty cave
#

it's like "ok let me do a quick vibe check with thousands of people all over the world doing different things"

idle ocean
#

tho it beats the thinking version correct?

#

and thats the non thinking?

twilit echo
#

It placed #46/#65 on my own leaderboard, #55 in Chess. Overall pretty consistent for a mini/flash type model.
Don't doubt it achieved Sonnet 4 level scores on benchmarks optimized for (Swebench etc.), but that doesn't universally translate to real world usage. Sonnet is smarter with more world knowledge (which is to be expected for larger models).

bold jungle
#

when will the copilot leaderboard get updated??

karmic fog
#

patience

twin valve
#

there are other coding LB and one can get an idea how models perform in all those

twilit echo
idle ocean
#

Ok

#

How about 4.5 or 4 thinking?

twilit echo
idle ocean
#

Ooh nice

#

4 thinking does worse than 4?

drifting hornet
reef moth
#

@jaunty cave @drowsy needle hi something goes wrong, 2.5 pro disappered from the leaderboard

drowsy needle
ripe bronze
#

hello

#

thers any way to use vro 3.1 free ?

native pewter
#

Hello

lethal root
#

hi agine

marble bluff
marble bluff
tropic pivot
vital pelican
#

HmHmmm

tropic pivot
worldly vigil
#

Hello.I am here to test & compare different AI models.

drifting hornet
#

@marble moth Please, read our guide in ⁠⁠how-to-video-bot to learn how to properly prompt the bot.

stark wharf
#

hello community, im here to explore the world of AI

sick crow
rough cliff
inland mason
idle ocean
#

no

wide nova
#

I can certainly help you with that. While the current application is designed for analyzing existing videos, I can upgrade it to generate new videos from a text description using the Gemini API. This will allow you to create the video you described.
Here is a summary of the changes I'll make:
Video Generation Mode: I'll introduce a new "Generate Video" mode. On launch, you'll be able to choose between analyzing an existing video or generating a new one.
Prompt-based Creation: In the new mode, you'll find a text area where you can describe the video you want. I've pre-filled it with your request for a video about world leaders.
API Key Selection: Video generation with the Veo model requires an API key with billing enabled. I'll add a button that prompts you to select your API key before you can generate a video.
Real-time Feedback: Video generation can take a few minutes. I'll display a status screen with informative messages to keep you updated on the progress.
Seamless Integration: Once your video is ready, it will automatically load into the player, and you'll be able to use all the existing analysis tools on it.
Here are the updated files for the application:

ember quartz
#

Sora 2 cod

full warren
#

A short, hyper-realistic video clip in 8K. Scene: A beautiful woman in bed at night.
The video starts with a close-up on her face as she tosses her head gently on the pillow, her eyes closed but her expression troubled.
Her eyes snap open. She looks towards something off-screen (like an alarm clock), and a wave of deep sadness and frustration washes over her face.
She then turns her head to stare out a window as a single, silent tear slowly rolls down her cheek.
Cinematic, dramatic low-key lighting from moonlight. The mood is somber, quiet, and deeply emotional.

#

A realistic, short video clip in a vertical aspect ratio (9:16). A beautiful, sad woman is in bed, tossing and turning in slow motion. The camera is a close-up on her face, capturing her pained and troubled expression. She briefly opens her weary eyes, then closes them again with a deep, sorrowful sigh. The scene is lit by cool, blue moonlight. Highly detailed, photorealistic, 4K. Focus is on the raw emotion of sleeplessness.

idle ocean
earnest arrow
#

Complete Profile info

fervent magnet
#

Is there gemini 3 in lm arena?

idle ocean
sharp fiber
#

video area 1 is fire

tough pewter
#

true

drifting hornet
rough cliff
kindred bane
#

im here to explore the world of AI

floral bridge
#

hello

rough cliff
idle ocean
#

SO TRUE

drowsy needle
idle ocean
#

a lot of people were saying sora 2 was better than veo 3.1 so that's a surpise

drowsy needle
#

sora-2 overall still performing really well

undone abyss
rough cliff
edgy flower
#

HI THEAR

fringe falcon
#

hello

covert timber
#

yup

tepid halo
#

Hi

marble bluff
#

@woven pewter @devout sorrel @quick pivot @abstract bane Please check on #1397655624103493813 to learn how to use the bot properly.

#

@quick pivot Hey! We ask you to use English only in this server, thank you.

quick pivot
quick pivot
rough cliff
spare zenith
#

Hi

trail valve
#

wow

violet flint
#

hi

wary stag
#

Rehan

sharp silo
#

hi

steep elk
#

hi

shadow wave
#

Hi

drowsy needle
#

hey everyone! lets try to use #general for greetings.

drifting hornet
quiet jetty
sudden bluff
#

THANKS

drifting hornet
golden nexus
#

hello how do i generate an ai video

sick rover
#

Why is there no sound when generating a video?