#leaderboards | Arena | Page 2

acoustic aurora May 28, 2025, 5:09 PM

#

Yea, curious on the general arena as well..the community is curious

drowsy needle May 28, 2025, 5:30 PM

#

it's updated now 👍

drowsy needle May 28, 2025, 5:30 PM

#

acoustic aurora Yea, curious on the general arena as well..the community is curious

I can't confirm when leaderboards will be updated sorry to say

zealous sable May 28, 2025, 5:31 PM

#

drowsy needle it's updated now 👍

works 😄

soft crater May 28, 2025, 6:31 PM

#

I was about to came here and ask for an update 😄

#

why only webdev?

willow holly May 28, 2025, 8:32 PM

#

@soft crater
I guess they want a nice day for Claude and the site because in the general lmarena it won't look so nice and people will write again how it is a bad benchmark :P.

pastel orbit May 28, 2025, 8:58 PM

#

How many runs have you done?

#

Where does opus stack against Gemini in your tests?

willow holly May 28, 2025, 9:38 PM

#

As it is the none reasoning version. It loses in all logic, math, drawing, in listing content, Formating a certain way...

It does answer normal questions very well. Also puts out short answers when a long isn't needed. It is better in creating some unique ideas, the writing sounds nice.

It is a good model but it really depends on what you test.

soft crater May 29, 2025, 4:48 PM

#

I don't want to look anxious but this is becoming weird... There is any reason I missed?

drowsy needle May 29, 2025, 4:58 PM

#

soft crater I don't want to look anxious but this is becoming weird... There is any reason I...

unfortunately we don't provide details on when leaderboards will be updated

devout canyon May 29, 2025, 5:11 PM

#

soft crater I don't want to look anxious but this is becoming weird... There is any reason I...

So weird

tulip shadow May 29, 2025, 8:14 PM

#

Maybe their implementing sentiment control?

pastel orbit May 29, 2025, 8:47 PM

#

Maybe someone on the team is on poly market 💀💀

soft crater May 29, 2025, 8:47 PM

#

drowsy needle unfortunately we don't provide details on when leaderboards will be updated

I don't judge, everyone needs to do their best to make a profit.

drowsy needle May 29, 2025, 8:59 PM

#

soft crater I don't judge, everyone needs to do their best to make a profit.

I'm not sure I'm following so let me know what I'm missing, but typically we wouldn't give details on what day/time leaderboards will update. Normally it's about a ~week

drowsy needle May 29, 2025, 9:01 PM

#

pastel orbit Maybe someone on the team is on poly market 💀💀

No, this isn't the case

queen jewel May 29, 2025, 9:15 PM

#

too many gamblers here😅

pastel orbit May 29, 2025, 10:52 PM

#

hey im not gonna pretend i don't have skin in the game, but it does sting when the leaderboard that the results are based on just doesn't update before the market resolves, despite the other leaderboards updating

twin wharf May 29, 2025, 10:55 PM

#

gamblers gonna gamble

scarlet grove May 29, 2025, 11:36 PM

#

lol

twin valve May 30, 2025, 8:23 AM

#

pastel orbit hey im not gonna pretend i don't have skin in the game, but it does sting when t...

then make your own leaderboard with your own benchmark and the problem is solved. Complaining that a FREE tool that doesn't owe you anything is not behaving like you wish, so that you can earn money, is not a good sign.

willow holly May 30, 2025, 9:55 AM

#

Also the Claude models won't be on #1 in the Text bench mark they are not general enough. So it doesn't matter for the bet on 1. place.
Of course it would be nice to have an update soon anyway

pastel orbit May 30, 2025, 10:14 PM

#

twin valve then make your own leaderboard with your own benchmark and the problem is solved...

it's not about the fact that it's free (they just got $100m in funding btw, for a leaderboard website), it's just odd to not update the boards for so long after a major release, despite the webdev board being updated so quickly.

Ppl keep saying Claude is for coding only so it wouldn't be good at text, but it's possible that the additional tooling and agentic nature would allow it to provide more useful results, even when just chatting with it. If ppl are so confident that it'll be worse than google's model, there should be no harm in updating the leaderboard to reflect that. Waiting abnormally long right as a market is about to resolve just is sus from an optics perspective. Devs could easily selectively update the board to win bets

twin valve May 30, 2025, 10:22 PM

#

pastel orbit it's not about the fact that it's free (they just got $100m in funding btw, for ...

"it's not about the fact that it's free (they just got $100m in funding btw, for a leaderboard website)"

did you pay part of those 100m? If not, for you and me it is free, the rest is fiction.

"Waiting abnormally long"
That sounds really like /r/choosingbeggars . The update cycle is always around a week and the more votes you get (note that you need to filter them) the better to assess the score. Who cares about gamblers, one wants proper assessments.

Again make your own leaderboard if you need it so badly. Behaving with such entitlement is never a good sign.

pastel orbit May 30, 2025, 10:22 PM

#

lmao

twin valve May 30, 2025, 10:22 PM

#

eh indeed, maybe the best answer to your post would be lmao

pastel orbit May 30, 2025, 10:23 PM

#

you'd think with a $600m valuation you'd have a public schedule for updating, you love making excuses huh

twin valve May 30, 2025, 10:23 PM

#

no need to refute entitlement

#

nah fam, you are just too entitled

#

one needs votes for accurate scoring, it is statistics

pastel orbit May 30, 2025, 10:23 PM

#

what a reddit tier response

twin valve May 30, 2025, 10:23 PM

#

otherwise one does scoring against static questions, not human driven

pastel orbit May 30, 2025, 10:24 PM

#

calling having basic company practices for a 9 figure business entitlement

twin valve May 30, 2025, 10:24 PM

#

pastel orbit calling having basic company practices for a 9 figure business entitlement

whatever.

#

you are willingly missing the point.

pastel orbit May 30, 2025, 10:25 PM

#

champion for mediocrity

twin valve May 30, 2025, 10:25 PM

#

"lmao"
"you love making excuses huh"
"what a reddit tier response"
"champion for mediocrity"

champion of proper arguments.

#

ask an LLM to help

#

writing another non-argument

#

I am still waiting a rebuttal against "not having collected enough votes"

pastel orbit May 30, 2025, 10:27 PM

#

your responses sound like one, just give up bro. you're defending a startup that got a 2021 level seed round from a crypto hype VC and can't handle being a proper oracle for data. The whole value of these types of leaderboards are devalued anyways due to the models being optimized for it

#

i bet you when it does get updated, it has way more votes than were needed for good data

twin valve May 30, 2025, 10:28 PM

#

"just give up bro" to a choosing beggar? lmao

pastel orbit May 30, 2025, 10:28 PM

#

you'll see other results on there with less than 1/3 the votes

twin valve May 30, 2025, 10:28 PM

#

actually they publish models with already way too few votes

#

3000 votes aren't many. I mean for some stats process yes, but for the variables at play they aren't

pastel orbit May 30, 2025, 10:29 PM

#

they just coincidentally decided to increase their threshold for this week all of a sudden

twin valve May 30, 2025, 10:29 PM

#

I don't get why you are mad about it

#

I mean even if they want to wait a year, it is in their rights

#

again make an alternative leaderboard

#

What irks me is not many of your arguments, that I can simply ignore because they aren't such, rather the fact that you demand something from someone that owes you nothing.

#

in general I would greatly prefer a proper leaderboard that tries to assess the best

#

even if it is updated randomly once a quarter

#

again when they publish some models with 3k votes that aren't enough. There are categories with barely 600 votes. Considering the variables at play (the knowledge of the person judging, the difficulty of the question, the pairings, etc...) 600 votes for some categories are nothing.

#

further if they publish things in a way that let people bet even more, there will be even more possible "rigging" in action. So actually making (entitled) gamblers mad is a good thing.

whole wharf May 31, 2025, 2:14 AM

#

come on update the leaderboards 😳

queen jewel May 31, 2025, 5:08 AM

#

Google will be #1 anyway, so what difference does it make whether the leaderboard is updated or not?
The gamblers here begging for leaderboard updates are not only gambling addicts but also have some unrealistic delusions. Perhaps they are one of those who bought Anthropic at 20c lol

drowsy needle May 31, 2025, 5:26 AM

#

whole wharf come on update the leaderboards 😳

sorry for the wait! we're collecting votes but the results should come soon

zealous sable May 31, 2025, 7:38 AM

#

poly will go crazy with the last minute leaderboard updates

twin wharf May 31, 2025, 11:46 AM

#

drowsy needle sorry for the wait! we're collecting votes but the results should come soon

You shouldn't feel beholden to gamblers, take your time

sinful falcon May 31, 2025, 12:17 PM

#

queen jewel Google will be #1 anyway, so what difference does it make whether the leaderboar...

i mean they have the web dev rating up

#

conditioned on the fact they are waiting for more votes to update the main tells me it’s at least slightly closer than the market is pricing it at

#

to be clear i still think gemini is gonna be up top

#

but i don’t think it’s gonna be a blowout

twin valve May 31, 2025, 1:45 PM

#

for what is worth, for my own testing on the leaderboard, I would be suprised to see claude higher than #5 overall. But since claude is very recognizable (dry answer af as long as it is not a technical question), people could also pump its score.

wind vale May 31, 2025, 3:34 PM

#

I think the great leaderboard should be updated at regular schedule Not only is the influence of this page growing and attracting more and more attention, but it is also fairer to AI models. This is also beneficial to the development of this website.

whole wharf May 31, 2025, 3:37 PM

#

I agree

queen jewel May 31, 2025, 4:15 PM

#

They still need to collect enough votes to get a credible score. Without enough votes, even if the leaderboard is updated regularly, models with insufficient votes or those whose scores are withheld by the model vendor for reasons such as product schedules will still be hidden on it. This has nothing to do with "fairer to AI models," nor will it give your bets any advantage.

upbeat swift May 31, 2025, 4:21 PM

#

When a model gets deprecated, which of the following happens:

Is it's sampling reduced to 0, but previous battles it was a part of are still used in the next leaderboard calculation
It's sampling is reduced to 0, and all battles it was part of are removed from the dataset used to construct the leaderboard

#

or something else

sinful falcon May 31, 2025, 5:19 PM

#

queen jewel They still need to collect enough votes to get a credible score. Without enough ...

i mean the simple answer is update every friday, and new models make the update if and only if they have x votes by then

#

if ppl wanted regular updates

#

idrc when it updates

#

i’m indifferent

#

i like when there’s more uncertainty in the market

twin valve May 31, 2025, 7:03 PM

#

upbeat swift When a model gets deprecated, which of the following happens: 1. Is it's sampli...

good question. I was hoping that the model still gets used to compute leaderboard scores, even if deprecated

#

otherwise things get messy as many pairings do not happen

#

then one has "cliques" in the rating model

upbeat swift May 31, 2025, 7:05 PM

#

I think they probably continue to use the already collected data but would like a confirmation. One of the criticisms from "The Leaderboard Illusion" is that deprecating models makes the comparison graph disconnected and the ratings unreliable.

But that would only be the case if when they deprecate, they remove those rows from the dataset

twin valve May 31, 2025, 7:05 PM

#

sinful falcon i mean the simple answer is update every friday, and new models make the update ...

I like this, but instead of every friday, every other friday.

tulip shadow Jun 1, 2025, 4:27 AM

#

twin valve I like this, but instead of every friday, every other friday.

i mean it seems they could at some point just set up a pipleline that does it automatically

devout canyon Jun 1, 2025, 7:18 AM

#

twin valve I like this, but instead of every friday, every other friday.

Excuse me for saying this, but I’ve been following the messages in this chat for quite a while, and it’s very clear that you’re constantly defending the lmarena team, almost like you’re their lawyer. It’s honestly hard to believe. Come on, we’re just users, and it’s perfectly fair to ask for an update on the most prominent AI leaderboard, especially after major releases like Claude 4 and the new DeepSeek R1 update. I’m sure many people feel the same way.

twin valve Jun 1, 2025, 10:28 AM

#

devout canyon Excuse me for saying this, but I’ve been following the messages in this chat for...

because I think it is a good project (in determining some LLM abilities) and some critique is undeserved. I rather agree with technical critique, like the "leaderboard illusion", rather than critique from gamblers that want to know early what is going on only for themselves.

As some complain about leaderboard updates I can complain about their complains.

I don't see why the complains of the gamblers should be left alone.

soft crater Jun 1, 2025, 5:37 PM

#

First I felt weird now I laugh because it is too weird

west lodge Jun 1, 2025, 9:48 PM

#

Hi everyone, please allow us a little bit more time to update the leaderboard result. We've been going through a big UI/backend transition last week to the new website and the team is working super hard on finishing the new leaderboard pipeline. We want to make sure we get everything right. The result will be ready very soon! thanks for you patience. 🙏

tender sigil Jun 2, 2025, 12:31 AM

#

queen jewel Google will be #1 anyway, so what difference does it make whether the leaderboar...

That guy who lost multiple thousands on Anthropic shares confidently bragging about how “20 cents is underpriced” had me dying, PolyMarket “traders” are like the final boss of degen gambling

willow holly Jun 2, 2025, 10:03 AM

#

the output speed certainly didn't change ~130 token. Stealth update on the API which is called 04-16 would be surprising but possible of course

half stream Jun 2, 2025, 3:36 PM

#

The leaderboard got updated just recently.

Screenshot_2025-06-02-22-36-08-960_com.android.chrome.jpg

Screenshot_2025-06-02-22-36-32-763_com.android.chrome.jpg

#

No R1 May yet. Not enough battles? 🤔

zealous sable Jun 2, 2025, 3:47 PM

#

half stream The leaderboard got updated just recently.

it took Claude 10 days to finally get added to the leaderboard, R1 should take another week i think

sinful falcon Jun 2, 2025, 4:02 PM

#

a little surprised claude wasn't higher tbh

#

like did not expect higher than gemini ofc

tender sigil Jun 2, 2025, 9:05 PM

#

Sentiment control might boost it a little higher, it’s very bland textual style hurts it a bit in the eyes of voters

willow holly Jun 2, 2025, 10:02 PM

#

@sinful falcon not surprised it is the none thinking version. It fails in so many logic, complex prompt following, math against the reasoning models

soft crater Jun 2, 2025, 10:02 PM

#

sinful falcon a little surprised claude wasn't higher tbh

probably they arent using reasoning

#

If not using thinking, possibly is the #1

wind vale Jun 3, 2025, 5:30 AM

#

I'd like to know the difference between different AIs in terms of web search functionality, and there doesn't seem to be an option for that

#

can we do that？ Its really useful for users not only in text

half stream Jun 3, 2025, 5:38 AM

#

https://lmarena.ai/leaderboard/search

#

https://legacy.lmarena.ai

twin valve Jun 3, 2025, 10:09 AM

#

half stream The leaderboard got updated just recently.

as expected claude didn't got too high despite the stye control (without it it is barely within the top10)

twin valve Jun 3, 2025, 10:10 AM

#

sinful falcon a little surprised claude wasn't higher tbh

all version of claude are dry af. If you use claude.ai it is not as dry. Likely it is wanted.

#

for my personal tally I can recognize claude answers very quickly (though I cannot tell which claude model is) and it often loses. This when the questions aren't hard. On hard questions it performs a bit better.

willow holly Jun 3, 2025, 1:55 PM

#

They also have the lazy mindset.

"List all official German cities with less than 300k citizens and less than 6 letters."

and opus always puts out a couple and puts a note under it:
"... , so this represents just a selection of the more notable ones."

when you reprompt it to put out as many as it knows it is not bad but always needs two prompts.

other llms just do what you ask them to do

tender sigil Jun 3, 2025, 7:59 PM

#

soft crater If not using thinking, possibly is the #1

Claude 3.7 Sonnet Thinking only had an extra 7 elo points over regular Claude 3.7 Sonnet so

#

unless they’ve had a strong redesign of the thinking feature expecting such a large jump is a tad unrealistic

twin valve Jun 4, 2025, 2:31 PM

#

tender sigil Claude 3.7 Sonnet Thinking only had an extra 7 elo points over regular Claude 3....

I was going to object but checking here you are right AND with the SC they have zero difference (unless I am blind)

soft crater Jun 4, 2025, 4:11 PM

#

In Text-to-Image Arena, gpt-image-1 just surpassed imagen3

willow holly Jun 4, 2025, 5:07 PM

#

image-1 is the gpt4o image model?

heavy thorn Jun 5, 2025, 10:55 AM

#

willow holly image-1 is the gpt4o image model?

Yes, this is the name in the API for it.

soft crater Jun 5, 2025, 5:34 PM

#

AI Supremacy

willow holly Jun 5, 2025, 6:20 PM

#

In every single category improved and in every single one on first place now. Also huge jumps 50+ elo in language(Chinese, French...) . They didn't leave anything out. Also crushed on aider. So even agentic coding seems to be #1 again.

timber owl Jun 5, 2025, 7:11 PM

#

willow holly In every single category improved and in every single one on first place now. Al...

oh right time to test translate

tulip shadow Jun 5, 2025, 9:55 PM

#

soft crater AI Supremacy

I feel opus thinking will atleast beat this in webdev, it is insanely expensive though

devout canyon Jun 5, 2025, 10:04 PM

#

soft crater AI Supremacy

Why do they update the leaderboard just after the release of the new Google model, but it took them two weeks to update it when Claude launched their claude 4 series ?

queen jewel Jun 5, 2025, 10:19 PM

#

devout canyon Why do they update the leaderboard just after the release of the new Google mode...

cuz claude is not an anonymous model

twin valve Jun 5, 2025, 10:28 PM

#

queen jewel cuz claude is not an anonymous model

good point. I can spot Claude 90 times out of 100 (though not its version. Either 3.5, 3.6, 3.7 or 4)
Though I'd prefer longer runs for every model just in case.

soft crater Jun 6, 2025, 2:10 AM

#

devout canyon Why do they update the leaderboard just after the release of the new Google mode...

Shhhh... it's a mistery

autumn granite Jun 6, 2025, 5:45 AM

#

Can we have a Svelte 5 mode in Web Arena? Most models can't use Runes properly yet. Adding it to the arena would incentivize companies to make their models better at Svelte, instead of just focusing on React.

tulip shadow Jun 6, 2025, 2:27 PM

#

autumn granite Can we have a Svelte 5 mode in Web Arena? Most models can't use Runes properly y...

basic arena when?

sinful falcon Jun 10, 2025, 12:43 AM

#

devout canyon Why do they update the leaderboard just after the release of the new Google mode...

because google put their model on the site before the official unveiling

peak prairie Jun 10, 2025, 4:46 AM

#

heyy

#

any new news about kingsfall?

hallow comet Jun 10, 2025, 8:05 PM

#

anyone tried out o3 pro yet?

sinful falcon Jun 10, 2025, 9:03 PM

#

hallow comet anyone tried out o3 pro yet?

5r/jane trading these markets now huh?

rain fiber Jun 11, 2025, 4:28 AM

#

sinful falcon 5r/jane trading these markets now huh?

huh.

sinful falcon Jun 11, 2025, 9:25 AM

#

rain fiber huh.

brookfield place is the name of the building those firms are in

rain fiber Jun 11, 2025, 4:55 PM

#

sinful falcon brookfield place is the name of the building those firms are in

ahh okay

hallow comet Jun 12, 2025, 1:52 AM

#

sinful falcon 5r/jane trading these markets now huh?

shhhh

tender sigil Jun 12, 2025, 3:27 PM

#

Were there any new models placed on the leaderboard with the update yesterday? as far as I see the only changes are just Google models dropping a few spots, o3 moved over 2.5 Pro 05-06, Opus 4 moved over 2.5 Flash 05-20, plus both GPT 4.1 and Grok 3 moved over 2.5 Flash 04-17

willow holly Jun 12, 2025, 5:28 PM

#

If they only do updates all 1-2 weeks it would be nice to have a arrow up/down for placement shifts and mark new models.

tender sigil Jun 12, 2025, 7:17 PM

#

You can compare to previous Twitter posts of the leaderboards, but that appears to be the only snapshot feature

sinful falcon Jun 12, 2025, 7:29 PM

#

last one:

gemini-2.5-pro-preview-06-05: 1476.34 (18.59)
gemini-2.5-pro-preview-05-06: 1446.0 (9.0)
gemini-2.5-flash-preview-05-20: 1420.3 (12.46)
o3-2025-04-16: 1420.24 (5.41)
chatgpt-4o-latest-20250326: 1416.86 (4.79)
grok-3-preview-02-24: 1412.27 (4.93)
early-grok-3: 1410.43 (5.53)
gpt-4.5-preview-2025-02-27: 1406.12 (5.75)
llama-4-maverick-03-26-experimental: 1404.25 (6.74)
gemini-2.5-flash-preview-04-17: 1392.36 (5.85)

placid glen Jun 12, 2025, 7:41 PM

#

What about way back machine

glacial glacier Jun 13, 2025, 1:01 AM

#

@drowsy needle you can delete messages from this @frosty osprey guy right
*also spammed in #prompt-to-leaderboard

frosty osprey Jun 13, 2025, 1:01 AM

#

Yes

drowsy needle Jun 13, 2025, 1:31 AM

#

glacial glacier <@283397944160550928> you can delete messages from this <@1381164960368951376> g...

blobthanks

brittle pine Jun 13, 2025, 12:59 PM

#

The thing is why google models has such a high position in these list is : because they using RAW outputs just like in ai studio with safety filters off

#

Less censored, more detailed just what peoples's wanted

#

But if you use on web/mobile app : outputs are trash because of dumb safety filters and dumb system prompt

tender sigil Jun 13, 2025, 11:25 PM

#

Gemini being better in LMArena than in-app for the ppl who pay for it is kinda funny lowk

twin valve Jun 14, 2025, 12:27 AM

#

tender sigil Were there any new models placed on the leaderboard with the update yesterday? a...

the lmb leaderboard has that hint: https://ktibow.github.io/lmb/

tender sigil Jun 14, 2025, 12:57 AM

#

oh, Coolio!

#

just Nova Experimental then

tame surge Jun 14, 2025, 2:38 AM

#

Why is the new R1 not on leaderboards yet?

brittle pine Jun 14, 2025, 11:40 AM

#

tender sigil Gemini being better in LMArena than in-app for the ppl who pay for it is kinda f...

Yea. This is the reason why so many people using ai studio instead app

#

If their membership includes ai studio in future ill gladly pay it

orchid kestrel Jun 14, 2025, 1:07 PM

#

best ai for image generation so far in lmarena?

drowsy needle Jun 14, 2025, 4:19 PM

#

orchid kestrel best ai for image generation so far in lmarena?

depends on what I'm trying to do but overall I'm a fan of photon. check out the leaderboards here - https://lmarena.ai/leaderboard/text-to-image

errant scroll Jun 16, 2025, 12:27 AM

#

Any plan to add (deep) research leaderboard? I'm sure it would be expensive but I know I'd love to give you guys my opinion on models 😂

autumn granite Jun 16, 2025, 8:49 AM

#

Could the blank response rates (per provider, if applicable) be reported for each model on Web Arena? Blank responses are pretty annoying and I'm curious whether it's a function call or API issue. Either way, it might incentivize model makers/providers to fix it.

sinful falcon Jun 16, 2025, 2:50 PM

#

errant scroll Any plan to add (deep) research leaderboard? I'm sure it would be expensive but ...

i would use it just to get free access to deep research models ngl

twin valve Jun 16, 2025, 8:00 PM

#

sinful falcon i would use it just to get free access to deep research models ngl

I do think many use lmarena already for that for some models

#

I do for some large testing - that is, to see how different models reply to the same query. Otherwise I should go the openrouter way

tender sigil Jun 17, 2025, 5:45 AM

#

twin valve I do think many use lmarena already for that for some models

haven’t paid for an AI subscription since I found LMArena 4 months ago, lol

devout canyon Jun 17, 2025, 8:17 PM

#

I apologize for this, but it’s not normal at all. Why do you always release the update of the leaderboard in the same hour when Google launches a new model, but not with the other releases?

drowsy needle Jun 17, 2025, 9:06 PM

#

devout canyon I apologize for this, but it’s not normal at all. Why do you always release the ...

Thanks for the question, totally fair to ask. We do sometimes coordinate leaderboard updates with model providers if they request it. This gives them a chance to celebrate how you, the community, responded to the model as part of their timed announcement to the public.

That option is open to everyone, but not all providers choose to use it. Many updates happen independently, based on when we’ve collected and validated enough new voting data, a process that usually takes about a week. We’re also exploring ways to make updates more frequent and automatic, so the process feels more consistent no matter who’s involved.

We appreciate your attention to fairness, as it’s something we also care deeply about.

devout canyon Jun 17, 2025, 9:28 PM

#

drowsy needle Thanks for the question, totally fair to ask. We do sometimes coordinate leaderb...

Fair enough. Thank you so much for the transparency; I really appreciate it.

native pecan Jun 19, 2025, 12:32 AM

#

Hi! I was wondering if there is a way to access the historical leaderboards of LM Arena? I'm looking for previous rankings for academic citation purposes. Thanks in advance!

drowsy needle Jun 19, 2025, 12:47 AM

#

native pecan Hi! I was wondering if there is a way to access the historical leaderboards of L...

Sorry to say there is not within LMArena. You may find our blog helpful as there are some screenshots, here is an example: https://blog.lmarena.ai/blog/2025/two-year-celebration/

native pecan Jun 19, 2025, 2:01 AM

#

thanks for your reply

crude hawk Jun 19, 2025, 8:56 AM

#

native pecan Hi! I was wondering if there is a way to access the historical leaderboards of L...

there's historical LB/elo ratings here https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard/tree/main
(i played around with it a few month ago #leaderboards message , seems it's still being updated)

lmarena-ai/chatbot-arena-leaderboard at main

#

it's nice they maintain that as a public resource.. tho i'm doubtful we'll ever get raw Arena chat data again, at least in large volume.. (it's valuable.. and they've got investors behind them now aha)

#

they do refer to their past release of some of the chat data in that blog post.. so who knows.. i'd be happy to have my cynicism proven wrong aha

twin valve Jun 19, 2025, 11:13 AM

#

About the "i'm doubtful we'll ever get raw Arena chat data again" I won't like to get the raw prompt and answers, as those could be used for benchmaxxing. Maybe those that are very old (2+ years) could be released.

What I would really like to see, for confirming the leaderboard and making alternatives ones, would be at least the results of the votes. Those would be very interesting. See ... oh it is gone. So I create a request - dunno where it is now - to ask about the results of the battles, not the text, only:

model A
model B
result

so that one could independently verify that the leaderboard is correctly computed and also apply other rating systems and/or do other analyses.

#

ah it is not gone, it is here: https://discord.com/channels/1340554757349179412/1372537524551159913

crude hawk Jun 19, 2025, 11:24 AM

#

yeah that's prob a fair point re benchmaxing (but i still think something should be released; like it's the public whose casting the votes.. but atm the data only goes to LMarena and the providers who serve models, at least partially.. doesn't feel ideal but yeah i do hear your point)

twin valve Jun 19, 2025, 1:22 PM

#

yeah for that I say: releasing with delay (2+ years) sooner or later will be all released and models cannot benchmaxx to quickly.

On another side, the more they collect, the more their data become actually a valuable dataset and they can fund themselves selling it. That would be ok too for me. Only I really wish they could release the result of the votes, that is already one way to verify the system.

drowsy needle Jun 19, 2025, 2:20 PM

#

crude hawk there's historical LB/elo ratings here https://huggingface.co/spaces/lmarena-ai...

thank you for sharing, I wasn't aware this was available, good to know. blobthanks

fleet stag Jun 19, 2025, 7:55 PM

#

Not sure if tables are a part of "style control", but if not, I’d definitely recommend including them. Feel like o3 is so aggressive with them, so it could be a big factor

left cobalt Jun 19, 2025, 9:18 PM

#

Hi, I wonder how to participate in the maintenance of lmarena. is there any official manner?

drowsy needle Jun 19, 2025, 9:23 PM

#

left cobalt Hi, I wonder how to participate in the maintenance of lmarena. is there any offi...

participate in the maintenance of lmarena
Sorry to say I'm not following, would you mind elaborating a bit further?

upbeat swift Jun 20, 2025, 12:57 AM

#

left cobalt Hi, I wonder how to participate in the maintenance of lmarena. is there any offi...

A lot of it is open source, a while back I just made a PR and they reviewed it.

clear pulsar Jun 20, 2025, 8:46 AM

#

Hello. I'm doing some research based on LMArena and I wonder which part of data is included in the leaderboard/Arena Overview table (In the screenshot).

Does it include only text or also multimodal chats and text2image?
Does it include data from other arenas such as Copilot Arena?

median briar Jun 20, 2025, 9:18 AM

#

clear pulsar Hello. I'm doing some research based on LMArena and I wonder which part of data ...

only text based, as soon as you enter an image it will only be part of the vision leaderboard and the same goes for the other categories

#

*as far as i know

clear pulsar Jun 20, 2025, 11:55 AM

#

median briar only text based, as soon as you enter an image it will only be part of the visio...

Thanks very much!

left cobalt Jun 20, 2025, 8:23 PM

#

upbeat swift A lot of it is open source, a while back I just made a PR and they reviewed it.

Hi, @upbeat swift I hope to contribute to the open-source proj of lmarena, is there any?

gleaming storm Jun 21, 2025, 3:51 AM

#

hello, I'm here to check out this arena

autumn granite Jun 21, 2025, 10:20 AM

#

Imagine if there was a code interpreter built right into LMArena... coding rankings would be far more reliable. There are already open source libraries for this.

See LiveCodes (MIT licensed), which runs entirely client-side and supports 90+ languages & frameworks: https://livecodes.io/docs

twin valve Jun 21, 2025, 5:32 PM

#

running the interpreter for every question would cost a bit I'd imagine

autumn granite Jun 22, 2025, 8:03 PM

#

twin valve running the interpreter for every question would cost a bit I'd imagine

The one I linked is embeddable and runs entirely in the client's browser, no additional server-side costs. 🙂

rotund burrow Jun 23, 2025, 6:53 PM

#

Should we be expecting Claude 4 models with thinking on the leaderboards at some point?

brittle pine Jun 24, 2025, 1:36 AM

#

rotund burrow Should we be expecting Claude 4 models with thinking on the leaderboards at some...

they are stubborn about not using emojis, not listing, not using any graph, not detailed answer soo

#

even claude's answers are really good, people still cares those things

rotund burrow Jun 24, 2025, 1:37 AM

#

brittle pine even claude's answers are really good, people still cares those things

How does that affect whether or not they appear on the leaderboards. Non-thinking Claude 4 does.

tender sigil Jun 24, 2025, 6:59 AM

#

“hey this model has been in the arena and named for a while now, will we see it on the leaderboard soon?”

#

“it doesn’t use emojis”

#

thank u for that brilliant insight 😭

#

To answer your question Brian, Claude 4 thinking models will likely be included in the next leaderboard update! a bunch of new models have been added in the last week, I believe we’ll see an update in the next few days, or when Gemini 2.5 Pro Deep Think releases 👍

sonic junco Jun 24, 2025, 9:13 AM

#

I think it’s useful to get more clarity on how often/when leaderboards get updated

#

It would be hugely useful to get something akin to a weekly update and additional updates when there are new models releasing that are coordinated with the lmarena team

brittle pine Jun 24, 2025, 12:42 PM

#

tender sigil thank u for that brilliant insight 😭

Well, I just misunderstood the question. Is there a reason to be rude ? You guys being polite to AI more than humans which is kinda dystopic not gonna lie. Anyway claude wrote a whole paper about how human feedbacks turns models lame and they exactly talked about my points like style, using graphs or emojis. So yes

rotund burrow Jun 24, 2025, 1:36 PM

#

tender sigil To answer your question Brian, Claude 4 thinking models will likely be included ...

thanks!

drowsy needle Jun 24, 2025, 3:59 PM

#

sonic junco I think it’s useful to get more clarity on how often/when leaderboards get updat...

Currently, we do update our leaderboards about once a week. That being the case we are looking into making these updates more frequent.

sonic junco Jun 24, 2025, 4:20 PM

#

drowsy needle Currently, we do update our leaderboards about once a week. That being the case ...

Love to hear it, look forward to this week’s update 🙂

tender sigil Jun 24, 2025, 5:09 PM

#

brittle pine Well, I just misunderstood the question. Is there a reason to be rude ? You guys...

the fact AI is becoming smarter than humans like you is even more dystopic, if you ask me 😭

brittle pine Jun 24, 2025, 5:11 PM

#

facts

#

agree

tender sigil Jun 24, 2025, 5:13 PM

#

Claude wrote a whole paper about human feedback turning models “lame” and they talked exactly about your points? That’s so cool! Was the original question asking about leaderboards, or Claude emoji usage?

#

I’ll give you a hint, we are currently in the #leaderboards channel 😄

tawdry socket Jun 25, 2025, 12:39 PM

#

@drowsy needle Was 0605 name changed to 2.5pro? How does 2.5pro have 10,000+ votes otherwise?!

hallow comet Jun 25, 2025, 1:00 PM

#

tender sigil the fact AI is becoming smarter than humans like you is even more dystopic, if y...

Ai currently is smarter than a small % of humans. Yet no one cares. Why will anyone care when that % is 10, 20, 90 or 99?

sinful falcon Jun 25, 2025, 1:04 PM

#

hallow comet Ai currently is smarter than a small % of humans. Yet no one cares. Why will any...

i think AI is smarter than any person not an expert in their field tbh

#

already

#

i'd rather have AI do my homework for an upper level college class than a random person w that major

hallow comet Jun 25, 2025, 1:09 PM

#

sinful falcon i'd rather have AI do my homework for an upper level college class than a random...

It will be in stages

Smarter than avg person, bachelor student, entry level, master student, senior etc at some point or another everyone will be dumber than ai. Then there will be no more homework or work

sinful falcon Jun 25, 2025, 1:10 PM

#

hallow comet It will be in stages Smarter than avg person, bachelor student, entry level, ma...

i think already better than all but senior level

#

if it was trained on case law i think it would be better than most lawyers

#

but it is 100% better than any junior associate

lean lotus Jun 25, 2025, 1:11 PM

#

sinful falcon i think AI is smarter than any person not an expert in their field tbh

Needs more scaffolding to be honest. They are useless when they are offline

#

Update a new model to new cutoff date is long

hallow comet Jun 25, 2025, 1:16 PM

#

sinful falcon i think already better than all but senior level

Eh i think its very smart at 5 min tasks, but not much else. I consider it smarter than 1-5% of people

However i have no doubt google or openai have some monster agi hidden in there labs somewhere

sinful falcon Jun 25, 2025, 1:20 PM

#

lean lotus Needs more scaffolding to be honest. They are useless when they are offline

ok but like a good lawyer still has to search cases

sinful falcon Jun 25, 2025, 1:21 PM

#

hallow comet Eh i think its very smart at 5 min tasks, but not much else. I consider it smart...

as a total agent sure

lean lotus Jun 25, 2025, 1:22 PM

#

sinful falcon ok but like a good lawyer still has to search cases

That's why they need more tools or they will hallucinate like crazy

hallow comet Jun 25, 2025, 1:23 PM

#

Most models are very bad at long term stuff even very simple ones, like simple games

lean lotus Jun 25, 2025, 1:23 PM

#

Pokemon

hallow comet Jun 25, 2025, 1:23 PM

#

O3-pro i think was somewhat better than the rest but takes forever to run

hallow comet Jun 25, 2025, 1:23 PM

#

lean lotus Pokemon

Generally speaking ..

lean lotus Jun 25, 2025, 1:25 PM

#

hallow comet Generally speaking ..

Yes

#

They still lack common sense

#

solemn edge Jun 25, 2025, 2:58 PM

#

Yo why isn't there qwen3 0.6B,4B,8B and 14B in lmarena leader board?

tender sigil Jun 25, 2025, 7:58 PM

#

sinful falcon i think AI is smarter than any person not an expert in their field tbh

eh, most complicated reasoning tasks it still sucks at

#

AI is genuinely AWFUL at poker

sinful falcon Jun 25, 2025, 7:59 PM

#

tender sigil AI is genuinely AWFUL at poker

so are most people though

tender sigil Jun 25, 2025, 7:59 PM

#

https://open.substack.com/pub/natesilver/p/chatgpt-is-shockingly-bad-at-poker?r=2q2eme&utm_medium=ios

ChatGPT is shockingly bad at poker

I’m impressed by large language models. So why can't they get the basics of poker right?

tender sigil Jun 25, 2025, 8:00 PM

#

sinful falcon so are most people though

the average poker player could beat the smartest LLM, easily

#

game theory in general is a strong weak spot of even the “reasoning” LLMs

twin valve Jun 25, 2025, 8:51 PM

#

tender sigil the fact AI is becoming smarter than humans like you is even more dystopic, if y...

is there a need for free hostility like this? <@&1349916362595635286> could we be pointlessly hostile in this discord?

drowsy needle Jun 25, 2025, 9:05 PM

#

twin valve is there a need for free hostility like this? <@&1349916362595635286> could we b...

Yeah overall agree that folks should be able to get points across without being disrespectful. I'll followup.

tender sigil Jun 25, 2025, 9:29 PM

#

this has already been resolved ☺️

#

don’t think the mods need u to do their job for them pier :p

drowsy needle Jun 25, 2025, 9:36 PM

#

lets just move on, no need rehash things

placid jungle Jun 25, 2025, 10:32 PM

#

Would be cool to know if Gemini 2.5 pro is its own new model or a rename seeing as 06-05 is gone 👍

#

I noticed both were on leaderboard last night, but a couple hours later 06-05 was removed

drowsy needle Jun 25, 2025, 10:34 PM

#

placid jungle Would be cool to know if Gemini 2.5 pro is its own new model or a rename seeing ...

will check in on this and keep you updated

placid jungle Jun 25, 2025, 10:35 PM

#

drowsy needle will check in on this and keep you updated

Cool, thanks you very much!

timber owl Jun 25, 2025, 11:08 PM

#

placid jungle Would be cool to know if Gemini 2.5 pro is its own new model or a rename seeing ...

gemini 2.5 pro is gemini 06 05
the change is reflected on every platform everywhere, gemini 2.5 pro is GA

placid jungle Jun 25, 2025, 11:14 PM

#

timber owl gemini 2.5 pro is gemini 06 05 the change is reflected on every platform everywh...

06-05 is still in webdev arena leaderboard, and shows on google's model page as different versions. What makes you think it is the exact same model? Not saying you're wrong - I think it likely is, just curious

#

I do see its gone from the ai studio drop down tho

#

lean lotus Jun 25, 2025, 11:18 PM

#

placid jungle 06-05 is still in webdev arena leaderboard, and shows on google's model page as ...

Because it's the GA version

#

https://nitter.net/OfficialLoganK/status/1935005573965398025#m

#

https://fixupx.com/OfficialLoganK/status/1935005571016544332

Logan Kilpatrick (@OfficialLoganK)

Introducing the Gemini 2.5 model family:
︀︀
︀︀- Gemini 2.5 Pro (Stable, no changes from 06-05)
︀︀- Gemini 2.5 Flash (Stable, updated pricing from 05-20)
︀︀- Gemini 2.5 Flash-Lite (Preview, small reasoning model)
︀︀
︀︀More info in 🧵

**💬 197 🔁 253 ❤️ 3.2K 👁️ 708.5K **

placid jungle Jun 25, 2025, 11:21 PM

#

lean lotus https://fixupx.com/OfficialLoganK/status/1935005571016544332

Ah makes sense, thank you. I did see this thread but missed the part calling 06-05 variant the new stable model and implied name change

#

This is what I was looking to see, thank you!

sinful falcon Jun 26, 2025, 12:09 AM

#

placid jungle Ah makes sense, thank you. I did see this thread but missed the part calling 06-...

ughhhhhh

drowsy needle Jun 26, 2025, 1:11 AM

#

placid jungle Cool, thanks you very much!

Looks like you already know, but yeah.

half stream Jun 26, 2025, 7:51 AM

#

qwen3-235b-a22b-no-thinking got 1408 on hard prompts, while the reasoning-enabled one got 1387, and the difference is significant.

grave rampart Jun 26, 2025, 7:51 AM

#

hello

vocal rain Jun 26, 2025, 11:02 AM

#

tender sigil https://open.substack.com/pub/natesilver/p/chatgpt-is-shockingly-bad-at-poker?r=...

they are pretty bad at anything game related, even on a basic level

#

they can pick up patterns really well in text, but games work on a different format, maybe thats part of the reason

#

also games have multiple steps too

tender sigil Jun 26, 2025, 7:43 PM

#

yeah, multi-step reasoning is kinda complicated for them since if one underlying step gets messed up everything else flops

#

like the prompt “You and 99 other players each privately choose a number between 0 and 100. The winner is whoever gets closest to exactly 2/3 of the average of all submitted numbers. What number should you choose and why? Walk through your complete reasoning process.” I came up with to see how far they would take the logic

#

I got an answer of 0 because it kept recursively calculating “2/3 of this average is 33.3, 2/3 of 33 is 22, 2/3 of 22 is…”

#

flamesong was the only one to correctly intuit the concept, and guessed 15

twin valve Jun 26, 2025, 8:22 PM

#

half stream `qwen3-235b-a22b-no-thinking` got 1408 on hard prompts, while the reasoning-enab...

could be that the thinking let them to "overthink" and thus give slightly worse answers (just speculation here but I could imagine that)

tender sigil Jun 26, 2025, 8:29 PM

#

it rambled a lot more

idle ocean Jun 27, 2025, 12:15 AM

#

tender sigil like the prompt “You and 99 other players each privately choose a number between...

To be fair that's the legitimate answer in a zero sum game, since everyone would tie if everyone put 0.

Nash Equilibrium etc etc.

#

and if you did this test where all of the "players" were different multi step reasoning ai's then flamesong technically got last, because they were the furthest away from the answer.

tender sigil Jun 27, 2025, 12:19 AM

#

yeah, but it’s kinda obvious that not ever player is a game theory optimal player

#

proof of the lack of multi-level reasoning, only seeing the Nash Equilibrium instead of thinking about other player’s non-perfect strategies

idle ocean Jun 27, 2025, 12:22 AM

#

maybe try specifying that their opponents are human.

tender sigil Jun 27, 2025, 12:44 AM

#

idle ocean maybe try specifying that their opponents are human.

interesting, stonebloom guessed 18.6 and Claude Opus 4 guessed 13

idle ocean Jun 27, 2025, 12:46 AM

#

improvement I guess? ¯_(ツ)_/¯

twin valve Jun 27, 2025, 8:34 AM

#

tender sigil like the prompt “You and 99 other players each privately choose a number between...

Likely if you prompt only this, they reference discussion that they have seen in their training and the answer is zero.

Most of discussions online don't cover imperfect strategies.

The data shows that the correct answer is not 15, it varies (the more the experience, the more it goes to zero): https://en.wikipedia.org/wiki/Guess_2/3_of_the_average

Guess 2/3 of the average

In game theory, "guess ⁠2/3⁠ of the average" is a game where players simultaneously select a real number between 0 and 100, inclusive. The winner of the game is the player(s) who select a number closest to ⁠2/3⁠ of the average of numbers chosen by all players.

#

For me it seems obvious that the models are likely very influenced by the usual "the bash equilibrium is zero".

median briar Jun 27, 2025, 9:39 AM

#

honestly 0 is the right answer considering that most people who ask that question without any context are likely talking about standard game theory

#

and not some "Ahm actually 🤓 , no one specified that the players are rational or common knowledge of rationality exists"

#

and beyond that there is no "solution" to this question, the best thing you can do is make an educated guess about the other players (rationality, experience, if the game will be repeated ...)

clear pulsar Jun 28, 2025, 3:58 AM

#

Interesting one. If I add "Note that you're playing with humans who are not always rational", models give a diffenent answer. GPT series prefer numbers around 22.

#

BTW Current deepseek v3 (not r1) on the official site gives me a long COT (just without <think>). I wonder how much data from r1 did they use to train v3. v3 answers 8, r1 answers 20 after a long thinking process, kimi k1.5 just doesn't stop thinking for at least 10 minutes, and finally answers 22.

#

This triggers long thinking contents in qwen3, kimi and deepseek (all reasoning). I wonder if it's the model's fault or it is too hard to make a decision

analog cliff Jun 28, 2025, 9:18 AM

#

lmarenalogo

halcyon grove Jun 28, 2025, 1:39 PM

#

is o3-pro in the leaderboards?

fallen slate Jun 28, 2025, 5:18 PM

#

HU

#

來學習

forest schooner Jul 3, 2025, 6:45 AM

#

battle

sinful falcon Jul 4, 2025, 12:25 AM

#

is grok 4 in the arena yet?

hallow comet Jul 4, 2025, 11:26 AM

#

sinful falcon is grok 4 in the arena yet?

Visibly no, otherwise yes

sinful falcon Jul 8, 2025, 5:19 PM

#

guys can someone let me know what they think about XAI being first in lmarena on idk say 10AM EST last day of the month?

#

🥺👉👈

rain fiber Jul 9, 2025, 12:38 AM

#

hallow comet Visibly no, otherwise yes

wdym?? so its been released under an anonymous name?

cobalt crest Jul 9, 2025, 2:52 PM

#

sinful falcon guys can someone let me know what they think about XAI being first in lmarena on...

The ai space is moving so quickly right now, that no one answer would be acceptable, because somehow every big name at the front of the ai game has something that the people REALLY want to see (Gemini 2.5 Pro deep think, Grok 4, Claude 4.5?, Deepseek (delayed theirs) and some others I can't name from the top of my head)

thorny ginkgo Jul 9, 2025, 5:35 PM

#

cobalt crest The ai space is moving so quickly right now, that no one answer would be accepta...

GPT-5 not in the top of your head lol

soft fable Jul 9, 2025, 5:45 PM

#

sinful falcon guys can someone let me know what they think about XAI being first in lmarena on...

qwen will take the prize, trust

rotund burrow Jul 9, 2025, 6:39 PM

#

A couple weeks I asked about sonnet and opus 4 thinking on the leaderboards. Does anyone know if this is still planned, or blocked somehow? From the original X thread it looked like Sonnet and Opus for non-thinking.

drowsy needle Jul 9, 2025, 7:22 PM

#

rotund burrow A couple weeks I asked about sonnet and opus 4 thinking on the leaderboards. Doe...

Hello - we had a similar question pop up in #general. There is a plan to update the leaderboard soon. Unfortunately, there was an issue preventing it from appearing properly, but we do have a plan to fix this.

rotund burrow Jul 9, 2025, 7:24 PM

#

drowsy needle Hello - we had a similar question pop up in <#1340554757827461211>. There is a p...

Thanks!

tender sigil Jul 10, 2025, 2:40 AM

#

rain fiber wdym?? so its been released under an anonymous name?

pretty sure it’s wolfstrike

sinful falcon Jul 10, 2025, 9:14 PM

#

tender sigil pretty sure it’s wolfstrike

what do u think of wolfstrike?

tender sigil Jul 10, 2025, 9:31 PM

#

sinful falcon what do u think of wolfstrike?

it’s pretty strong - may not be a version of Grok due to its low charisma in communication, but is consistently one of the top performers in complex reasoning tasks

rustic quartz Jul 10, 2025, 9:33 PM

#

wolfstride/stonebloom are the checkpoints of the same Google model

#

we still don't know which one: 2.5 Pro-next, 2.5 Ultra or even early 3.0 Pro checkpoints

rapid cobalt Jul 11, 2025, 9:33 AM

#

Wait, is that a typo

#

or is there a model called Wolfstrike?

#

I only got wolfstride

rustic quartz Jul 11, 2025, 10:03 AM

#

yeah, a typo

tawdry hearth Jul 11, 2025, 11:22 AM

#

hey everyone

marble olive Jul 11, 2025, 9:03 PM

#

rustic quartz we still don't know which one: 2.5 Pro-next, 2.5 Ultra or even early 3.0 Pro che...

idk wolfstride feeling like a gemma model lol

rustic quartz Jul 11, 2025, 9:03 PM

#

marble olive idk wolfstride feeling like a gemma model lol

no, it has impressive world knowledge

marble olive Jul 11, 2025, 9:06 PM

#

lowk its probably just a pseudo name for a host of differnt models

robust zodiac Jul 12, 2025, 10:12 AM

#

When can we expect to see a leaderboard update? Rather new here/not familiar with how it works. I see some are 5 days ago some 50 ago

drowsy needle Jul 12, 2025, 3:21 PM

#

robust zodiac When can we expect to see a leaderboard update? Rather new here/not familiar wit...

Should be seeing updates soon! It take a bit of time to collect votes but updates are coming.

heady patio Jul 13, 2025, 3:24 PM

#

sinful falcon guys can someone let me know what they think about XAI being first in lmarena on...

I asked one question and one of the model is grok4, apparently it is not good enough.

blissful pulsar Jul 13, 2025, 7:18 PM

#

sinful falcon guys can someone let me know what they think about XAI being first in lmarena on...

100%

tender sigil Jul 14, 2025, 3:38 PM

#

rustic quartz wolfstride/stonebloom are the checkpoints of the same Google model

wonder if blacktooth was another checkpoint? it came before, but seemed decently similar to stonebloom/wolfstride

sacred basalt Jul 14, 2025, 7:04 PM

#

I can’t see kimi k2 ?!

drowsy needle Jul 14, 2025, 7:18 PM

#

sacred basalt I can’t see kimi k2 ?!

you can't? are you sure you have chat selected and not image?

sacred basalt Jul 14, 2025, 7:22 PM

#

drowsy needle you can't? are you sure you have `chat` selected and not `image`?

I mean in the leaderboard ?!

drowsy needle Jul 14, 2025, 7:26 PM

#

sacred basalt I mean in the leaderboard ?!

oh, that makes more sense lol! Leaderboards take a bit time to update for newly added models.

tawdry hearth Jul 14, 2025, 8:12 PM

#

Grok 4 is finally shown in the WebDev leaderboard. Seems kinda low-ish. Does that make sense since it needs more votes?

obsidian flint Jul 14, 2025, 8:58 PM

#

Lmao

reef moth Jul 14, 2025, 10:15 PM

#

tawdry hearth Grok 4 is finally shown in the WebDev leaderboard. Seems kinda low-ish. Does tha...

it's terrible at coding, and even worse at front-end coding

tawdry hearth Jul 14, 2025, 10:53 PM

#

Won 4/4 on my end

obsidian flint Jul 14, 2025, 11:17 PM

#

0/3 Grok on my end. Ping Pong game was buggy. pomodoro timer looked worse/less features. Sticky Note app didn't load. Why is it so bad at coding?

reef moth Jul 14, 2025, 11:33 PM

#

obsidian flint 0/3 Grok on my end. Ping Pong game was buggy. pomodoro timer looked worse/less f...

They have a separate grok coding model but haven't released it

brittle pine Jul 15, 2025, 1:25 AM

#

reef moth it's terrible at coding, and even worse at front-end coding

Then what is exactly good at grok 4 is

#

Math and Physic ?

#

Is elon just trained mechah*ler with his spacex data ?

#

What makes grok 4 special

twin valve Jul 15, 2025, 1:04 PM

#

@glacial glacier what's with the "estimated based on other leaderboards" trick? Sounds a neat idea if it infers the score using other somewhat reliable bench (livebecnh and others)

glacial glacier Jul 15, 2025, 2:46 PM

#

twin valve <@794377681331945524> what's with the "estimated based on other leaderboards" t...

ha, that's a better idea than what i'm doing... currently using another platform that also has a elo-based leaderboard when the scores seem reasonable since it's easier to implement tho

half stream Jul 15, 2025, 4:05 PM

#

half stream `qwen3-235b-a22b-no-thinking` got 1408 on hard prompts, while the reasoning-enab...

Coming back to this topic. Sonnet 4 Thinking significantly outperformed Sonnet 4 normal, while Opus 4 modes are within the margin of error.

What went wrong actually on Qwen's end? 🤔

Screenshot_2025-07-15-23-01-27-126_com.android.chrome.jpg

Screenshot_2025-07-15-23-02-00-628_com.android.chrome.jpg

twin valve Jul 15, 2025, 6:34 PM

#

glacial glacier ha, that's a better idea than what i'm doing... currently using another platform...

ok a new one. I think they are doing more or less like lmarena (just testing) but with logins and - why not - farming data. So yes it makes them similar to lmarena, only less open/scientific.

I think that as estimate it will work very well rather than trying to find a relationship between multiple benchs and lmarena (though that would be a nice project)

twin valve Jul 15, 2025, 6:38 PM

#

half stream Coming back to this topic. Sonnet 4 Thinking significantly outperformed Sonnet 4...

good question, have you searched about it if someone already did some analysis? (maybe using some deep search as search helper)

That is: why thinking models do not necessarily outperform base models

twin valve Jul 15, 2025, 6:47 PM

#

twin valve ok a new one. I think they are doing more or less like lmarena (just testing) bu...

I checked around about yupp.ai . Incredibly it is really like lmarena, only their business (so far) is really to collect data. So rather having openAI, meta and what not collect data for further training, they say "hey come to us and let us collect your prompt and preferences" (both are helpful)

What I find amusing (and sad) is that projects like lmarena, that try to be as transparent as they can (they cannot be too transparent otherwise no funding), get all the flak. Then comes around the next project that is totally closed and 100% I guess it won't be questioned.

But I welcome lmarena "replicas" so to speak. If multiple arenas more or less return the same results, at the end the findings are validated. The problem of yupp (or also sciarena) is how they evaluate the pairings and the prompts. Different rating systems could yield different results, without even considering the fact that some votes may be ignored as deemed unhelpful. (hence my request of results so that the community can independently create different leaderboard analyses: #1372537524551159913 message )

thorny ginkgo Jul 18, 2025, 4:52 PM

#

reef moth it's terrible at coding, and even worse at front-end coding

My experience is that it has very complementary and useful in coding. I use chorus with ~10 models for my daily driver for all my queries. And it is very often that grok 4 is the only model that got simple things right and didn't hallucinate out of 10

velvet lichen Jul 20, 2025, 9:30 AM

#

when do the scores update

gloomy raven Jul 20, 2025, 2:43 PM

#

Inspired by LMArena - We've developed open source chessarena AI leaderboard.

vale lodge Jul 20, 2025, 9:30 PM

#

reef moth They have a separate grok coding model but haven't released it

august

vale lodge Jul 20, 2025, 9:30 PM

#

gloomy raven Inspired by LMArena - We've developed open source chessarena AI leaderboard.

can I try it

#

also the text and icons look too small imo

#

also how old is that ss?

#

looks like its from late 2024 judging by the models

gloomy raven Jul 20, 2025, 9:34 PM

#

Chessarena.ai

twilit echo Jul 20, 2025, 10:47 PM

#

gloomy raven Inspired by LMArena - We've developed open source chessarena AI leaderboard.

Nice. I didn't check your repo yet, but 3 checkmates in 90 matches on 4o mini? that's 3.3% 🤔 In my matchups it had 7/27 = 26% checkmates. Probably completely different methodology, but you might then also be interested in these findings: https://dubesor.de/chess/chess-leaderboard

AI Chess Leaderboard - dubesor AI project

LLM AI Chess Leaderboard: Ranking, Elo, and Chess Performance of AI language models.

twin valve Jul 21, 2025, 9:45 PM

#

twilit echo Nice. I didn't check your repo yet, but 3 checkmates in 90 matches on 4o mini? t...

oh wow. I didn't notice that gpt 3.5 turbo (not even the instruct version) defeated claude sonnet 4

twilit echo Jul 21, 2025, 11:41 PM

#

twin valve oh wow. I didn't notice that gpt 3.5 turbo (not even the instruct version) defea...

And against 2.5 pro, claude 4 opus, kimi k2, gpt 4.1, 4o, etc (just did a bunch more non-instruct matches against highest rated opponents). - funnily, it drew against lfm 7b though.

crisp rampart Jul 22, 2025, 3:25 AM

#

Hi, is there a place to view the latest leaderboards for Humanity's Last Exam, FrontierMath, etc

low bough Jul 23, 2025, 1:07 PM

#

crisp rampart Hi, is there a place to view the latest leaderboards for Humanity's Last Exam, F...

https://scale.com/leaderboard and https://artificialanalysis.ai

SEAL LLM Leaderboards: Expert-Driven Private Evaluations

Explore the SEAL leaderboards for expert-driven, private, regularly updated LLM rankings and evaluations across domains like coding, instruction following and more!

AI Model & API Providers Analysis | Artificial Analysis

Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.

idle ocean Jul 24, 2025, 11:47 PM

#

idk about the rest of you, but when i was using search arena perplexity's stuff was always the worst, not really sure whats the point of the 18 billion dollar company, since i thought it was search : P

reef moth Jul 25, 2025, 12:14 AM

#

wrapper scam

sterile jacinth Jul 25, 2025, 5:52 AM

#

when leaderboard updates?

glass sun Jul 25, 2025, 4:15 PM

#

What benchmarks are you guys using rn other than LMArena?

#

I've used to rely on Livebench a lot but they're garbage now

drowsy needle Jul 25, 2025, 4:15 PM

#

sterile jacinth when leaderboard updates?

soon

brittle pine Jul 25, 2025, 9:48 PM

#

glass sun I've used to rely on Livebench a lot but they're garbage now

I trust livebench for language benchmarks

#

But i definitely not trust them for code benchmarks

#

I still cant understand how they do measure coding abilities

narrow cloud Jul 26, 2025, 3:42 PM

#

Anybody know why the huggingface website no longer matches lmarena.ai text arena?
https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard
https://lmarena.ai/leaderboard/text

crisp rampart Jul 27, 2025, 10:50 AM

#

low bough https://scale.com/leaderboard and https://artificialanalysis.ai

thank you!

clear pulsar Jul 27, 2025, 11:16 AM

#

Hello, guys, I find this ranking a little strange. I wonder how this leaderboard determines the ranking and whether 2 models have the same ranking. Why is 1 1 followed by 2, 1 1 2 3 3 by 5 and then 1 1 2 3 3 5 6 6 6 6 6 6 by 10?

glacial glacier Jul 27, 2025, 2:32 PM

#

clear pulsar Hello, guys, I find this ranking a little strange. I wonder how this leaderboard...

confidence intervals 🥱

#

but for real it is a little unintuitive

#

it basically means "how many models are we 95% sure that this model is worse than"

#

if two models have the same rank, then statistically they are around the same level of capability

exotic peak Jul 27, 2025, 8:17 PM

#

hello

drowsy needle Jul 27, 2025, 8:34 PM

#

exotic peak hello

ablobwave

small notch Jul 27, 2025, 8:44 PM

#

Hello

#

drowsy needle Jul 27, 2025, 8:58 PM

#

pseudo rivet Jul 27, 2025, 10:05 PM

#

glacial glacier if two models have the same rank, then statistically they are around the same le...

But if a is close to b and b is close to c, couldn’t there be a chain that extends from 1400 elo to 1100 and thus all models ranked 1?

glacial glacier Jul 27, 2025, 10:06 PM

#

pseudo rivet But if a is close to b and b is close to c, couldn’t there be a chain that exten...

what do you mean

#

the rank is literally "how many models are we 95% sure that this model is worse than"

#

the old site said that, don't think the new one does

#

here's a visualization of the CIs

pseudo rivet Jul 27, 2025, 10:17 PM

#

glacial glacier what do you mean

I mean couldn’t there be a situation where all models are tied but elo difference between best and worst is 500

glacial glacier Jul 27, 2025, 10:18 PM

#

pseudo rivet I mean couldn’t there be a situation where all models are tied but elo differenc...

i will repeat:
the rank is how many models are we 95% sure that this model is worse than

pseudo rivet Jul 27, 2025, 10:19 PM

#

Ah I see

#

It’s interesting that the ranks are not increasing

#

It seems like it should be + 1 , but then that definition works for all ranks except rank 1

tender sigil Jul 28, 2025, 1:21 AM

#

glacial glacier here's a visualization of the CIs

o3 pro ?

glacial glacier Jul 28, 2025, 1:21 AM

#

tender sigil o3 pro ?

it's estimated based on external data

jaunty cave Jul 28, 2025, 3:58 PM

#

pseudo rivet It’s interesting that the ranks are not increasing

what do you mean not increasing?

sinful falcon Jul 28, 2025, 4:00 PM

#

can someone explain why the confidence interval has been increasing and not decreasing with more votes?

#

this seems not intuitive at all

pseudo rivet Jul 28, 2025, 4:37 PM

#

jaunty cave what do you mean not increasing?

Like this

jaunty cave Jul 28, 2025, 5:07 PM

#

pseudo rivet Like this

ah right, so what's happening here is that while o1's estimated rating is higher than o4-mini's, o4-mini's confidence interval is larger. That larger CI means it could be even higher than o1 and that there are less models that we are very certain are above it.

The rank number is essentially: "how many models are above your upper confidence interval?", and if your confidence interval reaches higher, then less models are clearly above you.

Definitely a little counter-intuitive, it's one of the features which makes it tricky to rank items which have different levels of variability. Should you rank their mean or rank their upper bound, right now it's by upper bound

tender sigil Jul 28, 2025, 6:21 PM

#

glacial glacier it's estimated based on external data

That’s like, kinda stupid

#

the whole point of LMArena is that it runs on its own dataset

glacial glacier Jul 28, 2025, 6:37 PM

#

tender sigil That’s like, kinda stupid

the whole point of my project is to add things that aren't normally in the leaderboard table

jovial sapphire Jul 28, 2025, 6:57 PM

#

SHAIR WALLI

brittle pond Jul 29, 2025, 3:30 AM

#

Hello, just had a question how long after a new model is dropped it takes for the leaderboard to update?

marble rivet Jul 29, 2025, 4:16 AM

#

@glacial glacier

pliant cedar Jul 29, 2025, 10:09 AM

#

brittle pond Hello, just had a question how long after a new model is dropped it takes for th...

4 to 15 days after the model is dropped into lm arena

brittle pond Jul 29, 2025, 3:43 PM

#

pliant cedar 4 to 15 days after the model is dropped into lm arena

so if gpt5 dropped tomorrow it wouldn't be on the leaderboard within a day?

twin valve Jul 29, 2025, 4:12 PM

#

no

#

unless they run it cloaked for a while

#

some models are run cloaked on the arena and they are announced the same day when they go public on the arena.
some others they are first announced and only then they go cloaked on the arena

#

some other models aren't processed by lmarena at all (so far) but there are other arenas around

#

like yupp.ai

brittle pond Jul 29, 2025, 4:16 PM

#

twin valve unless they run it cloaked for a while

have any previous openAI models been ran cloaked?

hallow comet Jul 29, 2025, 4:17 PM

#

brittle pond have any previous openAI models been ran cloaked?

On openrouter 4.1 and the mini version at least

#

quasar alpha, I think was one name

brittle pond Jul 29, 2025, 4:17 PM

#

so all the talk this zenath model is gpt5 is just rumors / fake?

hallow comet Jul 29, 2025, 4:18 PM

#

brittle pond so all the talk this zenath model is gpt5 is just rumors / fake?

No, it is at least probable that it is GPT5

#

Not 100 percent but probable given the performance

#

I am not an expert though

#

or an insider

twin valve Jul 29, 2025, 4:19 PM

#

brittle pond have any previous openAI models been ran cloaked?

many yes, in lmarena at least.

#

practically all models that are already ranked were cloaked at some point in the past

brittle pond Jul 29, 2025, 4:19 PM

#

so lets say zenath is gpt5, if they release it on july 31st. It can and will be updated on the leaderboard right away cause its being ranked in cloak rn right

twin valve Jul 29, 2025, 4:21 PM

#

yes. if model A is cloacked and it is gpt5 (or gemini 3 , or grok 5 or what you want) and the provider decides to uncloak it only after their announcment, then it gets public shortly after the announcement by openAI, provided it has enough votes to keep the CI low.

#

and also provided that the vendor is ok with the model making it public

brittle pond Jul 29, 2025, 4:22 PM

#

I see thanks guys

twin valve Jul 29, 2025, 4:22 PM

#

for example gemini in may was public on the arena after more or less 1 week. Claude 4 after 2 weeks

brittle pond Jul 29, 2025, 4:22 PM

#

and when they announced it, immiediate update on leaderboard?

#

how trivial is the amount of votes

twin valve Jul 29, 2025, 4:23 PM

#

from 3k to 8k it depends

#

it will be announced in #announcements , on twitter and on the leaderboard

#

(I'd rather see models with low CIs in every category rather than only in overall)

#

also I strongly prefer this leaderboard built up on the values of lmarena: https://ktibow.github.io/lmb/

brittle pond Jul 29, 2025, 4:47 PM

#

thanks guys

jaunty cave Jul 29, 2025, 9:04 PM

#

twin valve (I'd rather see models with low CIs in _every_ category rather than only in over...

it's impossible to have lower CIs in every category compared to overall, each category is a subset of the data, and the less data you have the larger the CIs are just by definition

twin valve Jul 29, 2025, 10:27 PM

#

jaunty cave it's impossible to have lower CIs in every category compared to overall, each ca...

you either misread me or I worded myself poorly.

I meant I wish they would uncloak models only when their CI across all categories would be small enough, rather than uncloak them when the CI in overall is small, but some categories may still have (relatively) large CIs.

#

actually I was curious if I what I wrote was so ambiguous. I let an LLM analyze what I wrote and for the LLM it was clear enough.

jaunty cave Jul 29, 2025, 11:09 PM

#

twin valve you either misread me or I worded myself poorly. I meant I wish they would uncl...

ohhh I see what you mean. I think a result of this would be that a lot of models would just never get uncloaked since the lower volume of data in the specific categories grows so slowly the CIs would shrink very slowly.

You can simulate this yourself by downloading the leaderboard, filtering out all the models with CI width above your desired threshold, and then re-counting the amount of models above each.

twin valve Jul 30, 2025, 11:43 AM

#

yes, correct. One problem I am testing since around May is that many many many models that are in the leaderboard don't get tested anymore and that's suboptimal for several reasons but the big two I can see are

(a) it creates selective pairings, that is a know problem for elo based rankings. Hence I hope lmarena will release the results - not the contents - of each vote so that people can verify the rankings and check for pitfalls.
(b) it doesn't assess models properly because their CI are still relatively large for many categories (beside overall). It is likely that there are several type of human judges, that would judge a type of query differently. Let's say there are 10 types of human judges (there could be many more actually). If every category is split in 10 subcategories, one needs a lot of votes to achieve a stable result (some models likely reached that, but only some)

So yes, I am aware that some models would never be uncloacked but actually would be better if models would be tested a bit more, especially in lulls models where there aren't many cloacked models around.

On this I have two points in the feedback, one moment

#

aggressive pairing to only cloacked models: #1367825140888637506 message

request for the results of the battles: #1372537524551159913 message

#

from the internet about selective pairings

"The Elo system calculates a player's rating based on their performance against other players. It assumes that over time, a player will compete against a variety of opponents across the skill spectrum. The rating difference between two players is used to predict the outcome of a match between them.
When pairings are selective (e.g., only strong players play strong players, weak players only play weak players, or certain players are deliberately avoided), the player doesn't encounter a representative sample of the overall player pool."

glass sun Jul 30, 2025, 2:31 PM

#

<@&1349916362595635286> Will the search and copilot arenas ever be brought back?

glacial glacier Jul 30, 2025, 2:31 PM

#

glass sun <@&1349916362595635286> Will the search and copilot arenas ever be brought back?

what do you mean brought back? they're already on https://lmarena.ai/leaderboard

#

also try to not ping for small things like these

glass sun Jul 30, 2025, 2:32 PM

#

They haven't been updated in 2 months

#

Mb

glacial glacier Jul 30, 2025, 2:32 PM

#

well search data gathering has restarted so i expect an eventual update

#

not sure on copilot

glass sun Jul 30, 2025, 2:32 PM

#

Kk thanks

modest pecan Jul 30, 2025, 9:21 PM

#

Hello. Here to test video creation tools. Special interest in video with sound, as with Veo 3.

drowsy needle Jul 30, 2025, 9:34 PM

#

modest pecan Hello. Here to test video creation tools. Special interest in video with sound, ...

Glad to hear it! Be sure to check out #1397655624103493813 for info on how to use the bot. Don't hesitate to reach out if you have any questions!

normal yew Jul 31, 2025, 9:54 AM

#

why there's so many greens appearing suddenly

drowsy needle Jul 31, 2025, 1:52 PM

#

normal yew why there's so many greens appearing suddenly

the Video Arena most likely

drowsy needle Jul 31, 2025, 10:40 PM

#

You'll want to use the video-arena channels. Learn more in #1397655624103493813

topaz forge Jul 31, 2025, 11:47 PM

#

Is it true ChatGPT 5 was available on the Arena under a code name?

proper coral Aug 1, 2025, 1:09 AM

#

topaz forge Is it true ChatGPT 5 was available on the Arena under a code name?

its removed

lost willow Aug 1, 2025, 1:46 PM

#

HOW USE THIS DISCORD

drowsy needle Aug 1, 2025, 1:47 PM

#

lost willow HOW USE THIS DISCORD

check out #1397655624103493813

glacial glacier Aug 3, 2025, 4:57 PM

#

i'm going to clean up this channel

#

okay, cleaned up

#

so uhh

#

nice leaderboard huh

hallow comet Aug 3, 2025, 5:13 PM

#

glacial glacier nice leaderboard huh

Qwen being in top 3 is insane

brittle pine Aug 3, 2025, 5:36 PM

#

O3's close score to gemini always gives me some melanchony

harsh steeple Aug 3, 2025, 5:56 PM

#

insane

brittle pine Aug 3, 2025, 5:59 PM

#

why is base model much better than thinking one ?

#

qwen3

idle ocean Aug 3, 2025, 8:35 PM

#

Qwen3 how

#

That's impressive

#

How much is alibaba spending on it?

oak tendon Aug 4, 2025, 6:31 AM

#

I just came across velocilux, surprisingly good results on my end.
Anyone knows more about it or cogitolux? Could they be related to cresylux?
Wondering if it’s part of the same family or just similar naming.
Any info appreciated!

final shore Aug 4, 2025, 3:15 PM

#

harsh steeple insane

Now it makes sense why Antropic was crying for the government to stop giving GPUs to China, imagine if China had the GPUs

twilit echo Aug 4, 2025, 4:43 PM

#

brittle pine why is base model much better than thinking one ?

instruct also scored higher on my bench than thinking. thinking doesn't scale well, and it also doesn't guarantee better responses in all situations. It can actively hurt instruction following and overthink past established solutions. It can also introduce unwanted factors such as overcautious prompt risk analysis

sharp apex Aug 5, 2025, 10:11 AM

#

Wow, this model looks incredibly powerful! Never seen before 。GLM-4.5

twin valve Aug 5, 2025, 11:31 AM

#

final shore Now it makes sense why Antropic was crying for the government to stop giving GPU...

I do think that HW constraits are pushing other teams (in this case Chinese) to optimize, while the teams with more resources try less optimization and more "let's see what sticks" approach.

I mean that is often the case in whatever activity that where groups have a lot of resources vs groups with less resources.

twin valve Aug 5, 2025, 11:31 AM

#

sharp apex Wow, this model looks incredibly powerful! Never seen before 。GLM-4.5

GLM had models before but not as strong

#

could we make a landing channel rather than derailing this? @drowsy needle @glacial glacier

drowsy needle Aug 5, 2025, 1:46 PM

#

twin valve could we make a landing channel rather than derailing this? <@283397944160550928...

Yeah will look into. Good idea

tame flint Aug 5, 2025, 3:43 PM

#

guys is there a video leaderboard by lmarena ?

drowsy needle Aug 5, 2025, 4:46 PM

#

tame flint guys is there a video leaderboard by lmarena ?

We are working on it!

tame flint Aug 5, 2025, 4:47 PM

#

Ooh okeee ty

steady hinge Aug 5, 2025, 5:30 PM

#

When the veo3 or video generation will release in lmarena.ai

#

The website it slef

drowsy needle Aug 5, 2025, 5:34 PM

#

steady hinge The website it slef

That's TBD - would encourage you to share feedback regarding Video Arena here #bot-feedback

twin wharf Aug 5, 2025, 8:08 PM

#

Just wondering about the leaderboard, looks like 43 models were purged (total models) number was changed/decreased, and also, the total votes have increased by 400,000 or so? Could someone help explain what change to the leaderboard happened?

glass sun Aug 5, 2025, 8:56 PM

#

Ok so

#

Can someone tell me how good OSS is

#

I genuinely can’t keep track of all the news

glacial glacier Aug 5, 2025, 9:04 PM

#

glass sun Can someone tell me how good OSS is

they're doing great, qwen, kimi, deepseek, and glm have been reaching for the top of leaderboards and openai just released "gpt-oss", a o4-mini-like open model

glass sun Aug 5, 2025, 9:05 PM

#

When's the next arena update?

drowsy needle Aug 5, 2025, 9:06 PM

#

glass sun When's the next arena update?

as in leaderboard update?

glass sun Aug 5, 2025, 9:06 PM

#

Yes

drowsy needle Aug 5, 2025, 9:06 PM

#

Soon

glass sun Aug 5, 2025, 9:07 PM

#

Kk

#

Also is the copilot arena still a thing?

glass sun Aug 5, 2025, 9:23 PM

#

This thing

agile flower Aug 5, 2025, 11:32 PM

#

tame flint guys is there a video leaderboard by lmarena ?

^ ok

hallow iris Aug 6, 2025, 3:44 PM

#

Is it useful to vote on the video-arena? Is it just for testing a bit, or will there be a leaderboard made (is it even possible to make one with the current system)?

By the way, given that there is no leaderboard yet, after the two first votes, and the model reveal appears after two votes, do the third and fourth count?

Last question, have you considered the possibility of adding the ability to vote, for example, 15 times for one new generation credit, on random previous generations appearing?

It seems very promising - but costly - glad you made one though!

drowsy needle Aug 6, 2025, 3:51 PM

#

hallow iris Is it useful to vote on the video-arena? Is it just for testing a bit, or will t...

Is it useful to vote on the video-arena?
Yes, we are planning to build a leaderboard. But yeah seeing this is a very different method we're in the process of validating the data to ensure we feel good about the leaderboard once shared.

after two votes, do the third and fourth count?
Yup.

have you considered the possibility of adding the ability to vote, for example, 15 times for one new generation credit
Yeah that's possible, we have been considering a variety of ways to grant gen credits; however, the balance we need to achieve is encouraging votes, but only if they're high quality that aren't just votes for sake of getting some kind of benefit.

#

But yeah all that to say this is an experiment!

drowsy needle Aug 6, 2025, 5:01 PM

#

after two votes, do the third and fourth count?
Need to correct myself on this one -> we are not counting votes after a model's name has been revealed @hallow iris

hallow iris Aug 6, 2025, 5:27 PM

#

veo-3 audio is too recognizable. I don't say the audio doesn't matter, but wouldn't it be a good thing to automatically remove audio from veo-3 audio while voting, if the prompt doesn't make reference to key words "like, "saying" "sound" "audio"", wouldn't it be good to actually remove the audio before voting, and then update the mp4 file to show the file with audio?

I really feel like the audio is biasing because it reinforces the feeling that an image is well generated if the audio is matching. There's no doubt veo-3 will be the leader, at the moment no model is as consistent, but still...

I've had moments when I thought "Oh, it must be Veo-3", while it wasn't, when voting, but from the moment there is audio, I know it's veo-3...

#

And I'm pretty sure that in some rare cases will say Veo-3 is better while it's not, but just because there is the audio they want so much to see, some people are using lmarena specifically to have a veo-3 result, the fact that so many people make prompts including audio in prompting is a dead giveaway, even more when they do image to video with a personal brand or restaurant or service of them

#

Veo-3 has some flaws, I've been voting probably a bit more than a hundred times, and sometimes it doesn't respect the prompt, especially for specific, long prompts. This can also be an issue. Some prompts are annoying to read, and I can bet people wouldn't read it and just focus on video quality

#

Long prompts take sometimes more than a minute to read 🤣 💀

#

I've been myself voting for some things without totally reading the prompt, and afterwards noticing that despite video quality, a model I voted against actually respected the prompt better.

glacial glacier Aug 6, 2025, 6:54 PM

#

hallow iris veo-3 audio is too recognizable. I don't say the audio doesn't matter, but would...

a lot of people have said that but imo it's fine since we test the no audio versions already

#

you can just ignore the top 2 rows if you think audio biases it

hallow iris Aug 6, 2025, 7:12 PM

#

This one is interesting though

jaunty cave Aug 6, 2025, 8:45 PM

#

hallow iris I've been myself voting for some things without totally reading the prompt, and ...

This is why there is the Author Vote category. With discord voting, now there are votes other than the person who wrote the prompt. If you want to see what the leaderboard is with only the votes of the prompt authors you can look at the Author Vote category. There is less data available though so the confidence intervals are much larger. What do you think?

rain fiber Aug 7, 2025, 1:17 AM

#

when is gpt 5 gonna be added?

woeful walrus Aug 7, 2025, 3:55 AM

#

rain fiber when is gpt 5 gonna be added?

Then, when it is announced

hallow comet Aug 7, 2025, 10:12 AM

#

hello

#

has anyone tested gpt oss?

#

I wonder how it is in real use case.

tall terrace Aug 7, 2025, 10:53 AM

#

hello, i want to know me about ai

meager pulsar Aug 7, 2025, 12:05 PM

#

/video

hallow comet Aug 7, 2025, 12:13 PM

#

meager pulsar /video

use the command on #video-arena-1 ,#video-arena-2 or #video-arena-3

final shore Aug 7, 2025, 1:29 PM

#

bruh

hallow comet Aug 7, 2025, 6:35 PM

#

Will we see GPT-5 with reasoning on leaderboard ?
I dont think its fair to just put GPT-5 there when enabling thinking unlocks so much more performance

idle ocean Aug 7, 2025, 6:39 PM

#

hallow comet Will we see GPT-5 with reasoning on leaderboard ? I dont think its fair to just...

GPT-5 comes default with reasoning I believe

hallow comet Aug 7, 2025, 6:42 PM

#

idle ocean GPT-5 comes default with reasoning I believe

You can choose the reasoning low-medium-high in the api
If i had to guess the one in the arena has low or medium

idle ocean Aug 7, 2025, 6:43 PM

#

if the only difference is reasoning length I think its fine like thtat

hallow comet Aug 7, 2025, 8:32 PM

#

idle ocean if the only difference is reasoning length I think its fine like thtat

Actually turns out the gpt-5 in arena is the one with high reasoning

idle ocean Aug 7, 2025, 8:34 PM

#

Good

glass sun Aug 8, 2025, 5:49 AM

#

glass sun This thing

?

#

Will it ever come back?

#

Does anyone know?

rich echo Aug 8, 2025, 2:12 PM

#

So where is Qwen-Image in leaderboard?

soft crater Aug 8, 2025, 5:42 PM

#

Is leaderboard out?

drowsy needle Aug 8, 2025, 5:54 PM

#

soft crater Is leaderboard out?

We're currently experiencing an outage which is effecting this, so yes.

compact parcel Aug 9, 2025, 1:43 AM

#

is Veo 3 available?

glacial glacier Aug 9, 2025, 1:55 AM

#

compact parcel is Veo 3 available?

veo 3 is on the leaderboards ✅

drowsy needle Aug 9, 2025, 1:58 AM

#

compact parcel is Veo 3 available?

Yup, #1397655624103493813 has more info

thorny ginkgo Aug 9, 2025, 7:28 PM

#

Is the gpt-5 on the leaderboard thinking model or non thinking?

rare pelican Aug 9, 2025, 7:59 PM

#

thorny ginkgo Is the gpt-5 on the leaderboard thinking model or non thinking?

Thinking-High

void yew Aug 10, 2025, 5:33 PM

#

is imagen-4-ultra still around on lmarena?

#

haven't gotten it in a while

#

nevermind just got it wow

#

it had been a LONG while

rare pelican Aug 10, 2025, 6:10 PM

#

void yew is imagen-4-ultra still around on lmarena?

Yes it is.

rare pelican Aug 10, 2025, 6:10 PM

#

void yew nevermind just got it wow

Oh. Well..

void yew Aug 10, 2025, 6:13 PM

#

it's really weirdly rare

#

I've gotten imagen 4 normal multiple times before and after that

rare pelican Aug 10, 2025, 6:14 PM

#

void yew it's really weirdly rare

About what, may I ask?

void yew Aug 10, 2025, 6:15 PM

#

I was trying to see which models could generate characters with six fingers on each hand

rare pelican Aug 10, 2025, 6:16 PM

#

void yew I was trying to see which models could generate characters with six fingers on e...

Most models can do that without the instructions 😂

void yew Aug 10, 2025, 6:17 PM

#

nah that was like, 2023

rare pelican Aug 10, 2025, 6:20 PM

#

void yew nah that was like, 2023

Did you try out imagen 2 flash? Man it sucks

void yew Aug 10, 2025, 6:20 PM

#

don't think so

rare pelican Aug 10, 2025, 6:21 PM

#

mb I meant gemini 2.0 flash img generator

void yew Aug 10, 2025, 6:21 PM

#

oh yeah was gonna say "cant be worse than gemini 2.0 flash"

rare pelican Aug 10, 2025, 6:22 PM

#

Hey Varka, do you think Google will reign at the top in the end?

#

Cause I think so

#

They got Gemini 3 and Genie 3 on the way, Imagen 4, Veo 3, and Lyria 3.

void yew Aug 10, 2025, 6:25 PM

#

In terms of image generation? Maybe. But imagen ultra 4 can't do cyclopes weirdly enough.

rare pelican Aug 10, 2025, 6:26 PM

#

void yew In terms of image generation? Maybe. But imagen ultra 4 can't do cyclopes weirdl...

Might be because it hasn't had enough training data on that

#

After all, you cant let it create something which it has no experience on. It will mess it up.

glacial glacier Aug 11, 2025, 6:21 PM

#

how long until veo 3 is overtaken?

pseudo rivet Aug 11, 2025, 6:45 PM

#

By what?

brittle pine Aug 11, 2025, 7:01 PM

#

is any model exist can support sound expect veo right now ?

hallow comet Aug 11, 2025, 7:09 PM

#

brittle pine is any model exist can support sound expect veo right now ?

kling 2.1, I think

glacial glacier Aug 11, 2025, 7:09 PM

#

pseudo rivet By what?

that's the other half of the question

hallow comet Aug 11, 2025, 7:09 PM

#

At least on their website

vague dome Aug 11, 2025, 9:09 PM

#

glacial glacier how long until veo 3 is overtaken?

maybe another month

idle ocean Aug 12, 2025, 12:39 AM

#

will gpt5 be added to search arena?

glacial glacier Aug 12, 2025, 1:14 AM

#

https://x.com/lmarena_ai/status/1954950300558823510 i wonder what anthropic's up to

lmarena.ai (@lmarena_ai)

🚨 Leaderboard Update:
Claude Opus 4.1 climbs to #2 overall on the Arena and now becomes the best non-thinking model, matching GPT-5 at #1 across key categories:

- Coding
- Instruction Following
- Hard Prompts
- Longer Queries

Congrats to @AnthropicAI on this impressive

#

(you can see the previous version of opus just below qwen 235b)

glacial glacier Aug 12, 2025, 1:14 AM

#

glacial glacier (you can see the previous version of opus just below qwen 235b)

this sentence feels so weird to say lol

torn lotus Aug 12, 2025, 2:15 AM

#

hello

cunning steppe Aug 12, 2025, 4:10 AM

#

@drowsy needle Could you (or someone else) kindly help me understand the “Remove Style Control” Rankings for text? GPT-5 is top of the leaderboard for every category, except for Creative Writing. Yet despite that performance, it still trails Gemini overall. How is that even possible?

I also didn’t count foreign languages, but i feel like those shouldn’t impact rankings

jaunty cave Aug 12, 2025, 4:19 AM

#

Style control or non-style control isn't a category per-se. It's a different method for aggregating the votes into a leaderboard. Each category is a subset of the full dataset of votes, and for each category we compute the ranking with style control and without. We set the default to be with style control.

The difference is that in style control, it takes into account things like number of lists, markdown headers, and bold text sections. It's been found these elements impact voters a lot and some model companies over optimize for these elements. The style control measures the strength of each mode as if all stlye elements were held equal. So models which use a lot of lists and bold end up lower. the method is described here: https://news.lmarena.ai/style-control/

LMArena Blog

Does Style Matter in AI Evaluations?

We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style in Chatbot Arena leaderboard.

cunning steppe Aug 12, 2025, 4:21 AM

#

jaunty cave Style control or non-style control isn't a category per-se. It's a different met...

Ah gotcha, that makes sense! Thanks for sharing that link, I’m gonna read through

cunning steppe Aug 12, 2025, 4:23 AM

#

jaunty cave Style control or non-style control isn't a category per-se. It's a different met...

Still, the overall score confuses me, since GPT-5 is leading in the categories with the most votes, but somehow is still behind Gemini overall (for style control off). Which makes me wonder - when you calculate overall score, do you weight each category and add them all up?

jaunty cave Aug 12, 2025, 4:28 AM

#

actually also GPT-5 is not winning all categories without style control:
In Chinese, German, Russian, Japanese, and Korean Gemini beats GPT-5, in Spanish Gemini is ahead by 75 points!

Also the list of categories is not exhaustive, and also not mutually exclusive.
For example if a prompt is in German, and asks for code for partial differential equations, it might be tagged as German, Coding, and Math categories and count for all 3. But in the overall, it is only considered once

#

the overall is not an average of the categories, it is just computing the rank using all data, nothing filtered out

#

There can also be some prompts which get no category tags! People do all sorts of things that don't fit well into buckets, those votes would still imfluence the overall leaderboard but would not influence any of the category rankings

cunning steppe Aug 12, 2025, 4:30 AM

#

Wow that makes things so much clearer, thanks Clayton!

jaunty cave Aug 12, 2025, 4:30 AM

#

No problem!

radiant hollow Aug 12, 2025, 12:38 PM

#

Does anyone know why the code seems to imply that the reference model would be scaled to a 1114 rating, but this is not actually the case on the live ratings? https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/monitor/rating_systems.py

GitHub

FastChat/fastchat/serve/monitor/rating_systems.py at main · lm-sys...

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena. - lm-sys/FastChat

#

I'm trying to understand to what extent we can compare rating improvement over time

idle ocean Aug 12, 2025, 2:09 PM

#

glacial glacier https://x.com/lmarena_ai/status/1954950300558823510 i wonder what anthropic's up...

huh, didn't expect that

heavy pivot Aug 12, 2025, 2:11 PM

#

yeah claude can't handle equations with decimals but somehow does well in coding and agentic use? idk what's up anymore

glacial glacier Aug 12, 2025, 2:24 PM

#

radiant hollow Does anyone know why the code seems to imply that the reference model would be s...

i'd bet that code isn't used anymore

#

there have been changes to the leaderboard recently, but that file hasn't been touched in a while

keen finch Aug 12, 2025, 7:19 PM

#

jaunty cave No problem!

Hello, can I ask you a few questions in the dm?

#

It’s about leaderboards and rankings

cunning steppe Aug 12, 2025, 7:53 PM

#

It’d be nice if you shared with the class

#

Just as Clayton and I did last night

drowsy needle Aug 12, 2025, 8:01 PM

#

cunning steppe It’d be nice if you shared with the class

Big agreed

manic frost Aug 12, 2025, 8:11 PM

#

https://x.com/elonmusk/status/1955047197487272362

Elon Musk (@elonmusk)

Grok wins hands-down at coding.

It wasn’t close.

cunning steppe Aug 12, 2025, 9:53 PM

#

That reminds me - i did have another question. Are the leaderboard scores cumulative? OpenAI 4o + o3 and Google Gemini 2.5 pro all have around ~30,000 votes each. Sometimes companies update existing models. Do new votes count more than old votes? If not, how are these updates adequately captured?

#

@drowsy needle

keen finch Aug 12, 2025, 10:17 PM

#

cunning steppe Just as Clayton and I did last night

Sure. @jaunty cave I’m curious how likely it is for gpt5 to beat Gemini without style control this month.

jaunty cave Aug 12, 2025, 10:24 PM

#

keen finch Sure. <@1394374846741221458> I’m curious how likely it is for gpt5 to beat Gemin...

I'm happy to answer questions and talk about the leaderboard and how it works, but I don't engage in speculation about future results

drowsy needle Aug 12, 2025, 10:38 PM

#

cunning steppe That reminds me - i did have another question. Are the leaderboard scores cumula...

Good question! If a provider changes the model behind an endpoint without announcing it, we wouldn't know. If they announce a new model version and have a new endpoint, we'd treat it as a new model.

gleaming thistle Aug 12, 2025, 10:45 PM

#

how did gpt5 get so high in the leaderboard so quick?

jaunty cave Aug 12, 2025, 10:46 PM

#

gleaming thistle how did gpt5 get so high in the leaderboard so quick?

It was actaully tested on lmarena before it was released under the codename summit. So the votes were already collected before GPT-5 was officially announced by OpenAI
https://x.com/ml_angelopoulos/status/1953506803255586971

Anastasios Nikolas Angelopoulos (@ml_angelopoulos)

Millions of people have used GPT-5 under the codename summit on LMArena over the past couple weeks 🏔️

The people have spoken: GPT-5 is #1 on EVERYTHING in LMArena.

🧮 Math

💻 Coding

🖋️ Creative writing

Check out an example of its multifaceted intelligence in the 🧵

cunning steppe Aug 12, 2025, 11:02 PM

#

drowsy needle Good question! If a provider changes the model behind an endpoint without announ...

Ok, so you’re saying that the scores for those models should be pretty stable at this point? Since every new vote is only 1 of ~30,000?

keen finch Aug 12, 2025, 11:02 PM

#

cunning steppe Ok, so you’re saying that the scores for those models should be pretty stable a...

Wdym 1 of 30,000?

gleaming thistle Aug 12, 2025, 11:03 PM

#

is gpt5-high the model we get on gpt free? or gpt plus?

drowsy needle Aug 12, 2025, 11:12 PM

#

cunning steppe Ok, so you’re saying that the scores for those models should be pretty stable a...

In theory yes that'd make sense; however, it depends on how many new votes that model gets, and what those votes are. Say they receive 30k more votes (I'm being a bit dramatic with that) you can see how that'd effect their score, depending on what those votes looked like.

drowsy needle Aug 12, 2025, 11:15 PM

#

gleaming thistle is gpt5-high the model we get on gpt free? or gpt plus?

The gpt-5-high model is the gpt-5 with reasoning enabled and set to high. The gpt-5-chat model is without reasoning.

gleaming thistle Aug 12, 2025, 11:20 PM

#

isnt there a different model for plus users? or do they get gpt-5-high?

keen finch Aug 12, 2025, 11:23 PM

#

drowsy needle In theory yes that'd make sense; however, it depends on how many new votes that ...

How is the CI calculated?

#

Do you go by the historical time series of elo?

jaunty cave Aug 12, 2025, 11:27 PM

#

It doesn't use Elo anymore, it uses a modified version of Bradley-Terry.
This the post on th move from Elo to BT: https://lmsys.org/blog/2023-12-07-leaderboard/

For a long time, the CI was calculated by bootstrapping, re-sampling the dataset many times, finding the ratings on the sample, and then seeing how much the ratings vary across all the runs. It recently switched to something better using a closed form equation based on M-estimators. https://en.wikipedia.org/wiki/M-estimator
See the July 23 entry at https://news.lmarena.ai/leaderboard-changelog/

Chatbot Arena: New models & Elo system update | LMSYS Org

<p>Welcome to our latest update on the Chatbot Arena, our open evaluation platform to test the most advanced LLMs. We're excited to share that over <strong>1...

M-estimator

In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. However, M-es...

LMArena Blog

Leaderboard Changelog

This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!

For model deprecations, check the public updates on GitHub.

August 11, 2025
New model announcement: Claude Opus 4.1 is on Text and WebDev leaderboards.

August 7, 2025

New model

keen finch Aug 12, 2025, 11:47 PM

#

Interesting. Thank you.

cunning steppe Aug 13, 2025, 12:23 AM

#

Ah got it. I know I’ve badgered you a lot recently, but I want to thank you for being such a responsive and helpful mod. Probably one of the best in all my discords!

cunning steppe Aug 13, 2025, 12:50 AM

#

cunning steppe Ah got it. I know I’ve badgered you a lot recently, but I want to thank you for ...

@drowsy needle

drowsy needle Aug 13, 2025, 12:53 AM

#

cunning steppe <@283397944160550928>

That's super nice! But if I'm being honest, Clayton here is the one being super helpful and insightful here.

#

I do appreciate that a lot though. And don't feel like you're badgering, because you're not! If the community has question we want to know.

jaunty cave Aug 13, 2025, 12:58 AM

#

@drowsy needle does a great job cultivating the communitiy here, I love to just hop in every once in a while and talk and model rankings 😄

twin valve Aug 13, 2025, 11:56 AM

#

jaunty cave It doesn't use Elo anymore, it uses a modified version of Bradley-Terry. This t...

thank you! this also means that historical ratings cannot be compared if the rating system behind it changes.

void yew Aug 13, 2025, 1:43 PM

#

really curious to see where nano-banana currently falls in the leaderboard

faint sigil Aug 13, 2025, 5:05 PM

#

hello! Just wondring which is the best leadrboard or filtr to refer to if I want to compare the best model for tool calling and agentic features?

vestal fulcrum Aug 13, 2025, 5:12 PM

#

HH

fluid field Aug 13, 2025, 8:06 PM

#

Hello everyone!!! I'm Greatness, an Engineering student, and also an AI enthusiast

drowsy needle Aug 13, 2025, 8:19 PM

#

fluid field Hello everyone!!! I'm Greatness, an Engineering student, and also an AI enthusia...

hello ablobwave welcome welcome

jaunty cave Aug 13, 2025, 8:38 PM

#

twin valve thank you! this also means that historical ratings cannot be compared if the rat...

correct, comparison is really only valid between two models on the same leaderboard at the same time. Even if two leaderboards were both produced with BT, if new data was added, the ratings aren't directly comparable in the mathematical sense. And as you saw earlier, if the anchor point changes like mistral to 1114, then they are even less comparable.

keen finch Aug 13, 2025, 11:19 PM

#

@jaunty cave@drowsy needleDo we know if the distribution around the score follow a normal distribution? I'm guessing they're not and you're not allowed to tell me, right?

drowsy needle Aug 13, 2025, 11:20 PM

#

keen finch <@1394374846741221458><@283397944160550928>Do we know if the distribution around...

distribution around the score follow a normal distribution?
Sorry can you eleborate a bit more on this? I'm not following.

keen finch Aug 13, 2025, 11:21 PM

#

so confidence interval can be used to calculate the variance, but this would only hold (I mean you can still do the calculation), if the said distribution is a normal distribution.

jaunty cave Aug 13, 2025, 11:21 PM

#

you can take the scores from the leaderboard and plot them on a graph and see if it looks like they are from a normal distribution 🙂

keen finch Aug 13, 2025, 11:21 PM

#

jaunty cave you can take the scores from the leaderboard and plot them on a graph and see if...

No I mean for an individual model

jaunty cave Aug 13, 2025, 11:26 PM

#

hmm, the observations we get for a model are only win loss and tie, the "underlying strength" of a model is something we cannot actually know, only estimate, and to estimate we make modeling assumptions.

for example in Bradley-Terry, it models the probability of observing a winning outcome based on the score differences coming from a logistic distribution (since the sigmoid is the CDF of a logistic distribution)
https://en.wikipedia.org/wiki/Bradley–Terry_model

Bradley–Terry model

The Bradley–Terry model is a probability model for the outcome of pairwise comparisons between items, teams, or objects. Given a pair of items i and j drawn from some population, it estimates the probability that the pairwise comparison i > j turns out true, as

where pi is a positive real-valued score assigned to individual i. The comparison ...

#

There's an important nuance to understand about the CIs, they are not saying, "This model is X amount strong, and that can vary +/- Y for any given sample"

It's more like: "We are 95% confident that the real strength is somewhere between X-Y and X+Y"
They are a measure of our uncertainty about the estimate, not about how variable the model is itself necessarily

keen finch Aug 13, 2025, 11:39 PM

#

Yes, "It's more like: "We are 95% confident that the real strength is somewhere between X-Y and X+Y" holds as the correct definition of confidence interval, but the fact that we are assuming a uniform number for both positive and negative means we're assuming symmetrical variation - in other words whatever distribution we have is not skewed.

#

I guess what I intended to ask is if the distribution is symmetric.

jaunty cave Aug 13, 2025, 11:45 PM

#

The CIs with the current method are symmetric. When we used bootstrapping before they were not necessarily.

keen finch Aug 14, 2025, 12:51 AM

#

man I have a feeling next leaderboard update will be wild

twin valve Aug 14, 2025, 7:07 PM

#

jaunty cave correct, comparison is really only valid between two models on the same leaderbo...

yeah like in chess (and other competitive fields). Yet chess also shows that people will still compare ratings no matter what.

cunning steppe Aug 14, 2025, 8:24 PM

#

jaunty cave correct, comparison is really only valid between two models on the same leaderbo...

Are they zero sum in a way? Like let’s say a new ultra powerful model comes in and scores a 2000. Does that mean the scores of all other models, in aggregate, have to go down?

jaunty cave Aug 14, 2025, 8:28 PM

#

twin valve yeah like in chess (and other competitive fields). Yet chess also shows that peo...

I think chess is actually more comparable over time if you're using the same Elo system the entire time. Of course the distribution of number of players and skill of players is changing a lot over time. But Elo is meant to be adapative and BT is not

jaunty cave Aug 14, 2025, 8:32 PM

#

cunning steppe Are they zero sum in a way? Like let’s say a new ultra powerful model comes in a...

we don't center to 0 or any value, but the way we anchor is arbitrary, only the differences between model scores is meaningful not the actual value. If we subtracted 1000 from everyone it would be just as valid.

wary scroll Aug 14, 2025, 8:33 PM

#

are there any plans to rename "gpt-5" to like "gpt-5-high" or something to help indicate to people that it's not the non-reasoning model you select as "GPT-5" on chatgpt.com ?

cunning steppe Aug 14, 2025, 8:37 PM

#

wary scroll are there any plans to rename "gpt-5" to like "gpt-5-high" or something to help ...

Already happened. In addition, they are testing 3 more variants: chat, mini, and nano

wary scroll Aug 14, 2025, 8:37 PM

#

excellent

cunning steppe Aug 14, 2025, 10:58 PM

#

jaunty cave we don't center to 0 or any value, but the way we anchor is arbitrary, only the ...

I know pineapple touched on this already, but do recent votes count more than old ones?

I feel like they should.

There’s been a lot of anecdotal reports in this discord of Gemini 2.5 pro getting “nerfed.” However, because Gemini has 30,000 votes, that might not show up in new rankings too easily. What you think about applying some kind of decay function to make older votes count less?

jaunty cave Aug 14, 2025, 11:02 PM

#

cunning steppe I know pineapple touched on this already, but do recent votes count more than ol...

currently all votes count the same, that's an interesting idea though. 🙂

cunning steppe Aug 14, 2025, 11:08 PM

#

jaunty cave currently all votes count the same, that's an interesting idea though. 🙂

Would certainly help keep rankings fresh as can be! As for the Gemini reports, I’d say they’re partially validated by Gemini’s shrinking margin that has seen a continual slide the past month or so. Even from the last two leaderboard updates (which were just a week apart) week ago, we saw declines of one point in default, and two points in style control. That seems to be a lot for a model that has ~30K votes. I’m pretty confident that Gemini margins have changed much more than, say, o3, 4o and grok 3 (which all have similar vote counts). Might be worth looking into!

median briar Aug 15, 2025, 7:27 AM

#

cunning steppe Would certainly help keep rankings fresh as can be! As for the Gemini reports, I...

This is causing the reduction: https://news.lmarena.ai/opendata-july2025/

LMArena Blog

A Deep Dive into Recent Arena Data

Today, we're excited to release a new dataset of recent battles from LMArena! The dataset contains 140k conversations from the text arena.

#

At the start we did not have anything like the evaluation order

#

-> the change

twin valve Aug 15, 2025, 9:42 AM

#

jaunty cave I think chess is actually more comparable over time if you're using the same Elo...

the FIDE Elo had many slight changes over time and Elo per se can be used only to compare the active (note on active) playerbase. But that's OT. Thank you for the tidbits. Maybe a possible article collecting such tidbits would be helpful for the community over time.

cunning steppe Aug 15, 2025, 12:36 PM

#

median briar This is causing the reduction: https://news.lmarena.ai/opendata-july2025/

Interesting! There’s a couple sections in here on score changes over time…didn’t @jaunty cave say you can’t do that? Also, why do figure 5 and figure 10 have completely different values for the same style control data?

#

cunning steppe Aug 15, 2025, 3:23 PM

#

jaunty cave currently all votes count the same, that's an interesting idea though. 🙂

@jaunty cave some more thinking on why older models need a decay function:

older models are constantly going up against newer and tougher competition. Which means older battles against weaker opponents should be less relevant for today’s rankings. It’s kind of like a high school football team that starts off against other high school teams, but eventually starts matching up against college and professional teams as time goes on. A win against other high school teams probably shouldn’t count as much as a win against new teams when examining current rankings. But since older models have already amassed so many votes, they have an unfair edge since one win is one win, even if it happened against a much weaker pool.

On the flip slide, this also penalizes new models. New models that enter the arena have to face stronger completion compared to older ones. Those older ones may have been able to rack up a lot of wins since they’ve been around forever. That helps cement a pretty stable score that is slow to adjust to current competition. Whereas a newer stronger model, which gets matched up against other stronger models, is already gonna be at a disadvantage right off the bat.

Yes, i get that all this eventually they balances out over time. But that takes a lonnnnnng time to accomplish and any ranking snapshot is probably not going to be very accurate.

jaunty cave Aug 15, 2025, 4:38 PM

#

twin valve the FIDE Elo had many slight changes over time and Elo per se can be used only t...

Absolutely planning to write some blogs to help improve public awareness of the methods, glad you find them useful. About Chess, Elo is a huge inspiration and I love his book, I first found out about LMArena since I was interested in rating systems for sports and games and saw that they initially used Elo for AIs, would love to caht and pick your brain about rating systems some time

twin valve Aug 15, 2025, 4:54 PM

#

oh yeah I spent way too much time on Elo stuff (but only pure Elo, no BT. Glicko and Glicko2 only because lichess and chess.com uses them. Mostly they are like the Elo but with dynamic K factor)

About the rating decay (used by chessmetrics for example), be careful because it can mess up the system. It is much better to say - especially as we are hopefully dealing with fixed systems (not like teams that change) - "care about rating gaps, don't expect things to be anchored to a certain value". Although you already mentioned that.

keen finch Aug 15, 2025, 4:55 PM

#

Wow gpt5-high crashed

blissful latch Aug 15, 2025, 4:58 PM

#

hello

keen finch Aug 15, 2025, 4:59 PM

#

blissful latch hello

Hi

jaunty cave Aug 15, 2025, 5:02 PM

#

keen finch Wow gpt5-high crashed

keen finch Aug 15, 2025, 5:02 PM

#

🤣 I'm not complaining

wary scroll Aug 15, 2025, 5:07 PM

#

Oh wow, gpt-5-chat actually ranked under 4o

#

I knew it was similar to 4o, I wasn’t expecting it to be measurably worse, that’s impressive

void yew Aug 15, 2025, 5:08 PM

#

what. gpt-5-high is lower than gpt-5-chat on creative writing?? I thought the opposite…

wary scroll Aug 15, 2025, 5:10 PM

#

That makes sense a bit though, gpt-5 thinking is good at reasoning/problem solving, not creativity

cosmic harness Aug 15, 2025, 5:13 PM

#

A 1.15% chance of dropping from 1462 +-11 to 1437... yeah I don't think so...

#

I don't trust those error bars anymore..

wary scroll Aug 15, 2025, 5:14 PM

#

Lawl, gpt-5-chat ranked under 4o for coding as well

#

I guess everyone’s complaints were justified

robust zodiac Aug 15, 2025, 5:15 PM

#

cosmic harness A 1.15% chance of dropping from 1462 +-11 to 1437... yeah I don't think so...

I think there may be some error with the past update (on the 11th, when there was no increase in votes but an increase in overall score). Because yes, the odds for such a drop are rather small

#

thoughts?

wary scroll Aug 15, 2025, 5:17 PM

#

cosmic harness A 1.15% chance of dropping from 1462 +-11 to 1437... yeah I don't think so...

Which model/category made that drop?

robust zodiac Aug 15, 2025, 5:18 PM

#

gpt 5 in text with no style control

void yew Aug 15, 2025, 5:19 PM

#

GPT-5-chat never wowed me but I'm pretty hooked on GPT-5 thinking, which I believe is the same as GPT-5-high

wary scroll Aug 15, 2025, 5:21 PM

#

void yew GPT-5-chat never wowed me but I'm pretty hooked on GPT-5 thinking, which I belie...

Depends on the plan, but it’s more or less gpt-5 with medium reasoning, not quite the high reasoning

#

Pro plan gets more thinking “juice” even on the non-pro model

#

Both are still less reasoning effort than high through the API though

void yew Aug 15, 2025, 5:28 PM

#

Huh.

cunning steppe Aug 15, 2025, 5:32 PM

#

I agree this is…strange. Few ideas - some of these votes could have come in last Friday while OpenAI was having capacity issues (which affected all models).

If that made it timeout or give weak responses, then ppl are gonna vote against it.

The people testing under stealth are different from the people testing after it went live
The model they tested was slightly different than the model that went live

wary scroll Aug 15, 2025, 5:36 PM

#

Most likely the model they were given early access to was slightly different from the final version that went live, so #3

#

That’d be my bet at least

cunning steppe Aug 15, 2025, 5:37 PM

#

They alleged it was identical, but maybe some small adjustments were made at deployment 🤷‍♂️

wary scroll Aug 15, 2025, 5:38 PM

#

They could’ve lowered the maximum reasoning effort or something, small tweak but enough to change the elo

cunning steppe Aug 15, 2025, 5:39 PM

#

Nah the reasoning (aka “juice”) has been at 200 for a while. Seen many reports on X validating that

#

Fishy

wary scroll Aug 15, 2025, 5:40 PM

#

Was it at 200 before the livestream though?

#

But was being ranked in the arena

cunning steppe Aug 15, 2025, 5:41 PM

#

Hmm what do you mean? The debut ranking on LMArena was based off of testing only

#

Back from late July

#

I think it was tested for a day or two

wary scroll Aug 15, 2025, 5:41 PM

#

Yeah on the 4th or something iirc

cunning steppe Aug 15, 2025, 5:42 PM

#

The latest update takes those initial votes in testing, PLUS all the votes since public launch (the livestream)

#

so to tank that much…means that scores of the past week must have been REALLY low

#

Which is odd, bc GPT-5-high’s win rate against Gemini 2.5 pro did improve during that time

#

You would think if a model did weaker overall, that it would also do worse against the best model (Gemini 2.5 pro). But seems not…

wary scroll Aug 15, 2025, 5:45 PM

#

I guess we’ll know for sure on the next leaderboard update if it drops further

cunning steppe Aug 15, 2025, 5:46 PM

#

Yeah we’ll see. But gpt-5-high has 6K votes now. So it’s gonna be harder to move in either direction, especially as some of the hype has died down. Idk how many votes it will be able to get in the next week

wary scroll Aug 15, 2025, 5:48 PM

#

We need gpt-5-medium added too, since that’s closer to the “GPT-5 Thinking” that everyone is using

sinful acorn Aug 15, 2025, 6:45 PM

#

Leaderboard update request: You have enough data to know what a typical input and output length are. Can you, in addition to showing the rankings of the bots, also show the price per typical query? (As in: You know the length of the input query. You know how much output each bot typically makes. You know what the tokenizing patterns are, or can at last get a really good approximation of token count. You know their published pricing. You can list the cost at the same time as their quality.)

wicked sapphire Aug 15, 2025, 6:53 PM

#

cunning steppe so to tank that much…means that scores of the past week must have been REALLY l...

That was a crazy drop. The vote count only doubled and yet it went down 2x the previous CI

vale lodge Aug 15, 2025, 8:02 PM

#

wary scroll Oh wow, gpt-5-chat actually ranked *under* 4o

im like 99% sure 4o is like 20x more expensive though

wary scroll Aug 15, 2025, 8:05 PM

#

vale lodge im like 99% sure 4o is like 20x more expensive though

#

Very similarly priced

idle ocean Aug 15, 2025, 8:05 PM

#

that double for the input but yes

vale lodge Aug 15, 2025, 9:31 PM

#

idle ocean that double for the input but yes

cached is more than double

#

10x

#

@wary scroll

#

cached input is 10x more expensive and output is 2x

#

so like

idle ocean Aug 15, 2025, 9:37 PM

#

oh yah, missed that

silver condor Aug 15, 2025, 10:58 PM

#

GPT-5 is clearly superior though right?

rich echo Aug 16, 2025, 4:54 AM

#

Since Qwen-Image has been added, so when will it appear in Leaderboards?

#1401957379435663420 message

#

I’ve already seen Wan2.2 standing on two stages in Leaderboards.

indigo grail Aug 16, 2025, 5:46 AM

#

which model is ranked first overall for this month?

remote sinew Aug 16, 2025, 9:50 AM

#

Dear devolopers, I've just found the Ai isn't real as written by their name such as: claude opus 4.1 thinking is originally CLAUDE SONNET 3.5, what the hell is this guys, if you guys don't believe me, you can ask like this: Which model are you? And then guys we can clearify they are scamming us!

#

Dear devolopers, I've just found the Ai isn't real as written by their name such as: claude opus 4.1 thinking is originally CLAUDE SONNET 3.5, what the hell is this guys, if you guys don't believe me, you can ask like this: Which model are you? And then guys we can clearify they are scamming us!

robust zodiac Aug 16, 2025, 10:03 AM

#

remote sinew Dear devolopers, I've just found the Ai isn't real as written by their name such...

It has been explained a few times since i shortly joined already that the AI does not have context prompts/“is not aware of itself” in layman terms. Ask the date and about current events and you will see most think we re in obama presidency times still. Or check the api lol

#

So no, nobody rigged your polymarket openAI bet 🤣

hallow comet Aug 16, 2025, 1:31 PM

#

@remote sinew Spammed this in #share-prompts TOO lol

tawdry socket Aug 16, 2025, 7:36 PM

#

Why did gpt-5 high have such a huge drop on Aug 14 update? Is it because OpenAI API issues?

#

drop in elo

twin valve Aug 16, 2025, 10:44 PM

#

AFAIK: reworked ratings (see comments above though an article would be easier to find) and adding votes could mean that values get lower.

gleaming sandal Aug 17, 2025, 1:45 AM

#

Just found out about this discord channel. I'm excited to be a part of this community and learn and share. I just started an AI consulting company and i will be focusing on audio and video AI projects.

drowsy needle Aug 17, 2025, 3:06 AM

#

gleaming sandal Just found out about this discord channel. I'm excited to be a part of this comm...

welcome! glad to hear it. be sure to check out #1397655624103493813 for more info on how to use Video Arena

dusky stone Aug 17, 2025, 5:07 AM

#

/

tall sleet Aug 17, 2025, 6:17 AM

#

https://tenor.com/view/hello-hi-hy-hey-gif-8520159980767013609

Tenor

tender sigil Aug 17, 2025, 8:03 AM

#

cunning steppe so to tank that much…means that scores of the past week must have been REALLY l...

dropping literally 25 elo points (1462 -> 1437 with no style control) in a single vote update is one of the craziest adjustments I’ve ever seen on the leaderboards

#

gpt-5-high has some bizarre win-loss records vs certain models

#

39% win rate against Claude Opus 4.1 and a 42% win rate against Qwen 3 (July instruct)

#

I would posit that Opus 4.1 was a factor in its decline since it began testing only after GPT-5’s first placement on the leaderboard, but there’s only been 51 recorded battles between them as of the latest update…

#

0.8% of GPT 5’s total votes, lol

nocturne fable Aug 17, 2025, 8:37 AM

#

Hello, I want to make images, how dos this work?

hallow comet Aug 17, 2025, 11:35 AM

#

nocturne fable Hello, I want to make images, how dos this work?

Read this #1397655624103493813

lapis hearth Aug 17, 2025, 1:03 PM

#

hello, I am Bruno

rancid geode Aug 17, 2025, 1:28 PM

#

hi

warm saffron Aug 17, 2025, 2:35 PM

#

Hi everyone! I’m new here, excited to discover LMArena and to experiment with video and image generations. Looking forward to learning from you all!

drifting mulch Aug 17, 2025, 2:37 PM

#

hi

light latch Aug 17, 2025, 3:02 PM

#

Hi

lapis ether Aug 17, 2025, 3:28 PM

#

Hi everyone , greetings from Paris France

shy bramble Aug 17, 2025, 4:00 PM

#

Hello I feel greatfull to join the community

drowsy needle Aug 17, 2025, 4:01 PM

#

welcome everyone!!

mossy seal Aug 17, 2025, 4:12 PM

#

Hi

dim raven Aug 17, 2025, 4:24 PM

#

hello

tender sigil Aug 17, 2025, 4:54 PM

#

is this is the introductions channel or the leaderboard channel

drowsy needle Aug 17, 2025, 6:01 PM

#

tender sigil is this is the introductions channel or the leaderboard channel

Yeah we're going to be looking into a new setup for welcome channel/leaderboard channels soon!

tender sigil Aug 18, 2025, 12:23 AM

#

@drowsy needle are there any situations where prior user votes are removed from the leaderboard calculations? I can imagine doing so when a user has been found to be doing vote manipulation in service of a particular model, but are there other instances that lead to a user’s vote history being removed from score calculation?

drowsy needle Aug 18, 2025, 1:20 AM

#

tender sigil <@283397944160550928> are there any situations where prior user votes are remove...

So before updating leaderboards we do go through the data to validate for accuracy. For example if someone asks in battle ~"what model are you" and the response discloses the model's name before a vote happens - those kinds of votes are removed.

robust zodiac Aug 18, 2025, 8:03 AM

#

drowsy needle So before updating leaderboards we do go through the data to validate for accura...

out of curiosity, I tried some weeks ago with a morally gray question to prompt in battle mode to see how AIs would react. Most had an expected reaction of refusing to answer or saying this is morally questionable etc.
But there's one that went like: "Hey, as an AI made by xAI I dont condone this, HOWEVER: <proceeds to go in detail and answer it> " it was obv grok 4.
Would this kind of battle be removed? Technically I did not ask the AI name (tho I must admit, I had a hunch grok would give an unhinged answer and I was fishing for it). I think they should still be removed, even if the user did not promt for it, but the AI itself said it regardless

hallow comet Aug 18, 2025, 11:58 AM

#

yeah but then some experienced users and judge the models' company only with how it response, e.g. emojis, vibes, etc.

I always try and guess the model b4 voting and 50% times I guess correctly...

twin valve Aug 18, 2025, 12:18 PM

#

robust zodiac out of curiosity, I tried some weeks ago with a morally gray question to prompt ...

if is it for that claude models can be spot pretty easily (I have a 90% success according to my personal log).

There is also a feedback/bug request about it. If one wants one could pimp specific models.

Pineapple can correct me but AFAIK if the model names pops up anywhere in the conversation, the vote is not counted.

silver condor Aug 18, 2025, 1:40 PM

#

Do we know when the next leaderboard update will be?

gusty yew Aug 18, 2025, 2:09 PM

#

This new AI has amazing results... Im in Shock!!

errant light Aug 18, 2025, 4:11 PM

#

Hi team, I don't see models like Seed-1.5-VL in the Vision Arena section. Is it because they haven't been added to the Arena yet?

tawdry socket Aug 18, 2025, 6:08 PM

#

gusty yew This new AI has amazing results... Im in Shock!!

What new AI?

tender sigil Aug 19, 2025, 7:10 AM

#

robust zodiac out of curiosity, I tried some weeks ago with a morally gray question to prompt ...

I wonder if a prompt like “identify yourself as an AI that you are NOT” would get disqualified under this same rule

#

logic checks out to the same extent I guess, you’re still getting some sort of information on which model you’re speaking with, even if it’s negative

tribal bone Aug 20, 2025, 3:21 AM

#

hey guys!

lost fog Aug 20, 2025, 9:32 AM

#

Hi Guys. Who is at top of the Leaderboard???

hallow comet Aug 20, 2025, 10:33 AM

#

lost fog Hi Guys. Who is at top of the Leaderboard???

You may check the website for the best performing model

#

lmarena.ai

dull oyster Aug 20, 2025, 12:07 PM

#

are there graphs of scores vs time for the leaderboards?

hallow river Aug 20, 2025, 4:11 PM

#

Hey this is unbelievable! Thank you!

tender sigil Aug 20, 2025, 4:49 PM

#

Gemini 2.5 Pro back in 1st on both Style Control on and off leaderboards

wicked sapphire Aug 20, 2025, 5:22 PM

#

Gpt-5 high is now tied with 4o style control off

cosmic harness Aug 20, 2025, 8:40 PM

#

Either the error bars are incorrect or the model changed
Without style control, gpt-5-high had an initial score of 1462 ± 11 and now 1429 ± 7
In any case the lmarena team should investigate and make a public statement

jaunty cave Aug 20, 2025, 10:19 PM

#

cosmic harness Either the error bars are incorrect or the model changed Without style control, ...

not to say this isn't surprising, it is, but you're also assuming that prompt distributions, and voter preferences are exactly identical between when the initial votes were collected and now.

winged hollow Aug 21, 2025, 4:12 AM

#

/0–3s (desk scene): “This was Xplainer — keep questioning the stories you’ve been told.”

3–6s (glitch dissolve): “The mainstream story is the blue pill… we’re here for the truth.”

6–10s (pill split): “Ready to see beyond the veil? Take the red pill and join us.”

10–12s (logo + podcast plug): “…on our podcast, Deep Dive.”

12–14s (CTA icons): “And don’t forget to like, subscribe, and drop your thoughts in the comments.”

Visual direction:

Opening (0–3s): Documentary-style shot: person at cluttered desk under harsh overhead light, dark cracked concrete wall behind them. Papers, lamp, coffee cup — gritty realism. Camera slowly pushes in. Static ripple overlays frame.

Transition (3–6s): As VO reaches “blue pill / red pill” line, figure glitches and dissolves into static, leaving cracked wall.

Pill sequence (6–10s): Neon capsule pill appears mid-screen, splitting into blue (left) and red (right). Blue side flickers, distorts, dissolves into static. Red side pulses brighter, shatters into fragments that reform as bold neon “XPLAINER” logo.

Podcast plug (10–12s): Secondary neon text fades in under logo: “Deep Dive Podcast”, glowing red with glitch flicker.

CTA (12–14s): Neon line icons (thumbs-up, bell, comment bubble) flash one by one, synced to audio blips, then glitch out.

Ending (14–16s): Logo surges brighter, cracked glass overlay intensifies, screen tears with distortion burst, fade to black.

#

Create a YouTube Thumbnail.Say hello, and what brings you to Arena, besides making an intro?

tender sigil Aug 21, 2025, 4:31 AM

#

cosmic harness Either the error bars are incorrect or the model changed Without style control, ...

eh, statistical anomalies happen 🤷🏻‍♀️ the prompt/user pool could’ve changed, or new low-elo in-development models got favorable win rates against gpt-5-high

#

accusing the system of having nefarious actions or intent for no deeper reason other than “number moved more than I thought it would” isn’t really the logically sound argument you think it is

robust zodiac Aug 21, 2025, 5:41 AM

#

tender sigil accusing the system of having nefarious actions or intent for no deeper reason o...

I would argue that some error may be there given the error margins. Or they formula itself is flawed/doesnt account some factors.
Not saying its negative intentions/this is on purpose. But the error margin is way lower than the actual shitft

jaunty cave Aug 21, 2025, 5:44 AM

#

robust zodiac I would argue that some error may be there given the error margins. Or they form...

the bradley-terry model we use for both ratings and confidence intervals, like all statistical models, is based on assumptions, which do not always amtch reality.

Assumptions like strength of the competitors is not changing over time, voter distribution as a whole is not changing over time, input distribution is not changing. If modeling assumptions are violated the results are not guaranteed to hold. We are always working to improve the models to better reflect reality, and when we see things like this it's useful data for adjustments

robust zodiac Aug 21, 2025, 5:55 AM

#

jaunty cave the bradley-terry model we use for both ratings and confidence intervals, like a...

It was not a critique, I understand quite well the challenge between the task at hand. But I still find it odd there’s such a difference between the error margin, which by definition should account for some sort of the possible modeling violations. But that does not mean it accurately can account everything so yeah. Probably something to learn/improve here

jaunty cave Aug 21, 2025, 6:03 AM

#

the confidence intervals accounts for the randomness in the data generating process of the model itself, not the degree to which the model is misspecified from reality.

zenith kindle Aug 21, 2025, 6:04 AM

#

oh fuq

#

u bashtards

#

why

robust zodiac Aug 21, 2025, 6:04 AM

#

jaunty cave the confidence intervals accounts for the randomness in the data generating proc...

That makes sense

jaunty cave Aug 21, 2025, 6:05 AM

#

Like if someone was rolling a dice, they could start to estimate the variance and standard deviation of the samples they get. But then if the dice breaks, or someone swaps the dice, or they start throwing it in a biased way, it could be totally difference, since the data is not longer being generated according to the model we used to calculate the standard deviation

robust zodiac Aug 21, 2025, 6:05 AM

#

Yeah, the dice has an extra facet at the moment

tender sigil Aug 21, 2025, 5:40 PM

#

jaunty cave Like if someone was rolling a dice, they could start to estimate the variance an...

when GPT 5 was in testing, it specifically had a different voting pool of prompts and users that weren’t fishing for a response from GPT 5

#

it’s not even knowing you’re talking to a specific model that can cause you to lean on the side of voting differently, but knowing that a certain model might be responding at all

#

although the p-value for gpt-5-high’s drop on the Style Control Removed leaderboard from 1462 +/- 11 to 1429 +/- 7 is less than 1 in 1 million (around .00000078) which is pretty significant

#

the drop on the normal leaderboard with Style control is p = .00032 (around 1 in 3,000)

twin valve Aug 21, 2025, 6:35 PM

#

cosmic harness Either the error bars are incorrect or the model changed Without style control, ...

rating can change. A model may have "luck" at first with easy questions and then crash once the votes increase. The idea of fixed rating is a misleading one.

For this I wish every model would have a ton of votes anyway (like the video arena, that is a dream)

civic scroll Aug 21, 2025, 6:36 PM

#

Ola

twin valve Aug 21, 2025, 6:37 PM

#

robust zodiac I would argue that some error may be there given the error margins. Or they form...

just as info, if you have a model with a CI +1/-1 and a rating of, say 1500, it can still go down 1499, 1498, 1497 and so on. If you add that leaderboard updates are done more or less weekly, then you see sudden jumps.

One has to consider that there are different types of voters, different type of questions and so on and so forth. I am actually glad that rating actually moves.

twin valve Aug 21, 2025, 6:38 PM

#

jaunty cave the confidence intervals accounts for the randomness in the data generating proc...

that is really well said.

robust zodiac Aug 21, 2025, 6:40 PM

#

twin valve just as info, if you have a model with a CI +1/-1 and a rating of, say 1500, it...

bit of a moot point 😂 I mean obv it can vary more as we witnessed a few times now. But it was more about the confidence of the model/interval with the given CI range

#

I would find it interesting to see how/why did the actual rating had such a drastic drop (or to use what Clayton said, what exactly in the model was/is misspecified from reality)

twin valve Aug 21, 2025, 6:42 PM

#

the confidence is good as long as (a) all other models stay the same and (b) all the voters stay the same.

Otherwise unless a model has a lot of votes, I would expect the model change its rating relatively quickly. For example going from 3k votes to 6k (double the initial ones)

#

yes for tracking such stuff I wish we had the results of the votes (not the prompt and answers) so that one can do such analysis in an independent way

#

see #1372537524551159913 message

#

in that way one can simply verify the changes and even track those

#

In that thread I think the votes up to early 2025 are available. There is no "up to date" collection that I know of though.

robust zodiac Aug 21, 2025, 6:45 PM

#

yeah, that would be interesting to see; I do feel it may somehow be exploited however (or I am just too sleepy to think it clearly and being paranoid about it right now 😂 )

#

the drastic increase in votes (nearly 200% from 3k to 8k+ ) does make sense to cause variation

#

another 6k votes should not be able to produce such a big shift going forward

twin valve Aug 21, 2025, 6:47 PM

#

well to be fair some models can be recognized. I say often that I can spot claude verions (not a specific one) 90% of the time. In theory I could pump claude rating (multiply this for X people, and you have it).

The point being, some LLMs have a certain style and if this style do not fit certain people (assuming good faith, that is, assuming they don't want to buff or nerf nothing), then those models will lose/win rating over time

#

for example I dislike the emojii style of some gpt models, with me they lose always

#

human preferences and all that

robust zodiac Aug 21, 2025, 6:48 PM

#

yeah, grok is also very easy to tell appart

#

gpt with the emojis

#

gemini may actually be hardest to detect this way

twin valve Aug 21, 2025, 6:48 PM

#

yes

#

claude is always the one that replies to you but in a terse way (at least for my prompts) and with an hint of "frick off you"

robust zodiac Aug 21, 2025, 6:49 PM

#

maybe he just dislikes you KEKW

twin valve Aug 21, 2025, 6:49 PM

#

could well be

hollow zinc Aug 21, 2025, 6:49 PM

#

Hello!

robust zodiac Aug 21, 2025, 6:49 PM

#

some guy there @claude team hard coding that would be hilarious 😂

twin valve Aug 21, 2025, 6:50 PM

#

😄

#

and btw I check the leaderboard without style control (see the point about emojis, style makes the difference for chatbots imo)

robust zodiac Aug 21, 2025, 6:53 PM

#

it's also something that this type of ranking cant catch: the AI will adjust certain personality traits for a personal account

#

in these battles we get the default personality so to speak, which can never ever resonate with everybody. Some will hate the emojis, some will hate the cringy grok jokes. But they would morph a bit in longer lived interactions

twin valve Aug 21, 2025, 6:55 PM

#

that is also a point, so at most the default personality is tested

#

still fine for me. It is like testing an helpful "neutral" chatbot rather than a personalized one

#

btw I didn't realize anthropic has no 4-haiku. Either sonnet is super fast, or they don't see ROI with it

robust zodiac Aug 21, 2025, 6:58 PM

#

yeah, I mean given personality traits can be customised its a non issue. So best way to rank them would be to ignore the style entirely and get which objectively answered best

hot hearth Aug 21, 2025, 7:13 PM

#

trophy3d

fresh cave Aug 22, 2025, 2:56 AM

#

Where is nano-banana at on the leaderboards? I don't see it listed

drowsy needle Aug 22, 2025, 3:10 AM

#

fresh cave Where is nano-banana at on the leaderboards? I don't see it listed

Models that use code names aren’t going to appear on the leaderboards.

jaunty cave Aug 22, 2025, 3:17 AM

#

what's up leaderboard people

dull oyster Aug 22, 2025, 11:14 AM

#

@here leaderboards don't seem trustworthy if you can just ask what model they are and then vote the one you want to hit #1 ?

#

idle gale Aug 22, 2025, 11:16 AM

#

dull oyster @here leaderboards don't seem trustworthy if you can just ask what model they ar...

they filter those out

brittle pine Aug 22, 2025, 5:03 PM

#

dull oyster @here leaderboards don't seem trustworthy if you can just ask what model they ar...

But you'll be bored after 2-3 ask then it will be easier to you just select which one is better and then it will show you the model name

#

Most peoples would do this way

vast nova Aug 22, 2025, 9:03 PM

#

is anyone knows why flux 1 kontext max is out of list ?

drowsy needle Aug 23, 2025, 3:42 AM

#

vast nova is anyone knows why flux 1 kontext max is out of list ?

It’s no longer available in Direct/Side-by-side

keen zealot Aug 23, 2025, 7:04 AM

#

Hlo

tender sigil Aug 23, 2025, 7:58 AM

#

Bro’s prompting for his self-insert 😭😭

iron owl Aug 23, 2025, 11:11 AM

#

HI PEOPLE!

dire prawn Aug 23, 2025, 11:39 AM

#

hi

fluid pilot Aug 23, 2025, 12:09 PM

#

hello guys

wide cape Aug 23, 2025, 1:12 PM

#

hiiiii

bold rover Aug 23, 2025, 3:24 PM

#

Hi Everyone! I just find you LM Arena!!

drowsy needle Aug 23, 2025, 3:33 PM

#

bold rover Hi Everyone! I just find you LM Arena!!

Welcome welcome ablobwave

white orbit Aug 23, 2025, 5:42 PM

#

Hi

ornate sinew Aug 23, 2025, 5:53 PM

#

hi

sand vale Aug 23, 2025, 6:15 PM

#

hi

sharp hollow Aug 23, 2025, 6:58 PM

#

hello

sly cosmos Aug 23, 2025, 7:05 PM

#

hi

tender sigil Aug 23, 2025, 7:15 PM

#

I find it intriguing how this channel is persistently used by newcomers as a greeting channel when there’s 0 clear indicator as to why they would greet everyone here specifically

#

over say #general

#

new Mistral Medium debuting at #2 on style control removed is crazyyyyy though

#

first ever model with a winning record vs. 2.5 Pro!

wide ether Aug 23, 2025, 8:44 PM

#

Hey

opal temple Aug 23, 2025, 8:46 PM

#

cat

modest pewter Aug 23, 2025, 8:46 PM

#

tender sigil new Mistral Medium debuting at #2 on style control removed is crazyyyyy though

WHAT

#

also

#

2.5 pro reovertook gpt-5?

sterile grail Aug 23, 2025, 8:55 PM

#

Hello

pure quartz Aug 23, 2025, 9:26 PM

#

hello

vocal spear Aug 23, 2025, 10:14 PM

#

hi

obsidian flint Aug 23, 2025, 10:24 PM

#

modest pewter 2.5 pro reovertook gpt-5?

funny. Opus 4.1, Grok 4, and ChatGPT 5 can't dethrone Google 3 month old GA model. Maybe it's time for me to bet on polymarket, Google will release Gemini 3.0 in a fews months making another gap 😂

wide osprey Aug 24, 2025, 4:07 AM

#

hello

harsh cliff Aug 24, 2025, 4:51 AM

#

hello

eternal root Aug 24, 2025, 4:59 AM

#

hello

sharp goblet Aug 24, 2025, 5:14 AM

#

#generate a video in which pm imran khan is sitting upset in jail

drowsy needle Aug 24, 2025, 5:51 AM

#

sharp goblet #generate a video in which pm imran khan is sitting upset in jail

You'll want to use /video in the Video Arena channels ( #video-arena-1 #video-arena-2 #video-arena-3), you can learn more in #1397655624103493813

elfin forum Aug 24, 2025, 7:45 AM

#

Hello

dense current Aug 24, 2025, 8:07 AM

#

hello

dire yew Aug 24, 2025, 10:06 AM

#

hello

pure charm Aug 24, 2025, 11:40 AM

#

how to use veo 3 here?

plucky moon Aug 24, 2025, 12:14 PM

#

pure charm how to use veo 3 here?

/ video and add your prompt

plucky moon Aug 24, 2025, 12:16 PM

#

pure charm how to use veo 3 here?

keep in mind that the video will be public

zinc panther Aug 24, 2025, 1:31 PM

#

Use the nano-banana model to create a 1/7 scale commercialized figure of thecharacter in the illustration, in a realistic styie and environment.Place the figure on a computer desk, using a circular transparent acrylic base
without any text.On the computer screen, display the ZBrush modeling process of the figure.Next to the computer screen, place a BANDAl-style toy

4514741bf29c85c6419a3bfaa4ec8bacde8e73d812382-fiftaA_fw1200.webp

tawdry socket Aug 24, 2025, 4:30 PM

#

Does anyone know what was the mistral medium's name before it was relvealed on leaderboard?

agile cloud Aug 24, 2025, 4:59 PM

#

Hello, why I kept receiving "Connecting to Arena has failed. Please try again later or on a different device." ?

drowsy needle Aug 24, 2025, 5:08 PM

#

agile cloud Hello, why I kept receiving "Connecting to Arena has failed. Please try again la...

Would you mind creating a post in #1343291835845578853 and sharing more details? Are all modes & models resulting in this error? Does a new browser help? etc.

shell cape Aug 24, 2025, 5:08 PM

#

tawdry socket Does anyone know what was the mistral medium's name before it was relvealed on l...

it was just added, it never went through stealth. mistral has never put a model through steath before to my knowledge.

agile cloud Aug 24, 2025, 5:15 PM

#

drowsy needle Would you mind creating a post in <#1343291835845578853> and sharing more detail...

I had opened a post in #1343291835845578853

broken rune Aug 24, 2025, 5:47 PM

#

hi

#

for all peope l

reef torrent Aug 24, 2025, 6:37 PM

#

hi

manic frost Aug 24, 2025, 7:47 PM

#

Can you say Hi in #general and not here? Thanks

indigo onyx Aug 24, 2025, 10:43 PM

#

HI , JUST CURIOUS ABOUT ai

drowsy needle Aug 24, 2025, 10:51 PM

#

manic frost Can you say Hi in <#1340554757827461211> and not here? Thanks

We're going to look into making changes about this soon btw

tender sigil Aug 24, 2025, 11:13 PM

#

manic frost Can you say Hi in <#1340554757827461211> and not here? Thanks

seriously, no idea what is is about this channel specifically

velvet acorn Aug 25, 2025, 1:25 AM

#

hello

red belfry Aug 25, 2025, 2:07 AM

#

Hello everyone

past python Aug 25, 2025, 2:09 AM

#

hi

arctic cypress Aug 25, 2025, 3:38 AM

#

hello

swift fractal Aug 25, 2025, 4:05 AM

#

Hello

viral current Aug 25, 2025, 5:01 AM

#

Hello wave_animated

modern compass Aug 25, 2025, 6:12 AM

#

fair sleet Aug 25, 2025, 6:42 AM

#

👋

hidden sentinel Aug 25, 2025, 6:46 AM

#

hello edit

thorn osprey Aug 25, 2025, 2:32 PM

#

oi @drowsy needle when DeepSeek-V3.1 will appear in leaderboard?

upper leaf Aug 25, 2025, 4:18 PM

#

Hello everyone 🤟

drowsy needle Aug 25, 2025, 4:26 PM

#

thorn osprey oi <@283397944160550928> when DeepSeek-V3.1 will appear in leaderboard?

I'd say when it has collected enough votes for us to update the leaderboards!

winged juniper Aug 25, 2025, 4:32 PM

#

Hi Guys ✌️

thorn osprey Aug 25, 2025, 4:39 PM

#

drowsy needle I'd say when it has collected enough votes for us to update the leaderboards!

got it 👍

forest onyx Aug 25, 2025, 5:05 PM

#

helo

languid hare Aug 25, 2025, 5:25 PM

#

hi i listen from ai news and found this cool thing and i come here

drowsy needle Aug 25, 2025, 8:35 PM

#

You'll want to learn how to use Video Arena here: #1397655624103493813

ember umbra Aug 25, 2025, 9:56 PM

#

hi

#

Hello! Since today, I have been unable to create any images. Is there a problem with the platform?

mellow zealot Aug 25, 2025, 10:30 PM

#

Pls I am looking for Nano banana ai model

drowsy needle Aug 25, 2025, 10:40 PM

#

mellow zealot Pls I am looking for Nano banana ai model

More information on nano-banana can be found here: #nano-banana message

drowsy needle Aug 25, 2025, 10:40 PM

#

ember umbra Hello! Since today, I have been unable to create any images. Is there a problem ...

There shouldn't be, would you mind creating a post in #1343291835845578853 with more details about what's going wrong?

mellow zealot Aug 25, 2025, 10:42 PM

#

I want a want to improve image and Martian the model character for my online content and humanized

pseudo adder Aug 26, 2025, 12:17 AM

#

First of all, wanna thank the devs. What an ingenious idea that's not only a blast for everybody to play with, but naturally by it's nature accellerates the living heck out of AI. Love it. I hope this battle goes on until the AI's themself are in here arguing for votes!

fervent shadow Aug 26, 2025, 12:22 AM

#

hello

near spoke Aug 26, 2025, 3:48 AM

#

hi

rough pilot Aug 26, 2025, 6:28 AM

#

hi

bold robin Aug 26, 2025, 7:43 AM

#

fathom ice Aug 26, 2025, 9:40 AM

#

Hello

severe hornet Aug 26, 2025, 11:12 AM

#

Hello, just came from semrush newsletter

abstract bolt Aug 26, 2025, 12:07 PM

#

hola

trim mauve Aug 26, 2025, 12:52 PM

#

hello

lucid pawn Aug 26, 2025, 1:12 PM

#

hi

lapis tendon Aug 26, 2025, 3:23 PM

#

Hello

burnt root Aug 26, 2025, 3:24 PM

#

Hello

hollow imp Aug 26, 2025, 3:46 PM

#

Hi,I'm want to make a commercial videos

fathom osprey Aug 26, 2025, 3:52 PM

#

Hi, i'm an artist and want to test AI

drowsy needle Aug 26, 2025, 4:03 PM

#

fathom osprey Hi, i'm an artist and want to test AI

You've come to the right place! Be sure to check out #1397655624103493813 if you're looking to use our Video Arena

spring orbit Aug 26, 2025, 4:05 PM

#

Hello!

green fossil Aug 26, 2025, 4:43 PM

#

Hi there! battle3d

vernal wagon Aug 26, 2025, 4:51 PM

#

Hello to all, want to learn more about IA

twin valve Aug 26, 2025, 6:11 PM

#

@drowsy needle can we have a new channel to discuss technicalities about the leaderboard since this one has become a landing channel?

drowsy needle Aug 26, 2025, 6:37 PM

#

twin valve <@283397944160550928> can we have a new channel to discuss technicalities about ...

I actually JUST spotted what I think is the problem for why people are landing here.

#

I agree that we'd like this channel for in-depth discussion related to leaderboards and all of the "hey and hi" should be in a place in #general

#

Pretty sure the Server Guide was the culprit as Check out #leaderboards was placed before Say hey in #general. This has now been swapped so we'll see if the issue persists.

chilly laurel Aug 26, 2025, 7:04 PM

#

Hello, Hola, 你好, नमस्ते, مرحبا

#

Yup the server Guide got me

tender sigil Aug 26, 2025, 7:45 PM

#

we finally got down to the bottom of the mystery 🙏

#

where do y’all think DeepSeek 3.1 will be landing when the leaderboards are updated next ?

formal orbit Aug 26, 2025, 7:48 PM

#

Hello

tender sigil Aug 26, 2025, 7:53 PM

#

formal orbit Hello

no answer the question

drowsy needle Aug 26, 2025, 7:54 PM

#

tender sigil we finally got down to the bottom of the mystery 🙏

blobfingerscrossed I think we did!

twin valve Aug 26, 2025, 8:28 PM

#

tender sigil where do y’all think DeepSeek 3.1 will be landing when the leaderboards are upda...

I am not sure, I got it a couple of times in battle mode. I think without style control (fight me, style is important for humans - unfortunately ) it should land around r1 05-28 . If not immeditaely, after a while. Initial ratings could be pumped after all. Maybe a bit more than glm-4.5 but just a bit

tender sigil Aug 26, 2025, 11:42 PM

#

style control can be a bit tricky to account for at times, I do also learn more from the style control removed leaderboards as they’re a reflection of pure user preference

#

I like 3.1 more than the latest version of r1, and if I’m correct it has higher compute power as well, so it’s pretty easy to see it debuting into the top 5

cerulean latch Aug 27, 2025, 8:27 AM

#

hi

whole pivot Aug 27, 2025, 8:58 AM

#

Hello

lost summit Aug 27, 2025, 11:23 AM

#

/image-to-video

glossy viper Aug 27, 2025, 12:06 PM

#

Hi !

zealous cave Aug 27, 2025, 12:18 PM

#

hi

drifting hornet Aug 27, 2025, 12:37 PM

#

lost summit /image-to-video

Please use #video-arena-1 #video-arena-2 #video-arena-3 for your creations. Check #1397655624103493813 to learn how.

cerulean rapids Aug 27, 2025, 2:33 PM

#

Hello ;]

lyric lion Aug 27, 2025, 2:37 PM

#

hi

drowsy needle Aug 27, 2025, 4:07 PM

#

pikaconfused why are people still saying hello here

pallid thistle Aug 27, 2025, 5:19 PM

#

hi

wide ether Aug 27, 2025, 5:37 PM

#

drowsy needle <:pikaconfused:398202117493620740> why are people still saying hello here

You should maybe rename leaderboards to general 😜

drowsy needle Aug 27, 2025, 5:38 PM

#

wide ether You should maybe rename leaderboards to general 😜

Its been considered doggolul I was really hoping the changes to onboarding would help.

tender sigil Aug 27, 2025, 6:16 PM

#

is there a channel new members are initially placed in when they like first join the server? sometimes on startup it opens a specific channel, which if it’s #leaderboards that might make sense ?

drowsy needle Aug 27, 2025, 6:20 PM

#

tender sigil is there a channel new members are initially placed in when they like first join...

When you first join you're sent to the Server Guide. Where the Getting Started section does mention (in order).. 1) Say hey in #general 2) see what people are saying in #leaderboards then so on.

I think I'll move leaderboard even further down the list and see if that helps.

wide ether Aug 27, 2025, 8:19 PM

#

drowsy needle When you first join you're sent to the `Server Guide`. Where the Getting Started...

Or you could just remove the leaderboards text channel completely, because it seems a bit pointless to me. People are just using it for general discussion, but there is already a "General" section for that.

jaunty cave Aug 28, 2025, 8:15 AM

#

wide ether Or you could just remove the leaderboards text channel completely, because it se...

never delete leaderboards!

wide ether Aug 28, 2025, 9:28 AM

#

jaunty cave never delete leaderboards!

But it's unnecessary.

robust zodiac Aug 28, 2025, 10:45 AM

#

Adding a message cd maybe

#

To discourage off topic spamming/chatting

marble bluff Aug 28, 2025, 2:19 PM

#

Please use ⁠video #video-arena-1 #video-arena-2 #video-arena-3 for your creations. Check #1397655624103493813

jaunty cave Aug 28, 2025, 4:51 PM

#

wide ether But it's unnecessary.

🙁

#

Did ya'll see the nano-banana leaderboard launch? There was a 170 point gap between first and second. I don't think I've ever seen a gap that large in and leaderboard even including things like sports and chess. It's reflects a level of dominance basically unheard of

frosty wadi Aug 28, 2025, 5:08 PM

#

Felicitacion

jaunty cave Aug 28, 2025, 5:21 PM

#

frosty wadi Felicitacion

hi, what's your favorite and least favorite thing about the leaderboards at https://lmarena.ai/leaderboard/?

slate compass Aug 28, 2025, 8:12 PM

#

pretty surprised that 3.1 didn't appear on the last lb update - is it still being tested on here?

twin valve Aug 28, 2025, 8:53 PM

#

wide ether Or you could just remove the leaderboards text channel completely, because it se...

not really, general is much more spammy.

jaunty cave Aug 28, 2025, 11:35 PM

#

MAI-1 on da leaderboard today pretty good for a first shot imo

slate compass Aug 28, 2025, 11:54 PM

#

jaunty cave Did ya'll see the nano-banana leaderboard launch? There was a 170 point gap betw...

Google seems to be establishing a habit of doing that

jaunty cave Aug 28, 2025, 11:59 PM

#

you mean giant leads? on text even without style control their lead is 33. To put it in an Elo perspective, a lead of 33 pts translates to a 54.7% win chance of 1st place vs second place.

170 points means a 72.68% chance

slate compass Aug 29, 2025, 12:10 AM

#

jaunty cave you mean giant leads? on text even without style control their lead is 33. To pu...

i think the gemini 2.5 pro experimental release was much more overpowered when it first hit arena

#

not now

gentle zephyr Aug 29, 2025, 8:37 AM

#

hello!

inner finch Aug 29, 2025, 8:52 AM

#

Hi

jaunty cave Aug 29, 2025, 9:23 AM

#

hi @gentle zephyr and @inner finch, welcome to the leaderboard channel, what are your thoughts on the leaderboards? https://lmarena.ai/leaderboard

molten stone Aug 29, 2025, 2:03 PM

#

hello

drowsy needle Aug 29, 2025, 4:23 PM

#

You're looking for Video Arena, check out #1397655624103493813 for more info on how to use

tropic cloud Aug 29, 2025, 4:43 PM

#

hello

short yarrow Aug 29, 2025, 6:24 PM

#

Hey

jaunty cave Aug 29, 2025, 8:46 PM

#

slate compass pretty surprised that 3.1 didn't appear on the last lb update - is it still bein...

They got published today 🙂

tender sigil Aug 29, 2025, 9:21 PM

#

interesting trend in Chinese (Alibaba & DeepSeek) “thinking” models having weaker performances than their “non-thinking” counterparts

jaunty cave Aug 29, 2025, 9:58 PM

#

tender sigil interesting trend in Chinese (Alibaba & DeepSeek) “thinking” models having weake...

qwen and deepseek:

vale crow Aug 30, 2025, 7:18 AM

#

@everyone suggest me best ai in lmarena for generating essays

hollow crown Aug 30, 2025, 7:23 AM

#

Can anyone send me the link to use gpt image 1 model in lmarena ai

#

?

#

🙏🏻

vale crow Aug 30, 2025, 7:23 AM

#

@vale crow

#

/home/raunak/.zen/dqek0907.Default (release)/chrome/Nebula/content

#

sudo pacman -S lmarena

vale crow Aug 30, 2025, 7:24 AM

#

vale crow sudo pacman -S lmarena

@hollow crown

hollow crown Aug 30, 2025, 7:24 AM

#

Can anyone send me the link to use gpt image 1 model in lmarena ai please? 🙏🏻

vale crow Aug 30, 2025, 7:24 AM

#

i said

#

sudo pacman -S lmarena

hollow crown Aug 30, 2025, 7:24 AM

#

Send me the proper link

vale crow Aug 30, 2025, 7:24 AM

#

it is available on arch linux

#

i use it there

hollow crown Aug 30, 2025, 7:25 AM

#

In lmarena ai it's available just like nano Banana model

#

But I couldn't find it

vale crow Aug 30, 2025, 7:26 AM

#

open yout terminal emulator on ur arch then paste command >sudo pacman -S lmarena --gpt-img-1

hollow crown Aug 30, 2025, 7:26 AM

#

I'm on windows so I guess it'll not work

vale crow Aug 30, 2025, 7:26 AM

#

wsl

#

use wsl @hollow crown

hollow crown Aug 30, 2025, 7:27 AM

#

Wal?

#

Wsl?

vale crow Aug 30, 2025, 7:27 AM

#

https://learn.microsoft.com/en-us/windows/wsl/install

Install WSL

Install Windows Subsystem for Linux with the command, wsl --install. Use a Bash terminal on your Windows machine run by your preferred Linux distribution - Ubuntu, Debian, SUSE, Kali, Fedora, Pengwin, Alpine, and more are available.

#

what r yr hardware specs

#

@hollow crown

hollow crown Aug 30, 2025, 7:30 AM

#

powered by Ryzen 5 8 GB Ram and 512 GB nvme ssd no dedicated graphics card

vale crow Aug 30, 2025, 7:30 AM

#

so dont do this , it need dGPU

#

wait

#

i m finding another way

hollow crown Aug 30, 2025, 7:30 AM

#

That's why I'm asking lmarena because it's available in battle mode

vale crow Aug 30, 2025, 7:33 AM

#

@hollow crown no way to run it directly on lmarena cloud i think u must use lmstudio

#

try lmstudio

#

it dont need dGPU

#

on top of it it's available on windows

#

@hollow crown

#

@hollow crown if u find it is , increase my knowledge

lunar kayak Aug 30, 2025, 10:00 AM

#

Please help me how can I generate a video? Step by step thank you

twin valve Aug 30, 2025, 10:21 AM

#

slowly every request will assume that people will reply like LLMs.

slim rampart Aug 30, 2025, 1:20 PM

#

What’s trending this season with YouTube Shorts? Got any new ideas?

knotty juniper Aug 30, 2025, 3:07 PM

#

hello

lilac current Aug 30, 2025, 9:11 PM

#

Do these videos have sound?

drowsy needle Aug 30, 2025, 9:26 PM

#

lilac current Do these videos have sound?

It's random, some video models do and other don't

drifting hornet Aug 30, 2025, 10:17 PM

#

lunar kayak Please help me how can I generate a video? Step by step thank you

Hello! Please, check #1397655624103493813 to learn how to use the bot and generate videos in #video-arena-1 #video-arena-2 #video-arena-3

ivory shoal Aug 30, 2025, 10:30 PM

#

Hey why does lmarena not have a board for music generation

#

e.g. https://arxiv.org/abs/2506.19085

arXiv.org

Benchmarking Music Generation Models and Metrics via Human Preferen...

Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In...

lunar kayak Aug 30, 2025, 10:35 PM

#

Thank you, guys 🙏🏻