#leaderboards

1 messages · Page 1 of 1 (latest)

vague plume
#

cool to see 4.5 in first place imo

keen plaza
vague plume
keen plaza
vague plume
balmy moon
#

claude 3.7 thinking still not in there?

vague plume
coral moat
#

It seems extremely easy to gain the lm dashboard and just upvote GPT 4.5 content. Why do y'all put so much weight on this leaderboard that seems pretty fickle

wicked sapphire
#

it's just one metric of many. And I think the people who try and game the leaderboard are pretty few in number.

coral moat
coral moat
pseudo rivet
coral moat
coral moat
coral moat
# vague plume cool to see 4.5 in first place imo

Curious how 3k votes was determined as a big enough sample size to put in the leaderboards since the rest of the top 5 have 10k+ votes. Maybe there were a lot of votes where 4.5 beat other top 5 models?

vague plume
#

I think a few other models that showed up in the leaderboard had ~6k - 10k votes before first showing up there. Might be the fact it's pricey, might be hype amount, idk

#

In any case it's a win for the more casual LLM users/the general public imo, it'll be on Plus this week

left zephyr
#

@west lodge I DM'd you for a collaboration.

fast crown
#

could we get the servers to have different logos?

vague plume
glacial glacier
vague plume
hallow comet
#

in style control there's more of a substantial gap

coral moat
#

These elo ratings are pretty similar for example 4.5 only wins 54% of the time against Google's most advanced model

coral moat
#

any theories on how this is possible in the span of a day via a legitimate process instead of gaming by that twitter guy??

blissful pulsar
#

elon knows in real time what the estimate for the arena score will be

#

like there's been over 1000 votes in the last day, so every few minutes he receives a new vote so can know he'll probably be #1 when it enters the arena

coral moat
surreal drum
craggy zenith
#

hi

low bough
#

Also, he's the richest guy in the world. Some donations to LMSYS team may not be game changer but may give extra perks.

blissful pulsar
#

he is not faking votes or bribing anyone, he is obviously just optimizing the model for user preferences, which is a fine thing to optimize for. he is just cringe about it

blissful grotto
#

OK stupid noob question, is there a way to recreate leaderboards from the past...

I see that https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=C5H_wlbqGwCJ

Reaches down into a google cloud storage. I'm trying to look at trends in the Arena scores over time

url = "https://storage.googleapis.com/arena_external_data/public/clean_battle_20240814_public.json"

But what are the other valid things to enter, it would be great to get the current ones, is there a listing somewhere?

I tried to just list it, but of course it looks like you need authentication

gcloud storage ls --recursive gs://arena_external_data/public
ERROR: (gcloud.storage.ls) HTTPError 403: rich@tne.ai does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist). This command is authenticated as rich@tne.ai which is the active account specified by the [core/account] property.

# but accessing individual objects works
gcloud storage ls --recursive gs://arena_external_data/public/clean_battle_20240814_public.json
low bough
blissful pulsar
#

usually when you have cracked employees you have to keep them by not pushing them to be obviously fraudulent. talented ppl just quit the job when they're being asked to be obviously fraudulent, bc they want their cracked engineer status points which come from not being caught being an obvious fraud

#

just doesn't seem likely to me

coral moat
#

Thanks guys for your thoughts on this. I agree with Bayesian that this seems pretty hard to game but I also agree with Larry that he is not a good boy and cringe most of the time.

twilit relic
#

only kalshi gamblers here

surreal drum
#

I hate elon musk too, but i cant lie, grok can cook (not to mention he didn’t develop it at all). It's a great balance of precise reasoning and proper formatting + clarity. The problem with most other models doesn't lie in any sole benchmark; usually, it's imbalance, it's due to prioritizing Markdown formatting over a precise response (or vice versa)
GPT 4.5 also seems to strike this balance well.
Idk if grok can overtake it. we'll see

#

You’ll never catch me paying $30/mo for grok though. No way

low bough
#

I don't know men, I've tried grok and gemini many times when my GPT PLUS acc was out of limits. They just don't satisfy my needs. The only time other LLM surprised me more than GPT was the Sonnet 3.7. It was the first LLM to answer my ultra niche prompt. Grok does not understand it.

coral moat
low bough
#

Drawing a Hanning window (context: signal processing) using ASCII given the constraints and explaining the formula. Every model up to last month used to draw triangles, which is BS.

coral moat
#

That is pretty niche!!

blissful pulsar
#

the non-reasoning models will often be worse at reasoning tasks like coding / math

#

but they're as good or better at non-reasoning tasks

random jewel
#

I don't think that grok should be the best in coding... Been reading about advanced non-Elo algos yesterday, a/b algos are a part of my trailblazing successes...

dapper gull
low bough
#

Oh god 😄 Not sure if you're joking but its funny 😄

dapper gull
low bough
#

To be fair this is more accurate 😄

#

However, this is what it should make https://en.wikipedia.org/wiki/Window_function

In signal processing and statistics, a window function (also known as an apodization function or tapering function) is a mathematical function that is zero-valued outside of some chosen interval. Typically, window functions are symmetric around the middle of the interval, approach a maximum in the middle, and taper away from the middle. Mathemat...

dapper gull
#

Like everytime for coding stuff Sonnet is just better idk. Even tho other models are close on the leaderboard

torpid matrix
#

im being forced by this server to write random stuff here to remove thhsese annoying popups.

coral moat
#

What popups?

unborn plover
thick totem
#

Hi~all, Does the Visual Language Multimodal Leaderboard include other languages ​​besides English and Chinese? For example, French and Spanish?

west lodge
#

unfortunately we have too few votes in other languages for us to compile the leaderboard. could you say more why you're interested in seeing other languages?

thick totem
#

@west lodge I am curious about the evaluation of the multimodal model. Does the votes in the category "overall" only include Chinese and English, and exclude votes outside of these two languages?

frozen creek
#

Does anyone know the AI big model rankings that polish Chinese papers?

#

为什么gpt 4.5不能模型比赛?

#

为什么gpt 4.5不能比赛

#

Why can't gpt 4.5 compete

waxen portal
#

hello

coral moat
#

Any safeguards to prevent this type of jailbreak from influencing the leaderboard rankings?

wide schooner
unborn plover
lucid thistle
wide schooner
lucid thistle
#

At the prashis, use "it" right after think would be kinda weird, anyway great Pfp xD

lofty pasture
# coral moat Any safeguards to prevent this type of jailbreak from influencing the leaderboar...

Without asking the model those questions ,you can know its name easily . Each model have a specific style, mistakes and answers , so you can't prevent it. For example chatgpt 4o answers are like

1️⃣

--

2️⃣

--

3️⃣
With some kind of emojis and I can disting 4o from gpt4.5 by the accent .
The only method to prevent it is to play fair and to vote for the model that deserves it.

sullen tiger
#

Is the grok-3-preview in the leaderboard actually grok3 or grok3-think?

true yoke
#

I wonder, why claude-3-7-sonnet-20250219-thinking-32k is on the chat available but is not on the leaderboard 🤔

marsh iron
#

when can i chat with o1 generate image with dall e 3 generate video with veo2

strong knot
marsh iron
#

in arena battle under chat now you can see Expand to see the descriptions of 90 models, in direct chat you also see it. what do you think that mean?

fierce bear
formal shuttle
#

Hi everyone

#

There is a problem with the website there is high traffic

hallow comet
#

Hi, just an amateur AI user here who has taken a gander at the leaderboard off and on. NGL, the ranking system confuses me.

minor acorn
#

hello

wide schooner
# hallow comet Hi, just an amateur AI user here who has taken a gander at the leaderboard off a...

The Bradley–Terry model is a probability model for the outcome of pairwise comparisons between items, teams, or objects. Given a pair of items i and j drawn from some population, it estimates the probability that the pairwise comparison i > j turns out true, as

where pi is a positive real-valued score assigned to individual i. The comparison i ...

#

Perhaps you can paste that into your favorite AI model and have it simplify it for you

old pagoda
#

Have a question concerning the AI models used on https://lmarena.ai/ it is a great opportunity to check the models before installing. I was testing command-r-08-2024. It was giving a very good output for my promt and was reproducable with similar results. Then I was installing the same model through ollama (https://ollama.com/library/command-r) but was giving very poor results for the same promt. Do you have any suggestions? Thanks!

twilit echo
old pagoda
stable basin
#

你好

timid wind
#

will grok text-to-image be added to text-to-image LB?

mystic dock
strong knot
#

I thought it was flux dev with a lora

glacial glacier
fleet stag
#

Hi all!

I was wondering if Claude 3.7 Sonnet - Thinking would be added to the leaderboards soon?

I think it’s important to differentiate the thinking model from the non-thinking model. In my opinion, there’s a good chance that 3.7 Sonnet Thinking could even be #1 in coding!

Best,
Mason

fleet stag
mental bay
#

Hi is there a way to get leaderboard changes programmatically? The notebook mentions a file that’s 6 mo old and simple playwright script hits cloudflare…
I just want the data for the little dashboard I have for myself that aggregates llm related news etc

mental bay
#

why 😂 ? is it against the rules somehow?

unborn plover
#

You wont get the realtime data

#

Possibly due to prevent cheating

#

At least make it harder

mental bay
#

huh? if any human can see it how does that prevent cheating? i'm fine if it's delayed like a day, too...

glacial glacier
#

(ignore the unused code)

crude hawk
#

if that is the case, ig there isn't really a 'live' leaderboard per se - just the accumulation of new votes / data

azure stirrup
#

When do the arena score leaderboards update?

#

It doesn't appear to be done in realtime and instead manually?

wide schooner
#

I assume they will unveil they new gemma models

supple glade
#

hello, I want to know how to add the company's model to the competition ranking.

median briar
true yoke
#

i don't understand why a model that just was released is included, but still no claude-3-7-sonnet-thinking. Or qwq-32b

frozen creek
#

What is the difference between Gemma 3 and Gemini 2, which one is better

#

What is the difference between Gemma 3 and Gemini 2, which one is better

true yoke
lofty pasture
delicate sable
#

i think the gemma team wanted it released (they are really proud of their arena score)

#

and i'm guessing claude and alibaba didnt make those requests? just a hunch

mystic dock
#

on the coding leaderboard its quite low tho

#

(with style control)

#

without its actually on rank 15

#

i want qwq on the leaderboard

brave herald
lofty pasture
#

Oh it told me now thta it was zizou 10

brave herald
thorny ginkgo
#

I tested gemma 27b in AI studio. Is it the same in lmarena? It is not even close to Claude and Deepseek

stark grove
#

i noticed that too, something is off there

glacial glacier
#

google's best private models have only around a 56% chance of winning over gemma 3 27b

crude hawk
wispy pumice
crude hawk
#

yeah i know what it looks like, and skimmed the wiki page to get a sense of what's going on - but yeah at a technical level, it means nothing to me ha

#

i've tried prompting with both Hann and Hanning Window. With the former, some LLMs point out that the latter is the more conventional phrasing, others say the former 🤷‍♂️ either way, they know what you mean so it's kinda redudant to the task imo

#

4o intitially says one thing (Hanning), then the other (Hann) in the very next response lol

blissful pulsar
#

is there an api to get the latest lb rankings

#

i'd like to get alerted when it gets updated basically

fleet stag
#

Not sure if anyone knows, but I was curious, is the “Grok-3-Preview-02-24” the thinking or non-thinking model?

low bough
#

Interestingly, O3-mini blew my mind when I first asked him to do this with a maximum of 40 characters per line. It chose number 27 as a desired maximum for the "esthetics". Which is weird, because models usually goes with whatever maximum you give them. This was a first time I sensed some kind of "free will" coming from the model. Also, it drew the hanning perfectly. Better than I would.

#

However, when I ran the same prompt on the ChatGPT website, even when using the same model, the output was much worse. Why would that be I dont know. Maybe they limit the thinkning tokens based on current server capabilities?

#

I hope they don't prompts from LMARENA with higher performance limitations. That woudl ruin the benchmark.

hallow comet
low bough
#

I have plus

hallow comet
#

oh did u use o3 mini high?

low bough
#

Yes

hallow comet
low bough
#

No, maybe this is the cause. Will check it 🙂

hallow comet
#

afaik lm arena has o3 mini medium by default

low bough
#

Perfection of o3-mini-high

hallow comet
low bough
#

03-mini-medium great too

#

When trying this on arena now (o3-mini) it returns garbage

#

There really is a replication problem. Performance is inconsistent.

hallow comet
low bough
#

Could you define "sampling"?

hallow comet
#

temperature, top_p, etc

low bough
#

The params are not the same on gpt web and arena?

#

Or are they trying performance on different params?

hallow comet
#

models arent even deterministic anyway with greedy decoding. with the same seed youll get basically deterministic results (though it depends, and it might not be guaranteed)

low bough
#

I would hope they woul 'set.seed(X)" in the script 😄

hallow comet
low bough
#

Still I would expect models to perform slightly worse or better due to randomness and not complete collapse

blissful pulsar
hallow comet
blissful pulsar
blissful pulsar
#

but if anyone knows if there is a way to get quick data when the lb updates lmk

coral moat
#

Thoughts on Google Gemma 3?

coarse jacinth
#

Will the next multimodal Gemini Flash now be added to the text-to-image leaderboard? Doesn't seem to be in the mix currently

wispy pumice
wispy pumice
surreal drum
blissful pulsar
#

ty mate

open scaffold
#

Anyone knows why there is no qwq-32b in the leaderboard?

terse mirage
crude hawk
#

thanks for clarifying / explaining - makes sense now why the LLMs also seemingly get it confused in their explanations of the terminology (but they still know what you mean in the context of the 'window' and task.. most just aren't great at ASCII art .. or get that it's meant to be a bell shape, but instead go for the eaiser - but wrong - triangle shape... ig anyway ha )

true yoke
surreal drum
terse mirage
terse mirage
steady umbra
#

How is 4.5-preview ranked so high? I've heard a lot of people say that it's bad, is that not the case?

mystic dock
#

sometimes, because gpt4.5 is such a large model, and has a lot of knowledge stored in its weidghts it can find the error in things, that other models sometimes cant find

#

its also the model which hallucinates the least of all models, simply because a lot of stuff is stored in its weights, such as even all the restaurants of my local place, no other model has that much knowledge stored in its weights

#

the smaller models can of course use web-search, but that is something different then directly knowing something, which can also be better for explaining things

thorn birch
#

Is the grok3 on leaderboard thinking model?

#

I doesn't seem that it is. Are they saving the other one? Or it's one of the code names in testing?

fossil cipher
#

Hello

blissful pulsar
#

The current grok 3 thinking model is called grok 3 thinking beta, it’s not trained as much as they d want, so we ll probably get grok 3 thinking in a few weeks and it ll probably be very good

half stream
#

Was just thinking about this. Maybe we should also consider emoji usage for style control. Some models are notorious for using them more than others.

Oh, and also block quotes.

#

And would like to know the coefficients of those two factors. 👀

lofty pasture
#

Am I the only one feeling that
Specter is the new Gemini 2.0 flash thinking (experimental) updated by google yesterday

Phantom is the old Gemini 2.0 flash thinking experimental

Just a feeling ... 🥲🥲💔

brave herald
lofty pasture
# brave herald Phantom is aldo a new model, not old

Phantom have the same problems of the old Gemini thinking while the new version looks improved and have less mistakes and problems but I am not sure just a feeling because when I tried the new Gemini thinking on the app It looks improved. Maybe google wanted to compare between them secretly.

wispy pumice
fringe tree
#

hi guys, thanks for this great work! I would like to ask you regarding "early-grok-3" model. Isn't this model in market, is it? As per info in their site it says there is no API yet for grok3. Also I don't find it in OpenRouter or Glama for example. So, my question is how and who are the persons voting (less than 5 thousand) that can say it is good for coding or not? Thanks!

blissful pulsar
#

It says it’s deprecated which prolly means it wont show up in the arena anymore

#

Seems likely the grok3 available on grok.com is the grok3 preview so early grok 3 may just be gone permanently

#

But it was good at coding, prolly not the best, depends on the task

#

You usually get more out of reasoning models

lofty pasture
weary nebula
#

Do you guys think phantom is gemini 2.0 pro thinking?

blissful pulsar
#

Yeah

#

Uninformed guess but seems likely

twin wharf
#

Interesting how Hunyuan-Turbo is on the leaderboard with only 1473 votes, I thought it had to be over 3k, or did the company request personally?

steady umbra
#

Is there a benchmark/leaderboard for llms with access to web search?

rapid cobalt
#

Does anyone know if the grok answers at the Arena are with Think mode or without?

lofty pasture
blissful pulsar
#

I also misread pro, seems more likely to be another iteration of flash thinking

glacial glacier
#

QwQ has a 1312 elo

glacial glacier
#

how are they saying this when gemini flash is right there

hallow comet
glacial glacier
hallow comet
#

you can run command a more easily than deepseek ig

glacial glacier
#

i guess command a is just the "enterprise" experience

crude hawk
glacial glacier
#

thats what happens when you distill from gemini

hallow comet
#

the base models are very bad though

crude hawk
#

tbf, command-a does seem 'smarter' (though it guess must lose points to gemma in coding or creative writing or some other tasks i don't prompt for) but only marginly (definitely not 3 times smarter)

#

it also has a longer a context window, and i believe is optimised for agentic stuff and RAG.. but yeah still..

hallow comet
#

what i personally found was interesting about gemma 3 outside of the focus on human preference was afaik they revealed what they basically did to gemini 1.5 exp/2 math

crude hawk
# hallow comet

I assume "large IT teacher" is a domain-specific LLM (which is 'teaching' the model being distilled i.e. gemma3), right?

hallow comet
crude hawk
#

but like couldn't they find (build / tune) a math teacher?

#

though there is something special about programming language ig - seems LLMs learn and kinda generalise a lot from it

hallow comet
hallow comet
hallow comet
#

unlike regular synthetic datasets, they train it on 256 probabilities from the gem 2 model - effectively copying a lot more of the "solution" better or worse

#

also comparison between gemma 3 base models and qwen 2.5 (released in september 2024)

#

ignore the metric column xd (gemini added it without it really matching)

crude hawk
hallow comet
#

oh i said gem 2 pro but its not known which one they used

hallow comet
frozen creek
#

Is this website ranking accurate? Ranking number one is o3 high?

glacial glacier
#

that said, artificial analysis benchmarks stem and o3 mini is a good stem model

glacial glacier
#

why do we still not have claude thinking on the leaderboard

low bough
#

The text-to-image has so few models. I wonder why we have so many LLMs but so few text-to-image. Harder to manage data? Harder to train? Less financial incentive?

autumn granite
crude hawk
blissful pulsar
#

any guess whether o1-pro is on the lmarena now that it has an api thing

split anvil
#

how often does the leaderboard typically update?

blissful pulsar
#

every week or so

split anvil
#

is there a way to find the history of past leaderboards?

desert gorge
sinful falcon
#

the repo i see

#

doesn't update for leaderboard updates

#

just when they change column names and whatnot

desert gorge
tawdry socket
#

Is o1 pro in the arena yet? People are saying it is most advanced open ai model. Curious how much better it is compared to Gpt 4.5..

blissful pulsar
#

It’d plausibly be worse but prolly pretty similar

tawdry socket
#

What makes you say that? If it is not better, why is OpenAI charging lot more for it?

glacial glacier
#

if you would please consult the leaderboard

split anvil
#

Any info on if it will be ranked?

timid perch
#

will we get o1-pro or GPT-4.5 ranked? or is the price gate too high? i did see 4.5-preview in rankings and in cursor

mystic dock
tawdry socket
#

what about sonnet thinking?

mystic dock
#

Rank #14

hallow iris
glacial glacier
#

Leaderboard update: claude thinking and nemotron 49b

fleet estuary
#

hi

#

guys do you know this chat model called nebula on arena ? i cant find any info on it

autumn granite
# mystic dock Rank #14

#1 for hard prompts with Style Control enabled, but also tied with o1, R1, Grok 3 and gpt-4.5 when taking confidence interval into account.

autumn granite
lofty pasture
fleet estuary
hard tangle
#

Anyone know if any Deep Research models exist in the arena’s Search or just light research ones

vague garden
#

HI , How are the new models evaluated? According to the orginal paper https://arxiv.org/pdf/2403.04132 , evaluates preferencing using humans . How is this done for new models, is this also using human preferences? If so which dataset is it usng?

glacial glacier
#

It's an arena

#

People go to vote every day

rapid cobalt
#

How does it work that some models like Nebula aren't on the LB but I get them on battles?

#

also true for march

glacial glacier
#

Well the company has to contact the arena and then it adds them under a psuedonym

rapid cobalt
#

So does it scores towards some of the listed models in the leaderboard?

#

I mean, does the pseudonym I see on the battle have a counterpart on the leaderboard?

glacial glacier
rapid cobalt
#

My question is basically if there are models taking part on the battle that are not being scored for on the lb because that would not make much sense from what I understood of how the elo system works

glacial glacier
#

No the anonymous model system is typically used for testing early tunes or checkpoints

#

They can just drop battles that aren't entirely public models

rapid cobalt
#

So if a listed model win a battle against a anonymous one how would the elo gain be calculated if the formula takes elo diff into account?

glacial glacier
#

Well then they include all battles

#

Not too hard to reason about

rapid cobalt
glacial glacier
#

The match goes into a DB

rapid cobalt
#

B gains elo and A loses elo, but that gain/loss is based on their current elo and how they performed relative to their expected winrate against each other, right?

#

I understand that the match goes into a dB and LMArena uses Bradley-Terry with bootstrapping to get a closer approximation to models performance on the long run to make it more equitable between models that are newer (w lower vote count) and the ones that already have a lot of votes

#

But still for it to go into the dB it should be scoring A against B, and the way it is supposed to do it is getting an approximation (through current elo) of how A and B should perform against eachother and then take into account how the result panned out

vague garden
#

Yes this is exactly my question. Also model generated output cannot be automatically evaluated unless it is a classification problem

rapid cobalt
#

A has a probabilistic chance of winning against B based on their elo diff.

vague garden
#

So how can the results of a model be evaluated so qucikly? Unless it is simple classification and not generated long answers which requires human eval for reliable preference judgement

rapid cobalt
#

The only way it makes sense is if the pseudonym models are having their elo calculated and updated but just not being openly presented within the leaderboard

vague garden
#

How is win determined?

rapid cobalt
#

Ah I think I ended up answering my own question lol

#

Prob they have elo ratings just it's in the shadow. Bummer.

rapid cobalt
#

One prompt, two answers, user picks his favorite answer. Winner increases elo, loser decreases

rapid cobalt
glacial glacier
crude hawk
#

though in most cases.. they just come and go.. presumably the voting data goes to the lab responsible for the anon model, but unless it is performant, seems generally they just pull it and move on

#

like based on this (h/t @hallow comet) nebula is potentially the 38th anon model google has added to the arena.. meta I feel adds anon models just as frequently.. though again, mostly they just come and then quietly go.. ("success has many fathers, but failure is an orphan" kinda thing ha)

hallow comet
#

yeah google put a lot of models on the arena, quite a lot of them never see the light of day (checkpoints and the likes)

vague garden
glacial glacier
#

if youre gonna be a new user at least do your homework

#

browse the arena a bit

#

battle yourself

#

see if you agree with the rankings

vague garden
#

The point is model eval is about user vote distribution, extends beyond single user. LLMArena voting and question selection is not very clear when they release scores for new modeld

glacial glacier
#

everything we need is already here

hard tangle
#

Again, Anyone know if any Deep Research models exist in the arena’s Search or just light research ones

glacial glacier
#

of course it is

#

nobodys waiting 10 minutes for a report

vague garden
#

Hi Team, I am using the notebook and how come some judges are predominatly the same person. If it were all live users through the webapp i would expect users (judges) to numbers of votes to fairly uniformly distributed. i am surprised that some users have contributed so many more votes than others

hard tangle
hard tangle
glacial glacier
#

would be cool if so but i doubt lmsys users will wait that long

delicate sable
#

that doesnt make sense

vague garden
delicate sable
#

why would user votes be normally distributed? It's not like people are choosing at random whether to vote every day

#

Some people are gonna vote more than others and its gonna create fatter tails

#

Zipf's law (; German pronunciation: [tsɪpf]) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the n-th entry is often approximately inversely proportional to n.
The best known instance of Zipf's law applies to the frequency table of words in a text or corpus of natural language:

...

#

kinda looks like this to me 🤷‍♂️

vague garden
delicate sable
#

So you think they should preprocess the data

vague garden
# delicate sable So you think they should preprocess the data

AT this point, I am trying to understand how the votes are consolidated - trying to get context behind the numbers. But yes, if the vote is dominated by one or few select users then the leaderboard becomes less reliable when it comes to representing human preferences ( how people feel about the model output). If it is only getting experts on the subject matter to vote, then different path and user vote distribution i guess.

crude hawk
tawdry socket
#

@crude hawk

tawdry socket
#

@crude hawk Where did you get that screenshot from?

spare scroll
#

Hello there. I am new here and I just discovered Arena and I am looking for resources to understand the leaderboards a bit better, beyond just the mechanics. For example, I would like to know when and how often it is updated, how is it decided if a model gets a "preview" or not, how many updates a certain model gets, if there's a test in the pipeline etc. Can someone point me in the right direction? Thank you.

crude hawk
karmic mortar
#

does anyone know if the current Deepseek V3 leaderboard listing is for DeepSeek-V3-0324?

#

or is it the original?

karmic mortar
#

thank you! i'm curious if they'll add the newer one anytime soon

wanton turret
sleek cosmos
#

Is there a Search leaderboard?

crude hawk
#

will be interesting to see the leaderboard

sullen tiger
#

In the arena, are models matched against each other randomly, or is there a higher chance of matching models with closer scores? Cause I think the latter makes more sense.

surreal drum
#

Higher chance of closer scores, but certain models appear in the arena much more frequently, and the pairings are usually more random for those bots

which likely happens because the Elo/strength of such models is much less certain (because it’s new, fewer votes, etc.) , thus they get matches more often to lower the rating spread.

so in conclusion, models with high variance are more likely to be matched in general, but yes, also ones close in Elo

(this is not based off any conclusive evidence, I’m making all of this up)

brisk topaz
#

@west lodge hi, I’m new here and have recently started studying the chatbot arena. I noticed that our LMArena has released multiple datasets, such as Chatbot Arena Conversations (https://huggingface.co/datasets/lmsys/chatbot_arena_conversations), Arena Human Pref-100K (https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k), and Arena Human Pref-55K (https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k) on huggingface. But, I also saw that the tutorial notebooks, like the BT model notebook ((https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=ViV11W8l9NfL)) and the ELO rating notebook 1 (https://colab.research.google.com/drive/17L9uCiAivzWfzOxo2Tb9RMauT7vS6nVU?usp=sharing#scrollTo=-0rg26TQxFQv), ELO rating notebook 2 (https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) include a lot of Arena data as well. I’m just curious. are the datasets used in these notebooks new, or do they overlap with the ones we’ve already released on huggingface? what's the relationship between these datasets? Thank you so much! and btw, chatbot arena is really an impactful project for the community.

timid wind
#

I got my first answer from "spider".

It was amazing, looked far from anything before

#

spider feels good fr fr

vernal hatch
#

if you see something like this when doing an arena battle does that mean its real name is hidden?

strong knot
vernal hatch
#

same for neptune or something, that was 4.5, i think

strong knot
#

It woud be funny to see models like harry potter or gandalf rocking the #1 place xd

#

(those aren't in the arena, but were fyi ^^)

sinful falcon
#

is there a historic high elo score over time chart?

crude hawk
#

i used 4o and o3-mini to provide code to pull and manipulate all the data. haven't crossed checked or anything, but it looks about right to me

hasty flint
#

пацаны

#

дайте топ модель для работы с cline желательно фри

#

щас юзаю gemini 2 флэш

#

но у нее иногда такая деменция происходит

#

что ужас

#

💀

gusty lily
#

deepseek-v3-0324:free

timid wind
#

IDK, on LMArena spider just wins every round (for me)

stable shore
#

Hi, I'm lost, why aren't Baidu models ranked anywhere while they are supposedly kind of good?
Do they fall in a certain category that is not ranked?
Because I want to know how good they really are, I don't trust internal benchmarks

timid wind
lofty pasture
crude hawk
#

any idea when we can expect to see the first Search leaderboard @west lodge ?

grand loom
#

Hello, I would like to know which model is Stradale in Chatbot Arena. It's not on the leaderboard, and I can't find it either. Thank you very much.

#

stradale is the name I gave to the model after temporarily conversing and evaluating it in Arena (battle).

unique nimbus
# crude hawk

This would be interesting to see in terms of open source Vs proprietary

rapid cobalt
#

does votes for a model into the webdev arena count towards the general LMarena leaderboard or its 100% segregated?

surreal drum
#

100% segregated

stable shore
timid wind
#

can someone explain what exactly is "style control"?

rapid cobalt
marsh condor
timid wind
#

the problem is that I barely understand how exactly "style control" check works, and I just want to see leaderboard without it

stable shore
#

Hope they do

stable shore
#

If they end up being a solid top 15 in the long term run, knowing their starting point would surely be interesting

finite ibex
#

I wonder what would a llm be if it scores over 1500

lofty pasture
glacial glacier
#

llama 4 is now tied for first place with a 1417 elo

glacial glacier
#

that green dot is llama 4 maverick

tawdry socket
glacial glacier
#

being more accurate than the official leaderboard is a first for me

tawdry socket
#

I dont understand. Where did you get your rank from?

glacial glacier
#

yeah 1439.18 - 10.02 is definitely under 1416.58 + 13.5 so idk whats up

#

ranking comes from my own math and website
* lmb uses style control by default, turn it off for the mainstream results

tawdry socket
#

Nice. So, take the data from arena repository and calculate the elo/rank yourself?

glacial glacier
#

well i don't calculate the elo, just the rank

#

i take their elo for granted

#

(since they stopped publishing the anonymized battle data you need to recalculate elos)

half stream
#

Llama 4 Maverick... Experimental?

timid wind
glacial glacier
#

turns out accidentally shifting the CI down by 0.5% causes problems

jaunty steeple
#

HI lov to see new llms and the use of it

tawdry socket
glacial glacier
half stream
# half stream Llama 4 Maverick... Experimental?

Back to this topic.

Tested and compared between the Experimental and the full-release Maverick in my native tongue (officially instruction-tuned). Even when I cranked the temperature of Experimental down to zero, it still sounds overly friendly, instead of the much more bearable full-release.

The Experimental should be deprecated at this point, and let people test the full-release herd, so that it won't be misleading.

wispy pumice
glacial glacier
#

https://x.com/lmarena_ai/status/1909397817434816562

In addition, we're also adding the HF version of Llama-4-Maverick to Arena, with leaderboard results published shortly. Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences. (link in next tweet)

Early

steel chasm
#

Meta so desperate

twin wharf
#

Absolute clowns

#

We knew something was fishy with so much skepticism the last few days

crude hawk
# glacial glacier https://x.com/lmarena_ai/status/1909397817434816562 > In addition, we're also ad...

thanks - glad to see this... was just about to post a rant about how this 'experimental' (i.e. arena/human-preference primed) version violates LMarena's existing policies about stealth models.. like based on their blog post, they can only be unmasked and added to the leaderboard if the unmasked model is made publicly available... which isn't the case with this maverick-experimental thing (aside from Direct Chat in LmArena afaik)

#

all a bunch of nonsense from meta

#

did they think noone would notice or something lol

#

it should be deprecated / removed immediately. this model is not publicly available (and the only place it's 'available online' is LMarena itself via Direct Chat..) and so imo shouldn't have been added to the leaderboard in the first place

glacial glacier
#

grok 3 is now a frontier model without and with style control (if you exclude llama 4) (when prices are mixed 2:1 and are adjusted for thinking)

rapid cobalt
#

@glacial glacier whats your data based upon?

glacial glacier
rapid cobalt
#

You mean based on your personal battles DB?

glacial glacier
#

no, the elo scores are the general arena scores

rapid cobalt
#

Ah, okay. So based on current arena standings

glacial glacier
#

yes

haughty rapids
#

Llama 4 just dropped rating dramatically(Finally the one it deserves)

glacial glacier
strong knot
#

Is super on reasoning mode on the arena?
also double damn: instruction following + style control:

crude hawk
slate compass
slate compass
twin valve
#

visualize scores "with moats" let me chuckle 😄

#

would it be possible to select some options and categories in the leaderboard and get a link? In that way discussions online would be easier, rather than saying "go this this page and click this and that"

#

it would be also cool to see the No. of votes. I know that the CI would tell that, but orderining by No. votes could help too.

glacial glacier
cosmic harness
#

What happened to Llama-4? 💀

winged quiver
#

Amazed chatgpt-4o is even on the leaderboard

raven heron
#

so two different llama-4 that presumably differ only in style have 50 Elo difference with style control? Can we think of ways to improve style control so that this difference goes down to negligble one?

twin valve
#

some models are in the middle of the rankings, without great publicity, but they are new as well.

glacial glacier
#

that's an interesting idea... i might implement something similar like a new badge

glacial glacier
glacial glacier
#

just type, category, and style control?

twin valve
twin valve
#

neat! Thank you! (hopefully lmarena integrates something similar or directly asks you to do the leaderboards)

#

checking your github. You have a lot of repos and advent of code as well!

solemn flint
#

gemini 3.0🗿

atomic ice
#

What's the process for LM Arena adding a new model?

chilly moat
#

gpt-4.1 on the leaderboard soon?

lofty pasture
twin valve
twilit echo
echo lake
#

Can't wait to check the score of gpt4.1 and nvidia/Llama-3_1-Nemotron-Ultra-253B-v1, curious on their actual performance.

#

it would be quite fun to find nvidia version of llama3 surpass the score of llama4

raven heron
errant tusk
#

how often are the leaderboard updated?

neon narwhal
#

what is style control?

sterile rune
#

Who tried 4.1 ?

#

Is it mostly for coding ?

terse mirage
terse mirage
crude hawk
#

right they're updated periodically (roughly weekly)?

#

i know i'm being pedantic.. but that isn't 'live' per se

twin valve
#

they are updated like weekly, can confirm.

Why? Because with too few votes the score would be all over the place so they need to accumulate votes and waiting a week is not that bad.

twilit echo
# sterile rune Who tried 4.1 ?

I tried them, and I'd say they are mostly optimized for coding. As general models, for usage in a multitude of broad real world tasks, they don't stand out:

Tested **GPT-4.1 **series:

GPT-4.1 Nano:
Cheap tiny model, roughly comparable to Qwen2.5 14B.
Substantially beaten on price & performance by e.g. Googles flash models.

GPT-4.1 Mini:
Versatile fast model, roughly comparable to Gemini 2.0 flash (but more expensive).
Quite a solid coder, and performed on par with the larger model in my STEM segment.

**GPT-4.1 **:
"flagship" of the series, roughly as strong Llama 3.3 70B (but weaker STEM) & DeepSeek V3 0324 (but weaker coder).
Behind 7 other OpenAI models in my testing.
The "Maverick" type model of OpenAI.

All models are non-reasoning models and not very verbose, when compared to other recent model releases (1.15x / 1.23x / 1-35x token verbosity as size increases in testing).
All models, including Nano, are fairly competent coders! though none excel at my backend testing
None of these were particularly good in my STEM segment.

I have also added 0-shot examples for UI impressions and simplistic game design for each model on my shared assets (NOT part of any scoring, just for additional curiosity/comparison).

As always, YMMV!

pine wasp
#

someone add gpt 1.4

proper granite
slate compass
winged quiver
#

Was meta caught in the act gaming the lmarena benchmarks?

surreal drum
#

Yes

winged quiver
#

Haha. How did they get caught?

#

I also thought grok was cheating but didn't say anything

chilly moat
#

tbh it was suspicious from the beginning

#

anyways o3 and o4-mini seem promising

ancient cedar
#

super promising

#

gonna beat gemini easily

delicate sable
#

does your expertise in on chain derivatives help you forecast this?

winged quiver
#

Wait, so were they caught in the act or is it just assumed they were gaming the leaderboards?

sinful falcon
#

that's why they were removed

sinful falcon
#

well i just got my first head to head of o4 mini and gemini 2.5

sinful falcon
#

and i think the o models do very well w coding

#

so not sure that's representative of how it's gonna stack up

hallow comet
#

o4 mini or 4o mini?

sinful falcon
hallow comet
#

np 🤣 its hella confusing

desert gorge
#

Yeah o4 mini seems v good for coding, but o3 is a weird model, seems more slanted towards generalization than memorization, maybe because it is supposed to have access to search/tools, also they seem to have more 'personality' than the old o models

tawdry socket
#

So, dragontail/night whisperer are from google and they are better or similar to 2.5 experimental? Thats what the april 22 announcement is, isnt it?

twin wharf
#

looks like o3-mini-high is still on the leaderboard, looks to be deprecated for o4-mini-high so I think it should be deleted now?

twin valve
twin valve
twin valve
#

I see, so in the data on github there is nothing like "Last updated: 2025-04-09." ?

hallow comet
#

When will the leaderboard include o3 ?

hallow comet
sinful falcon
hallow comet
errant tusk
#

why did flash land on the leaderboard so quickly and o3 and o4 mini are still not there

hallow comet
#

o3 and o4 mini were just added

errant tusk
#

thanks for your answer

sinful falcon
#

when was the gemini 4-17 model added to vote on?

hallow comet
sinful falcon
#

but when was the “anon” model added

#

to the arena for ppl to vote on

glacial glacier
#

uhhhhh

#

i might skip this one since i can't think of a good place to put it

#

also uhhh @twilit echo lmk when you calculate the thinking ratios for gemini 2.5 flash

twilit echo
hallow comet
#

Do you guys think o3 > 2.5 ?

timber owl
sinful falcon
#

i almost exclusively use llms for working on a codebase

#

so my preferences are highly correlated w the swe and coding benchmarks

#

and the o models do better on those

#

it’s hard for me to say which will do better in the arena

#

but my personal preference is oai over gemini

sinful falcon
hard violet
#

guys do you think 2.5 will be the top model on leaderboard on apr 30? say 10am ET? if you could assign % chances to your views that would be helpful. also how often does leaderboard update. thank you

hallow comet
delicate sable
hallow comet
timid wind
hallow comet
twin valve
#

btw o3 and the new oai models have somewhat limited replies. When I get them in the battle mode with non-trivial questions they are nice answer but pretty concise. So they lose sometimes as other models add details.

hallow comet
#

when gemini 2.5 was released , it appeared on the leaderboard exactly 1 day and 3h after , for o3 it has been 1 day and 19h and still no leaderboard :/

twilit echo
#

the leaderboard is just a marketing tool nowadays and doesn't represent real life, so who cares. Whether its #1 by a huge margin or #35 makes no difference anyway, the former just means they gamed better.

rancid moat
#

the leaderboard is my favorite way to check the status of AI

hallow comet
scenic moon
tawdry socket
#

2.5 pro was being tested as nebular weeks before its release...

#

nebula*

hallow comet
scenic moon
#

flash 2.5 was being tested under a different name as well... it was probably dragontail.

tawdry socket
#

There were many google models introduced in last month. So, not sure which one was flash.

tawdry socket
#

Although since o3 is expensive, I dont know if arena has resources to test it often like o4-mini...

hallow comet
#

Ive tried 10 questions myself with o3 and gemini, no problems so far

tawdry socket
#

Yes, that should count. Do that 2000 times and you will see them on leaderboard tomorrow 😅 Side by side votes do not count

#

what do you notice btw? o3 is clearly better or similar to gemini?

timber owl
#

o3 is better and similar

tawdry socket
sinful falcon
hallow comet
hallow comet
#

Still you can rig it , cause its easy to detect to which llm the output belongs to. For instance just asking what llm they are , and they would say openai or google. But way too much effort

scenic moon
#

they claim they exclude responses where the model reveals itself

hallow comet
scenic moon
#

true, their style is very different. chatgpt is probably the easiest to spot id say

hallow comet
scenic moon
#

because its ahead?

#

in polymarket*

hallow comet
# scenic moon in polymarket*

no, because ive seen a lot of google shilling happening, and google also promoted that they were nr 1 in this arena (which they can also optimise for) , maybe rig, maybe bias call it what you want

scenic moon
#

theres one dude betting $14k against openai and $22k on gemini. he sure thinks google is going to win

hallow comet
#

thats actually somewhat more in favor to openai as to what the market thinks. hes saying 38% chance for openai 62% for gemini , market is 28% for openai atm

slate compass
#

just gonna go all-in on the top placer the second the leaderboard updates with o4-mini and o3 (hopefully at the same time)

lofty pasture
scenic moon
#

what do you use them for? I think it is better at almost everything

#

o3

#

except creative stuff which I havent tested

#

its formatting is a little odd though, this might be a problem

lofty pasture
# scenic moon except creative stuff which I havent tested

Since I am a medical student and I am preparing for exams I asked them for some tricks to memorize diffrent name of drugs and the doses and link the names and doses to something known and logic , No open ai model get it but Gemini models are perfect and deepSeek is next to Gemini ,,,, they are just Wow ...if those models are trully reasoning why failing a taks like that ?!!

scenic moon
#

oh cool

#

yeah maybe gemini is better at that

junior flame
#

Which free large language model is best for the summarizing of YouTube videos?I'm confused between Gemini 2.0 Flash and Gemini 2.5 Flash Preview 04-17 because they differ in tokens per minute

scenic moon
#

summarizing actual videos or the transcript? if actual video, then you will need 2.5 pro

#

if not then id use 2.5 flash

#

ive used 2.0 flash but it is kind of bad with timestamps, ok at summarizing

twin wharf
#

Id use 2.5 pro, its great at that

twin valve
twin valve
# twin valve disagree

For this I strongly think that lmarena represent closely average chatbot.

They don't represent api calls, that yes. For that we have openrouter rankings.

hallow comet
vocal rain
pliant vessel
#

how is the score decided for the leaderboard?

hallow comet
hallow comet
# vocal rain

intelligence race and who has better model are two different things

my guess is that google and openai have achieved agi internally and are looking to control it (OpenAI researcher)

vocal rain
hallow comet
vocal rain
#

why would they have agi internally?

hallow comet
# vocal rain why would they have agi internally?

they have the compute for 10x more powerful models
but they havent published nor reported to have trained such models

Which means, A they havent done so for whatever reason
or B they have and they are hiding it

Option B is very much the more likely for me. Powerful AI would automate jobs, automated jobs lead to protests and revolts, the last thing billionaires want (like those backing google and openai)

vocal rain
#

it could simply be that we dont scale as we would like

sinful falcon
#

🤣

hallow comet
sinful falcon
#

every single one (even those that quit)

#

and that the billionaires behind openai wouldn't stand to gain so much more from releasing it

#

there are also a ton of independent companies all working on this so your logic would have to apply to them too

sinful falcon
#

are you saying no one will have a job at all because of AI>

#

some billionair who is concerned about riots wouldn't have to worry about riots if they had the first company to make AGI. they would become a multi trillionaire

hallow comet
# sinful falcon do you think that every single employee at openai is just keeping their mouth sh...

AGI is nothing short of a manhattan like project where everything is at stake, so its very clear there would be measures, first one that comes to mind is restricted access, and nda forms. You talk you lose everything and no one would believe you without proof.

other independent companies dont have google and openai resources

no1 would have a job cause of ai -> almost, very few people would be needed

making AGI = rich , yes, but publishing it and automating jobs collapses the system. so they are developing it, but quietly

lofty pasture
sinful falcon
hallow iris
#

You put 10 people right now and make them prompt for 10 hours a day, you can really rig o4 scores and make it look like trash by the end of the month.

hallow comet
slate compass
rapid cobalt
#

it has safeguards, I looked into it awhile ago

lofty pasture
lofty pasture
sinful falcon
#

100 man hours

#

for that

hallow comet
# sinful falcon 100 man hours

there are people who have bet 20k and stand to win 5x if openai (or gemini) wins. thats 80k profit. Sure that is not enough to rig or bribe ? I hope not but idk

vocal rain
# vocal rain
poll_question_text

who is currently leading the intelligence race in your opinion?

victor_answer_votes

7

total_votes

18

#

some bets on polymarket are not serious at all, you can literally bet on jesus coming back to earth in 2025

#

these ai bets are kinda like a joke

#

and enables some rather uneducated ppl to bet their money and lose it

#

and yeah, the ai bet markets are not really liquid either

surreal drum
#

What bro tells me after losing a $10 parlay

hallow comet
hallow iris
hallow comet
hallow iris
#

Don't know why this message was rejected by the system, probably a banned word

hallow comet
hallow iris
#

Yeah and then they install Tuxler which has undetectable residential IPs for free and switch at every round.

hallow comet
#

i imagine its way more ez for a whale that bets 20k , to pay 5-10k on supporting lmarena and so getting to know what goes on .. early. ez 3x

hallow comet
#

if theres huge difference between new peoples voting and people who have been around for 2+ years then thats clear rigging indicator

hallow iris
#

Hopefully...

hallow comet
#

Yeah still could go that way, or whale bribe way .. thats why you never bet early, you must be the first to react

#

The only time to bet before news is if you know the news before the news is public 😅

hoary wind
#

it go down so fast 😭

twin valve
#

also did the order of the categories on LMB change? My preferred one are easier to reach.

pearl marsh
#

When will o3 and o4-mini be put on the leaderboard?

hallow comet
twin valve
obsidian egret
twin valve
#

well people can bet on everything, then everything becomes life changing. I don't think is an argument.

feral shale
glacial glacier
smoky bramble
#

hello

thorny ginkgo
#

Llama 4 maverick was rated ~1400 on launch and now it is 1271. Is it just due to error margin or something else

slate compass
twin valve
timber owl
#

i hate that formatting because they like putting 5 newlines after each sentence

timber owl
#

and they love bullet points as everyone knows which is usually not the best way to format

crude hawk
# hallow iris Don't know why this message was rejected by the system, probably a banned word

yeah it introduces incentives among all sorts of actors who have no stake in the actual performance of a model (like i've always thought major labs wouldn't waste their time trying to rig the votes, because the performance of their models in the wild will always ultimately speak for themselves.. even if they generate initial 'hype', the net result after being found out is likely to be negative [see Meta's llama4 maverick-arena-version fallout.. they ofc didn't manipulate votes .. but just released a model specifically juiced for the arena.. and it's been a disaster for them and the model release imo]

#

but betting markets and a crowd-sourced ranking project to my mind do not make a healthy mix .. there will invariably be those who at least consider ways to manipulate the system, if not have a go... it's like the inherent nature of betting...

#

doesn't bother me that people bet on the leaderboard - why not (and it does kinda seem illiquid anyway).. but it's not really an ideal situation in terms of the integrity of the voting system.. i dunno if LMarena asking polymarkt not to provide coverage would be possible / worthwhile.. ig if there's enough interest, the market would just be created / facilitated elsewhere.. be a game of whack-a-mole

pliant vessel
#

how come there are 4 models at rank 2 even if they have different arena scores?

noble coral
pliant vessel
#

what's the margin

twin valve
#

something like this

sinful falcon
#

so it’s w/e

#

but like adverse selection is part of every single market

#

u think ppl don’t know earnings reports before they’re released to the public?

#

do u think that the ppl running lmarena aren’t telling their friends which model “looks promising”?

#

million ways to get an edge in any market

#

if you’re trading it you’re willing accepting the risk your bet doesn’t play out

#

no crying in the casino

sinful falcon
glacial glacier
#

i've tried before but probably won't implement anytime soon

crude hawk
#

i mean what if there is a substantial difference between p1 and p2? what if there's no p2 at all? the underlying assumptions seems purely abritrary

sinful falcon
#

seems arbitrary

crude hawk
#

what?

#

i'm responding to your post

#

let’s say there’s some probability p1 of someone gaming the system in favor of oai, and there’s some other probability, p2, of someone gaming it in favor of gemini

sinful falcon
#

nevermind i thought u were the person who said someone could rig

#

either way it’s kinda silly when ppl freak out about this cuz like

#

there’s information asymmetry in every market

#

and if u think it’s worse here

#

then don’t trade

#

no crying in the casino

crude hawk
#

discussing it ≠ freaking out..

crude hawk
#

but not all markets are the same...

#

rigging LMArena leaderboard would be easier than say rigging the result of the SuperBowl or a US pres election

crude hawk
#

crowd sourced anything is inherently vulnerable... doesn't mean the arena is being rigged (i don't think it is) - but the idea that 'well, there'd be multiple manipulators and it'd all just net out anyway so what's it matter' just seems kinda naive

crude hawk
#

no one is complaining about losing money lol
i think you're missing the point (like i said, it doesn't bother me at all that people are betting on the leaderboard – i gamble all the time, on stocks, horse racing, roulette; whatever), i just find the fact that they are able to problematic in terms of introducing incentives that otherwise would not exist for actors to attempt to manipulate the leaderboard

sinful falcon
#

they could just make it so you need to make an lmarena account

#

that way if there’s ever any funny business

#

they know who to go after

hallow iris
#

And Gemini 2.5 Pro-exp isn't going through hard testing on lmarena right now so it's harder to find this one in particular

surreal drum
#

are you guys suggesting that this rigging is happening right now? or that it’s only a matter of time? because current leaderboard scores are pretty well-aligned with benchmark results and user sentiment.
a baseline “vibe” that the arena is being gamed is not conclusive. there’s literally zero evidence to prove it.

#

also, yes you’re absolutely correct that it is extremely easy to rig, but on the same token, users have been very quick to detect BS’ed scores. Look at Maverick 4, for example.

#

Though, to give yall credit, the absence of evidence is not the evidence of absence. it’s definitely POSSIBLE that this is happening.

delicate sable
#

I think its pretty unlikely the leaderboard is currently being rigged and I actually concluded that rigging the lmarena leaderboard is a more diffiicult way to market manipulate than some other things you could do.

You'd think this would protect the integrity of the leaderboard but the main thing is that its not a particularly clever to rig it. Every degen and their mom has realized that its possible someone could try.

sinful falcon
#

(asking for a friend)

woven hearth
#

Hello

#

will there be a leaderboard update on the 29th? Or 30th?

delicate sable
#

April 30 at 11 AM

surreal drum
drowsy umbra
#

Any plan for a leaderboard for video generation models?

twin valve
slate compass
#

i think that, with the level of sophistication required to actually rig the arena without being found out, you'd probably realize as much

delicate sable
timid wind
timid wind
delicate sable
#

kalshi also markets itself as a securities trading platform so from that perspective insider trading is very bad

timid wind
crude hawk
#

if either are acceptable... literally what would be the point of the market

delicate sable
crude hawk
#

yeah exactly

delicate sable
#

so the forecasts will just get worse and worse over time

timid wind
crude hawk
#

if you literally know that gemini will be #1 because you are the one compiling the leaderboard

#

that's not an information asymetery

#

it's a totally broken market

delicate sable
#

the whole idea behind prediction markets is that there is a financial incentive to make good predictions. liquid markets result in better predictions since the financial incentive is stronger

timid wind
#

I can't get it TBH

delicate sable
#

if you know you will trade against someone who has the answer you cannot make money

#

so people will just stop trading the market

crude hawk
#

yeah that sounds like a natural law of economics to me ha

timid wind
timid wind
crude hawk
#

if you know you will trade against someone who has the answer you cannot make money
so people will just stop trading the market

delicate sable
#

what does it mean to be 'competing with the real world'

timid wind
crude hawk
#

the odds that you get for it do

delicate sable
#

they do lol

#

you're trading against real other people

#

if i know the answer and buy 50% of the open interest on a strike who do you think that money i win will come from

timid wind
delicate sable
#

say i forecast the odds of gemini winning to be 70% and openAI to be 30%

#

someone with a large bankroll who knows the answer buys openAI to 40%

#

what trade do i do

timid wind
timid wind
delicate sable
#

i feel like you dont understand the basics of prediction markets

timid wind
#

guys I am sorry but you didnt provide a single argument

delicate sable
#

not gonna keep arguing we can just agree to disagree on this

#

fwiw some people do share your opinion

delicate sable
#

yes you won the argument

crude hawk
sinful falcon
#

i wish i could win arguments like you

#

please let me know what trades you make next time so i can trade accordingly

timid wind
timid wind
sinful falcon
#

why do you think i’m trolling

#

i am very dumb

#

i need advice from you to make my next trade

#

please let me know what you want to trade next

timid wind
tawdry socket
#

So, you are saying insider trading should be legal?

sinful falcon
timid wind
timid wind
sinful falcon
#

is there a problem with my grammar

timid wind
tawdry socket
delicate sable
#

i feel like you are just making really big claims about how prediction markets work but aren't very inexperienced with them

surreal drum
delicate sable
#

i don't think they're dumb though

#

i feel like its just a very nuanced and random thing

surreal drum
#

Very confidently incorrect

#

No, for most people, it’s very clear why it’s not ethical to insider trade

delicate sable
surreal drum
#

yeah, probably bad word choice. i mean, though, it’s obvious why it’s wrong, unfair

delicate sable
#

i felt like this was only a discussion about whether it results in better/worse predictions

#

i also dont think the answer is obvious a lot of people disagree on it

timid wind
delicate sable
#

why do you speak so confidently about these things lol

#

you don't even know kalshi exists?

timid wind
#

if someone does know the answer in advance and vote for it, it can only improve predictions overall, am I wrong?

delicate sable
#

its just surface level

#

IMO

timid wind
delicate sable
#

it is a 'serious and active prediction market'

#

i'd also argue and prediction market where you trade real money is serious

#

im not sure why residency requirements make a pm less 'serious or active'

tawdry socket
# timid wind if someone does know the answer in advance and vote for it, it can only improve ...

You are assuming that allowing insider trades does not affect market and volume. If I know that insiders can buy up orders as soon as they get information, I am not going to place any order. If there are no orders, then there is no opportunity for insiders. So, it basically kills the markets even for insiders.

Your argument works only when people dont know that there are insiders. Then, insiders will bring markets to correct values sooner than otherwise. If people know i.e., insider trading is legal --> no volume --> no trading, not even insider trading --> wrong predictions.

#

Trust me, big players on prediction markets dont gamble. They trade because they think they have an edge. If they know that they dont i.e., only gamblers are left in the market. Which does not help predictions.

#

So, if you want accurate predictions, you increase the publicity of prediction markets and make them popular or competitive. Not by outright legalizing insider trading which basically kills the markets.

sinful falcon
#

idk man dinahu seemed like he was very educated on the subject

#

are you sure you want to take an opposing take?

#

seems unwise to oppose someone so smart

crude hawk
# tawdry socket So, if you want accurate predictions, you increase the publicity of prediction m...

yeah.. i mean.. it's not a prediction market if a participant (either directly on indrectly) has literally perfect information about the thing on which everyone else is speculating/gambling (e.g. which model will be number on the leaderboard after the next update).. they aren't predicting anything if they quite literally know with 100% certainty what the outcome will be
[it is a prediction market if individuals with such perfect information do not use it or disclose it to others - then it's a fair playing field. but it's basicallly an honour system as far as i can tell lol.. or at least it's not in LMsys' interests to exploit their info advantage - because the moment they were found out to be doing so, their and the Arena's integrity collapses]

crude hawk
glacial glacier
#

i'm surprised r1 has held up this well

crude hawk
#

lol are people really this dumb

#

the LMArena leaderboard isn't like the weather, or the outcome of a sporting game or election

#

if some pollster has info not publically released, they can use that to bet on the outcome of a particular election. they have an info advantage (as an 'insider' of a polling company company), but they don't know the literal outcome of the election; and so yeah, it theoretically leads to a more accurate prediction market

#

a boxer's trainer might no that they are injured before a bout, and use that info to place a bet (e.g. that the boxer will lose); the 'insider' has privledged info but again, they don't actually know what the outcome will be for certain.. so sure.. more accurate accurate prediction market yadada

#

but the #1 spot on the LMArena leaderboard is straight-up known by insiders before an update is actually published.. they don't have an info advantage; they literally know actual outcome, on which people are betting, before it is released public..

#

how that difference and the problems it potentially entails is not blindingly obvious is beyond baffling

slim marsh
#

hi

gritty moat
#

Guys stop yapping ong

neon willow
#

At this point they should just rename this to #gambling

sinful falcon
#

is there supposed to be an update tomorrow?

#

with a new model?

sinful falcon
#

becuase the mods tell me in advance

#

well ig for all of y'all plebs it is

#

nevermind

twin valve
#

can we move the polymarket discussion elsewhere?

woven hearth
#

Besides the discussion, will there be a leaderboard update today/tomorrow?

twin valve
#

normally it is about 1 week

#

4-5 updates per month

hollow patrol
#

Hey guys, I am working on a model recommending system and I would love to use the lmarena leaderboard as source to base the recommendations on. For now the way I go about it is simply downloading this dataset maintained by someone:

https://huggingface.co/datasets/mathewhe/chatbot-arena-elo

And filter within that dataset. But this is just for the LLMs.

Is there any intent to create an API for the leaderboard? Or something similar with which I can easily access the leaderboard data in my applications?

glacial glacier
hollow patrol
twin valve
woven hearth
#

guys

#

will there be a leaderboard update today?

low patrol
#

Hey Hi
I am from BharatGen we had builded our own llm with indic nuance
May I know how we can integrate our model api in chatbot arena to get better comparison

glacial glacier
#
hollow patrol
hallow comet
#

Is there a way to find old leaderboards? For example the leaderboard on the 14th of April

twin valve
hollow patrol
twin valve
#

well when they tested it, it was legit. For me the p2l model (simply picked the best model that thought was going to answer your question) never lost in my cases.
The only drawback is that they have to compute the p2l every now and then. Last time was in January I think.

twin valve
#

was nvidia nemotron 253B tested via lmarena? I cannot remember its score

glacial glacier
hollow patrol
twin valve
twin valve
#

so yeah if you want to go at it, go for it. My initial objection was only to understand if the prompt to leaderboard was the same thing you wanted to do

hollow patrol
hollow patrol
twilit loom
#

😂
Chatgpt with fancy layers

timber owl
#

also according to 5 minutes of research, i cannot find any info on the llm itself, only the organization

twin valve
#

Imagine every week they have X thousands new battles that have to be classified, rejected (if something is obvious) and then evaluated. In the past, if I am not wrong, they used a llama model for classification. I am not sure if it is still the same. Still it takes compute to go through all that, plus manual checks. I think the lmarena team is small.

slate compass
#

$30M (equivalent) in funding, not too shabby

timber owl
slate compass
#

oh

#

@low patrol do you have public model weights?

wind vale
#

Hello guys , May I know When will the leaderboard be updated again?

woven hearth
#

No you may not

#

they think everyone who asks that question is here for the money

timber owl
#

that's because the ratio of people who ask that question that are here for the money vs that aren't is significant

twin wharf
woven hearth
#

Or...

#

He is just curious about where QWEN or other releases get placed on the leaderboard

#

The polymarket bet ends every end of the month...

#

Stop being toxic

woven hearth
#

ffs

#

you re right

timber owl
timber owl
pine wasp
#

whos reading for grok 3.5 to break the leaderboard

pseudo rivet
#

Grok 3.5 will be just below chatgpt 4o

pine wasp
#

rightt

twilit echo
glacial glacier
#

(where glicko is elo with uncertainty taken into account)

hollow patrol
#

What is the reason that a model like gemma 3-4b can be one place below claude 3.5 sonnet?

I have used a lot of small models, and I have never seen any that can consistently produce quality answers

twilit echo
#

even so, 40 ranks is still very little considering the models are worlds apart in capability (obviously, since it's 4B..) for reference here is my own comparison (no censor):

hollow patrol
twin valve
twin valve
#

@glacial glacier would it be possible for you to add the history of the score of a model? I know it is asking for quite some work and I know that models with small CI change their values slowly, but still it would be interesting to know.
Another one would be to check the changes in the amount of votes (if those are reported in the stats distributed via github). One could see if a model, despite being not yet deprecated, practically stops getting votes (without votes: no further rating or CI change).

drowsy needle
median briar
#

New Gemini 2 Flash image generation is legit more expensive than Imagen 3 002 🤡

twin valve
#

if needed I can request the same for the official leaderboard but I won't expect changes because the team is surely super busy

drowsy needle
glacial glacier
twin valve
# glacial glacier interesting... are both of these to accomplish the same goal (understanding when...

no, the model should stay (allegedly) the same. Rather the rating should change (if at all) because other models change or the judger change (different questions, different judgments).

The score history is what is most interesting. The vote history (or CI history, though the latter is a bit more cryptic) is interesting to see because it tells whether the score is really stable, despite further battles, or it is stable because mostly no battles happens.

For example I am pretty sure many models, though listed, are receiving very few battles.

glacial glacier
timid wind
#

is it OK what I in 50%+ cases judge LLM by first 100-200 characters of response and didn't read the rest when start is bad to me?

timber owl
#

no

twin valve
# glacial glacier sorry i meant "understanding when a model stops getting battles / moving around ...

🤔 the number of votes achieve what you say (stop getting battles). The rating history shows how the "playerbase" changes. Even with the BT model over Elo (or any Elo variants), over time a model can be "farmed" so to speak - provided that it gets battles. Or a model can still hold pretty well (claude 3.5 from Oct 24) . Sure the rating can move slowly (within the expected CI) but can move nonetheless as the playerbase and votes get in. Because people voting change their judgment and question get different.

glacial glacier
twin valve
#

yeah no worries, I throw ideas because I like your leaderboard a lot, but you don't have to implement them

echo lake
#

Is there any information regarding the Qwen3 32B model? I think it's more commonly use than those MoE models, since MoE = Memory Overconsumption Ensemble in practice.

dreamy bluff
#

how could we add new models to the LMArena leaderboard?

glacial glacier
#

what is this leaderboard?!

#

where's gpt-4o-2024-11-20?
why is gemini flash down there?
and also having a wide CI is giving it an advantage here

twin valve
glad elk
#

Hello everyone I'm new here; I'm wondering can I push my own fine-tuned model to the Chatbot Arena to let the users blindly test it with other models? Thanks!

drowsy needle
crude hawk
#

but yeah the ranking in the ss you shareed seems counterintuitive.. i assume there;s methodological explanation.. o3's seems convincing but eh i dunoo

empty hill
#

🤔

wispy sapphire
#

Aloha hi, I am sorry if this is already documented somewhere; I am responsible at my company for compiling a list of models that our product can run, and I thought it would be nice if I could query the contents of the Leaderboard via an API, rather than picking through the list on a browser and copying and pasting.

Is the data that populates the current leaderboard available via an API, or an export of that data available as a flat-file in some manner?

peak yacht
#

Hi 🫡

#

Hi everyone, I'm newcomer, how could we push our company's model to the Chatbot Arena? Thanks !

drowsy needle
peak yacht
valid sun
#

how to join?

drowsy needle
viscid lynx
#

hi

unborn talon
#

hello

crisp tartan
#

Hello Guys, Found this website through YT video... Hope to learn something more in AI Universe 😉

wispy sapphire
drowsy needle
drowsy needle
wispy sapphire
drowsy needle
wispy sapphire
wispy sapphire
# drowsy needle Thank you!

One last thing, just to hopefully make this as innocently lightweight as possible, I added this to the issue description:

I think what I am asking is that the leaderboard-table-file (gradio_web_server_multi.py#L337) (docs) is copied to an S3 bucket, or even just served raw via the site's http server (sans captcha, please), so that we can look at the raw data.

drowsy needle
frozen knot
#

maybe with the upcoming sentiment control claude 4 opus might actually be #1

#

until gemini deepthink atleast

tulip shadow
zealous sable
#

When will the leaderboard update next? like with the scores of new claude 4 models?

mortal sedge
#

Why isn't claude 4 into the leaderboard?

carmine spade
#

Leaderboard is down

drowsy needle
#

should be fixed now 👍

pale cloud
#

Is there a channel to suggest improvements to the existing benchmarks?

drowsy needle
pale cloud
#

Thank you

zealous sable
#

Hi, when do the leaderboards update usually?

drowsy needle
wicked plume
sinful falcon
#

lol

zealous sable
willow holly
#

In my own LMArena stats Sonnet 4 does worse than 3.7.
Curios where it ends up on the public leaderboard.

It is really just optimised for the agentic coding workflow huh?

Opus does well for me tho. I like how it is efficient with tokens.

feral shale
queen jewel
#

I hope the old website can last longer, or at least remain accessible as a legacy URL after the update, as tools like p2l and arena explorer on it have not been fully replicated on the beta website yet

pale hare
#

when is the leader updating to show sonnet 4?

twin valve
#

in randint(1,5) days

twin valve
indigo cliff
#

What is the "Arena Overview" leaderboard? Is it the same as the text leaderboard (with style control)?

willow holly
#

@indigo cliff Yes. If you mean here and the default is now Style controlled.

indigo cliff
#

I mean the one at the bottom of the page. Is it influenced by webdev, vision, search, copilot, and text-to-image? or is it just the text? By looking at the page one would think its a combination of all the leaderboards above, but that doesn't seem to be the case.

willow holly
#

@indigo cliff yes it is also just the Text, just split in the different categories.

It is just the overview you also have in old one but worse because you can't switch to the ELO points. The new design looks better but hopefully all the functionality will also come back.

indigo cliff
#

I see, thanks

zealous sable
#

Hi just saw the X post for the Claude 4 update but the site doesnt show it?

drowsy needle
zealous sable