#leaderboards | Arena | Page 1

vague plume Mar 3, 2025, 6:18 PM

#

cool to see 4.5 in first place imo

keen plaza Mar 3, 2025, 6:18 PM

#

vague plume cool to see 4.5 in first place imo

its out in leaderboard?

vague plume Mar 3, 2025, 6:19 PM

#

keen plaza its out in leaderboard?

yeah, snagged 1st place everywhere

keen plaza Mar 3, 2025, 6:19 PM

#

vague plume yeah, snagged 1st place everywhere

crazy

vague plume Mar 3, 2025, 6:19 PM

#

https://x.com/lmarena_ai/status/1896590146465579105?t=CK7KDuZ8siQS9ihFKJqFww&s=19

true

lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X

BREAKING News: @OpenAI's GPT-4.5 now tops the Arena leaderboard!

With over 3k votes, GPT-4.5 landed #1 across ALL categories, and singularly #1 under Style Control / Multi-Turn 🥇 Huge congratulations to @OpenAI on this impressive milestone! 🙌

View below for more insights on

balmy moon Mar 3, 2025, 6:19 PM

#

claude 3.7 thinking still not in there?

vague plume Mar 3, 2025, 6:20 PM

#

https://x.com/elonmusk/status/1896624102674506172?t=8x9-X8P7egbp7k3UXOhHyg&s=19

anyways seems like grokman wants to do better

Elon Musk (@elonmusk) on X

@lmarena_ai @OpenAI Not for long

coral moat Mar 3, 2025, 6:22 PM

#

It seems extremely easy to gain the lm dashboard and just upvote GPT 4.5 content. Why do y'all put so much weight on this leaderboard that seems pretty fickle

wicked sapphire Mar 3, 2025, 6:23 PM

#

it's just one metric of many. And I think the people who try and game the leaderboard are pretty few in number.

pseudo rivet Mar 3, 2025, 6:28 PM

#

coral moat It seems extremely easy to gain the lm dashboard and just upvote GPT 4.5 content...

but the battles are blind

coral moat Mar 3, 2025, 6:28 PM

#

Seems like it does play a big factor in vibes. I prefer artificialanalysis.ai for my model evals

coral moat Mar 3, 2025, 6:29 PM

#

pseudo rivet but the battles are blind

Do the arena (side by side) votes not count on the leaderboards? If not, then good!

pseudo rivet Mar 3, 2025, 6:32 PM

#

coral moat Do the arena (side by side) votes not count on the leaderboards? If not, then go...

Oh good question, I don't know. I didn't see that.

But I checked and gpt 4.5 is not available as an option there

coral moat Mar 3, 2025, 6:34 PM

#

pseudo rivet Oh good question, I don't know. I didn't see that. But I checked and gpt 4.5 i...

I would assume not. I didn't realize the main page of lm arena does the blind study type voting, my earlier question was based on a wrong understanding of the methodology.

coral moat Mar 3, 2025, 6:36 PM

#

pseudo rivet Oh good question, I don't know. I didn't see that. But I checked and gpt 4.5 i...

Was able to confirm side by side arena results are not factored into leaderboards

coral moat Mar 3, 2025, 6:38 PM

#

vague plume cool to see 4.5 in first place imo

Curious how 3k votes was determined as a big enough sample size to put in the leaderboards since the rest of the top 5 have 10k+ votes. Maybe there were a lot of votes where 4.5 beat other top 5 models?

vague plume Mar 3, 2025, 6:41 PM

#

I think a few other models that showed up in the leaderboard had ~6k - 10k votes before first showing up there. Might be the fact it's pricey, might be hype amount, idk

#

In any case it's a win for the more casual LLM users/the general public imo, it'll be on Plus this week

left zephyr Mar 3, 2025, 6:56 PM

#

@west lodge I DM'd you for a collaboration.

fast crown Mar 3, 2025, 7:00 PM

#

could we get the servers to have different logos?

low bough Mar 3, 2025, 7:59 PM

#

vague plume I think a few other models that showed up in the leaderboard had ~6k - 10k votes...

DeepSeek R1 had only 1.5k

vague plume Mar 3, 2025, 8:04 PM

#

low bough DeepSeek R1 had only 1.5k

Ohh ok cool

glacial glacier Mar 3, 2025, 9:13 PM

#

vague plume https://x.com/lmarena_ai/status/1896590146465579105?t=CK7KDuZ8siQS9ihFKJqFww&s=1...

This is below the lower bound I calculated using ELO math, disappointing

vague plume Mar 3, 2025, 9:49 PM

#

vague plume https://x.com/elonmusk/status/1896624102674506172?t=8x9-X8P7egbp7k3UXOhHyg&s=19 ...

worst part is that it's true, new version wins by 1 point

hallow comet Mar 3, 2025, 9:54 PM

#

in style control there's more of a substantial gap

coral moat Mar 3, 2025, 10:56 PM

#

These elo ratings are pretty similar for example 4.5 only wins 54% of the time against Google's most advanced model

coral moat Mar 3, 2025, 11:28 PM

#

any theories on how this is possible in the span of a day via a legitimate process instead of gaming by that twitter guy??

blissful pulsar Mar 4, 2025, 12:06 AM

#

elon knows in real time what the estimate for the arena score will be

#

like there's been over 1000 votes in the last day, so every few minutes he receives a new vote so can know he'll probably be #1 when it enters the arena

coral moat Mar 4, 2025, 12:29 AM

#

blissful pulsar like there's been over 1000 votes in the last day, so every few minutes he recei...

Does the arena increase the frequency of new models for the blind test studies? In order to quickly assess capabilities? I guess I am not sure how Elon would control this which seems like he did based on his tweet lol

surreal drum Mar 4, 2025, 2:14 AM

#

coral moat any theories on how this is possible in the span of a day via a legitimate proce...

there are gambling sites that allow you to bet on what the #1 model is. iykyk

craggy zenith Mar 4, 2025, 6:41 AM

#

hi

low bough Mar 4, 2025, 10:54 AM

#

coral moat any theories on how this is possible in the span of a day via a legitimate proce...

Considering ethical aspect of elon (he is the least moral guy since 2023), the benchmark is probably being gamed. Maybe legally (fine-tuning on arena questions) or illegally (faking votes, allocating more resources than being communicated).

#

Also, he's the richest guy in the world. Some donations to LMSYS team may not be game changer but may give extra perks.

blissful pulsar Mar 4, 2025, 11:34 AM

#

he is not faking votes or bribing anyone, he is obviously just optimizing the model for user preferences, which is a fine thing to optimize for. he is just cringe about it

blissful grotto Mar 4, 2025, 12:09 PM

#

OK stupid noob question, is there a way to recreate leaderboards from the past...

I see that https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=C5H_wlbqGwCJ

Reaches down into a google cloud storage. I'm trying to look at trends in the Arena scores over time

url = "https://storage.googleapis.com/arena_external_data/public/clean_battle_20240814_public.json"

But what are the other valid things to enter, it would be great to get the current ones, is there a listing somewhere?

I tried to just list it, but of course it looks like you need authentication

gcloud storage ls --recursive gs://arena_external_data/public
ERROR: (gcloud.storage.ls) HTTPError 403: rich@tne.ai does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist). This command is authenticated as rich@tne.ai which is the active account specified by the [core/account] property.

# but accessing individual objects works
gcloud storage ls --recursive gs://arena_external_data/public/clean_battle_20240814_public.json

low bough Mar 4, 2025, 12:15 PM

#

blissful pulsar he is not faking votes or bribing anyone, he is obviously just optimizing the mo...

He's choosing to cheat and immorally engage in all other business aspects, why are you so naive in believing he's a good boy in LLMs?

blissful pulsar Mar 4, 2025, 12:17 PM

#

low bough He's choosing to cheat and immorally engage in all other business aspects, why a...

i am not naive, he is not a "good boy", he just cannot fake votes at scale, and not literally everyone is corrupt, it just is the kind of fraud that is too likely to be caught to be worth persuing most of the time.

#

usually when you have cracked employees you have to keep them by not pushing them to be obviously fraudulent. talented ppl just quit the job when they're being asked to be obviously fraudulent, bc they want their cracked engineer status points which come from not being caught being an obvious fraud

#

just doesn't seem likely to me

coral moat Mar 4, 2025, 1:06 PM

#

Thanks guys for your thoughts on this. I agree with Bayesian that this seems pretty hard to game but I also agree with Larry that he is not a good boy and cringe most of the time.

twilit relic Mar 4, 2025, 4:29 PM

#

only kalshi gamblers here

surreal drum Mar 4, 2025, 5:35 PM

#

I hate elon musk too, but i cant lie, grok can cook (not to mention he didn’t develop it at all). It's a great balance of precise reasoning and proper formatting + clarity. The problem with most other models doesn't lie in any sole benchmark; usually, it's imbalance, it's due to prioritizing Markdown formatting over a precise response (or vice versa)
GPT 4.5 also seems to strike this balance well.
Idk if grok can overtake it. we'll see

#

You’ll never catch me paying $30/mo for grok though. No way

low bough Mar 4, 2025, 7:02 PM

#

I don't know men, I've tried grok and gemini many times when my GPT PLUS acc was out of limits. They just don't satisfy my needs. The only time other LLM surprised me more than GPT was the Sonnet 3.7. It was the first LLM to answer my ultra niche prompt. Grok does not understand it.

coral moat Mar 4, 2025, 7:06 PM

#

low bough I don't know men, I've tried grok and gemini many times when my GPT PLUS acc was...

What is the ultra niche prompt?

low bough Mar 4, 2025, 7:07 PM

#

Drawing a Hanning window (context: signal processing) using ASCII given the constraints and explaining the formula. Every model up to last month used to draw triangles, which is BS.

coral moat Mar 4, 2025, 7:08 PM

#

That is pretty niche!!

blissful pulsar Mar 4, 2025, 7:15 PM

#

the non-reasoning models will often be worse at reasoning tasks like coding / math

#

but they're as good or better at non-reasoning tasks

random jewel Mar 4, 2025, 8:34 PM

#

I don't think that grok should be the best in coding... Been reading about advanced non-Elo algos yesterday, a/b algos are a part of my trailblazing successes...

dapper gull Mar 4, 2025, 8:35 PM

#

low bough Drawing a Hanning window (context: signal processing) using ASCII given the cons...

Wym Grok can do it pretty easily

low bough Mar 4, 2025, 8:56 PM

#

Oh god 😄 Not sure if you're joking but its funny 😄

dapper gull Mar 4, 2025, 9:01 PM

#

low bough Oh god 😄 Not sure if you're joking but its funny 😄

This is the second image that Grok made, but I think the first is funnier xd

low bough Mar 4, 2025, 9:01 PM

#

To be fair this is more accurate 😄

#

However, this is what it should make https://en.wikipedia.org/wiki/Window_function

Window function

In signal processing and statistics, a window function (also known as an apodization function or tapering function) is a mathematical function that is zero-valued outside of some chosen interval. Typically, window functions are symmetric around the middle of the interval, approach a maximum in the middle, and taper away from the middle. Mathemat...

dapper gull Mar 4, 2025, 9:04 PM

#

low bough I don't know men, I've tried grok and gemini many times when my GPT PLUS acc was...

Btw I completely agree with you on this. But maybe it is just different uses cases people have.

#

Like everytime for coding stuff Sonnet is just better idk. Even tho other models are close on the leaderboard

torpid matrix Mar 5, 2025, 6:33 PM

#

im being forced by this server to write random stuff here to remove thhsese annoying popups.

coral moat Mar 5, 2025, 8:37 PM

#

What popups?

unborn plover Mar 5, 2025, 8:44 PM

#

torpid matrix im being forced by this server to write random stuff here to remove thhsese anno...

You just have to check out 3 diff channels for one time but yeah its dumb

thick totem Mar 6, 2025, 1:11 PM

#

Hi~all, Does the Visual Language Multimodal Leaderboard include other languages besides English and Chinese? For example, French and Spanish?

west lodge Mar 6, 2025, 5:57 PM

#

unfortunately we have too few votes in other languages for us to compile the leaderboard. could you say more why you're interested in seeing other languages?

thick totem Mar 7, 2025, 2:00 AM

#

@west lodge I am curious about the evaluation of the multimodal model. Does the votes in the category "overall" only include Chinese and English, and exclude votes outside of these two languages?

frozen creek Mar 7, 2025, 6:25 AM

#

Does anyone know the AI big model rankings that polish Chinese papers?

#

为什么gpt 4.5不能模型比赛？

#

为什么gpt 4.5不能比赛

#

Why can't gpt 4.5 compete

west lodge Mar 7, 2025, 8:26 AM

#

thick totem <@787778518591078421> I am curious about the evaluation of the multimodal model....

overall includes all languages

waxen portal Mar 7, 2025, 4:06 PM

#

hello

coral moat Mar 7, 2025, 4:32 PM

#

Any safeguards to prevent this type of jailbreak from influencing the leaderboard rankings?

wide schooner Mar 7, 2025, 4:49 PM

#

coral moat Any safeguards to prevent this type of jailbreak from influencing the leaderboar...

Play Fair: If AI identity reveals, your vote won't count.

unborn plover Mar 7, 2025, 5:32 PM

#

coral moat Any safeguards to prevent this type of jailbreak from influencing the leaderboar...

That is not a jailbreak and you don't need a jailbreak to identify a model
There are simpler tells that won't detected by simple text match like @wide schooner mentions in the rules

lucid thistle Mar 7, 2025, 6:18 PM

#

coral moat Any safeguards to prevent this type of jailbreak from influencing the leaderboar...

Maybe the fact they be wrong 80% of times also helps a lot, Gpt thinks he is Gemini immo

wide schooner Mar 7, 2025, 6:24 PM

#

lucid thistle Maybe the fact they be wrong 80% of times also helps a lot, Gpt thinks he is Gem...

I'd imagine most newly released llms don't have a problem recalling their name. Also refer to models as "it" instead of "he", they aren't humans lol.

lucid thistle Mar 7, 2025, 6:36 PM

#

At the prashis, use "it" right after think would be kinda weird, anyway great Pfp xD

lofty pasture Mar 7, 2025, 10:29 PM

#

coral moat Any safeguards to prevent this type of jailbreak from influencing the leaderboar...

Without asking the model those questions ,you can know its name easily . Each model have a specific style, mistakes and answers , so you can't prevent it. For example chatgpt 4o answers are like

1️⃣

--

2️⃣

--

3️⃣
With some kind of emojis and I can disting 4o from gpt4.5 by the accent .
The only method to prevent it is to play fair and to vote for the model that deserves it.

sullen tiger Mar 7, 2025, 11:20 PM

#

Is the grok-3-preview in the leaderboard actually grok3 or grok3-think?

true yoke Mar 8, 2025, 12:17 AM

#

I wonder, why claude-3-7-sonnet-20250219-thinking-32k is on the chat available but is not on the leaderboard 🤔

glacial glacier Mar 8, 2025, 12:33 AM

#

true yoke I wonder, why claude-3-7-sonnet-20250219-thinking-32k is on the chat available b...

have patience

marsh iron Mar 8, 2025, 12:55 AM

#

when can i chat with o1 generate image with dall e 3 generate video with veo2

strong knot Mar 8, 2025, 12:59 AM

#

marsh iron when can i chat with o1 generate image with dall e 3 generate video with veo2

o1 is in arena (battle) (not in direct chat)
Dalle 3 is in text2img (basically arena battle's chat mode, but for image generation)
Veo2 is not on lmarena to my knowledge. .-. Would be interesting.

marsh iron Mar 8, 2025, 1:22 AM

#

in arena battle under chat now you can see Expand to see the descriptions of 90 models, in direct chat you also see it. what do you think that mean?

fierce bear Mar 8, 2025, 2:05 AM

#

alpha.lmarena.ai

password: super-alpha

formal shuttle Mar 8, 2025, 3:08 PM

#

Hi everyone

#

There is a problem with the website there is high traffic

hallow comet Mar 8, 2025, 3:59 PM

#

Hi, just an amateur AI user here who has taken a gander at the leaderboard off and on. NGL, the ranking system confuses me.

minor acorn Mar 8, 2025, 6:59 PM

#

hello

stark grove Mar 8, 2025, 7:25 PM

#

hallow comet Hi, just an amateur AI user here who has taken a gander at the leaderboard off a...

It's like chess rankings

wide schooner Mar 8, 2025, 8:24 PM

#

hallow comet Hi, just an amateur AI user here who has taken a gander at the leaderboard off a...

https://en.m.wikipedia.org/wiki/Bradley–Terry_model

Bradley–Terry model

The Bradley–Terry model is a probability model for the outcome of pairwise comparisons between items, teams, or objects. Given a pair of items i and j drawn from some population, it estimates the probability that the pairwise comparison i > j turns out true, as

where pi is a positive real-valued score assigned to individual i. The comparison i ...

#

Perhaps you can paste that into your favorite AI model and have it simplify it for you

old pagoda Mar 8, 2025, 9:03 PM

#

Have a question concerning the AI models used on https://lmarena.ai/ it is a great opportunity to check the models before installing. I was testing command-r-08-2024. It was giving a very good output for my promt and was reproducable with similar results. Then I was installing the same model through ollama (https://ollama.com/library/command-r) but was giving very poor results for the same promt. Do you have any suggestions? Thanks!

command-r

Command R is a Large Language Model optimized for conversational interaction and long context tasks.

twilit echo Mar 8, 2025, 9:07 PM

#

old pagoda Have a question concerning the AI models used on https://lmarena.ai/ it is a gre...

command-r and command-r-08-2024 are two similar but different models.
also, by default ollama uses Q4, which is lower precision than what is likely ran on arena

old pagoda Mar 9, 2025, 8:38 AM

#

twilit echo command-r and command-r-08-2024 are two similar but different models. also, by d...

Thank you! Is there a reference what is used here instead of Q4? Or what num_ctx is used?

stable basin Mar 9, 2025, 12:46 PM

#

你好

timid wind Mar 10, 2025, 10:23 AM

#

will grok text-to-image be added to text-to-image LB?

mystic dock Mar 10, 2025, 11:00 AM

#

timid wind will grok text-to-image be added to text-to-image LB?

Isn't that flux pro?

strong knot Mar 10, 2025, 12:53 PM

#

I thought it was flux dev with a lora

glacial glacier Mar 10, 2025, 2:31 PM

#

mystic dock Isn't that flux pro?

They changed it

fleet stag Mar 10, 2025, 3:34 PM

#

Hi all!

I was wondering if Claude 3.7 Sonnet - Thinking would be added to the leaderboards soon?

I think it’s important to differentiate the thinking model from the non-thinking model. In my opinion, there’s a good chance that 3.7 Sonnet Thinking could even be #1 in coding!

Best,
Mason

glacial glacier Mar 10, 2025, 5:04 PM

#

fleet stag Hi all! I was wondering if Claude 3.7 Sonnet - Thinking would be added to the l...

Soon

fleet stag Mar 10, 2025, 5:27 PM

#

glacial glacier Soon

Awesome!

mental bay Mar 10, 2025, 10:07 PM

#

Hi is there a way to get leaderboard changes programmatically? The notebook mentions a file that’s 6 mo old and simple playwright script hits cloudflare…
I just want the data for the little dashboard I have for myself that aggregates llm related news etc

mental bay Mar 11, 2025, 2:01 AM

#

why 😂 ? is it against the rules somehow?

unborn plover Mar 11, 2025, 2:15 AM

#

You wont get the realtime data

#

Possibly due to prevent cheating

#

At least make it harder

mental bay Mar 11, 2025, 2:23 AM

#

huh? if any human can see it how does that prevent cheating? i'm fine if it's delayed like a day, too...

#

even wayback machine is blocked, which is too bad... https://web.archive.org/web/20250225010050/https://lmarena.ai/?leaderboard

glacial glacier Mar 11, 2025, 3:30 AM

#

mental bay Hi is there a way to get leaderboard changes programmatically? The notebook ment...

make your own script that looks something like https://github.com/KTibow/lmb/blob/main/handling/convert_pickle.py

#

(ignore the unused code)

mental bay Mar 11, 2025, 3:51 AM

#

glacial glacier make your own script that looks something like https://github.com/KTibow/lmb/blo...

That’s perfect! Thank you!

crude hawk Mar 11, 2025, 10:14 AM

#

mental bay huh? if any human can see it how does that prevent cheating? i'm fine if it's de...

i'm not completely certain about this but i kinda thought that they review/analyse the latest data/votes (like further checks against potential manipulation, looking into any wild anomalies etc) before publishing an updated leaderboard

#

if that is the case, ig there isn't really a 'live' leaderboard per se - just the accumulation of new votes / data

azure stirrup Mar 11, 2025, 3:45 PM

#

When do the arena score leaderboards update?

#

It doesn't appear to be done in realtime and instead manually?

wide schooner Mar 11, 2025, 5:27 PM

#

https://rsvp.withgoogle.com/events/gemma-dev-day-paris/agenda

Gemma Developer Day in Paris on March 12

Join this in-person event for the AI community and the Google teams behind Gemma.

#

I assume they will unveil they new gemma models

supple glade Mar 12, 2025, 3:29 AM

#

hello, I want to know how to add the company's model to the competition ranking.

median briar Mar 12, 2025, 8:40 AM

#

wide schooner I assume they will unveil they new gemma models

https://ai.google.dev/gemma is already updated to Gemma 3, so this is definitely true

Google AI for Developers

Google AI Gemma open models | Google for Developers | Google AI f...

Gemma open models are built from the same research and technology as Gemini models. Gemma 2 comes in 2B, 9B and 27B and Gemma 1 comes in 2B and 7B sizes.

true yoke Mar 12, 2025, 11:41 AM

#

i don't understand why a model that just was released is included, but still no claude-3-7-sonnet-thinking. Or qwq-32b

frozen creek Mar 12, 2025, 11:54 AM

#

What is the difference between Gemma 3 and Gemini 2, which one is better

#

What is the difference between Gemma 3 and Gemini 2, which one is better

true yoke Mar 12, 2025, 12:08 PM

#

frozen creek What is the difference between Gemma 3 and Gemini 2, which one is better

Gemma is open (Apache 2.0 License), but likely less powerful.

lofty pasture Mar 12, 2025, 12:19 PM

#

true yoke i don't understand why a model that just was released is included, but still no ...

It got a good amount of votes 3000.

delicate sable Mar 12, 2025, 12:20 PM

#

i think the gemma team wanted it released (they are really proud of their arena score)

#

and i'm guessing claude and alibaba didnt make those requests? just a hunch

mystic dock Mar 12, 2025, 1:19 PM

#

on the coding leaderboard its quite low tho

#

(with style control)

#

without its actually on rank 15

#

i want qwq on the leaderboard

brave herald Mar 12, 2025, 3:05 PM

#

true yoke i don't understand why a model that just was released is included, but still no ...

gemma 3 27b has been in testing under the name zizou-10 for longer than qwq and Claude thinking

lofty pasture Mar 12, 2025, 4:09 PM

#

brave herald gemma 3 27b has been in testing under the name zizou-10 for longer than qwq and ...

How did you know that it was zizou ? 🥲

#

Oh it told me now thta it was zizou 10

brave herald Mar 12, 2025, 4:11 PM

#

lofty pasture How did you know that it was zizou ? 🥲

Well he said he was Gemma, who else do you want it to be?

thorny ginkgo Mar 12, 2025, 8:11 PM

#

I tested gemma 27b in AI studio. Is it the same in lmarena? It is not even close to Claude and Deepseek

stark grove Mar 12, 2025, 8:16 PM

#

i noticed that too, something is off there

glacial glacier Mar 12, 2025, 11:21 PM

#

google's best private models have only around a 56% chance of winning over gemma 3 27b

crude hawk Mar 13, 2025, 1:32 AM

#

low bough Drawing a Hanning window (context: signal processing) using ASCII given the cons...

i like this prompt and have been using it a bit (almost all models indeed fail, usually producing a triangle or just all over the place)...
[i don't know anything about hanning windows, jftr ha]

wispy pumice Mar 13, 2025, 1:49 AM

#

crude hawk i like this prompt and have been using it a bit (almost all models indeed fail, ...

A Hanning window is a smooth, bell-shaped curve commonly used in signal processing. (Actually it should be called 'Hann window"...)

crude hawk Mar 13, 2025, 2:24 AM

#

yeah i know what it looks like, and skimmed the wiki page to get a sense of what's going on - but yeah at a technical level, it means nothing to me ha

#

i've tried prompting with both Hann and Hanning Window. With the former, some LLMs point out that the latter is the more conventional phrasing, others say the former 🤷‍♂️ either way, they know what you mean so it's kinda redudant to the task imo

#

4o intitially says one thing (Hanning), then the other (Hann) in the very next response lol

blissful pulsar Mar 13, 2025, 2:48 AM

#

is there an api to get the latest lb rankings

#

i'd like to get alerted when it gets updated basically

fleet stag Mar 13, 2025, 5:43 AM

#

Not sure if anyone knows, but I was curious, is the “Grok-3-Preview-02-24” the thinking or non-thinking model?

twin wharf Mar 13, 2025, 7:15 AM

#

blissful pulsar i'd like to get alerted when it gets updated basically

💵

low bough Mar 13, 2025, 8:26 AM

#

crude hawk i like this prompt and have been using it a bit (almost all models indeed fail, ...

Haha I've never thought someone would be interested in this niche thing

#

Interestingly, O3-mini blew my mind when I first asked him to do this with a maximum of 40 characters per line. It chose number 27 as a desired maximum for the "esthetics". Which is weird, because models usually goes with whatever maximum you give them. This was a first time I sensed some kind of "free will" coming from the model. Also, it drew the hanning perfectly. Better than I would.

#

However, when I ran the same prompt on the ChatGPT website, even when using the same model, the output was much worse. Why would that be I dont know. Maybe they limit the thinkning tokens based on current server capabilities?

#

I hope they don't prompts from LMARENA with higher performance limitations. That woudl ruin the benchmark.

hallow comet Mar 13, 2025, 8:33 AM

#

low bough However, when I ran the same prompt on the ChatGPT website, even when using the ...

chatgpt free has the lower reasoning effort one

low bough Mar 13, 2025, 8:33 AM

#

I have plus

hallow comet Mar 13, 2025, 8:33 AM

#

oh did u use o3 mini high?

low bough Mar 13, 2025, 8:33 AM

#

Yes

hallow comet Mar 13, 2025, 8:34 AM

#

low bough Yes

did u try o3 mini (non-high) too?

low bough Mar 13, 2025, 8:36 AM

#

No, maybe this is the cause. Will check it 🙂

hallow comet Mar 13, 2025, 8:37 AM

#

afaik lm arena has o3 mini medium by default

low bough Mar 13, 2025, 8:37 AM

#

Perfection of o3-mini-high

hallow comet Mar 13, 2025, 8:38 AM

#

low bough Interestingly, O3-mini blew my mind when I first asked him to do this with a max...

it mightve just been a fluke because of sampling

low bough Mar 13, 2025, 8:39 AM

#

03-mini-medium great too

#

When trying this on arena now (o3-mini) it returns garbage

#

There really is a replication problem. Performance is inconsistent.

hallow comet Mar 13, 2025, 8:40 AM

#

low bough There really is a replication problem. Performance is inconsistent.

sampling can do that

low bough Mar 13, 2025, 8:40 AM

#

Could you define "sampling"?

hallow comet Mar 13, 2025, 8:41 AM

#

temperature, top_p, etc

low bough Mar 13, 2025, 8:41 AM

#

The params are not the same on gpt web and arena?

#

Or are they trying performance on different params?

hallow comet Mar 13, 2025, 8:42 AM

#

low bough The params are not the same on gpt web and arena?

even if its the same there's a lot of randomness say with the same setting of temperature 1.0

#

models arent even deterministic anyway with greedy decoding. with the same seed youll get basically deterministic results (though it depends, and it might not be guaranteed)

low bough Mar 13, 2025, 8:43 AM

#

I would hope they woul 'set.seed(X)" in the script 😄

hallow comet Mar 13, 2025, 8:43 AM

#

low bough I would hope they woul 'set.seed(X)" in the script 😄

rarely anyone uses these models with greedy decoding and specific seed

low bough Mar 13, 2025, 9:02 AM

#

Still I would expect models to perform slightly worse or better due to randomness and not complete collapse

blissful pulsar Mar 13, 2025, 9:28 AM

#

twin wharf 💵

It can be bought? Or what

hallow comet Mar 13, 2025, 9:30 AM

#

blissful pulsar It can be bought? Or what

he's calling u a gambler

blissful pulsar Mar 13, 2025, 9:31 AM

#

hallow comet he's calling u a gambler

LOL makes sense, yeah i make good $, and if i can snipe those limit orders on top of that that’d be nice

blissful pulsar Mar 13, 2025, 10:43 AM

#

but if anyone knows if there is a way to get quick data when the lb updates lmk

coral moat Mar 13, 2025, 1:20 PM

#

Thoughts on Google Gemma 3?

coarse jacinth Mar 13, 2025, 2:19 PM

#

Will the next multimodal Gemini Flash now be added to the text-to-image leaderboard? Doesn't seem to be in the mix currently

wispy pumice Mar 13, 2025, 6:30 PM

#

crude hawk yeah i know what it looks like, and skimmed the wiki page to get a sense of what...

I know, I was joking. It's just used to fade in and out the ends of a signal so there aren't abrupt cutoffs that pollute the spectrum.

wispy pumice Mar 13, 2025, 6:30 PM

#

crude hawk i've tried prompting with both Hann and Hanning Window. With the former, some LL...

"Hamming window" is named after a guy named Hamming. "Hann window" is named after a guy named Hann. "Hanning window" is a (very common) confusion of the two.

surreal drum Mar 13, 2025, 7:52 PM

#

blissful pulsar i'd like to get alerted when it gets updated basically

I’ll let you know bro 🙏

blissful pulsar Mar 13, 2025, 7:55 PM

#

ty mate

open scaffold Mar 14, 2025, 12:44 AM

#

Anyone knows why there is no qwq-32b in the leaderboard?

terse mirage Mar 14, 2025, 1:05 AM

#

open scaffold Anyone knows why there is no qwq-32b in the leaderboard?

Usually it takes a bit to get enough votes and data to show up in the leaderboards - should be soon! You can test qwq-32b in the chats and help get it there by voting.

crude hawk Mar 14, 2025, 11:20 AM

#

wispy pumice "Hamming window" is named after a guy named Hamming. "Hann window" is named aft...

ahhhh i see!

#

thanks for clarifying / explaining - makes sense now why the LLMs also seemingly get it confused in their explanations of the terminology (but they still know what you mean in the context of the 'window' and task.. most just aren't great at ASCII art .. or get that it's meant to be a bell shape, but instead go for the eaiser - but wrong - triangle shape... ig anyway ha )

true yoke Mar 14, 2025, 3:02 PM

#

terse mirage Usually it takes a bit to get enough votes and data to show up in the leaderboar...

and what about claude-3-7-sonnet-thinking ?

surreal drum Mar 14, 2025, 3:48 PM

#

terse mirage Usually it takes a bit to get enough votes and data to show up in the leaderboar...

how often and when is the leaderboard updated? thanks

terse mirage Mar 14, 2025, 5:20 PM

#

true yoke and what about claude-3-7-sonnet-thinking ?

same situation, we release to leaderboards as soon as we get enough votes and data. looking like it next week for claude-3-7-sonnet-thinking if the votes come in as expected.

terse mirage Mar 14, 2025, 5:21 PM

#

surreal drum how often and when is the leaderboard updated? thanks

goal is weekly, but may differ depending on how quickly votes and data come in

steady umbra Mar 15, 2025, 12:35 PM

#

How is 4.5-preview ranked so high? I've heard a lot of people say that it's bad, is that not the case?

mystic dock Mar 15, 2025, 12:52 PM

#

steady umbra How is 4.5-preview ranked so high? I've heard a lot of people say that it's bad,...

its a good model, but not worth its price as its not that much of an improvement compared to other models. also for coding o3-mini/deepseek is better

#

sometimes, because gpt4.5 is such a large model, and has a lot of knowledge stored in its weidghts it can find the error in things, that other models sometimes cant find

#

its also the model which hallucinates the least of all models, simply because a lot of stuff is stored in its weights, such as even all the restaurants of my local place, no other model has that much knowledge stored in its weights

#

the smaller models can of course use web-search, but that is something different then directly knowing something, which can also be better for explaining things

thorn birch Mar 15, 2025, 1:55 PM

#

Is the grok3 on leaderboard thinking model?

#

I doesn't seem that it is. Are they saving the other one? Or it's one of the code names in testing?

fossil cipher Mar 15, 2025, 4:19 PM

#

Hello

blissful pulsar Mar 15, 2025, 4:44 PM

#

thorn birch Is the grok3 on leaderboard thinking model?

It’s not a thinking model

#

The current grok 3 thinking model is called grok 3 thinking beta, it’s not trained as much as they d want, so we ll probably get grok 3 thinking in a few weeks and it ll probably be very good

half stream Mar 15, 2025, 11:36 PM

#

Was just thinking about this. Maybe we should also consider emoji usage for style control. Some models are notorious for using them more than others.

Oh, and also block quotes.

#

And would like to know the coefficients of those two factors. 👀

lofty pasture Mar 16, 2025, 10:31 AM

#

Am I the only one feeling that
Specter is the new Gemini 2.0 flash thinking (experimental) updated by google yesterday

Phantom is the old Gemini 2.0 flash thinking experimental

Just a feeling ... 🥲🥲💔

brave herald Mar 16, 2025, 10:40 AM

#

lofty pasture Am I the only one feeling that Specter is the new Gemini 2.0 flash thinking (ex...

Phantom is aldo a new model, not old

lofty pasture Mar 16, 2025, 1:58 PM

#

brave herald Phantom is aldo a new model, not old

Phantom have the same problems of the old Gemini thinking while the new version looks improved and have less mistakes and problems but I am not sure just a feeling because when I tried the new Gemini thinking on the app It looks improved. Maybe google wanted to compare between them secretly.

wispy pumice Mar 16, 2025, 3:09 PM

#

steady umbra How is 4.5-preview ranked so high? I've heard a lot of people say that it's bad,...

I think the criticism is that it's disappointing, not bad. It's an improvement, but very expensive and not worth it in many cases

fringe tree Mar 16, 2025, 7:17 PM

#

hi guys, thanks for this great work! I would like to ask you regarding "early-grok-3" model. Isn't this model in market, is it? As per info in their site it says there is no API yet for grok3. Also I don't find it in OpenRouter or Glama for example. So, my question is how and who are the persons voting (less than 5 thousand) that can say it is good for coding or not? Thanks!

blissful pulsar Mar 16, 2025, 7:25 PM

#

It says it’s deprecated which prolly means it wont show up in the arena anymore

#

Seems likely the grok3 available on grok.com is the grok3 preview so early grok 3 may just be gone permanently

#

But it was good at coding, prolly not the best, depends on the task

#

You usually get more out of reasoning models

lofty pasture Mar 16, 2025, 9:42 PM

#

fringe tree hi guys, thanks for this great work! I would like to ask you regarding "early-gr...

You can test them on arena web dev (even if you are 0 at coding like me there are some artifacts that will show you the result of the code 😆😂)

weary nebula Mar 16, 2025, 11:26 PM

#

Do you guys think phantom is gemini 2.0 pro thinking?

blissful pulsar Mar 16, 2025, 11:41 PM

#

Yeah

#

Uninformed guess but seems likely

twin wharf Mar 17, 2025, 1:25 AM

#

Interesting how Hunyuan-Turbo is on the leaderboard with only 1473 votes, I thought it had to be over 3k, or did the company request personally?

steady umbra Mar 17, 2025, 4:52 AM

#

Is there a benchmark/leaderboard for llms with access to web search?

rapid cobalt Mar 17, 2025, 6:42 AM

#

Does anyone know if the grok answers at the Arena are with Think mode or without?

lofty pasture Mar 17, 2025, 11:28 AM

#

weary nebula Do you guys think phantom is gemini 2.0 pro thinking?

It is a Gemini thinking but not sure with the "pro"

lofty pasture Mar 17, 2025, 11:28 AM

#

rapid cobalt Does anyone know if the grok answers at the Arena are with Think mode or without...

I think without

blissful pulsar Mar 17, 2025, 12:31 PM

#

I also misread pro, seems more likely to be another iteration of flash thinking

glacial glacier Mar 17, 2025, 8:07 PM

#

QwQ has a 1312 elo

glacial glacier Mar 18, 2025, 1:06 AM

#

how are they saying this when gemini flash is right there

hallow comet Mar 18, 2025, 1:07 AM

#

glacial glacier how are they saying this when gemini flash is right there

you can download the weights at least

glacial glacier Mar 18, 2025, 1:07 AM

#

hallow comet you can download the weights at least

same goes for gemma, deepseek, etc

hallow comet Mar 18, 2025, 1:09 AM

#

you can run command a more easily than deepseek ig

glacial glacier Mar 18, 2025, 1:09 AM

#

i guess command a is just the "enterprise" experience

crude hawk Mar 18, 2025, 2:52 AM

#

glacial glacier same goes for gemma, deepseek, etc

gemma-3-27b being ranked above command-a (111b) seems kinda notable - imo reflects very well on gemma (or poorly on command-a), given the latter has 3 times as many params

glacial glacier Mar 18, 2025, 2:53 AM

#

thats what happens when you distill from gemini

hallow comet Mar 18, 2025, 2:53 AM

#

the base models are very bad though

crude hawk Mar 18, 2025, 2:54 AM

#

tbf, command-a does seem 'smarter' (though it guess must lose points to gemma in coding or creative writing or some other tasks i don't prompt for) but only marginly (definitely not 3 times smarter)

#

it also has a longer a context window, and i believe is optimised for agentic stuff and RAG.. but yeah still..

hallow comet Mar 18, 2025, 2:56 AM

#

what i personally found was interesting about gemma 3 outside of the focus on human preference was afaik they revealed what they basically did to gemini 1.5 exp/2 math

#

crude hawk Mar 18, 2025, 3:05 AM

#

hallow comet

I assume "large IT teacher" is a domain-specific LLM (which is 'teaching' the model being distilled i.e. gemma3), right?

hallow comet Mar 18, 2025, 3:06 AM

#

crude hawk I assume "large IT teacher" is a domain-specific LLM (which is 'teaching' the mo...

oh no its just a gemini 2 instruction tuned either flash or pro

crude hawk Mar 18, 2025, 3:06 AM

#

but like couldn't they find (build / tune) a math teacher?

#

though there is something special about programming language ig - seems LLMs learn and kinda generalise a lot from it

hallow comet Mar 18, 2025, 3:07 AM

#

crude hawk but like couldn't they find (build / tune) a math teacher?

yea but the "solution space" between the math model and the general model may be different. might make be counterintuitive. they will apply rl anyway to increase math scores. (gemini models are basically math reasoning models on math questions)

hallow comet Mar 18, 2025, 3:07 AM

#

crude hawk though there is something special about programming language ig - seems LLMs lea...

yes. research has shown that to be true. personally i hypothesize its because its basically cot lol

hallow comet Mar 18, 2025, 3:08 AM

#

crude hawk I assume "large IT teacher" is a domain-specific LLM (which is 'teaching' the mo...

additionally they do two phases of distillation:

pretraining (presumably from a gem 2 base model)
instruction tuning (presumably from a gem 2 instruct model)

#

unlike regular synthetic datasets, they train it on 256 probabilities from the gem 2 model - effectively copying a lot more of the "solution" better or worse

#

also comparison between gemma 3 base models and qwen 2.5 (released in september 2024)

#

ignore the metric column xd (gemini added it without it really matching)

crude hawk Mar 18, 2025, 3:10 AM

#

hallow comet yes. research has shown that to be true. personally i hypothesize its because it...

ha yeah indeed - being formulaic / structured and goal-oriented apparently goes a long way. i also read somewhere that the comments in code are part of the magic. like natural language, explaining stuff (like what a particular piece of code is meant to do; how a human would explain that). seems to do something that just dumping reddit posts into a training data can't

crude hawk Mar 18, 2025, 3:12 AM

#

hallow comet unlike regular synthetic datasets, they train it on 256 probabilities from the g...

i see i see - interesting!

hallow comet Mar 18, 2025, 3:13 AM

#

oh i said gem 2 pro but its not known which one they used

hallow comet Mar 18, 2025, 3:20 AM

#

crude hawk i see i see - interesting!

i misread/misremembered what they said, its likely pro was the teacher i think

#

frozen creek Mar 19, 2025, 12:05 AM

#

Is this website ranking accurate? Ranking number one is o3 high?

glacial glacier Mar 19, 2025, 12:33 AM

#

frozen creek Is this website ranking accurate? Ranking number one is o3 high?

this channel is about lm arena's leaderboard, not any others

#

that said, artificial analysis benchmarks stem and o3 mini is a good stem model

glacial glacier Mar 19, 2025, 1:39 AM

#

why do we still not have claude thinking on the leaderboard

low bough Mar 20, 2025, 10:25 AM

#

The text-to-image has so few models. I wonder why we have so many LLMs but so few text-to-image. Harder to manage data? Harder to train? Less financial incentive?

autumn granite Mar 20, 2025, 11:07 AM

#

Looks interesting: https://neurohive.io/en/state-of-the-art/llama-nemotron-nvidia-launches-family-of-open-reasoning-ai-models-overtaking-deepseek-r1/

Neurohive - Нейронные сети

Stanislav Isakov

Llama Nemotron: NVIDIA Launches Family of Open Reasoning AI Models ...

NVIDIA has announced the open Llama Nemotron family of models with reasoning capabilities, designed to provide a business-ready foundation for creating advanced AI agents.

crude hawk Mar 20, 2025, 1:37 PM

#

autumn granite Looks interesting: https://neurohive.io/en/state-of-the-art/llama-nemotron-nvidi...

could be march-chatbot (and with, reasoning enabled, march-chatbot-r) i think

blissful pulsar Mar 20, 2025, 4:30 PM

#

any guess whether o1-pro is on the lmarena now that it has an api thing

split anvil Mar 20, 2025, 5:03 PM

#

how often does the leaderboard typically update?

blissful pulsar Mar 20, 2025, 5:14 PM

#

every week or so

split anvil Mar 20, 2025, 5:50 PM

#

is there a way to find the history of past leaderboards?

desert gorge Mar 20, 2025, 11:31 PM

#

split anvil is there a way to find the history of past leaderboards?

check commit history on the huggingface repo

sinful falcon Mar 20, 2025, 11:35 PM

#

desert gorge check commit history on the huggingface repo

can u link that?

#

the repo i see

#

doesn't update for leaderboard updates

#

just when they change column names and whatnot

desert gorge Mar 20, 2025, 11:38 PM

#

https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard/tree/main look at the leaderboard table csvs, cross reference with commit history if you want to verify dates

lmarena-ai/chatbot-arena-leaderboard at main

tawdry socket Mar 21, 2025, 2:43 AM

#

Is o1 pro in the arena yet? People are saying it is most advanced open ai model. Curious how much better it is compared to Gpt 4.5..

blissful pulsar Mar 21, 2025, 2:51 AM

#

It’d plausibly be worse but prolly pretty similar

tawdry socket Mar 21, 2025, 3:01 AM

#

What makes you say that? If it is not better, why is OpenAI charging lot more for it?

glacial glacier Mar 21, 2025, 4:24 AM

#

tawdry socket What makes you say that? If it is not better, why is OpenAI charging lot more fo...

most of lm arena isn't hard prompts

#

if you would please consult the leaderboard

split anvil Mar 21, 2025, 7:03 AM

#

Any info on if it will be ranked?

timid perch Mar 21, 2025, 3:45 PM

#

will we get o1-pro or GPT-4.5 ranked? or is the price gate too high? i did see 4.5-preview in rankings and in cursor

mystic dock Mar 21, 2025, 7:40 PM

#

timid perch will we get o1-pro or GPT-4.5 ranked? or is the price gate too high? i did see 4...

I don't see how Ranking o1-pro would be feasible. If you have to wait 2 minutes for getting an answer most will assume that something is broken and reload the page. Also some speculated that o1-pro is simply o1 but with best of n rating.

tawdry socket Mar 21, 2025, 8:51 PM

#

what about sonnet thinking?

mystic dock Mar 21, 2025, 8:55 PM

#

tawdry socket what about sonnet thinking?

It's actually already on the leaderboard

#

Rank #14

hallow iris Mar 21, 2025, 8:56 PM

#

glacial glacier if you would please consult the leaderboard

Any idea on how does lmarena system identify "hard prompts"? Is it based on token numbers of the prompt or on number of instructions included?

mystic dock Mar 21, 2025, 8:56 PM

#

hallow iris Any idea on how does lmarena system identify "hard prompts"? Is it based on toke...

https://blog.lmarena.ai/blog/2024/hard-prompts/

glacial glacier Mar 22, 2025, 1:14 AM

#

Leaderboard update: claude thinking and nemotron 49b

fleet estuary Mar 23, 2025, 4:38 AM

#

hi

#

guys do you know this chat model called nebula on arena ? i cant find any info on it

autumn granite Mar 23, 2025, 5:40 AM

#

mystic dock Rank #14

#1 for hard prompts with Style Control enabled, but also tied with o1, R1, Grok 3 and gpt-4.5 when taking confidence interval into account.

autumn granite Mar 23, 2025, 6:08 AM

#

crude hawk could be march-chatbot (and with, reasoning enabled, march-chatbot-r) i think

The 49B is on there now

lofty pasture Mar 23, 2025, 9:04 AM

#

fleet estuary guys do you know this chat model called nebula on arena ? i cant find any info ...

A secret model from google 😁

fleet estuary Mar 23, 2025, 4:32 PM

#

lofty pasture A secret model from google 😁

yeah google is being very secretive about it

hard tangle Mar 24, 2025, 5:35 PM

#

Anyone know if any Deep Research models exist in the arena’s Search or just light research ones

vague garden Mar 24, 2025, 7:46 PM

#

HI , How are the new models evaluated? According to the orginal paper https://arxiv.org/pdf/2403.04132 , evaluates preferencing using humans . How is this done for new models, is this also using human preferences? If so which dataset is it usng?

glacial glacier Mar 24, 2025, 8:29 PM

#

vague garden HI , How are the new models evaluated? According to the orginal paper https://ar...

Dataset????

#

It's an arena

#

People go to vote every day

rapid cobalt Mar 24, 2025, 9:21 PM

#

How does it work that some models like Nebula aren't on the LB but I get them on battles?

#

also true for march

glacial glacier Mar 24, 2025, 10:24 PM

#

rapid cobalt How does it work that some models like Nebula aren't on the LB but I get them on...

What exactly do you mean by how does it work

#

Well the company has to contact the arena and then it adds them under a psuedonym

rapid cobalt Mar 24, 2025, 10:25 PM

#

So does it scores towards some of the listed models in the leaderboard?

#

I mean, does the pseudonym I see on the battle have a counterpart on the leaderboard?

glacial glacier Mar 24, 2025, 10:28 PM

#

rapid cobalt I mean, does the pseudonym I see on the battle have a counterpart on the leaderb...

What's the point of having an anonymous model if it goes on the leaderboard

rapid cobalt Mar 24, 2025, 10:28 PM

#

My question is basically if there are models taking part on the battle that are not being scored for on the lb because that would not make much sense from what I understood of how the elo system works

glacial glacier Mar 24, 2025, 10:28 PM

#

No the anonymous model system is typically used for testing early tunes or checkpoints

#

They can just drop battles that aren't entirely public models

rapid cobalt Mar 24, 2025, 10:31 PM

#

So if a listed model win a battle against a anonymous one how would the elo gain be calculated if the formula takes elo diff into account?

glacial glacier Mar 24, 2025, 10:32 PM

#

rapid cobalt So if a listed model win a battle against a anonymous one how would the elo gain...

Do you mean when the score is reported?

#

Well then they include all battles

#

Not too hard to reason about

rapid cobalt Mar 24, 2025, 10:35 PM

#

glacial glacier Do you mean when the score is reported?

I mean that if model A and B present an answer to a series of prompts in a battle and I vote for B

glacial glacier Mar 24, 2025, 10:35 PM

#

rapid cobalt I mean that if model A and B present an answer to a series of prompts in a batt...

Well nothing immediately happens

#

The match goes into a DB

rapid cobalt Mar 24, 2025, 10:36 PM

#

B gains elo and A loses elo, but that gain/loss is based on their current elo and how they performed relative to their expected winrate against each other, right?

#

I understand that the match goes into a dB and LMArena uses Bradley-Terry with bootstrapping to get a closer approximation to models performance on the long run to make it more equitable between models that are newer (w lower vote count) and the ones that already have a lot of votes

#

But still for it to go into the dB it should be scoring A against B, and the way it is supposed to do it is getting an approximation (through current elo) of how A and B should perform against eachother and then take into account how the result panned out

vague garden Mar 24, 2025, 10:45 PM

#

Yes this is exactly my question. Also model generated output cannot be automatically evaluated unless it is a classification problem

rapid cobalt Mar 24, 2025, 10:45 PM

#

A has a probabilistic chance of winning against B based on their elo diff.

vague garden Mar 24, 2025, 10:46 PM

#

So how can the results of a model be evaluated so qucikly? Unless it is simple classification and not generated long answers which requires human eval for reliable preference judgement

rapid cobalt Mar 24, 2025, 10:46 PM

#

The only way it makes sense is if the pseudonym models are having their elo calculated and updated but just not being openly presented within the leaderboard

vague garden Mar 24, 2025, 10:47 PM

#

How is win determined?

rapid cobalt Mar 24, 2025, 10:47 PM

#

Ah I think I ended up answering my own question lol

#

Prob they have elo ratings just it's in the shadow. Bummer.

rapid cobalt Mar 24, 2025, 10:54 PM

#

vague garden How is win determined?

By user vote based on the answer to a prompt in the arena

#

One prompt, two answers, user picks his favorite answer. Winner increases elo, loser decreases

rapid cobalt Mar 24, 2025, 10:56 PM

#

lofty pasture A secret model from google 😁

Is this meme answer or is it actually a Google model? Cause it's really impressive on all answers so far

glacial glacier Mar 24, 2025, 11:13 PM

#

rapid cobalt Is this meme answer or is it actually a Google model? Cause it's really impressi...

we think it's gemini 2 pro thinking

crude hawk Mar 25, 2025, 12:17 AM

#

rapid cobalt The only way it makes sense is if the pseudonym models are having their elo calc...

yeah I think that's basically how it works / the right way to look at it. In some cases, the anonymous models are very publicly 'unmasked', like when oai disclosed that im-a-good-chatbot was in fact a new version of 4o, or when xAI revealed that sus-column-r was an early version of Grok-2.

#

though in most cases.. they just come and go.. presumably the voting data goes to the lab responsible for the anon model, but unless it is performant, seems generally they just pull it and move on

#

like based on this (h/t @hallow comet) nebula is potentially the 38th anon model google has added to the arena.. meta I feel adds anon models just as frequently.. though again, mostly they just come and then quietly go.. ("success has many fathers, but failure is an orphan" kinda thing ha)

hallow comet Mar 25, 2025, 12:26 AM

#

yeah google put a lot of models on the arena, quite a lot of them never see the light of day (checkpoints and the likes)

vague garden Mar 25, 2025, 1:01 AM

#

rapid cobalt One prompt, two answers, user picks his favorite answer. Winner increases elo, l...

Thank you. So I am actually surprised they are able to obtain human eval (user vote ) so quickly on new models. Wonder what the number of questions and model pair comparison distribution. Cos some folks on reddit are saying that the benchmark is not very reliable and some models are way ahead on the leaderboard but way worse on real world comparison

glacial glacier Mar 25, 2025, 1:03 AM

#

if youre gonna be a new user at least do your homework

#

browse the arena a bit

#

battle yourself

#

see if you agree with the rankings

vague garden Mar 25, 2025, 1:05 AM

#

The point is model eval is about user vote distribution, extends beyond single user. LLMArena voting and question selection is not very clear when they release scores for new modeld

glacial glacier Mar 25, 2025, 1:14 AM

#

everything we need is already here

hard tangle Mar 25, 2025, 1:19 AM

#

Again, Anyone know if any Deep Research models exist in the arena’s Search or just light research ones

glacial glacier Mar 25, 2025, 1:20 AM

#

hard tangle Again, Anyone know if any Deep Research models exist in the arena’s Search or ju...

its all light search

#

of course it is

#

nobodys waiting 10 minutes for a report

vague garden Mar 25, 2025, 1:26 AM

#

Hi Team, I am using the notebook and how come some judges are predominatly the same person. If it were all live users through the webapp i would expect users (judges) to numbers of votes to fairly uniformly distributed. i am surprised that some users have contributed so many more votes than others

hard tangle Mar 25, 2025, 1:30 AM

#

glacial glacier nobodys waiting 10 minutes for a report

Oh

hard tangle Mar 25, 2025, 1:30 AM

#

glacial glacier nobodys waiting 10 minutes for a report

MUCH longer than I expected it to take

hard tangle Mar 25, 2025, 1:31 AM

#

glacial glacier its all light search

So in the future when Search models are accessible via Direct Chat and Manual Arena, will the deep research models be present?

glacial glacier Mar 25, 2025, 1:32 AM

#

would be cool if so but i doubt lmsys users will wait that long

delicate sable Mar 25, 2025, 1:41 AM

#

vague garden Hi Team, I am using the notebook and how come some judges are predominatly the s...

why would votes be uniformly distributed

#

that doesnt make sense

vague garden Mar 25, 2025, 1:46 AM

#

delicate sable why would votes be uniformly distributed

Assuming user behvaiour is very similar, or the other case is normal distribution. The above is long tail distribution, so hence my question..

delicate sable Mar 25, 2025, 1:48 AM

#

why would user votes be normally distributed? It's not like people are choosing at random whether to vote every day

#

Some people are gonna vote more than others and its gonna create fatter tails

#

https://en.wikipedia.org/wiki/Zipf's_law

Zipf's law

Zipf's law (; German pronunciation: [tsɪpf]) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the n-th entry is often approximately inversely proportional to n.
The best known instance of Zipf's law applies to the frequency table of words in a text or corpus of natural language:

...

#

kinda looks like this to me 🤷‍♂️

vague garden Mar 25, 2025, 1:49 AM

#

delicate sable why would user votes be normally distributed? It's not like people are choosing ...

Well there is the issue to broad preferencing for models across the population, if votes are dominated by one user or few users then the results are not representative of the wider population. It is also about the leaderboard design

delicate sable Mar 25, 2025, 1:51 AM

#

So you think they should preprocess the data

vague garden Mar 25, 2025, 1:53 AM

#

delicate sable So you think they should preprocess the data

AT this point, I am trying to understand how the votes are consolidated - trying to get context behind the numbers. But yes, if the vote is dominated by one or few select users then the leaderboard becomes less reliable when it comes to representing human preferences ( how people feel about the model output). If it is only getting experts on the subject matter to vote, then different path and user vote distribution i guess.

crude hawk Mar 25, 2025, 2:50 AM

#

vague garden AT this point, I am trying to understand how the votes are consolidated - trying...

worth having a look through their blog (and research papers for that matter), which might shed some light on, if not address entirely, your questions e.g. https://lmsys.org/blog/2024-05-08-llama3/#the-effect-of-overrepresented-prompts-and-judges

#

not sure if this is still representative of the current pipeline (it's pretty old), but could also be relevant / helpful https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=B_PYA7oVyaHO

tawdry socket Mar 25, 2025, 12:27 PM

#

crude hawk like based on this (h/t <@456226577798135808>) nebula is potentially the 38th an...

Where can I see this information?

#

@crude hawk

tawdry socket Mar 25, 2025, 5:51 PM

#

@crude hawk Where did you get that screenshot from?

spare scroll Mar 25, 2025, 11:17 PM

#

Hello there. I am new here and I just discovered Arena and I am looking for resources to understand the leaderboards a bit better, beyond just the mechanics. For example, I would like to know when and how often it is updated, how is it decided if a model gets a "preview" or not, how many updates a certain model gets, if there's a test in the pipeline etc. Can someone point me in the right direction? Thank you.

crude hawk Mar 26, 2025, 2:04 AM

#

tawdry socket <@1053335914555908116> Where did you get that screenshot from?

From here! #general message
(I'm not the person to ask where / how they found it - honestly no clue other than something to do with the browser console, i assume ha)

karmic mortar Mar 26, 2025, 4:53 AM

#

does anyone know if the current Deepseek V3 leaderboard listing is for DeepSeek-V3-0324?

#

or is it the original?

glacial glacier Mar 26, 2025, 5:09 AM

#

karmic mortar does anyone know if the current Deepseek V3 leaderboard listing is for `DeepSeek...

original

karmic mortar Mar 26, 2025, 6:24 AM

#

thank you! i'm curious if they'll add the newer one anytime soon

wanton turret Mar 26, 2025, 2:04 PM

#

gpt-4o-11-20 isn't on the leaderboard, but can be selected as a model to compare and ChatGPT-4o-latest (2024-11-20) is on the deprecated leaderboard. https://github.com/lm-sys/FastChat/issues/3685

GitHub

Why gpt-4o-2024-11-20 not on the latest leaderboard. · Issue #368...

Why gpt-4o-2024-11-20 not on the latest leaderboard.

sleek cosmos Mar 26, 2025, 3:20 PM

#

Is there a Search leaderboard?

crude hawk Mar 26, 2025, 3:25 PM

#

sleek cosmos Is there a Search leaderboard?

i don't believe yet (still gathering data)

#

will be interesting to see the leaderboard

sullen tiger Mar 27, 2025, 3:56 PM

#

In the arena, are models matched against each other randomly, or is there a higher chance of matching models with closer scores? Cause I think the latter makes more sense.

surreal drum Mar 27, 2025, 4:16 PM

#

Higher chance of closer scores, but certain models appear in the arena much more frequently, and the pairings are usually more random for those bots

which likely happens because the Elo/strength of such models is much less certain (because it’s new, fewer votes, etc.) , thus they get matches more often to lower the rating spread.

so in conclusion, models with high variance are more likely to be matched in general, but yes, also ones close in Elo

(this is not based off any conclusive evidence, I’m making all of this up)

brisk topaz Mar 28, 2025, 10:49 AM

#

@west lodge hi, I’m new here and have recently started studying the chatbot arena. I noticed that our LMArena has released multiple datasets, such as Chatbot Arena Conversations (https://huggingface.co/datasets/lmsys/chatbot_arena_conversations), Arena Human Pref-100K (https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k), and Arena Human Pref-55K (https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k) on huggingface. But, I also saw that the tutorial notebooks, like the BT model notebook ((https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=ViV11W8l9NfL)) and the ELO rating notebook 1 (https://colab.research.google.com/drive/17L9uCiAivzWfzOxo2Tb9RMauT7vS6nVU?usp=sharing#scrollTo=-0rg26TQxFQv), ELO rating notebook 2 (https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR?usp=sharing) include a lot of Arena data as well. I’m just curious. are the datasets used in these notebooks new, or do they overlap with the ones we’ve already released on huggingface? what's the relationship between these datasets? Thank you so much! and btw, chatbot arena is really an impactful project for the community.

lmarena-ai/arena-human-preference-55k · Datasets at Hugging Face

Google Colab

timid wind Mar 28, 2025, 5:32 PM

#

I got my first answer from "spider".

It was amazing, looked far from anything before

#

spider feels good fr fr

vernal hatch Mar 28, 2025, 8:07 PM

#

if you see something like this when doing an arena battle does that mean its real name is hidden?

strong knot Mar 28, 2025, 8:45 PM

#

vernal hatch if you see something like this when doing an arena battle does that mean its rea...

usually hidden. One time "chocolate" got uploaded to the leaderboard, but then soon replaced by grok 3.

vernal hatch Mar 28, 2025, 8:45 PM

#

same for neptune or something, that was 4.5, i think

strong knot Mar 28, 2025, 8:46 PM

#

It woud be funny to see models like harry potter or gandalf rocking the #1 place xd

#

(those aren't in the arena, but were fyi ^^)

sinful falcon Mar 29, 2025, 5:25 AM

#

is there a historic high elo score over time chart?

crude hawk Mar 29, 2025, 9:03 AM

#

sinful falcon is there a historic high elo score over time chart?

the data for each update is available here https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard/tree/main

#

i used 4o and o3-mini to provide code to pull and manipulate all the data. haven't crossed checked or anything, but it looks about right to me

#

#

(if of any interest, here's the – very messy - collab notebook used https://colab.research.google.com/drive/1eiR0B_8paakgexC1OHvH20H3FeaNVwZh?usp=sharing )

hasty flint Mar 29, 2025, 4:21 PM

#

пацаны

#

дайте топ модель для работы с cline желательно фри

#

щас юзаю gemini 2 флэш

#

но у нее иногда такая деменция происходит

#

что ужас

#

💀

gusty lily Mar 29, 2025, 7:24 PM

#

deepseek-v3-0324:free

timid wind Mar 31, 2025, 5:08 PM

#

hasty flint но у нее иногда такая деменция происходит

that's the problem of all LLMs

#

IDK, on LMArena spider just wins every round (for me)

stable shore Apr 1, 2025, 9:11 AM

#

Hi, I'm lost, why aren't Baidu models ranked anywhere while they are supposedly kind of good?
Do they fall in a certain category that is not ranked?
Because I want to know how good they really are, I don't trust internal benchmarks

timid wind Apr 1, 2025, 9:23 AM

#

stable shore Hi, I'm lost, why aren't Baidu models ranked anywhere while they are supposedly ...

they are not present on LMArena, IDK why

lofty pasture Apr 1, 2025, 10:34 PM

#

timid wind they are not present on LMArena, IDK why

And minimax too ,

vague garden Apr 1, 2025, 11:03 PM

#

tawdry socket <@1053335914555908116> Where did you get that screenshot from?

I devevopped the stats myself using this notebook https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH

Google Colab

crude hawk Apr 2, 2025, 1:37 PM

#

any idea when we can expect to see the first Search leaderboard @west lodge ?

grand loom Apr 2, 2025, 1:47 PM

#

Hello, I would like to know which model is Stradale in Chatbot Arena. It's not on the leaderboard, and I can't find it either. Thank you very much.

#

stradale is the name I gave to the model after temporarily conversing and evaluating it in Arena (battle).

unique nimbus Apr 2, 2025, 3:14 PM

#

crude hawk

This would be interesting to see in terms of open source Vs proprietary

rapid cobalt Apr 3, 2025, 12:27 PM

#

does votes for a model into the webdev arena count towards the general LMarena leaderboard or its 100% segregated?

surreal drum Apr 3, 2025, 2:41 PM

#

100% segregated

stable shore Apr 3, 2025, 6:54 PM

#

stable shore Hi, I'm lost, why aren't Baidu models ranked anywhere while they are supposedly ...

Any1 has an answer to that please?

timid wind Apr 3, 2025, 7:36 PM

#

can someone explain what exactly is "style control"?

glacial glacier Apr 3, 2025, 11:12 PM

#

timid wind can someone explain what exactly is "style control"?

https://lmsys.org/blog/2024-08-28-style-control/

rapid cobalt Apr 4, 2025, 12:27 AM

#

stable shore Any1 has an answer to that please?

Models need to be submitted by their company to be contending in the arena

marsh condor Apr 4, 2025, 8:19 AM

#

stable shore Hi, I'm lost, why aren't Baidu models ranked anywhere while they are supposedly ...

I don't think Baidu's models are really good, so they are not on the leaderboards. Apple searched for an AI service for the iPhone in China, but Baidu was out. Finally, Apple chose Alibaba. DeepSeek is out because it does not have enough commercial experience.

timid wind Apr 4, 2025, 11:38 AM

#

glacial glacier https://lmsys.org/blog/2024-08-28-style-control/

without "style control" option nothing is chabged based on style?

#

the problem is that I barely understand how exactly "style control" check works, and I just want to see leaderboard without it

stable shore Apr 4, 2025, 4:34 PM

#

rapid cobalt Models need to be submitted by their company to be contending in the arena

Oh fair thanks

#

Hope they do

stable shore Apr 4, 2025, 4:36 PM

#

marsh condor I don't think Baidu's models are really good, so they are not on the leaderboard...

Yea they're probably not as good as Baidu says, but I believe they're at least top 20-30, which would be great to know, always good to track progression

#

If they end up being a solid top 15 in the long term run, knowing their starting point would surely be interesting

finite ibex Apr 5, 2025, 12:48 PM

#

I wonder what would a llm be if it scores over 1500

lofty pasture Apr 5, 2025, 1:24 PM

#

finite ibex I wonder what would a llm be if it scores over 1500

Gemini pro 3 or Gemini pro 3.5 😁

glacial glacier Apr 5, 2025, 7:52 PM

#

llama 4 is now tied for first place with a 1417 elo

glacial glacier Apr 5, 2025, 8:11 PM

#

that green dot is llama 4 maverick

tawdry socket Apr 5, 2025, 10:33 PM

#

glacial glacier llama 4 is now tied for first place with a 1417 elo

Why does UB rank show 2 for Llama ?Shouldn't be 1 since elo overlaps with Gemini2.5?

glacial glacier Apr 5, 2025, 10:34 PM

#

being more accurate than the official leaderboard is a first for me

tawdry socket Apr 5, 2025, 10:36 PM

#

I dont understand. Where did you get your rank from?

glacial glacier Apr 5, 2025, 10:36 PM

#

yeah 1439.18 - 10.02 is definitely under 1416.58 + 13.5 so idk whats up

#

ranking comes from my own math and website
* lmb uses style control by default, turn it off for the mainstream results

tawdry socket Apr 5, 2025, 10:39 PM

#

Nice. So, take the data from arena repository and calculate the elo/rank yourself?

glacial glacier Apr 5, 2025, 10:40 PM

#

tawdry socket Nice. So, take the data from arena repository and calculate the elo/rank yoursel...

yup

#

well i don't calculate the elo, just the rank

#

i take their elo for granted

#

(since they stopped publishing the anonymized battle data you need to recalculate elos)

half stream Apr 6, 2025, 8:42 AM

#

Llama 4 Maverick... Experimental?

timid wind Apr 6, 2025, 11:55 AM

#

half stream Llama 4 Maverick... Experimental?

yep

glacial glacier Apr 6, 2025, 4:02 PM

#

tawdry socket Why does UB rank show 2 for Llama ?Shouldn't be 1 since elo overlaps with Gemini...

actually
mistakes were made...

#

turns out accidentally shifting the CI down by 0.5% causes problems

jaunty steeple Apr 6, 2025, 5:04 PM

#

HI lov to see new llms and the use of it

tawdry socket Apr 6, 2025, 7:30 PM

#

glacial glacier actually mistakes were made...

So, basically a matter of 1 elo point or 1 CI point...

glacial glacier Apr 6, 2025, 8:50 PM

#

tawdry socket So, basically a matter of 1 elo point or 1 CI point...

yeah
for all purposes llama 4 is statistically worse than gemini 2.5
(however - don't forget that gemini 2.5 just has a 53.2% win rate over llama 4)

half stream Apr 7, 2025, 2:57 AM

#

half stream Llama 4 Maverick... Experimental?

Back to this topic.

Tested and compared between the Experimental and the full-release Maverick in my native tongue (officially instruction-tuned). Even when I cranked the temperature of Experimental down to zero, it still sounds overly friendly, instead of the much more bearable full-release.

The Experimental should be deprecated at this point, and let people test the full-release herd, so that it won't be misleading.

wispy pumice Apr 7, 2025, 3:06 PM

#

Trying to plot the Elo of an ideal AGI by using the "both are bad" ratings: https://gist.github.com/endolith/e001d8b7811699cf9be822a774e7cb67

Gist

I tried to plot AGI on the Chatbot Arena Elo scale by comparing to ...

I tried to plot AGI on the Chatbot Arena Elo scale by comparing to "both bad" and "tie" votes - .gitignore

#

Trying to plot change in skill over time for API models that are being changed behind the scenes: https://github.com/lm-sys/FastChat/issues/3004

GitHub

Add Whole History Rating to Leaderboard? · Issue #3004 · lm-sys/F...

For the API-based models, there are frequent claims online that users see models getting worse over time. It would be good to know if that's true. Copying a comment of mine from HF: I know ther...

glacial glacier Apr 8, 2025, 12:17 AM

#

https://x.com/lmarena_ai/status/1909397817434816562

In addition, we're also adding the HF version of Llama-4-Maverick to Arena, with leaderboard results published shortly. Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X

We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences. (link in next tweet)

Early

steel chasm Apr 8, 2025, 1:57 AM

#

Meta so desperate

twin wharf Apr 8, 2025, 2:42 AM

#

Absolute clowns

#

We knew something was fishy with so much skepticism the last few days

crude hawk Apr 8, 2025, 6:57 AM

#

glacial glacier https://x.com/lmarena_ai/status/1909397817434816562 > In addition, we're also ad...

thanks - glad to see this... was just about to post a rant about how this 'experimental' (i.e. arena/human-preference primed) version violates LMarena's existing policies about stealth models.. like based on their blog post, they can only be unmasked and added to the leaderboard if the unmasked model is made publicly available... which isn't the case with this maverick-experimental thing (aside from Direct Chat in LmArena afaik)

#

all a bunch of nonsense from meta

#

did they think noone would notice or something lol

#

it should be deprecated / removed immediately. this model is not publicly available (and the only place it's 'available online' is LMarena itself via Direct Chat..) and so imo shouldn't have been added to the leaderboard in the first place

glacial glacier Apr 10, 2025, 12:00 AM

#

grok 3 is now a frontier model without and with style control (if you exclude llama 4) (when prices are mixed 2:1 and are adjusted for thinking)

rapid cobalt Apr 10, 2025, 12:02 AM

#

@glacial glacier whats your data based upon?

glacial glacier Apr 10, 2025, 12:02 AM

#

rapid cobalt <@794377681331945524> whats your data based upon?

the prices are collected manually, i could explain more if i want
the elo scores are direct from arena data

rapid cobalt Apr 10, 2025, 12:03 AM

#

You mean based on your personal battles DB?

glacial glacier Apr 10, 2025, 12:03 AM

#

no, the elo scores are the general arena scores

rapid cobalt Apr 10, 2025, 12:03 AM

#

Ah, okay. So based on current arena standings

glacial glacier Apr 10, 2025, 12:03 AM

#

yes

haughty rapids Apr 10, 2025, 9:27 PM

#

Llama 4 just dropped rating dramatically(Finally the one it deserves)

glacial glacier Apr 10, 2025, 9:47 PM

#

damn

strong knot Apr 10, 2025, 9:58 PM

#

Is super on reasoning mode on the arena?
also double damn: instruction following + style control:

crude hawk Apr 11, 2025, 5:22 AM

#

strong knot Is super on reasoning mode on the arena? also double damn: instruction following...

not sure, but seems a good question. fwiw i think so (based primarily on the LB rank, but also, using it via Direct Chat, there is a) a lag before first token is generated, and; b) it hits its max output well before that limit has been reached in terms of the tokens visible in the UI.. so presumably the rest were used up for thinking/reasoning)

slate compass Apr 11, 2025, 5:53 AM

#

glacial glacier damn

what interface is that?

crude hawk Apr 11, 2025, 6:13 AM

#

slate compass what interface is that?

https://ktibow.github.io/lmb/

slate compass Apr 11, 2025, 6:14 AM

#

crude hawk https://ktibow.github.io/lmb/

whoa, that's good

twin valve Apr 11, 2025, 12:04 PM

#

crude hawk https://ktibow.github.io/lmb/

nice @glacial glacier . So many users cry for different default ratings, while they could simply make a new leaderboard like yours

#

visualize scores "with moats" let me chuckle 😄

#

would it be possible to select some options and categories in the leaderboard and get a link? In that way discussions online would be easier, rather than saying "go this this page and click this and that"

#

it would be also cool to see the No. of votes. I know that the CI would tell that, but orderining by No. votes could help too.

glacial glacier Apr 11, 2025, 3:02 PM

#

twin valve it would be also cool to see the No. of votes. I know that the CI would tell tha...

but orderining by No. votes could help too
wdym?
resorting the leaderboard by vote count?

cosmic harness Apr 11, 2025, 3:59 PM

#

What happened to Llama-4? 💀

winged quiver Apr 11, 2025, 4:07 PM

#

Amazed chatgpt-4o is even on the leaderboard

raven heron Apr 12, 2025, 6:23 AM

#

so two different llama-4 that presumably differ only in style have 50 Elo difference with style control? Can we think of ways to improve style control so that this difference goes down to negligble one?

twin valve Apr 12, 2025, 9:46 AM

#

glacial glacier > but orderining by No. votes could help too wdym? resorting the leaderboard by ...

yes sometimes I do it to see if a model is relatively new or barely used.

#

some models are in the middle of the rankings, without great publicity, but they are new as well.

glacial glacier Apr 12, 2025, 2:16 PM

#

that's an interesting idea... i might implement something similar like a new badge

glacial glacier Apr 12, 2025, 9:35 PM

#

twin valve yes sometimes I do it to see if a model is relatively new or barely used.

once new models start to show up they'll get new badges

glacial glacier Apr 12, 2025, 10:01 PM

#

twin valve would it be possible to select some options and categories in the leaderboard an...

how much should the link include?

#

just type, category, and style control?

twin valve Apr 12, 2025, 10:12 PM

#

it would be a start. I also would add moats because it is visually nice.

https://ktibow.github.io/lmb?type=text&category=math&style_control=off&moats=off

but of course it depends on you, at the end it is your time and project.

glacial glacier Apr 12, 2025, 10:58 PM

#

twin valve it would be a start. I also would add moats because it is visually nice. https:...

okay added moats https://ktibow.github.io/lmb/#{"paradigm":"text","category":"full","styleControl":false,"vizBorder":true,"vizBar":false}

twin valve Apr 13, 2025, 8:06 AM

#

neat! Thank you! (hopefully lmarena integrates something similar or directly asks you to do the leaderboards)

#

checking your github. You have a lot of repos and advent of code as well!

solemn flint Apr 13, 2025, 10:23 AM

#

gemini 3.0🗿

atomic ice Apr 13, 2025, 5:44 PM

#

What's the process for LM Arena adding a new model?

chilly moat Apr 14, 2025, 7:18 PM

#

gpt-4.1 on the leaderboard soon?

lofty pasture Apr 14, 2025, 8:04 PM

#

chilly moat gpt-4.1 on the leaderboard soon?

It isn t just chatGpt4o with some updates?

twin valve Apr 14, 2025, 10:34 PM

#

is there a way to have this https://chromewebstore.google.com/detail/mylmarena/dcmbcmdhllblkndablelimnifmbpimae?authuser=0&hl=en-GB but without browser plugin? That could solve a lot of problems for single users

MyLMArena - Chrome Web Store

Track your LLM preferences on LMArena with a personal ELO rating leaderboard.

twilit echo Apr 14, 2025, 11:09 PM

#

lofty pasture It isn t just chatGpt4o with some updates?

it does feel like a more efficient 4o. probably a lot cheaper to run (mostly for OpenAI 😉 )

echo lake Apr 15, 2025, 2:41 AM

#

Can't wait to check the score of gpt4.1 and nvidia/Llama-3_1-Nemotron-Ultra-253B-v1, curious on their actual performance.

#

it would be quite fun to find nvidia version of llama3 surpass the score of llama4

raven heron Apr 15, 2025, 5:37 AM

#

raven heron so two different llama-4 that presumably differ only in style have 50 Elo differ...

Also I think style control should be enabled by default; that would save the whole llama 4 drama among other benefits

errant tusk Apr 15, 2025, 2:54 PM

#

how often are the leaderboard updated?

neon narwhal Apr 15, 2025, 3:51 PM

#

what is style control?

sterile rune Apr 15, 2025, 4:18 PM

#

Who tried 4.1 ?

#

Is it mostly for coding ?

terse mirage Apr 15, 2025, 4:20 PM

#

neon narwhal what is style control?

this might be helpful: https://blog.lmarena.ai/blog/2024/style-control/

terse mirage Apr 15, 2025, 4:21 PM

#

errant tusk how often are the leaderboard updated?

they are live, but it takes a certain amount of data to be significant enough to see a change, so you'll notice they generally "change" weekly. happens faster if there is more data.

crude hawk Apr 15, 2025, 5:01 PM

#

right they're updated periodically (roughly weekly)?

#

i know i'm being pedantic.. but that isn't 'live' per se

twin valve Apr 15, 2025, 5:20 PM

#

they are updated like weekly, can confirm.

Why? Because with too few votes the score would be all over the place so they need to accumulate votes and waiting a week is not that bad.

twilit echo Apr 15, 2025, 5:25 PM

#

sterile rune Who tried 4.1 ?

I tried them, and I'd say they are mostly optimized for coding. As general models, for usage in a multitude of broad real world tasks, they don't stand out:

Tested **GPT-4.1 **series:

GPT-4.1 Nano:
Cheap tiny model, roughly comparable to Qwen2.5 14B.
Substantially beaten on price & performance by e.g. Googles flash models.

GPT-4.1 Mini:
Versatile fast model, roughly comparable to Gemini 2.0 flash (but more expensive).
Quite a solid coder, and performed on par with the larger model in my STEM segment.

**GPT-4.1 **:
"flagship" of the series, roughly as strong Llama 3.3 70B (but weaker STEM) & DeepSeek V3 0324 (but weaker coder).
Behind 7 other OpenAI models in my testing.
The "Maverick" type model of OpenAI.

All models are non-reasoning models and not very verbose, when compared to other recent model releases (1.15x / 1.23x / 1-35x token verbosity as size increases in testing).
All models, including Nano, are fairly competent coders! though none excel at my backend testing
None of these were particularly good in my STEM segment.

I have also added 0-shot examples for UI impressions and simplistic game design for each model on my shared assets (NOT part of any scoring, just for additional curiosity/comparison).

As always, YMMV!

pine wasp Apr 16, 2025, 2:36 AM

#

someone add gpt 1.4

twin valve Apr 16, 2025, 1:23 PM

#

was this already shared? https://huggingface.co/spaces/lmgame/game_arena_bench

Game Arena Bench - a Hugging Face Space by lmgame

proper granite Apr 16, 2025, 6:22 PM

#

twilit echo I tried them, and I'd say they are mostly optimized for coding. As general model...

Do you have initial estimates for o3 / o4mini? Where do you expect them to land on the leaderboard?

twilit echo Apr 16, 2025, 6:23 PM

#

proper granite Do you have initial estimates for o3 / o4mini? Where do you expect them to land ...

these things.. they take time

slate compass Apr 16, 2025, 6:54 PM

#

twilit echo these things.. they take time

is this for https://dubesor.de/benchtable ?

Dubesor LLM Benchmark table

Dubesor LLM Benchmark table - Small-scale manual LLM performance comparison benchmark

winged quiver Apr 16, 2025, 7:08 PM

#

Was meta caught in the act gaming the lmarena benchmarks?

surreal drum Apr 16, 2025, 7:08 PM

#

Yes

winged quiver Apr 16, 2025, 7:10 PM

#

Haha. How did they get caught?

#

I also thought grok was cheating but didn't say anything

chilly moat Apr 16, 2025, 7:26 PM

#

tbh it was suspicious from the beginning

#

anyways o3 and o4-mini seem promising

ancient cedar Apr 16, 2025, 7:32 PM

#

super promising

#

gonna beat gemini easily

delicate sable Apr 16, 2025, 7:33 PM

#

does your expertise in on chain derivatives help you forecast this?

winged quiver Apr 16, 2025, 8:09 PM

#

Wait, so were they caught in the act or is it just assumed they were gaming the leaderboards?

sinful falcon Apr 16, 2025, 8:10 PM

#

winged quiver Wait, so were they caught in the act or is it just assumed they were gaming the ...

it's confirmed

#

that's why they were removed

sinful falcon Apr 16, 2025, 8:31 PM

#

well i just got my first head to head of o4 mini and gemini 2.5

desert gorge Apr 16, 2025, 8:32 PM

#

sinful falcon well i just got my first head to head of o4 mini and gemini 2.5

who won?

sinful falcon Apr 16, 2025, 8:32 PM

#

desert gorge who won?

i voted for 4o mini. it was a coding question though

#

and i think the o models do very well w coding

#

so not sure that's representative of how it's gonna stack up

hallow comet Apr 16, 2025, 8:34 PM

#

o4 mini or 4o mini?

sinful falcon Apr 16, 2025, 8:34 PM

#

hallow comet o4 mini or 4o mini?

o4 mini my bad

hallow comet Apr 16, 2025, 8:34 PM

#

np 🤣 its hella confusing

desert gorge Apr 16, 2025, 8:37 PM

#

Yeah o4 mini seems v good for coding, but o3 is a weird model, seems more slanted towards generalization than memorization, maybe because it is supposed to have access to search/tools, also they seem to have more 'personality' than the old o models

tawdry socket Apr 16, 2025, 11:34 PM

#

So, dragontail/night whisperer are from google and they are better or similar to 2.5 experimental? Thats what the april 22 announcement is, isnt it?

twin wharf Apr 17, 2025, 8:24 AM

#

looks like o3-mini-high is still on the leaderboard, looks to be deprecated for o4-mini-high so I think it should be deleted now?

twin valve Apr 17, 2025, 2:08 PM

#

twin wharf looks like o3-mini-high is still on the leaderboard, looks to be deprecated for ...

nah. As long as the api endpoint is there it is not deprecated. Like many other models

twin valve Apr 17, 2025, 2:09 PM

#

glacial glacier okay added moats https://ktibow.github.io/lmb/#{%22paradigm%22:%22text%22,%22cat...

another request if I may. Add the latest update of the leaderboard (otherwise I tend to check the official leaderboard anyway)

glacial glacier Apr 17, 2025, 2:18 PM

#

twin valve another request if I may. Add the latest update of the leaderboard (otherwise I ...

i don't see any updates

twin valve Apr 17, 2025, 3:14 PM

#

I see, so in the data on github there is nothing like "Last updated: 2025-04-09." ?

hallow comet Apr 17, 2025, 5:19 PM

#

When will the leaderboard include o3 ?

hallow comet Apr 17, 2025, 5:20 PM

#

tawdry socket So, dragontail/night whisperer are from google and they are better or similar to...

Hard to say, it is known google will release flash 2.5 , and a coding model. Updated gemini 2.5 too its speculation

sinful falcon Apr 17, 2025, 5:56 PM

#

hallow comet When will the leaderboard include o3 ?

which did you gamble on?

hallow comet Apr 17, 2025, 5:59 PM

#

sinful falcon which did you gamble on?

o3

errant tusk Apr 17, 2025, 8:12 PM

#

why did flash land on the leaderboard so quickly and o3 and o4 mini are still not there

hallow comet Apr 17, 2025, 8:13 PM

#

errant tusk why did flash land on the leaderboard so quickly and o3 and o4 mini are still no...

flash was an anonymous model

#

o3 and o4 mini were just added

errant tusk Apr 17, 2025, 8:14 PM

#

thanks for your answer

sinful falcon Apr 17, 2025, 10:24 PM

#

when was the gemini 4-17 model added to vote on?

hallow comet Apr 17, 2025, 10:38 PM

#

sinful falcon when was the gemini 4-17 model added to vote on?

its flash model, was under anon name, now its public and directly in leaderboard

sinful falcon Apr 17, 2025, 10:38 PM

#

hallow comet its flash model, was under anon name, now its public and directly in leaderboard

i know

#

but when was the “anon” model added

#

to the arena for ppl to vote on

glacial glacier Apr 17, 2025, 10:48 PM

#

twin valve I see, so in the data on github there is nothing like "Last updated: 2025-04-09....

ah nvm i thought you meant "could you run the update script again"

#

uhhhhh

#

i might skip this one since i can't think of a good place to put it

#

you can always read https://github.com/KTibow/lmb/commits/main/src/routes/assets/results.json though

GitHub

History for src/routes/assets/results.json - KTibow/lmb

Language Model Board, a better way to read the LMSYS results - History for src/routes/assets/results.json - KTibow/lmb

#

also uhhh @twilit echo lmk when you calculate the thinking ratios for gemini 2.5 flash

twilit echo Apr 17, 2025, 11:06 PM

#

glacial glacier also uhhh <@126820015382069250> lmk when you calculate the thinking ratios for g...

~~very close to o4-mini (normal). I'll post my stuff tomorrow, too late rn~~
edit: actually nvm, since at the time I only looked at usage stats and didn't inspect actual content (which contained some API errors), the ratios are in fact not similar.

hallow comet Apr 17, 2025, 11:20 PM

#

Do you guys think o3 > 2.5 ?

timber owl Apr 17, 2025, 11:22 PM

#

hallow comet Do you guys think o3 > 2.5 ?

yes

sinful falcon Apr 17, 2025, 11:23 PM

#

hallow comet Do you guys think o3 > 2.5 ?

i think it depends on the user/task

#

i almost exclusively use llms for working on a codebase

#

so my preferences are highly correlated w the swe and coding benchmarks

#

and the o models do better on those

#

it’s hard for me to say which will do better in the arena

#

but my personal preference is oai over gemini

sinful falcon Apr 17, 2025, 11:24 PM

#

hallow comet Do you guys think o3 > 2.5 ?

i would be surprised if they are more than 10 apart in elo

hard violet Apr 18, 2025, 3:30 AM

#

guys do you think 2.5 will be the top model on leaderboard on apr 30? say 10am ET? if you could assign % chances to your views that would be helpful. also how often does leaderboard update. thank you

hallow comet Apr 18, 2025, 3:56 AM

#

hard violet guys do you think 2.5 will be the top model on leaderboard on apr 30? say 10am E...

Its 50/50. leaderboard will likely update in the next day or two, id be surprised if it took more. Also as a tip, dont trade manually

delicate sable Apr 18, 2025, 3:59 AM

#

hallow comet Its 50/50. leaderboard will likely update in the next day or two, id be surprise...

thank you. do you think the leaderboard will look different at 12 PM ET on APR 30?

sinful falcon Apr 18, 2025, 5:22 AM

#

hard violet guys do you think 2.5 will be the top model on leaderboard on apr 30? say 10am E...

oai is a lock

#

trust me

hallow comet Apr 18, 2025, 6:43 AM

#

sinful falcon trust me

as much as i trust a catfish

timid wind Apr 18, 2025, 10:10 AM

#

hallow comet Do you guys think o3 > 2.5 ?

my few messages from o3 felt a head of everything for like 15-25 rating points

hallow comet Apr 18, 2025, 10:10 AM

#

timid wind my few messages from o3 felt a head of everything for like 15-25 rating points

nah, if it wins its with 3 points , 5 max. at least thats how i see it

twilit echo Apr 18, 2025, 10:11 AM

#

glacial glacier also uhhh <@126820015382069250> lmk when you calculate the thinking ratios for g...

I added the final ratios to my token rate page

twin valve Apr 18, 2025, 10:58 AM

#

glacial glacier i might skip this one since i can't think of a good place to put it

yeah no worries

twin valve Apr 18, 2025, 11:00 AM

#

glacial glacier you can always read https://github.com/KTibow/lmb/commits/main/src/routes/assets...

but then you have it already, one could put the date of the update. Or at least the link. I mean as a user it is nice to have it in the same place. Anyway no worries, mine are just wishes.

Also a pity that localllama blocks github.io links because I think your leaderboard is more shareable and visually intuitive

#

btw o3 and the new oai models have somewhat limited replies. When I get them in the battle mode with non-trivial questions they are nice answer but pretty concise. So they lose sometimes as other models add details.

hallow comet Apr 18, 2025, 1:16 PM

#

when gemini 2.5 was released , it appeared on the leaderboard exactly 1 day and 3h after , for o3 it has been 1 day and 19h and still no leaderboard :/

twilit echo Apr 18, 2025, 1:32 PM

#

the leaderboard is just a marketing tool nowadays and doesn't represent real life, so who cares. Whether its #1 by a huge margin or #35 makes no difference anyway, the former just means they gamed better.

rancid moat Apr 18, 2025, 1:33 PM

#

the leaderboard is my favorite way to check the status of AI

hallow comet Apr 18, 2025, 1:36 PM

#

twilit echo the leaderboard is just a marketing tool nowadays and doesn't represent real lif...

"so who cares. Whether its #1 by a huge margin or #35" -> my portfolio

scenic moon Apr 18, 2025, 2:30 PM

#

hallow comet when gemini 2.5 was released , it appeared on the leaderboard exactly 1 day and ...

i think it was being tested under a different name for longer than a day

tawdry socket Apr 18, 2025, 2:35 PM

#

2.5 pro was being tested as nebular weeks before its release...

#

nebula*

hallow comet Apr 18, 2025, 3:00 PM

#

tawdry socket 2.5 pro was being tested as nebular weeks before its release...

huh , but even flash 2.5 got in the leaderboard so quick.
when do you think o3 will be there

scenic moon Apr 18, 2025, 3:14 PM

#

flash 2.5 was being tested under a different name as well... it was probably dragontail.

tawdry socket Apr 18, 2025, 3:36 PM

#

There were many google models introduced in last month. So, not sure which one was flash.

tawdry socket Apr 18, 2025, 3:38 PM

#

hallow comet huh , but even flash 2.5 got in the leaderboard so quick. when do you think o3 w...

I would be surprised if it does not show up in next 5 days. If more people test it on arena, faster it will show up.

#

Although since o3 is expensive, I dont know if arena has resources to test it often like o4-mini...

hallow comet Apr 18, 2025, 7:00 PM

#

tawdry socket Although since o3 is expensive, I dont know if arena has resources to test it of...

But dont side by side votes count ?

#

Ive tried 10 questions myself with o3 and gemini, no problems so far

tawdry socket Apr 18, 2025, 7:35 PM

#

~~Yes, that should count. Do that 2000 times and you will see them on leaderboard tomorrow~~ 😅 Side by side votes do not count

#

what do you notice btw? o3 is clearly better or similar to gemini?

timber owl Apr 18, 2025, 7:40 PM

#

o3 is better and similar

tawdry socket Apr 18, 2025, 7:42 PM

#

hallow comet But dont side by side votes count ?

Side by side votes do not count. Only battle mode votes because they are not anonymous in side by side votes...

sinful falcon Apr 18, 2025, 7:45 PM

#

tawdry socket Side by side votes do not count. Only battle mode votes because they are not ano...

bro was tying to rig it in his favor

hallow comet Apr 18, 2025, 8:14 PM

#

tawdry socket what do you notice btw? o3 is clearly better or similar to gemini?

o3 better, not by much

hallow comet Apr 18, 2025, 8:14 PM

#

sinful falcon bro was tying to rig it in his favor

nah i assumed they had measures

#

Still you can rig it , cause its easy to detect to which llm the output belongs to. For instance just asking what llm they are , and they would say openai or google. But way too much effort

scenic moon Apr 18, 2025, 8:18 PM

#

they claim they exclude responses where the model reveals itself

hallow comet Apr 18, 2025, 8:18 PM

#

scenic moon they claim they exclude responses where the model reveals itself

still you can detect even if they dont reveal, but again , too much effort.

scenic moon Apr 18, 2025, 8:19 PM

#

true, their style is very different. chatgpt is probably the easiest to spot id say

hallow comet Apr 18, 2025, 8:20 PM

#

scenic moon true, their style is very different. chatgpt is probably the easiest to spot id ...

but if there is rigging happening, id say for sure its on the side of gemini

scenic moon Apr 18, 2025, 8:21 PM

#

because its ahead?

#

in polymarket*

hallow comet Apr 18, 2025, 8:23 PM

#

scenic moon in polymarket*

no, because ive seen a lot of google shilling happening, and google also promoted that they were nr 1 in this arena (which they can also optimise for) , maybe rig, maybe bias call it what you want

scenic moon Apr 18, 2025, 8:23 PM

#

theres one dude betting $14k against openai and $22k on gemini. he sure thinks google is going to win

#

https://polymarket.com/profile/0x36cc547537fdc25339129b6b3c1ef1404ef48642

hallow comet Apr 18, 2025, 8:26 PM

#

thats actually somewhat more in favor to openai as to what the market thinks. hes saying 38% chance for openai 62% for gemini , market is 28% for openai atm

slate compass Apr 18, 2025, 8:28 PM

#

just gonna go all-in on the top placer the second the leaderboard updates with o4-mini and o3 (hopefully at the same time)

lofty pasture Apr 18, 2025, 8:37 PM

#

tawdry socket what do you notice btw? o3 is clearly better or similar to gemini?

Since I am not interrested to codes and maths , I don t like those O3 ,,, just trush
Can t comprae it to Gemini

scenic moon Apr 18, 2025, 8:38 PM

#

what do you use them for? I think it is better at almost everything

#

o3

#

except creative stuff which I havent tested

#

its formatting is a little odd though, this might be a problem

lofty pasture Apr 18, 2025, 8:43 PM

#

scenic moon except creative stuff which I havent tested

Since I am a medical student and I am preparing for exams I asked them for some tricks to memorize diffrent name of drugs and the doses and link the names and doses to something known and logic , No open ai model get it but Gemini models are perfect and deepSeek is next to Gemini ,,,, they are just Wow ...if those models are trully reasoning why failing a taks like that ?!!

scenic moon Apr 18, 2025, 8:44 PM

#

oh cool

#

yeah maybe gemini is better at that

junior flame Apr 18, 2025, 11:50 PM

#

Which free large language model is best for the summarizing of YouTube videos?I'm confused between Gemini 2.0 Flash and Gemini 2.5 Flash Preview 04-17 because they differ in tokens per minute

scenic moon Apr 19, 2025, 12:26 AM

#

summarizing actual videos or the transcript? if actual video, then you will need 2.5 pro

#

if not then id use 2.5 flash

#

ive used 2.0 flash but it is kind of bad with timestamps, ok at summarizing

twin wharf Apr 19, 2025, 8:54 AM

#

Id use 2.5 pro, its great at that

twin valve Apr 19, 2025, 11:38 AM

#

twilit echo the leaderboard is just a marketing tool nowadays and doesn't represent real lif...

disagree

twin valve Apr 19, 2025, 11:39 AM

#

hallow comet huh , but even flash 2.5 got in the leaderboard so quick. when do you think o3 w...

sometimes models are announced when they are made public in lmarena. Some other times they are announced before. Unless you want nonsense voting, one has to wait that models gain enough votes to have a reliable score. What difference 1 week makes in 99,999999 % of the cases? none.

twin valve Apr 19, 2025, 11:41 AM

#

twin valve disagree

For this I strongly think that lmarena represent closely average chatbot.

They don't represent api calls, that yes. For that we have openrouter rankings.

hallow comet Apr 19, 2025, 6:28 PM

#

rancid moat the leaderboard is my favorite way to check the status of AI

25% true. I know there are undervalued models, but not by that big a margin

vocal rain Apr 19, 2025, 6:39 PM

#

pliant vessel Apr 19, 2025, 6:57 PM

#

how is the score decided for the leaderboard?

hallow comet Apr 19, 2025, 8:18 PM

#

pliant vessel how is the score decided for the leaderboard?

blind voting in battle mode

hallow comet Apr 19, 2025, 8:26 PM

#

vocal rain

intelligence race and who has better model are two different things

my guess is that google and openai have achieved agi internally and are looking to control it (OpenAI researcher)

vocal rain Apr 19, 2025, 8:33 PM

#

hallow comet intelligence race and who has better model are two different things my guess is...

that is why i said intelligence race, also do you have sth to back your guess up?

hallow comet Apr 19, 2025, 8:35 PM

#

vocal rain that is why i said intelligence race, also do you have sth to back your guess up...

yes, basic logic, but no proof.

vocal rain Apr 19, 2025, 8:35 PM

#

hallow comet yes, basic logic, but no proof.

can you explain the logic?

#

why would they have agi internally?

hallow comet Apr 19, 2025, 8:40 PM

#

vocal rain why would they have agi internally?

they have the compute for 10x more powerful models
but they havent published nor reported to have trained such models

Which means, A they havent done so for whatever reason
or B they have and they are hiding it

Option B is very much the more likely for me. Powerful AI would automate jobs, automated jobs lead to protests and revolts, the last thing billionaires want (like those backing google and openai)

vocal rain Apr 19, 2025, 8:41 PM

#

hallow comet they have the compute for 10x more powerful models but they havent published no...

i think sam said recently that not compute is the issue anymore but running out of good data or sth

#

it could simply be that we dont scale as we would like

sinful falcon Apr 19, 2025, 9:44 PM

#

hallow comet they have the compute for 10x more powerful models but they havent published no...

bro trades prediction markets and has crazy conspiracy theories.....

#

🤣

hallow comet Apr 19, 2025, 9:55 PM

#

sinful falcon bro trades prediction markets and has crazy conspiracy theories.....

so billionaires behind these companies are telling us the whole truth ? thats a strong stance to take 🤣

sinful falcon Apr 19, 2025, 10:01 PM

#

hallow comet so billionaires behind these companies are telling us the whole truth ? thats a ...

do you think that every single employee at openai is just keeping their mouth shut about some AGI model?

#

every single one (even those that quit)

#

and that the billionaires behind openai wouldn't stand to gain so much more from releasing it

#

there are also a ton of independent companies all working on this so your logic would have to apply to them too

sinful falcon Apr 19, 2025, 10:03 PM

#

hallow comet they have the compute for 10x more powerful models but they havent published no...

automating jobs does not lead to protests and revolts. extreme poverty does

#

are you saying no one will have a job at all because of AI>

#

some billionair who is concerned about riots wouldn't have to worry about riots if they had the first company to make AGI. they would become a multi trillionaire

hallow comet Apr 19, 2025, 10:15 PM

#

sinful falcon do you think that every single employee at openai is just keeping their mouth sh...

AGI is nothing short of a manhattan like project where everything is at stake, so its very clear there would be measures, first one that comes to mind is restricted access, and nda forms. You talk you lose everything and no one would believe you without proof.

other independent companies dont have google and openai resources

no1 would have a job cause of ai -> almost, very few people would be needed

making AGI = rich , yes, but publishing it and automating jobs collapses the system. so they are developing it, but quietly

lofty pasture Apr 19, 2025, 11:21 PM

#

hallow comet AGI is nothing short of a manhattan like project where everything is at stake, s...

I believe that but I am enjoying the chatbots 😆🙃

sinful falcon Apr 20, 2025, 12:22 AM

#

hallow comet AGI is nothing short of a manhattan like project where everything is at stake, s...

then why didn't they release a chatbot that gets how many 'r's in the word strawberry correctly until recently?

hallow iris Apr 20, 2025, 6:50 AM

#

scenic moon theres one dude betting $14k against openai and $22k on gemini. he sure thinks g...

Polymarket is really a bad thing for Lmarena.
I can spot super easily if a model is OpenAI or Gemini on first sight after any prompt.
You just need a few people to massively play on lmarena to make one or the other win all the time and you can rig the results...

#

You put 10 people right now and make them prompt for 10 hours a day, you can really rig o4 scores and make it look like trash by the end of the month.

hallow comet Apr 20, 2025, 7:39 AM

#

hallow iris You put 10 people right now and make them prompt for 10 hours a day, you can rea...

true maybe the team is handling that. maybe they did before poly too, to avoid biased reviews. its not too hard, just removing the 20-30% of user votes that are on either extreme (very pro o3 or very pro 2.5) would give better estimates.

slate compass Apr 20, 2025, 7:46 AM

#

hallow comet true maybe the team is handling that. maybe they did before poly too, to avoid b...

yeah, that was mentioned to some extent in the original paper

rapid cobalt Apr 20, 2025, 8:07 AM

#

it has safeguards, I looked into it awhile ago

lofty pasture Apr 20, 2025, 8:41 AM

#

hallow comet true maybe the team is handling that. maybe they did before poly too, to avoid b...

It wouldn t give also be a solutions. For example, if I always vote to a model that s because I know it deserve it or I always don T vote to a model it os because it is a real trush ... Removing my votes will make me so uncomfort and I will stop voting 4ever ...

lofty pasture Apr 20, 2025, 8:43 AM

#

hallow comet true maybe the team is handling that. maybe they did before poly too, to avoid b...

Just check the promt and the result or you can make a review button ... If the user gave you why he voted like that even if i vote 1000 times you musn t change the result !!!!

sinful falcon Apr 20, 2025, 5:56 PM

#

hallow iris You put 10 people right now and make them prompt for 10 hours a day, you can rea...

there’s not enough money in the market to make it worth it

#

100 man hours

#

for that

glacial glacier Apr 20, 2025, 5:57 PM

#

sinful falcon there’s not enough money in the market to make it worth it

counterpoint

hallow comet Apr 20, 2025, 6:02 PM

#

sinful falcon 100 man hours

there are people who have bet 20k and stand to win 5x if openai (or gemini) wins. thats 80k profit. Sure that is not enough to rig or bribe ? I hope not but idk

vocal rain Apr 20, 2025, 6:39 PM

#

vocal rain

poll_question_text

who is currently leading the intelligence race in your opinion?

victor_answer_votes

7

total_votes

18

#

some bets on polymarket are not serious at all, you can literally bet on jesus coming back to earth in 2025

#

these ai bets are kinda like a joke

#

and enables some rather uneducated ppl to bet their money and lose it

#

and yeah, the ai bet markets are not really liquid either

surreal drum Apr 20, 2025, 6:49 PM

#

What bro tells me after losing a $10 parlay

hallow comet Apr 20, 2025, 7:24 PM

#

vocal rain some bets on polymarket are not serious at all, you can literally bet on jesus c...

lmao thats actually a legit market xDD

hallow iris Apr 20, 2025, 7:25 PM

#

hallow comet lmao thats actually a legit market xDD

Just like shitcoins, you put money on it, you bet some people will invest in it and you'll be able to sell your shares at a higher price than you bought it

hallow comet Apr 20, 2025, 7:25 PM

#

vocal rain and enables some rather uneducated ppl to bet their money and lose it

and educated ppl to react quick to news and win $

hallow iris Apr 20, 2025, 7:26 PM

#

Don't know why this message was rejected by the system, probably a banned word

hallow comet Apr 20, 2025, 7:27 PM

#

hallow iris Don't know why this message was rejected by the system, probably a banned word

that is true but that account would likely get flagged

hallow iris Apr 20, 2025, 7:28 PM

#

Yeah and then they install Tuxler which has undetectable residential IPs for free and switch at every round.

hallow comet Apr 20, 2025, 7:28 PM

#

i imagine its way more ez for a whale that bets 20k , to pay 5-10k on supporting lmarena and so getting to know what goes on .. early. ez 3x

hallow comet Apr 20, 2025, 7:29 PM

#

hallow iris Yeah and then they install Tuxler which has undetectable residential IPs for fre...

they could do that, but lmarena can still filter to only include people who have been around long enough not new ones

#

if theres huge difference between new peoples voting and people who have been around for 2+ years then thats clear rigging indicator

hallow iris Apr 20, 2025, 7:30 PM

#

Hopefully...

hallow comet Apr 20, 2025, 7:31 PM

#

Yeah still could go that way, or whale bribe way .. thats why you never bet early, you must be the first to react

#

The only time to bet before news is if you know the news before the news is public 😅

hoary wind Apr 21, 2025, 4:13 AM

#

it go down so fast 😭

twin valve Apr 21, 2025, 7:16 AM

#

glacial glacier you can always read https://github.com/KTibow/lmb/commits/main/src/routes/assets...

I like the change of the color for "new". more distinguishable

#

also did the order of the categories on LMB change? My preferred one are easier to reach.

pearl marsh Apr 21, 2025, 11:48 AM

#

When will o3 and o4-mini be put on the leaderboard?

hallow comet Apr 21, 2025, 4:23 PM

#

pearl marsh When will o3 and o4-mini be put on the leaderboard?

Does it matter so much if its today, 30th of April or May 1st ?

twin valve Apr 21, 2025, 5:10 PM

#

pearl marsh When will o3 and o4-mini be put on the leaderboard?

they are being tested, as long as not enough votes are there, it is not good to show their scores. Just wait a bit. Why are people so impatient for something absolutely not life changing.

obsidian egret Apr 21, 2025, 6:21 PM

#

twin valve they are being tested, as long as not enough votes are there, it is not good to ...

I think its a sign people trust the Arena and are excited

feral shale Apr 21, 2025, 7:27 PM

#

twin valve they are being tested, as long as not enough votes are there, it is not good to ...

polymarket bets

timber owl Apr 21, 2025, 8:03 PM

#

twin valve they are being tested, as long as not enough votes are there, it is not good to ...

some people are betting on it

twin valve Apr 21, 2025, 9:21 PM

#

well people can bet on everything, then everything becomes life changing. I don't think is an argument.

feral shale Apr 21, 2025, 9:46 PM

#

twin valve well people can bet on everything, then everything becomes life changing. I don'...

We're just explaining why we think they're doing it, not saying they should

glacial glacier Apr 21, 2025, 11:22 PM

#

twin valve also did the order of the categories on LMB change? My preferred one are easier ...

nope, it's been the same since the start

smoky bramble Apr 22, 2025, 1:49 AM

#

hello

thorny ginkgo Apr 22, 2025, 6:22 AM

#

Llama 4 maverick was rated ~1400 on launch and now it is 1271. Is it just due to error margin or something else

slate compass Apr 22, 2025, 8:16 AM

#

thorny ginkgo Llama 4 maverick was rated ~1400 on launch and now it is 1271. Is it just due to...

long story - meta tried to cheat the benchmark by using a model specifically optimized to do well on lmarena, and then lmsys forced them to use their publicly-released model instead

twin valve Apr 22, 2025, 10:08 AM

#

feral shale We're just explaining why we think they're doing it, not saying they should

you have a point

twin valve Apr 22, 2025, 10:09 AM

#

slate compass long story - meta tried to cheat the benchmark by using a model specifically opt...

to add the "do well on lmarena" means "make the output more enjoyable for humans". Thus emojii, structured formatting and so on. I still think that that version is used in meta services (like whatsapp)

timber owl Apr 22, 2025, 10:59 AM

#

i hate that formatting because they like putting 5 newlines after each sentence

timber owl Apr 22, 2025, 11:19 AM

#

and they love bullet points as everyone knows which is usually not the best way to format

crude hawk Apr 22, 2025, 2:18 PM

#

hallow iris Don't know why this message was rejected by the system, probably a banned word

yeah it introduces incentives among all sorts of actors who have no stake in the actual performance of a model (like i've always thought major labs wouldn't waste their time trying to rig the votes, because the performance of their models in the wild will always ultimately speak for themselves.. even if they generate initial 'hype', the net result after being found out is likely to be negative [see Meta's llama4 maverick-arena-version fallout.. they ofc didn't manipulate votes .. but just released a model specifically juiced for the arena.. and it's been a disaster for them and the model release imo]

#

but betting markets and a crowd-sourced ranking project to my mind do not make a healthy mix .. there will invariably be those who at least consider ways to manipulate the system, if not have a go... it's like the inherent nature of betting...

#

doesn't bother me that people bet on the leaderboard - why not (and it does kinda seem illiquid anyway).. but it's not really an ideal situation in terms of the integrity of the voting system.. i dunno if LMarena asking polymarkt not to provide coverage would be possible / worthwhile.. ig if there's enough interest, the market would just be created / facilitated elsewhere.. be a game of whack-a-mole

pliant vessel Apr 22, 2025, 8:15 PM

#

how come there are 4 models at rank 2 even if they have different arena scores?

noble coral Apr 22, 2025, 8:20 PM

#

pliant vessel how come there are 4 models at rank 2 even if they have different arena scores?

theyre within margin of error

pliant vessel Apr 22, 2025, 8:21 PM

#

what's the margin

twin valve Apr 22, 2025, 8:55 PM

#

glacial glacier nope, it's been the same since the start

for the graph at the end of the rankings on your site, would it be possible to add the pareto line? If it is too much work then no worries

#

something like this

sinful falcon Apr 22, 2025, 10:47 PM

#

hallow iris Don't know why this message was rejected by the system, probably a banned word

let’s say there’s some probability p1 of someone gaming the system in favor of oai, and there’s some other probability, p2, of someone gaming it in favor of gemini, unless there is a substantial difference between p1 and p2 then it just introduces additional variance and doesn’t change the EV for ppl wagering on the markets

#

so it’s w/e

#

but like adverse selection is part of every single market

#

u think ppl don’t know earnings reports before they’re released to the public?

#

do u think that the ppl running lmarena aren’t telling their friends which model “looks promising”?

#

million ways to get an edge in any market

#

if you’re trading it you’re willing accepting the risk your bet doesn’t play out

#

no crying in the casino

sinful falcon Apr 22, 2025, 10:51 PM

#

sinful falcon let’s say there’s some probability p1 of someone gaming the system in favor of o...

also if u think someone is doing this for just one model (but u don’t know which) then the fair value for the two models is just 50% and there’s some serious EV to be made rn

glacial glacier Apr 23, 2025, 12:37 AM

#

twin valve for the graph at the end of the rankings on your site, would it be possible to a...

it's hard to get right

#

i've tried before but probably won't implement anytime soon

crude hawk Apr 23, 2025, 4:31 AM

#

sinful falcon let’s say there’s some probability p1 of someone gaming the system in favor of o...

that seems highly idealised..based on purely hypothetical / arbitrary assumptions rather than anything empirical.. like i don't get why such an equilibrium would be the natural end state of any and all attempted manipulation/s..

#

i mean what if there is a substantial difference between p1 and p2? what if there's no p2 at all? the underlying assumptions seems purely abritrary

sinful falcon Apr 23, 2025, 4:32 AM

#

crude hawk i mean what if there _is_ a substantial difference between p1 and p2? what if th...

you’re assuming there is a p1 and no p2

#

seems arbitrary

crude hawk Apr 23, 2025, 4:32 AM

#

what?

#

i'm responding to your post

#

let’s say there’s some probability p1 of someone gaming the system in favor of oai, and there’s some other probability, p2, of someone gaming it in favor of gemini

sinful falcon Apr 23, 2025, 4:32 AM

#

nevermind i thought u were the person who said someone could rig

#

either way it’s kinda silly when ppl freak out about this cuz like

#

there’s information asymmetry in every market

#

and if u think it’s worse here

#

then don’t trade

#

no crying in the casino

crude hawk Apr 23, 2025, 4:34 AM

#

discussing it ≠ freaking out..

crude hawk Apr 23, 2025, 4:34 AM

#

sinful falcon there’s information asymmetry in every market

obviously..

#

but not all markets are the same...

#

rigging LMArena leaderboard would be easier than say rigging the result of the SuperBowl or a US pres election

crude hawk Apr 23, 2025, 4:37 AM

#

sinful falcon there’s information asymmetry in every market

LMarena staff telling their buddies stuff in advance who then use that info to bet on polymarket would be something resembling insider trading - not market manipulation

#

crowd sourced anything is inherently vulnerable... doesn't mean the arena is being rigged (i don't think it is) - but the idea that 'well, there'd be multiple manipulators and it'd all just net out anyway so what's it matter' just seems kinda naive

crude hawk Apr 23, 2025, 4:42 AM

#

sinful falcon no crying in the casino

yeah but LMArena isn't a casino

#

no one is complaining about losing money lol
i think you're missing the point (like i said, it doesn't bother me at all that people are betting on the leaderboard – i gamble all the time, on stocks, horse racing, roulette; whatever), i just find the fact that they are able to problematic in terms of introducing incentives that otherwise would not exist for actors to attempt to manipulate the leaderboard

sinful falcon Apr 23, 2025, 4:57 AM

#

they could just make it so you need to make an lmarena account

#

that way if there’s ever any funny business

#

they know who to go after

hallow iris Apr 23, 2025, 9:02 AM

#

sinful falcon let’s say there’s some probability p1 of someone gaming the system in favor of o...

Actually the issue with gaming with Gemini is that it is wayyyy harder, because there are already 9000 votes.
With OpenAI's new models, usually for the first update on leaderboard, there are like 3-4k votes. So you're making a difference that can make it not be rank one with only 200 votes. That's how flawed it is.

#

And Gemini 2.5 Pro-exp isn't going through hard testing on lmarena right now so it's harder to find this one in particular

hallow comet Apr 23, 2025, 11:01 AM

#

hallow iris Actually the issue with gaming with Gemini is that it is wayyyy harder, because ...

Good enough

surreal drum Apr 23, 2025, 5:51 PM

#

are you guys suggesting that this rigging is happening right now? or that it’s only a matter of time? because current leaderboard scores are pretty well-aligned with benchmark results and user sentiment.
a baseline “vibe” that the arena is being gamed is not conclusive. there’s literally zero evidence to prove it.

#

also, yes you’re absolutely correct that it is extremely easy to rig, but on the same token, users have been very quick to detect BS’ed scores. Look at Maverick 4, for example.

#

Though, to give yall credit, the absence of evidence is not the evidence of absence. it’s definitely POSSIBLE that this is happening.

delicate sable Apr 23, 2025, 6:19 PM

#

I think its pretty unlikely the leaderboard is currently being rigged and I actually concluded that rigging the lmarena leaderboard is a more diffiicult way to market manipulate than some other things you could do.

You'd think this would protect the integrity of the leaderboard but the main thing is that its not a particularly clever to rig it. Every degen and their mom has realized that its possible someone could try.

sinful falcon Apr 23, 2025, 7:44 PM

#

delicate sable I think its pretty unlikely the leaderboard is currently being rigged and I actu...

what are some other things one could do?

#

(asking for a friend)

woven hearth Apr 23, 2025, 9:01 PM

#

Hello

#

will there be a leaderboard update on the 29th? Or 30th?

delicate sable Apr 23, 2025, 9:11 PM

#

April 30 at 11 AM

surreal drum Apr 23, 2025, 11:14 PM

#

sinful falcon what are some other things one could do?

“what are some other things one could do?”

drowsy umbra Apr 24, 2025, 9:17 AM

#

Any plan for a leaderboard for video generation models?

twin valve Apr 24, 2025, 12:03 PM

#

glacial glacier it's hard to get right

you added the date of the introduction of the model. neat!

slate compass Apr 24, 2025, 2:13 PM

#

delicate sable I think its pretty unlikely the leaderboard is currently being rigged and I actu...

yeah - come to think of it, I can imagine a few really fun ones

#

i think that, with the level of sophistication required to actually rig the arena without being found out, you'd probably realize as much

timid wind Apr 24, 2025, 2:36 PM

#

crude hawk LMarena staff telling their buddies stuff in advance who then use that info to b...

is it against PMs rules?

delicate sable Apr 24, 2025, 2:41 PM

#

timid wind is it against PMs rules?

against kalshi rules but not polymarket rules. I'm also really doubt this is happening

timid wind Apr 24, 2025, 2:42 PM

#

drowsy umbra Any plan for a leaderboard for video generation models?

arent they still useless?

timid wind Apr 24, 2025, 2:43 PM

#

delicate sable against kalshi rules but not polymarket rules. I'm also really doubt this is hap...

the point of PM is information gathering in first place, TBH I am not sure that such actions should be forbidden

delicate sable Apr 24, 2025, 2:45 PM

#

timid wind the point of PM is information gathering in first place, TBH I am not sure that ...

i think a credible fear of insider trading results in worse predictions overall

#

kalshi also markets itself as a securities trading platform so from that perspective insider trading is very bad

timid wind Apr 24, 2025, 2:46 PM

#

delicate sable i think a credible fear of insider trading results in worse predictions overall

if someone literally changes the result then yes but if somebody just knows the result in advance then I think it can be OK

crude hawk Apr 24, 2025, 2:46 PM

#

timid wind the point of PM is information gathering in first place, TBH I am not sure that ...

eh? i mean LMarena staff could just bet on it themselves if they were willing to do that (tip off mates)

#

if either are acceptable... literally what would be the point of the market

delicate sable Apr 24, 2025, 2:47 PM

#

timid wind if someone literally changes the result then yes but if somebody just knows the ...

from a market structure perspective nobody's gonna trade on a market where theres a chance that someone has the answer beforehand

crude hawk Apr 24, 2025, 2:47 PM

#

yeah exactly

delicate sable Apr 24, 2025, 2:47 PM

#

so the forecasts will just get worse and worse over time

timid wind Apr 24, 2025, 2:48 PM

#

delicate sable from a market structure perspective nobody's gonna trade on a market where there...

what does it really change? if I am right that Gemini will be #1 I am right no matter who knows what

crude hawk Apr 24, 2025, 2:49 PM

#

if you literally know that gemini will be #1 because you are the one compiling the leaderboard

#

that's not an information asymetery

#

it's a totally broken market

delicate sable Apr 24, 2025, 2:50 PM

#

the whole idea behind prediction markets is that there is a financial incentive to make good predictions. liquid markets result in better predictions since the financial incentive is stronger

timid wind Apr 24, 2025, 2:50 PM

#

I can't get it TBH

delicate sable Apr 24, 2025, 2:50 PM

#

if you know you will trade against someone who has the answer you cannot make money

#

so people will just stop trading the market

crude hawk Apr 24, 2025, 2:51 PM

#

yeah that sounds like a natural law of economics to me ha

timid wind Apr 24, 2025, 2:52 PM

#

delicate sable if you know you will trade against someone who has the answer you cannot make mo...

when you're predicting you're not competing with others, only with real world, I can't get it

timid wind Apr 24, 2025, 2:54 PM

#

crude hawk yeah that sounds like a natural law of economics to me ha

what exactly?

crude hawk Apr 24, 2025, 2:55 PM

#

if you know you will trade against someone who has the answer you cannot make money
so people will just stop trading the market

delicate sable Apr 24, 2025, 2:55 PM

#

timid wind when you're predicting you're not competing with others, only with real world, I...

dont get what you're saying here

#

what does it mean to be 'competing with the real world'

timid wind Apr 24, 2025, 2:56 PM

#

delicate sable dont get what you're saying here

your predictions doesn't depend on others'

crude hawk Apr 24, 2025, 2:56 PM

#

the odds that you get for it do

delicate sable Apr 24, 2025, 2:56 PM

#

they do lol

#

you're trading against real other people

#

if i know the answer and buy 50% of the open interest on a strike who do you think that money i win will come from

timid wind Apr 24, 2025, 2:57 PM

#

crude hawk the odds that you get for it do

how? you're making a prediction, you calculate probability, not someone else

delicate sable Apr 24, 2025, 2:58 PM

#

say i forecast the odds of gemini winning to be 70% and openAI to be 30%

#

someone with a large bankroll who knows the answer buys openAI to 40%

#

what trade do i do

timid wind Apr 24, 2025, 2:58 PM

#

delicate sable you're trading against real other people

this is not only trading, smartcontract(s) with the oracle do the thing

timid wind Apr 24, 2025, 2:59 PM

#

delicate sable what trade do i do

WDYM?

delicate sable Apr 24, 2025, 3:00 PM

#

i feel like you dont understand the basics of prediction markets

timid wind Apr 24, 2025, 3:00 PM

#

guys I am sorry but you didnt provide a single argument

delicate sable Apr 24, 2025, 3:00 PM

#

not gonna keep arguing we can just agree to disagree on this

#

fwiw some people do share your opinion

timid wind Apr 24, 2025, 3:00 PM

#

delicate sable not gonna keep arguing we can just agree to disagree on this

so no arguments?

delicate sable Apr 24, 2025, 3:00 PM

#

yes you won the argument

crude hawk Apr 24, 2025, 3:02 PM

#

timid wind how? you're making a prediction, you calculate probability, not someone else

we're going round in circles.. here's sonn3.7's take:

sinful falcon Apr 24, 2025, 3:06 PM

#

timid wind so no arguments?

you are so smart

#

i wish i could win arguments like you

#

please let me know what trades you make next time so i can trade accordingly

timid wind Apr 24, 2025, 3:07 PM

#

crude hawk we're going round in circles.. here's sonn3.7's take:

if you're wrong at predictions you lose thats right, but it doesnt mean what you will quit because how do you know you wrong?

timid wind Apr 24, 2025, 3:08 PM

#

sinful falcon you are so smart

trolling sarcasm doesnt make you smart either and is destructive for the discussion

sinful falcon Apr 24, 2025, 3:08 PM

#

why do you think i’m trolling

#

i am very dumb

#

i need advice from you to make my next trade

#

please let me know what you want to trade next

timid wind Apr 24, 2025, 3:11 PM

#

crude hawk we're going round in circles.. here's sonn3.7's take:

anyway banning someone at predicting because he knows answer too well is.. just strange given that the goal is gathering information, not gambling between people who doesnt know anything

tawdry socket Apr 24, 2025, 3:13 PM

#

So, you are saying insider trading should be legal?

sinful falcon Apr 24, 2025, 3:14 PM

#

timid wind anyway banning someone at predicting because he knows answer too well is.. just ...

do you think it should be legal to give one person at the table for online poker ability to see all the cards?

timid wind Apr 24, 2025, 3:16 PM

#

tawdry socket So, you are saying insider trading should be legal?

in case of predictions markets where goal is clearly information gathering, why not

anyway amount of information what people have is unequal. it is not a prediction competition where every participant have equal information

timid wind Apr 24, 2025, 3:16 PM

#

sinful falcon do you think it should be legal to give one person at the table for online poker...

do you think it should be legal to write senseless messages on online communication platforms?

sinful falcon Apr 24, 2025, 3:17 PM

#

timid wind do you think it should be legal to write senseless messages on online communicat...

i am ESL

#

is there a problem with my grammar

timid wind Apr 24, 2025, 3:17 PM

#

timid wind in case of predictions markets where goal is clearly information gathering, why ...

the goal is not to define who is the best

tawdry socket Apr 24, 2025, 5:42 PM

#

timid wind in case of predictions markets where goal is clearly information gathering, why ...

If you want prediction markets without these rules, I am sure you will find markets for them. Companies which have these rules do this because it gives them better business from more markets and higher volume.

delicate sable Apr 24, 2025, 5:47 PM

#

i feel like you are just making really big claims about how prediction markets work but aren't very inexperienced with them

surreal drum Apr 24, 2025, 6:22 PM

#

delicate sable i feel like you are just making really big claims about how prediction markets w...

It’s so much harder to argue with a dumb person 😂

delicate sable Apr 24, 2025, 6:23 PM

#

i don't think they're dumb though

#

i feel like its just a very nuanced and random thing

surreal drum Apr 24, 2025, 6:23 PM

#

Very confidently incorrect

#

No, for most people, it’s very clear why it’s not ethical to insider trade

delicate sable Apr 24, 2025, 6:26 PM

#

surreal drum No, for most people, it’s very clear why it’s not ethical to insider trade

this wasnt a discussion on ethics though...

surreal drum Apr 24, 2025, 6:27 PM

#

yeah, probably bad word choice. i mean, though, it’s obvious why it’s wrong, unfair

delicate sable Apr 24, 2025, 6:29 PM

#

i felt like this was only a discussion about whether it results in better/worse predictions

#

i also dont think the answer is obvious a lot of people disagree on it

timid wind Apr 24, 2025, 7:34 PM

#

tawdry socket If you want prediction markets without these rules, I am sure you will find mark...

there are any serious and active prediction markets other than PM? PM doesn't have these rules, does it?

delicate sable Apr 24, 2025, 7:35 PM

#

why do you speak so confidently about these things lol

#

you don't even know kalshi exists?

timid wind Apr 24, 2025, 7:39 PM

#

if someone does know the answer in advance and vote for it, it can only improve predictions overall, am I wrong?

delicate sable Apr 24, 2025, 7:39 PM

#

its just surface level

#

IMO

timid wind Apr 24, 2025, 7:48 PM

#

delicate sable you don't even know kalshi exists?

available only for U.S. residents.

delicate sable Apr 24, 2025, 7:49 PM

#

timid wind > available only for U.S. residents.

im aware

#

it is a 'serious and active prediction market'

#

i'd also argue and prediction market where you trade real money is serious

#

im not sure why residency requirements make a pm less 'serious or active'

tawdry socket Apr 24, 2025, 10:15 PM

#

timid wind if someone does know the answer in advance and vote for it, it can only improve ...

You are assuming that allowing insider trades does not affect market and volume. If I know that insiders can buy up orders as soon as they get information, I am not going to place any order. If there are no orders, then there is no opportunity for insiders. So, it basically kills the markets even for insiders.

Your argument works only when people dont know that there are insiders. Then, insiders will bring markets to correct values sooner than otherwise. If people know i.e., insider trading is legal --> no volume --> no trading, not even insider trading --> wrong predictions.

#

Trust me, big players on prediction markets dont gamble. They trade because they think they have an edge. If they know that they dont i.e., only gamblers are left in the market. Which does not help predictions.

#

So, if you want accurate predictions, you increase the publicity of prediction markets and make them popular or competitive. Not by outright legalizing insider trading which basically kills the markets.

sinful falcon Apr 24, 2025, 11:49 PM

#

idk man dinahu seemed like he was very educated on the subject

#

are you sure you want to take an opposing take?

#

seems unwise to oppose someone so smart

crude hawk Apr 25, 2025, 8:32 AM

#

tawdry socket So, if you want accurate predictions, you increase the publicity of prediction m...

yeah.. i mean.. it's not a prediction market if a participant (either directly on indrectly) has literally perfect information about the thing on which everyone else is speculating/gambling (e.g. which model will be number on the leaderboard after the next update).. they aren't predicting anything if they quite literally know with 100% certainty what the outcome will be
[it is a prediction market if individuals with such perfect information do not use it or disclose it to others - then it's a fair playing field. but it's basicallly an honour system as far as i can tell lol.. or at least it's not in LMsys' interests to exploit their info advantage - because the moment they were found out to be doing so, their and the Arena's integrity collapses]

crude hawk Apr 25, 2025, 8:33 AM

#

delicate sable i also dont think the answer is obvious a lot of people disagree on it

tbh i don't really get the nuance here... i feel like i'm missing something if that's the case ha

glacial glacier Apr 26, 2025, 1:04 AM

#

i'm surprised r1 has held up this well

crude hawk Apr 28, 2025, 2:50 AM

#

lol are people really this dumb

#

the LMArena leaderboard isn't like the weather, or the outcome of a sporting game or election

#

if some pollster has info not publically released, they can use that to bet on the outcome of a particular election. they have an info advantage (as an 'insider' of a polling company company), but they don't know the literal outcome of the election; and so yeah, it theoretically leads to a more accurate prediction market

#

a boxer's trainer might no that they are injured before a bout, and use that info to place a bet (e.g. that the boxer will lose); the 'insider' has privledged info but again, they don't actually know what the outcome will be for certain.. so sure.. more accurate accurate prediction market yadada

#

but the #1 spot on the LMArena leaderboard is straight-up known by insiders before an update is actually published.. they don't have an info advantage; they literally know actual outcome, on which people are betting, before it is released public..

#

how that difference and the problems it potentially entails is not blindingly obvious is beyond baffling

slim marsh Apr 28, 2025, 3:31 AM

#

hi

gritty moat Apr 28, 2025, 10:52 AM

#

Guys stop yapping ong

neon willow Apr 28, 2025, 1:42 PM

#

At this point they should just rename this to #gambling

sinful falcon Apr 29, 2025, 12:28 AM

#

is there supposed to be an update tomorrow?

#

with a new model?

sinful falcon Apr 29, 2025, 12:28 AM

#

neon willow At this point they should just rename this to #gambling

it's not gambling when i know i'm gonna win

#

becuase the mods tell me in advance

#

well ig for all of y'all plebs it is

#

nevermind

viral maple Apr 29, 2025, 12:56 AM

#

surreal drum yeah, probably bad word choice. i mean, though, it’s obvious why it’s wrong, unf...

have you even said thank you

twin valve Apr 29, 2025, 7:02 AM

#

can we move the polymarket discussion elsewhere?

woven hearth Apr 29, 2025, 3:43 PM

#

Besides the discussion, will there be a leaderboard update today/tomorrow?

twin valve Apr 29, 2025, 7:49 PM

#

normally it is about 1 week

#

4-5 updates per month

hollow patrol Apr 30, 2025, 5:00 AM

#

Hey guys, I am working on a model recommending system and I would love to use the lmarena leaderboard as source to base the recommendations on. For now the way I go about it is simply downloading this dataset maintained by someone:

https://huggingface.co/datasets/mathewhe/chatbot-arena-elo

And filter within that dataset. But this is just for the LLMs.

Is there any intent to create an API for the leaderboard? Or something similar with which I can easily access the leaderboard data in my applications?

mathewhe/chatbot-arena-elo · Datasets at Hugging Face

glacial glacier Apr 30, 2025, 5:06 AM

#

hollow patrol Hey guys, I am working on a model recommending system and I would love to use th...

you could write something similar to this script https://github.com/KTibow/lmb/blob/main/handling/convert_pickle.py

GitHub

lmb/handling/convert_pickle.py at main · KTibow/lmb

Language Model Board, a better way to read the LMSYS results - KTibow/lmb

hollow patrol Apr 30, 2025, 6:13 AM

#

glacial glacier you could write something similar to this script https://github.com/KTibow/lmb/b...

Thanks for sharing! I do feel like its still a bit sub-optimal though..

twin valve Apr 30, 2025, 9:20 AM

#

hollow patrol Hey guys, I am working on a model recommending system and I would love to use th...

wouldn't that be "prompt to leaderboard" ?

#

this https://github.com/lmarena/p2l

GitHub

GitHub - lmarena/p2l: Prompt-to-Leaderboard

Prompt-to-Leaderboard. Contribute to lmarena/p2l development by creating an account on GitHub.

woven hearth Apr 30, 2025, 11:26 AM

#

guys

#

will there be a leaderboard update today?

low patrol Apr 30, 2025, 1:09 PM

#

Hey Hi
I am from BharatGen we had builded our own llm with indic nuance
May I know how we can integrate our model api in chatbot arena to get better comparison

glacial glacier Apr 30, 2025, 5:05 PM

#

the leaderboard situation is crazy

arXiv.org

The Leaderboard Illusion

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted...

twin valve Apr 30, 2025, 5:10 PM

#

glacial glacier [the leaderboard situation is crazy](https://arxiv.org/abs/2504.20879)

see here #arena-feedback message

hollow patrol May 1, 2025, 8:54 AM

#

twin valve wouldn't that be "prompt to leaderboard" ?

But that doesnt show the leaderboard no?

hallow comet May 1, 2025, 10:37 AM

#

Is there a way to find old leaderboards? For example the leaderboard on the 14th of April

twin valve May 1, 2025, 11:39 AM

#

hollow patrol But that doesnt show the leaderboard no?

it does for every type of prompt

hollow patrol May 1, 2025, 3:13 PM

#

twin valve it does for every type of prompt

Is it legit? Looks cool

twin valve May 1, 2025, 4:33 PM

#

well when they tested it, it was legit. For me the p2l model (simply picked the best model that thought was going to answer your question) never lost in my cases.
The only drawback is that they have to compute the p2l every now and then. Last time was in January I think.

median briar May 1, 2025, 4:53 PM

#

hallow comet Is there a way to find old leaderboards? For example the leaderboard on the 14th...

https://github.com/nakasyou/lmarena-history maybe this will help

GitHub

GitHub - nakasyou/lmarena-history: The history of scores on lmarena...

The history of scores on lmarena.ai saved as JSON. - nakasyou/lmarena-history

twin valve May 1, 2025, 5:17 PM

#

was nvidia nemotron 253B tested via lmarena? I cannot remember its score

glacial glacier May 1, 2025, 11:14 PM

#

twin valve was nvidia nemotron 253B tested via lmarena? I cannot remember its score

i actually recently reran the lmb data fetcher on historical data, seems it was removed but when it was tested it got around 1221 elo

hollow patrol May 2, 2025, 6:54 AM

#

twin valve well when they tested it, it was legit. For me the p2l model (simply picked the ...

Ah, so if its very outdated then its not that useful no? What is the reason it gets updated with such infrequency?

twin valve May 2, 2025, 11:01 AM

#

glacial glacier i actually recently reran the lmb data fetcher on historical data, seems it was ...

ty! I wonder if the "v1" has any meaning. 1221 seems very low

twin valve May 2, 2025, 11:02 AM

#

hollow patrol Ah, so if its very outdated then its not that useful no? What is the reason it g...

lmarena people have only so much time to dedicate to many todos. No if it is outdated is not that useful. In general one can argue so: even if an existing solution exists, you never know if it stops being available anytime soon. In the past there was the idea that "internet never forgets" but time has shown that it does or some services become less user friendly.

#

so yeah if you want to go at it, go for it. My initial objection was only to understand if the prompt to leaderboard was the same thing you wanted to do

hollow patrol May 2, 2025, 12:38 PM

#

twin valve so yeah if you want to go at it, go for it. My initial objection was only to und...

Well not exactly, but I do like your suggestion very much (if it has some value for live recommendation)

hollow patrol May 2, 2025, 12:39 PM

#

twin valve lmarena people have only so much time to dedicate to many todos. No if it is out...

I get that, and its not a flack to the lmarena people. What I mean is that it sounds like keeping this updated is very automatable. But I guess my assumptions are wrong?

slate compass May 2, 2025, 4:07 PM

#

low patrol Hey Hi I am from BharatGen we had builded our own llm with indic nuance May I k...

wdym by built your own llm

twilit loom May 2, 2025, 4:55 PM

#

😂
Chatgpt with fancy layers

timber owl May 2, 2025, 7:13 PM

#

slate compass wdym by built your own llm

according to 5 minutes of research, bharatgen is a legitimate government funded indian startup

#

also according to 5 minutes of research, i cannot find any info on the llm itself, only the organization

twin valve May 2, 2025, 8:06 PM

#

hollow patrol I get that, and its not a flack to the lmarena people. What I mean is that it so...

the p2l is automated for sure, but I guess they cannot run all the script they can. Already checking the leaderboard - classifying the battles - is surely compute intensive.

#

Imagine every week they have X thousands new battles that have to be classified, rejected (if something is obvious) and then evaluated. In the past, if I am not wrong, they used a llama model for classification. I am not sure if it is still the same. Still it takes compute to go through all that, plus manual checks. I think the lmarena team is small.

slate compass May 2, 2025, 8:12 PM

#

timber owl also according to 5 minutes of research, i cannot find any info on the llm itsel...

now, what could "indic nuance" mean

#

context for anyone joining in: https://bharatgen.tech/

#

$30M (equivalent) in funding, not too shabby

timber owl May 2, 2025, 8:19 PM

#

slate compass now, what could "indic nuance" mean

indian subtext

slate compass May 2, 2025, 8:21 PM

#

oh

#

@low patrol do you have public model weights?

wind vale May 3, 2025, 12:16 PM

#

Hello guys , May I know When will the leaderboard be updated again?

woven hearth May 3, 2025, 5:28 PM

#

No you may not

#

they think everyone who asks that question is here for the money

timber owl May 3, 2025, 7:02 PM

#

that's because the ratio of people who ask that question that are here for the money vs that aren't is significant

twin wharf May 3, 2025, 7:39 PM

#

woven hearth they think everyone who asks that question is here for the money

HAHAH because we've seen the same pattern for months, its very obvious.

woven hearth May 3, 2025, 8:00 PM

#

Or...

#

He is just curious about where QWEN or other releases get placed on the leaderboard

#

The polymarket bet ends every end of the month...

#

Stop being toxic

delicate sable May 3, 2025, 8:06 PM

#

woven hearth He is just curious about where QWEN or other releases get placed on the leaderbo...

woven hearth May 3, 2025, 9:11 PM

#

ffs

#

you re right

timber owl May 3, 2025, 9:42 PM

#

woven hearth The polymarket bet ends every end of the month...

and starts again every new month

muted isle May 4, 2025, 2:40 PM

#

timber owl that's because the ratio of people who ask that question that are here for the m...

what money

timber owl May 4, 2025, 2:47 PM

#

muted isle what money

there is a popular bet on which company will have the #1 model on lmarena

pine wasp May 4, 2025, 8:13 PM

#

whos reading for grok 3.5 to break the leaderboard

pseudo rivet May 4, 2025, 8:33 PM

#

Grok 3.5 will be just below chatgpt 4o

pine wasp May 4, 2025, 8:45 PM

#

rightt

twilit echo May 6, 2025, 3:30 AM

#

someone compiled and calculated glicko ratings using 28 benchmark scores
https://nitter.net/scaling01/status/1919389344617414824

glacial glacier May 6, 2025, 3:47 AM

#

(where glicko is elo with uncertainty taken into account)

hollow patrol May 6, 2025, 2:14 PM

#

What is the reason that a model like gemma 3-4b can be one place below claude 3.5 sonnet?

I have used a lot of small models, and I have never seen any that can consistently produce quality answers

twilit echo May 6, 2025, 2:25 PM

#

hollow patrol What is the reason that a model like gemma 3-4b can be one place below claude 3....

because style is king in lmarena, and claude family generates no-fluff answers by default, whereas other models do not. This heavily influences match outcomes. as you can see, with style control the gap grows from 1 rank to 40 rank difference. The default ranking is kinda meaningless, because of this.

#

even so, 40 ranks is still very little considering the models are worlds apart in capability (obviously, since it's 4B..) for reference here is my own comparison (no censor):

hollow patrol May 6, 2025, 2:48 PM

#

twilit echo because style is king in lmarena, and claude family generates no-fluff answers b...

Where can I read up on how this style controle measurement is used?

twilit echo May 6, 2025, 2:54 PM

#

hollow patrol Where can I read up on how this style controle measurement is used?

https://blog.lmarena.ai/blog/2024/style-control/

twin valve May 6, 2025, 6:28 PM

#

twilit echo someone compiled and calculated glicko ratings using 28 benchmark scores https:...

I think combining multiple (relatively notable) benchmarks is a good idea.

twin valve May 8, 2025, 11:23 AM

#

@glacial glacier would it be possible for you to add the history of the score of a model? I know it is asking for quite some work and I know that models with small CI change their values slowly, but still it would be interesting to know.
Another one would be to check the changes in the amount of votes (if those are reported in the stats distributed via github). One could see if a model, despite being not yet deprecated, practically stops getting votes (without votes: no further rating or CI change).

drowsy needle May 8, 2025, 1:02 PM

#

twin valve <@794377681331945524> would it be possible for you to add the history of the sco...

Would you mind filling out the feedback form? That’d be helpful, let me know if you need that link

median briar May 8, 2025, 6:55 PM

#

New Gemini 2 Flash image generation is legit more expensive than Imagen 3 002 🤡

twin valve May 8, 2025, 9:01 PM

#

drowsy needle Would you mind filling out the feedback form? That’d be helpful, let me know if ...

I sent a lot of feedback but my request was for this https://ktibow.github.io/lmb/ leaderboard (made by @glacial glacier ) that is not bad at all.
I didn't ask the lmarena team to do the same improvements because I think they have too much on their plate already.

#

if needed I can request the same for the official leaderboard but I won't expect changes because the team is surely super busy

drowsy needle May 8, 2025, 10:40 PM

#

twin valve I sent a lot of feedback but my request was for this https://ktibow.github.io/lm...

understood, thanks for explaining 👍

glacial glacier May 8, 2025, 10:51 PM

#

twin valve <@794377681331945524> would it be possible for you to add the history of the sco...

interesting... are both of these to accomplish the same goal (understanding when a model stops changing) or no?

twin valve May 9, 2025, 1:07 AM

#

glacial glacier interesting... are both of these to accomplish the same goal (understanding when...

no, the model should stay (allegedly) the same. Rather the rating should change (if at all) because other models change or the judger change (different questions, different judgments).

The score history is what is most interesting. The vote history (or CI history, though the latter is a bit more cryptic) is interesting to see because it tells whether the score is really stable, despite further battles, or it is stable because mostly no battles happens.

For example I am pretty sure many models, though listed, are receiving very few battles.

glacial glacier May 9, 2025, 1:08 AM

#

twin valve no, the model should stay (allegedly) the same. Rather the rating should change ...

sorry i meant "understanding when a model stops getting battles / moving around on the leaderboard"

timid wind May 9, 2025, 6:40 PM

#

is it OK what I in 50%+ cases judge LLM by first 100-200 characters of response and didn't read the rest when start is bad to me?

timber owl May 9, 2025, 9:21 PM

#

no

twin valve May 9, 2025, 10:15 PM

#

glacial glacier sorry i meant "understanding when a model stops getting battles / moving around ...

🤔 the number of votes achieve what you say (stop getting battles). The rating history shows how the "playerbase" changes. Even with the BT model over Elo (or any Elo variants), over time a model can be "farmed" so to speak - provided that it gets battles. Or a model can still hold pretty well (claude 3.5 from Oct 24) . Sure the rating can move slowly (within the expected CI) but can move nonetheless as the playerbase and votes get in. Because people voting change their judgment and question get different.

glacial glacier May 9, 2025, 10:48 PM

#

twin valve 🤔 the number of votes achieve what you say (stop getting battles). The rating h...

ah

well voting history would require a good bit of effort and i don't know if it would quite work as intended if there's general elo drift

twin valve May 9, 2025, 11:36 PM

#

yeah no worries, I throw ideas because I like your leaderboard a lot, but you don't have to implement them

echo lake May 12, 2025, 10:19 AM

#

Is there any information regarding the Qwen3 32B model? I think it's more commonly use than those MoE models, since MoE = Memory Overconsumption Ensemble in practice.

dreamy bluff May 13, 2025, 2:34 AM

#

how could we add new models to the LMArena leaderboard?

glacial glacier May 13, 2025, 3:54 AM

#

what is this leaderboard?!

#

where's gpt-4o-2024-11-20?
why is gemini flash down there?
and also having a wide CI is giving it an advantage here

twin valve May 14, 2025, 4:02 PM

#

glacial glacier where's gpt-4o-2024-11-20? why is gemini flash down there? and also having a wid...

maybe deprecated and gemini flash getting more votes (that pulls it down?)

glad elk May 16, 2025, 8:52 AM

#

Hello everyone I'm new here; I'm wondering can I push my own fine-tuned model to the Chatbot Arena to let the users blindly test it with other models? Thanks!

drowsy needle May 16, 2025, 2:32 PM

#

glad elk Hello everyone I'm new here; I'm wondering can I push my own fine-tuned model to...

You’re unable to push your own model to the arena; however, tell us about it in #1372229840131985540

crude hawk May 19, 2025, 2:01 PM

#

glacial glacier what is this leaderboard?!

fwiw i gave it to o3 on chatgpt and i think this part of its response perhaps explains it?

#

but yeah the ranking in the ss you shareed seems counterintuitive.. i assume there;s methodological explanation.. o3's seems convincing but eh i dunoo

empty hill May 20, 2025, 8:42 AM

#

🤔

wispy sapphire May 21, 2025, 12:15 AM

#

Aloha hi, I am sorry if this is already documented somewhere; I am responsible at my company for compiling a list of models that our product can run, and I thought it would be nice if I could query the contents of the Leaderboard via an API, rather than picking through the list on a browser and copying and pasting.

Is the data that populates the current leaderboard available via an API, or an export of that data available as a flat-file in some manner?

peak yacht May 21, 2025, 7:41 AM

#

Hi 🫡

#

Hi everyone, I'm newcomer, how could we push our company's model to the Chatbot Arena? Thanks !

drowsy needle May 21, 2025, 1:26 PM

#

peak yacht Hi everyone, I'm newcomer, how could we push our company's model to the Chatbot ...

hey there ablobwave at the moment the best course is to put information about your model in the #1372229840131985540 channel.

peak yacht May 21, 2025, 1:32 PM

#

drowsy needle hey there <a:ablobwave:552927506957729802> at the moment the best course is to p...

Hi🤗 , I have put the model information in the model-request channel, could you please have a look and if there are more details needed?

valid sun May 21, 2025, 6:46 PM

#

how to join？

drowsy needle May 21, 2025, 6:48 PM

#

valid sun how to join？

are you asking how you have a model apart of LMArena?

viscid lynx May 21, 2025, 6:58 PM

#

hi

unborn talon May 21, 2025, 7:01 PM

#

hello

crisp tartan May 22, 2025, 6:43 AM

#

Hello Guys, Found this website through YT video... Hope to learn something more in AI Universe 😉

wispy sapphire May 22, 2025, 5:27 PM

#

wispy sapphire Aloha hi, I am sorry if this is already documented somewhere; I am responsible a...

Ping on this? Given that MLperf releases their data under Tableau, I'm hoping this isn't an unreasonable ask. Right now, the site uses CloudFlare Captcha to prevent screen-scraping, which I respect. But it feels like making a limited set of data available would prevent people from either trying to circumvent captcha and/or copying and pasting a bunch of data by hand.

drowsy needle May 22, 2025, 5:34 PM

#

wispy sapphire Ping on this? Given that MLperf releases their data under Tableau, I'm hoping th...

ah sorry about missing your initial question! I'll get back to you shortly!

drowsy needle May 22, 2025, 5:57 PM

#

wispy sapphire Aloha hi, I am sorry if this is already documented somewhere; I am responsible a...

Apologies for the delayed response. Unfortunately, at the moment this isn't available.

wispy sapphire May 22, 2025, 6:01 PM

#

drowsy needle Apologies for the delayed response. Unfortunately, at the moment this isn't avai...

Would it be worthwhile for me to create a GitHub issue requesting?

drowsy needle May 22, 2025, 6:13 PM

#

wispy sapphire Would it be worthwhile for me to create a GitHub issue requesting?

I've passed this request to the team.

wispy sapphire May 22, 2025, 6:14 PM

#

drowsy needle I've passed this request to the team.

Cool thank you. I created a GH issue in the interim anyways, if it would make sense to track there: https://github.com/lm-sys/FastChat/issues/3730

GitHub

API or data blob access to leaderboard output · Issue #3730 · lm-...

I asked on the Discord, and I believe that this isn't available today. If I am wrong, happy if you can direct me to the source. At my company, I am responsible for compiling a list of (non-prop...

drowsy needle May 22, 2025, 6:18 PM

#

wispy sapphire Cool thank you. I created a GH issue in the interim anyways, if it would make se...

Thank you!

wispy sapphire May 23, 2025, 3:11 AM

#

drowsy needle Thank you!

One last thing, just to hopefully make this as innocently lightweight as possible, I added this to the issue description:

I think what I am asking is that the leaderboard-table-file (gradio_web_server_multi.py#L337) (docs) is copied to an S3 bucket, or even just served raw via the site's http server (sans captcha, please), so that we can look at the raw data.

drowsy needle May 23, 2025, 3:38 PM

#

wispy sapphire One last thing, just to hopefully make this as innocently lightweight as possibl...

The more detail/pinpoint request provided does help, thank you!

frozen knot May 24, 2025, 10:53 AM

#

maybe with the upcoming sentiment control claude 4 opus might actually be #1

#

until gemini deepthink atleast

tulip shadow May 24, 2025, 10:56 AM

#

frozen knot maybe with the upcoming sentiment control claude 4 opus might actually be #1

idk is it nice to talk to?

zealous sable May 24, 2025, 12:46 PM

#

When will the leaderboard update next? like with the scores of new claude 4 models?

mortal sedge May 24, 2025, 1:19 PM

#

Why isn't claude 4 into the leaderboard?

carmine spade May 24, 2025, 3:24 PM

#

Leaderboard is down

#

drowsy needle May 24, 2025, 4:04 PM

#

carmine spade

thanks for the flag, looking into

#

should be fixed now 👍

pale cloud May 25, 2025, 3:29 AM

#

Is there a channel to suggest improvements to the existing benchmarks?

drowsy needle May 25, 2025, 3:30 AM

#

pale cloud Is there a channel to suggest improvements to the existing benchmarks?

yeah absolutely! using #1372230675914031105 would be ideal 👍

pale cloud May 25, 2025, 3:30 AM

#

Thank you

zealous sable May 25, 2025, 11:08 AM

#

Hi, when do the leaderboards update usually?

drowsy needle May 25, 2025, 3:07 PM

#

zealous sable Hi, when do the leaderboards update usually?

it's not on a set schedule and we don't rly disclose when it will be; however, the leaderboards will say the last time they have been updated.

wicked plume May 26, 2025, 8:39 AM

#

drowsy needle it's not on a set schedule and we don't rly disclose when it will be; however, t...

Gm ser
I sent you a proposal in DM, kindly check.
Would have love to open a support ticket but couldn’t find any.

sinful falcon May 26, 2025, 8:49 AM

#

zealous sable When will the leaderboard update next? like with the scores of new claude 4 mode...

yes i am also curious. i am also wondering if someone could give a probability of it surpassing 2.5 pro by the end of may

#

lol

zealous sable May 26, 2025, 8:49 AM

#

sinful falcon yes i am also curious. i am also wondering if someone could give a probability o...

fellow polymarket trader 💀

willow holly May 26, 2025, 2:17 PM

#

In my own LMArena stats Sonnet 4 does worse than 3.7.
Curios where it ends up on the public leaderboard.

It is really just optimised for the agentic coding workflow huh?

Opus does well for me tho. I like how it is efficient with tokens.

feral shale May 27, 2025, 12:18 AM

#

sinful falcon yes i am also curious. i am also wondering if someone could give a probability o...

Very unlikely because it's only arguably stronger in coding and weaker in everything else

queen jewel May 27, 2025, 8:12 AM

#

I hope the old website can last longer, or at least remain accessible as a legacy URL after the update, as tools like p2l and arena explorer on it have not been fully replicated on the beta website yet

pale hare May 27, 2025, 9:14 AM

#

when is the leader updating to show sonnet 4?

twin valve May 27, 2025, 8:39 PM

#

in randint(1,5) days

twin valve May 27, 2025, 8:40 PM

#

willow holly In my own LMArena stats Sonnet 4 does worse than 3.7. Curios where it ends up o...

I am keeping track of the results to my questions since early May. Sonnet, no matter the questions, loses every time as soon as I ask something that is not about coding or computer configuration. Now I see why it is that low in the rankings.

indigo cliff May 28, 2025, 9:43 AM

#

What is the "Arena Overview" leaderboard? Is it the same as the text leaderboard (with style control)?

willow holly May 28, 2025, 10:01 AM

#

@indigo cliff Yes. If you mean here and the default is now Style controlled.

indigo cliff May 28, 2025, 10:10 AM

#

I mean the one at the bottom of the page. Is it influenced by webdev, vision, search, copilot, and text-to-image? or is it just the text? By looking at the page one would think its a combination of all the leaderboards above, but that doesn't seem to be the case.

willow holly May 28, 2025, 10:17 AM

#

@indigo cliff yes it is also just the Text, just split in the different categories.

It is just the overview you also have in old one but worse because you can't switch to the ELO points. The new design looks better but hopefully all the functionality will also come back.

indigo cliff May 28, 2025, 10:23 AM

#

I see, thanks

zealous sable May 28, 2025, 4:26 PM

#

Hi just saw the X post for the Claude 4 update but the site doesnt show it?

drowsy needle May 28, 2025, 4:28 PM

#

zealous sable Hi just saw the X post for the Claude 4 update but the site doesnt show it?

currently looking into, ty for the flag!

zealous sable May 28, 2025, 4:33 PM

#

drowsy needle currently looking into, ty for the flag!

also are we only getting WebDev results and not for any other?