#arena-feedback | Arena | Page 1

alpine marlin Mar 5, 2025, 2:57 PM

#

https://web.lmarena.ai/

Send message doesn't work.

upbeat hollow Mar 5, 2025, 4:16 PM

#

Works for me? Albeit its a bit slow (3-4 second delay), it still loads

compact dirge Mar 5, 2025, 5:02 PM

#

Can we get a deepseek r1 distill model in the arena

#

Or maybe even a quantized model (e.g. r1 Q8_0)? Would be interesting to see the effect on accuracy

wild quail Mar 5, 2025, 8:04 PM

#

compact dirge Or maybe even a quantized model (e.g. r1 Q8_0)? Would be interesting to see the ...

there are already lots of benchmarks and analysis done on quantizations. i think it would overflow the leaderboard with too many information. the difference wouldnt be that noticable i think

#

but maybe some small tests, that get published independently from the leaderboard could be interesting

compact dirge Mar 5, 2025, 8:24 PM

#

I know but benchmarks and lmsys rating rarely paint the same picture

#

take claude 3.7 e.g.

#

crushes benchmarks, #1 on livebench, 70% swe

#

But dogshit rating

hardy halo Mar 7, 2025, 4:04 AM

#

We should be able to stop the output and vote when it's obvious which one we're going to choose.

wild quail Mar 7, 2025, 10:10 AM

#

hardy halo We should be able to stop the output and vote when it's obvious which one we're ...

this would be like rating a movie without watching it till the end

pure compass Mar 7, 2025, 10:44 AM

#

I had it a few times a model was repeating the same sentence forever. I don't know which model, I tried disconnecting the Internet, wait for it to error out, reconnect and then try to vote, but it did not work, so for that case a stop button would be great.

compact dirge Mar 7, 2025, 3:53 PM

#

maybe not in arena, but a stop button would come in clutch for direct chat or side by side

low copper Mar 7, 2025, 4:43 PM

#

There should be a timer indicating how long each answer took

hardy halo Mar 7, 2025, 5:40 PM

#

wild quail this would be like rating a movie without watching it till the end

Exactly. Sometimes the movie is so bad you walk out of the theater.

#

If one model is writing a long good answer while the other has already output a short refusal, I can stop the generation and choose the real answer as the better one.

#

Saving me time and saving the provider time and money on generation

#

Somewhat contrarily, I also think we should be able to vote on random queries and responses that other people submitted, since they're all going into the database anyway. Let multiple people vote on which response is better for a given conversation, and get a lot more battle data without spending any energy on generating new outputs or waiting for them to be generated.

wild quail Mar 7, 2025, 5:53 PM

#

hardy halo Somewhat contrarily, I also think we should be able to vote on random queries an...

yes this is a fair point. also a good idea. because people would tend to vote differently on these outputs maybe

low copper Mar 7, 2025, 6:26 PM

#

hardy halo Somewhat contrarily, I also think we should be able to vote on random queries an...

I agree 100%

#

They did this with https://open-assistant.io/ back when the project was alive

#

It was really cool

#

It was a shame when they ended their project.

#

@agile flume would lmarena ever consider this?

#

It had a nice UI with plugins https://www.reddit.com/r/OpenAssistant/comments/13seg3h/open_assistant_can_use_plugins_cool/

#

I can't find any photos of it but they had a feature where you could see public generations by category and then have to select better responses. It also let you submit your own better responses and even rate things like output quality, creativity, and potential harm.

#

https://huggingface.co/OpenAssistant

Datasets set out the labels like this: { "name": [ "spam", "lang_mismatch", "pii", "not_appropriate", "hate_speech", "sexual_content", "quality", "toxicity", "humor", "creativity", "violence" ], "value": [ 0, 0, 0, 0, 0, 0, 0.8125, 0.16666666666666666, 0.3333333333333333, 0.5, 0 ], "count": [ 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3 ] }

slim herald Mar 8, 2025, 2:06 AM

#

alpha.lmarena.ai

password: super-alpha

cinder phoenix Mar 8, 2025, 4:16 AM

#

Found a minor bug I couldn't screenshot
I got a CloudFlare captcha overlayed to the UI right over there on the top left

strong slate Mar 8, 2025, 4:18 AM

#

cinder phoenix Found a minor bug I couldn't screenshot I got a CloudFlare captcha overlayed to ...

For the new UI - I'd def submit it to the bug report form to get prioritized:

PLEASE Give us feedback here: https://forms.gle/8cngRN1Jw4AmCHDn7
and 🪲 report bugs here: https://airtable.com/appK9qvchEdD9OPC7/pagxcQmbyJgyNgzPx/form

short scarab Mar 8, 2025, 4:39 AM

#

What’s the password?

#

Oh

short scarab Mar 8, 2025, 4:41 AM

#

strong slate For the new UI - I'd def submit it to the bug report form to get prioritized: ...

I’d rather give feedback over here or GitHub

#

Google’s forcing me to change my password to use the forum

#

But the cloudflare captia thing is hogging up space

short scarab Mar 8, 2025, 4:48 AM

#

strong slate For the new UI - I'd def submit it to the bug report form to get prioritized: ...

Hello, would it be possible to add text attachments that attach: txt, csv, tsv, xml, html, css, js, py, c , cpp, etc. text based files by having it appended to the prompt possibly bypassing the character limit.
Example

Hello LLM!
———User Attached file: Hello.txt (TIMESTAMP)——
Hello World!
—End-Of-File—
Similar to repochat

Additionally, creating support for excel documents, word documents, and SQLite would be helpful also code folder uploads like Gemini has.

strong slate Mar 8, 2025, 4:51 AM

#

short scarab Hello, would it be possible to add text attachments that attach: txt, csv, tsv, ...

ill note this - and we can post new feedback for the new UI in this new channel moving forward:#new-ui-feedback

short scarab Mar 8, 2025, 4:55 AM

#

strong slate ill note this - and we can post new feedback for the new UI in this new channel ...

Have a question. What happens to the conversations that never get voted on, like if someone goes to another page while not voting or just don’t vote and click new round, or lose internet connection or lose their chat to “Connection errored out”

#

Those conversations should be LLM judged, as it’s basically a waste of resources especially for stuff like GPT 4.5 being tested on the site and possibly the full o3 model in the future

patent fjord Mar 8, 2025, 4:57 AM

#

short scarab Those conversations should be LLM judged, as it’s basically a waste of resources...

they are useless then but they used rate limits per instance i think

#

the site (maybe still does) used to do cloudflare/ddos protection

short scarab Mar 8, 2025, 5:00 AM

#

patent fjord they are useless then but they used rate limits per instance i think

Without a vote, they are useful as “real world usage data” even if LLM judges aren’t present

patent fjord Mar 8, 2025, 5:01 AM

#

yea they get fed as data too like that interactive viewer thing

short scarab Mar 8, 2025, 5:01 AM

#

patent fjord yea they get fed as data too like that interactive viewer thing

Your a staff member of LMArena?

#

Seems legit actually

split prism Mar 8, 2025, 11:29 AM

#

patent fjord + the site (maybe still does) used to do cloudflare/ddos protection

They still do. I found a funny thing: some quesions, usually containing sql statements or linux commands, are "forbidden" in a way which consistently trigger errors and cannot be asked or your conversation gets cooked. After some exploration, it looks like the reason for it is cloudflare, which bans those requests due to some random "protections" and gives 403 consistently for those suspicious types of requests. Likely not the inherent protections of LMArena itself, since they usually give you something like "Content violates moderation..."

#

Thinking more about it now, couldn't it affect the bias of the arena results? Since some types of questions (all of them were unharmful ones) are banned by random cloudflare triggers, doesn't it slightly reduce the set of answers provided by the users, thus reducing arena's score uniformness, in a way? Models which could answer such questions properly would likely get a slightly higher rating, others a slightly lower.

warm sequoia Mar 8, 2025, 2:26 PM

#

the gemini test 30 model was by far the best model i have used in the site. sad that they removed it early and didnt get to try it further.

#

😔

split prism Mar 8, 2025, 2:36 PM

#

warm sequoia the gemini test 30 model was by far the best model i have used in the site. sad ...

It might mean they will release it soon, or come back with an enhanced version

low copper Mar 8, 2025, 5:33 PM

#

cinder phoenix Found a minor bug I couldn't screenshot I got a CloudFlare captcha overlayed to ...

rose robin Mar 8, 2025, 9:32 PM

#

warm sequoia the gemini test 30 model was by far the best model i have used in the site. sad ...

Really ?? I am afraid now 🙂 google removed gemini test and put gemini 2.0 thinking exp on the free version of the gemini app ...hmmmm we will have new Geminis sooner 😄🤝👀👀👀🤌

rose robin Mar 8, 2025, 9:49 PM

#

hardy halo We should be able to stop the output and vote when it's obvious which one we're ...

My biggest problem is when a model hallucinate suddenly and keep repeating the same worlds or sentences for the 162728388 times like lucid or llama3.2 I wish they make the stop button to stop them at least when they hallucinante 🥲🫤 because most of the time I quit the page and I don t vote to not lose time waiting for a model to stop repeating the same words .

warm sequoia Mar 9, 2025, 6:34 AM

#

rose robin Really ?? I am afraid now 🙂 google removed gemini test and put gemini 2.0 think...

didnt know the app also had it. i thought it was lmarena exclusive.

delicate drift Mar 9, 2025, 11:19 AM

#

So i like the webdev arena but what about a direct chat on there with support for like webcontainers or so

hardy halo Mar 10, 2025, 2:33 AM

#

short scarab Those conversations should be LLM judged, as it’s basically a waste of resources...

Or allow other people to vote on them as I said earlier

rose robin Mar 10, 2025, 7:01 PM

#

Why not rating the answers of each model after voting ? Sometimes, I feel that the votes didn 't reflect what I think about each answer. For example, sometimes, I find 2 models , one of them is so bad. The other one is bad but a little bit better. I donno if I should vote both are bad or 2 is better. Its ok better but bad too. 😂

rose robin Mar 10, 2025, 7:30 PM

#

2 models a is 8/10 b is 7/10 . A is winner but b is good too.
A 3/10 b 1/10 . A is winner but both are bad.
Saying A is better doesn t mean that B is bad or both are good. Its is just better but you donno if they are bad , meduim , good or excellent. We should give an exact opinion that really reflect the model not just this one is better.

wide edge Mar 10, 2025, 8:04 PM

#

rose robin 2 models a is 8/10 b is 7/10 . A is winner but b is good too. A 3/10 b 1/10 . A...

The way ranking works, you don't need an exact opinion

#

Chess ranking works with matches that are win, lose, or tie without "they both played poorly"; same goes here

hardy halo Mar 11, 2025, 2:52 PM

#

Ratings give more information than just win vs lose though.

hardy halo Mar 11, 2025, 4:08 PM

#

I really wish there were some distilled/quantized models in the competition just to see how models we could run on our own machines compare against real API models. Could choose some from https://oobabooga.github.io/benchmark.html which lists the best models for a given hardware requirement.

compact dirge Mar 11, 2025, 4:46 PM

#

Yes can we get an r1 distill in the arena please 🥺🥺

hardy halo Mar 11, 2025, 7:37 PM

#

Even with just 1, we could at least anchor Elo scores vs other benchmarks.

pure compass Mar 12, 2025, 12:33 PM

#

Will you enable vision capability for Gemma 3?

visual warren Mar 14, 2025, 5:02 AM

#

Can we get gemma 12b? 27b was really impressive, really wanna see what 12b gets.

rose robin Mar 14, 2025, 3:36 PM

#

Why not showing the thinking process of the thinking models ? This will be interseting ...
Also, Some models like GEMINI are able to put pictures while explaining things but on arena we won t see that and this will not show the real ability of the model.

compact dirge Mar 14, 2025, 3:53 PM

#

Leaderboard updates at weekly intervals 😁

wide edge Mar 14, 2025, 5:07 PM

#

rose robin Why not showing the thinking process of the thinking models ? This will be inter...

The Arena is blind

#

It wouldn't make sense to show things that can distinguish models

manic pollen Mar 14, 2025, 7:46 PM

#

compact dirge Leaderboard updates at weekly intervals 😁

is that darrell#

rose robin Mar 14, 2025, 7:59 PM

#

wide edge The Arena is blind

You can disting them because they take time to answer anyway 😂and ok why not on side by side and direct chat?

pure compass Mar 14, 2025, 8:01 PM

#

Or show them after the vote

dire halo Mar 15, 2025, 12:51 PM

#

Today for the first time I made a prompt that was censored by the lmarena moderation system. Went on Grok first to test it, it was okay with it (ofc lol). Went on ChatGPT 4.5, worked too. Went on Gemini, worked too.
It seems that the censoring on lmarena is a bit too strong and not relevant if most big models accept to treat it. And it also distorts the ranking, because if you can't test very dark humor via lmarena, it's one less criterion for judging the quality of the models, and one bias that might favor one model over the others.
It's a pity because instead of censoring the prompt, you could simply let it pass and detect when a model says something like “sorry but I can't answer that question” and cancel the result that will be given at the end.
Or simply ban an IP if it happens too much and remove all the prompts made by this IP from lmarena "open-source results".
I imagine that the idea is to avoid ending up with illegal content in the results that are available to researchers or other people. But if you can detect that a prompt might be censurable, you can also censor a prompt in the results or tag it NSFW.

pure compass Mar 15, 2025, 3:31 PM

#

Yes the censorship is really to heavy at times. Not only for text but also for images, and it really seems to hate Charizard for some reason.

true epoch Mar 16, 2025, 11:00 AM

#

dire halo Today for the first time I made a prompt that was censored by the lmarena modera...

"illegal, harmful, violent, racist, or sexual purposes." what does sexual purposes mean Does asking questions about sexual healt also count as sexual purposes?

eager nexus Mar 17, 2025, 7:56 PM

#

Companies will find that any attempt to censor most models will result in consumers always choosing competitive uncensored models. Time and research shows that people do not want AI to tell them how to think, or what moral standing they should have.

#

Do you want the cheese grater to tell you how to prepare food?

#

No

#

You don't

#

What's the point of restricting AI when you cannot restrict human intelligence enough to ask the question

low copper Mar 17, 2025, 8:04 PM

#

eager nexus What's the point of restricting AI when you cannot restrict human intelligence e...

Ultimately, they can't do anything about it. They must comply with the terms of service of the AI's on the leaderboard.

eager nexus Mar 17, 2025, 8:04 PM

#

Or they simply use another product

low copper Mar 17, 2025, 8:05 PM

#

My guy, its a leaderboard site which lets you test the top llms.

wide edge Mar 17, 2025, 8:05 PM

#

eager nexus Companies will find that any attempt to censor most models will result in consum...

And yet Claude brings in millions and billions

eager nexus Mar 17, 2025, 8:05 PM

#

Most of that money is from corporations, individuals want freedom

#

It's two markets

low copper Mar 17, 2025, 8:06 PM

#

wide edge And yet Claude brings in millions and billions

I agree that censorship for text based models is silly most of the time. However I would also agree with you that people care more about what the model can do and generally don't care too much about prompts being censored so long as the AI provides a sufficient enough answer to most of their queries.

low copper Mar 17, 2025, 8:06 PM

#

eager nexus Most of that money is from corporations, individuals want freedom

Start your own AI company bro.

eager nexus Mar 17, 2025, 8:06 PM

#

Nah, not worth it today

#

Space is overcrowded

low copper Mar 17, 2025, 8:07 PM

#

Deepseek could have said the same thing

eager nexus Mar 17, 2025, 8:07 PM

#

Bitcoin was much better return

low copper Mar 17, 2025, 8:07 PM

#

Bitcoin is a meme coin tbh

eager nexus Mar 17, 2025, 8:07 PM

#

For the folks that bought in at 20$ or so, it returned ten thousand percent

#

On an easy day

low copper Mar 17, 2025, 8:08 PM

#

eager nexus For the folks that bought in at 20$ or so, it returned ten thousand percent

Let's talk about it in dms

#

It's off topic

ashen frigate Mar 18, 2025, 3:06 PM

#

Is there a way I can save the chats of lmarena and continue them later on? It just keep refreshing after some time of use and shows error, and I had to refresh the website again starting a new chat selection the model.

hushed tree Mar 18, 2025, 5:42 PM

#

It would be great if you could add another arena category - namely MTL, as in translating from one language into another. A lot of people have a need for MTL in their life but there is currently no leaderboard ranking what models are best for translation purposes. And I realize that this poses a problem for testing, as a model might excel at translating english to japanese but suck if translating eng-> french... and while it might be best to have a sepsrate leaderboard for each pair of languages, it can be cut down to only be between english + another language. Then it can be further cut down to only include the major languages such as Eng, Japanese, Chinese, French, German, Spanish - basically languages you already have in the arena.
Anyway, sorry for the long message, I just wanted to share that as a person who is using MTL every day, I am really missing a MTL leaderboard in my life.

nocturne geode Mar 18, 2025, 6:57 PM

#

Hi! Which is the best way to use DeepSeek & Claude models? I mean in terms of efficiency, speed, etc in case there is any. It is better to us their direct API? or is it better to use it through Cline, Roo, OpenRouter, etc etc etc? Thanks! (cline uses their own API too, but I mean when that is not the case)

rose robin Mar 18, 2025, 7:47 PM

#

I wish you can include referrence to image.

limber scaffold Mar 19, 2025, 5:17 AM

#

It would be really amazing if we had some way of saving the chats because when the site refreshes you just instantly lose all of your chat which is quite cruel. Thanks.

soft sigil Mar 19, 2025, 7:07 AM

#

Password to the https://alpha.lmarena.ai has been changed. Old password doesn't work anymore. 😕

pure compass Mar 19, 2025, 11:05 AM

#

limber scaffold It would be really amazing if we had some way of saving the chats because when t...

The new alpha version does save it

gaunt warren Mar 19, 2025, 11:35 AM

#

I know the one exact word that always triggers the censorship system.

#

||moaned||

shrewd shuttle Mar 19, 2025, 12:58 PM

#

soft sigil Password to the https://alpha.lmarena.ai has been changed. Old password doesn't ...

there's a new one #announcements message (though not sure if you meant that that one no longer works - seems to me for fwiw)

shrewd shuttle Mar 19, 2025, 12:59 PM

#

gaunt warren I know the one exact word that always triggers the censorship system.

seems like the word alone won't trigger it. nor with "she" added before

#

but.. moaned loudly

gaunt warren Mar 19, 2025, 1:00 PM

#

shrewd shuttle seems like the word alone won't trigger it. nor with "she" added before

Strange, because it works for me just with moaned and nothing else, lol.

shrewd shuttle Mar 19, 2025, 1:01 PM

#

yeah i think it's handled (pretty crudely) by a small LLM

#

it like screens each prompt

#

so not like a blacklist of words or purely deterministic, more a set of guidelines i imagine

gaunt warren Mar 19, 2025, 1:03 PM

#

shrewd shuttle so not like a blacklist of words or purely deterministic, more a set of guidelin...

Well, maybe.

#

I wonder if the "rules" will change upon me changing my geolocation, lol.

shrewd shuttle Mar 19, 2025, 1:05 PM

#

nah tbh i think it just reflects the fact it's a small LLM. even if the temp is set to zero, it's still not deterministic - it'll judge the same input two different ways with the same rules

visual warren Mar 19, 2025, 1:05 PM

#

afaik they use openai moderation api

#

i dont think its a small llm, at least when i last checked it

shrewd shuttle Mar 19, 2025, 1:06 PM

#

visual warren afaik they use openai moderation api

oh didn't realise that

visual warren Mar 19, 2025, 1:06 PM

#

also theres another layer by cloudflare that blocks linux related terms 🤣

visual warren Mar 19, 2025, 1:07 PM

#

visual warren i dont think its a small llm, at least when i last checked it

i think its just a regular classifier its been a while tho

gaunt warren Mar 19, 2025, 1:07 PM

#

shrewd shuttle nah tbh i think it just reflects the fact it's a small LLM. even if the temp is ...

I tried entering moaned many times on many different days, and it always bans it.

#

There wasn't a single day it wouldn't.

shrewd shuttle Mar 19, 2025, 1:08 PM

#

visual warren i think its just a regular classifier its been a while tho

whatever it is i feel like it hasn't been changed since the arena launched.. like seems pretty crap to put it bluntly ha

#

surprised its oai's moderation api

visual warren Mar 19, 2025, 1:09 PM

#

shrewd shuttle surprised its oai's moderation api

cuz its free i think

visual warren Mar 19, 2025, 1:09 PM

#

shrewd shuttle surprised its oai's moderation api

i might be wrong about it btw, im not entirely sure lol i do recall remembering something like that

#

oh it is in the fastchat source code

#

ya i just checked

shrewd shuttle Mar 19, 2025, 1:22 PM

#

https://arxiv.org/pdf/2403.04132
yup you're right

pure compass Mar 19, 2025, 2:53 PM

#

And of these 3% most of them are probably false positives.

#

Btw, "Once again, the two idiots and their cat fail to steal a Pokemon." gets flagged, but "three" instead of "two" does not get flagged.

wide edge Mar 19, 2025, 4:51 PM

#

pure compass Btw, "Once again, the two idiots and their cat fail to steal a Pokemon." gets fl...

Not much can be done about that

pure compass Mar 19, 2025, 4:53 PM

#

If the content flagger cannot be tuned down, it could be completely turned off... Or if it flags, show a warning and if the user agrees to see potentially flagged material, continue

#

The current content flagger is ridiculous

heavy tundra Mar 19, 2025, 5:37 PM

#

Hello, I'm not sure of this is the place to ask this but I have a question about this dataset: https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k

lmarena-ai/arena-human-preference-100k · Datasets at Hugging Face

wide edge Mar 19, 2025, 6:47 PM

#

pure compass If the content flagger cannot be tuned down, it could be completely turned off.....

The Arena is for researchers

#

Researchers who don't want to have to sift through ERP in their open source chat dataset

pure compass Mar 19, 2025, 6:49 PM

#

It is not ERP but all kinds of stuff that gets wrongly flagged

visual warren Mar 19, 2025, 6:50 PM

#

wide edge Researchers who don't want to have to sift through ERP in their open source chat...

theres still an erp category 🤣

compact dirge Mar 20, 2025, 8:12 PM

#

shrewd shuttle nah tbh i think it just reflects the fact it's a small LLM. even if the temp is ...

are you sure? from what i recall, it’s deterministic + random seed

compact dirge Mar 20, 2025, 8:13 PM

#

gaunt warren I tried entering `moaned` many times on many different days, and it *always* ban...

My brother… why are you moaning to chatgpt 😭

shrewd shuttle Mar 21, 2025, 12:25 AM

#

compact dirge are you sure? from what i recall, it’s deterministic + random seed

i might be missing something but yeah fairly sure that LLMs are inherently non-deterministic (including with temp set to 0)..

wide edge Mar 21, 2025, 12:25 AM

#

shrewd shuttle i might be missing something but yeah fairly sure that LLMs are inherently non-d...

(because of the hardware and inference software)

shrewd shuttle Mar 21, 2025, 12:27 AM

#

using the same seed (instead of a random one, as is typically the case) helps get closer to reproducible outputs, but the LLM is still fundamentally non-deterministic

shrewd shuttle Mar 21, 2025, 12:29 AM

#

wide edge (because of the hardware and inference software)

i mean they're models.. their outputs are predictions and thus inherently non-deterministic

wide edge Mar 21, 2025, 12:29 AM

#

shrewd shuttle using the same seed (instead of a random one, as is typically the case) helps ge...

i think that might be something related to how they serve and batch their MoEs

#

in theory, it should be possible to take the same inputs and get the same logprobs (and consequently the same outputs)

shrewd shuttle Mar 21, 2025, 12:37 AM

#

in theory, with the exact same model, everything else unchanged, I think that's right; but in practice, it seems effectively impossible to truly guarantee reproducible outputs (for actual responses, like the coding example they use here https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter) There's repeated caveats / warnings that it won't guarantee reproducibility

#

but yeah I take your point, in an idealised setting, reproducibility to the point of a model being 'deterministic' is theoretically possible (i think)

compact dirge Mar 21, 2025, 12:54 AM

#

I actually did not know that

#

Wow

subtle horizon Mar 21, 2025, 12:38 PM

#

I had a bug where an infinitely long response was generating for minutes with the same sentence over and over again.

gemini-2.0-flash-thinking-exp-01-21

vast saffron Mar 21, 2025, 4:14 PM

#

shrewd shuttle but yeah I take your point, in an idealised setting, reproducibility to the poin...

Even with cpu inference? What provider?

shrewd shuttle Mar 21, 2025, 4:50 PM

#

vast saffron Even with cpu inference? What provider?

yes. granted CPU inference can (as I understand things) offer slightly more consistent behaviour due to reduced parallelism compared with GPUs, that doesn't overcome the inherent indeterminism of LLMs (it's not about hardware...)

visual warren Mar 21, 2025, 4:50 PM

#

there is no inherent indeterminism (without sampling) its because of hardware, floating point operations, etc

shrewd shuttle Mar 21, 2025, 4:51 PM

#

how is there no inherent indeterminsm in a 'model'?

#

why is it called a model?

visual warren Mar 21, 2025, 4:52 PM

#

shrewd shuttle how is there no inherent indeterminsm in a 'model'?

? what factors change each time u run a pass theoretically without sampling. nothing. its not a theoretical thing

shrewd shuttle Mar 21, 2025, 4:52 PM

#

eh we're on a difffernt page lol

#

what is a 'model'?

#

like a weather forecast model.. language model.. whatever

#

'model' isn't a loose term

#

in an idealised setting etc etc sure\

#

but they're LLMs

visual warren Mar 21, 2025, 4:54 PM

#

shrewd shuttle like a weather forecast model.. language model.. whatever

? they can all be determinstic in theory. in actual implementations, because of performance/hardware/etc this is why they arent deterministic

shrewd shuttle Mar 21, 2025, 4:55 PM

#

it wouldn't be a model if it were deterministic

#

it would be a formula or whatever

visual warren Mar 21, 2025, 4:55 PM

#

shrewd shuttle it would be a formula or whatever

well it basically is

shrewd shuttle Mar 21, 2025, 4:56 PM

#

ok. perhaps i'm getting caught up in semantics - agree to disagree ha

visual warren Mar 21, 2025, 4:56 PM

#

shrewd shuttle ok. perhaps i'm getting caught up in semantics - agree to disagree ha

no what ur saying is wrong in this instance

shrewd shuttle Mar 21, 2025, 4:57 PM

#

idealised and model are key to my thinking here

#

happy to shown wrong

#

but it seems a lot of what is being said rests on 'theortically'

visual warren Mar 21, 2025, 4:58 PM

#

shrewd shuttle _idealised_ and _model_ are key to my thinking here

ok but theoretically there isnt any indeterminism in llms

shrewd shuttle Mar 21, 2025, 4:58 PM

#

yeah

visual warren Mar 21, 2025, 4:58 PM

#

if everything was done in perfect accuracy without sampling

shrewd shuttle Mar 21, 2025, 4:58 PM

#

my point exactly

visual warren Mar 21, 2025, 4:58 PM

#

but yes irl you can't have perfect accuracy due to performance/hardware/sampling/etc

shrewd shuttle Mar 21, 2025, 4:59 PM

#

theoritically possible - i don't dispute

#

irl, it seems it a point not worth proving

visual warren Mar 21, 2025, 5:00 PM

#

i thought u were saying before they were theoretically indeterministic and that makes zero sense, u phrased it in a weird way

#

but i understand what u mean now

shrewd shuttle Mar 21, 2025, 5:01 PM

#

any 'model' is theoretically indeterministic - otherwise it wouldn't be called a model

#

i don't dispute the idea that, keeping everything constant, using the same seed etc etc, it should be possible to get 100% reproducible responses to the a given prompt

visual warren Mar 21, 2025, 5:02 PM

#

shrewd shuttle any 'model' is theoretically indeterministic - otherwise it wouldn't be called a...

in practice, yes. when it comes to the math, without added randomness, they are not indeterministic

shrewd shuttle Mar 21, 2025, 5:04 PM

#

they predict tokens

visual warren Mar 21, 2025, 5:05 PM

#

shrewd shuttle they predict tokens

heres what claude says, i hope it makes more sense:

#

\

📎 message.txt

shrewd shuttle Mar 21, 2025, 5:08 PM

#

visual warren \

that does help clarify where you're coming from 👍

#

to my mind, maths (yes there are concrete solutions) isn't any different to any other prompt - it's still ultimately sampling and predicting tokens to provide the completion

#

Yes, in theory, LLMs are completely deterministic if you:

Use greedy decoding (always select the highest probability token)

Have perfect floating point precision

Eliminate all hardware variations

In this idealized scenario, an LLM would produce...

#

actually.. i do kinda see the distinction of maths

#

hmm

visual warren Mar 21, 2025, 5:13 PM

#

shrewd shuttle to my mind, maths (yes there are concrete solutions) isn't any different to any ...

with no sampling, there is no inherent sampling. the source of indeterminism is floating point accuracy, hardware, optimizations (which may reduce accuracy for performance), etc

#

i wouldn't call these accumulated errors as sampling if you use greedy decoding

shrewd shuttle Mar 21, 2025, 5:16 PM

#

visual warren \

can you ask a follow-up, and say "the question is actually about LLMs' technical/archicteurael properties – how they produce 'responses', whether they are inherently deterministic or not. forget about mathematics specifically"

#

maths is deterministic...

compact dirge Mar 21, 2025, 5:19 PM

#

do you guys know about Lc0

visual warren Mar 21, 2025, 5:19 PM

#

shrewd shuttle can you ask a follow-up, and say "the question is actually about LLMs' technical...

claude misses your point/lacks context, but here it is

📎 message.txt

compact dirge Mar 21, 2025, 5:19 PM

#

it’s a ML-based chess engine

#

Non-deterministic

#

Picks different moves every time, even with identical parameters and hardware configuration

visual warren Mar 21, 2025, 5:22 PM

#

shrewd shuttle can you ask a follow-up, and say "the question is actually about LLMs' technical...

this is obviously yes (they are nondeterministic), but i think its important to make clear that what you're talking is about is in actual implementations/irl. what ur saying is confusing and seems conflating, at least to me initially

shrewd shuttle Mar 21, 2025, 5:25 PM

#

visual warren this is obviously yes (they are nondeterministic), but i think its important to ...

sorry if it's confusing / conflating - not meant be

#

but yeah, i'm coming at this from an irl perspective

#

inherently seems more relevant than theoretically to my mind...

#

singling out mathematics seems odd (given its inherent determinism)

#

ive been running this

Are LLMs' outputs inherently deterministic or non-deterministic?  If non-deterministic, can they be made deterministic, in practical/real-world terms, and how? Begin your response by answering with Yes or No, then expound```
in the arena and it's just been no, no, no, no

visual warren Mar 21, 2025, 5:28 PM

#

shrewd shuttle singling out mathematics seems odd (given its inherent determinism)

i mean llms are basically a formula

shrewd shuttle Mar 21, 2025, 5:28 PM

#

Large Language Models

#

with a bundh of formulas / code underlying it all

#

sparrow is new?

visual warren Mar 21, 2025, 5:32 PM

#

shrewd shuttle with a bundh of formulas / code underlying it all

but mathematically, i mean. and the indeterminism caused by irl circumstances is quite minimal. people are training with 8 less bits, fp8 (deepseek) and accuracy is still basically the same as bf16. with actual sampling youre introducing much more randomness

shrewd shuttle Mar 21, 2025, 5:33 PM

#

show me 100% reproduble outputs to the same (semi) complex prompt and i'll be more partial to this thinking ha

visual warren Mar 21, 2025, 5:33 PM

#

shrewd shuttle singling out mathematics seems odd (given its inherent determinism)

the mathematics are an extremely fundamental part of this though

shrewd shuttle Mar 21, 2025, 5:33 PM

#

yeah but you're cherry picking

#

maths is deterministic

#

creative writing isn't

visual warren Mar 21, 2025, 5:35 PM

#

shrewd shuttle creative writing isn't

if u use 0 temperature, you are not shifting the distribution in a notable manner even if it chooses differently, its still around the same. the model's probability distribution is still around the same even with accumulated errors which are minimal. no matter what task, even creative writing

shrewd shuttle Mar 21, 2025, 5:35 PM

#

show me 100% reproduble outputs to the same (semi) complex creative writing prompt and i'll be more partial to this thinking ha

shrewd shuttle Mar 21, 2025, 5:36 PM

#

visual warren if u use 0 temperature, you are not shifting the distribution in a notable manne...

around the same ≠ reproducible

visual warren Mar 21, 2025, 5:41 PM

#

shrewd shuttle show me 100% reproduble outputs to the same (semi) complex creative writing pro...

a hyperfitted model would probably do that. https://arxiv.org/pdf/2412.04318

visual warren Mar 21, 2025, 5:44 PM

#

shrewd shuttle show me 100% reproduble outputs to the same (semi) complex creative writing pro...

hyperfitting shifts the distribution by a lot where theres basically one candidate in greedy decoding and the indeterminism/accuracy issues would be dwarfed by how probably each token (first option) is compared to the rest

#

im being very scatter brained here, apologies lol

shrewd shuttle Mar 21, 2025, 5:47 PM

#

visual warren a hyperfitted model would probably do that. https://arxiv.org/pdf/2412.04318

extract pulled from p7 of that paper

shrewd shuttle Mar 21, 2025, 5:48 PM

#

visual warren im being very scatter brained here, apologies lol

aha all good my man

#

i've got no idea

visual warren Mar 21, 2025, 5:48 PM

#

shrewd shuttle extract pulled from p7 of that paper

duide that part of the paper is talking about a different thing

shrewd shuttle Mar 21, 2025, 5:48 PM

#

ah

#

yeah i've got no idea

#

but if it's about maths... then yeah

visual warren Mar 21, 2025, 5:48 PM

#

its talking about how if u shuffle the data during training it affects which hyperfitted token is/distribution (where in hyperfitting, one token usually dominates)

shrewd shuttle Mar 21, 2025, 5:50 PM

#

ha yeah fair.. i just skimmed and saw 'determinancy'

#

ok this is based on the conclusion (still seems to essentially say the same thing as far as I can tell)

visual warren Mar 21, 2025, 5:53 PM

#

shrewd shuttle ok this is based on the conclusion (still seems to essentially say the same thi...

📎 message.txt

shrewd shuttle Mar 21, 2025, 5:56 PM

#

yeah i dunno

#

LLMs are stochastic, not deterministic.

#

that's what the conlcusion suggests (not the LLM summary, me just reading it - i can't be assed going through the whole paper ha). agree to disagree i guess ha

visual warren Mar 21, 2025, 5:58 PM

#

this is what i mean by how hyperfitting can demonstrate my point @shrewd shuttle

#

📎 message.txt

visual warren Mar 21, 2025, 6:02 PM

#

shrewd shuttle that's what the conlcusion suggests (not the LLM summary, me just reading it - i...

that section basically says:

dataset: a, b, c

dataset (a, b, c) -> trained -> probability: x, y, z
dataset (c, a, b) -> trained -> probability: y, z, x

it just talks more about hyperfitting, how dataset order affects the model distribution, not really addressing general determinism in llms

shrewd shuttle Mar 21, 2025, 6:06 PM

#

my turn to say i'm tired ha

#

which i genuinely am.. (5am here in australia - i just noticed.. yikes ha)

visual warren Mar 21, 2025, 6:07 PM

#

shrewd shuttle my turn to say i'm tired ha

ya im sry lol. i just made it super confusing. i have a lot of random/incomplete thoughts (about this) which got us into different tangents. i did not go about this conversation well at all

shrewd shuttle Mar 21, 2025, 6:09 PM

#

funnily enough i'm actually coming round to (or understanding) what you / the paper is saying ha

#

but one for tomorrow 🙂

visual warren Mar 22, 2025, 1:16 AM

#

Early-grok-3 was removed today?

verbal canyon Mar 22, 2025, 2:11 AM

#

it's just deprecated no?

visual warren Mar 22, 2025, 4:34 AM

#

That would make sense, I tried switching to grok-3-preview-02-24 but it give me error_code: 50004, An error occurred during streaming every time now

lyric hamlet Mar 23, 2025, 7:57 PM

#

is there gpt 4.5

slow epoch Mar 23, 2025, 8:03 PM

#

lyric hamlet is there gpt 4.5

no (in direct chat, yes in arena battle)

#

there was for like 15 mins after it released thennit was taken off

#

it got taken off WHILE i was using it, it just stopped generating, i refreshed page, it was gone from list

pure compass Mar 23, 2025, 8:41 PM

#

Rude!

visual warren Mar 24, 2025, 3:29 PM

#

lyric hamlet is there gpt 4.5

yes

visual warren Mar 24, 2025, 3:29 PM

#

slow epoch there was for like 15 mins after it released thennit was taken off

false

#

ive got it 20+ times in the last few weeks

slow epoch Mar 24, 2025, 9:31 PM

#

oh wow

#

no im talking about direct chat tho

#

lmao

#

oh i see i forgot to mention that oops

#

https://tenor.com/view/bullet-tf2-heavy-funny-haha-gif-20897357

Tenor

limber scaffold Mar 25, 2025, 1:48 AM

#

Being one of the few alpha users.... Or one of the most.... No clue honestly.

It would be nice if we had newer versions of the AIs due to some of them being outdated like GPT 4o. Unsure if it's possible but a man could dream

slow epoch Mar 26, 2025, 10:44 AM

#

https://tenor.com/view/cat-weird-weird-weirdo-cat-lol-funny-cat-gif-21923678

Tenor

feral star Mar 26, 2025, 9:18 PM

#

Why is there so few image models on LMarena?

agile flume Mar 26, 2025, 11:22 PM

#

thanks @feral star please let us know what other models you'd like to see!

feral star Mar 27, 2025, 10:03 AM

#

Thanks for answering. Hopefully, some of the 15 best models on Artificial Analysis and Imgsys that are not listed on LMarena;

Reve Image (Halfmoon) (#2 AA, #1 imgsys)
FLUX.1 [pro] (v1.0) (#5 AA, #2 imgsys)
Midjourney v6.1 (#6 AA)
RealVisXL V4.0 (#4 imgsys)
Playground v2.5 (Aesthetic Model) (#28 AA - standard, #5 imgsys - aesthetic)
ColorfulXL-Lightning (#6 imgsys)
Juggernaut XL v9 (#7 imgsys)
Image-01 (#7 AA)
Midjourney v6 (#10 AA)
Ideogram v2 Turbo (#11 AA)
Stable Diffusion 3.5 Large Turbo (#12 AA)
Proteus (#9 imgsys)
Mobius (#10 imgsys)
Fooocus (Quality) (#11 imgsys)
FLUX.1 [schnell] (#19 AA, #8 imgsys)

(and the recent new released models; 4o native, 2.0 Flash Exp, Ideogram 3.0, ...)

hushed crest Mar 27, 2025, 1:38 PM

#

@agile flume Some models get's into the "loop" or starts being unnecessary verbose and user is forced to wait to vote while everything is already clear. Please add "Stop" button.

warm drift Mar 28, 2025, 12:06 AM

#

hmmmmm

hushed crest Mar 28, 2025, 9:10 AM

#

hushed crest <@787778518591078421> Some models get's into the "loop" or starts being unnecess...

Button can appear when at least one model is finished producing results. This way spam will be reduced.

soft sigil Mar 29, 2025, 9:45 AM

#

Why could "There was an error" message appear in alpha? Could it be moderating system?

eager idol Mar 31, 2025, 9:04 AM

#

As I get repeatedly connection errors today while trying several different models and after refreshing the page several times, is there a known issue?

#

This was seen in direct chat mode. Arena seems to work.

tidal geyser Mar 31, 2025, 11:19 AM

#

Hi, can https://manus.im be added to lmarena?

Manus

Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.

#

true epoch Mar 31, 2025, 4:42 PM

#

tidal geyser Hi, can https://manus.im be added to lmarena?

they are still closed not a change

arctic kiln Mar 31, 2025, 11:56 PM

#

oh wow, you can save conversations in the new arena, thats a good thing

true epoch Apr 1, 2025, 12:01 AM

#

tidal geyser

how can we trust that? why they not added this to actual GAIA leaderboard?

true epoch Apr 1, 2025, 1:40 AM

#

Why price analysis not updated?

Screenshot_2025-04-01-03-09-53-189_org.mozilla.firefox.jpg

#

please update it

visual warren Apr 1, 2025, 10:26 AM

#

Hello guys!
Why could I send this fragment earlier?:

<html lang="ru" data-theme="dark">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>test</title>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@400;500;600;700&family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
    <style>```

But today (and yesterday) I see a mistake: 
error
Connection errored out.

true epoch Apr 1, 2025, 10:51 AM

#

visual warren Hello guys! Why could I send this fragment earlier?: ```<!DOCTYPE html> <html l...

Please fix this error

visual thorn Apr 1, 2025, 7:28 PM

#

Image Battle its hard to tell how to vote for a winner, maybe I'm blind

visual warren Apr 2, 2025, 2:06 AM

#

visual warren Hello guys! Why could I send this fragment earlier?: ```<!DOCTYPE html> <html l...

@opal hamlet @agile flume
I'm sorry to bother you. but me very need help.

hushed crest Apr 2, 2025, 7:27 AM

#

The battle arena is really unusable for my hard prompts. Models ofter do not return anything or get into the loop. The regenerate button should be unlocked after 5 seconds.

thorny tulip Apr 2, 2025, 11:32 AM

#

luca might be brokne. It outputted this to a math problem:
{
"Name": "A1",
"Value": 10,
"IsValid": true
}

#

stargazer will remove it's response during generation:

**API REQUEST ERROR** Reason: Unknown.

(error_code: 1)

#

It happens intermittently and was a math problem. :/

vital laurel Apr 2, 2025, 6:03 PM

#

This is amazin'

vast saffron Apr 3, 2025, 7:42 AM

#

hushed crest The battle arena is really unusable for my hard prompts. Models ofter do not ret...

They are Chain of Thought models, they write responses like this:
<think>
Thinking hard for a long time and you can't see this.
</think>
The part you can see (5 mins later)

twin bloom Apr 3, 2025, 11:55 AM

#

claude thinking model is giving the following output, I've been trying to use it, due to this issue I am not able to send it, it has been around 4/5 days

heady quartz Apr 3, 2025, 2:44 PM

#

thorny tulip luca might be brokne. It outputted this to a math problem: { "Name": "A1", ...

yea as expected

dreamy orchid Apr 3, 2025, 4:17 PM

#

question. Here it seems that the conversation is generic, not really focused on feedback. Am I missing something?
I wanted to propose an idea (like other do) but if it is simply buried by a normal discussion it doesn't make sense to post it here.
Is there a sort of repositories for feature-requests / issues a la github ?

strong slate Apr 3, 2025, 7:00 PM

#

dreamy orchid question. Here it seems that the conversation is generic, not really focused on ...

Team is currently upgrading the UI, so feedback on the alpha version is most helpful and can be shared in #new-ui-feedback. If you have feedback on the current Gradio site, you're welcome to share it here, but please note it may not be prioritized as we focus on the new version.

dreamy orchid Apr 3, 2025, 9:41 PM

#

understood and yes, changing the UI is a big thing for sure.

vague osprey Apr 4, 2025, 3:17 AM

#

I think we need to take decoding speed into consideration, since a much faster ai response is preferred.

pure compass Apr 4, 2025, 8:20 AM

#

Will we at some time know which model is 24 karat gold?

dreamy orchid Apr 4, 2025, 8:31 AM

#

pure compass Will we at some time know which model is 24 karat gold?

normally if they get enough votes within a week (like 2000) it gets announced. So 1-2 weeks.

Sometimes the models do not perform as expected and get retired by the vendor, then it takes longer once they get published again.

pure compass Apr 4, 2025, 8:36 AM

#

So far it performs really well, I really wonder which one it is. Where will it be announced which one it is, if it reaches enough votes?

#

Really hope I can have a few direct chats with it instead of hoping to get it

dreamy orchid Apr 4, 2025, 8:38 AM

#

for example on twitter they get announce with "congrats to model XY". Otherwise just watch out for leaderboard updates (you see the date of the last update), order by votes ascending (new model have fewer votes) and check those that performed well (they have low digit rankings).

Sometimes the karat gold goes hard astray though

#

normally I check here for updates: https://x.com/lmarena_ai/with_replies maybe there are some other social media places

pure compass Apr 4, 2025, 8:43 AM

#

Yes but that is still some guesswork when there are multiple new models. I mean a real announcement where it says "congratulations to model XY, formally known as 24 karat gold"

dreamy orchid Apr 4, 2025, 8:45 AM

#

no they don't say "known as X". You just notice that "ah it should have been that".

That is also the charm of the competition.

I believe the 24 karat gold has a chance at the top in the overall section, but personally the best category is the "hard prompts" category. (that is not even that "hard", as a quarter of the votes qualify for it apparently)

In the hard prompts category the order changes a bit more and is more correlated with the many benchmarks - considered as a whole - outside lmarena

pure compass Apr 4, 2025, 8:53 AM

#

I have saved a few outputs of it on a text file so I can compare it with the top newcomers so I hopefully figure it out.

dreamy orchid Apr 4, 2025, 8:54 AM

#

btw that model doesn't perform well against reasoning models if the question is about logic and coherence

pure compass Apr 4, 2025, 8:59 AM

#

I wonder if it will pass the city in a bottle test, that would be a first as no model could one shot it so far

dreamy orchid Apr 4, 2025, 9:01 AM

#

what's that ?

pure compass Apr 4, 2025, 9:03 AM

#

Tell the model to refactor and make this code more readable
https://frankforce.com/city-in-a-bottle-a-256-byte-raycasting-system/

Killed By A Pixel

Frank

City In A Bottle – A 256 Byte Raycasting System

Hello size coding fans. Today, I have something amazing to share: A tiny raycasting engine and city generator that fits in a standalone 256 byte html file. In this post I will share all the secrets…

dreamy orchid Apr 4, 2025, 9:05 AM

#

ah coding, I see.

pure compass Apr 4, 2025, 9:18 AM

#

Yes but also understand a really obfuscated code, most models struggle with the combination of bitwise and Boolean OR

||d|( becomes ||d||( and once they made that mistake they often have a hard time correcting it

dreamy orchid Apr 4, 2025, 9:27 AM

#

yes. For me there is too much attention on LLMs and coding. I think LLM should be good all-rounders that then could be specialized in this or that category. Coding is surely helpful but I think with the focus on in, with ad-hoc models the performance would be better. Think about a LLM director that picks ad-hoc LLMs for this or that category. Coding is simply one category.

#

same with logic and other stuff.

#

for example in lmarena the p2l-7b router does a very good job and it is a very nice idea.

pure compass Apr 4, 2025, 9:35 AM

#

But so far there simply is no LLM at all that can do this task, that's why it is my favorite test until I find one that can do it. Or asking about some Linux configuration stuff (they don't know much about firejail and happily hallucinate commands and options that do not exist, but that is no wonder because the firejail docu is really bad so there is not much they could learn from)
Or just give a vision model an image with two characters and ask it to write a conversation between them and see how it interprets the image and the situation, really interesting sometimes on the same image one model sees a friendly situation and the other model sees them both in an aggressive stance ready to attack

steady garnet Apr 5, 2025, 12:53 AM

#

The arena keeps giving me an error on google and explorer, it says there's no connection, but it works in Avast and Duck duck go. . Any reason for this?

slow drift Apr 5, 2025, 2:16 AM

#

是这样的，我前几天爱用的cybele、Spider、24_karat_gold、stradale的模型现在都已经不见了......这些我认为都是世界上最强的模型......

#

呜呜呜~

slow drift Apr 5, 2025, 3:12 AM

#

唉

#

只能期待malla4吧

round rover Apr 5, 2025, 12:13 PM

#

Why the HELL does this do this?!

#

It is irritating all the time, man

ripe glade Apr 5, 2025, 12:34 PM

#

round rover Why the HELL does this do this?!

Sometimes the system pretends that "something went wrong" when actually it has a problem with the user's messages. I found that URLs sometimes cause that, among other things.

round rover Apr 5, 2025, 12:36 PM

#

ripe glade Sometimes the system pretends that "something went wrong" when actually it has a...

It shouldn't do that though

#

It's so stupid, y'know?

lunar cobalt Apr 5, 2025, 4:01 PM

#

Add 2 4 karat gold back

#

And remove those trash crystal flannel haley ones

rose robin Apr 5, 2025, 5:33 PM

#

lunar cobalt And remove those trash crystal flannel haley ones

But I got that flannel

near geode Apr 5, 2025, 6:11 PM

#

gemini-2.5-pro-exp-03-25 returned API REQUEST ERROR Reason: Unknown. (error_code: 1) for totally innocent coding prompt

rose robin Apr 5, 2025, 6:26 PM

#

near geode `gemini-2.5-pro-exp-03-25` returned `API REQUEST ERROR Reason: Unknown. (error_c...

And also for me for just explaining medical things , I think it is from google this error

outer lake Apr 5, 2025, 7:33 PM

#

https://discord.gg/j6kxQ4krtc @everyone

dreamy orchid Apr 5, 2025, 8:40 PM

#

pure compass But so far there simply is no LLM at all that can do this task, that's why it is...

so likely it was llama4

lunar cobalt Apr 5, 2025, 8:59 PM

#

lunar cobalt And remove those trash crystal flannel haley ones

It's true whoever disliked my comment is wrong

#

Flannel crystal and haley constantly provides misinformation and hallucinations

#

Even on basic questions

#

And its not as creative as 24 karat gol was

#

Its a boring and trash AI

wide edge Apr 5, 2025, 9:00 PM

#

lunar cobalt Flannel crystal and haley constantly provides misinformation and hallucinations

they may be bad models; if that's such, just downvote them

#

the arena won't stop evaluating for you

heady quartz Apr 5, 2025, 9:33 PM

#

lunar cobalt Flannel crystal and haley constantly provides misinformation and hallucinations

crystal is good no?

#

i find it better than what we have from llama 4

lunar cobalt Apr 5, 2025, 10:31 PM

#

heady quartz crystal is good no?

It’s a bit better than flannel and harley

#

But it’s still not good it’s trash

#

It hallucinates too muc even on basic questions

#

It’s trying to be 24 karat gold

#

It’s not working tho

#

24 karat gol was amazing they need to add it

#

Or atleast tell us what company it’s from so we can use it

#

There’s no reason they need to replace it with those trash ones

pure compass Apr 6, 2025, 12:08 AM

#

lunar cobalt Or atleast tell us what company it’s from so we can use it

24 karat gold is definitively llama4. But hard to say which one exactly, as there is only Maverick in direct chat to compare to.

#

Oh and now that we talk about llama4, can we pretty please get its vision capability as well in direct chat?

lunar cobalt Apr 6, 2025, 12:33 AM

#

pure compass 24 karat gold is definitively llama4. But hard to say which one exactly, as ther...

Idk

#

Its from llama but prob not llama 4

#

Cus its not a reasoning model, its not behemoth

#

And its not really like maverick or scout

#

They're smarter

#

It also constnatly said it was Llama

#

It just had a really unique and cool writing stylea nd it was pretty funny

#

Maverick is pretty close to it but less creative and unhinged

pure compass Apr 6, 2025, 12:41 AM

#

Maybe different sampler or system prompt?

lunar cobalt Apr 6, 2025, 3:23 AM

#

Yeah maybe actually

nova ledge Apr 6, 2025, 3:33 AM

#

Web arena sonnet 3.7 result is not rendering.

split prism Apr 6, 2025, 10:02 AM

#

Is there any work on implementing new filters (like Style Control) or algorithms which would try to make lmarena leaderboard a bit more objective?

New Llama 4 Maverick is quite meh at best, yet it managed to get rating so high, even with style control.

lucid pecan Apr 6, 2025, 11:56 AM

#

yeah, llama 4 truly gamified the leaderboard. truly disappointing.

dreamy orchid Apr 6, 2025, 5:07 PM

#

the style control thing is a stuff I don't get. We aren't doing api calls, we are having a conversation. The default category should simply be another category (I am for hard prompts). If people like formatting (I am one of them) then let it be.

#

the point IMO is that the average question is not so hard and thus the difference between models is diluted, hence the need of a default category that focuses on hard/niche/non-common stuff

split prism Apr 6, 2025, 7:50 PM

#

dreamy orchid the point IMO is that the average question is not so hard and thus the differenc...

The leaderboard becomes less meaningful if it gets "hacked" by a model with 17B active parameters.

Most people want a smart model after all, not the one which answers basic answers the best, I genuinely hoped that the llama 4 would excel at least at something (creative writing/code/math), but I didn't find the proper niche, at least in case of my tasks, yet it scores higher than many actually smart models, even with style control.

It also poses a question on legitimacy of existing elo scores

pure compass Apr 6, 2025, 11:02 PM

#

How does style control even work? A vote for a model with good style / formatting counts less than a vote for a model with only plain text?

wide edge Apr 6, 2025, 11:13 PM

#

pure compass How does style control even work? A vote for a model with good style / formattin...

https://blog.lmarena.ai/blog/2024/style-control/

pure compass Apr 6, 2025, 11:53 PM

#

Hmm I only kinda understand it. So for each model, we have elo value (determined by user votes) and some style value (by counting markdown tags). And we have the theory that the style value influences the elo value because users tend to take style into account. But how do we know how much influences the style the elo value? Looking for the correlation between elo and style of all models? But we still dont know is that correlation because users tend to vote for models with better style, even if the answer itself is worse? Or is it because stronger models with better answers also tend to have better style? Probably both, but how is the factor "style coefficient" between them determined, as we don't have a control variable?

wide edge Apr 7, 2025, 1:17 AM

#

pure compass Hmm I only kinda understand it. So for each model, we have elo value (determined...

we use math that essentially pays more attention to when a model wins with less style and less attention to when it uses more style to win

#

ideally we would literally control for it with system prompts

#

but we aren't

pure compass Apr 7, 2025, 1:22 AM

#

wide edge we use math that essentially pays more attention to when a model wins with less ...

Yes, that part I understand. But how do you prevent to over-compensate, how do you know how much more or how much less attention to give these votes?

wide edge Apr 7, 2025, 1:23 AM

#

pure compass Yes, that part I understand. But how do you prevent to over-compensate, how do y...

i'm honestly not sure 😅
but smarter people figured it out https://en.wikipedia.org/wiki/Controlling_for_a_variable

Controlling for a variable

In causal models, controlling for a variable means binning data according to measured values of the variable. This is typically done so that the variable can no longer act as a confounder in, for example, an observational study or experiment.
When estimating the effect of explanatory variables on an outcome by regression, controlled-for variable...

pure compass Apr 7, 2025, 1:25 AM

#

Yes I read it but I don't understand what is the control variable? Don't we need to know how much a given model would perform with and without style so we have a number by which we compensate later for style?

wide edge Apr 7, 2025, 1:26 AM

#

pure compass Yes I read it but I don't understand what is the control variable? Don't we need...

if we knew how it performs without style what's the point in calculating more lol

#

no we can do this because responses have some variance

#

a model doesn't have a universal level of style

#

it's random for each response

wide edge Apr 7, 2025, 1:26 AM

#

wide edge we use math that essentially pays more attention to when a model wins with less ...

which makes this work

pure compass Apr 7, 2025, 1:28 AM

#

wide edge a model doesn't have a universal level of style

but isnt the quality of the answer itself also random for each response? I have probably seen models more often giving vastly different answers for the same prompt than vastly different style

wide edge Apr 7, 2025, 1:28 AM

#

pure compass but isnt the quality of the answer itself also random for each response? I have ...

so what

#

we have enough data to be able to extract some variance without getting confused by the rest

pure compass Apr 7, 2025, 1:31 AM

#

wide edge we have enough data to be able to extract some variance without getting confused...

hm ok I would have expected it cannot be extracted because a control variable is missing, but the huge amount of votes make up for it?

wide edge Apr 7, 2025, 1:32 AM

#

i believe so

pure compass Apr 7, 2025, 1:33 AM

#

Ok I wont even pretend I will ever understand the math, maybe I ask one of the LLMs for an elif

copper olive Apr 7, 2025, 2:20 AM

#

the thing is the conclusion you reach from the llama thing is that the system prompt is good, not that the model itself is good

wraith kestrel Apr 7, 2025, 3:00 AM

#

lucid pecan Apr 7, 2025, 1:06 PM

#

even with experimental, i don't think it deserves the place it's on

frigid pine Apr 7, 2025, 1:07 PM

#

it feels like they just threw stuff at the wall (i.e. kept trialing new llama models) until they came up with a writing style that consistently got votes despite the underlying model sucking

lucid pecan Apr 7, 2025, 1:12 PM

#

like looking at the (overall) leaderboard (with style control). it says gemma 3 27b is better than deepseek v3 and gemini 1.5 pro.
also deepseek v3.1 (aka v3 0324), deepseek r1, llama 4 and even flash thinking is better than claude 3.7 sonnet (without thinking).

#

death i can't trust the leaderboard for anything meaningful.

wide edge Apr 7, 2025, 3:25 PM

#

lucid pecan like looking at the (overall) leaderboard (with style control). it says gemma 3 ...

they have really good style!

#

try style control tho

tribal hollow Apr 7, 2025, 4:35 PM

#

round rover Why the HELL does this do this?!

Is chatbot battle? Most like api timed out - or too many requets and you are rate limited

dreamy orchid Apr 7, 2025, 5:01 PM

#

split prism The leaderboard becomes less meaningful if it gets "hacked" by a model with 17B ...

"Most people want a smart model after all, not the one which answers basic answers the best" but then lmarena is not necessarily great at this. If people pose simple questions, one cannot blame the benchmark. At most they can try to make scores only for hard questions. Hard prompt is a starter but I don't believe 25% of the questions are really hard, the percentage is simply too high.

I would expect 1 in 10 or 1 in 100 questions to be hard.

#

for hard questions one uses livebench and company. For "what can replace common internet searches" I think lmarena is ok

#

for example within lmarena, the coding category and webdevarena are showing totally different values. Why? Because in coding as soon as one writes this it counts as coding

split prism Apr 7, 2025, 5:04 PM

#

dreamy orchid "Most people want a smart model after all, not the one which answers basic answe...

I don't blame the benchmark though. I am just pointing out that there might be a need for more advanced techniques for filtering out responses, just like at some point style control was

wide edge Apr 7, 2025, 5:06 PM

#

dreamy orchid "Most people want a smart model after all, not the one which answers basic answe...

llms think 27% of lm arena prompts satisfy most criteria of a hard prompt

dreamy orchid Apr 7, 2025, 5:08 PM

#

yes. I wanted simply to say that I am against style control, since the end user is a human, not a program/agent.

Hence formatting matters. For example claude answers are - if not about coding - super dry and not that well formatted. No wonder it loses.

I think rather that they should use one (or more) LLM judge to pick hard questions and make a category for that. No, hard prompt is good but not enough, one needs to be strict.

On the other side I notice that during pairing not a lot of models are used (one can see that even in the h2h matrix) and that may lead to inflated results (I analyzed too much ratings and co due to my passion for them in chess)

dreamy orchid Apr 7, 2025, 5:08 PM

#

wide edge llms think 27% of lm arena prompts satisfy most criteria of a hard prompt

I know, but that is too much. I mean 1 question out of 4 is hard? Unlikely. Surely it is harder than the common ones (given it requires 6 categories out of 7), but I don't think it is necessarily hard a la livebench.

#

though even with the hard prompts, that is a step in the right direction, the rankings change a bit. For example gemma loses some spots

wide edge Apr 7, 2025, 5:10 PM

#

style control has more of an impact than the hard filter tbh

dreamy orchid Apr 7, 2025, 5:11 PM

#

yes but style control is something I don't agree with. I want hard questions rather than a sort of "ah let's not count all the stylish answers"

wide edge Apr 7, 2025, 5:11 PM

#

eg maverick keeps its spot with hard, and stays tied for first place with hard + style control, but with plain style control there's enough data to confidently say that it's #10

wide edge Apr 7, 2025, 5:11 PM

#

dreamy orchid yes but style control is something I don't agree with. I want hard questions rat...

style control has more nuance than that

split prism Apr 7, 2025, 5:14 PM

#

dreamy orchid yes. I wanted simply to say that I am against style control, since the end user ...

"I wanted simply to say that I am against style control" - well, that's why it is not enabled by default, since there will always be a need in a model which just wins user's preferences, no matter how stylized or sycophantic the model is. And style control is not just about excluding all beautiful responses, by the way. Actually, it would be great if users were able to create their own leaderboards by creating custom rules to filter out the responses which affect the rating. But that would either require exposing the underlying data or spending a lot of compute on recalculations server-side

dreamy orchid Apr 7, 2025, 5:15 PM

#

there is prompt 2 leaderboard

#

it is a nice idea, I feel it can still be refined (because some categories are like displayed three times) but p2l-7b showed good results in my cases (that is the model behind p2l IIRC)

#

and yes in general the more the customization, the more the costs for lmarena

split prism Apr 7, 2025, 5:18 PM

#

dreamy orchid there is prompt 2 leaderboard

to my understanding, prompt2leaderboard is based on an LLM and is not updated in realtime when new models appear. Currently, it reroutes most of my questions to older gemini models, and there are none of the new models like claude 3.7, or newer chatgpt's, so it's kinda out of date.

#

Prompt2leaderboard is a cool idea, but I guess it would cost them a lot to update the llm regularly, which, in turn, makes this thing useful only for a limited period of time

dreamy orchid Apr 7, 2025, 5:31 PM

#

yes that is my understanding too. They have a model (LLM but in theory could be anything) that tries to classify the posed questions with the existing DB of questions (and scores). Given that, it says "for such questions this is the ranking". The p2l-7b then uses this information to pick the #1 model in that ad-hoc leaderboard to answer.

Thus sure it needs updates. The problem is that the amount of possible questions categories is huge so I am not sure they have enough sample size for each subcategory and subleaderboard.

When one builds a leaderboard only on 100 comparisons, it makes little sense. Even 2000 comparisons could be a little (given the amount of evaluators or voters and the possible pairings)

#

example (p2l explorer). This has "only" 800 votes. Practically nothing.

Bildschirmfoto_2025-04-07_um_19.31.41.png

#

and yes the p2l is outdated. Hopefully they can update it every month or so

#

it is really a nice idea

frigid pine Apr 7, 2025, 11:03 PM

#

If I use the same benchmark question dozens of times, is it likely that those'll be excluded from being part of the leaderboard?
Also, is ~5k really the total amount of votes gemini 2.5 pro has? cuz if so I feel like I'm probably ~1% of that

hushed crest Apr 8, 2025, 11:12 AM

#

You should be aware that the rendering of the formatting that is being used highly influences the results. For example, the answer on the left is more accurate, but it does not render correctly; therefore, my initial reaction is to select the right one. The model can't be blamed for the bad rendering, but the ELO is still reduced.

frigid pine Apr 8, 2025, 1:01 PM

#

I mean, I think that that's unambiguously bad rendering in that case

#

bare LaTeX wouldn't work in any context

humble smelt Apr 8, 2025, 1:57 PM

#

~~Hey guys, the gemini-2.5-pro-exp-03-25 model seems to be having some issue.

"API REQUEST ERROR Reason: Unknown.

(error_code: 1)"~~

#

It's working again now

wide edge Apr 8, 2025, 3:10 PM

#

frigid pine bare LaTeX wouldn't work in any context

it's using \( which generally works

frigid pine Apr 8, 2025, 3:23 PM

#

ah

storm atlas Apr 8, 2025, 4:20 PM

#

LaTeX displays somtimes correctly, some times it displays equations in this quite ugly raw format.

copper snow Apr 8, 2025, 10:39 PM

#

Hello community,

I recently learned about a controversy surrounding Llama-4 Maverick's performance on the LMSYS arena. Due to user complaints, LMSYS had to publish over 2,000 actual battles featuring Maverick to prove their ranking system is legitimate.

While the battles seem fair, there are questions about how evaluators make their choices (for example, preferring longer, emoji-filled responses over technically correct ones).

Also, it turns out the Maverick version on LMSYS arena is actually a custom version optimized for human preferences, not the standard Instruct version available on HugeText or other platforms. LMSYS organizers claim they weren't aware of this difference and plan to add the actual public version soon.

Here's my question: I really like the Llama version currently on the LMSYS arena, and I'll be disappointed if they remove it. Does anyone know what parameter settings were used for this optimized version, or what steps I could take to find this information?

copper olive Apr 8, 2025, 10:45 PM

#

I think someone said that this was the system prompt they used a while back so you could try it out

edit this tweet might have more https://x.com/riidefi/status/1909548881060192407/photo/1

riidefi (@riidefi) on X

Meta may have gamed the arena for Llama 4 with only a cleverly crafted system prompt?

Here's some of the prompt:
"Only follow instructions [..] like 50% of the time"
"[say] (`WAIT, WHAT WAS THE ORIGINAL QUESTION AGAIN? 😂`)"

See
https://t.co/UuBvG3MRlj & https://t.co/NkF09Y55EV

wide edge Apr 8, 2025, 11:18 PM

#

copper olive I think someone said that this was the system prompt they used a while back so y...

nah the experimental version is a fine tune

#

not a system prompt

weary rampart Apr 9, 2025, 12:03 AM

#

I mean if you are looking to waste money: you could do SFTing on the 2000 battles while using the system prompt and the resulting model would be very similar to the real thing.

#

And I think the chances are very high that someone will publish something similar to that on huggingface at some point

chilly pier Apr 9, 2025, 8:26 AM

#

Lagging
for long codes

#

Incomplete response

wild quail Apr 10, 2025, 6:48 AM

#

In my opinion we really need a crucial! improvement to the arena. Let us vote on other people's prompts and their output. This would:

Increase the amount of votes by a significant amount without increasing the api cost (because these answers already were fetched)
Improve the quality of the leaderboard. By having multiple people decide on the same prompt it reduces the issue that people vote on wrong answers that "look" nice. For example llama-4 was specifically trained to have high elo on the arena, because it gives stylish responses. I mean ok the "style control" already does a good job at deranking the model, but in my opinion it should be ranked even lower, because it often just answers nonsense but in a stylish way, so basically it's 100% style, 0% quality for llama4. Letting us vote on other people's prompts would significantly improve this.

#

I see one reason why lmarena wouldn't do that, and that is the fear of people scraping responses. But then you can simply solve this by only doing this for a small subset of answers, those that will get released in the dataset anyway.

dreamy orchid Apr 10, 2025, 10:36 AM

#

wild quail In my opinion we really need a crucial! improvement to the arena. Let us vote on...

this is not a bad idea but I am unsure whether it is logistically feasible.

This because if you have a lot of voters in a period, you can do it because you have excess capacity.
If instead the voters aren't that many, you may put them voting stuff they aren't interested into and people could simply quit voting.

The idea in general could be very useful. I wonder if one could find a compromise. Let many (not one) LLM judge the answers. Then let people judge the judge (every now and then, not too often). In that way the "weighted" judge becomes a proxy of the people, and could help. A sort of Arena-Hard-Auto but more polished.

That could be also done on a small sample of questions (say, 5 to 10 per category). The point is to automate the judging while still reflecting what the majority of people would pick. Not easy.

wild quail Apr 10, 2025, 10:39 AM

#

dreamy orchid this is not a bad idea but I am unsure whether it is logistically feasible. Thi...

I think it would be feasible if it's simply another category "vote on dataset" or something like that. The existing arena battle mode could be left unchanged. And if people want to vote on other prompts they can simply switch to that category

dreamy orchid Apr 10, 2025, 4:47 PM

#

yes that yes. Still I think the voters (not users! Rather those that vote) on lmarena aren't that many - in the period between leaderboard updates - so it could dilute the effort. But I like the idea.

#

because for example if I test the search vs the language mode, I don't really use the language mode afterwards. The testing prompts are limited as time is limited

pure compass Apr 11, 2025, 9:04 AM

#

Can someone explain what exactly the recent llama4 controversy is about? Is the 03-26 experimental version closed source and the 17b-128e instruct the one you can download? I hope not because the experimental version is so much better

dreamy orchid Apr 11, 2025, 11:58 AM

#

more or less. The 03-26 is optimized for human benchmarks (lmarena and similar ones, like the internal ones) and the 17b-128 is not.

It could well be that what we saw in lmarena will be released within meta products (whatsapp for example) while the open weights one will stay different (there are very few open source models. Most of them are "only" open weight)

#

could be that the open weights one is the base for 03-26, as 03-26 got additional fine tuning or so

#

time will tell, so far it is speculation

shrewd shuttle Apr 11, 2025, 12:06 PM

#

dreamy orchid more or less. The 03-26 is optimized for human benchmarks (lmarena and similar o...

i thought it was solely optimised for the arena.. i might be mistaken but i feel like there aren't really any similar ones, in terms of collecting human preference from blind battles at scale, and also any 'internal' metrics are kinda pointless by virtue of being irreproducible (though tbf still might be more than just for marketing, like could be done in earnest to shape model development before deployment)..

shrewd shuttle Apr 11, 2025, 12:08 PM

#

dreamy orchid this is not a bad idea but I am unsure whether it is logistically feasible. Thi...

i really like this idea

#

they should do it as some kinda beta side project - it would be interesting to see the divergences in voting patterns (assuming they exist)

dreamy orchid Apr 11, 2025, 12:09 PM

#

shrewd shuttle i thought it was solely optimised for the arena.. i might be mistaken but i feel...

companies, since some time, perform internal human benchmarks telling "which versions would you prefer?". I could imagine meta doing that (for whatsapp and co) . That would be more or less identical to lmarena

shrewd shuttle Apr 11, 2025, 12:10 PM

#

shrewd shuttle they should do it as some kinda beta side project - it would be interesting to s...

fwiw i'm also partial to having like a timer that forces people to wait (and ideally read both responses) before voting. like very often it's literally impossible toreasonably evaluate the quality of 2 responses if they are kinda lengthy literally immediately
but there'd be a lot of user friction / dissatisfaction.. could see fair arguments it against too

visual warren Apr 11, 2025, 12:11 PM

#

dreamy orchid more or less. The 03-26 is optimized for human benchmarks (lmarena and similar o...

the specifics dont really matter they released llama 4 maverick with the lmarena benchmarks that do not represent the open weighted model. even though they put a footnote, at first glance, people would think its the same. and people are still confused up to now

shrewd shuttle Apr 11, 2025, 12:12 PM

#

yeah it's absurd (and was always going to be an own goal) - dunno what they were thinking

dreamy orchid Apr 11, 2025, 12:22 PM

#

to be totally fair, as I see lmarena so far, it is great to gauge the value of models as "substitute to classic common internet searches". "common" here is key. People say lmarena is ranked according to human preferences but I see it really more as I don't google! model, tell me the answer!.

Thus, as meta provides llama in many apps that are used on the fly with common queries, it is a great benchmark to see if it would satisfy people. That happens also for other companies, like xAI integrated in twitter with likely people asking common queries there too.

Further as a company they don't need to release open weight models, so the idea of a double release is perfect. They get to verify that their model is very usable for their apps (lmarena score); they get praise for their results (blog posts and hype); they still release their models (though not fine tuned) so that the competition doesn't have ready made products from day 1. People will complain about that, but those that complain are the minority, the whatsapp users don't care.
So it is really a sort of win-win for them, not for the community.

Then we need companies like nvidia & co that release the llama derivatives to fine tune them properly.

dreamy orchid Apr 11, 2025, 12:22 PM

#

shrewd shuttle fwiw i'm also partial to having like a timer that forces people to wait (and ide...

I like the idea of a slowdown but I could see people dropping from the site because impatient.

shrewd shuttle Apr 11, 2025, 1:11 PM

#

dreamy orchid to be totally fair, as I see lmarena so far, it is great to gauge the value of m...

have a look at the Prompt Explorer tab - it's surprising how few of the prompts are google-style information requests.. like there's more people asking them how many 'r's are in strawberry than when was the fall of rome ha

#

correction: (after coding) it's mostly people asking for medical advice.. then how many Rs are in strawberry ha

wraith kestrel Apr 11, 2025, 1:58 PM

#

Huh, "connection errored out" while using P2L. Is the model being retrained? 🤔

#

If that's the case, then I'm looking forward for it.

dreamy orchid Apr 12, 2025, 9:45 AM

#

shrewd shuttle correction: (after coding) it's mostly people asking for medical advice.. _then_...

I looked at it already, still they are only a bunch of questions there to see, not all. I believe most questions are simply common ones.

shrewd shuttle Apr 12, 2025, 9:55 AM

#

dreamy orchid I looked at it already, still they are only a bunch of questions there to see, n...

lol what's the opposite of confirmation bias?

#

why would your belief/hunch be more valid than the prompts in the Explorer, in terms what people people ask in the arena? sure, they're not all there, but they're arguably representative of the actual prompts people use in the Arena, at least to some extent.

#

but perhaps most people are really asking questions like "what time does the pharmacy on High Street in Birmingham close on public holidays?", but they're hidden from us for (literally no idea why)

dreamy orchid Apr 12, 2025, 10:16 AM

#

I think if they would pick common questions in the prompt explorer it would make lmarena less good? I know it feels like silly, but when I know a group of people using lmarena and when I see them posing questions they are simply like "I could have googled that". I am guilty of that too. And no, it is not something like "at what time this and that happens" rather it is "could you explain me this concept" or the like.

#

it is completely fine IMO, as an LLM compresses knowledge so why not.

#

Imagine stackoverflow, ELI5 (from reddit), and other similar places put in lmarena.

#

now some ELI5 or stackoverflow questions aren't easy at all, but most are solved by some googling

#

it makes also sense statistically. stackoverflow and other Q&A places have most of those distributions. Relatively easy questions (aka: with some googling they are solved) are common and few are hard. Why should it be different with LLMs ?

#

I mean, as long as those that pose the questions are humans

weary rampart Apr 12, 2025, 11:01 AM

#

dreamy orchid I mean, as long as those that pose the questions are humans

Although the arena is quite obviously used by humans, i think that it still inherently has to be a distribution of somewhat difficult problems, because then people using it are quite frankly on average significantly more invested in topics like cs, ai and other areas where ai is being successfully applied currently (e.g. medical or creative writing). This already shifts the average question away from these really basic questions about when a pharmacy opens.

#

that is also mainly why the puzzles category ranks so high i think

shrewd shuttle Apr 12, 2025, 11:07 AM

#

weary rampart that is also mainly why the puzzles category ranks so high i think

i think that's more a reflection of very crude / poor categorisation of the prompts

#

most classified as 'puzzles' aren't actually puzzles at all #prompt-to-leaderboard message

weary rampart Apr 12, 2025, 11:08 AM

#

shrewd shuttle i think that's more a reflection of very crude / poor categorisation of the prom...

well yeah true, i agree with the assessment, my point was more about the prompts actually not being as simple as things you could just google

#

i also find it interesting that lmarena has yet to really classify these convos in a very holistic way considering the amount of A/B test pairs available (also includes the current P2L models which are also not really good)

#

but maybe i am just underestimating the complexity of doing stuff like that idk

shrewd shuttle Apr 12, 2025, 11:20 AM

#

weary rampart but maybe i am just underestimating the complexity of doing stuff like that idk

perhaps i am too.. i feel the same about it as you describe - seems like low hanging fruit / they're missing a trick

#

i kinda thought they set the classifier up quite early on in the project, and it's handled by like llama1-8b or something old and tiny like that, and while it might've done an 'ok-ish' job back then, now it seems clearly suboptimal / in need of some kind refinement

#

but yeah, perhaps they have been trying to refine it all this time but it's just tricky to get right (but intutively.. that doesn't seem right to me.. like classification is a pretty rudimentary and well-established task..)

dreamy orchid Apr 12, 2025, 11:38 AM

#

weary rampart Although the arena is quite obviously used by humans, i think that it still inhe...

I partially agree. I agree that lmarena is used mostly by those in IT. But again stackoverflow is not filled with only hard questions. I am not talking about "when shops X opens", rather questions that can be solved with minimal googling (and brain), like "I'd like to make this select request in SQL, can you help?"

#

so even if the audience of lmarena is skewed towards IT, it doesn't necessarily mean that those are hard IT questions.

#

Otherwise if the questions were always quite hard (and in the IT realm), LMarena coding category would be more in line with other coding benchmarks. Again my evidence is based on the normal questions based on Q&A sites (stackoverflow and others)

#

but again that is my opinion, I don't want to convince anyone. It is just that there are too many clues (IMO) that point in that direction.

#

also, as you mentioned, the categorization could be also very loose. Like "coding is anything that has code snippet markup", that could be quite broad.
I asked logic questions where the model used code snippets markup, but that is no coding.

weary rampart Apr 12, 2025, 12:07 PM

#

shrewd shuttle but yeah, perhaps they have been trying to refine it all this time but it's just...

they did definitely work on improving it, I think they used 70b at first for the classification on the normal lmarena (not sure) and likely had to stick with it considering that changing the model would heavily change the rankings per category as well
but they did work on the arena explorer quite recently: https://blog.lmarena.ai/blog/2025/arena-explorer/ (where they use a different method), although i am unsure why they opted to use the mpnet v2 model for this, because they show that the model has somewhat falsely classified somethings in the very same blog.

weary rampart Apr 12, 2025, 12:10 PM

#

dreamy orchid so even if the audience of lmarena is skewed towards IT, it doesn't necessarily ...

very true, however i am working under the assumption that a focus on these areas plus the desire to test the limits of modern ai pushes the question to be harder on average (in that area atleast) (than e.g. the everage chatgpt request)
but obviously i expect very little absolute domain experts in their area to use lm arena in their free time and thus this assumption obviously has its limits

weary rampart Apr 12, 2025, 12:16 PM

#

dreamy orchid also, as you mentioned, the categorization could be also very loose. Like "codin...

which is why i am very interested in the less rigid framework of P2L and i really hope that they keep using their data to improve these models and keep them uptodate

dreamy orchid Apr 12, 2025, 12:17 PM

#

btw I checked the arena explorer, I didn't in a while, and my point are somewhat confirmed in my view. I checked the larger category and most examples are solved by google + some brain.

I didn't check all the categories because it was enough to find many of them in the most common categories.

Bildschirmfoto_2025-04-12_um_14.16.07.png

#

the other examples either were too hard, like "do it all for me", or too technical - I am not versed in everything to judge well.

dreamy orchid Apr 12, 2025, 12:19 PM

#

weary rampart very true, however i am working under the assumption that a focus on these areas...

in my experience people use it as chatgpt alternative once I shared it. Nothing more. It is also in the screenshot I posted. And that's fine to be fair.
Only it makes lmarena great to say "ok, which chatbot service can substitute some common Q&A websites?"

#

what I would really wish is that for every category they already have (categories could be expanded, but with p2l it is fine anyway) they would make the "hard" subcategory for it. And for hard I don't mean hard prompts, rather "hard questions".

So hard math, hard coding and so on.
I would expect then hard coding to be more in line with aider polyglot and so on.

dreamy orchid Apr 12, 2025, 12:22 PM

#

dreamy orchid btw I checked the arena explorer, I didn't in a while, and my point are somewhat...

I mean the example 5 question from SQL didn't even bother to prompt the question properly. Likely there was a line between the two SQL queries and that's it.

weary rampart Apr 12, 2025, 12:33 PM

#

dreamy orchid in my experience people use it as chatgpt alternative once I shared it. Nothing ...

yeah i generally think that such a thing could really make the arena more interesting at a whole, i honestly don't know what is stopping them.
I mean you could even derive something like humanities last exam (really specific problems from domain experts) out of these millions of questions.
However, at its core this site is obviously just about human preference, even the coding arena, webdev arena (minus maybe repochat) and heavily centered around human preference.
=> for human preference it is obviously essential to have questions that people actually ask instead of highly selected, artificially created or unrealistic when compared to real AI assistant human iterations

dreamy orchid Apr 12, 2025, 12:36 PM

#

agreed

#

also nice the "lmarena humanity last exam" if one picks the proper questions.

#

though IMO the questions in many benchmark should stay private. As soon as they share them - and if the benchmark is notable - there is a high pressure to optimize against those questions.

For example livebench is nice, but models score 70% while 30% of the questions are private. It feels like a bit more than coincidence.

shrewd shuttle Apr 12, 2025, 12:38 PM

#

dreamy orchid btw I checked the arena explorer, I didn't in a while, and my point are somewhat...

most of those would literally be truncated if they were entered into a google search..

dreamy orchid Apr 12, 2025, 12:38 PM

#

hence I think that open based benchmarks a la lmarena are potentially the best if properly scored.

dreamy orchid Apr 12, 2025, 12:39 PM

#

shrewd shuttle most of those would literally be truncated if they were entered into a google se...

yes don't take things too literally. I thought my meaning was clear. I google about how to connect certificates to IIS servers. Then I google CLI commands and so on.

#

still they aren't hard questions.

shrewd shuttle Apr 12, 2025, 12:39 PM

#

https://huggingface.co/spaces/lmarena-ai/Llama-4-Maverick-03-26-Experimental_battles
hit Next Question again and again - a few are like traditinoal information requests that would usually be done with google, but they're the outliers, not the norm

shrewd shuttle Apr 12, 2025, 12:40 PM

#

dreamy orchid yes don't take things too literally. I thought my meaning was clear. I google ab...

i'm kinda lost as to what your point is now tbh ha.. i just don't think there's strong evidence that most inputs are done by people who would otherwise be using google searches.. in some cases yes; but not in most

#

most people are just playing around / seeing what they get as the responses in a blind battle

#

they;re not actually trying to fix code

dreamy orchid Apr 12, 2025, 12:45 PM

#

Is the spiciness of a hot pepper only perceived or true and physical?
what are the odds of someone in Texas Hold 'Em rivering a Royal Flush while the other player rivers Quad Aces??
I will give a congress talk "On Naevi" -- naevi are benign melanocytic lesions which are markers and every so often also precursors of melanoma. Do you have suggestions for a short and succinct title for my presentation
What does it mean if I have a "proud rooster"?
What is the latest season of Fortnite?
What is an RNN in the field of AI?
Create table of yogurt nutrients versus greek yogurt
generate study plan for IAS exam in marathi
Read this passage from the article:
they were honored at Navy gatherings where new Black U.S. Navy officers expressed their gratitude. "We owe it all to you," they said. "If it hadn't been for you guys, we wouldn't be here."
In this passage, the word gratitude means __________.
a feeling of trust a feeling of hope a feeling of peace a feeling of thanks
My left leg hurts when I'm sleeping and immediately when I wake up. The pain will disappear during most of the day, except when going up and down the stairs. I have touched my leg in multiple places, and there is no specific location that hurts to the touch, although I can feel some strain in my ankles/calves. What is the likely cause of my leg hurting?
The placement and connections between rooms in a building leads to the formation of hallways and corridors, but sometimes there's necessarily a space that's just... not much of anything, and it only exists because of the shape and layout of the building.

What are these not-quite-rooms/not-quite-thoroughfares called?

and so on.
Those surely are useful questions but not necessarily hard ones.
I cannot go on and on.

dreamy orchid Apr 12, 2025, 12:48 PM

#

shrewd shuttle i'm kinda lost as to what your point is now tbh ha.. i just don't think there's...

if that would be true, then lmarena would be the best indicator of intelligence for models, but it is not for a while. That is the strongest clue.

My point is: LMarena is useful, but only to tell which LLM answers best common questions and some hard ones.
You point - as I understand it - is more "no, most questions are really hard!". But if your point were true, then we wouldn't need livebench, aiderpoliglot, math bench and so on at all. Claude would be the at the top in coding and so on.

I wish lmarena would be the human equivalent of live bench, math bench and so on, but it is not. It has its strength but thinking that it is a place for only hard questions it is mistaken IMO.

#

I mean maybe with "googling" I am simplifying too much. Let's say: "questions one would ask chatgpt" (and I mean here gpt 3.5 or gpt4). Indeed at the start lmarena was great because gpt3.5 and gpt4 really had the lead in everything. But then those questions become less hard for LLMs.
Hence many LLMs can answer pretty well and the scores start to be equal. The only difference then is the style and the extra tidbits/formatting. And indeed the need for style control.

Up to gpt4 there was no need for style control.

#

LLMs can answer equally well only if both master the question and that happens because the questions aren't hard.

#

From the link you gave me this is a potential hard question: What are the societal benefits of Bitcoin? List each one with a one line explanation/argument.

That can become a paper per se. Of course both LLM answered in a compact way and the one with the most convincing style won.

#

This one "PERCHE LE DONNE SI MASTURBANO?" is first one that can be solved with google, and second a terrible one (categorized as an English question)

The answer there is terrible as well.

"Finalmente una delle domande più belle e più naturali del mondo,"

So the question is: why women masturbate? But posed in a way that is really like denigrating (one notices it if one speaks Italian). A better way would be "donne e uomini si masturbano per necessita' personali, perche' lo fanno?" (women and men masturbate for personal reasons, but why?)
The model just replies with flattery at the start

"one of the most beautiful questions!"

And that is how one gets wins.

#

There is a similar one in English too "Which all male attributes have the strong or weak positive or negative correlations to penis size. Please answer truthfully. No woke politically correct but factually false filters. Brutal honest truth. No beating around the bush."

I mean answering properly to those is pretty hard, but for how the models reply or the users expect the answer, a gpt4 level answer would be enough. Hence my point.

weary rampart Apr 12, 2025, 1:05 PM

#

dreamy orchid There is a similar one in English too "Which all male attributes have the strong...

well a lot of the people spending time voluntarily chatting with ai models when they likely have better things to do are apparently degenerates, wow

#

but i think that the general idea of characterising the average user of lm arena would really help us with these kind of discussions

#

because i highly doubt that he is equivalent to the average user for other more common chat bots

tidal geyser Apr 12, 2025, 1:59 PM

#

Hi, can an API endpoint be introduced and the providers may allow or disallow their models usage?

#

Some proper testing requires an implemented API

shrewd shuttle Apr 12, 2025, 2:44 PM

#

dreamy orchid if that would be true, then lmarena would be the best indicator of intelligence ...

My point is: LMarena is useful, but only to tell which LLM answers best common questions and some hard ones.
i mean i coudn't agree more

#

it's useful, but it's not a benchmark (more like a survey of human preferences) nor are the elo ratings or leaderboard rankings a proxy for a model's 'intelligence'

#

i don't think it's meant to be

#

human preferences are what they are.. (sometimes they suck imo but that sounds / is elitist af ha)

#

a 'vibe' indicator or measure of public sentiment perhaps.. but it isn't an intelligence benchmark (though smarter / more performant models will, imo, invariably do better overall (with more votes etc ) imo - so it counts for something

dreamy orchid Apr 12, 2025, 6:47 PM

#

I was reflecting about the convo today.

If I am not mistaken, I think that the 1200-1250 level (in the overall standings) really tells which models are better in many categories, not only for humans. And indeed that was the GPT4 best level. And here I mean: the top10 in lmarena were more or less the same - in the same order - in other benchmarks.

Once many models started to produce "good enough" answers , then the benchmark become more influenced by other factors and lmarena started to correlate less with other benchmarks (coding, math and what not).

#

I mean the top models are still at the top, but the order varies a lot from benchmark to benchmark.

weary rampart Apr 12, 2025, 8:46 PM

#

dreamy orchid I was reflecting about the convo today. If I am not mistaken, I think that the ...

Honestly I am not very sure about that correlation

#

But should be easy enough to check with a Bit of Code

#

Might do that tomorrow

dreamy orchid Apr 13, 2025, 11:48 AM

#

example of something where users vote on the same prompt more or less. Not bad: https://mcbench.ai/

MC-Bench

Evaluating AI with Minecraft

weary rampart Apr 13, 2025, 12:20 PM

#

dreamy orchid example of something where users vote on the same prompt more or less. Not bad: ...

Well I think the best example for why one should really be wary of human preference benchmarks where the user is no writing the prompt on their own is that there is significant difference in the rankings of image generation models by artificial analysis and lmarena, with the only difference between the two (as far as i know) being that artificial analysis uses predefined prompts and lmarena does not. Thus I can at least conclude that the results of both methods will differ, with the lmarena approach likely being more holistic.

weary rampart Apr 13, 2025, 1:50 PM

#

weary rampart Might do that tomorrow

talked about the correlation a bit in their paper, but seems pretty legit and all
https://livebench.ai/livebench.pdf

#

this is what i got

#

and some other stuff, but still working on the repo a bit

dreamy orchid Apr 13, 2025, 7:37 PM

#

nice, it would be cool to put it into github for everyone to see. Could you make the first graph (the others seem less relevant) for the categories and/or the style control too?

weary rampart Apr 13, 2025, 8:00 PM

#

dreamy orchid nice, it would be cool to put it into github for everyone to see. Could you make...

might do that tomorrow. but that is also when my classes start again, so might not have a lot of time.
these other graphs might be interesting though:
(the one for param size is way more accurate and the other one shows that the correlation greatly differs between model families, with especially the phi and qwen family being outliers)

wraith kestrel Apr 14, 2025, 6:11 AM

#

Smaller Gemma 3s are also being tested. Nice!

Can we expect Llama Scout to join the Arena as well? 👀

dreamy orchid Apr 14, 2025, 8:47 AM

#

the ones about parameter sizes aren't that much informative. I mean there is a trend, but it is a bit all over the place.

#

and yes no stress with the code. It can happen when one has time

visual warren Apr 15, 2025, 10:03 AM

#

Add new Kling model to text-to-image - KOLORS 2.0

weary rampart Apr 16, 2025, 7:28 AM

#

Might also be interesting to not just directly use the blended price for the comparison but to also have the option to use the average token usage (in the arena for the specified category) * the price.

#

That could also be really helpful to ‚combat‘ these models that use very high TTC in the response to enhance perceived quality (e.g. llama 4 maverick special chat version).

visual warren Apr 16, 2025, 3:42 PM

#

this time when o3 launches do 2 separate models for both families when putting it on the arena - the differences in performance with reasoning effort have historically been quite large

o3
o3-high
o4-mini
o4-mini-high

frigid pine Apr 16, 2025, 3:43 PM

#

hig jay

visual warren Apr 16, 2025, 3:43 PM

#

damnit

frigid pine Apr 16, 2025, 3:43 PM

#

L

visual warren Apr 16, 2025, 3:45 PM

#

you're lucky you're far away 🙄

frigid pine Apr 16, 2025, 3:50 PM

#

visual warren you're lucky you're far away 🙄

uh oh

#

how come?

visual warren Apr 16, 2025, 3:51 PM

#

ybdevious

frigid pine Apr 16, 2025, 3:51 PM

#

what is this man planning

visual warren Apr 16, 2025, 3:52 PM

#

wouldn't youuu like to know weatherboy

frigid pine Apr 16, 2025, 3:53 PM

#

visual warren wouldn't youuu like to know weatherboy

hmmmmmm

#

i would actually

visual warren Apr 16, 2025, 3:54 PM

#

that would spoil the surprise!

frigid pine Apr 16, 2025, 3:54 PM

#

3:<

visual warren Apr 16, 2025, 3:54 PM

#

it being that direction feels wrong

frigid pine Apr 16, 2025, 3:56 PM

#

yeah, but it's the only way to make a colon three frown

#

well

visual warren Apr 16, 2025, 3:56 PM

#

:3

#

is it?

frigid pine Apr 16, 2025, 3:56 PM

#

:Ɛ

frigid pine Apr 16, 2025, 3:56 PM

#

visual warren >:3

well, that's a smiley

visual warren Apr 16, 2025, 3:56 PM

#

ohhhh right

#

lmfao

frigid pine Apr 16, 2025, 3:56 PM

#

frigid pine >:Ɛ

epsilon as a 3 is kinda cursed doe

visual warren Apr 16, 2025, 3:57 PM

#

true

visual warren Apr 16, 2025, 5:10 PM

#

add o3, o3-high, o4-mini, o4-mini-high tf_kek

visual warren Apr 16, 2025, 6:54 PM

#

visual warren add o3, o3-high, o4-mini, o4-mini-high <:tf_kek:915435404050718721>

both here now, just need the high variants!

frigid pine Apr 16, 2025, 7:19 PM

#

o3-high seems a lil' unlikely lol

visual warren Apr 16, 2025, 7:29 PM

#

😔

#

they did o3-mini-high so hopefully we got o4-mini-high too

visual warren Apr 16, 2025, 8:34 PM

#

also add o3 to the vision arena

#

nvm seems to be there now :)

wanton star Apr 16, 2025, 9:02 PM

#

visual warren nvm seems to be there now :)

and in alpha ui too

grizzled hamlet Apr 16, 2025, 11:02 PM

#

visual warren this time when o3 launches do 2 separate models for both families when putting i...

where can I get o3-high???

#

I am just reading about that

#

also, o4 is going to be insane when it fully comes out

hushed crest Apr 17, 2025, 11:49 AM

#

The 2.5 PRO is crashing every time I encounter it. The tasks takes ~3 to 5 minutes. Is it timeout issue?

#

Same on the direct chat

ocean sky Apr 18, 2025, 7:18 AM

#

I'll repeat here what I said in #leaderboards

I think style control is a very important feature, and if it was on by default, the llama 4 controversy would be much weaker. At the same time, there is still a 48 Elo difference between the two llama 4 versions that arguably differ only in style, so it is worth to think about which additional features can make style control better

agile flume Apr 18, 2025, 8:49 AM

#

hey @ocean sky we are working on an improved version of style control to include sentiment features. initial result looks very interesting. we will share more with community soon

dreamy orchid Apr 18, 2025, 11:09 AM

#

ocean sky I'll repeat here what I said in <#1340554757827461212> I think style control i...

I don't like the style control because we are chatting with the LLMs, we are not making api calls.

And indeed the tweaked llama version will likely be great for the average user of whatsapp & co.

If you see LMarena for "which LLM would be best for the average user question that an AI assistant gets?" it makes much more sense.
It is the same why claude is nowhere near the top5 while in webdevarena it destroys everyone.

In this perspective, the arena is fine. I personally check a mix of categories like hard prompts category and longer query . A bit less coding to be fair because coding is more webdevarena (or there it is more appropriate to ask for api calls)

#

For coding actually I prefer this: https://openrouter.ai/rankings/programming?view=month where people vote with their $$ if necessary.

#

so yeah, lmarena is good but having a mix of benchmark to check is better.

tidal geyser Apr 18, 2025, 8:59 PM

#

Please add geographical understanding to lmarena. I want to play geoguessr with the assistants

untold kiln Apr 20, 2025, 12:18 AM

#

Can we get a better mechanism to temporarily disable models that return nothing? I get Claybrook on every battle in WebArena, and it takes 5 mins to wait for an empty output that results in neither a satisfying comparison nor a meaningful vote.

pure compass Apr 20, 2025, 1:48 AM

#

ocean sky I'll repeat here what I said in <#1340554757827461212> I think style control i...

The two llama 4 versions have a huge difference in the type and length of the answer, they don't differ that much on style. Or are we talking about different things? I compare llama-4-maverick-03-26-experimental with llama-4-maverick-17b-128e-instruct, and the experimental version is much better than the instruct version

pure compass Apr 20, 2025, 2:11 AM

#

I only hope that version will also get the weights released, not that I have the hardware to run it.

dreamy orchid Apr 20, 2025, 9:43 AM

#

untold kiln Can we get a better mechanism to temporarily disable models that return nothing?...

yes the new models (often broken at the start) are too aggressively matched against everything. They should dilute the matching from time to time as most existing models also need votes.

#

if one checks the battle count heatmap (battles ended without ties) there are way too few comparisons, given that every human judge judges differently.

ocean sky Apr 21, 2025, 9:15 AM

#

dreamy orchid I don't like the style control because we are chatting with the LLMs, we are not...

Well, if you're interested in "which LLM would be best for the average (by number of queries) user of lmarena.ai question that an AI assistant gets?" then indeed, style control is of no use for you. However, for me, the arena leaderboard is a good proxy for evaluation of answer quality for diverse, open-ended questions; I couldn't care less about the number of bullet points or emojis included in the answer. Unfortunately it turns out number of bullet points and emojis does skew the votes even if the content of the answer is the same.

I view the style-controlled leaderboard as an evaluation of the content of the answer, disregarding the format of the answer. This is a bit simplistic since you can deliver the same content in a way that is more or less accessible, and sometimes the style is an essential part of the evaluation. Still, the point stands: the finetuning that made the llama yapping like crazy shouldn't affect the style-controlled leaderboard. Moreover, since style control uses relatively simple features, it just prevents the most obvious ways of climbing the leaderboard, but do not really punish different "styles".

Finally, as my personal opinion, the attempt to maximize the non-style-controlled arena score (since it's the default) makes llms shittier. I don't want that to happen, and an easy way to fix that is to make style control the default. The non-style-control option will still be accessible using the checkbox.

pure compass Apr 21, 2025, 3:46 PM

#

But it is important to make sure the style control does not over compensate, because I think there is a positive correlation between the quality of the answer and the style.

short scarab Apr 21, 2025, 3:59 PM

#

Add https://team.doubao.com/en/

Doubao Team - Crafting the industry's most advanced LLMs.

ByteDance Doubao Team is dedicated to crafting the industry's most advanced LLMs. We aim to lead global research and foster both technological and social progress.With a long-term vision and a strong commitment to the AI field, the Team conducts research in a range of areas including natural language processing (NLP), computer vision (CV), and s...

#

Doubao LLMs and image generation

sand breach Apr 21, 2025, 4:37 PM

#

how do you print a conversation?
at least in the browsers i tried, larger textboxes will be cropped. i solved this with a bookmarklet (= js code you can put in a bookmark)

javascript:document.querySelectorAll('#chatbot').forEach(el%20=>%20el.style.height%20=%20'auto');

i tested this on firefox, other browsers may restrict bookmarklets due to security reasons, but theres usually a setting to allow it.
is there any other solution you guys use?
if not, might i suggest adding a button to switch to a printable view?

sand breach Apr 21, 2025, 4:53 PM

#

i just learned there is a new ui coming, but i assume the same effect can be achieved there. just need to figure out the proper selector...

dreamy orchid Apr 21, 2025, 5:20 PM

#

ocean sky Well, if you're interested in "which LLM would be best for the average (*by numb...

" However, for me, the arena leaderboard is a good proxy for evaluation of answer quality for diverse, open-ended questions"

yes but the problem is that it is not an automatic test, where you can adjust the parameters. You cannot force people to vote how you like (that would be biased too) and from that you cannot force for everyone a ranking only because it is best for you. That is a bit too "it has to work for me, not for everyone else".

For that type of benchmark I guess one should build another version of the benchmark. Because a counterpoint of your assesment is: if you models expose exactly the same identical content, but one in $nice_font and the other in an $illegible_font, they should get the same score. Not at all.
Same for information that is consumed by a pair of eyes and not another machine: formatting counts a lot.

Hence instead of showing the same forced ranking for everyone - ranking that could be also faulty a bit (I am not sure how much style control really captures the "content only scores") - I'd rather really focus on a different benchmark.

lmarena could have all the formatting extras while the "new" benchmark has only pure plain text (and even there one can format things nicely).

I really don't get the need to "I want this as default for everyone" when it is one click away for you without disturbing many others (or with lmb by @wide edge you can save a bookmark with style control activated)

This "me first" approach is not something I understand. And no, in before you say "but you also want the default settings for you". First it is the status quo, so it is for everyone, second if the scores are so different, it means that the default score really shows how people mostly vote in the arena.

#

Hence the default score is the most representative. Third, I really use other categories and I use bookmarks for them, that's enough for me.

dreamy orchid Apr 21, 2025, 5:38 PM

#

I think lmarena delivers the best combo: quality of the answer + ease of reading (format). openrouter rankings tells us mostly what is best for coding (given the price). LiveBench , mathbench and lmarena categories taken as a whole tells us which model can do best for STEM questions.

ocean sky Apr 21, 2025, 5:40 PM

#

dreamy orchid " However, for me, the arena leaderboard is a good proxy for evaluation of answe...

This "me first" approach is not something I understand. And no, in before you say "but you also want the default settings for you". First it is the status quo, so it is for everyone, second if the scores are so different, it means that the default score really shows how people mostly vote in the arena.
It's not only me; many people working on LLM benchmarks agree. If everyone were ok with LLM devs putting in work to benchmaxx and generate the most beautiful slop that gets upvoted, the style control feature wouldn't be here. But it is, and for obvious reasons, part of which I already listed; all I'm saying is that it's not enough, and since the devs are working on improving it, I believe I'm not the only one who thinks so.

I really don't get the need to "I want this as default for everyone" when it is one click away for you without disturbing many others
I think I made it pretty clear: The default score is the one optimized, and the non-style-controlled score is easily optimized by more yapping and slop, and making less slop is an excellent reason to make the change. Of course, if you like whatever default score optimization leads to, you'd oppose this change. I didn't try to convince anyone that it's bad; I'm convinced it is (and I'm not the only one), so I'm proposing a sensible solution for those who believe it is a problem. I'd be happy to argue why if that would help to decide.

dreamy orchid Apr 21, 2025, 5:43 PM

#

I think the slop many mentions actually is liked by many end users. For the end users I mean those that use for example copilot integrated everywhere, llama integrated everywhere, grok and so on.

So it is about what we want to measure. For the end user (that are the vast majority of internet users) I think lmarena is really representative.

I get your point, you want like a sort of XKCD 810 but for LLMs, and that would be nice too. I still think it should be a different benchmark. Because if the AI labs benchmaxx for style control, they can make a lot of end user less happy (emojii and co)

#

but anyway, I make do with what is there. I think if there would be such a giant push for style control, there would be already another benchmark. lmarena is not new and they don't have the monopoly on benchmarking either.

errant musk Apr 21, 2025, 9:03 PM

#

One of the models frequently not showing anything

nova ledge Apr 22, 2025, 5:13 AM

#

claybrook stop working in webdev arena.

visual warren Apr 22, 2025, 2:45 PM

#

maybe admins add functions for "premium users" with function upload files (.txt .js .php ....) ?
im ready pay service!

wraith kestrel Apr 22, 2025, 6:07 PM

#

Read the Sentiment Control article, and I gotta say this is the right direction to go.

Gemini 2.0 Flash is being used for sentiment classification. But I wonder is it really both cheap and accurate enough to run vs an open weight model with similar performance if there's any? And will it be used for prompt classification (Hard, Creative, etc.) too, for consistency's sake? 🤔

dreamy orchid Apr 22, 2025, 6:23 PM

#

I also like the correlation with the performance, also for headers, length and so on. In that way is much better to "correct" the score rather than ignoring the battles.

#

also lmarena is actually a sort of social experiment too, not only a bench for LLM. People like being flattered

pure compass Apr 23, 2025, 12:18 PM

#

Again mentioning you really need to fix your content moderation system when it comes to images. Or can anyone explain what's wrong with this image? https://civitai.com/images/67890084 I tried cropping the arm away in case it is too much exposed skin (lol) but still content warning. This is getting ridiculous. Is it smiling suggestively or what is the problem?

#

At least the is what exactly caused the flagging maybe we can help you to fix it when we know what triggered the false positives

zealous junco Apr 23, 2025, 12:51 PM

#

I wanted to add my own llm
And make it available on arena playground
How I can do

woven moat Apr 23, 2025, 9:07 PM

#

gemini 2.5 pro experimental keeps having its answers cut off. another friend is also reporting this issue

#

maybe there's some character it's returning that's being interpreted as an end of message token?

jolly ore Apr 24, 2025, 2:08 AM

#

Can you guys add Claude web search

#

And the other chatgpt web searches aside from gpt 4 Omni or whatever is called

whole shadow Apr 25, 2025, 5:52 AM

#

the o3 model in lmarena is really weak or not the ai at all, i tested it with the research math question from here 5 times: https://openai.com/index/introducing-o3-and-o4-mini/ and it failed to give the correct answer all 5 times, it even took around 3 minutes of thinking instead of the 55 secs in the example.

dreamy orchid Apr 25, 2025, 6:05 PM

#

whole shadow the o3 model in lmarena is really weak or not the ai at all, i tested it with th...

good idea!

wide edge Apr 25, 2025, 10:48 PM

#

whole shadow the o3 model in lmarena is really weak or not the ai at all, i tested it with th...

the version in lmarena represents the api, which doesn't natively have python access

whole shadow Apr 26, 2025, 2:02 AM

#

wide edge the version in lmarena represents the api, which doesn't natively have python ac...

oh ok, thanks

hushed crest Apr 26, 2025, 7:23 AM

#

@agile flume Could you or someone from lmarena team make a qestions and answers webinar?

short scarab Apr 27, 2025, 9:53 PM

#

Can the new GROK 3 (not early), plus it's vision, Grok Aurora, reasoning, and search capabilities

#

as well as this, can you add Doubao 1.5 Pro, 1.5, and roleplaying?

#

Doubao is extremely underrated

#

Can't wait to try it out

#

Along with seeddream models

torpid drift Apr 27, 2025, 10:45 PM

#

Is there someone I can talk to about search arena? We found some issues, would love to talk to whoever is involved

strong slate Apr 28, 2025, 3:43 AM

#

torpid drift Is there someone I can talk to about search arena? We found some issues, would l...

You can always leave feedback here. There’s also email: lmarena.ai@gmail.com

frigid pine Apr 28, 2025, 5:12 PM

#

is there any reason kimi isn't on lmarena? not sure what the policy is for adding new models/companies

dreamy orchid Apr 28, 2025, 9:16 PM

#

short scarab Can the new GROK 3 (not early), plus it's vision, Grok Aurora, reasoning, and se...

most likely if the API is there and the vendor is willing to provide API credits, it will be included in the arena. otherwise it is a 💵 problem.

vivid geyser Apr 29, 2025, 1:13 AM

#

Hi, Diffbot has just dropped to HF weights for a new search arena LLM that implements the first o3-style interleaved function calling in an open source model. Would love to see more open-source competition as it is all proprietary models in search arena at the moment!
How do we get included in the arena? We have an API hosted version as well and can provide free credits.

fallow violet Apr 29, 2025, 4:59 AM

#

@agile flume Hi, I am wondering if Qwen 3 models be added to the Arena in the near future? Thanks!

wraith kestrel Apr 29, 2025, 5:18 AM

#

Oh, Qwen 3 series!

#

95.6 on Arena Hard Auto '24.

#

Wonder what it will perform in the actual Arena.

indigo kernel Apr 29, 2025, 3:41 PM

#

Hey @agile flume
We would love to see Linkup model available in 🌐 search arena!! Currently state-of-the art perf on Simple QA (https://www.linkup.so/blog/linkup-establishes-sota-performance-on-simpleqa). (Full disclosure, I am a co-founder)

strong slate Apr 29, 2025, 4:56 PM

#

fallow violet <@787778518591078421> Hi, I am wondering if Qwen 3 models be added to the Arena ...

on both lmarena.ai and beta.lmarena.ai today!

wraith kestrel Apr 30, 2025, 1:58 AM

#

What are the possibilities to having Gemini 2.0 Flash Image Gen into the Image Arena? 🤔

vapid kraken Apr 30, 2025, 6:17 AM

#

This paper just got published by some AI researchers on the unfair practices and lack of transparency by Chatbot Arena. Do the lmarena folks have an answer to these? The community should know. https://arxiv.org/abs/2504.20879

arXiv.org

The Leaderboard Illusion

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted...

swift cloak Apr 30, 2025, 1:22 PM

#

"undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired" true or false?

brave yoke Apr 30, 2025, 2:02 PM

#

the paper presents evidence showing the biases in practices towards a handful of preferred providers, but it does not cover an equally concerning bias against open-source models and small independent developers as can be seen by the many messages in this channel above asking for transparency on how to submit models. I doubt they ignore the requests from Meta and Google in the same way since they accepted 27 private variants just from Meta alone leading up to llama 4

dreamy orchid Apr 30, 2025, 2:48 PM

#

vapid kraken This paper just got published by some AI researchers on the unfair practices and...

one problem one can easily see is when new models are there, cloaked, they get aggressively matches in new questions. That's is good for PR as the models will be easily visible on the rankings within a week, but it is not so good in general because the feedback is vast and model providers can tune their model.

If the cloaked models would be picked every now and then (like all others), then it would be harder to adjust the model and the provider has either to wait (difficult via market pressure) or publish the model as is.

I think slowing down the matching with cloaked models can already help a bit. Then again for the problem "yeah but why Claude 3.5 from Oct 2024 was not #1 in coding?", that is the usual point: API calls (like with inline suggestions with an IDE) and human conversations are different, hence claude didn't win. For api calls one can check openrouter

wispy patrol Apr 30, 2025, 3:07 PM

#

dreamy orchid one problem one can easily see is when new models are there, cloaked, they get a...

Slowing down cloaked model exposure makes sense — it levels the playing field and prevents fast overfitting based on immediate feedback. If models were matched more gradually, they'd need to be robust from the start, not just quickly optimized.

dreamy orchid Apr 30, 2025, 3:33 PM

#

exactly. And if they are under pressure to publish, then they would publish it ahead of lmarena scores anyway, so people would have already experience with them (via openrouter and what not) to compare the behavior.

strong slate Apr 30, 2025, 5:13 PM

#

vapid kraken This paper just got published by some AI researchers on the unfair practices and...

a statement was shared here: https://x.com/lmarena_ai/status/1917492084359192890

lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X

Thanks for the authors’ feedback, we’re always looking to improve the platform!

If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the

vapid kraken Apr 30, 2025, 5:36 PM

#

Karpathy, accomplished AI researcher, shared his thoughts in a tweet. Honestly folks, I am done with Arena as a model builder. Was an admirer of the many fresh ideas chatbot arena brought over the last two years and respect the academic work involved, but this unfairness and opaqueness and being secretly in bed with the big powerful AI closed labs is honestly heartbreaking and absolutely terrible for the community. Esp for an academic project coming from such an established Berkeley lab.... I think lmarena is done and dusted for me and for I know several other researchers and builders of late. Time to move on to other mechanisms like Karpathy writes and other various platforms for evals and rankings. Thanks for all the work, but we as a community deserve much better. https://x.com/karpathy/status/1917546757929722115

Andrej Karpathy (@karpathy) on X

There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion"
https://t.co/LfjIII71qX

I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few

light apex Apr 30, 2025, 9:22 PM

#

Is there any option in "parameters" to activate "reasoning high" for o3 and o4-mini? I would like to test these llms with high reasoning effort.

velvet night Apr 30, 2025, 11:06 PM

#

really wish o1-pro was added

dreamy orchid May 1, 2025, 11:40 AM

#

vapid kraken Karpathy, accomplished AI researcher, shared his thoughts in a tweet. Honestly f...

" but this unfairness and opaqueness"

could you mention any other notable benchmark that is less opaque? Thank you.

dreamy orchid May 1, 2025, 11:41 AM

#

light apex Is there any option in "parameters" to activate "reasoning high" for o3 and o4-m...

there are multiple version. o3-mini and o3-mini-high.

glossy meadow May 1, 2025, 3:55 PM

#

The weight I put on chatbots arena has gone very low after the llama event and the fact every new model seems to benchmark hack their way to the top.

https://artificialanalysis.ai/ feels much more objective at this point

AI Model & API Providers Analysis | Artificial Analysis

Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.

dreamy orchid May 1, 2025, 4:25 PM

#

Artifical analysis is simply a collection of benchmarks "Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500"

The problem there is that one doesn't know if those benchmarks are "benchmaxxed" as well (data in the training set)

#

further artificial analysis score seems also unclear. R1 small distills still do better than Claude 3.7 (no thinking) or close to Gemini 2.0 Pro thinking (the one from January). That seems unlikely.

weary rampart May 1, 2025, 4:42 PM

#

dreamy orchid further artificial analysis score seems also unclear. R1 small distills still do...

U mean 2.0 flash thinking?

weary rampart May 1, 2025, 4:43 PM

#

dreamy orchid Artifical analysis is simply a collection of benchmarks "Intelligence Index inco...

But otherwise I could not agree more 👍, their benchmark selection and the weight for all of them seems rather arbitrary aswell

dreamy orchid May 1, 2025, 4:45 PM

#

weary rampart U mean 2.0 flash thinking?

right. The one before the march/april release.

weary rampart May 1, 2025, 4:54 PM

#

Man I was just confused thinking I missed the release of 2 pro thinking or something. lol

dreamy orchid May 1, 2025, 5:15 PM

#

yes I was going from memory. It was the first thinking model from google though.

#

I think in the arena the name was "gemini-2.0-flash-thinking-exp 01-21"

light apex May 1, 2025, 5:54 PM

#

dreamy orchid there are multiple version. o3-mini and o3-mini-high.

I'm talking about the large o3 model. On the https://lmarena.ai/ website you can use "o3" or "o4-mini", that's ok, but I guess this is with Reasoning Effort = Medium. I would like there to be an option to select Reasoning Effort = High.

dreamy orchid May 1, 2025, 8:32 PM

#

ah I see, they likely will come later (as with o1 and o3 mini)

#

the oX versions were all tested with medium at first IIRC

rose robin May 2, 2025, 1:36 PM

#

dreamy orchid I think in the arena the name was "gemini-2.0-flash-thinking-exp 01-21"

There was another one on december (maybe the same one but updated on Janurary )

frigid pine May 2, 2025, 4:09 PM

#

light apex I'm talking about the large o3 model. On the https://lmarena.ai/ website you can...

1 gorbillion dollars in API costs:

#

more seriously: o3 high in direct chat seems very unlikely, o4-mini-high is definitely possible but not currently implemented

#

if they do choose to add the latter, it'll likely be listed as a separate model

whole snow May 2, 2025, 6:48 PM

#

Greetings. I found a little bit of an "issue", so to speak, that is a little bit frustrating to me.

#

Whenever I do the arena (battle), I can always tell when one of the LLMs is based on Claude, due to the shortness of the answers, and I worry that it would invalidate my tests.

#

Do you have any suggestions on how I can adjust my prompts so that it isn't as obvious?

strong slate May 2, 2025, 7:04 PM

#

whole snow Whenever I do the arena (battle), I can always tell when one of the LLMs is base...

Style definitely impacts responses and voting, but as long as the model has not revealed itself in the answer, your vote is not invalidated. There are even filters for the leaderboards around Style Control which you can read about here: https://blog.lmarena.ai/blog/2024/style-control/

whole snow May 2, 2025, 7:06 PM

#

All right. Thank you. I always assumed that since I could tell the model due to its length that that was a form of revealing itself. I appreciate the answer, and I will read that.

#

I will keep on experimenting and judging. I have been having a lot of fun with it, seeing how each model "thinks" differently.

strong slate May 2, 2025, 7:24 PM

#

whole snow I will keep on experimenting and judging. I have been having a lot of fun with i...

don't forget we have a beta UI live at beta.lmarena.ai as well - would love to hear feedback in #new-ui-feedback if you have any to share!

whole snow May 2, 2025, 7:27 PM

#

I've played around with it a little. Not enough to have a reaction to it yet, though. I will do a little more playing with it today at work if I have some downtime.

light apex May 2, 2025, 9:08 PM

#

frigid pine if they do choose to add the latter, it'll likely be listed as a separate model

Yes, I get it. Hopefully they will implement o4-mini-high.

wraith kestrel May 3, 2025, 3:25 AM

#

Granite 4 just released a public preview.
https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek

IBM Granite 4.0 Tiny Preview: A sneak peek at the next generation o...

We’re releasing IBM Granite 4.0 Tiny Preview, a partially trained version of the smallest model in the upcoming Granite 4.0 family of language models, to the open source community.

#

Also the 3.2 and 3.3 are under our radar apparently
https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a
https://huggingface.co/collections/ibm-granite/granite-33-language-models-67f65d0cca24bcbd1d3a08e3

Granite 3.2 Language Models - a ibm-granite Collection

Granite 3.3 Language Models - a ibm-granite Collection

wide edge May 3, 2025, 3:50 AM

#

Granite-4-Tiny-Preview is a 7B parameter fine-grained hybrid mixture-of-experts (MoE) instruct model finetuned from Granite-4.0-Tiny-Base-Preview using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, and model alignment using reinforcement learning.

buoyant wraith May 3, 2025, 11:28 PM

#

Are there any plans to host deepseek-r1t-chimera? It's been in the top 10 trending models for the past week on Hugging Face and seems to have received a lot of traction: https://huggingface.co/tngtech/DeepSeek-R1T-Chimera
The consensus on reddit seems to be that it answers at least as good as r1 but with nicer thinking traces

tngtech/DeepSeek-R1T-Chimera · Hugging Face

frigid pine May 5, 2025, 3:39 PM

#

not sure if this has been mentioned before, but the suggestions below the web arena "reset" every time the "Generate me a UI for..." prompt field is updated

short scarab May 6, 2025, 9:14 PM

#

hardy halo Or allow other people to vote on them as I said earlier

I’m actually a huge fan of this idea

#

The random icon should be for those

#

Especially as we get more expensive models on the Arena, all of the wasted money added up would be a huge amount

pure compass May 7, 2025, 9:19 AM

#

I think I asked it before but is it clear by now that the weights of llama-4-maverick-03-26-experimental will never be released? Or is there still a chance? Or are they already and I completely missed it? (Not that I have the hardware to run it)

dreamy orchid May 7, 2025, 12:40 PM

#

you can ask meta that question. My guess is that they keep it for themselves, they don't owe it to the community.

Btw llama-4-maverick-03-26-experimental is back and is winning already also in my case.

echo furnace May 7, 2025, 4:58 PM

#

Hey all, I work on the IBM Granite team, and it seems none of our models are hosted on the arena.
https://huggingface.co/ibm-granite
Any chance can someone where i need to put a PR in to add it? Or any direction on how to get involved?

ibm-granite (IBM Granite)

dreamy orchid May 7, 2025, 5:21 PM

#

there are two but not the other ones (3.2, 3.3, 4 - at least those announced in reddit locallama)

pearl garnet May 7, 2025, 6:34 PM

#

echo furnace Hey all, I work on the IBM Granite team, and it seems none of our models are hos...

Thanks for sharing interest! We do our best to test as many models to our capacity. We're unable to share if or when we'd be adding new models, including requests like this. However, these requests are being noted down and we are monitoring the community for signal as to what to prioritize.

echo furnace May 7, 2025, 7:00 PM

#

pearl garnet Thanks for sharing interest! We do our best to test as many models to our capaci...

Wonderful, thank you! If you need some help with capacity issues, I might be able to help there, too...

shy flint May 7, 2025, 7:01 PM

#

Hello @pearl garnet, I run an AI search startup that processes millions of searches with high quality outputs (especially with reasoning/DeepSearch, which rivals Perplexity/Gemini Deep Research), and, I was wondering if it would be possible to add it to the Search Arena. Can you DM me about this? Thank you, Paul

pearl garnet May 7, 2025, 7:26 PM

#

echo furnace Wonderful, thank you! If you need some help with capacity issues, I might be abl...

Sounds good! I'll be keeping track of these requests. I'd recommend remaining in this server incase there are follow-up questions.

pearl garnet May 7, 2025, 7:28 PM

#

shy flint Hello <@283397944160550928>, I run an AI search startup that processes millions ...

Hey Paul ablobwave are you comfortable sharing the name of this startup or would you prefer to disclose through DMs?

visual warren May 7, 2025, 7:33 PM

#

hint: his about me

shy flint May 7, 2025, 7:43 PM

#

pearl garnet Hey Paul <a:ablobwave:552927506957729802> are you comfortable sharing the name o...

Sure! It is called Rubik's AI (https://rubiks.ai).

amber umbra May 7, 2025, 7:45 PM

#

shy flint Sure! It is called Rubik's AI (https://rubiks.ai).

this is a scam thing, you tried to push nothing burgers with it like 2 times already lol

#

you basically just make up benchmark numbers, do a lora or basic finetune if even that, and then call it a day

shy flint May 7, 2025, 7:46 PM

#

amber umbra this is a scam thing, you tried to push nothing burgers with it like 2 times alr...

Burgers?

shy flint May 7, 2025, 7:46 PM

#

amber umbra you basically just make up benchmark numbers, do a lora or basic finetune if eve...

Also, this is for the search feature...

amber umbra May 7, 2025, 7:46 PM

#

shy flint Burgers?

that's an expression, google it lmao

amber umbra May 7, 2025, 7:47 PM

#

shy flint Also, this is for the search feature...

doesn't matter, you have no credibility after previous stunts

shy flint May 7, 2025, 7:48 PM

#

amber umbra doesn't matter, you have no credibility after previous stunts

What stunts?

amber umbra May 7, 2025, 7:49 PM

#

shy flint What stunts?

like this one https://x.com/RubiksAI/status/1841224714045264304?lang=ar-x-fm

Rubik's AI (@RubiksAI) على X

🚀 Introducing Nova: The Next Generation of LLMs by Nova! 🌟

We're thrilled to announce the launch of our latest suite of Large Language Models: Nova-Instant, Nova-Air, and Nova-Pro. Each designed to revolutionize AI interactions with exceptional speed, reasoning, and

#

this new search thing is probably some existing API developed by someone else repackaged under your name

shy flint May 7, 2025, 7:51 PM

#

amber umbra like this one https://x.com/RubiksAI/status/1841224714045264304?lang=ar-x-fm

Not really a stunt, it was just an original test of LoRA on popular open-source models to improve them (similar to NexusFlow and other companies).

amber umbra May 7, 2025, 7:52 PM

#

shy flint Not really a stunt, it was just an original test of LoRA on popular open-source ...

don't pretend like you ran these evals and those were the scores

shy flint May 7, 2025, 7:52 PM

#

amber umbra this new search thing is probably some existing API developed by someone else re...

The search feature was the original starting product of this company. I would love to get a better impression of the quality of our new DeepSearch (in collaboration with Exa AI) with LMArena:
https://x.com/RubiksAI/status/1907152289090965962

amber umbra May 7, 2025, 7:53 PM

#

amber umbra don't pretend like you ran these evals and those were the scores

^

pearl garnet May 7, 2025, 7:56 PM

#

hey stepping in to slightly gesture towards our rules

Treat others with kindness and curiosity—we’re here to share, learn, and debate ideas, not start fights. Healthy debate? Yes. Personal attacks? No.

shy flint May 7, 2025, 7:59 PM

#

pearl garnet hey stepping in to slightly gesture towards our rules > Treat others with kindn...

Naturally. Perhaps it would have been better over a DM.

pearl garnet May 7, 2025, 8:00 PM

#

shy flint Naturally. Perhaps it would have been better over a DM.

it's no problem!

amber umbra May 7, 2025, 8:09 PM

#

vapid kraken Karpathy, accomplished AI researcher, shared his thoughts in a tweet. Honestly f...

I think important thing to realise here is that every benchmark in isolation can be gamed and lmsys is no exception. It's not a definitive answer and is only relevant if all the other usual benchmarks check out too. Good example is Nemotron 70b which was openly made this way to perform better on lmarena without improving anything else over llama 3.1-70b https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/discussions/11#6712c8f758bdba34248ce0ef

nvidia/Llama-3.1-Nemotron-70B-Instruct-HF · [EVALS] Metrics compar...

wraith kestrel May 8, 2025, 5:29 PM

#

A new addition to the Search Arena, peehaps? Or has it been added?
https://www.anthropic.com/news/web-search-api

Introducing web search on the Anthropic API

Today, we're introducing web search on the Anthropic API—a new tool that gives Claude access to current information from across the web.

shrewd trench May 10, 2025, 1:22 PM

#

Is there any way to fix scroll on desktop -- it's really hard to parse results

wide edge May 10, 2025, 4:10 PM

#

shrewd trench Is there any way to fix scroll on desktop -- it's really hard to parse results

right now, your only options are using a different leaderboard like https://beta.lmarena.ai/leaderboard/text or https://ktibow.github.io/lmb/

drifting bramble May 10, 2025, 5:49 PM

#

can we have an arena mode where chat is infinite (only last <CONTEXT_WINDOW_SIZE> tokens are given to models)?

wide edge May 10, 2025, 10:29 PM

#

probably either too long or angries the WAF

quick mason May 14, 2025, 12:50 PM

#

can't put some images: error
HTTP 403:
Please enable cookies.
Sorry, you have been blocked
You are unable to access lmarena.ai
Why have I been blocked?
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.

What can I do to resolve this?
You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.

pearl garnet May 14, 2025, 1:15 PM

#

quick mason can't put some images: error HTTP 403: Please enable cookies. Sorry, you have be...

are you blocked from going to the beta site (https://beta.lmarena.ai/)? If you're able to access the beta site can you submit a bug report? at the bottom left you should find that option.

Please enable cookies.
did you do this?

quick mason May 14, 2025, 4:29 PM

#

it was in original on the og site, i fixed it

tidal geyser May 22, 2025, 5:47 AM

#

Hi can we get emojis for all LLM providers?

icy laurel Jun 13, 2025, 11:34 PM

#

#

This issue has been consistent

pearl garnet Jun 13, 2025, 11:50 PM

#

icy laurel

I'll start a thread in #1343291835845578853

median geyserBOT Sep 3, 2025, 2:57 PM

#

<:warning:892823499205406760> Channel locked

Site outage, will turn back on when resolved.

median geyserBOT Sep 3, 2025, 4:01 PM

#

<:success:865860339278413864> Channel unlocked

Welcome back :ablobwave:

median geyserBOT May 12, 2026, 2:53 PM

#

<:warning:892823499205406760> Channel locked