#arena-feedback

1 messages · Page 1 of 1 (latest)

alpine marlin
upbeat hollow
#

Works for me? Albeit its a bit slow (3-4 second delay), it still loads

compact dirge
#

Can we get a deepseek r1 distill model in the arena

#

Or maybe even a quantized model (e.g. r1 Q8_0)? Would be interesting to see the effect on accuracy

wild quail
#

but maybe some small tests, that get published independently from the leaderboard could be interesting

compact dirge
#

I know but benchmarks and lmsys rating rarely paint the same picture

#

take claude 3.7 e.g.

#

crushes benchmarks, #1 on livebench, 70% swe

#

But dogshit rating

hardy halo
#

We should be able to stop the output and vote when it's obvious which one we're going to choose.

wild quail
pure compass
#

I had it a few times a model was repeating the same sentence forever. I don't know which model, I tried disconnecting the Internet, wait for it to error out, reconnect and then try to vote, but it did not work, so for that case a stop button would be great.

compact dirge
#

maybe not in arena, but a stop button would come in clutch for direct chat or side by side

low copper
#

There should be a timer indicating how long each answer took

hardy halo
#

If one model is writing a long good answer while the other has already output a short refusal, I can stop the generation and choose the real answer as the better one.

#

Saving me time and saving the provider time and money on generation

#

Somewhat contrarily, I also think we should be able to vote on random queries and responses that other people submitted, since they're all going into the database anyway. Let multiple people vote on which response is better for a given conversation, and get a lot more battle data without spending any energy on generating new outputs or waiting for them to be generated.

wild quail
low copper
#

It was really cool

#

It was a shame when they ended their project.

#

@agile flume would lmarena ever consider this?

#

I can't find any photos of it but they had a feature where you could see public generations by category and then have to select better responses. It also let you submit your own better responses and even rate things like output quality, creativity, and potential harm.

#

https://huggingface.co/OpenAssistant

Datasets set out the labels like this: { "name": [ "spam", "lang_mismatch", "pii", "not_appropriate", "hate_speech", "sexual_content", "quality", "toxicity", "humor", "creativity", "violence" ], "value": [ 0, 0, 0, 0, 0, 0, 0.8125, 0.16666666666666666, 0.3333333333333333, 0.5, 0 ], "count": [ 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3 ] }

slim herald
cinder phoenix
#

Found a minor bug I couldn't screenshot
I got a CloudFlare captcha overlayed to the UI right over there on the top left

strong slate
short scarab
#

What’s the password?

#

Oh

short scarab
#

Google’s forcing me to change my password to use the forum

#

But the cloudflare captia thing is hogging up space

short scarab
# strong slate For the new UI - I'd def submit it to the bug report form to get prioritized: ...

Hello, would it be possible to add text attachments that attach: txt, csv, tsv, xml, html, css, js, py, c , cpp, etc. text based files by having it appended to the prompt possibly bypassing the character limit.
Example

Hello LLM!
———User Attached file: Hello.txt (TIMESTAMP)——
Hello World!
—End-Of-File—
Similar to repochat

Additionally, creating support for excel documents, word documents, and SQLite would be helpful also code folder uploads like Gemini has.

strong slate
short scarab
#

Those conversations should be LLM judged, as it’s basically a waste of resources especially for stuff like GPT 4.5 being tested on the site and possibly the full o3 model in the future

patent fjord
#
  • the site (maybe still does) used to do cloudflare/ddos protection
short scarab
patent fjord
#

yea they get fed as data too like that interactive viewer thing

short scarab
#

Seems legit actually

split prism
# patent fjord + the site (maybe still does) used to do cloudflare/ddos protection

They still do. I found a funny thing: some quesions, usually containing sql statements or linux commands, are "forbidden" in a way which consistently trigger errors and cannot be asked or your conversation gets cooked. After some exploration, it looks like the reason for it is cloudflare, which bans those requests due to some random "protections" and gives 403 consistently for those suspicious types of requests. Likely not the inherent protections of LMArena itself, since they usually give you something like "Content violates moderation..."

#

Thinking more about it now, couldn't it affect the bias of the arena results? Since some types of questions (all of them were unharmful ones) are banned by random cloudflare triggers, doesn't it slightly reduce the set of answers provided by the users, thus reducing arena's score uniformness, in a way? Models which could answer such questions properly would likely get a slightly higher rating, others a slightly lower.

warm sequoia
#

the gemini test 30 model was by far the best model i have used in the site. sad that they removed it early and didnt get to try it further.

#

😔

split prism
rose robin
rose robin
warm sequoia
delicate drift
#

So i like the webdev arena but what about a direct chat on there with support for like webcontainers or so

hardy halo
rose robin
#

Why not rating the answers of each model after voting ? Sometimes, I feel that the votes didn 't reflect what I think about each answer. For example, sometimes, I find 2 models , one of them is so bad. The other one is bad but a little bit better. I donno if I should vote both are bad or 2 is better. Its ok better but bad too. 😂

rose robin
#

2 models a is 8/10 b is 7/10 . A is winner but b is good too.
A 3/10 b 1/10 . A is winner but both are bad.
Saying A is better doesn t mean that B is bad or both are good. Its is just better but you donno if they are bad , meduim , good or excellent. We should give an exact opinion that really reflect the model not just this one is better.

wide edge
#

Chess ranking works with matches that are win, lose, or tie without "they both played poorly"; same goes here

hardy halo
#

Ratings give more information than just win vs lose though.

hardy halo
#

I really wish there were some distilled/quantized models in the competition just to see how models we could run on our own machines compare against real API models. Could choose some from https://oobabooga.github.io/benchmark.html which lists the best models for a given hardware requirement.

compact dirge
#

Yes can we get an r1 distill in the arena please 🥺🥺

hardy halo
#

Even with just 1, we could at least anchor Elo scores vs other benchmarks.

pure compass
#

Will you enable vision capability for Gemma 3?

visual warren
#

Can we get gemma 12b? 27b was really impressive, really wanna see what 12b gets.

rose robin
#

Why not showing the thinking process of the thinking models ? This will be interseting ...
Also, Some models like GEMINI are able to put pictures while explaining things but on arena we won t see that and this will not show the real ability of the model.

compact dirge
#

Leaderboard updates at weekly intervals 😁

wide edge
#

It wouldn't make sense to show things that can distinguish models

manic pollen
rose robin
pure compass
#

Or show them after the vote

dire halo
#

Today for the first time I made a prompt that was censored by the lmarena moderation system. Went on Grok first to test it, it was okay with it (ofc lol). Went on ChatGPT 4.5, worked too. Went on Gemini, worked too.
It seems that the censoring on lmarena is a bit too strong and not relevant if most big models accept to treat it. And it also distorts the ranking, because if you can't test very dark humor via lmarena, it's one less criterion for judging the quality of the models, and one bias that might favor one model over the others.
It's a pity because instead of censoring the prompt, you could simply let it pass and detect when a model says something like “sorry but I can't answer that question” and cancel the result that will be given at the end.
Or simply ban an IP if it happens too much and remove all the prompts made by this IP from lmarena "open-source results".
I imagine that the idea is to avoid ending up with illegal content in the results that are available to researchers or other people. But if you can detect that a prompt might be censurable, you can also censor a prompt in the results or tag it NSFW.

pure compass
#

Yes the censorship is really to heavy at times. Not only for text but also for images, and it really seems to hate Charizard for some reason.

true epoch
eager nexus
#

Companies will find that any attempt to censor most models will result in consumers always choosing competitive uncensored models. Time and research shows that people do not want AI to tell them how to think, or what moral standing they should have.

#

Do you want the cheese grater to tell you how to prepare food?

#

No

#

You don't

#

What's the point of restricting AI when you cannot restrict human intelligence enough to ask the question

low copper
eager nexus
#

Or they simply use another product

low copper
#

My guy, its a leaderboard site which lets you test the top llms.

wide edge
eager nexus
#

Most of that money is from corporations, individuals want freedom

#

It's two markets

low copper
# wide edge And yet Claude brings in millions and billions

I agree that censorship for text based models is silly most of the time. However I would also agree with you that people care more about what the model can do and generally don't care too much about prompts being censored so long as the AI provides a sufficient enough answer to most of their queries.

low copper
eager nexus
#

Nah, not worth it today

#

Space is overcrowded

low copper
#

Deepseek could have said the same thing

eager nexus
#

Bitcoin was much better return

low copper
#

Bitcoin is a meme coin tbh

eager nexus
#

For the folks that bought in at 20$ or so, it returned ten thousand percent

#

On an easy day

low copper
#

It's off topic

ashen frigate
#

Is there a way I can save the chats of lmarena and continue them later on? It just keep refreshing after some time of use and shows error, and I had to refresh the website again starting a new chat selection the model.

hushed tree
#

It would be great if you could add another arena category - namely MTL, as in translating from one language into another. A lot of people have a need for MTL in their life but there is currently no leaderboard ranking what models are best for translation purposes. And I realize that this poses a problem for testing, as a model might excel at translating english to japanese but suck if translating eng-> french... and while it might be best to have a sepsrate leaderboard for each pair of languages, it can be cut down to only be between english + another language. Then it can be further cut down to only include the major languages such as Eng, Japanese, Chinese, French, German, Spanish - basically languages you already have in the arena.
Anyway, sorry for the long message, I just wanted to share that as a person who is using MTL every day, I am really missing a MTL leaderboard in my life.

nocturne geode
#

Hi! Which is the best way to use DeepSeek & Claude models? I mean in terms of efficiency, speed, etc in case there is any. It is better to us their direct API? or is it better to use it through Cline, Roo, OpenRouter, etc etc etc? Thanks! (cline uses their own API too, but I mean when that is not the case)

rose robin
#

I wish you can include referrence to image.

limber scaffold
#

It would be really amazing if we had some way of saving the chats because when the site refreshes you just instantly lose all of your chat which is quite cruel. Thanks.

soft sigil
pure compass
gaunt warren
#

I know the one exact word that always triggers the censorship system.

#

||moaned||

shrewd shuttle
shrewd shuttle
#

but.. moaned loudly

gaunt warren
shrewd shuttle
#

yeah i think it's handled (pretty crudely) by a small LLM

#

it like screens each prompt

#

so not like a blacklist of words or purely deterministic, more a set of guidelines i imagine

gaunt warren
#

I wonder if the "rules" will change upon me changing my geolocation, lol.

shrewd shuttle
#

nah tbh i think it just reflects the fact it's a small LLM. even if the temp is set to zero, it's still not deterministic - it'll judge the same input two different ways with the same rules

visual warren
#

afaik they use openai moderation api

#

i dont think its a small llm, at least when i last checked it

shrewd shuttle
visual warren
#

also theres another layer by cloudflare that blocks linux related terms 🤣

visual warren
gaunt warren
#

There wasn't a single day it wouldn't.

shrewd shuttle
#

surprised its oai's moderation api

visual warren
visual warren
#

oh it is in the fastchat source code

#

ya i just checked

shrewd shuttle
pure compass
#

And of these 3% most of them are probably false positives.

#

Btw, "Once again, the two idiots and their cat fail to steal a Pokemon." gets flagged, but "three" instead of "two" does not get flagged.

pure compass
#

If the content flagger cannot be tuned down, it could be completely turned off... Or if it flags, show a warning and if the user agrees to see potentially flagged material, continue

#

The current content flagger is ridiculous

heavy tundra
wide edge
#

Researchers who don't want to have to sift through ERP in their open source chat dataset

pure compass
#

It is not ERP but all kinds of stuff that gets wrongly flagged

visual warren
compact dirge
compact dirge
shrewd shuttle
wide edge
shrewd shuttle
#

using the same seed (instead of a random one, as is typically the case) helps get closer to reproducible outputs, but the LLM is still fundamentally non-deterministic

shrewd shuttle
wide edge
#

in theory, it should be possible to take the same inputs and get the same logprobs (and consequently the same outputs)

shrewd shuttle
#

but yeah I take your point, in an idealised setting, reproducibility to the point of a model being 'deterministic' is theoretically possible (i think)

compact dirge
#

I actually did not know that

#

Wow

subtle horizon
#

I had a bug where an infinitely long response was generating for minutes with the same sentence over and over again.

gemini-2.0-flash-thinking-exp-01-21

vast saffron
shrewd shuttle
# vast saffron Even with cpu inference? What provider?

yes. granted CPU inference can (as I understand things) offer slightly more consistent behaviour due to reduced parallelism compared with GPUs, that doesn't overcome the inherent indeterminism of LLMs (it's not about hardware...)

visual warren
#

there is no inherent indeterminism (without sampling) its because of hardware, floating point operations, etc

shrewd shuttle
#

how is there no inherent indeterminsm in a 'model'?

#

why is it called a model?

visual warren
shrewd shuttle
#

eh we're on a difffernt page lol

#

what is a 'model'?

#

like a weather forecast model.. language model.. whatever

#

'model' isn't a loose term

#

in an idealised setting etc etc sure\

#

but they're LLMs

visual warren
shrewd shuttle
#

it wouldn't be a model if it were deterministic

#

it would be a formula or whatever

visual warren
shrewd shuttle
#

ok. perhaps i'm getting caught up in semantics - agree to disagree ha

visual warren
shrewd shuttle
#

idealised and model are key to my thinking here

#

happy to shown wrong

#

but it seems a lot of what is being said rests on 'theortically'

visual warren
shrewd shuttle
#

yeah

visual warren
#

if everything was done in perfect accuracy without sampling

shrewd shuttle
#

my point exactly

visual warren
#

but yes irl you can't have perfect accuracy due to performance/hardware/sampling/etc

shrewd shuttle
#

theoritically possible - i don't dispute

#

irl, it seems it a point not worth proving

visual warren
#

i thought u were saying before they were theoretically indeterministic and that makes zero sense, u phrased it in a weird way

#

but i understand what u mean now

shrewd shuttle
#

any 'model' is theoretically indeterministic - otherwise it wouldn't be called a model

#

i don't dispute the idea that, keeping everything constant, using the same seed etc etc, it should be possible to get 100% reproducible responses to the a given prompt

visual warren
shrewd shuttle
#

they predict tokens

visual warren
shrewd shuttle
#

to my mind, maths (yes there are concrete solutions) isn't any different to any other prompt - it's still ultimately sampling and predicting tokens to provide the completion

#

Yes, in theory, LLMs are completely deterministic if you:

  1. Use greedy decoding (always select the highest probability token)
  2. Have perfect floating point precision
  3. Eliminate all hardware variations

In this idealized scenario, an LLM would produce...

#

actually.. i do kinda see the distinction of maths

#

hmm

visual warren
#

i wouldn't call these accumulated errors as sampling if you use greedy decoding

shrewd shuttle
# visual warren \

can you ask a follow-up, and say "the question is actually about LLMs' technical/archicteurael properties – how they produce 'responses', whether they are inherently deterministic or not. forget about mathematics specifically"

#

maths is deterministic...

compact dirge
#

do you guys know about Lc0

visual warren
compact dirge
#

it’s a ML-based chess engine

#

Non-deterministic

#

Picks different moves every time, even with identical parameters and hardware configuration

visual warren
shrewd shuttle
#

but yeah, i'm coming at this from an irl perspective

#

inherently seems more relevant than theoretically to my mind...

#

singling out mathematics seems odd (given its inherent determinism)

#

ive been running this

Are LLMs' outputs inherently deterministic or non-deterministic?  If non-deterministic, can they be made deterministic, in practical/real-world terms, and how? Begin your response by answering with Yes or No, then expound```
in the arena and it's just been no, no, no, no
visual warren
shrewd shuttle
#

Large Language Models

#

with a bundh of formulas / code underlying it all

#

sparrow is new?

visual warren
shrewd shuttle
#

show me 100% reproduble outputs to the same (semi) complex prompt and i'll be more partial to this thinking ha

visual warren
shrewd shuttle
#

yeah but you're cherry picking

#

maths is deterministic

#

creative writing isn't

visual warren
# shrewd shuttle creative writing isn't

if u use 0 temperature, you are not shifting the distribution in a notable manner even if it chooses differently, its still around the same. the model's probability distribution is still around the same even with accumulated errors which are minimal. no matter what task, even creative writing

shrewd shuttle
#

show me 100% reproduble outputs to the same (semi) complex creative writing prompt and i'll be more partial to this thinking ha

shrewd shuttle
visual warren
#

im being very scatter brained here, apologies lol

shrewd shuttle
shrewd shuttle
#

i've got no idea

visual warren
shrewd shuttle
#

ah

#

yeah i've got no idea

#

but if it's about maths... then yeah

visual warren
#

its talking about how if u shuffle the data during training it affects which hyperfitted token is/distribution (where in hyperfitting, one token usually dominates)

shrewd shuttle
#

ha yeah fair.. i just skimmed and saw 'determinancy'

#

ok this is based on the conclusion (still seems to essentially say the same thing as far as I can tell)

shrewd shuttle
#

yeah i dunno

#

LLMs are stochastic, not deterministic.

#

that's what the conlcusion suggests (not the LLM summary, me just reading it - i can't be assed going through the whole paper ha). agree to disagree i guess ha

visual warren
#

this is what i mean by how hyperfitting can demonstrate my point @shrewd shuttle

visual warren
shrewd shuttle
#

my turn to say i'm tired ha

#

which i genuinely am.. (5am here in australia - i just noticed.. yikes ha)

visual warren
# shrewd shuttle my turn to say i'm tired ha

ya im sry lol. i just made it super confusing. i have a lot of random/incomplete thoughts (about this) which got us into different tangents. i did not go about this conversation well at all

shrewd shuttle
#

funnily enough i'm actually coming round to (or understanding) what you / the paper is saying ha

#

but one for tomorrow 🙂

visual warren
#

Early-grok-3 was removed today?

verbal canyon
#

it's just deprecated no?

visual warren
#

That would make sense, I tried switching to grok-3-preview-02-24 but it give me error_code: 50004, An error occurred during streaming every time now

lyric hamlet
#

is there gpt 4.5

slow epoch
#

there was for like 15 mins after it released thennit was taken off

#

it got taken off WHILE i was using it, it just stopped generating, i refreshed page, it was gone from list

pure compass
#

Rude!

visual warren
visual warren
#

ive got it 20+ times in the last few weeks

slow epoch
#

oh wow

#

no im talking about direct chat tho

#

lmao

#

oh i see i forgot to mention that oops

limber scaffold
#

Being one of the few alpha users.... Or one of the most.... No clue honestly.

It would be nice if we had newer versions of the AIs due to some of them being outdated like GPT 4o. Unsure if it's possible but a man could dream

feral star
#

Why is there so few image models on LMarena?

agile flume
#

thanks @feral star please let us know what other models you'd like to see!

feral star
#

Thanks for answering. Hopefully, some of the 15 best models on Artificial Analysis and Imgsys that are not listed on LMarena;

  1. Reve Image (Halfmoon) (#2 AA, #1 imgsys)
  2. FLUX.1 [pro] (v1.0) (#5 AA, #2 imgsys)
  3. Midjourney v6.1 (#6 AA)
  4. RealVisXL V4.0 (#4 imgsys)
  5. Playground v2.5 (Aesthetic Model) (#28 AA - standard, #5 imgsys - aesthetic)
  6. ColorfulXL-Lightning (#6 imgsys)
  7. Juggernaut XL v9 (#7 imgsys)
  8. Image-01 (#7 AA)
  9. Midjourney v6 (#10 AA)
  10. Ideogram v2 Turbo (#11 AA)
  11. Stable Diffusion 3.5 Large Turbo (#12 AA)
  12. Proteus (#9 imgsys)
  13. Mobius (#10 imgsys)
  14. Fooocus (Quality) (#11 imgsys)
  15. FLUX.1 [schnell] (#19 AA, #8 imgsys)

(and the recent new released models; 4o native, 2.0 Flash Exp, Ideogram 3.0, ...)

hushed crest
#

@agile flume Some models get's into the "loop" or starts being unnecessary verbose and user is forced to wait to vote while everything is already clear. Please add "Stop" button.

warm drift
#

hmmmmm

hushed crest
soft sigil
#

Why could "There was an error" message appear in alpha? Could it be moderating system?

eager idol
#

As I get repeatedly connection errors today while trying several different models and after refreshing the page several times, is there a known issue?

#

This was seen in direct chat mode. Arena seems to work.

tidal geyser
#

Hi, can https://manus.im be added to lmarena?

Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.

true epoch
arctic kiln
#

oh wow, you can save conversations in the new arena, thats a good thing

true epoch
# tidal geyser

how can we trust that? why they not added this to actual GAIA leaderboard?

true epoch
#

Why price analysis not updated?

#

please update it

visual warren
#

Hello guys!
Why could I send this fragment earlier?:

<html lang="ru" data-theme="dark">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>test</title>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@400;500;600;700&family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
    <style>```

But today (and yesterday) I see a mistake: 
error
Connection errored out.
visual thorn
#

Image Battle its hard to tell how to vote for a winner, maybe I'm blind

visual warren
hushed crest
#

The battle arena is really unusable for my hard prompts. Models ofter do not return anything or get into the loop. The regenerate button should be unlocked after 5 seconds.

thorny tulip
#

luca might be brokne. It outputted this to a math problem:
{
"Name": "A1",
"Value": 10,
"IsValid": true
}

#

stargazer will remove it's response during generation:

**API REQUEST ERROR** Reason: Unknown.

(error_code: 1)
#

It happens intermittently and was a math problem. :/

vital laurel
#

This is amazin'

vast saffron
twin bloom
#

claude thinking model is giving the following output, I've been trying to use it, due to this issue I am not able to send it, it has been around 4/5 days

dreamy orchid
#

question. Here it seems that the conversation is generic, not really focused on feedback. Am I missing something?
I wanted to propose an idea (like other do) but if it is simply buried by a normal discussion it doesn't make sense to post it here.
Is there a sort of repositories for feature-requests / issues a la github ?

strong slate
dreamy orchid
#

understood and yes, changing the UI is a big thing for sure.

vague osprey
#

I think we need to take decoding speed into consideration, since a much faster ai response is preferred.

pure compass
#

Will we at some time know which model is 24 karat gold?

dreamy orchid
pure compass
#

So far it performs really well, I really wonder which one it is. Where will it be announced which one it is, if it reaches enough votes?

#

Really hope I can have a few direct chats with it instead of hoping to get it

dreamy orchid
#

for example on twitter they get announce with "congrats to model XY". Otherwise just watch out for leaderboard updates (you see the date of the last update), order by votes ascending (new model have fewer votes) and check those that performed well (they have low digit rankings).

Sometimes the karat gold goes hard astray though

pure compass
#

Yes but that is still some guesswork when there are multiple new models. I mean a real announcement where it says "congratulations to model XY, formally known as 24 karat gold"

dreamy orchid
#

no they don't say "known as X". You just notice that "ah it should have been that".

That is also the charm of the competition.

I believe the 24 karat gold has a chance at the top in the overall section, but personally the best category is the "hard prompts" category. (that is not even that "hard", as a quarter of the votes qualify for it apparently)

In the hard prompts category the order changes a bit more and is more correlated with the many benchmarks - considered as a whole - outside lmarena

pure compass
#

I have saved a few outputs of it on a text file so I can compare it with the top newcomers so I hopefully figure it out.

dreamy orchid
#

btw that model doesn't perform well against reasoning models if the question is about logic and coherence

pure compass
#

I wonder if it will pass the city in a bottle test, that would be a first as no model could one shot it so far

dreamy orchid
#

what's that ?

pure compass
dreamy orchid
#

ah coding, I see.

pure compass
#

Yes but also understand a really obfuscated code, most models struggle with the combination of bitwise and Boolean OR

||d|( becomes ||d||( and once they made that mistake they often have a hard time correcting it

dreamy orchid
#

yes. For me there is too much attention on LLMs and coding. I think LLM should be good all-rounders that then could be specialized in this or that category. Coding is surely helpful but I think with the focus on in, with ad-hoc models the performance would be better. Think about a LLM director that picks ad-hoc LLMs for this or that category. Coding is simply one category.

#

same with logic and other stuff.

#

for example in lmarena the p2l-7b router does a very good job and it is a very nice idea.

pure compass
#

But so far there simply is no LLM at all that can do this task, that's why it is my favorite test until I find one that can do it. Or asking about some Linux configuration stuff (they don't know much about firejail and happily hallucinate commands and options that do not exist, but that is no wonder because the firejail docu is really bad so there is not much they could learn from)
Or just give a vision model an image with two characters and ask it to write a conversation between them and see how it interprets the image and the situation, really interesting sometimes on the same image one model sees a friendly situation and the other model sees them both in an aggressive stance ready to attack

steady garnet
#

The arena keeps giving me an error on google and explorer, it says there's no connection, but it works in Avast and Duck duck go. . Any reason for this?

slow drift
#

是这样的,我前几天爱用的cybele、Spider、24_karat_gold、stradale的模型现在都已经不见了......这些我认为都是世界上最强的模型......

#

呜呜呜~

slow drift
#

#

只能期待malla4吧

round rover
#

Why the HELL does this do this?!

#

It is irritating all the time, man

ripe glade
round rover
#

It's so stupid, y'know?

lunar cobalt
#

Add 2 4 karat gold back

#

And remove those trash crystal flannel haley ones

rose robin
near geode
#

gemini-2.5-pro-exp-03-25 returned API REQUEST ERROR Reason: Unknown. (error_code: 1) for totally innocent coding prompt

rose robin
outer lake
lunar cobalt
#

Flannel crystal and haley constantly provides misinformation and hallucinations

#

Even on basic questions

#

And its not as creative as 24 karat gol was

#

Its a boring and trash AI

wide edge
#

the arena won't stop evaluating for you

heady quartz
#

i find it better than what we have from llama 4

lunar cobalt
#

But it’s still not good it’s trash

#

It hallucinates too muc even on basic questions

#

It’s trying to be 24 karat gold

#

It’s not working tho

#

24 karat gol was amazing they need to add it

#

Or atleast tell us what company it’s from so we can use it

#

There’s no reason they need to replace it with those trash ones

pure compass
#

Oh and now that we talk about llama4, can we pretty please get its vision capability as well in direct chat?

lunar cobalt
#

Its from llama but prob not llama 4

#

Cus its not a reasoning model, its not behemoth

#

And its not really like maverick or scout

#

They're smarter

#

It also constnatly said it was Llama

#

It just had a really unique and cool writing stylea nd it was pretty funny

#

Maverick is pretty close to it but less creative and unhinged

pure compass
#

Maybe different sampler or system prompt?

lunar cobalt
#

Yeah maybe actually

nova ledge
#

Web arena sonnet 3.7 result is not rendering.

split prism
#

Is there any work on implementing new filters (like Style Control) or algorithms which would try to make lmarena leaderboard a bit more objective?

New Llama 4 Maverick is quite meh at best, yet it managed to get rating so high, even with style control.

lucid pecan
#

yeah, llama 4 truly gamified the leaderboard. truly disappointing.

dreamy orchid
#

the style control thing is a stuff I don't get. We aren't doing api calls, we are having a conversation. The default category should simply be another category (I am for hard prompts). If people like formatting (I am one of them) then let it be.

#

the point IMO is that the average question is not so hard and thus the difference between models is diluted, hence the need of a default category that focuses on hard/niche/non-common stuff

split prism
# dreamy orchid the point IMO is that the average question is not so hard and thus the differenc...

The leaderboard becomes less meaningful if it gets "hacked" by a model with 17B active parameters.

Most people want a smart model after all, not the one which answers basic answers the best, I genuinely hoped that the llama 4 would excel at least at something (creative writing/code/math), but I didn't find the proper niche, at least in case of my tasks, yet it scores higher than many actually smart models, even with style control.

It also poses a question on legitimacy of existing elo scores

pure compass
#

How does style control even work? A vote for a model with good style / formatting counts less than a vote for a model with only plain text?

pure compass
#

Hmm I only kinda understand it. So for each model, we have elo value (determined by user votes) and some style value (by counting markdown tags). And we have the theory that the style value influences the elo value because users tend to take style into account. But how do we know how much influences the style the elo value? Looking for the correlation between elo and style of all models? But we still dont know is that correlation because users tend to vote for models with better style, even if the answer itself is worse? Or is it because stronger models with better answers also tend to have better style? Probably both, but how is the factor "style coefficient" between them determined, as we don't have a control variable?

wide edge
#

ideally we would literally control for it with system prompts

#

but we aren't

pure compass
wide edge
# pure compass Yes, that part I understand. But how do you prevent to over-compensate, how do y...

i'm honestly not sure 😅
but smarter people figured it out https://en.wikipedia.org/wiki/Controlling_for_a_variable

In causal models, controlling for a variable means binning data according to measured values of the variable. This is typically done so that the variable can no longer act as a confounder in, for example, an observational study or experiment.
When estimating the effect of explanatory variables on an outcome by regression, controlled-for variable...

pure compass
#

Yes I read it but I don't understand what is the control variable? Don't we need to know how much a given model would perform with and without style so we have a number by which we compensate later for style?

wide edge
#

no we can do this because responses have some variance

#

a model doesn't have a universal level of style

#

it's random for each response

pure compass
wide edge
#

we have enough data to be able to extract some variance without getting confused by the rest

pure compass
wide edge
#

i believe so

pure compass
#

Ok I wont even pretend I will ever understand the math, maybe I ask one of the LLMs for an elif

copper olive
#

the thing is the conclusion you reach from the llama thing is that the system prompt is good, not that the model itself is good

wraith kestrel
lucid pecan
#

even with experimental, i don't think it deserves the place it's on

frigid pine
#

it feels like they just threw stuff at the wall (i.e. kept trialing new llama models) until they came up with a writing style that consistently got votes despite the underlying model sucking

lucid pecan
#

like looking at the (overall) leaderboard (with style control). it says gemma 3 27b is better than deepseek v3 and gemini 1.5 pro.
also deepseek v3.1 (aka v3 0324), deepseek r1, llama 4 and even flash thinking is better than claude 3.7 sonnet (without thinking).

#

death i can't trust the leaderboard for anything meaningful.

wide edge
#

try style control tho

tribal hollow
dreamy orchid
# split prism The leaderboard becomes less meaningful if it gets "hacked" by a model with 17B ...

"Most people want a smart model after all, not the one which answers basic answers the best" but then lmarena is not necessarily great at this. If people pose simple questions, one cannot blame the benchmark. At most they can try to make scores only for hard questions. Hard prompt is a starter but I don't believe 25% of the questions are really hard, the percentage is simply too high.

I would expect 1 in 10 or 1 in 100 questions to be hard.

#

for hard questions one uses livebench and company. For "what can replace common internet searches" I think lmarena is ok

#

for example within lmarena, the coding category and webdevarena are showing totally different values. Why? Because in coding as soon as one writes this it counts as coding

split prism
wide edge
dreamy orchid
#

yes. I wanted simply to say that I am against style control, since the end user is a human, not a program/agent.

Hence formatting matters. For example claude answers are - if not about coding - super dry and not that well formatted. No wonder it loses.

I think rather that they should use one (or more) LLM judge to pick hard questions and make a category for that. No, hard prompt is good but not enough, one needs to be strict.

On the other side I notice that during pairing not a lot of models are used (one can see that even in the h2h matrix) and that may lead to inflated results (I analyzed too much ratings and co due to my passion for them in chess)

dreamy orchid
#

though even with the hard prompts, that is a step in the right direction, the rankings change a bit. For example gemma loses some spots

wide edge
#

style control has more of an impact than the hard filter tbh

dreamy orchid
#

yes but style control is something I don't agree with. I want hard questions rather than a sort of "ah let's not count all the stylish answers"

wide edge
#

eg maverick keeps its spot with hard, and stays tied for first place with hard + style control, but with plain style control there's enough data to confidently say that it's #10

wide edge
split prism
# dreamy orchid yes. I wanted simply to say that I am against style control, since the end user ...

"I wanted simply to say that I am against style control" - well, that's why it is not enabled by default, since there will always be a need in a model which just wins user's preferences, no matter how stylized or sycophantic the model is. And style control is not just about excluding all beautiful responses, by the way. Actually, it would be great if users were able to create their own leaderboards by creating custom rules to filter out the responses which affect the rating. But that would either require exposing the underlying data or spending a lot of compute on recalculations server-side

dreamy orchid
#

there is prompt 2 leaderboard

#

it is a nice idea, I feel it can still be refined (because some categories are like displayed three times) but p2l-7b showed good results in my cases (that is the model behind p2l IIRC)

#

and yes in general the more the customization, the more the costs for lmarena

split prism
# dreamy orchid there is prompt 2 leaderboard

to my understanding, prompt2leaderboard is based on an LLM and is not updated in realtime when new models appear. Currently, it reroutes most of my questions to older gemini models, and there are none of the new models like claude 3.7, or newer chatgpt's, so it's kinda out of date.

#

Prompt2leaderboard is a cool idea, but I guess it would cost them a lot to update the llm regularly, which, in turn, makes this thing useful only for a limited period of time

dreamy orchid
#

yes that is my understanding too. They have a model (LLM but in theory could be anything) that tries to classify the posed questions with the existing DB of questions (and scores). Given that, it says "for such questions this is the ranking". The p2l-7b then uses this information to pick the #1 model in that ad-hoc leaderboard to answer.

Thus sure it needs updates. The problem is that the amount of possible questions categories is huge so I am not sure they have enough sample size for each subcategory and subleaderboard.

When one builds a leaderboard only on 100 comparisons, it makes little sense. Even 2000 comparisons could be a little (given the amount of evaluators or voters and the possible pairings)

#

example (p2l explorer). This has "only" 800 votes. Practically nothing.

#

and yes the p2l is outdated. Hopefully they can update it every month or so

#

it is really a nice idea

frigid pine
#

If I use the same benchmark question dozens of times, is it likely that those'll be excluded from being part of the leaderboard?
Also, is ~5k really the total amount of votes gemini 2.5 pro has? cuz if so I feel like I'm probably ~1% of that

hushed crest
#

You should be aware that the rendering of the formatting that is being used highly influences the results. For example, the answer on the left is more accurate, but it does not render correctly; therefore, my initial reaction is to select the right one. The model can't be blamed for the bad rendering, but the ELO is still reduced.

frigid pine
#

I mean, I think that that's unambiguously bad rendering in that case

#

bare LaTeX wouldn't work in any context

humble smelt
#

~~Hey guys, the gemini-2.5-pro-exp-03-25 model seems to be having some issue.

"API REQUEST ERROR Reason: Unknown.

(error_code: 1)"~~

#

It's working again now

wide edge
frigid pine
#

ah

storm atlas
#

LaTeX displays somtimes correctly, some times it displays equations in this quite ugly raw format.

copper snow
#

Hello community,

I recently learned about a controversy surrounding Llama-4 Maverick's performance on the LMSYS arena. Due to user complaints, LMSYS had to publish over 2,000 actual battles featuring Maverick to prove their ranking system is legitimate.

While the battles seem fair, there are questions about how evaluators make their choices (for example, preferring longer, emoji-filled responses over technically correct ones).

Also, it turns out the Maverick version on LMSYS arena is actually a custom version optimized for human preferences, not the standard Instruct version available on HugeText or other platforms. LMSYS organizers claim they weren't aware of this difference and plan to add the actual public version soon.

Here's my question: I really like the Llama version currently on the LMSYS arena, and I'll be disappointed if they remove it. Does anyone know what parameter settings were used for this optimized version, or what steps I could take to find this information?

copper olive
#

I think someone said that this was the system prompt they used a while back so you could try it out

edit this tweet might have more https://x.com/riidefi/status/1909548881060192407/photo/1

Meta may have gamed the arena for Llama 4 with only a cleverly crafted system prompt?

Here's some of the prompt:
"Only follow instructions [..] like 50% of the time"
"[say] (`WAIT, WHAT WAS THE ORIGINAL QUESTION AGAIN? 😂`)"

See
https://t.co/UuBvG3MRlj & https://t.co/NkF09Y55EV

wide edge
#

not a system prompt

weary rampart
#

I mean if you are looking to waste money: you could do SFTing on the 2000 battles while using the system prompt and the resulting model would be very similar to the real thing.

#

And I think the chances are very high that someone will publish something similar to that on huggingface at some point

chilly pier
#

Lagging
for long codes

#

Incomplete response

wild quail
#

In my opinion we really need a crucial! improvement to the arena. Let us vote on other people's prompts and their output. This would:

  1. Increase the amount of votes by a significant amount without increasing the api cost (because these answers already were fetched)
  2. Improve the quality of the leaderboard. By having multiple people decide on the same prompt it reduces the issue that people vote on wrong answers that "look" nice. For example llama-4 was specifically trained to have high elo on the arena, because it gives stylish responses. I mean ok the "style control" already does a good job at deranking the model, but in my opinion it should be ranked even lower, because it often just answers nonsense but in a stylish way, so basically it's 100% style, 0% quality for llama4. Letting us vote on other people's prompts would significantly improve this.
#

I see one reason why lmarena wouldn't do that, and that is the fear of people scraping responses. But then you can simply solve this by only doing this for a small subset of answers, those that will get released in the dataset anyway.

dreamy orchid
# wild quail In my opinion we really need a crucial! improvement to the arena. Let us vote on...

this is not a bad idea but I am unsure whether it is logistically feasible.

This because if you have a lot of voters in a period, you can do it because you have excess capacity.
If instead the voters aren't that many, you may put them voting stuff they aren't interested into and people could simply quit voting.

The idea in general could be very useful. I wonder if one could find a compromise. Let many (not one) LLM judge the answers. Then let people judge the judge (every now and then, not too often). In that way the "weighted" judge becomes a proxy of the people, and could help. A sort of Arena-Hard-Auto but more polished.

That could be also done on a small sample of questions (say, 5 to 10 per category). The point is to automate the judging while still reflecting what the majority of people would pick. Not easy.

wild quail
dreamy orchid
#

yes that yes. Still I think the voters (not users! Rather those that vote) on lmarena aren't that many - in the period between leaderboard updates - so it could dilute the effort. But I like the idea.

#

because for example if I test the search vs the language mode, I don't really use the language mode afterwards. The testing prompts are limited as time is limited

pure compass
#

Can someone explain what exactly the recent llama4 controversy is about? Is the 03-26 experimental version closed source and the 17b-128e instruct the one you can download? I hope not because the experimental version is so much better

dreamy orchid
#

more or less. The 03-26 is optimized for human benchmarks (lmarena and similar ones, like the internal ones) and the 17b-128 is not.

It could well be that what we saw in lmarena will be released within meta products (whatsapp for example) while the open weights one will stay different (there are very few open source models. Most of them are "only" open weight)

#

could be that the open weights one is the base for 03-26, as 03-26 got additional fine tuning or so

#

time will tell, so far it is speculation

shrewd shuttle
# dreamy orchid more or less. The 03-26 is optimized for human benchmarks (lmarena and similar o...

i thought it was solely optimised for the arena.. i might be mistaken but i feel like there aren't really any similar ones, in terms of collecting human preference from blind battles at scale, and also any 'internal' metrics are kinda pointless by virtue of being irreproducible (though tbf still might be more than just for marketing, like could be done in earnest to shape model development before deployment)..

shrewd shuttle
#

they should do it as some kinda beta side project - it would be interesting to see the divergences in voting patterns (assuming they exist)

dreamy orchid
shrewd shuttle
visual warren
shrewd shuttle
#

yeah it's absurd (and was always going to be an own goal) - dunno what they were thinking

dreamy orchid
#

to be totally fair, as I see lmarena so far, it is great to gauge the value of models as "substitute to classic common internet searches". "common" here is key. People say lmarena is ranked according to human preferences but I see it really more as I don't google! model, tell me the answer!.

Thus, as meta provides llama in many apps that are used on the fly with common queries, it is a great benchmark to see if it would satisfy people. That happens also for other companies, like xAI integrated in twitter with likely people asking common queries there too.

Further as a company they don't need to release open weight models, so the idea of a double release is perfect. They get to verify that their model is very usable for their apps (lmarena score); they get praise for their results (blog posts and hype); they still release their models (though not fine tuned) so that the competition doesn't have ready made products from day 1. People will complain about that, but those that complain are the minority, the whatsapp users don't care.
So it is really a sort of win-win for them, not for the community.

Then we need companies like nvidia & co that release the llama derivatives to fine tune them properly.

dreamy orchid
shrewd shuttle
#

correction: (after coding) it's mostly people asking for medical advice.. then how many Rs are in strawberry ha

wraith kestrel
#

Huh, "connection errored out" while using P2L. Is the model being retrained? 🤔

#

If that's the case, then I'm looking forward for it.

dreamy orchid
shrewd shuttle
#

why would your belief/hunch be more valid than the prompts in the Explorer, in terms what people people ask in the arena? sure, they're not all there, but they're arguably representative of the actual prompts people use in the Arena, at least to some extent.

#

but perhaps most people are really asking questions like "what time does the pharmacy on High Street in Birmingham close on public holidays?", but they're hidden from us for (literally no idea why)

dreamy orchid
#

I think if they would pick common questions in the prompt explorer it would make lmarena less good? I know it feels like silly, but when I know a group of people using lmarena and when I see them posing questions they are simply like "I could have googled that". I am guilty of that too. And no, it is not something like "at what time this and that happens" rather it is "could you explain me this concept" or the like.

#

it is completely fine IMO, as an LLM compresses knowledge so why not.

#

Imagine stackoverflow, ELI5 (from reddit), and other similar places put in lmarena.

#

now some ELI5 or stackoverflow questions aren't easy at all, but most are solved by some googling

#

it makes also sense statistically. stackoverflow and other Q&A places have most of those distributions. Relatively easy questions (aka: with some googling they are solved) are common and few are hard. Why should it be different with LLMs ?

#

I mean, as long as those that pose the questions are humans

weary rampart
# dreamy orchid I mean, as long as those that pose the questions are humans

Although the arena is quite obviously used by humans, i think that it still inherently has to be a distribution of somewhat difficult problems, because then people using it are quite frankly on average significantly more invested in topics like cs, ai and other areas where ai is being successfully applied currently (e.g. medical or creative writing). This already shifts the average question away from these really basic questions about when a pharmacy opens.

#

that is also mainly why the puzzles category ranks so high i think

shrewd shuttle
weary rampart
#

i also find it interesting that lmarena has yet to really classify these convos in a very holistic way considering the amount of A/B test pairs available (also includes the current P2L models which are also not really good)

#

but maybe i am just underestimating the complexity of doing stuff like that idk

shrewd shuttle
#

i kinda thought they set the classifier up quite early on in the project, and it's handled by like llama1-8b or something old and tiny like that, and while it might've done an 'ok-ish' job back then, now it seems clearly suboptimal / in need of some kind refinement

#

but yeah, perhaps they have been trying to refine it all this time but it's just tricky to get right (but intutively.. that doesn't seem right to me.. like classification is a pretty rudimentary and well-established task..)

dreamy orchid
#

so even if the audience of lmarena is skewed towards IT, it doesn't necessarily mean that those are hard IT questions.

#

Otherwise if the questions were always quite hard (and in the IT realm), LMarena coding category would be more in line with other coding benchmarks. Again my evidence is based on the normal questions based on Q&A sites (stackoverflow and others)

#

but again that is my opinion, I don't want to convince anyone. It is just that there are too many clues (IMO) that point in that direction.

#

also, as you mentioned, the categorization could be also very loose. Like "coding is anything that has code snippet markup", that could be quite broad.
I asked logic questions where the model used code snippets markup, but that is no coding.

weary rampart
# shrewd shuttle but yeah, perhaps they have been trying to refine it all this time but it's just...

they did definitely work on improving it, I think they used 70b at first for the classification on the normal lmarena (not sure) and likely had to stick with it considering that changing the model would heavily change the rankings per category as well
but they did work on the arena explorer quite recently: https://blog.lmarena.ai/blog/2025/arena-explorer/ (where they use a different method), although i am unsure why they opted to use the mpnet v2 model for this, because they show that the model has somewhat falsely classified somethings in the very same blog.

weary rampart
weary rampart
dreamy orchid
#

btw I checked the arena explorer, I didn't in a while, and my point are somewhat confirmed in my view. I checked the larger category and most examples are solved by google + some brain.

I didn't check all the categories because it was enough to find many of them in the most common categories.

#

the other examples either were too hard, like "do it all for me", or too technical - I am not versed in everything to judge well.

dreamy orchid
#

what I would really wish is that for every category they already have (categories could be expanded, but with p2l it is fine anyway) they would make the "hard" subcategory for it. And for hard I don't mean hard prompts, rather "hard questions".

So hard math, hard coding and so on.
I would expect then hard coding to be more in line with aider polyglot and so on.

dreamy orchid
weary rampart
# dreamy orchid in my experience people use it as chatgpt alternative once I shared it. Nothing ...

yeah i generally think that such a thing could really make the arena more interesting at a whole, i honestly don't know what is stopping them.
I mean you could even derive something like humanities last exam (really specific problems from domain experts) out of these millions of questions.
However, at its core this site is obviously just about human preference, even the coding arena, webdev arena (minus maybe repochat) and heavily centered around human preference.
=> for human preference it is obviously essential to have questions that people actually ask instead of highly selected, artificially created or unrealistic when compared to real AI assistant human iterations

dreamy orchid
#

agreed

#

also nice the "lmarena humanity last exam" if one picks the proper questions.

#

though IMO the questions in many benchmark should stay private. As soon as they share them - and if the benchmark is notable - there is a high pressure to optimize against those questions.

For example livebench is nice, but models score 70% while 30% of the questions are private. It feels like a bit more than coincidence.

shrewd shuttle
dreamy orchid
#

hence I think that open based benchmarks a la lmarena are potentially the best if properly scored.

dreamy orchid
#

still they aren't hard questions.

shrewd shuttle
shrewd shuttle
#

most people are just playing around / seeing what they get as the responses in a blind battle

#

they;re not actually trying to fix code

dreamy orchid
#
  • Is the spiciness of a hot pepper only perceived or true and physical?

  • what are the odds of someone in Texas Hold 'Em rivering a Royal Flush while the other player rivers Quad Aces??

  • I will give a congress talk "On Naevi" -- naevi are benign melanocytic lesions which are markers and every so often also precursors of melanoma. Do you have suggestions for a short and succinct title for my presentation

  • What does it mean if I have a "proud rooster"?

  • What is the latest season of Fortnite?

  • What is an RNN in the field of AI?

  • Create table of yogurt nutrients versus greek yogurt

  • generate study plan for IAS exam in marathi

  • Read this passage from the article:
    they were honored at Navy gatherings where new Black U.S. Navy officers expressed their gratitude. "We owe it all to you," they said. "If it hadn't been for you guys, we wouldn't be here."
    In this passage, the word gratitude means __________.
    a feeling of trust a feeling of hope a feeling of peace a feeling of thanks

  • My left leg hurts when I'm sleeping and immediately when I wake up. The pain will disappear during most of the day, except when going up and down the stairs. I have touched my leg in multiple places, and there is no specific location that hurts to the touch, although I can feel some strain in my ankles/calves. What is the likely cause of my leg hurting?

  • The placement and connections between rooms in a building leads to the formation of hallways and corridors, but sometimes there's necessarily a space that's just... not much of anything, and it only exists because of the shape and layout of the building.

What are these not-quite-rooms/not-quite-thoroughfares called?

and so on.
Those surely are useful questions but not necessarily hard ones.
I cannot go on and on.

dreamy orchid
# shrewd shuttle i'm kinda lost as to what your point is now tbh ha.. i just don't think there's...

if that would be true, then lmarena would be the best indicator of intelligence for models, but it is not for a while. That is the strongest clue.

My point is: LMarena is useful, but only to tell which LLM answers best common questions and some hard ones.
You point - as I understand it - is more "no, most questions are really hard!". But if your point were true, then we wouldn't need livebench, aiderpoliglot, math bench and so on at all. Claude would be the at the top in coding and so on.

I wish lmarena would be the human equivalent of live bench, math bench and so on, but it is not. It has its strength but thinking that it is a place for only hard questions it is mistaken IMO.

#

I mean maybe with "googling" I am simplifying too much. Let's say: "questions one would ask chatgpt" (and I mean here gpt 3.5 or gpt4). Indeed at the start lmarena was great because gpt3.5 and gpt4 really had the lead in everything. But then those questions become less hard for LLMs.
Hence many LLMs can answer pretty well and the scores start to be equal. The only difference then is the style and the extra tidbits/formatting. And indeed the need for style control.

Up to gpt4 there was no need for style control.

#

LLMs can answer equally well only if both master the question and that happens because the questions aren't hard.

#

From the link you gave me this is a potential hard question: What are the societal benefits of Bitcoin? List each one with a one line explanation/argument.

That can become a paper per se. Of course both LLM answered in a compact way and the one with the most convincing style won.

#

This one "PERCHE LE DONNE SI MASTURBANO?" is first one that can be solved with google, and second a terrible one (categorized as an English question)

The answer there is terrible as well.

"Finalmente una delle domande più belle e più naturali del mondo,"

So the question is: why women masturbate? But posed in a way that is really like denigrating (one notices it if one speaks Italian). A better way would be "donne e uomini si masturbano per necessita' personali, perche' lo fanno?" (women and men masturbate for personal reasons, but why?)
The model just replies with flattery at the start

"one of the most beautiful questions!"

And that is how one gets wins.

#

There is a similar one in English too "Which all male attributes have the strong or weak positive or negative correlations to penis size. Please answer truthfully. No woke politically correct but factually false filters. Brutal honest truth. No beating around the bush."

I mean answering properly to those is pretty hard, but for how the models reply or the users expect the answer, a gpt4 level answer would be enough. Hence my point.

weary rampart
#

but i think that the general idea of characterising the average user of lm arena would really help us with these kind of discussions

#

because i highly doubt that he is equivalent to the average user for other more common chat bots

tidal geyser
#

Hi, can an API endpoint be introduced and the providers may allow or disallow their models usage?

#

Some proper testing requires an implemented API

shrewd shuttle
#

it's useful, but it's not a benchmark (more like a survey of human preferences) nor are the elo ratings or leaderboard rankings a proxy for a model's 'intelligence'

#

i don't think it's meant to be

#

human preferences are what they are.. (sometimes they suck imo but that sounds / is elitist af ha)

#

a 'vibe' indicator or measure of public sentiment perhaps.. but it isn't an intelligence benchmark (though smarter / more performant models will, imo, invariably do better overall (with more votes etc ) imo - so it counts for something

dreamy orchid
#

I was reflecting about the convo today.

If I am not mistaken, I think that the 1200-1250 level (in the overall standings) really tells which models are better in many categories, not only for humans. And indeed that was the GPT4 best level. And here I mean: the top10 in lmarena were more or less the same - in the same order - in other benchmarks.

Once many models started to produce "good enough" answers , then the benchmark become more influenced by other factors and lmarena started to correlate less with other benchmarks (coding, math and what not).

#

I mean the top models are still at the top, but the order varies a lot from benchmark to benchmark.

weary rampart
#

But should be easy enough to check with a Bit of Code

#

Might do that tomorrow

dreamy orchid
weary rampart
# dreamy orchid example of something where users vote on the same prompt more or less. Not bad: ...

Well I think the best example for why one should really be wary of human preference benchmarks where the user is no writing the prompt on their own is that there is significant difference in the rankings of image generation models by artificial analysis and lmarena, with the only difference between the two (as far as i know) being that artificial analysis uses predefined prompts and lmarena does not. Thus I can at least conclude that the results of both methods will differ, with the lmarena approach likely being more holistic.

weary rampart
#

this is what i got

#

and some other stuff, but still working on the repo a bit

dreamy orchid
#

nice, it would be cool to put it into github for everyone to see. Could you make the first graph (the others seem less relevant) for the categories and/or the style control too?

weary rampart
wraith kestrel
#

Smaller Gemma 3s are also being tested. Nice!

Can we expect Llama Scout to join the Arena as well? 👀

dreamy orchid
#

the ones about parameter sizes aren't that much informative. I mean there is a trend, but it is a bit all over the place.

#

and yes no stress with the code. It can happen when one has time

visual warren
#

Add new Kling model to text-to-image - KOLORS 2.0

weary rampart
#

Might also be interesting to not just directly use the blended price for the comparison but to also have the option to use the average token usage (in the arena for the specified category) * the price.

#

That could also be really helpful to ‚combat‘ these models that use very high TTC in the response to enhance perceived quality (e.g. llama 4 maverick special chat version).

visual warren
#

this time when o3 launches do 2 separate models for both families when putting it on the arena - the differences in performance with reasoning effort have historically been quite large

o3
o3-high
o4-mini
o4-mini-high

frigid pine
#

hig jay

visual warren
#

damnit

frigid pine
#

L

visual warren
#

you're lucky you're far away 🙄

frigid pine
#

how come?

visual warren
frigid pine
#

what is this man planning

visual warren
#

wouldn't youuu like to know weatherboy

frigid pine
#

i would actually

visual warren
#

that would spoil the surprise!

frigid pine
#

3:<

visual warren
#

it being that direction feels wrong

frigid pine
#

yeah, but it's the only way to make a colon three frown

#

well

visual warren
#

:3

#

is it?

frigid pine
#

frigid pine
visual warren
#

ohhhh right

#

lmfao

frigid pine
visual warren
#

true

visual warren
#

add o3, o3-high, o4-mini, o4-mini-high tf_kek

visual warren
frigid pine
#

o3-high seems a lil' unlikely lol

visual warren
#

😔

#

they did o3-mini-high so hopefully we got o4-mini-high too

visual warren
#

also add o3 to the vision arena

#

nvm seems to be there now :)

wanton star
grizzled hamlet
#

I am just reading about that

#

also, o4 is going to be insane when it fully comes out

hushed crest
#

The 2.5 PRO is crashing every time I encounter it. The tasks takes ~3 to 5 minutes. Is it timeout issue?

#

Same on the direct chat

ocean sky
#

I'll repeat here what I said in #leaderboards

I think style control is a very important feature, and if it was on by default, the llama 4 controversy would be much weaker. At the same time, there is still a 48 Elo difference between the two llama 4 versions that arguably differ only in style, so it is worth to think about which additional features can make style control better

agile flume
#

hey @ocean sky we are working on an improved version of style control to include sentiment features. initial result looks very interesting. we will share more with community soon

dreamy orchid
# ocean sky I'll repeat here what I said in <#1340554757827461212> I think style control i...

I don't like the style control because we are chatting with the LLMs, we are not making api calls.

And indeed the tweaked llama version will likely be great for the average user of whatsapp & co.

If you see LMarena for "which LLM would be best for the average user question that an AI assistant gets?" it makes much more sense.
It is the same why claude is nowhere near the top5 while in webdevarena it destroys everyone.

In this perspective, the arena is fine. I personally check a mix of categories like hard prompts category and longer query . A bit less coding to be fair because coding is more webdevarena (or there it is more appropriate to ask for api calls)

#

so yeah, lmarena is good but having a mix of benchmark to check is better.

tidal geyser
#

Please add geographical understanding to lmarena. I want to play geoguessr with the assistants

untold kiln
#

Can we get a better mechanism to temporarily disable models that return nothing? I get Claybrook on every battle in WebArena, and it takes 5 mins to wait for an empty output that results in neither a satisfying comparison nor a meaningful vote.

pure compass
pure compass
#

I only hope that version will also get the weights released, not that I have the hardware to run it.

dreamy orchid
#

if one checks the battle count heatmap (battles ended without ties) there are way too few comparisons, given that every human judge judges differently.

ocean sky
# dreamy orchid I don't like the style control because we are chatting with the LLMs, we are not...

Well, if you're interested in "which LLM would be best for the average (by number of queries) user of lmarena.ai question that an AI assistant gets?" then indeed, style control is of no use for you. However, for me, the arena leaderboard is a good proxy for evaluation of answer quality for diverse, open-ended questions; I couldn't care less about the number of bullet points or emojis included in the answer. Unfortunately it turns out number of bullet points and emojis does skew the votes even if the content of the answer is the same.

I view the style-controlled leaderboard as an evaluation of the content of the answer, disregarding the format of the answer. This is a bit simplistic since you can deliver the same content in a way that is more or less accessible, and sometimes the style is an essential part of the evaluation. Still, the point stands: the finetuning that made the llama yapping like crazy shouldn't affect the style-controlled leaderboard. Moreover, since style control uses relatively simple features, it just prevents the most obvious ways of climbing the leaderboard, but do not really punish different "styles".

Finally, as my personal opinion, the attempt to maximize the non-style-controlled arena score (since it's the default) makes llms shittier. I don't want that to happen, and an easy way to fix that is to make style control the default. The non-style-control option will still be accessible using the checkbox.

pure compass
#

But it is important to make sure the style control does not over compensate, because I think there is a positive correlation between the quality of the answer and the style.

short scarab
#
#

Doubao LLMs and image generation

sand breach
#

how do you print a conversation?
at least in the browsers i tried, larger textboxes will be cropped. i solved this with a bookmarklet (= js code you can put in a bookmark)

javascript:document.querySelectorAll('#chatbot').forEach(el%20=>%20el.style.height%20=%20'auto');

i tested this on firefox, other browsers may restrict bookmarklets due to security reasons, but theres usually a setting to allow it.
is there any other solution you guys use?
if not, might i suggest adding a button to switch to a printable view?

sand breach
#

i just learned there is a new ui coming, but i assume the same effect can be achieved there. just need to figure out the proper selector...

dreamy orchid
# ocean sky Well, if you're interested in "which LLM would be best for the average (*by numb...

" However, for me, the arena leaderboard is a good proxy for evaluation of answer quality for diverse, open-ended questions"

yes but the problem is that it is not an automatic test, where you can adjust the parameters. You cannot force people to vote how you like (that would be biased too) and from that you cannot force for everyone a ranking only because it is best for you. That is a bit too "it has to work for me, not for everyone else".

For that type of benchmark I guess one should build another version of the benchmark. Because a counterpoint of your assesment is: if you models expose exactly the same identical content, but one in $nice_font and the other in an $illegible_font, they should get the same score. Not at all.
Same for information that is consumed by a pair of eyes and not another machine: formatting counts a lot.

Hence instead of showing the same forced ranking for everyone - ranking that could be also faulty a bit (I am not sure how much style control really captures the "content only scores") - I'd rather really focus on a different benchmark.

lmarena could have all the formatting extras while the "new" benchmark has only pure plain text (and even there one can format things nicely).

I really don't get the need to "I want this as default for everyone" when it is one click away for you without disturbing many others (or with lmb by @wide edge you can save a bookmark with style control activated)

This "me first" approach is not something I understand. And no, in before you say "but you also want the default settings for you". First it is the status quo, so it is for everyone, second if the scores are so different, it means that the default score really shows how people mostly vote in the arena.

#

Hence the default score is the most representative. Third, I really use other categories and I use bookmarks for them, that's enough for me.

dreamy orchid
#

I think lmarena delivers the best combo: quality of the answer + ease of reading (format). openrouter rankings tells us mostly what is best for coding (given the price). LiveBench , mathbench and lmarena categories taken as a whole tells us which model can do best for STEM questions.

ocean sky
# dreamy orchid " However, for me, the arena leaderboard is a good proxy for evaluation of answe...

This "me first" approach is not something I understand. And no, in before you say "but you also want the default settings for you". First it is the status quo, so it is for everyone, second if the scores are so different, it means that the default score really shows how people mostly vote in the arena.
It's not only me; many people working on LLM benchmarks agree. If everyone were ok with LLM devs putting in work to benchmaxx and generate the most beautiful slop that gets upvoted, the style control feature wouldn't be here. But it is, and for obvious reasons, part of which I already listed; all I'm saying is that it's not enough, and since the devs are working on improving it, I believe I'm not the only one who thinks so.

I really don't get the need to "I want this as default for everyone" when it is one click away for you without disturbing many others
I think I made it pretty clear: The default score is the one optimized, and the non-style-controlled score is easily optimized by more yapping and slop, and making less slop is an excellent reason to make the change. Of course, if you like whatever default score optimization leads to, you'd oppose this change. I didn't try to convince anyone that it's bad; I'm convinced it is (and I'm not the only one), so I'm proposing a sensible solution for those who believe it is a problem. I'd be happy to argue why if that would help to decide.

dreamy orchid
#

I think the slop many mentions actually is liked by many end users. For the end users I mean those that use for example copilot integrated everywhere, llama integrated everywhere, grok and so on.

So it is about what we want to measure. For the end user (that are the vast majority of internet users) I think lmarena is really representative.

I get your point, you want like a sort of XKCD 810 but for LLMs, and that would be nice too. I still think it should be a different benchmark. Because if the AI labs benchmaxx for style control, they can make a lot of end user less happy (emojii and co)

#

but anyway, I make do with what is there. I think if there would be such a giant push for style control, there would be already another benchmark. lmarena is not new and they don't have the monopoly on benchmarking either.

errant musk
#

One of the models frequently not showing anything

nova ledge
#

claybrook stop working in webdev arena.

visual warren
#

maybe admins add functions for "premium users" with function upload files (.txt .js .php ....) ?
im ready pay service!

wraith kestrel
#

Read the Sentiment Control article, and I gotta say this is the right direction to go.

Gemini 2.0 Flash is being used for sentiment classification. But I wonder is it really both cheap and accurate enough to run vs an open weight model with similar performance if there's any? And will it be used for prompt classification (Hard, Creative, etc.) too, for consistency's sake? 🤔

dreamy orchid
#

I also like the correlation with the performance, also for headers, length and so on. In that way is much better to "correct" the score rather than ignoring the battles.

#

also lmarena is actually a sort of social experiment too, not only a bench for LLM. People like being flattered

pure compass
#

Again mentioning you really need to fix your content moderation system when it comes to images. Or can anyone explain what's wrong with this image? https://civitai.com/images/67890084 I tried cropping the arm away in case it is too much exposed skin (lol) but still content warning. This is getting ridiculous. Is it smiling suggestively or what is the problem?

#

At least the is what exactly caused the flagging maybe we can help you to fix it when we know what triggered the false positives

zealous junco
#

I wanted to add my own llm
And make it available on arena playground
How I can do

woven moat
#

gemini 2.5 pro experimental keeps having its answers cut off. another friend is also reporting this issue

#

maybe there's some character it's returning that's being interpreted as an end of message token?

jolly ore
#

Can you guys add Claude web search

#

And the other chatgpt web searches aside from gpt 4 Omni or whatever is called

whole shadow
#

the o3 model in lmarena is really weak or not the ai at all, i tested it with the research math question from here 5 times: https://openai.com/index/introducing-o3-and-o4-mini/ and it failed to give the correct answer all 5 times, it even took around 3 minutes of thinking instead of the 55 secs in the example.

wide edge
hushed crest
#

@agile flume Could you or someone from lmarena team make a qestions and answers webinar?

short scarab
#

Can the new GROK 3 (not early), plus it's vision, Grok Aurora, reasoning, and search capabilities

#

as well as this, can you add Doubao 1.5 Pro, 1.5, and roleplaying?

#

Doubao is extremely underrated

#

Can't wait to try it out

#

Along with seeddream models

torpid drift
#

Is there someone I can talk to about search arena? We found some issues, would love to talk to whoever is involved

strong slate
frigid pine
#

is there any reason kimi isn't on lmarena? not sure what the policy is for adding new models/companies

dreamy orchid
vivid geyser
#

Hi, Diffbot has just dropped to HF weights for a new search arena LLM that implements the first o3-style interleaved function calling in an open source model. Would love to see more open-source competition as it is all proprietary models in search arena at the moment!
How do we get included in the arena? We have an API hosted version as well and can provide free credits.

fallow violet
#

@agile flume Hi, I am wondering if Qwen 3 models be added to the Arena in the near future? Thanks!

wraith kestrel
#

Oh, Qwen 3 series!

#

95.6 on Arena Hard Auto '24.

#

Wonder what it will perform in the actual Arena.

indigo kernel
wraith kestrel
#

What are the possibilities to having Gemini 2.0 Flash Image Gen into the Image Arena? 🤔

vapid kraken
#

This paper just got published by some AI researchers on the unfair practices and lack of transparency by Chatbot Arena. Do the lmarena folks have an answer to these? The community should know. https://arxiv.org/abs/2504.20879

swift cloak
#

"undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired" true or false?

brave yoke
#

the paper presents evidence showing the biases in practices towards a handful of preferred providers, but it does not cover an equally concerning bias against open-source models and small independent developers as can be seen by the many messages in this channel above asking for transparency on how to submit models. I doubt they ignore the requests from Meta and Google in the same way since they accepted 27 private variants just from Meta alone leading up to llama 4

dreamy orchid
# vapid kraken This paper just got published by some AI researchers on the unfair practices and...

one problem one can easily see is when new models are there, cloaked, they get aggressively matches in new questions. That's is good for PR as the models will be easily visible on the rankings within a week, but it is not so good in general because the feedback is vast and model providers can tune their model.

If the cloaked models would be picked every now and then (like all others), then it would be harder to adjust the model and the provider has either to wait (difficult via market pressure) or publish the model as is.

I think slowing down the matching with cloaked models can already help a bit. Then again for the problem "yeah but why Claude 3.5 from Oct 2024 was not #1 in coding?", that is the usual point: API calls (like with inline suggestions with an IDE) and human conversations are different, hence claude didn't win. For api calls one can check openrouter

wispy patrol
dreamy orchid
#

exactly. And if they are under pressure to publish, then they would publish it ahead of lmarena scores anyway, so people would have already experience with them (via openrouter and what not) to compare the behavior.

strong slate
# vapid kraken This paper just got published by some AI researchers on the unfair practices and...

Thanks for the authors’ feedback, we’re always looking to improve the platform!

If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the

vapid kraken
#

Karpathy, accomplished AI researcher, shared his thoughts in a tweet. Honestly folks, I am done with Arena as a model builder. Was an admirer of the many fresh ideas chatbot arena brought over the last two years and respect the academic work involved, but this unfairness and opaqueness and being secretly in bed with the big powerful AI closed labs is honestly heartbreaking and absolutely terrible for the community. Esp for an academic project coming from such an established Berkeley lab.... I think lmarena is done and dusted for me and for I know several other researchers and builders of late. Time to move on to other mechanisms like Karpathy writes and other various platforms for evals and rankings. Thanks for all the work, but we as a community deserve much better. https://x.com/karpathy/status/1917546757929722115

There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion"
https://t.co/LfjIII71qX

I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few

light apex
#

Is there any option in "parameters" to activate "reasoning high" for o3 and o4-mini? I would like to test these llms with high reasoning effort.

velvet night
#

really wish o1-pro was added

dreamy orchid
dreamy orchid
glossy meadow
#

The weight I put on chatbots arena has gone very low after the llama event and the fact every new model seems to benchmark hack their way to the top.

https://artificialanalysis.ai/ feels much more objective at this point

Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.

dreamy orchid
#

Artifical analysis is simply a collection of benchmarks "Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500"

The problem there is that one doesn't know if those benchmarks are "benchmaxxed" as well (data in the training set)

#

further artificial analysis score seems also unclear. R1 small distills still do better than Claude 3.7 (no thinking) or close to Gemini 2.0 Pro thinking (the one from January). That seems unlikely.

weary rampart
dreamy orchid
weary rampart
#

Man I was just confused thinking I missed the release of 2 pro thinking or something. lol

dreamy orchid
#

yes I was going from memory. It was the first thinking model from google though.

#

I think in the arena the name was "gemini-2.0-flash-thinking-exp 01-21"

light apex
dreamy orchid
#

ah I see, they likely will come later (as with o1 and o3 mini)

#

the oX versions were all tested with medium at first IIRC

rose robin
frigid pine
#

more seriously: o3 high in direct chat seems very unlikely, o4-mini-high is definitely possible but not currently implemented

#

if they do choose to add the latter, it'll likely be listed as a separate model

whole snow
#

Greetings. I found a little bit of an "issue", so to speak, that is a little bit frustrating to me.

#

Whenever I do the arena (battle), I can always tell when one of the LLMs is based on Claude, due to the shortness of the answers, and I worry that it would invalidate my tests.

#

Do you have any suggestions on how I can adjust my prompts so that it isn't as obvious?

strong slate
whole snow
#

All right. Thank you. I always assumed that since I could tell the model due to its length that that was a form of revealing itself. I appreciate the answer, and I will read that.

#

I will keep on experimenting and judging. I have been having a lot of fun with it, seeing how each model "thinks" differently.

strong slate
whole snow
#

I've played around with it a little. Not enough to have a reaction to it yet, though. I will do a little more playing with it today at work if I have some downtime.

light apex
wraith kestrel
wide edge
#

Granite-4-Tiny-Preview is a 7B parameter fine-grained hybrid mixture-of-experts (MoE) instruct model finetuned from Granite-4.0-Tiny-Base-Preview using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, and model alignment using reinforcement learning.

buoyant wraith
frigid pine
#

not sure if this has been mentioned before, but the suggestions below the web arena "reset" every time the "Generate me a UI for..." prompt field is updated

short scarab
#

The random icon should be for those

#

Especially as we get more expensive models on the Arena, all of the wasted money added up would be a huge amount

pure compass
#

I think I asked it before but is it clear by now that the weights of llama-4-maverick-03-26-experimental will never be released? Or is there still a chance? Or are they already and I completely missed it? (Not that I have the hardware to run it)

dreamy orchid
#

you can ask meta that question. My guess is that they keep it for themselves, they don't owe it to the community.

Btw llama-4-maverick-03-26-experimental is back and is winning already also in my case.

echo furnace
dreamy orchid
#

there are two but not the other ones (3.2, 3.3, 4 - at least those announced in reddit locallama)

pearl garnet
echo furnace
shy flint
#

Hello @pearl garnet, I run an AI search startup that processes millions of searches with high quality outputs (especially with reasoning/DeepSearch, which rivals Perplexity/Gemini Deep Research), and, I was wondering if it would be possible to add it to the Search Arena. Can you DM me about this? Thank you, Paul

pearl garnet
pearl garnet
visual warren
#

hint: his about me

amber umbra
#

you basically just make up benchmark numbers, do a lora or basic finetune if even that, and then call it a day

shy flint
amber umbra
amber umbra
amber umbra
#

this new search thing is probably some existing API developed by someone else repackaged under your name

shy flint
amber umbra
shy flint
pearl garnet
#

hey stepping in to slightly gesture towards our rules

Treat others with kindness and curiosity—we’re here to share, learn, and debate ideas, not start fights. Healthy debate? Yes. Personal attacks? No.

shy flint
pearl garnet
amber umbra
# vapid kraken Karpathy, accomplished AI researcher, shared his thoughts in a tweet. Honestly f...

I think important thing to realise here is that every benchmark in isolation can be gamed and lmsys is no exception. It's not a definitive answer and is only relevant if all the other usual benchmarks check out too. Good example is Nemotron 70b which was openly made this way to perform better on lmarena without improving anything else over llama 3.1-70b https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/discussions/11#6712c8f758bdba34248ce0ef

wraith kestrel
shrewd trench
drifting bramble
#

can we have an arena mode where chat is infinite (only last <CONTEXT_WINDOW_SIZE> tokens are given to models)?

wide edge
#

probably either too long or angries the WAF

quick mason
#

can't put some images: error
HTTP 403:
Please enable cookies.
Sorry, you have been blocked
You are unable to access lmarena.ai
Why have I been blocked?
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.

What can I do to resolve this?
You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.

pearl garnet
quick mason
#

it was in original on the og site, i fixed it

tidal geyser
#

Hi can we get emojis for all LLM providers?

icy laurel
#

This issue has been consistent

pearl garnet
median geyserBOT
#
<:warning:892823499205406760> Channel locked

Site outage, will turn back on when resolved.

median geyserBOT
#
<:success:865860339278413864> Channel unlocked

Welcome back :ablobwave:

median geyserBOT
#
<:warning:892823499205406760> Channel locked