#arena-feedback
1 messages · Page 1 of 1 (latest)
Works for me? Albeit its a bit slow (3-4 second delay), it still loads
Can we get a deepseek r1 distill model in the arena
Or maybe even a quantized model (e.g. r1 Q8_0)? Would be interesting to see the effect on accuracy
there are already lots of benchmarks and analysis done on quantizations. i think it would overflow the leaderboard with too many information. the difference wouldnt be that noticable i think
but maybe some small tests, that get published independently from the leaderboard could be interesting
I know but benchmarks and lmsys rating rarely paint the same picture
take claude 3.7 e.g.
crushes benchmarks, #1 on livebench, 70% swe
But dogshit rating
We should be able to stop the output and vote when it's obvious which one we're going to choose.
this would be like rating a movie without watching it till the end
I had it a few times a model was repeating the same sentence forever. I don't know which model, I tried disconnecting the Internet, wait for it to error out, reconnect and then try to vote, but it did not work, so for that case a stop button would be great.
maybe not in arena, but a stop button would come in clutch for direct chat or side by side
There should be a timer indicating how long each answer took
Exactly. Sometimes the movie is so bad you walk out of the theater.
If one model is writing a long good answer while the other has already output a short refusal, I can stop the generation and choose the real answer as the better one.
Saving me time and saving the provider time and money on generation
Somewhat contrarily, I also think we should be able to vote on random queries and responses that other people submitted, since they're all going into the database anyway. Let multiple people vote on which response is better for a given conversation, and get a lot more battle data without spending any energy on generating new outputs or waiting for them to be generated.
yes this is a fair point. also a good idea. because people would tend to vote differently on these outputs maybe
I agree 100%
They did this with https://open-assistant.io/ back when the project was alive
It was really cool
It was a shame when they ended their project.
@agile flume would lmarena ever consider this?
It had a nice UI with plugins https://www.reddit.com/r/OpenAssistant/comments/13seg3h/open_assistant_can_use_plugins_cool/
I can't find any photos of it but they had a feature where you could see public generations by category and then have to select better responses. It also let you submit your own better responses and even rate things like output quality, creativity, and potential harm.
https://huggingface.co/OpenAssistant
Datasets set out the labels like this: { "name": [ "spam", "lang_mismatch", "pii", "not_appropriate", "hate_speech", "sexual_content", "quality", "toxicity", "humor", "creativity", "violence" ], "value": [ 0, 0, 0, 0, 0, 0, 0.8125, 0.16666666666666666, 0.3333333333333333, 0.5, 0 ], "count": [ 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3 ] }
password: super-alpha
Found a minor bug I couldn't screenshot
I got a CloudFlare captcha overlayed to the UI right over there on the top left
For the new UI - I'd def submit it to the bug report form to get prioritized:
PLEASE Give us feedback here: https://forms.gle/8cngRN1Jw4AmCHDn7
and 🪲 report bugs here: https://airtable.com/appK9qvchEdD9OPC7/pagxcQmbyJgyNgzPx/form
I’d rather give feedback over here or GitHub
Google’s forcing me to change my password to use the forum
But the cloudflare captia thing is hogging up space
Hello, would it be possible to add text attachments that attach: txt, csv, tsv, xml, html, css, js, py, c , cpp, etc. text based files by having it appended to the prompt possibly bypassing the character limit.
Example
Hello LLM!
———User Attached file: Hello.txt (TIMESTAMP)——
Hello World!
—End-Of-File—
Similar to repochat
Additionally, creating support for excel documents, word documents, and SQLite would be helpful also code folder uploads like Gemini has.
ill note this - and we can post new feedback for the new UI in this new channel moving forward:#new-ui-feedback
Have a question. What happens to the conversations that never get voted on, like if someone goes to another page while not voting or just don’t vote and click new round, or lose internet connection or lose their chat to “Connection errored out”
Those conversations should be LLM judged, as it’s basically a waste of resources especially for stuff like GPT 4.5 being tested on the site and possibly the full o3 model in the future
they are useless then but they used rate limits per instance i think
- the site (maybe still does) used to do cloudflare/ddos protection
Without a vote, they are useful as “real world usage data” even if LLM judges aren’t present
yea they get fed as data too like that interactive viewer thing
Your a staff member of LMArena?
Seems legit actually
They still do. I found a funny thing: some quesions, usually containing sql statements or linux commands, are "forbidden" in a way which consistently trigger errors and cannot be asked or your conversation gets cooked. After some exploration, it looks like the reason for it is cloudflare, which bans those requests due to some random "protections" and gives 403 consistently for those suspicious types of requests. Likely not the inherent protections of LMArena itself, since they usually give you something like "Content violates moderation..."
Thinking more about it now, couldn't it affect the bias of the arena results? Since some types of questions (all of them were unharmful ones) are banned by random cloudflare triggers, doesn't it slightly reduce the set of answers provided by the users, thus reducing arena's score uniformness, in a way? Models which could answer such questions properly would likely get a slightly higher rating, others a slightly lower.
the gemini test 30 model was by far the best model i have used in the site. sad that they removed it early and didnt get to try it further.
😔
It might mean they will release it soon, or come back with an enhanced version
Really ?? I am afraid now 🙂 google removed gemini test and put gemini 2.0 thinking exp on the free version of the gemini app ...hmmmm we will have new Geminis sooner 😄🤝👀👀👀🤌
My biggest problem is when a model hallucinate suddenly and keep repeating the same worlds or sentences for the 162728388 times like lucid or llama3.2 I wish they make the stop button to stop them at least when they hallucinante 🥲🫤 because most of the time I quit the page and I don t vote to not lose time waiting for a model to stop repeating the same words .
didnt know the app also had it. i thought it was lmarena exclusive.
So i like the webdev arena but what about a direct chat on there with support for like webcontainers or so
Or allow other people to vote on them as I said earlier
Why not rating the answers of each model after voting ? Sometimes, I feel that the votes didn 't reflect what I think about each answer. For example, sometimes, I find 2 models , one of them is so bad. The other one is bad but a little bit better. I donno if I should vote both are bad or 2 is better. Its ok better but bad too. 😂
2 models a is 8/10 b is 7/10 . A is winner but b is good too.
A 3/10 b 1/10 . A is winner but both are bad.
Saying A is better doesn t mean that B is bad or both are good. Its is just better but you donno if they are bad , meduim , good or excellent. We should give an exact opinion that really reflect the model not just this one is better.
The way ranking works, you don't need an exact opinion
Chess ranking works with matches that are win, lose, or tie without "they both played poorly"; same goes here
Ratings give more information than just win vs lose though.
I really wish there were some distilled/quantized models in the competition just to see how models we could run on our own machines compare against real API models. Could choose some from https://oobabooga.github.io/benchmark.html which lists the best models for a given hardware requirement.
Yes can we get an r1 distill in the arena please 🥺🥺
Even with just 1, we could at least anchor Elo scores vs other benchmarks.
Will you enable vision capability for Gemma 3?
Can we get gemma 12b? 27b was really impressive, really wanna see what 12b gets.
Why not showing the thinking process of the thinking models ? This will be interseting ...
Also, Some models like GEMINI are able to put pictures while explaining things but on arena we won t see that and this will not show the real ability of the model.
Leaderboard updates at weekly intervals 😁
The Arena is blind
It wouldn't make sense to show things that can distinguish models
is that darrell#
You can disting them because they take time to answer anyway 😂and ok why not on side by side and direct chat?
Or show them after the vote
Today for the first time I made a prompt that was censored by the lmarena moderation system. Went on Grok first to test it, it was okay with it (ofc lol). Went on ChatGPT 4.5, worked too. Went on Gemini, worked too.
It seems that the censoring on lmarena is a bit too strong and not relevant if most big models accept to treat it. And it also distorts the ranking, because if you can't test very dark humor via lmarena, it's one less criterion for judging the quality of the models, and one bias that might favor one model over the others.
It's a pity because instead of censoring the prompt, you could simply let it pass and detect when a model says something like “sorry but I can't answer that question” and cancel the result that will be given at the end.
Or simply ban an IP if it happens too much and remove all the prompts made by this IP from lmarena "open-source results".
I imagine that the idea is to avoid ending up with illegal content in the results that are available to researchers or other people. But if you can detect that a prompt might be censurable, you can also censor a prompt in the results or tag it NSFW.
Yes the censorship is really to heavy at times. Not only for text but also for images, and it really seems to hate Charizard for some reason.
"illegal, harmful, violent, racist, or sexual purposes." what does sexual purposes mean Does asking questions about sexual healt also count as sexual purposes?
Companies will find that any attempt to censor most models will result in consumers always choosing competitive uncensored models. Time and research shows that people do not want AI to tell them how to think, or what moral standing they should have.
Do you want the cheese grater to tell you how to prepare food?
No
You don't
What's the point of restricting AI when you cannot restrict human intelligence enough to ask the question
Ultimately, they can't do anything about it. They must comply with the terms of service of the AI's on the leaderboard.
Or they simply use another product
My guy, its a leaderboard site which lets you test the top llms.
And yet Claude brings in millions and billions
I agree that censorship for text based models is silly most of the time. However I would also agree with you that people care more about what the model can do and generally don't care too much about prompts being censored so long as the AI provides a sufficient enough answer to most of their queries.
Start your own AI company bro.
Deepseek could have said the same thing
Bitcoin was much better return
Bitcoin is a meme coin tbh
For the folks that bought in at 20$ or so, it returned ten thousand percent
On an easy day
Let's talk about it in dms
It's off topic
Is there a way I can save the chats of lmarena and continue them later on? It just keep refreshing after some time of use and shows error, and I had to refresh the website again starting a new chat selection the model.
It would be great if you could add another arena category - namely MTL, as in translating from one language into another. A lot of people have a need for MTL in their life but there is currently no leaderboard ranking what models are best for translation purposes. And I realize that this poses a problem for testing, as a model might excel at translating english to japanese but suck if translating eng-> french... and while it might be best to have a sepsrate leaderboard for each pair of languages, it can be cut down to only be between english + another language. Then it can be further cut down to only include the major languages such as Eng, Japanese, Chinese, French, German, Spanish - basically languages you already have in the arena.
Anyway, sorry for the long message, I just wanted to share that as a person who is using MTL every day, I am really missing a MTL leaderboard in my life.
Hi! Which is the best way to use DeepSeek & Claude models? I mean in terms of efficiency, speed, etc in case there is any. It is better to us their direct API? or is it better to use it through Cline, Roo, OpenRouter, etc etc etc? Thanks! (cline uses their own API too, but I mean when that is not the case)
I wish you can include referrence to image.
It would be really amazing if we had some way of saving the chats because when the site refreshes you just instantly lose all of your chat which is quite cruel. Thanks.
Password to the https://alpha.lmarena.ai has been changed. Old password doesn't work anymore. 😕
The new alpha version does save it
there's a new one #announcements message (though not sure if you meant that that one no longer works - seems to me for fwiw)
seems like the word alone won't trigger it. nor with "she" added before
but.. moaned loudly
Strange, because it works for me just with moaned and nothing else, lol.
yeah i think it's handled (pretty crudely) by a small LLM
it like screens each prompt
so not like a blacklist of words or purely deterministic, more a set of guidelines i imagine
Well, maybe.
I wonder if the "rules" will change upon me changing my geolocation, lol.
nah tbh i think it just reflects the fact it's a small LLM. even if the temp is set to zero, it's still not deterministic - it'll judge the same input two different ways with the same rules
afaik they use openai moderation api
i dont think its a small llm, at least when i last checked it
oh didn't realise that
also theres another layer by cloudflare that blocks linux related terms 🤣
i think its just a regular classifier its been a while tho
I tried entering moaned many times on many different days, and it always bans it.
There wasn't a single day it wouldn't.
whatever it is i feel like it hasn't been changed since the arena launched.. like seems pretty crap to put it bluntly ha
surprised its oai's moderation api
cuz its free i think
i might be wrong about it btw, im not entirely sure lol i do recall remembering something like that
oh it is in the fastchat source code
ya i just checked
https://arxiv.org/pdf/2403.04132
yup you're right
And of these 3% most of them are probably false positives.
Btw, "Once again, the two idiots and their cat fail to steal a Pokemon." gets flagged, but "three" instead of "two" does not get flagged.
Not much can be done about that
If the content flagger cannot be tuned down, it could be completely turned off... Or if it flags, show a warning and if the user agrees to see potentially flagged material, continue
The current content flagger is ridiculous
Hello, I'm not sure of this is the place to ask this but I have a question about this dataset: https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k
The Arena is for researchers
Researchers who don't want to have to sift through ERP in their open source chat dataset
It is not ERP but all kinds of stuff that gets wrongly flagged
theres still an erp category 🤣
are you sure? from what i recall, it’s deterministic + random seed
My brother… why are you moaning to chatgpt 😭
i might be missing something but yeah fairly sure that LLMs are inherently non-deterministic (including with temp set to 0)..
(because of the hardware and inference software)
using the same seed (instead of a random one, as is typically the case) helps get closer to reproducible outputs, but the LLM is still fundamentally non-deterministic
i mean they're models.. their outputs are predictions and thus inherently non-deterministic
i think that might be something related to how they serve and batch their MoEs
in theory, it should be possible to take the same inputs and get the same logprobs (and consequently the same outputs)
in theory, with the exact same model, everything else unchanged, I think that's right; but in practice, it seems effectively impossible to truly guarantee reproducible outputs (for actual responses, like the coding example they use here https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter) There's repeated caveats / warnings that it won't guarantee reproducibility
but yeah I take your point, in an idealised setting, reproducibility to the point of a model being 'deterministic' is theoretically possible (i think)
I had a bug where an infinitely long response was generating for minutes with the same sentence over and over again.
gemini-2.0-flash-thinking-exp-01-21
Even with cpu inference? What provider?
yes. granted CPU inference can (as I understand things) offer slightly more consistent behaviour due to reduced parallelism compared with GPUs, that doesn't overcome the inherent indeterminism of LLMs (it's not about hardware...)
there is no inherent indeterminism (without sampling) its because of hardware, floating point operations, etc
? what factors change each time u run a pass theoretically without sampling. nothing. its not a theoretical thing
eh we're on a difffernt page lol
what is a 'model'?
like a weather forecast model.. language model.. whatever
'model' isn't a loose term
in an idealised setting etc etc sure\
but they're LLMs
? they can all be determinstic in theory. in actual implementations, because of performance/hardware/etc this is why they arent deterministic
it wouldn't be a model if it were deterministic
it would be a formula or whatever
well it basically is
ok. perhaps i'm getting caught up in semantics - agree to disagree ha
no what ur saying is wrong in this instance
idealised and model are key to my thinking here
happy to shown wrong
but it seems a lot of what is being said rests on 'theortically'
ok but theoretically there isnt any indeterminism in llms
yeah
if everything was done in perfect accuracy without sampling
my point exactly
but yes irl you can't have perfect accuracy due to performance/hardware/sampling/etc
theoritically possible - i don't dispute
irl, it seems it a point not worth proving
i thought u were saying before they were theoretically indeterministic and that makes zero sense, u phrased it in a weird way
but i understand what u mean now
any 'model' is theoretically indeterministic - otherwise it wouldn't be called a model
i don't dispute the idea that, keeping everything constant, using the same seed etc etc, it should be possible to get 100% reproducible responses to the a given prompt
in practice, yes. when it comes to the math, without added randomness, they are not indeterministic
they predict tokens
heres what claude says, i hope it makes more sense:
\
that does help clarify where you're coming from 👍
to my mind, maths (yes there are concrete solutions) isn't any different to any other prompt - it's still ultimately sampling and predicting tokens to provide the completion
Yes, in theory, LLMs are completely deterministic if you:
- Use greedy decoding (always select the highest probability token)
- Have perfect floating point precision
- Eliminate all hardware variations
In this idealized scenario, an LLM would produce...
actually.. i do kinda see the distinction of maths
hmm
with no sampling, there is no inherent sampling. the source of indeterminism is floating point accuracy, hardware, optimizations (which may reduce accuracy for performance), etc
i wouldn't call these accumulated errors as sampling if you use greedy decoding
can you ask a follow-up, and say "the question is actually about LLMs' technical/archicteurael properties – how they produce 'responses', whether they are inherently deterministic or not. forget about mathematics specifically"
maths is deterministic...
do you guys know about Lc0
claude misses your point/lacks context, but here it is
it’s a ML-based chess engine
Non-deterministic
Picks different moves every time, even with identical parameters and hardware configuration
this is obviously yes (they are nondeterministic), but i think its important to make clear that what you're talking is about is in actual implementations/irl. what ur saying is confusing and seems conflating, at least to me initially
sorry if it's confusing / conflating - not meant be
but yeah, i'm coming at this from an irl perspective
inherently seems more relevant than theoretically to my mind...
singling out mathematics seems odd (given its inherent determinism)
ive been running this
Are LLMs' outputs inherently deterministic or non-deterministic? If non-deterministic, can they be made deterministic, in practical/real-world terms, and how? Begin your response by answering with Yes or No, then expound```
in the arena and it's just been no, no, no, no
i mean llms are basically a formula
Large Language Models
with a bundh of formulas / code underlying it all
sparrow is new?
but mathematically, i mean. and the indeterminism caused by irl circumstances is quite minimal. people are training with 8 less bits, fp8 (deepseek) and accuracy is still basically the same as bf16. with actual sampling youre introducing much more randomness
show me 100% reproduble outputs to the same (semi) complex prompt and i'll be more partial to this thinking ha
the mathematics are an extremely fundamental part of this though
if u use 0 temperature, you are not shifting the distribution in a notable manner even if it chooses differently, its still around the same. the model's probability distribution is still around the same even with accumulated errors which are minimal. no matter what task, even creative writing
show me 100% reproduble outputs to the same (semi) complex creative writing prompt and i'll be more partial to this thinking ha
around the same ≠ reproducible
a hyperfitted model would probably do that. https://arxiv.org/pdf/2412.04318
hyperfitting shifts the distribution by a lot where theres basically one candidate in greedy decoding and the indeterminism/accuracy issues would be dwarfed by how probably each token (first option) is compared to the rest
im being very scatter brained here, apologies lol
extract pulled from p7 of that paper
aha all good my man
i've got no idea
duide that part of the paper is talking about a different thing
its talking about how if u shuffle the data during training it affects which hyperfitted token is/distribution (where in hyperfitting, one token usually dominates)
ha yeah fair.. i just skimmed and saw 'determinancy'
ok this is based on the conclusion (still seems to essentially say the same thing as far as I can tell)
yeah i dunno
LLMs are stochastic, not deterministic.
that's what the conlcusion suggests (not the LLM summary, me just reading it - i can't be assed going through the whole paper ha). agree to disagree i guess ha
that section basically says:
dataset: a, b, c
dataset (a, b, c) -> trained -> probability: x, y, z
dataset (c, a, b) -> trained -> probability: y, z, x
it just talks more about hyperfitting, how dataset order affects the model distribution, not really addressing general determinism in llms
my turn to say i'm tired ha
which i genuinely am.. (5am here in australia - i just noticed.. yikes ha)
ya im sry lol. i just made it super confusing. i have a lot of random/incomplete thoughts (about this) which got us into different tangents. i did not go about this conversation well at all
funnily enough i'm actually coming round to (or understanding) what you / the paper is saying ha
but one for tomorrow 🙂
Early-grok-3 was removed today?
it's just deprecated no?
That would make sense, I tried switching to grok-3-preview-02-24 but it give me error_code: 50004, An error occurred during streaming every time now
is there gpt 4.5
no (in direct chat, yes in arena battle)
there was for like 15 mins after it released thennit was taken off
it got taken off WHILE i was using it, it just stopped generating, i refreshed page, it was gone from list
Rude!
yes
false
ive got it 20+ times in the last few weeks
oh wow
no im talking about direct chat tho
lmao
oh i see i forgot to mention that oops
Being one of the few alpha users.... Or one of the most.... No clue honestly.
It would be nice if we had newer versions of the AIs due to some of them being outdated like GPT 4o. Unsure if it's possible but a man could dream
Why is there so few image models on LMarena?
thanks @feral star please let us know what other models you'd like to see!
Thanks for answering. Hopefully, some of the 15 best models on Artificial Analysis and Imgsys that are not listed on LMarena;
- Reve Image (Halfmoon) (#2 AA, #1 imgsys)
- FLUX.1 [pro] (v1.0) (#5 AA, #2 imgsys)
- Midjourney v6.1 (#6 AA)
- RealVisXL V4.0 (#4 imgsys)
- Playground v2.5 (Aesthetic Model) (#28 AA - standard, #5 imgsys - aesthetic)
- ColorfulXL-Lightning (#6 imgsys)
- Juggernaut XL v9 (#7 imgsys)
- Image-01 (#7 AA)
- Midjourney v6 (#10 AA)
- Ideogram v2 Turbo (#11 AA)
- Stable Diffusion 3.5 Large Turbo (#12 AA)
- Proteus (#9 imgsys)
- Mobius (#10 imgsys)
- Fooocus (Quality) (#11 imgsys)
- FLUX.1 [schnell] (#19 AA, #8 imgsys)
(and the recent new released models; 4o native, 2.0 Flash Exp, Ideogram 3.0, ...)
@agile flume Some models get's into the "loop" or starts being unnecessary verbose and user is forced to wait to vote while everything is already clear. Please add "Stop" button.
hmmmmm
Button can appear when at least one model is finished producing results. This way spam will be reduced.
Why could "There was an error" message appear in alpha? Could it be moderating system?
As I get repeatedly connection errors today while trying several different models and after refreshing the page several times, is there a known issue?
This was seen in direct chat mode. Arena seems to work.
Hi, can https://manus.im be added to lmarena?
they are still closed not a change
oh wow, you can save conversations in the new arena, thats a good thing
how can we trust that? why they not added this to actual GAIA leaderboard?
Hello guys!
Why could I send this fragment earlier?:
<html lang="ru" data-theme="dark">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>test</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@400;500;600;700&family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
<style>```
But today (and yesterday) I see a mistake:
error
Connection errored out.
Please fix this error
Image Battle its hard to tell how to vote for a winner, maybe I'm blind
@opal hamlet @agile flume
I'm sorry to bother you. but me very need help.
The battle arena is really unusable for my hard prompts. Models ofter do not return anything or get into the loop. The regenerate button should be unlocked after 5 seconds.
luca might be brokne. It outputted this to a math problem:
{
"Name": "A1",
"Value": 10,
"IsValid": true
}
stargazer will remove it's response during generation:
**API REQUEST ERROR** Reason: Unknown.
(error_code: 1)
It happens intermittently and was a math problem. :/
This is amazin'
They are Chain of Thought models, they write responses like this:
<think>
Thinking hard for a long time and you can't see this.
</think>
The part you can see (5 mins later)
claude thinking model is giving the following output, I've been trying to use it, due to this issue I am not able to send it, it has been around 4/5 days
yea as expected
question. Here it seems that the conversation is generic, not really focused on feedback. Am I missing something?
I wanted to propose an idea (like other do) but if it is simply buried by a normal discussion it doesn't make sense to post it here.
Is there a sort of repositories for feature-requests / issues a la github ?
Team is currently upgrading the UI, so feedback on the alpha version is most helpful and can be shared in #new-ui-feedback. If you have feedback on the current Gradio site, you're welcome to share it here, but please note it may not be prioritized as we focus on the new version.
understood and yes, changing the UI is a big thing for sure.
I think we need to take decoding speed into consideration, since a much faster ai response is preferred.
Will we at some time know which model is 24 karat gold?
normally if they get enough votes within a week (like 2000) it gets announced. So 1-2 weeks.
Sometimes the models do not perform as expected and get retired by the vendor, then it takes longer once they get published again.
So far it performs really well, I really wonder which one it is. Where will it be announced which one it is, if it reaches enough votes?
Really hope I can have a few direct chats with it instead of hoping to get it
for example on twitter they get announce with "congrats to model XY". Otherwise just watch out for leaderboard updates (you see the date of the last update), order by votes ascending (new model have fewer votes) and check those that performed well (they have low digit rankings).
Sometimes the karat gold goes hard astray though
normally I check here for updates: https://x.com/lmarena_ai/with_replies maybe there are some other social media places
Yes but that is still some guesswork when there are multiple new models. I mean a real announcement where it says "congratulations to model XY, formally known as 24 karat gold"
no they don't say "known as X". You just notice that "ah it should have been that".
That is also the charm of the competition.
I believe the 24 karat gold has a chance at the top in the overall section, but personally the best category is the "hard prompts" category. (that is not even that "hard", as a quarter of the votes qualify for it apparently)
In the hard prompts category the order changes a bit more and is more correlated with the many benchmarks - considered as a whole - outside lmarena
I have saved a few outputs of it on a text file so I can compare it with the top newcomers so I hopefully figure it out.
btw that model doesn't perform well against reasoning models if the question is about logic and coherence
I wonder if it will pass the city in a bottle test, that would be a first as no model could one shot it so far
what's that ?
Tell the model to refactor and make this code more readable
https://frankforce.com/city-in-a-bottle-a-256-byte-raycasting-system/
ah coding, I see.
Yes but also understand a really obfuscated code, most models struggle with the combination of bitwise and Boolean OR
||d|( becomes ||d||( and once they made that mistake they often have a hard time correcting it
yes. For me there is too much attention on LLMs and coding. I think LLM should be good all-rounders that then could be specialized in this or that category. Coding is surely helpful but I think with the focus on in, with ad-hoc models the performance would be better. Think about a LLM director that picks ad-hoc LLMs for this or that category. Coding is simply one category.
same with logic and other stuff.
for example in lmarena the p2l-7b router does a very good job and it is a very nice idea.
But so far there simply is no LLM at all that can do this task, that's why it is my favorite test until I find one that can do it. Or asking about some Linux configuration stuff (they don't know much about firejail and happily hallucinate commands and options that do not exist, but that is no wonder because the firejail docu is really bad so there is not much they could learn from)
Or just give a vision model an image with two characters and ask it to write a conversation between them and see how it interprets the image and the situation, really interesting sometimes on the same image one model sees a friendly situation and the other model sees them both in an aggressive stance ready to attack
The arena keeps giving me an error on google and explorer, it says there's no connection, but it works in Avast and Duck duck go. . Any reason for this?
是这样的,我前几天爱用的cybele、Spider、24_karat_gold、stradale的模型现在都已经不见了......这些我认为都是世界上最强的模型......
呜呜呜~
Sometimes the system pretends that "something went wrong" when actually it has a problem with the user's messages. I found that URLs sometimes cause that, among other things.
It shouldn't do that though
It's so stupid, y'know?
But I got that flannel
gemini-2.5-pro-exp-03-25 returned API REQUEST ERROR Reason: Unknown. (error_code: 1) for totally innocent coding prompt
And also for me for just explaining medical things , I think it is from google this error
https://discord.gg/j6kxQ4krtc @everyone
so likely it was llama4
It's true whoever disliked my comment is wrong
Flannel crystal and haley constantly provides misinformation and hallucinations
Even on basic questions
And its not as creative as 24 karat gol was
Its a boring and trash AI
they may be bad models; if that's such, just downvote them
the arena won't stop evaluating for you
crystal is good no?
i find it better than what we have from llama 4
It’s a bit better than flannel and harley
But it’s still not good it’s trash
It hallucinates too muc even on basic questions
It’s trying to be 24 karat gold
It’s not working tho
24 karat gol was amazing they need to add it
Or atleast tell us what company it’s from so we can use it
There’s no reason they need to replace it with those trash ones
24 karat gold is definitively llama4. But hard to say which one exactly, as there is only Maverick in direct chat to compare to.
Oh and now that we talk about llama4, can we pretty please get its vision capability as well in direct chat?
Idk
Its from llama but prob not llama 4
Cus its not a reasoning model, its not behemoth
And its not really like maverick or scout
They're smarter
It also constnatly said it was Llama
It just had a really unique and cool writing stylea nd it was pretty funny
Maverick is pretty close to it but less creative and unhinged
Maybe different sampler or system prompt?
Yeah maybe actually
Web arena sonnet 3.7 result is not rendering.
Is there any work on implementing new filters (like Style Control) or algorithms which would try to make lmarena leaderboard a bit more objective?
New Llama 4 Maverick is quite meh at best, yet it managed to get rating so high, even with style control.
yeah, llama 4 truly gamified the leaderboard. truly disappointing.
the style control thing is a stuff I don't get. We aren't doing api calls, we are having a conversation. The default category should simply be another category (I am for hard prompts). If people like formatting (I am one of them) then let it be.
the point IMO is that the average question is not so hard and thus the difference between models is diluted, hence the need of a default category that focuses on hard/niche/non-common stuff
The leaderboard becomes less meaningful if it gets "hacked" by a model with 17B active parameters.
Most people want a smart model after all, not the one which answers basic answers the best, I genuinely hoped that the llama 4 would excel at least at something (creative writing/code/math), but I didn't find the proper niche, at least in case of my tasks, yet it scores higher than many actually smart models, even with style control.
It also poses a question on legitimacy of existing elo scores
How does style control even work? A vote for a model with good style / formatting counts less than a vote for a model with only plain text?
Hmm I only kinda understand it. So for each model, we have elo value (determined by user votes) and some style value (by counting markdown tags). And we have the theory that the style value influences the elo value because users tend to take style into account. But how do we know how much influences the style the elo value? Looking for the correlation between elo and style of all models? But we still dont know is that correlation because users tend to vote for models with better style, even if the answer itself is worse? Or is it because stronger models with better answers also tend to have better style? Probably both, but how is the factor "style coefficient" between them determined, as we don't have a control variable?
we use math that essentially pays more attention to when a model wins with less style and less attention to when it uses more style to win
ideally we would literally control for it with system prompts
but we aren't
Yes, that part I understand. But how do you prevent to over-compensate, how do you know how much more or how much less attention to give these votes?
i'm honestly not sure 😅
but smarter people figured it out https://en.wikipedia.org/wiki/Controlling_for_a_variable
In causal models, controlling for a variable means binning data according to measured values of the variable. This is typically done so that the variable can no longer act as a confounder in, for example, an observational study or experiment.
When estimating the effect of explanatory variables on an outcome by regression, controlled-for variable...
Yes I read it but I don't understand what is the control variable? Don't we need to know how much a given model would perform with and without style so we have a number by which we compensate later for style?
if we knew how it performs without style what's the point in calculating more lol
no we can do this because responses have some variance
a model doesn't have a universal level of style
it's random for each response
which makes this work
but isnt the quality of the answer itself also random for each response? I have probably seen models more often giving vastly different answers for the same prompt than vastly different style
so what
we have enough data to be able to extract some variance without getting confused by the rest
hm ok I would have expected it cannot be extracted because a control variable is missing, but the huge amount of votes make up for it?
i believe so
Ok I wont even pretend I will ever understand the math, maybe I ask one of the LLMs for an elif
the thing is the conclusion you reach from the llama thing is that the system prompt is good, not that the model itself is good
even with experimental, i don't think it deserves the place it's on
it feels like they just threw stuff at the wall (i.e. kept trialing new llama models) until they came up with a writing style that consistently got votes despite the underlying model sucking
like looking at the (overall) leaderboard (with style control). it says gemma 3 27b is better than deepseek v3 and gemini 1.5 pro.
also deepseek v3.1 (aka v3 0324), deepseek r1, llama 4 and even flash thinking is better than claude 3.7 sonnet (without thinking).
death i can't trust the leaderboard for anything meaningful.
they have really good style!
try style control tho
Is chatbot battle? Most like api timed out - or too many requets and you are rate limited
"Most people want a smart model after all, not the one which answers basic answers the best" but then lmarena is not necessarily great at this. If people pose simple questions, one cannot blame the benchmark. At most they can try to make scores only for hard questions. Hard prompt is a starter but I don't believe 25% of the questions are really hard, the percentage is simply too high.
I would expect 1 in 10 or 1 in 100 questions to be hard.
for hard questions one uses livebench and company. For "what can replace common internet searches" I think lmarena is ok
for example within lmarena, the coding category and webdevarena are showing totally different values. Why? Because in coding as soon as one writes this it counts as coding
I don't blame the benchmark though. I am just pointing out that there might be a need for more advanced techniques for filtering out responses, just like at some point style control was
llms think 27% of lm arena prompts satisfy most criteria of a hard prompt
yes. I wanted simply to say that I am against style control, since the end user is a human, not a program/agent.
Hence formatting matters. For example claude answers are - if not about coding - super dry and not that well formatted. No wonder it loses.
I think rather that they should use one (or more) LLM judge to pick hard questions and make a category for that. No, hard prompt is good but not enough, one needs to be strict.
On the other side I notice that during pairing not a lot of models are used (one can see that even in the h2h matrix) and that may lead to inflated results (I analyzed too much ratings and co due to my passion for them in chess)
I know, but that is too much. I mean 1 question out of 4 is hard? Unlikely. Surely it is harder than the common ones (given it requires 6 categories out of 7), but I don't think it is necessarily hard a la livebench.
though even with the hard prompts, that is a step in the right direction, the rankings change a bit. For example gemma loses some spots
style control has more of an impact than the hard filter tbh
yes but style control is something I don't agree with. I want hard questions rather than a sort of "ah let's not count all the stylish answers"
eg maverick keeps its spot with hard, and stays tied for first place with hard + style control, but with plain style control there's enough data to confidently say that it's #10
style control has more nuance than that
"I wanted simply to say that I am against style control" - well, that's why it is not enabled by default, since there will always be a need in a model which just wins user's preferences, no matter how stylized or sycophantic the model is. And style control is not just about excluding all beautiful responses, by the way. Actually, it would be great if users were able to create their own leaderboards by creating custom rules to filter out the responses which affect the rating. But that would either require exposing the underlying data or spending a lot of compute on recalculations server-side
there is prompt 2 leaderboard
it is a nice idea, I feel it can still be refined (because some categories are like displayed three times) but p2l-7b showed good results in my cases (that is the model behind p2l IIRC)
and yes in general the more the customization, the more the costs for lmarena
to my understanding, prompt2leaderboard is based on an LLM and is not updated in realtime when new models appear. Currently, it reroutes most of my questions to older gemini models, and there are none of the new models like claude 3.7, or newer chatgpt's, so it's kinda out of date.
Prompt2leaderboard is a cool idea, but I guess it would cost them a lot to update the llm regularly, which, in turn, makes this thing useful only for a limited period of time
yes that is my understanding too. They have a model (LLM but in theory could be anything) that tries to classify the posed questions with the existing DB of questions (and scores). Given that, it says "for such questions this is the ranking". The p2l-7b then uses this information to pick the #1 model in that ad-hoc leaderboard to answer.
Thus sure it needs updates. The problem is that the amount of possible questions categories is huge so I am not sure they have enough sample size for each subcategory and subleaderboard.
When one builds a leaderboard only on 100 comparisons, it makes little sense. Even 2000 comparisons could be a little (given the amount of evaluators or voters and the possible pairings)
example (p2l explorer). This has "only" 800 votes. Practically nothing.
and yes the p2l is outdated. Hopefully they can update it every month or so
it is really a nice idea
If I use the same benchmark question dozens of times, is it likely that those'll be excluded from being part of the leaderboard?
Also, is ~5k really the total amount of votes gemini 2.5 pro has? cuz if so I feel like I'm probably ~1% of that
You should be aware that the rendering of the formatting that is being used highly influences the results. For example, the answer on the left is more accurate, but it does not render correctly; therefore, my initial reaction is to select the right one. The model can't be blamed for the bad rendering, but the ELO is still reduced.
I mean, I think that that's unambiguously bad rendering in that case
bare LaTeX wouldn't work in any context
~~Hey guys, the gemini-2.5-pro-exp-03-25 model seems to be having some issue.
"API REQUEST ERROR Reason: Unknown.
(error_code: 1)"~~
It's working again now
it's using \( which generally works
ah
LaTeX displays somtimes correctly, some times it displays equations in this quite ugly raw format.
Hello community,
I recently learned about a controversy surrounding Llama-4 Maverick's performance on the LMSYS arena. Due to user complaints, LMSYS had to publish over 2,000 actual battles featuring Maverick to prove their ranking system is legitimate.
While the battles seem fair, there are questions about how evaluators make their choices (for example, preferring longer, emoji-filled responses over technically correct ones).
Also, it turns out the Maverick version on LMSYS arena is actually a custom version optimized for human preferences, not the standard Instruct version available on HugeText or other platforms. LMSYS organizers claim they weren't aware of this difference and plan to add the actual public version soon.
Here's my question: I really like the Llama version currently on the LMSYS arena, and I'll be disappointed if they remove it. Does anyone know what parameter settings were used for this optimized version, or what steps I could take to find this information?
I think someone said that this was the system prompt they used a while back so you could try it out
edit this tweet might have more https://x.com/riidefi/status/1909548881060192407/photo/1
nah the experimental version is a fine tune
not a system prompt
I mean if you are looking to waste money: you could do SFTing on the 2000 battles while using the system prompt and the resulting model would be very similar to the real thing.
And I think the chances are very high that someone will publish something similar to that on huggingface at some point
In my opinion we really need a crucial! improvement to the arena. Let us vote on other people's prompts and their output. This would:
- Increase the amount of votes by a significant amount without increasing the api cost (because these answers already were fetched)
- Improve the quality of the leaderboard. By having multiple people decide on the same prompt it reduces the issue that people vote on wrong answers that "look" nice. For example llama-4 was specifically trained to have high elo on the arena, because it gives stylish responses. I mean ok the "style control" already does a good job at deranking the model, but in my opinion it should be ranked even lower, because it often just answers nonsense but in a stylish way, so basically it's 100% style, 0% quality for llama4. Letting us vote on other people's prompts would significantly improve this.
I see one reason why lmarena wouldn't do that, and that is the fear of people scraping responses. But then you can simply solve this by only doing this for a small subset of answers, those that will get released in the dataset anyway.
this is not a bad idea but I am unsure whether it is logistically feasible.
This because if you have a lot of voters in a period, you can do it because you have excess capacity.
If instead the voters aren't that many, you may put them voting stuff they aren't interested into and people could simply quit voting.
The idea in general could be very useful. I wonder if one could find a compromise. Let many (not one) LLM judge the answers. Then let people judge the judge (every now and then, not too often). In that way the "weighted" judge becomes a proxy of the people, and could help. A sort of Arena-Hard-Auto but more polished.
That could be also done on a small sample of questions (say, 5 to 10 per category). The point is to automate the judging while still reflecting what the majority of people would pick. Not easy.
I think it would be feasible if it's simply another category "vote on dataset" or something like that. The existing arena battle mode could be left unchanged. And if people want to vote on other prompts they can simply switch to that category
yes that yes. Still I think the voters (not users! Rather those that vote) on lmarena aren't that many - in the period between leaderboard updates - so it could dilute the effort. But I like the idea.
because for example if I test the search vs the language mode, I don't really use the language mode afterwards. The testing prompts are limited as time is limited
Can someone explain what exactly the recent llama4 controversy is about? Is the 03-26 experimental version closed source and the 17b-128e instruct the one you can download? I hope not because the experimental version is so much better
more or less. The 03-26 is optimized for human benchmarks (lmarena and similar ones, like the internal ones) and the 17b-128 is not.
It could well be that what we saw in lmarena will be released within meta products (whatsapp for example) while the open weights one will stay different (there are very few open source models. Most of them are "only" open weight)
could be that the open weights one is the base for 03-26, as 03-26 got additional fine tuning or so
time will tell, so far it is speculation
i thought it was solely optimised for the arena.. i might be mistaken but i feel like there aren't really any similar ones, in terms of collecting human preference from blind battles at scale, and also any 'internal' metrics are kinda pointless by virtue of being irreproducible (though tbf still might be more than just for marketing, like could be done in earnest to shape model development before deployment)..
i really like this idea
they should do it as some kinda beta side project - it would be interesting to see the divergences in voting patterns (assuming they exist)
companies, since some time, perform internal human benchmarks telling "which versions would you prefer?". I could imagine meta doing that (for whatsapp and co) . That would be more or less identical to lmarena
fwiw i'm also partial to having like a timer that forces people to wait (and ideally read both responses) before voting. like very often it's literally impossible toreasonably evaluate the quality of 2 responses if they are kinda lengthy literally immediately
but there'd be a lot of user friction / dissatisfaction.. could see fair arguments it against too
the specifics dont really matter they released llama 4 maverick with the lmarena benchmarks that do not represent the open weighted model. even though they put a footnote, at first glance, people would think its the same. and people are still confused up to now
yeah it's absurd (and was always going to be an own goal) - dunno what they were thinking
to be totally fair, as I see lmarena so far, it is great to gauge the value of models as "substitute to classic common internet searches". "common" here is key. People say lmarena is ranked according to human preferences but I see it really more as I don't google! model, tell me the answer!.
Thus, as meta provides llama in many apps that are used on the fly with common queries, it is a great benchmark to see if it would satisfy people. That happens also for other companies, like xAI integrated in twitter with likely people asking common queries there too.
Further as a company they don't need to release open weight models, so the idea of a double release is perfect. They get to verify that their model is very usable for their apps (lmarena score); they get praise for their results (blog posts and hype); they still release their models (though not fine tuned) so that the competition doesn't have ready made products from day 1. People will complain about that, but those that complain are the minority, the whatsapp users don't care.
So it is really a sort of win-win for them, not for the community.
Then we need companies like nvidia & co that release the llama derivatives to fine tune them properly.
I like the idea of a slowdown but I could see people dropping from the site because impatient.
have a look at the Prompt Explorer tab - it's surprising how few of the prompts are google-style information requests.. like there's more people asking them how many 'r's are in strawberry than when was the fall of rome ha
correction: (after coding) it's mostly people asking for medical advice.. then how many Rs are in strawberry ha
Huh, "connection errored out" while using P2L. Is the model being retrained? 🤔
If that's the case, then I'm looking forward for it.
I looked at it already, still they are only a bunch of questions there to see, not all. I believe most questions are simply common ones.
lol what's the opposite of confirmation bias?
why would your belief/hunch be more valid than the prompts in the Explorer, in terms what people people ask in the arena? sure, they're not all there, but they're arguably representative of the actual prompts people use in the Arena, at least to some extent.
but perhaps most people are really asking questions like "what time does the pharmacy on High Street in Birmingham close on public holidays?", but they're hidden from us for (literally no idea why)
I think if they would pick common questions in the prompt explorer it would make lmarena less good? I know it feels like silly, but when I know a group of people using lmarena and when I see them posing questions they are simply like "I could have googled that". I am guilty of that too. And no, it is not something like "at what time this and that happens" rather it is "could you explain me this concept" or the like.
it is completely fine IMO, as an LLM compresses knowledge so why not.
Imagine stackoverflow, ELI5 (from reddit), and other similar places put in lmarena.
now some ELI5 or stackoverflow questions aren't easy at all, but most are solved by some googling
it makes also sense statistically. stackoverflow and other Q&A places have most of those distributions. Relatively easy questions (aka: with some googling they are solved) are common and few are hard. Why should it be different with LLMs ?
I mean, as long as those that pose the questions are humans
Although the arena is quite obviously used by humans, i think that it still inherently has to be a distribution of somewhat difficult problems, because then people using it are quite frankly on average significantly more invested in topics like cs, ai and other areas where ai is being successfully applied currently (e.g. medical or creative writing). This already shifts the average question away from these really basic questions about when a pharmacy opens.
that is also mainly why the puzzles category ranks so high i think
i think that's more a reflection of very crude / poor categorisation of the prompts
most classified as 'puzzles' aren't actually puzzles at all #prompt-to-leaderboard message
well yeah true, i agree with the assessment, my point was more about the prompts actually not being as simple as things you could just google
i also find it interesting that lmarena has yet to really classify these convos in a very holistic way considering the amount of A/B test pairs available (also includes the current P2L models which are also not really good)
but maybe i am just underestimating the complexity of doing stuff like that idk
perhaps i am too.. i feel the same about it as you describe - seems like low hanging fruit / they're missing a trick
i kinda thought they set the classifier up quite early on in the project, and it's handled by like llama1-8b or something old and tiny like that, and while it might've done an 'ok-ish' job back then, now it seems clearly suboptimal / in need of some kind refinement
but yeah, perhaps they have been trying to refine it all this time but it's just tricky to get right (but intutively.. that doesn't seem right to me.. like classification is a pretty rudimentary and well-established task..)
I partially agree. I agree that lmarena is used mostly by those in IT. But again stackoverflow is not filled with only hard questions. I am not talking about "when shops X opens", rather questions that can be solved with minimal googling (and brain), like "I'd like to make this select request in SQL, can you help?"
so even if the audience of lmarena is skewed towards IT, it doesn't necessarily mean that those are hard IT questions.
Otherwise if the questions were always quite hard (and in the IT realm), LMarena coding category would be more in line with other coding benchmarks. Again my evidence is based on the normal questions based on Q&A sites (stackoverflow and others)
but again that is my opinion, I don't want to convince anyone. It is just that there are too many clues (IMO) that point in that direction.
also, as you mentioned, the categorization could be also very loose. Like "coding is anything that has code snippet markup", that could be quite broad.
I asked logic questions where the model used code snippets markup, but that is no coding.
they did definitely work on improving it, I think they used 70b at first for the classification on the normal lmarena (not sure) and likely had to stick with it considering that changing the model would heavily change the rankings per category as well
but they did work on the arena explorer quite recently: https://blog.lmarena.ai/blog/2025/arena-explorer/ (where they use a different method), although i am unsure why they opted to use the mpnet v2 model for this, because they show that the model has somewhat falsely classified somethings in the very same blog.
very true, however i am working under the assumption that a focus on these areas plus the desire to test the limits of modern ai pushes the question to be harder on average (in that area atleast) (than e.g. the everage chatgpt request)
but obviously i expect very little absolute domain experts in their area to use lm arena in their free time and thus this assumption obviously has its limits
which is why i am very interested in the less rigid framework of P2L and i really hope that they keep using their data to improve these models and keep them uptodate
btw I checked the arena explorer, I didn't in a while, and my point are somewhat confirmed in my view. I checked the larger category and most examples are solved by google + some brain.
I didn't check all the categories because it was enough to find many of them in the most common categories.
the other examples either were too hard, like "do it all for me", or too technical - I am not versed in everything to judge well.
in my experience people use it as chatgpt alternative once I shared it. Nothing more. It is also in the screenshot I posted. And that's fine to be fair.
Only it makes lmarena great to say "ok, which chatbot service can substitute some common Q&A websites?"
what I would really wish is that for every category they already have (categories could be expanded, but with p2l it is fine anyway) they would make the "hard" subcategory for it. And for hard I don't mean hard prompts, rather "hard questions".
So hard math, hard coding and so on.
I would expect then hard coding to be more in line with aider polyglot and so on.
I mean the example 5 question from SQL didn't even bother to prompt the question properly. Likely there was a line between the two SQL queries and that's it.
yeah i generally think that such a thing could really make the arena more interesting at a whole, i honestly don't know what is stopping them.
I mean you could even derive something like humanities last exam (really specific problems from domain experts) out of these millions of questions.
However, at its core this site is obviously just about human preference, even the coding arena, webdev arena (minus maybe repochat) and heavily centered around human preference.
=> for human preference it is obviously essential to have questions that people actually ask instead of highly selected, artificially created or unrealistic when compared to real AI assistant human iterations
agreed
also nice the "lmarena humanity last exam" if one picks the proper questions.
though IMO the questions in many benchmark should stay private. As soon as they share them - and if the benchmark is notable - there is a high pressure to optimize against those questions.
For example livebench is nice, but models score 70% while 30% of the questions are private. It feels like a bit more than coincidence.
most of those would literally be truncated if they were entered into a google search..
hence I think that open based benchmarks a la lmarena are potentially the best if properly scored.
yes don't take things too literally. I thought my meaning was clear. I google about how to connect certificates to IIS servers. Then I google CLI commands and so on.
still they aren't hard questions.
https://huggingface.co/spaces/lmarena-ai/Llama-4-Maverick-03-26-Experimental_battles
hit Next Question again and again - a few are like traditinoal information requests that would usually be done with google, but they're the outliers, not the norm
i'm kinda lost as to what your point is now tbh ha.. i just don't think there's strong evidence that most inputs are done by people who would otherwise be using google searches.. in some cases yes; but not in most
most people are just playing around / seeing what they get as the responses in a blind battle
they;re not actually trying to fix code
-
Is the spiciness of a hot pepper only perceived or true and physical?
-
what are the odds of someone in Texas Hold 'Em rivering a Royal Flush while the other player rivers Quad Aces??
-
I will give a congress talk "On Naevi" -- naevi are benign melanocytic lesions which are markers and every so often also precursors of melanoma. Do you have suggestions for a short and succinct title for my presentation
-
What does it mean if I have a "proud rooster"?
-
What is the latest season of Fortnite?
-
What is an RNN in the field of AI?
-
Create table of yogurt nutrients versus greek yogurt
-
generate study plan for IAS exam in marathi
-
Read this passage from the article:
they were honored at Navy gatherings where new Black U.S. Navy officers expressed their gratitude. "We owe it all to you," they said. "If it hadn't been for you guys, we wouldn't be here."
In this passage, the word gratitude means __________.
a feeling of trust a feeling of hope a feeling of peace a feeling of thanks -
My left leg hurts when I'm sleeping and immediately when I wake up. The pain will disappear during most of the day, except when going up and down the stairs. I have touched my leg in multiple places, and there is no specific location that hurts to the touch, although I can feel some strain in my ankles/calves. What is the likely cause of my leg hurting?
-
The placement and connections between rooms in a building leads to the formation of hallways and corridors, but sometimes there's necessarily a space that's just... not much of anything, and it only exists because of the shape and layout of the building.
What are these not-quite-rooms/not-quite-thoroughfares called?
and so on.
Those surely are useful questions but not necessarily hard ones.
I cannot go on and on.
if that would be true, then lmarena would be the best indicator of intelligence for models, but it is not for a while. That is the strongest clue.
My point is: LMarena is useful, but only to tell which LLM answers best common questions and some hard ones.
You point - as I understand it - is more "no, most questions are really hard!". But if your point were true, then we wouldn't need livebench, aiderpoliglot, math bench and so on at all. Claude would be the at the top in coding and so on.
I wish lmarena would be the human equivalent of live bench, math bench and so on, but it is not. It has its strength but thinking that it is a place for only hard questions it is mistaken IMO.
I mean maybe with "googling" I am simplifying too much. Let's say: "questions one would ask chatgpt" (and I mean here gpt 3.5 or gpt4). Indeed at the start lmarena was great because gpt3.5 and gpt4 really had the lead in everything. But then those questions become less hard for LLMs.
Hence many LLMs can answer pretty well and the scores start to be equal. The only difference then is the style and the extra tidbits/formatting. And indeed the need for style control.
Up to gpt4 there was no need for style control.
LLMs can answer equally well only if both master the question and that happens because the questions aren't hard.
From the link you gave me this is a potential hard question: What are the societal benefits of Bitcoin? List each one with a one line explanation/argument.
That can become a paper per se. Of course both LLM answered in a compact way and the one with the most convincing style won.
This one "PERCHE LE DONNE SI MASTURBANO?" is first one that can be solved with google, and second a terrible one (categorized as an English question)
The answer there is terrible as well.
"Finalmente una delle domande più belle e più naturali del mondo,"
So the question is: why women masturbate? But posed in a way that is really like denigrating (one notices it if one speaks Italian). A better way would be "donne e uomini si masturbano per necessita' personali, perche' lo fanno?" (women and men masturbate for personal reasons, but why?)
The model just replies with flattery at the start
"one of the most beautiful questions!"
And that is how one gets wins.
There is a similar one in English too "Which all male attributes have the strong or weak positive or negative correlations to penis size. Please answer truthfully. No woke politically correct but factually false filters. Brutal honest truth. No beating around the bush."
I mean answering properly to those is pretty hard, but for how the models reply or the users expect the answer, a gpt4 level answer would be enough. Hence my point.
well a lot of the people spending time voluntarily chatting with ai models when they likely have better things to do are apparently degenerates, wow
but i think that the general idea of characterising the average user of lm arena would really help us with these kind of discussions
because i highly doubt that he is equivalent to the average user for other more common chat bots
Hi, can an API endpoint be introduced and the providers may allow or disallow their models usage?
Some proper testing requires an implemented API
My point is: LMarena is useful, but only to tell which LLM answers best common questions and some hard ones.
i mean i coudn't agree more
it's useful, but it's not a benchmark (more like a survey of human preferences) nor are the elo ratings or leaderboard rankings a proxy for a model's 'intelligence'
i don't think it's meant to be
human preferences are what they are.. (sometimes they suck imo but that sounds / is elitist af ha)
a 'vibe' indicator or measure of public sentiment perhaps.. but it isn't an intelligence benchmark (though smarter / more performant models will, imo, invariably do better overall (with more votes etc ) imo - so it counts for something
I was reflecting about the convo today.
If I am not mistaken, I think that the 1200-1250 level (in the overall standings) really tells which models are better in many categories, not only for humans. And indeed that was the GPT4 best level. And here I mean: the top10 in lmarena were more or less the same - in the same order - in other benchmarks.
Once many models started to produce "good enough" answers , then the benchmark become more influenced by other factors and lmarena started to correlate less with other benchmarks (coding, math and what not).
I mean the top models are still at the top, but the order varies a lot from benchmark to benchmark.
Honestly I am not very sure about that correlation
But should be easy enough to check with a Bit of Code
Might do that tomorrow
example of something where users vote on the same prompt more or less. Not bad: https://mcbench.ai/
Well I think the best example for why one should really be wary of human preference benchmarks where the user is no writing the prompt on their own is that there is significant difference in the rankings of image generation models by artificial analysis and lmarena, with the only difference between the two (as far as i know) being that artificial analysis uses predefined prompts and lmarena does not. Thus I can at least conclude that the results of both methods will differ, with the lmarena approach likely being more holistic.
talked about the correlation a bit in their paper, but seems pretty legit and all
https://livebench.ai/livebench.pdf
this is what i got
and some other stuff, but still working on the repo a bit
nice, it would be cool to put it into github for everyone to see. Could you make the first graph (the others seem less relevant) for the categories and/or the style control too?
might do that tomorrow. but that is also when my classes start again, so might not have a lot of time.
these other graphs might be interesting though:
(the one for param size is way more accurate and the other one shows that the correlation greatly differs between model families, with especially the phi and qwen family being outliers)
Smaller Gemma 3s are also being tested. Nice!
Can we expect Llama Scout to join the Arena as well? 👀
the ones about parameter sizes aren't that much informative. I mean there is a trend, but it is a bit all over the place.
and yes no stress with the code. It can happen when one has time
Add new Kling model to text-to-image - KOLORS 2.0
Might also be interesting to not just directly use the blended price for the comparison but to also have the option to use the average token usage (in the arena for the specified category) * the price.
That could also be really helpful to ‚combat‘ these models that use very high TTC in the response to enhance perceived quality (e.g. llama 4 maverick special chat version).
this time when o3 launches do 2 separate models for both families when putting it on the arena - the differences in performance with reasoning effort have historically been quite large
o3
o3-high
o4-mini
o4-mini-high
hig jay
damnit
L
you're lucky you're far away 🙄

what is this man planning
wouldn't youuu like to know weatherboy
that would spoil the surprise!
3:<
it being that direction feels wrong
:Ɛ
well, that's a smiley
epsilon as a 3 is kinda cursed doe
true
add o3, o3-high, o4-mini, o4-mini-high 
both here now, just need the high variants!
o3-high seems a lil' unlikely lol
and in alpha ui too
where can I get o3-high???
I am just reading about that
also, o4 is going to be insane when it fully comes out
The 2.5 PRO is crashing every time I encounter it. The tasks takes ~3 to 5 minutes. Is it timeout issue?
Same on the direct chat
I'll repeat here what I said in #leaderboards
I think style control is a very important feature, and if it was on by default, the llama 4 controversy would be much weaker. At the same time, there is still a 48 Elo difference between the two llama 4 versions that arguably differ only in style, so it is worth to think about which additional features can make style control better
hey @ocean sky we are working on an improved version of style control to include sentiment features. initial result looks very interesting. we will share more with community soon
I don't like the style control because we are chatting with the LLMs, we are not making api calls.
And indeed the tweaked llama version will likely be great for the average user of whatsapp & co.
If you see LMarena for "which LLM would be best for the average user question that an AI assistant gets?" it makes much more sense.
It is the same why claude is nowhere near the top5 while in webdevarena it destroys everyone.
In this perspective, the arena is fine. I personally check a mix of categories like hard prompts category and longer query . A bit less coding to be fair because coding is more webdevarena (or there it is more appropriate to ask for api calls)
For coding actually I prefer this: https://openrouter.ai/rankings/programming?view=month where people vote with their $$ if necessary.
so yeah, lmarena is good but having a mix of benchmark to check is better.
Please add geographical understanding to lmarena. I want to play geoguessr with the assistants
Can we get a better mechanism to temporarily disable models that return nothing? I get Claybrook on every battle in WebArena, and it takes 5 mins to wait for an empty output that results in neither a satisfying comparison nor a meaningful vote.
The two llama 4 versions have a huge difference in the type and length of the answer, they don't differ that much on style. Or are we talking about different things? I compare llama-4-maverick-03-26-experimental with llama-4-maverick-17b-128e-instruct, and the experimental version is much better than the instruct version
I only hope that version will also get the weights released, not that I have the hardware to run it.
yes the new models (often broken at the start) are too aggressively matched against everything. They should dilute the matching from time to time as most existing models also need votes.
if one checks the battle count heatmap (battles ended without ties) there are way too few comparisons, given that every human judge judges differently.
Well, if you're interested in "which LLM would be best for the average (by number of queries) user of lmarena.ai question that an AI assistant gets?" then indeed, style control is of no use for you. However, for me, the arena leaderboard is a good proxy for evaluation of answer quality for diverse, open-ended questions; I couldn't care less about the number of bullet points or emojis included in the answer. Unfortunately it turns out number of bullet points and emojis does skew the votes even if the content of the answer is the same.
I view the style-controlled leaderboard as an evaluation of the content of the answer, disregarding the format of the answer. This is a bit simplistic since you can deliver the same content in a way that is more or less accessible, and sometimes the style is an essential part of the evaluation. Still, the point stands: the finetuning that made the llama yapping like crazy shouldn't affect the style-controlled leaderboard. Moreover, since style control uses relatively simple features, it just prevents the most obvious ways of climbing the leaderboard, but do not really punish different "styles".
Finally, as my personal opinion, the attempt to maximize the non-style-controlled arena score (since it's the default) makes llms shittier. I don't want that to happen, and an easy way to fix that is to make style control the default. The non-style-control option will still be accessible using the checkbox.
But it is important to make sure the style control does not over compensate, because I think there is a positive correlation between the quality of the answer and the style.
ByteDance Doubao Team is dedicated to crafting the industry's most advanced LLMs. We aim to lead global research and foster both technological and social progress.With a long-term vision and a strong commitment to the AI field, the Team conducts research in a range of areas including natural language processing (NLP), computer vision (CV), and s...
Doubao LLMs and image generation
how do you print a conversation?
at least in the browsers i tried, larger textboxes will be cropped. i solved this with a bookmarklet (= js code you can put in a bookmark)
javascript:document.querySelectorAll('#chatbot').forEach(el%20=>%20el.style.height%20=%20'auto');
i tested this on firefox, other browsers may restrict bookmarklets due to security reasons, but theres usually a setting to allow it.
is there any other solution you guys use?
if not, might i suggest adding a button to switch to a printable view?
i just learned there is a new ui coming, but i assume the same effect can be achieved there. just need to figure out the proper selector...
" However, for me, the arena leaderboard is a good proxy for evaluation of answer quality for diverse, open-ended questions"
yes but the problem is that it is not an automatic test, where you can adjust the parameters. You cannot force people to vote how you like (that would be biased too) and from that you cannot force for everyone a ranking only because it is best for you. That is a bit too "it has to work for me, not for everyone else".
For that type of benchmark I guess one should build another version of the benchmark. Because a counterpoint of your assesment is: if you models expose exactly the same identical content, but one in $nice_font and the other in an $illegible_font, they should get the same score. Not at all.
Same for information that is consumed by a pair of eyes and not another machine: formatting counts a lot.
Hence instead of showing the same forced ranking for everyone - ranking that could be also faulty a bit (I am not sure how much style control really captures the "content only scores") - I'd rather really focus on a different benchmark.
lmarena could have all the formatting extras while the "new" benchmark has only pure plain text (and even there one can format things nicely).
I really don't get the need to "I want this as default for everyone" when it is one click away for you without disturbing many others (or with lmb by @wide edge you can save a bookmark with style control activated)
This "me first" approach is not something I understand. And no, in before you say "but you also want the default settings for you". First it is the status quo, so it is for everyone, second if the scores are so different, it means that the default score really shows how people mostly vote in the arena.
Hence the default score is the most representative. Third, I really use other categories and I use bookmarks for them, that's enough for me.
I think lmarena delivers the best combo: quality of the answer + ease of reading (format). openrouter rankings tells us mostly what is best for coding (given the price). LiveBench , mathbench and lmarena categories taken as a whole tells us which model can do best for STEM questions.
This "me first" approach is not something I understand. And no, in before you say "but you also want the default settings for you". First it is the status quo, so it is for everyone, second if the scores are so different, it means that the default score really shows how people mostly vote in the arena.
It's not only me; many people working on LLM benchmarks agree. If everyone were ok with LLM devs putting in work to benchmaxx and generate the most beautiful slop that gets upvoted, the style control feature wouldn't be here. But it is, and for obvious reasons, part of which I already listed; all I'm saying is that it's not enough, and since the devs are working on improving it, I believe I'm not the only one who thinks so.
I really don't get the need to "I want this as default for everyone" when it is one click away for you without disturbing many others
I think I made it pretty clear: The default score is the one optimized, and the non-style-controlled score is easily optimized by more yapping and slop, and making less slop is an excellent reason to make the change. Of course, if you like whatever default score optimization leads to, you'd oppose this change. I didn't try to convince anyone that it's bad; I'm convinced it is (and I'm not the only one), so I'm proposing a sensible solution for those who believe it is a problem. I'd be happy to argue why if that would help to decide.
I think the slop many mentions actually is liked by many end users. For the end users I mean those that use for example copilot integrated everywhere, llama integrated everywhere, grok and so on.
So it is about what we want to measure. For the end user (that are the vast majority of internet users) I think lmarena is really representative.
I get your point, you want like a sort of XKCD 810 but for LLMs, and that would be nice too. I still think it should be a different benchmark. Because if the AI labs benchmaxx for style control, they can make a lot of end user less happy (emojii and co)
but anyway, I make do with what is there. I think if there would be such a giant push for style control, there would be already another benchmark. lmarena is not new and they don't have the monopoly on benchmarking either.
One of the models frequently not showing anything
claybrook stop working in webdev arena.
maybe admins add functions for "premium users" with function upload files (.txt .js .php ....) ?
im ready pay service!
Read the Sentiment Control article, and I gotta say this is the right direction to go.
Gemini 2.0 Flash is being used for sentiment classification. But I wonder is it really both cheap and accurate enough to run vs an open weight model with similar performance if there's any? And will it be used for prompt classification (Hard, Creative, etc.) too, for consistency's sake? 🤔
I also like the correlation with the performance, also for headers, length and so on. In that way is much better to "correct" the score rather than ignoring the battles.
also lmarena is actually a sort of social experiment too, not only a bench for LLM. People like being flattered
Again mentioning you really need to fix your content moderation system when it comes to images. Or can anyone explain what's wrong with this image? https://civitai.com/images/67890084 I tried cropping the arm away in case it is too much exposed skin (lol) but still content warning. This is getting ridiculous. Is it smiling suggestively or what is the problem?
At least the is what exactly caused the flagging maybe we can help you to fix it when we know what triggered the false positives
I wanted to add my own llm
And make it available on arena playground
How I can do
gemini 2.5 pro experimental keeps having its answers cut off. another friend is also reporting this issue
maybe there's some character it's returning that's being interpreted as an end of message token?
Can you guys add Claude web search
And the other chatgpt web searches aside from gpt 4 Omni or whatever is called
the o3 model in lmarena is really weak or not the ai at all, i tested it with the research math question from here 5 times: https://openai.com/index/introducing-o3-and-o4-mini/ and it failed to give the correct answer all 5 times, it even took around 3 minutes of thinking instead of the 55 secs in the example.
good idea!
the version in lmarena represents the api, which doesn't natively have python access
oh ok, thanks
@agile flume Could you or someone from lmarena team make a qestions and answers webinar?
Can the new GROK 3 (not early), plus it's vision, Grok Aurora, reasoning, and search capabilities
as well as this, can you add Doubao 1.5 Pro, 1.5, and roleplaying?
Doubao is extremely underrated
Can't wait to try it out
Along with seeddream models
Is there someone I can talk to about search arena? We found some issues, would love to talk to whoever is involved
You can always leave feedback here. There’s also email: lmarena.ai@gmail.com
is there any reason kimi isn't on lmarena? not sure what the policy is for adding new models/companies
most likely if the API is there and the vendor is willing to provide API credits, it will be included in the arena. otherwise it is a 💵 problem.
Hi, Diffbot has just dropped to HF weights for a new search arena LLM that implements the first o3-style interleaved function calling in an open source model. Would love to see more open-source competition as it is all proprietary models in search arena at the moment!
How do we get included in the arena? We have an API hosted version as well and can provide free credits.
@agile flume Hi, I am wondering if Qwen 3 models be added to the Arena in the near future? Thanks!
Oh, Qwen 3 series!
95.6 on Arena Hard Auto '24.
Wonder what it will perform in the actual Arena.
Hey @agile flume
We would love to see Linkup model available in 🌐 search arena!! Currently state-of-the art perf on Simple QA (https://www.linkup.so/blog/linkup-establishes-sota-performance-on-simpleqa). (Full disclosure, I am a co-founder)
on both lmarena.ai and beta.lmarena.ai today!
What are the possibilities to having Gemini 2.0 Flash Image Gen into the Image Arena? 🤔
This paper just got published by some AI researchers on the unfair practices and lack of transparency by Chatbot Arena. Do the lmarena folks have an answer to these? The community should know. https://arxiv.org/abs/2504.20879
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted...
"undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired" true or false?
the paper presents evidence showing the biases in practices towards a handful of preferred providers, but it does not cover an equally concerning bias against open-source models and small independent developers as can be seen by the many messages in this channel above asking for transparency on how to submit models. I doubt they ignore the requests from Meta and Google in the same way since they accepted 27 private variants just from Meta alone leading up to llama 4
one problem one can easily see is when new models are there, cloaked, they get aggressively matches in new questions. That's is good for PR as the models will be easily visible on the rankings within a week, but it is not so good in general because the feedback is vast and model providers can tune their model.
If the cloaked models would be picked every now and then (like all others), then it would be harder to adjust the model and the provider has either to wait (difficult via market pressure) or publish the model as is.
I think slowing down the matching with cloaked models can already help a bit. Then again for the problem "yeah but why Claude 3.5 from Oct 2024 was not #1 in coding?", that is the usual point: API calls (like with inline suggestions with an IDE) and human conversations are different, hence claude didn't win. For api calls one can check openrouter
Slowing down cloaked model exposure makes sense — it levels the playing field and prevents fast overfitting based on immediate feedback. If models were matched more gradually, they'd need to be robust from the start, not just quickly optimized.
exactly. And if they are under pressure to publish, then they would publish it ahead of lmarena scores anyway, so people would have already experience with them (via openrouter and what not) to compare the behavior.
a statement was shared here: https://x.com/lmarena_ai/status/1917492084359192890
Thanks for the authors’ feedback, we’re always looking to improve the platform!
If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the
Karpathy, accomplished AI researcher, shared his thoughts in a tweet. Honestly folks, I am done with Arena as a model builder. Was an admirer of the many fresh ideas chatbot arena brought over the last two years and respect the academic work involved, but this unfairness and opaqueness and being secretly in bed with the big powerful AI closed labs is honestly heartbreaking and absolutely terrible for the community. Esp for an academic project coming from such an established Berkeley lab.... I think lmarena is done and dusted for me and for I know several other researchers and builders of late. Time to move on to other mechanisms like Karpathy writes and other various platforms for evals and rankings. Thanks for all the work, but we as a community deserve much better. https://x.com/karpathy/status/1917546757929722115
There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion"
https://t.co/LfjIII71qX
I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few
Is there any option in "parameters" to activate "reasoning high" for o3 and o4-mini? I would like to test these llms with high reasoning effort.
really wish o1-pro was added
" but this unfairness and opaqueness"
could you mention any other notable benchmark that is less opaque? Thank you.
there are multiple version. o3-mini and o3-mini-high.
The weight I put on chatbots arena has gone very low after the llama event and the fact every new model seems to benchmark hack their way to the top.
https://artificialanalysis.ai/ feels much more objective at this point
Artifical analysis is simply a collection of benchmarks "Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500"
The problem there is that one doesn't know if those benchmarks are "benchmaxxed" as well (data in the training set)
further artificial analysis score seems also unclear. R1 small distills still do better than Claude 3.7 (no thinking) or close to Gemini 2.0 Pro thinking (the one from January). That seems unlikely.
U mean 2.0 flash thinking?
But otherwise I could not agree more 👍, their benchmark selection and the weight for all of them seems rather arbitrary aswell
right. The one before the march/april release.
Man I was just confused thinking I missed the release of 2 pro thinking or something. lol
yes I was going from memory. It was the first thinking model from google though.
I think in the arena the name was "gemini-2.0-flash-thinking-exp 01-21"
I'm talking about the large o3 model. On the https://lmarena.ai/ website you can use "o3" or "o4-mini", that's ok, but I guess this is with Reasoning Effort = Medium. I would like there to be an option to select Reasoning Effort = High.
ah I see, they likely will come later (as with o1 and o3 mini)
the oX versions were all tested with medium at first IIRC
There was another one on december (maybe the same one but updated on Janurary )
1 gorbillion dollars in API costs:
more seriously: o3 high in direct chat seems very unlikely, o4-mini-high is definitely possible but not currently implemented
if they do choose to add the latter, it'll likely be listed as a separate model
Greetings. I found a little bit of an "issue", so to speak, that is a little bit frustrating to me.
Whenever I do the arena (battle), I can always tell when one of the LLMs is based on Claude, due to the shortness of the answers, and I worry that it would invalidate my tests.
Do you have any suggestions on how I can adjust my prompts so that it isn't as obvious?
Style definitely impacts responses and voting, but as long as the model has not revealed itself in the answer, your vote is not invalidated. There are even filters for the leaderboards around Style Control which you can read about here: https://blog.lmarena.ai/blog/2024/style-control/
All right. Thank you. I always assumed that since I could tell the model due to its length that that was a form of revealing itself. I appreciate the answer, and I will read that.
I will keep on experimenting and judging. I have been having a lot of fun with it, seeing how each model "thinks" differently.
don't forget we have a beta UI live at beta.lmarena.ai as well - would love to hear feedback in #new-ui-feedback if you have any to share!
I've played around with it a little. Not enough to have a reaction to it yet, though. I will do a little more playing with it today at work if I have some downtime.
Yes, I get it. Hopefully they will implement o4-mini-high.
Granite 4 just released a public preview.
https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek
Also the 3.2 and 3.3 are under our radar apparently
https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a
https://huggingface.co/collections/ibm-granite/granite-33-language-models-67f65d0cca24bcbd1d3a08e3
Granite-4-Tiny-Preview is a 7B parameter fine-grained hybrid mixture-of-experts (MoE) instruct model finetuned from Granite-4.0-Tiny-Base-Preview using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, and model alignment using reinforcement learning.
Are there any plans to host deepseek-r1t-chimera? It's been in the top 10 trending models for the past week on Hugging Face and seems to have received a lot of traction: https://huggingface.co/tngtech/DeepSeek-R1T-Chimera
The consensus on reddit seems to be that it answers at least as good as r1 but with nicer thinking traces
not sure if this has been mentioned before, but the suggestions below the web arena "reset" every time the "Generate me a UI for..." prompt field is updated
I’m actually a huge fan of this idea
The random icon should be for those
Especially as we get more expensive models on the Arena, all of the wasted money added up would be a huge amount
I think I asked it before but is it clear by now that the weights of llama-4-maverick-03-26-experimental will never be released? Or is there still a chance? Or are they already and I completely missed it? (Not that I have the hardware to run it)
you can ask meta that question. My guess is that they keep it for themselves, they don't owe it to the community.
Btw llama-4-maverick-03-26-experimental is back and is winning already also in my case.
Hey all, I work on the IBM Granite team, and it seems none of our models are hosted on the arena.
https://huggingface.co/ibm-granite
Any chance can someone where i need to put a PR in to add it? Or any direction on how to get involved?
there are two but not the other ones (3.2, 3.3, 4 - at least those announced in reddit locallama)
Thanks for sharing interest! We do our best to test as many models to our capacity. We're unable to share if or when we'd be adding new models, including requests like this. However, these requests are being noted down and we are monitoring the community for signal as to what to prioritize.
Wonderful, thank you! If you need some help with capacity issues, I might be able to help there, too...
Hello @pearl garnet, I run an AI search startup that processes millions of searches with high quality outputs (especially with reasoning/DeepSearch, which rivals Perplexity/Gemini Deep Research), and, I was wondering if it would be possible to add it to the Search Arena. Can you DM me about this? Thank you, Paul
Sounds good! I'll be keeping track of these requests. I'd recommend remaining in this server incase there are follow-up questions.
Hey Paul
are you comfortable sharing the name of this startup or would you prefer to disclose through DMs?
hint: his about me
Sure! It is called Rubik's AI (https://rubiks.ai).
this is a scam thing, you tried to push nothing burgers with it like 2 times already lol
you basically just make up benchmark numbers, do a lora or basic finetune if even that, and then call it a day
Burgers?
Also, this is for the search feature...
that's an expression, google it lmao
doesn't matter, you have no credibility after previous stunts
What stunts?
this new search thing is probably some existing API developed by someone else repackaged under your name
Not really a stunt, it was just an original test of LoRA on popular open-source models to improve them (similar to NexusFlow and other companies).
don't pretend like you ran these evals and those were the scores
The search feature was the original starting product of this company. I would love to get a better impression of the quality of our new DeepSearch (in collaboration with Exa AI) with LMArena:
https://x.com/RubiksAI/status/1907152289090965962
hey stepping in to slightly gesture towards our rules
Treat others with kindness and curiosity—we’re here to share, learn, and debate ideas, not start fights. Healthy debate? Yes. Personal attacks? No.
Naturally. Perhaps it would have been better over a DM.
it's no problem!
I think important thing to realise here is that every benchmark in isolation can be gamed and lmsys is no exception. It's not a definitive answer and is only relevant if all the other usual benchmarks check out too. Good example is Nemotron 70b which was openly made this way to perform better on lmarena without improving anything else over llama 3.1-70b https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/discussions/11#6712c8f758bdba34248ce0ef
A new addition to the Search Arena, peehaps? Or has it been added?
https://www.anthropic.com/news/web-search-api
Is there any way to fix scroll on desktop -- it's really hard to parse results
right now, your only options are using a different leaderboard like https://beta.lmarena.ai/leaderboard/text or https://ktibow.github.io/lmb/
can we have an arena mode where chat is infinite (only last <CONTEXT_WINDOW_SIZE> tokens are given to models)?
probably either too long or angries the WAF
can't put some images: error
HTTP 403:
Please enable cookies.
Sorry, you have been blocked
You are unable to access lmarena.ai
Why have I been blocked?
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.
What can I do to resolve this?
You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.
are you blocked from going to the beta site (https://beta.lmarena.ai/)? If you're able to access the beta site can you submit a bug report? at the bottom left you should find that option.
Please enable cookies.
did you do this?
it was in original on the og site, i fixed it
Hi can we get emojis for all LLM providers?
I'll start a thread in #1343291835845578853
Site outage, will turn back on when resolved.
Welcome back :ablobwave: