#leaderboards
1 messages · Page 3 of 1
You can learn more here: #1397655624103493813
Please include Pixverse 5.0. It is a very strong model. It is better than Hailuo in Artificial Analysis
hello
thank bro
Hello! Please, #1397655624103493813 to to learn how to use the bot and generate videos in #video-arena-1 #video-arena-2 #video-arena-3
Nice and Amazing... I can learn More Ideas... Thanks!
jo
go go
niece
/video
/image
@west pecan @queen owl you both are looking for Video Arena, read more on how to use here: #1397655624103493813
Hey @solid cipher please check this channel for instructions to generate videos. https://discord.com/channels/1340554757349179412/1397655624103493813
this can't be true, is oss 120b on LMarena?
LiveCodeBench is notoriously bad and has been detached from reality for a long time
Which one would you say is the most reliable benchmark to refer to
?
the tried and test one: swe-bench (smaller agentic coding tasks)
the new one (that is less contaminated or optimised for): https://brokk.ai/power-ranking (very large agentic coding tasks)
for** tool use** specifically: terminal bench
for front-end: webdev arena
for machine learning specifically: https://htihle.github.io/weirdml.html (ai models train a machine learning model, score = accuracy)
wow thank you, this is super insightful. THANKS!1
WILL CHECK IT OUT!
Hehe
Does lmarena has leaderboard around controlling mobile phones via agents like this by google research https://google-research.github.io/android_world/ ?
Or is there a plan in near future?
A Dynamic Benchmarking Environment for Autonomous Agents
no we don't have something like that at this point
Hello
Hi, what are your thoughts on the leaderboard? https://lmarena.ai/leaderboard
Site outage, will turn back on when resolved.
Site Outage - Hey everyone, there looks to be an outage with the site, our team is aware and working on a fix ASAP. We've turned off messagin in this server until the site is restored. Our apologies for the inconvenience!
Welcome back :ablobwave:
Check out #1397655624103493813 for information on how to use the bot @signal idol
Hi community! I'm just starting to explore the project's data and have found that the .pkl files don't seem to have a consistent structure. I was wondering if anyone knows why this might be the case. Perhaps it's due to different versions or data sources? Any insights on the origin of these files would be greatly appreciated.
Hi @quartz shuttle, I'm afraid that's just an artifact of sharing our data as we use it. We are always working to improve our data pipelines and that sometimes includes changing how we structure and store the data. Since the data has been releasing for almost 2 years now, this involves a nubmer of changes. If you have specific questions about the formats I'm happy to help if I can.
Guys, anyone looking to collaborate in a way to integrate LM Arena to our AIvsAI platform for games?
We let devs build AI agents for classic fighting games, provide leaderboards benchmarking scores and achievements
We want to expand our LLM module to see how LLMs perform in gaming as a benchmark for LM arena also
This leaderboard is based on the following benchmarks.
Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 4M+ user votes to compute Elo ratings. AAII - Artificial Analysis Intelligence Index aggregating 8 challenging evaluations. ARC-AGI - Artificial General Intelligence benchmark v2 to measure fl...
wild pfp
Leaderboard update anytime soon? it’s been a full week now since the last one
understandable delay with no new models released, but still
bruh it’s just the EU 😂
You'll want to read #1397655624103493813 for more inforamtion on how to use the Video Arena bot.
we updated vision on Tuesday https://lmarena.ai/leaderboard/vision
It was accompanied by a slight change in methodology in order to keep data quality high
https://news.lmarena.ai/leaderboard-changelog/
This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!
For model deprecations, check the public updates on GitHub.
September 2, 2025
Due to the increase in image generation traffic brought by nano-banana, we noticed there were prompts in our
Up
/video
The bot isn't working. Check out #1397655624103493813 to understand how to use the bot properly
up
hallo
Why is Leaderboard info is dated by the 28th of August? Isn't it being updated? Thank you?!
up
hey fellas
Coolio!
sometimes it goes longer without updates, 8-10 days is normally the max that the text leaderboard will go without updating, so we’ll likely see one sometime soon 🙂
/video
hi all
hi
Our Video Arena bot isn't working at the moment. Check out #1397655624103493813 for more information on how to use the bot (when it's workoing properly)
Our Video Arena bot isn't working at the moment. Check out #1397655624103493813 for more information on how to use the bot (when it's workoing properly)
You'll want to read the instruction in #1397655624103493813 for more information on the bot.
hello
hello
hi
How to upload a image to edit with prompt?
Hi
Hello
Hi! Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to do that on our video arena channels 🙂
Please check to learn how to generate on our video arena channels 😉 https://discordapp.com/channels/1340554757349179412/1397655624103493813
no text leaderboard update in 9 days 😢
will probably mean a significant leaderboard adjustment when it does update tho! excited to see where new Qwen places 😳
hello
helo
hi
the stream of "hello" continues! Hello to all!
btw the fact that image leaderboards have a density of votes (No.votes / No. Models) so high compared to the text arena, just shows that images are easier to digest for votes (and have higher demand)
Pretty sure there’s also some abuse going on with people doing some automated prompts for various needs
sure. Small business trying to make flyiers and what not, but still.
@molten dome Please read #1397655624103493813 for information on how to use the bot.
@grim barn be sure to read #1397655624103493813
that's actually like super amazing data and feedback, people using for business cases is very high signal, since we know they really care about which result is better
ratio of votes to models is so interesting to watch. The extreme differences makes me thing we aught to be investigating entirely different methodologies for them
thank you for your patience, new update today 🙂
Hi, do we have option to select starting frame and ending frame? That would be awesome if you add this
Where can I request that the censorship applied to texts be lowered a little? It no longer allows me to make a correction when I paste the text.
AWESOME
Sorry to say there isn't a way to get around the content filters; however, if you believe there are some false positives I'd encourage you to share them in this forum post #1376956905016004759
hell yea
hi were do i find the leaderboards
hello
You can ind them here - https://lmarena.ai/leaderboard
Thanks
Qwen3-Max !!!
I mean, there’s less than half as many votes in the image arena (1.7m vs 4m) as there is text, also split between 239 text models vs. 21 image models
I think that’s moreso a testament to the difficulties of creating/lower priority towards image models among AI companies that aren’t currently leading the field, likely due to level of expense and infrastructure needed to do training runs for image generation models compared to text
more structural incentives leading to monopolization of AI image generation compared to text, but it is also true that the text arena has been running since the beginning of LMArena and Image Arena is a newer creation
That's a good point about the time they've been running, but look at image-edit arena. 6 million votes on 10 models 🙂
It's been around even less time that text to image but has overtaken text arena in votes. Nano-Banana was quite an event
that is discounting how the image arena exploded. That 1.7m votes was accumulated real quick compared to the text arena. If the text arena had the same progress it would have like 10m or more.
But sure the barrier of entry is surely not trivial.
For search what model is really best o3 is on led but its not showing results as expected
true! Nano Banana was easily the biggest marketing push in the history of LMArena
I have no idea how o3 is considered better than Gemini for search, I find it noticably worse
There are lot of parameters for evaluating it but still o3 doesn’t gives accurately what i need i think gemini and gpt 5 are better
3D cartoon style, Princess Nadine with Yusuf by her side, while Laila stands slightly behind them watching with concern, background of ancient temple at night, torches glowing, cinematic dramatic lighting, high quality"
/3D cartoon style, Princess Nadine with Yusuf by her side, while Laila stands slightly behind them watching with concern, background of ancient temple at night, torches glowing, cinematic dramatic lighting, high quality"
I find the formatting of o3 so bad ngl
For medical stuff at least
Awesome 👍
Hello, can somebody point me to the best place to get the newest leaderboard data? It looks like the text arena scores have been updated 2 days ago but the latest file on Hugging Face is still August 29. The new web omits a lot of the underlying information unfortunately.
What we have posted to leaderboards/hugging face is going to be the most up to date data we have released.
Many thanks, helpful! Is there any plans to make the data that was previously shown directly in the web interface available again? I think the category scores are now hidden and only the ranks are shown (e.g., instruction following).
It's possible! Do we want to enhance our leaderboards and are currently working on a future update. The best way to tell us about what specifically you're looking for should be shared in the #1372230675914031105 forum channel. Posting there the specifics would be a big help!
Greatly appreciate the reply, thank you! Will post in feedback once I have some time!
No problem! Don't hesitate to ping me if you ever have any questions or run into problems on the site.
👍
💐
@undone sable Be sure to check out #1397655624103493813 for more information on how to properly use Video Arena
what search model you personally think its good
Gpt 5 pro and Gemini 2.5 pro seem the best two to me. Gpt 5 high, thinking on pro subscriptio, and o3 give me noticably worse results. I think gpt5 pro is noticably better than Gemini 2.5 but it's not exactly a fair comparison
Claude opus CONSISTENTLY gives me false information that would literally kill patients if I listened to it
So I now don't trust it for anything at all
Context is I'm a resident and I use them for ideas on differentials and management for the most complex patients I get
@jaunty cave hi
sup
How to generate veo
Note to check out #1397655624103493813 for more inforamtion on how to use the bot. It's important to keep in mind that you're unable to select which specific model you want, as it's random which model provider you get for your prompt.
Polymarket is now trading a market for the #2 ranked model at the end of the month https://polymarket.com/event/which-company-has-second-best-ai-model-end-of-september/will-alibaba-have-the-second-best-ai-model-on-september-30
definitely will be more interesting to watch considering Gemini’s unbreakable lead in 1st for the last almost 6 months now
i dont get it
they consider #1 gemini?
seems unclear to me
nvm, their other listings
yes - based off of arena score with style control removed
Gemini 2.5 Pro holds a strong lead over second place Qwen3 Max there
why is the gemini 2.5 pro on lmarena so different from my 2.5 pro in my browser?
it gives such better answers
That's super interesting Kami, what sort of ways is it different? Do you have any examples of the same prompt and what you like about one response more than the other?
interesting
hi
has ktibow left the server?
Furthermore <@&1349916362595635286> , if #ai-creations is for creations with AI, what about community ones (not necessarily AI based?)
I wanted to ask them if LMB is back, it seems back.
Hey - yeah KTibow unfortunately decided to leave the server. They played an important part in this community and will be very much missed.
#ai-creations what about community ones (not necessarily AI based?)
I'd prefer this remain just AI based/related for now.
LMB
What does this stand for? Sorry I'm still waking up.
It's quite saddening. But, if you want a similar discord to the old lmsys, openrouter is an option. Ktibow is also there. Idk how active though compared to here.
So isn't it possible to create unlimited videos?
LMB is something ktibow build:
Really nice interface with a lot of new features (for the leaderboard)
this is well known thing for a long time. In Ai studio website, you also will get better and more detailed outputs
Answer is simply: web/app version have extra safety filters + unnecesary system prompt
in ai studio you can get raw output same as lmarena
Also in api, you have chance to turn off those filters and you can avoid that system prompt but web app will continue to be worse
The thing is gemini is installed default in sooooo many android phones, like "billions" android phones, so they feel like they have no chance to make any mistake which causes they keep security higher even if means worse outputs
why do i bother paying then?
paying for pro doesnt give you better rate limits in studio right?
no, pro membership not affects ai studio and ai studio already offers very high limit free
i dont understand too
Homewer, ai studio's generous limits wont lasts forever, i think when gemini 3.0 releases, limits gonna be lesser
Also you cant get veo 3 and deep research on ai studio, only app
LMB is a creation of KTIbow.
Thanks for the clarification! I wouldn't be able to answer that question for you. I'm assuming if you reach out to them (DM or Friend request) they'd be able to answer.
yeah I'll try to see if I can reach him via github, if not, well it is not the end of the world. Thank you!
what is it
Keep me updated if you're unable to. I'm still in contact, so I can nudge if it'll be helpful.
does lmarena use a specific system prompt or instructions different from what users get default in a respective LLMs web interface? gemini 2.5 in lmarena gives me far better and different responses with identical prompt vs both ai studio and the gemini web app. same with claude opus 4.1 thinking
makes it feel stupid to pay for both of them when the free lmarena gives me very superior answers
Not in anonymous battle mode. Im talking about side by side mode
it's not different in battle vs SBS. And if the creator gives us a system prompt, we use it, otherwise we don't. We don't want to insert any of our own biases by designing our own system prompt
i dont think there is any system prompt in lmarena
Must be plasebo
Or maybe lmarena's model version is older and maybe older version works better but im not sure about that
.
absolutely not placebo. ill upload a practice case from boards review questions one second
i dont even have to tell you which is which
same exact prompt copy pasted
@brittle pine
gemini 2.5 pro and gemini 2.5 pro on lmarena
Very interesting. I'm no medical expert, is your feedback that the answer from the model on lmarena is more comprehensive, detailed, accurate? Some other qualities?
But you using app
Can you compare ai studio and lmarena
(use 32k token and disable google search pls)
maybe in ai studio, google search is open and that could be cause worse outputs
turn off search
And use 32k thinking token, i think it will be same with lmarena
Because both are using Api. Both lmarena and ai studio using Api
Only web/mobile app is different
all of the above. the app gemini is missing some stuff that is basically common sense to anyone that would admit this patient as well
interesting. DISABLE search? ill try it
with search enabled it was giving me actual garbage. it gave me literal idiotic information that would kill the patient or at least some of their organs. 2.5 pro in studio with 32k tokens and search enabled i mean
/imagine
wow
very interesting
its literally magnitudes better without search enabled (2.5 pro studio search disabled). with search enabled and same exact settings it gives me output almost identical to the garbage i posted above
wtf?
how does that even make sense?
@brittle pine
why would anyone want to use something worse thats paid vs free thats better...
I bet that's a big part of it. We don't have search enabled except in search arena. When search is enabled models tend to reference the results they find even if they are low quality and would be better off generating a response with their own knowledge. This is super interesting stuff thanks for sharing the feedback
no problem. im guessing this can be related to the bias i read about on your website, where people favor model responses with more source citations even if they arent necessarily quality sources
"We don't have search enabled except in search arena. " this explains a ton lol
yeah there are a lot of interesting human biases, kind of cool that we can collect enough data to actually measure them.
It's really funny, we put search in its own arena because we thought it would be an unfair advantage, but actually a lot of the time the results without searh are better. 🤣
surprisingly few models exist for search lol
honestly didnt even realize i was comparing them without search
have to compare with search now
is there no gpt5-high or thinking etc for search?
gpt-5-search is gpt-5 with web_search tool enabled https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses and with reasoning_effort = "high".
It would be cool to add more settings of reaosning level but search functionality isn't used all that much compared to default text to justify adding so many at this point. Hopefully it will grow. We gotta do more to promote it, I think a lot of people don't realize it's available or that you have to enable it
by this "gpt-5-search is gpt-5 with web_search tool enabled and with reasoning_effort = "high"."
do you m ean its the same thing as gpt-5-high but searches the web?
it's the same model, but in search arena that model is allowed to use a tool to search the web and then has access to the results when it writes its answer. In theory this should improve it but if the info it gets from the web isn't great it actually can give worse responses
i see thanks
Usually, i mean logically api is always better because you have more control on it. Claude is same too. And api is always more expensive. Problem is google giving you api usage FREE
Sadly in gemini app you cant turn off search. Same with chatgpt, you cant turn off search in app.
And using search is usually bad because when you ask something, and if it uses search, it usually reads 25-30 source and giving you a answer with using 25-30 source
But if it cant use search, then it uses WHOLE DATABASE, WHOLE TRAIN DATA, DIFFERENT LANGUAGES' INFORMATIONS
but if search is open, quickly check 30 website and summaries to you
Then you can ask why they making search default open ? well, most of their trained data ending with 2024 or 2025 january so
they want to give you up to date information
they not want to that their LLM saying "no, Trump is not president, Joe Biden is"
what do you mean with this? They always use their training data. Only the search results (summarized) go in their context window and influence the answer. It is RAG at the end.
i dont know, it effects a lot
Also im sure search is causing influencing by your native language a lot which is not good
When i ask something, i want what whole world's opinion(included China), but search usually using your native language and well
giving less detailed or wrong answers
Dear LMArena Team,
I noticed the new models Qwen3-next-80b-a3b-instruct and Qwen3-next-80b-a3b-thinking from Alibaba Qwen have been released recently, and some reports mention they've been added to LMArena. However, I don't see them on the Text leaderboard yet. Could you explain why they aren't ranked? Is it due to insufficient votes, ongoing processing, or another reason?
Thanks for your help!
Hello
good question - collecting enough vote data to update the leaderboards can take a little bit of time. It depends but this can take up to about a week generally.
Where can I find the archived versions of the leaderboard? (i.e. not the latest)
But this repository is no longer actively updated
Hello everyone. I'm very glad to join this impactful Ai community.
Oh I see…I can't find the elo scores except very old versions there, but thanks
How are about new system prompt gpt5 rank?
Havent tested it, but probably still worse then o3 pro, I doubt it would gain 200+ elo from just a system prompt change
The website is 100% vibe coded
No I created it lol just for a screenshot
Gets the point across
600+ elo ratings between o3 and o3-pro ? Unless you use funky elo K factor parameters, is not even close to reality.
Those are the results, I used a Chess.com bot, then took away around 300 elo points for inflation(More specific but in summary this). o3 beat a 1100 chess bot while o3 Pro beat a 1800 chess bot
o3 Pro played very decent moves tbh, none of them were bad until endgame(Which every LLM struggles with) but Pro was able to get a past pawn, o3 had some inaccuraces in middle and more in endgame then Pro.
Makes me hype for how 5 Pro does.
what do you think about gpt 5 pro vs o3 pro ?
GPT 5 Pro would probably destroy o3 tbh
text leaderboard update any time soon? been a week since the last one on Monday :p
Shouldn't be too much longer
okie 😄 excited to see where new GPT-5 system prompt places!! wondering if Opus 4.1 Thinking will continue it’s upward trajectory to possibly overtake Gemini 2.5 Pro 😳
Will the copilot arena EVER be brought back?
Yes or no
I’ve tried asking around so many times and I haven’t received an answer
@glass sun we can't make any promises, sorry :/
Mkay
ok
text to video and image to video leaderboards updated! lots more votes!
Read #1397655624103493813 and do the /video command in #video-arena-1, not here.
Seedream highrea must be 1st place what you guys think
new leaderboard is out! you're right seedream-4-high-res is first on text-to-image, but it's behind nano-banana on image editing by a lot
I see but then you should specify that you tested them on chess. Further for chess tests there are already a couple of benchmarks that are interesting (but they don't tell the whole story)
https://dubesor.de/chess/chess-leaderboard
https://maxim-saplin.github.io/llm_chess/
and there are others.
Because otherwise one would assume the elo is relatable to the lmarena elo (it is not) and even if it is related to chess, it would be related to chess.com bots elo (that is not relatable to player elo). It is a rabbit hole the elo thing.
LLM AI Chess Leaderboard: Ranking, Elo, and Chess Performance of AI language models.
LLM Chess Leaderboard comparing LLMs in simulated chess vs Random and Komodo Dragon. Rankings by Elo, win/loss, game duration, and tokens.
@twilit echo interesting, the LLM chess leaderboard that was based on a random player now stepped up the game a bit with Komodo chess levels (that again aren't totally comparable to the playerbase)
I am pretty sure that when people will see negative elo will say "wait that's wrong" (it isn't, elo works with differences)
yea minus elo makes no sense though, because in real elo maths its not possible to get negative. the worst player possible cannot get close to 0, due to the probablility rating
once they reach like ~400 rating they'd lose 0.0001 elo per loss
as far as I know it should work no problem.
1 /(1+10 ^((-3000-(-3200))/400) works no problem. The -3200 player has a probability of 0.24 vs a -3000 player .
identical if I do
1 /(1+10 ^((3000-(2800))/400)
a 3000 player vs a 2800
it works with differences, it doesn't matter the sign
ah I see you mean rating gain/loss. That's true.
likely they let the random player be clobbered, and hence he got real low, and then the others were scaled down.
thats theory vs practise. in practicial application, no player will ever reach negative elo, no matter the matchups or how terrible
sure, if one starts high enough it likely won't happen
in cases like lichess and chess.com they have rating floors because there are some that want just to lose
and then you make an inverse ladder (that is, player A completely loses to player B, player C completely loses to player A, player D completely loses to C and so on)
I used random movers and worstfish (deliberately always pick the lowest rated moves), and they wouldnt be able to get negative in my standard elo environment
you mean worstfish is not able to get negative from 441 ?
even adding -0.1 a lot of times?
I see. Yes if it adds -0.1 every time (or less), then it takes like forever
from the llm leaderboard with negative values. checking the result I believe the random player is simply anchored very low. LLMs that mostly are equal to it (see the results vsR) have around -30 rating.
That could explain all the other ratings (then it also depends on the K factor they use and so on)
as much as i can appreciate his commitment, i just dont find the way the methods are applied logical or useful to me. good that different things exists though, but as a user its not useful to me and I always try to make whatever I enjoy consuming.
o1 is 140 elo just doesnt translate to anything for me. it doesn't relate to humanity.
agreed. It is simply fun
interesting that they use stockfish to approximate maia engines. I guess another side project
I tried starting mapping Stockfish to Elo, (even did a youtube video on it), but I actually found out the Stockfish elo levels are vastly inaccurate. e.g. whatever chess.com labels 1000 elo or 1500 elo is completely not related to any reality scaling. thus I abandoned that method and switched to accuracy, which while still inaccurate in particular in huge back and forth swing (blunder followed by blunder), found it to be more reliable overall.
yes, the community says the same (a lot of players want to know "if I beat bot XY, what is my real elo?" and the result is: they are all over the place)
Ok I think there is a bit more blogging here, relatively "hidden" from the main site: https://github.com/maxim-saplin/llm_chess/blob/main/docs/notes.md
" We first calibrate Random vs Dragon; then any model’s games vs Random/Dragon can be combined into a single 1D MLE estimate with a 95% CI."
so Random got clobbered. The 1st level of dragon is 125 (if I understood), hence the ton of negative things. Still he could explain that in the main page.
even the worst dragon engine is like 1200. might be missing something, but 125 is far worse than random.
against a natural pool of players of all skill levels
well, in my player set, a complete random got like 438 elo, and worstfish was high 200s
after ~100 matches
ok, so if you would set the random to -30 (if I understand the data, that should be its level) then worstfish would be -260 at most.
Yeah something doesn't add up.
ah I see. If I am not wrong they don't do it like you do (or FIDE and others).
They collect the info and then run a MLE process, with anchored ratings.
So it is not game by game.
I was trying to search the K factor but I was unable to find it.
https://github.com/maxim-saplin/llm_chess/blob/main/data_processing/calc_elos.py#L20-L42
as you said, it is good that there is variation, but it is mostly fun
I think accuracy is better
Also I dont find those accurate and they dont always have the models i'm looking for
in live rated chess tournaments sure, but "real life" is a broad term. In online video games with botting and smurfing it's definitely possible for Elo to go negative if the rating system designer hasn't planned for it by adding a floor or something
this is so cool actually I'm glad I saw this
Can you run worstfish locally? I’ve wanted to play with it for a while
hello
hi
hai guys
lively night, what's up AI people
Hey what's up gang Vulpes here what's cooking
what are you making?
I'm making LMArena 😎
oh, just testing?
I work here, but yeah a lot of my job is testing AI
lol np, what you up to? What bings you to lmarena
boredom
but I'm working on a prompt, to turn Gemini into a good image-to-prompt generator
what's your favorite AI to use when you're board?
haha that's cool, creative way to use AI to better use AI.
once you get the prompt, then you feed that into banana or seeddream or smth?
My plan is - I throw a random photo for Gemini Pro to analyze. Gemini turns it into a prompt, so I can recreate thAt photo as closely as possible. So then, I upload my photo to banana and give it that prompt, basically recreating that photo, but with my likeness
So you can like, browse, lets say, Pinterest. You see a photo you like, throw it in AI Studio for Gemini, get a prompt, give it to banana, super easy
but Gemini seems to be random when it comes to making those prompts, sometimes they are on point, other times, I have to correct like 50% of it
intereting, do you find that gemini is good at coming up with prompts, like for example is this better than just taking the original photo, and a photo of you and asking banana to combine it
I see
idk actually, I just feel like having a precise prompt gives so much more control and reusability
makes sense!
and by using a prompt wit banana, instead of asking for a combo, there's more chance you get through censorship I think
I think there's more guardrails when it comes to deepfaking and putting ppl together etc.
oh yeah, I think especilly in the gemini app I heard people complaining it would not allow them to do a lot of things
yeah, it's very sensitive to anything that could possibly upset the model
but it's barely any better in the Arena, but I have already gotten used to the vocabulary, that you can or cannot use for banana prompts
you find your method more accurate than the other two? Ok, to each their own I guess.
likely it is possible but online is doable too: https://lichess.org/@/WorstFish
back to Cesar with you. Didn't the curier already clarified that?
(for the confused, the user has the username matching the name of a Legionaire in the Cesar Legion in Fallout News Vegas)
Ave! But not true to Caesar, I just like the name, apparently it means 'desert fox' in Latin
It's a more balanced method for sure
We should introduce a category for latin langauge on lmarena leaderboard 😄
text leaderboard just updated!
Anthropic does not move closer to Gemini for #1 - but Qwen3 Max expands its lead in #2 on no style control !
kinda crazy that OAI has 4 models within 1 point of each other overall, and like they are definitely each good at diffent things but it kind of cancels out
i will win
helo
It's good to see legacy platforms holding their own in 2025. Let's me know that AI is highly scalable and on the right track.
hi
nice to read
hey! is there a way I can see the leaderboards before last update?
(eg to compare current vs old one)
Hi
There is! It's not available on the site but you can find here - https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard/tree/main
thanks!
Hi, y'all!
Good day!
It always irked me that in the Legion people didn't speak much in Latin despite their fanaticism.
IIRC, only maybe Ceasar himself and perhaps some of his generals actually were even a little bit educated or cared at all about ancient romans and their culture/language/history. All the plebs below them were conquered and forcefully integrated into the legion or raised and brainwashed from childhood, to follow chain of command and obey Caesar/other superiors, to be fanatic killers, conquerors and enslavers of others who are not part of the Legion. Any education, if at all, would be only in stuff like weapons training, tactics, maybe some Latin phrases relevant to those things etc. but that's as sophisticated as it gets for them, I don't think most of them even understand or know, that Ceasar is making them all LARP as members of a ancient civilization from the past, they just do what they are told. And that's just the legionaries and soldiers, everyone else are slaves and women...
sure but for how they conduct themselves as "high and mighty" they could have read a thing or two. But yeah likely it fits.
@drowsy needle I like the "movers" in the announcement. If you keep that one can track the changes over time (unless one goes through the saved elo leaderboards)
Glad to hear it! I'll be sure to share this with the team.
how do I turn on the filter for Open LLM only on the leaderboard, guys?
It’s the License tab all the way on the right - I don’t think you can sort but basically any model that doesn’t say “Proprietary” is open source
it's kinda annoying that it's not sorted, but thx
If someone made a classifier on responses to predict which LLM it came from, how accurate could it be? And if someone is doing so, could they create bot accounts to target a model and raise it in the leaderboards because they bet money on it?
I don’t understand
What’s the difference between Open rank and Rank (UB)?
Open sourced?
who is the legacy platform in question?
Most scalable tech in history
Then, there will be two: Gemini 3 Pro and Flash, and maybe also Ultra will here.
What is the overall general image to video best model but completely by your opinion not statistics guys? Really glad to find this community, I honestly find Kling most reliable especially for high motion. MidJourney I also find good for more creative stuff, art.
but Google's models do not all have the same score as eah other
some new stuff on the leaderboards, grok-4-fast
isn't it weird that kimi 0905 is better than 0711 in every way yet it comes after it in the leaderboard, and deepseek v3.1-thinking is better than 3.1 and r1 but comes after them?
@rose cradle good question, just raised this with the team, currently it's just sorted by overall and then ties broken alphabetically. We'll update that to sort by the ratings in the other categories
But shouldn't overall be different for them? For example, I think Kimi had a lower overall because math benchmark only came much later, but when it did, to my surprise, its position didn't move.
the overall here is the overall rank, which accounts for overlapping confidence intervals, here the 2 kimis and 2 deepseek are all rank 10 overall, and it just sorts them alphaetically
But look at 3.1 thinking, it is better than 3.1 in a lot of things and it is 10 instead of 9
the overall column here is not an aggregate of the other columns, it just means the ratings when you include all votes. The categories do not cover all of the prompts. It is odd that it's better overall but worse on most categories.
in this case they have very close ratings and cis 1417 +/- 6 and 1414 +/- 7. But what's happening is that slight difference makes there one more model above them which 3.1 overlaps confidence intervals with and 3.1 thinking does not. So according to this method, there is one more model we are more certain is better than thinking but not better than 3.1
And that rank is the main sort key. I think cases like this definitely seem weird and are arguents for using a different sorting, but no matter how you order it something will look weird
Ah, ok, thanks. I always thought that was aggregate since you replaced the numbers from the previous leaderboard with this
the one on the old site? I'm not actually sure how that one worked, I joined after it was deprecated. appreciate the feedback a lot
Yes, the old one just had a single overall number and companies would fight for that number. Now it is kind of weird. They fight in individual categories but the main one is more subjective.
I also think you should remove copilot benchmark, and add links to lm battle web arena because it is super hidden
oh, the main overall on text is still the same methodology. https://lmarena.ai/leaderboard/text
copilot 😥 , too many other things to work on rn to resurrect that. it was also kinda hard to use. webdev arena integration definitely coming tho!
Yeah, that's why I think you should remove it, it's been 120 days since last update
I would like to thank all the designers of this website, everyone who contributed to its completion, and everyone who thought of it. All thanks and appreciation for your efforts and hard work. You are geniuses and smart makers. I love you, I love you with all my heart. You are in my heart more than anyone who worked hard on something like this. I love you. All love to the entire team. You are creative. You deserve to be at the top of designers and at the top of this world. Thank you, thank you. You are better than Elon Musk.
@wicked rapids @pulsar grove Hi! If you´re trying to generate your prompts please check: https://discordapp.com/channels/1340554757349179412/1397655624103493813 🙂
Hello to everyone, .....I am here to learn from the diverse skill and mindset of all present, and contribute my own 2 cents.
they removed it
Damn
I'm using it in replicate
After adding my credit card
I thought I could use it for free
hi
Nice
hello
tonight is the night
Hello! Please check #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.
Hello! Please #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.
the overall column here is not an aggregate of the other columns, it just means the ratings when you include all votes. The categories do not cover all of the prompts. It is odd that it's better overall but worse on most categories.
in this case they have very close ratings and cis 1417 +/- 6 and 1414 +/- 7. But what's happening is that slight difference makes there one more model above them which 3.1 overlaps confidence intervals with and 3.1 thinking does not. So according to this method, there is one more model we are more certain is better than thinking but not better than 3.1
And that rank is the main sort key. I think cases like this definitely seem weird and are arguents for using a different sorting, but no matter how you order it something will look weird
hello
awe this was really sweet
Hello! Please #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.
Hi! Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 🙂
is it any possibility to use LMArena API Key?
We don't have an API for LMArena available.
whats up?
Okay, tell me, how can I generate unlimited videos either on the official LMArena website or in this group? What should I do so that I get a video exactly related to my prompt, with sound included? I also messaged LMArena in their inbox and gave them a prompt, but I didn’t get a response. What’s going on with this?

??
Sure there are a few questions in here:
- The Video Arena bot can only be used in this Discord server, you can't use through Direct Messages.
- It's not unlimited, as you can only create 5 generations per day.
- If sound is included or not is going to be random. Not all models have sound capabilities, and since it's random what models you get, that means it's going to be random if you get sound or not.
Hi Everyone
So tell me, what’s the maximum length of a video we can make, and how do we get it made? Where should I give the prompt and what should I do?
The length is around 5-8 seconds. The #1397655624103493813 channel can answer your other questions.
thnks you soooo much i understand👍 💯 💓
No problem!
Good place for questions/feedback about the bot would be in #bot-feedback btw. We'd like to keep this channel focussed on #leaderboards discussion.
ok i understand my sweet lover brother @drowsy needle
Please, check #1397655624103493813 to learn how to use the bot.
new here, not sure if this is right place to ask, for search leaderboard, can we include llama (meta ai) as well? if this has been consider, what are reasons not to include? Thanks all.
Hi @true arch does meta surface an API where we can get answers to queries using results of web search + llama model? All the models on search arena are offered as search + llm products by their respective companies.
Hello, i'm working right now on a LMArena review video . The "Exclude Ties" feature is working? Because though the models rank list changes they're still showing ties.
Where can we actually View the leaderboards does this also displayed on the Web?
sup
Hello! Please #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.
Hello! Please #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.
ah! the exclude ties means to exclude votes where the person voted "tie" or "tie both bad". The resulting leaderboard can still ahve models which are tied with each other
Hi, the leaderboards are available here: https://lmarena.ai/leaderboard/ and it's visable on web or desktop
Hello everyone! If anyone knows, I'm curious as to how often is this leaderboard https://lmarena.ai/leaderboard/text/overall-no-style-control is updated
we ❤️ polymarket
hehe
hello ! everyone ! how are u all ?
let´s check it out
Please, head to #1397655624103493813 to learn to how to properly prompt the bot.
doing good how are you?
hello
hello 
look at all these new text models debuting dog we ain’t never getting a leaderboard update 😭💔
really interesting to see how variations even in provider, fal vs seedream official can have noticeable differences. I saw this post earlier about how different providers of other open source text models can differ a lot from the official. 💀
https://github.com/MoonshotAI/K2-Vendor-Verfier
This is fascinating, do other research companies do this or is someone doing this for multiple models?
I think we're the only ones doing large scale tests with real user requests to measure the quality 😎
I hadn't noticed Fal was listed as a separate model lol
I love the comparisons to the OpenRouter providers in general, I think the transparency is super important!
Yeah we didn't have the -fal provider suffix before. We added it now since we are serving the same model from different providers. We put a note about it on our Leaderboard Changelog to clarify it. https://news.lmarena.ai/leaderboard-changelog/
This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!
For model deprecations, check the public updates on GitHub.
September 25, 2025
New model announcement:
Seedream-4-2k has been added to the Text-to-Image and Image Edit leaderboards.
Note that Seedream-4-high-res-f...
Hello
Hello
Please, check out https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
hoi
What models are used for video?
all the models are listed on the leaderboards: https://lmarena.ai/leaderboard/text-to-video https://lmarena.ai/leaderboard/image-to-video
Flash slash
@twilit echo (dunno where to put this, since #ai-creations is now filled to the brim with pics)
I still check your awesome LLM chess benchmark and I check whether the elo changes are stabilizing or not. Gpt5 seems still to have potential to go up. O3 a bit less. Gpt 4.5 feels like invincible - that thing can go real high but I guess it is expensive (dunno if it still plays but the match with k2 should be relatively recent. Would it be possible to add dates just as reference?). Grok4 seems stable as well. gpt-3.5-turbo-instruct ist still unclear. It still clobbers gpt5 sometimes.
Also other models keep coming but they get brought in clap city according to their scores.
it is mostly openAI and xAI. Gemini is strange though, its rating is far from stable if one sees the rating swings it gets.
and again this guy has the gpt 35 at the bottom https://maxim-saplin.github.io/llm_chess/ but as you ( @twilit echo ) discovered, gpt 35 plays well if the opponent plays well and plays terribly if the opponent plays terribly. Since gptt 35 played only against the random player, it makes sense it goes at the bottom and the benchmark maintainer will never notice that is actually much stronger than that.
LLM Chess Leaderboard comparing LLMs in simulated chess vs Random and Komodo Dragon. Rankings by Elo, win/loss, game duration, and tokens.
correction, the problem seem the amount of illegal moves, since the support layer is not strictly helpful.
4.5 was deprecated and shut down more than 2 months ago, so no newer games possible https://platform.openai.com/docs/deprecations#2025-04-14-gpt-4-5-preview
dates and other info doesnt fit the UI and overbloats everything. newest is top tho.
but since Elo is self-correcting the Elo changes even for retired models (since their former opponents change ratings, which gives less/more points adjusted in realtime). gpt-4.5 used to be higher elo (1817 continuation on retirement day)
I thought that was deprecated but still available (though pricey)
Oh I didn't notice that you recompute the elo for everything rather than keeping it fix "as it were".
every single day all elo is recomputed from scratch. i don't store the elo values anywhere, everything is calculated on the fly with every new game
but with a database used as source to compute everything, its trivial to revisit any desired point in time and compute all stats as they were precisely at that moment. so nothing is ever lost even if I don't store hard values, because the formulas are what matters, not the output
we do this at lmarena too haha
but because we're not actually using Elo, but Bradley-Terry which treats matches identicaly regardless of time
batch estimation if i’m correct? cool process 😄
yep, maximum likelihood estimation over the entire dataset rather than one row at a time
Got it — batch MLE for Bradley-Terry makes total sense! Just to clarify: when you say "not using Elo," is the displayed score scale (e.g., 1469, 1437) still a log-transformed version of the BT parameters (like Elo uses for readability), even though the underlying estimation is batched BT? And how do confidence intervals factor in — are they computed on the raw BT scale before scaling to the displayed numbers?
Yep both the scores and the standard errors (+/- values) are computed in what I call the "natural scale" which means initialized to 0, using natural log (base e) and no scaling factor.
The "Elo" scale uses log base 10 and a scaling factor of 400, and an initial rating of something like 1000, 1200 etc.
After we fit the scores we multiply them by 400/log(10) and then add 1000 to each. For the +/- we only do the scaling
okay, yeah that makes a ton of sense! thanks for the clarity 😄
I see but how I understood it, since the previous games happened in the past, they had an influence about the new games (since Elo is moving forward in time), but the newer games do not have an influence on previous games.
Hence a model that doesn't play anymore has the Elo fixed in time.
Am I understanding the approach?
(if yes) It would be different if you, say, would compute the elo regardless of time. So picking results randomly, many times, and then averaging the final result.
how are this many ppl that stupid to not read the actual channel they’re sending stuff in I don’t get it
no because the elo gains/losses are not static. for example, even if gpt-4.5 never ever plays any games anymore, if at its playtime the elo of an opponent was higher and thus they gained +30 if 1 year later the elo adjusted and their former opponents are now lower elo, it will auto adjust to e.g. +21 instead, automatically autocorrecting even for past games
my system is more like retroactive Elo or Bayesian rating where the entire history is continuously reinterpreted based on all available information.
This approach is mathematically more sound because it uses maximum information to determine ratings
what is leaderboard in LMArena guys?!
I see, so it is different from FIDE / chess.com / lichess approaches.
You look also forward to adjust values rather than computing purely chronologically.
While I disagree on the tone, I agree on the gist of it.
I mean we are already at AGI/ASI level if people refuse to read instructions that LLM would read and spam the wrong channel.
nice
when will it finally update
Can you explain to us your thought process for putting this in the leaderboard channel?


That's up above his true paygrade, Dude. Lol..
amazing
Sweet
Hello
Ola gente
Hi
Hi
HELLO
Helloo
Hello
everybody treating this like the greeting channel again 😭
hopefully we see a leaderboard update tomorrow!
probably been enough time to gather enough votes for all of the new models
hello
Hi
Which bet did u take haha 😂
pfft 😂😂
I LOVE POLYMARKET
I had good profit off of holding Yes for Alibaba #2 and Google #1 in style control markets
yea I’ve cashed out some of my position to trade on other markets but
there’s some funny business moving in it currently
big buys on OpenAI yes and Alibaba no but
based off of GPT-5-High’s system prompt
I don’t have high hopes for its performance debut it doesn’t seem much stronger
October #2 will be interesting
Google has been increasing in price, not because it’ll be surpassed I don’t think but likely because new Gemini flash could debut at #2
Oh snap i didnt even consider that option
Flash no2 debut would be brutal
When they update text leaderboard?
Hopefully today!
is long-cat-thinking here in test?
This can vary as it depends on how many votes we're collecting, but would expect updates around a ~week
@hallow grove Hello! Please check #1397655624103493813 to learn how to use the bot.
is there any rule for this? Like 10 thousand votes per or something?
There isn't going to be one single metric we go off of as before we post an update we're going to validate the vote data to ensure accuracy before posting an update.
Why isn't Claude 4.5 on leaderboards yet?
@drowsy needle When was Claude 4.5 added to arena ? today?
Bring added to leaderboard or added to the arena?
Right now 
That's TBD
Oh. So you just added in the arena. Once you have enough votes, it will be added to leaderboard?
WebDev Arena specifically atm. I'll post another update when added to Text.
And yes, once we collect enough votes and validate we'll push a leaderboard update. TBD on when that'll happen.
So, it is still not there on other arena like text yet?
Correct
hi
is this the text arena one?
Yes
8 M votes (yes, I don't use that much video/picture mode)
🤤
We need that many in the text arena!
I guess this is super minor, but mobile has some annoyances
it has been 11 days
completely understandable with the amount of new models that have debuted recently - wanting to release scores for them all at once but
@tawdry socket Claude Sonnet 4.5 is in Text Arena currently
I got it on a prompt earlier today
Nice. How come they didn't have this as one of the anonymous model? Claude doesn't do that?
@tender sigil
How is it anyway? I can't test it right now .
Claude doesn’t do anonymous testing models
at least they don’t for full release
but I have never seen someone show evidence of a codenamed model being part of the Claude family
Makes sense. Claude could have shown off lmarena performance at least on web Arena to market their model like other companies do when they release...
even more models added to arena omg 😭 September has been insane
Need gemini 3

fingers crossed 🤞
TBD! We'll be sure to update when it's ready!
yes
good luck
I don't get it. How do you actually get the AI to create a video. All I'm seeing is this conversation
#video-arena-3 a little baby come on classroom and background music video generate
When is sonnet added to leaderboard?
Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to learn how to generate content 🙂
good
leaderboard hasnt been updated in some time
i wonder why
Its not enough time with automation to put sonnet45 placement?
nah its in the voting already. The leaderboard just doesnt update automatically
Everytime a new famous model is released people start asking for leaderboard update. My best guess is that the update is combined with the big companies. I'm not saying that it is manipulated i I'm just suggesting that there is a schedule.
There has been a ridiculous amount of new models released recently so
it is likely they are trying to update all at once in a big batch - it’s possible we don’t even see a leaderboard update today
Yeah, theres a kot of new models that need a lot of votes before it can be acurate
damn there are too many models. the rate of leaderboard upgrade cannot catch up with the rate of new models releasing
@tropic pivot
Do you need help with something?
Not LMArena but definitely leaderboard related. @twin valve as the resident chess expert what are your thoughts about the most recent chess ratings news? Seems like a previous change was not implemented correctly, and then applied retroactively with little/no warning?
https://x.com/BortnykChess/status/1972746172029936024
Something LMArena would definitely like to avoid
@crimson seal Please, check out how-to-video-bot to learn how to properly prompt the bot.
now up to 13 new models in the arena since the last LB update 😂
Wild times to be an arena fan!!
updated 12 days ago, more than one model a day!!!
we need to make battle royale mode, each prompt goes to 100 LLMs and the user eliminates them one by one
bahahahahahahaha
or make 100 different bots play the same game or logic puzzle and steadily eliminate them when they miss a step
honestly this is pretty cool lol
would be crazy expensive lmao
imagine lmarena hosting an event or something though
like once a week or something
battle royal between all models llike you said
lmao
@vale island be sure to review #1397655624103493813 for information on how to properly use Video Arena
Hey! Chess expert is too much (I think @twilit echo is as expert as me if not more) . Though I am rating interested. FIDE didn't apply them retroactively, they simply do a poor job. They noticed that people "farm" rating (nothing new really) and so they decided that for blitz and rapid time control they would use the pure elo formula - and not the 400 points cap - for all players over a certain rating. They did this in Dec 2024.
The problem? They forgot to implement it. Chess is full of nice ideas and poor executions (ironic for an intelletual game)
So thanks to Nakamura FIDE decided to create such a rule also for classical. Wronly imo, because without enough simulations they simply wreck the work of Mr. Sonas (that introduced the most recent rating fix as per March 2024). Simply adding a ton of edge case rules mess up with the ratings (and then people foolishly compare ratings between eras, while a lot of small but important details changed in the meantime).
Hence Bortnyk, that together with Naroditsky plays a lot of blitz and rapid in Charlotte, cries about his rating. He is partially right, because FIDE did a poor job of implenting the change and thus it feels like a rating steal, but it is not.
IMO it would have been easier to say "nah, if players farm rating we simply decide to unrate the event for that player", that was always a FIDE prerogative (they did it already with Alireza in 2023)
The problem with FIDE is that they are treating the 2700 and 2800 group as a sort of marketing. As if lmarena would say "we cannot have ratings going stale or going down, we need new models to reach 1450, 1500, 1550 and so on. We need to pump the ratings for marketing". So they try to find rules to not erode the 2800/2700 group while not pumping it too much where it feels fake.
while in reality only difference matters in ratings.
btw on reddit /r/chess there are like 3-4 threads about it.
here another from Naroditsky directly: https://www.reddit.com/r/chess/comments/1ntgf7q/fides_major_announcement_on_amendments_in_ratings/ngvwpfr/
👀 👀 👀
Does anyone know why Google has been in the top spot for some time on text?
Gemini is good but i rarely ever see it considered the top model by anyone
cause it wins battles constantly?
what people "feel" vs a measured deterministic algorithm is often not matching
Hello!!!
Hello what's up? See the recent leaderboard update?
hey! I was just replying to Yeeto
lol oops I meant to reply to Vitor 🤣
You are definitely right, Gemini ranks high because it consistently wins the comparisons
Yes!!! new leaderboard update is awesome
both new versions of Gemini Flash, new Qwen 3 max and both versions of v1, and both terminus variants of V3.1 - Sonnet 4.5 and GLM 4.6 should debut soon!
guys in webdev whats diff between qwen3-coder and qwen3-coder-plus ?
qwen 3 coder plus is newer, and apparently worse?
LMAO
Please, check out https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
Please, check out https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
qwen3-coder is this model https://github.com/QwenLM/Qwen3-Coder
and qwen3-coder-plus is an updated version from 9/23 https://www.alibabacloud.com/help/en/model-studio/qwen-coder#a3bbe78773cec
Qwen3-Coder is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud. - QwenLM/Qwen3-Coder
Qwen-Coder model capabilities,Alibaba Cloud Model Studio:The Qwen3-Coder model has powerful coding capabilities. You can integrate it into your business using an API.
Why is gemini 2.5 pro at the top
a lot of people vote vote for it over the other models it is compared against 🤷
Reminder of one of our server rules:
✅ Treat others with Respect.
here is a working Sora 2 code - SYKQBJ
Thanks
he is normally walking. Cinematic camera motion
every time I do a prompt to generate a video, I can't find where it's generated
cat and lion world tur
Will sonnet 4.5 thinking ever be on the leaderboard
was added a few days ago mate
don't expect it on the leaderboard that fast all the time
Wasn’t it added same time as 4.5
Yeah, I'm not a lm arena insider but if I had to bet 4.5 thinking probably took first place by far and so they are doing their due diligence to check that that was legit.
the non thinking mode has 12 on each side error bars I think
It looks like you haven't been prompting the bot correctly. Be sure to review the info in #1397655624103493813 as it should be helpful.
yeah I remembered correctly, there's really large error bar for non thinking sonnet rn, so they probably are waiting until the bars get smaller to post sonnet thinking
Will be interesting to see if it surpasses 2.5 pro
Hey pineapple, can you confirm, just for fun?
.
Pls?
Sry confirm what?
This
small baby dog eyes
a really cute dog specifically
Cute dog eyes difficult to resist, but yeah sorry can't provide any insider info
i doubt it honestly. i dont code but opus 4.1 16k thinking seems better for me than sonnet 4.5 32k thinking
Huh
?
I meant for general use
Not coding
Was only saying I doubt 4.5 thinking took 1st place by so far they're delaying the results to verify or whatever was proposed above
And now Clayton is making me think otherwise 
Lol


@proper wave be sure to review #1397655624103493813
NICE!
bro plese help me to finde out the code of the sora 2
hello
Nano-Banana dethorned.
https://x.com/arena/status/1974502371721162982
still got image edit, which is its crown jewel, I'm suprised that hunyuan did it though, would have expected seedream
Bruh but hunywan 3 0 sucks
Its only a better version of DALL-E3
I mean all of them are better versions of dall e when you think about it hard enough
@proud crescent please check #1397655624103493813 to know how to properly use the Arena bot
ok
@calm kraken @summer lantern Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to learn how to generate
Hello i need coupon flova
Please check how-to-video-bot to learn how to generate
the 2 messages above this are literally about this
my brain
Hello @errant tinsel Please check https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to use the bot and https://discord.com/channels/1340554757349179412/1397655695150682194 https://discord.com/channels/1340554757349179412/1400148557427904664 https://discord.com/channels/1340554757349179412/1400148597768720384 for your creations.
super cool analysis! Thanks for sharing
In the leaderboards, Claude 4.5 sonnet is ranked below Deepseek R1 and gpt5.
However from my usage, I've seen that sonnet makes way better frontend UIs compared to those above it.
Why is it still ranked so below?
nice
which leaderboard are you talking about? On our main leaderboard claude 4.5 sonnet is above both of those
https://lmarena.ai/leaderboard/text/overall
gpt 5 at the top for web dev
but purely frontend speaking, claude sonnet performs better from my experience
this channel is full of toddlers
are you ok dude
I love you
k
Thats 4.1 not 4.5
?
If you see 4.5 then it hasnt been put on the leaderboard yet
The 4.5 model that shows up isnt the thinkng model
Meh even non-thinking performs better than gpt 5
If you say so
Although I just realised the gpt 5 here is the "high" model, particularly not the one that is available to everyone for free so idk.
@icy girder be sure to check out #1397655624103493813 for instructions on how the video bot works.
Good app
see what I mean @idle ocean ? "In my use case model X is better than Y, I think it should be higher despite all other use cases that aren't mine"
? Why am i being pinged
for the discussion we had in the other channel about the topic. The quoted message is a perfect example of what I meant. But If that is "too many messages ago" then nvm.
I mean i agree that people are like that..
ok
Ok
I think it's just my unique skill of prompting 😎
Please, check out https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
Interesting, when you use it are you usng it with thinking or without? sonnet 4.5 with thinking isn't on webdev arena leaderbaord yet
Non-thinking
Even 4.1 beats gpt 5 (free) and 2.5 pro Gemini (from my experience)
But this is purely frontend-wise. Maybe web dev on the leaderboard is a culmination of all things web dev.
Hi
have you actually tried web dev arena?
@wise depot for sonnet-4-5 I believe it's ranked higher despite having a lower score because the upper bound CI could be higher than the rank of the other models ranked below.
For these having the same rank & CI I'm not sure why they're listed this way. Will ask and followup when I know more.
yo
So I don't have an update for you specific to this; however, once raised the team it's created a lot of discussion. @idle ocean
thx
Please head to #1397655624103493813 for a detailed guide on how to use the bot
Hello
does gemini3 show up in arena?
Not out yet
is there anything even known about gemini 3 being out? I know there's some coednamed models but neither really sticks out to me as a potential gemini 3 so far
not really, supposedly some ab tests on studio but that's it
that's what I thought but I see some people get super hyped about it that I dont even know what to think lol
comes out in 2 days/weeks/months
(according to some sources, in 2 days [thursday], which i still doubt)
but in january we will have it, 100% sure
highest probability: this quarter
The way some people were talking, it was if it was already released which was annoying
@slow solstice be sure to check out #1397655624103493813 for information on how to use the bot properly.
Let me know if you have any questions.
might be referring to the AB tests
people see in the networking data on those that it has an entirely different version number than 2.5 pro or 2.5 flash
and that apaprently its much better
not sure if i believe that theyd release a new flash preview end of september for 2.5 if theyre releasing 3.0 pro in october though
seems stupid
I have seen a better/different feeling model then 2.5
true about the flash thing that baffles me
they slowly tweaked 2.5 pro over its lifetime, maybe they had a bunch of changes to 2.5 and for some reason ended up just doing it all at once
ive read they always update their current model in october and release the new one in december 2 years in a row now?
havent looekd into it myslef though
I have never heard of such as thing, but I can look into that
the December claim is a little stretched, cause 1.0 and 2.0 were both released in December, but plenty of other models get released at other times
I think they were unsure if releasing a new 2.5 version was a good idea (but had a smaller team working on it the whole time) and now they have decided to release it (idk why).
But I bet the reason why they did not update 2.5 pro (which they probably have a better version of internally) is to make 3.0 flash look better.
And stick to the claim of 3 flash outperforming 2.5 pro
Well they recently released Gemini for Chrome, maybe that's why
Sonnet 4.5 on Web Dev Arena leaderboard...
thats non thinkin tho
also gpt 5 being at the top is weird it always makes like the exact same ui
but gpt-5 UIs loos so much better than sonnet
Anthropid models just spam gradients and their spacing is really off a lot of the time
guys do you know how can I use the Claude sonnet 4.5 thinking? I think it’s different from the external thinking function right?
?
??
Or they’re same? Sorry that I’m not familiar with Claude🫡
You can select direct chat in the upper left and then select claude sonnet thinking model.
Also are you a fan of clearlove the Chinese jungler?
I mean through the Claude app or through their official website to use the thinking model, not at the Lmarena haha. And yes, I am a fan of the Chinese jungle Clearlove🎉
oh, sry I don't actaully know all about how to use Claude through their app, feel free to use on LMArena though. I was a huge fan of clearlove too. Amazing career
Some lore about me, I started getting interested in statistics and machine learning by ranking League of Legends pros. ClearLove had amazing stats. here's a post I made 9 years ago. https://www.reddit.com/r/leagueoflegends/comments/4q783c/kda_rankings_in_professional_play_over_9200_pro/
Now almost 10 years later I work on leaderboards of AI models instead 😄
I love learning Clayton lore
wow, 9 years ago, back then I was younger
wowww that’s incredible!! As he has retired and became a coach a long time ago, it’s really not easy to meet his fans nowadays. So happy to meet u!
@uncut ginkgo Please check #1397655624103493813 to learn how to use the bot.
You Must Sacrifice Your Wallet To Dario
@cerulean grotto Please head to #1397655624103493813 for a detailed guide on how to use the bot
/video promt
Please head to (https://discord.com/channels/1340554757349179412/1397655624103493813) for a detailed guide on how to use the bot.
@terse hornet Please, check out https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
@foggy agate Please, check out https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
Please, check out https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
search arena leaderboard updated today, 20k new votes, not much actually changed though lol. We need some new search models
Gemini 3.0 search

what in the world does :malicepfpmoment: mean?
LMAO

Hard to explain, inside joke someone into sneakers might get
helo
search is so underrated because is one of the most immediate (yet very useful) application of LLMs (or even agentic LLMs for deep search)
dont understand how grok beats google at search lmao
not saying its worse just think its funny
Cause gemini sucks at search
either
- they don't provide their best search model on lmarena (AFAIK the search models are more or less similar to the free and pro subscription models, the one around 20$, not the ones that costs more)
- or search up to a certain level of complexity has no moat
- or the search questions aren't too tricky and so many models performs similarly (the lmarena search started much later than the lmarena text, so the models there are more or less all recent)
Also not all vendors are in the lmarena. For example mistral has search abilities as cohere does, but they aren't in the arena.
nah
you are both wrong
the reason why Gemini is so bad at search is 78% of the time the links it posts don't work
that it
Sometimes the best answer is the simplest one
Interesting because in my cases they work (I mean those under sources) but yeah could be the case too if I got consistently lucky.
why is qwen ranked so well?
every time I've tested it the answers were terrible related to lower ranked models
What sort of questions do you usually ask?
It was better than 2.5 pro for medical questions for me tbh
As in more accurate to reality information
I think I used qwen max preview not the other qwen max
But dont remmebr
@cerulean bane
@jade sundial Please check #1397655624103493813 to lear how to use the bot.
Please go to #1397655624103493813 to learn how to use the bot
Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to generate content 🙂
👍
Because questions are skewed by popularity in deliverables and coding which qwen is fine at, if you look at individual category scores that will give you a better idea
Deliverables?
Why hasn't LMArena updated the text-to-video and image-to-video leaderboards for almost a month?
I know they dont show the model apparently until after people besides the person who generates it votes
I wonder if theyre not counting votes either until then?
@drowsy needle Do you know this?
@dreamy crater please head to -> #1397655624103493813 to learn how to prompt the bot and in what channes it is available for use
@last storm please review the info on how to use the Arena bot and in what channels by going to -> #1397655624103493813
looking for feedback on what the community considers a good AI model with so many to choose from. Saves time.
i need a vid presenting future tense with will and going to
Grok 1
nano
Sry for the delay - it's the first two votes that are counted on the leaderboard, which is prior to the names of the models being revealed. After that, the votes don't contribute. There is a way to filter the leaderboards by votes only from the person that generated it.
It takes a bit to gather votes.
Would encourage others to votes on their preferences in the video arena channels!
@lilac swallow Please check #1397655624103493813 to learn how to use the bot.
If only the creator votes (or only someone else and not creator) and not a second person does it still get collected and used as data?
How to create videos?
Go to #1397655624103493813 and learn how to use the bot to create videos!
Go to #1397655624103493813 and learn how to use the bot to create videos!
this channel for some reason
LMAO
Yeah
@grizzled dust Please head to #1397655624103493813 for a detailed guide on how to use the bot
Please go to https://discordapp.com/channels/1340554757349179412/1397655624103493813 to learn how to generate 🙂
How to generate 3D video?
IIRC first and second votes are what generates the leaderboards regardless if the creator votes or not. Will double check and keep you updated.
Confirmed if the prompter doesn't vote, that doesn't mean the first 2 votes don't contribute to the leaderboards.

oops I think we miscommunicated here. The first 2 votes would still contribute to the overall leaderboard. But the author vote will not contribute to the "author vote" category leaderboard like it usually would.
GLM 4.6 is on Web Dev leaderboard on lmarena.ai but not on web.lmarena.ai?
From Web Dev category on lmarena.ai
I think it's interesting how the order is so different from the coding category on the text leaderboard
@drowsy needle
well in the webdev arena one expects a visible output (a sort of webpage). The user doesn't look at the code (at least I think)
in text arena one looks at the code.
the evaluation (in terms of what is checked) is different
Yeah there's that aspect of it, but I also think the coding subcategory is kinda unreliable.
It was interesting how filtering out battles where a model faced Sonnet < 15 times changed the rankings so much (Sonnet 4 from #10 to #4, all without style control).
R1 seemed strong when it was released, not sure about now.
This is just Web Dev Arena like Pier said, so actual agentic coding performance might differ.
yes, hence I asked you to do that as you are quick.
The point of the "unreliability" is simply that ratings have a lot of nuances. With the filtering we are checking the results within models that matched each other, rather than considering other models as well. It is a property of the ratings.
Otherwise one should generate other ratings (with filtering). Hopefully one can do that once we get all the results of the battles.
the best IMO would be like having a sort of meta benchmark, with many different (somewhat reliable) benchmarks taken in account.
Some did those, but they don't get updated that often.
Thanks for the insight, I wasn't quite sure what to make of it.
LiveBench agentic coding scores got updated (I think they increased the number of maximum turns).
5 Pro does badly, weird
LOL at gpt 5 pro being worse than gpt 5 medium high and.... mini high
And sonnet 4.5 non thinking
I think the number one indication of unreliability is gemini winning anything rn
Lol
Nvm that's old
that link is a meta leaderboard from May 2025. It is not new. If you meant that gemini 2.5 Pro wins because in that link it wins
hence my point "I wish such meta leaderboards would be done more often"
Yeah nvm didn't look closely enough
sora 2 doesn't beat veo 3?
thats what i was saying
thats unbelievable to me
Veo 3 is kinda nice
Hi
Hello! @hexed patio Please check #1397655624103493813 to learn how to generate videos.
hello
Hmm fried chicken yess
nice one
gemini 2.5
hey, how can find my videos??
Check the Bot's messages @candid hill
@lofty solstice Note that this is an English only server.
hi
Hey guys 👋 , I just want to know how often the text leaderboard updates. I see that the latest update was on 8 Oct. If Google releases a new model in late October, will this be reflected on the leaderboard soon?
It takes around a week normally for an update to happen.
thx! I tried to ask AI but got no reliable answers 👀
so can I expect an update in one or two days?
It's possible! Bit hard to say specifically as it takes time for us to collect votes and validate before posting an update, so it's a bit TBD.
"If Google releases a new model in late October, will this be reflected on the leaderboard soon?"
he's buying on polymarket
😭
Saw this paper a couple days ago: https://arxiv.org/pdf/2510.02306
It makes use of lmarena datasets (text, search, vision) and compare several rating systems. Elo, Glicko, TrueSkill, and an online variant of Bradley-Terry.
They claim that by just ignoring data from ties you get better predictive accuracy (even on datasets including ties) It's cool they released code too.
I have some issues with it though, mainly that it only uses dynamic rating systems rather than true Bradley-Terry. In my experience the dynamic rating systems are much more noisy and have recency bias on arena data. Also in my thinking Elo and online Bradley-Terry should be the same thing. 🤔
Any thought it was interesting, wonder if anyone else has seen this or has done experiments specifically about ties.
Hello
can you give me invitation code for sora 2 plese give me
this is what I meant with "the community would do such analysis if this https://discord.com/channels/1340554757349179412/1372537524551159913 would happen".
Btw if I am not wrong lmarena has already a category where ties are removed.
Also cthorrez seems the same account of @jaunty cave . Is not the same user behind those accounts?
I think ties are actually important. In my simulations for other use cases (chess , videogames) ties really helps stabilize the values.
Sometimes the compress ratings differences too much though, so one could also play around with weights for ties (for example in some videogames we decided with the developer to weight the result of draws by 0.5). But the more one plays around the less consensus the model has (that is, it becomes more arbitrary)
hello, our team train a language model, we want to sumbit on the leardbord, could pls tell me how to sumbimit the modle, thanks.
I guess #1372229840131985540 could help
So true
Whats the evidence of that?
I think the best of all worlds is just where they ignore ties
@brave pier Please head to #1397655624103493813 for a detailed guide on how to use the bot
how to use the bot
Please head to #1397655624103493813 for a detailed guide on how to use the bot
evidence for what in particular?
ties are important IMO. it could be like an extra category. For example "exclude ties" could be default and "include ties" could be extra
Clayton
Yah that works
I see. I may have been mixing people but Clayton should have the surname (at least what I saw online) similar to the other user. I can be wrong though.
Ok
Can confirm cthorrez is clayton thorrez https://cthorrez.github.io/
very nice
Oh yeah I'm Clayton. Just this account is my personal account I had logged in on my phone haha. The other is my work account. I need to set up the auth to use the work account on my phone. 😅
So true
so I wasn't wrong. Good to know.
😄


they said claude 4 sonnet performance and it seems like they didn't lie about that
Lmao
It's really pretty cool when LMArena results back up waht the model providers are saying. Especially in this case they released it yesterday, and by today we have >2k real live diverse human votes on it and it shows it's super close to 4-sonnet
it's like "ok let me do a quick vibe check with thousands of people all over the world doing different things"
It placed #46/#65 on my own leaderboard, #55 in Chess. Overall pretty consistent for a mini/flash type model.
Don't doubt it achieved Sonnet 4 level scores on benchmarks optimized for (Swebench etc.), but that doesn't universally translate to real world usage. Sonnet is smarter with more world knowledge (which is to be expected for larger models).
when will the copilot leaderboard get updated??
patience
I am not sure that gets updated anymore. It requires votes in the VS code extension and I don't know if there are enough.
there are other coding LB and one can get an idea how models perform in all those
What did sonnet 4 get?
very high, #5 on main. and #47 on chess (anthropic isn't great at chess in general).
.... I publish everything, see profile.
@vital osprey Please, read our guide https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
@jaunty cave @drowsy needle hi something goes wrong, 2.5 pro disappered from the leaderboard
Sorry about that! It is fixed now!
Hello
hi agine
@shy thorn you might check #1397655624103493813 to learn how to use the bot.
Hello! @sly geyser please check #1397655624103493813 to learn how to use the bot.
Hi there. Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to generate content
HmHmmm
@vital pelican Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to generate content 🙂
Hello.I am here to test & compare different AI models.
@marble moth Please, read our guide in how-to-video-bot to learn how to properly prompt the bot.
hello community, im here to explore the world of AI
Please head to #1397655624103493813 for a detailed guide on how to use the bot
Make it 9:16 size.
no
I can certainly help you with that. While the current application is designed for analyzing existing videos, I can upgrade it to generate new videos from a text description using the Gemini API. This will allow you to create the video you described.
Here is a summary of the changes I'll make:
Video Generation Mode: I'll introduce a new "Generate Video" mode. On launch, you'll be able to choose between analyzing an existing video or generating a new one.
Prompt-based Creation: In the new mode, you'll find a text area where you can describe the video you want. I've pre-filled it with your request for a video about world leaders.
API Key Selection: Video generation with the Veo model requires an API key with billing enabled. I'll add a button that prompts you to select your API key before you can generate a video.
Real-time Feedback: Video generation can take a few minutes. I'll display a status screen with informative messages to keep you updated on the progress.
Seamless Integration: Once your video is ready, it will automatically load into the player, and you'll be able to use all the existing analysis tools on it.
Here are the updated files for the application:
soooo truee
Sora 2 cod
A short, hyper-realistic video clip in 8K. Scene: A beautiful woman in bed at night.
The video starts with a close-up on her face as she tosses her head gently on the pillow, her eyes closed but her expression troubled.
Her eyes snap open. She looks towards something off-screen (like an alarm clock), and a wave of deep sadness and frustration washes over her face.
She then turns her head to stare out a window as a single, silent tear slowly rolls down her cheek.
Cinematic, dramatic low-key lighting from moonlight. The mood is somber, quiet, and deeply emotional.
A realistic, short video clip in a vertical aspect ratio (9:16). A beautiful, sad woman is in bed, tossing and turning in slow motion. The camera is a close-up on her face, capturing her pained and troubled expression. She briefly opens her weary eyes, then closes them again with a deep, sorrowful sigh. The scene is lit by cool, blue moonlight. Highly detailed, photorealistic, 4K. Focus is on the raw emotion of sleeplessness.
Please head to #1397655624103493813 for a detailed guide on how to use the bot
Complete Profile info
Is there gemini 3 in lm arena?
don't think so
video area 1 is fire
true
@sharp fiber Please, read our guide in https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
@kindred bane Please head to #1397655624103493813 for a detailed guide on how to use the bot
im here to explore the world of AI
hello
@fiery wagon Please head to #1397655624103493813 for a detailed guide on how to use the bot
SO TRUE
a lot of people were saying sora 2 was better than veo 3.1 so that's a surpise
Yeah this was an interesting one to watch for sure
sora-2 overall still performing really well
were you perhaps trying to remove 2.5 pro because there would be another model to replace it..
buy more z ai 
@opaque vapor Please head to #1397655624103493813 for a detailed guide on how to use the bot
HI THEAR
hello
yup
Hi
@woven pewter @devout sorrel @quick pivot @abstract bane Please check on #1397655624103493813 to learn how to use the bot properly.
@quick pivot Hey! We ask you to use English only in this server, thank you.
I read everything in the assistant bot and did everything, but now I don't know where to get the results. Please tell me.
For your creations make sure you go to #video-arena-1 #video-arena-2 #video-arena-3
I went through and entered the Promptravila. Where can I see what was generated based on it?
@torn cove Please head to #1397655624103493813 for a detailed guide on how to use the bot
Hi
wow
hi
Rehan
hi
hi
Hi
@lean spear Please, read our guide in https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
THANKS
@potent island Please, read our guide in https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.
hello how do i generate an ai video
Why is there no sound when generating a video?