#leaderboards | Arena | Page 3

silk holly Aug 31, 2025, 1:21 AM

#

trying to generate a video...... can I prcveed?

drowsy needle Aug 31, 2025, 1:23 AM

#

silk holly trying to generate a video...... can I prcveed?

You can learn more here: #1397655624103493813

simple carbon Aug 31, 2025, 3:28 AM

#

Please include Pixverse 5.0. It is a very strong model. It is better than Hailuo in Artificial Analysis

quaint zenith Aug 31, 2025, 9:49 AM

#

hello

past inlet Aug 31, 2025, 10:01 AM

#

thank bro

marble bluff Aug 31, 2025, 1:13 PM

#

Hello! Please, #1397655624103493813 to to learn how to use the bot and generate videos in #video-arena-1 #video-arena-2 #video-arena-3

tough sand Aug 31, 2025, 2:40 PM

#

Nice and Amazing... I can learn More Ideas... Thanks!

cinder prawn Aug 31, 2025, 8:05 PM

#

jo

prisma smelt Aug 31, 2025, 11:43 PM

#

go go

white token Sep 1, 2025, 1:36 AM

#

niece

west pecan Sep 1, 2025, 7:27 AM

#

/video

queen owl Sep 1, 2025, 7:36 AM

#

/image

drowsy needle Sep 1, 2025, 12:37 PM

#

@west pecan @queen owl you both are looking for Video Arena, read more on how to use here: #1397655624103493813

amber comet Sep 1, 2025, 3:52 PM

#

Hey @solid cipher please check this channel for instructions to generate videos. https://discord.com/channels/1340554757349179412/1397655624103493813

stoic nymph Sep 1, 2025, 6:02 PM

#

this can't be true, is oss 120b on LMarena?

median briar Sep 1, 2025, 6:06 PM

#

stoic nymph this can't be true, is oss 120b on LMarena?

LiveCodeBench is notoriously bad and has been detached from reality for a long time

stoic nymph Sep 1, 2025, 6:10 PM

#

median briar LiveCodeBench is notoriously bad and has been detached from reality for a long t...

Which one would you say is the most reliable benchmark to refer to

#

?

median briar Sep 1, 2025, 6:19 PM

#

stoic nymph Which one would you say is the most reliable benchmark to refer to

the tried and test one: swe-bench (smaller agentic coding tasks)
the new one (that is less contaminated or optimised for): https://brokk.ai/power-ranking (very large agentic coding tasks)

for** tool use** specifically: terminal bench

for front-end: webdev arena

for machine learning specifically: https://htihle.github.io/weirdml.html (ai models train a machine learning model, score = accuracy)

stoic nymph Sep 1, 2025, 6:20 PM

#

median briar the tried and test one: **swe-bench** (smaller agentic coding tasks) the new one...

wow thank you, this is super insightful. THANKS!1

median briar Sep 1, 2025, 6:21 PM

#

np :)

#

btw you can find a lot of them here: https://epoch.ai/benchmarks

stoic nymph Sep 1, 2025, 6:23 PM

#

WILL CHECK IT OUT!

bronze hollow Sep 2, 2025, 3:44 PM

#

Hehe

thorny mauve Sep 2, 2025, 6:37 PM

#

Does lmarena has leaderboard around controlling mobile phones via agents like this by google research https://google-research.github.io/android_world/ ?
Or is there a plan in near future?

AndroidWorld

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

A Dynamic Benchmarking Environment for Autonomous Agents

jaunty cave Sep 2, 2025, 11:03 PM

#

thorny mauve Does lmarena has leaderboard around controlling mobile phones via agents like th...

no we don't have something like that at this point

humble seal Sep 2, 2025, 11:04 PM

#

Hello

jaunty cave Sep 2, 2025, 11:04 PM

#

humble seal Hello

Hi, what are your thoughts on the leaderboard? https://lmarena.ai/leaderboard

river fjordBOT Sep 3, 2025, 2:57 PM

#

<:warning:892823499205406760> Channel locked

Site outage, will turn back on when resolved.

drowsy needle Sep 3, 2025, 2:59 PM

#

Site Outage - Hey everyone, there looks to be an outage with the site, our team is aware and working on a fix ASAP. We've turned off messagin in this server until the site is restored. Our apologies for the inconvenience!

river fjordBOT Sep 3, 2025, 4:01 PM

#

<:success:865860339278413864> Channel unlocked

Welcome back :ablobwave:

drowsy needle Sep 3, 2025, 8:41 PM

#

Check out #1397655624103493813 for information on how to use the bot @signal idol

quartz shuttle Sep 4, 2025, 3:25 AM

#

Hi community! I'm just starting to explore the project's data and have found that the .pkl files don't seem to have a consistent structure. I was wondering if anyone knows why this might be the case. Perhaps it's due to different versions or data sources? Any insights on the origin of these files would be greatly appreciated.

jaunty cave Sep 4, 2025, 3:43 AM

#

Hi @quartz shuttle, I'm afraid that's just an artifact of sharing our data as we use it. We are always working to improve our data pipelines and that sometimes includes changing how we structure and store the data. Since the data has been releasing for almost 2 years now, this involves a nubmer of changes. If you have specific questions about the formats I'm happy to help if I can.

naive prairie Sep 4, 2025, 10:01 AM

#

Guys, anyone looking to collaborate in a way to integrate LM Arena to our AIvsAI platform for games?

We let devs build AI agents for classic fighting games, provide leaderboards benchmarking scores and achievements

We want to expand our LLM module to see how LLMs perform in gaming as a benchmark for LM arena also

twin valve Sep 4, 2025, 10:06 AM

#

https://openlm.ai/chatbot-arena/

OpenLM.ai

Chatbot Arena +

This leaderboard is based on the following benchmarks.
Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 4M+ user votes to compute Elo ratings. AAII - Artificial Analysis Intelligence Index aggregating 8 challenging evaluations. ARC-AGI - Artificial General Intelligence benchmark v2 to measure fl...

clear elm Sep 4, 2025, 4:29 PM

#

twin valve https://openlm.ai/chatbot-arena/

wild pfp

tender sigil Sep 4, 2025, 4:29 PM

#

Leaderboard update anytime soon? it’s been a full week now since the last one

#

understandable delay with no new models released, but still

tender sigil Sep 4, 2025, 4:31 PM

#

clear elm wild pfp

bruh it’s just the EU 😂

drowsy needle Sep 4, 2025, 4:33 PM

#

You'll want to read #1397655624103493813 for more inforamtion on how to use the Video Arena bot.

twin valve Sep 4, 2025, 4:37 PM

#

clear elm wild pfp

agreed

jaunty cave Sep 4, 2025, 8:28 PM

#

tender sigil Leaderboard update anytime soon? it’s been a full week now since the last one

we updated vision on Tuesday https://lmarena.ai/leaderboard/vision
It was accompanied by a slight change in methodology in order to keep data quality high
https://news.lmarena.ai/leaderboard-changelog/

LMArena Blog

Leaderboard Changelog

This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!

For model deprecations, check the public updates on GitHub.

September 2, 2025
Due to the increase in image generation traffic brought by nano-banana, we noticed there were prompts in our

elder willow Sep 5, 2025, 1:59 AM

#

Up

pulsar wave Sep 5, 2025, 9:18 AM

#

/video

drowsy needle Sep 5, 2025, 9:21 AM

#

pulsar wave /video

The bot isn't working. Check out #1397655624103493813 to understand how to use the bot properly

west pecan Sep 5, 2025, 9:31 AM

#

hello

#

giving message do not have permission in this channel

grizzled turret Sep 5, 2025, 10:13 AM

#

up

upbeat lichen Sep 5, 2025, 11:54 AM

#

hallo

orchid herald Sep 5, 2025, 12:24 PM

#

Why is Leaderboard info is dated by the 28th of August? Isn't it being updated? Thank you?!

severe cedar Sep 5, 2025, 1:51 PM

#

up

upper carbon Sep 5, 2025, 1:57 PM

#

hey fellas

tender sigil Sep 5, 2025, 5:00 PM

#

jaunty cave we updated vision on Tuesday https://lmarena.ai/leaderboard/vision It was accomp...

Coolio!

tender sigil Sep 5, 2025, 5:01 PM

#

orchid herald Why is Leaderboard info is dated by the 28th of August? Isn't it being updated? ...

sometimes it goes longer without updates, 8-10 days is normally the max that the text leaderboard will go without updating, so we’ll likely see one sometime soon 🙂

soft fjord Sep 5, 2025, 5:02 PM

#

/video

loud spear Sep 5, 2025, 5:07 PM

#

hi all

brave gazelle Sep 5, 2025, 5:40 PM

#

hi

drowsy needle Sep 5, 2025, 5:55 PM

#

soft fjord /video

Our Video Arena bot isn't working at the moment. Check out #1397655624103493813 for more information on how to use the bot (when it's workoing properly)

drowsy needle Sep 5, 2025, 6:38 PM

#

Our Video Arena bot isn't working at the moment. Check out #1397655624103493813 for more information on how to use the bot (when it's workoing properly)

drowsy needle Sep 5, 2025, 8:19 PM

#

You'll want to read the instruction in #1397655624103493813 for more information on the bot.

astral leaf Sep 6, 2025, 12:01 AM

#

hello

slate holly Sep 6, 2025, 7:15 AM

#

hello

grizzled oak Sep 6, 2025, 8:07 AM

#

hi

livid remnant Sep 6, 2025, 8:28 AM

#

How to upload a image to edit with prompt?

next onyx Sep 6, 2025, 8:52 AM

#

Hi

humble magnet Sep 6, 2025, 11:55 AM

#

Hello

tropic pivot Sep 6, 2025, 1:32 PM

#

livid remnant How to upload a image to edit with prompt?

Hi! Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to do that on our video arena channels 🙂

tropic pivot Sep 6, 2025, 1:58 PM

#

Please check to learn how to generate on our video arena channels 😉 https://discordapp.com/channels/1340554757349179412/1397655624103493813

tender sigil Sep 6, 2025, 9:27 PM

#

no text leaderboard update in 9 days 😢

#

will probably mean a significant leaderboard adjustment when it does update tho! excited to see where new Qwen places 😳

orchid pagoda Sep 6, 2025, 10:27 PM

#

hello

reef frigate Sep 7, 2025, 3:58 AM

#

helo

leaden flint Sep 7, 2025, 5:44 AM

#

hi

twin valve Sep 7, 2025, 10:14 AM

#

the stream of "hello" continues! Hello to all!

#

btw the fact that image leaderboards have a density of votes (No.votes / No. Models) so high compared to the text arena, just shows that images are easier to digest for votes (and have higher demand)

robust zodiac Sep 7, 2025, 11:34 AM

#

twin valve btw the fact that image leaderboards have a density of votes (No.votes / No. Mod...

Pretty sure there’s also some abuse going on with people doing some automated prompts for various needs

twin valve Sep 7, 2025, 1:38 PM

#

sure. Small business trying to make flyiers and what not, but still.

drowsy needle Sep 7, 2025, 6:22 PM

#

@molten dome Please read #1397655624103493813 for information on how to use the bot.

drowsy needle Sep 8, 2025, 5:44 PM

#

@grim barn be sure to read #1397655624103493813

jaunty cave Sep 8, 2025, 6:31 PM

#

twin valve sure. Small business trying to make flyiers and what not, but still.

that's actually like super amazing data and feedback, people using for business cases is very high signal, since we know they really care about which result is better

jaunty cave Sep 8, 2025, 6:35 PM

#

twin valve btw the fact that image leaderboards have a density of votes (No.votes / No. Mod...

ratio of votes to models is so interesting to watch. The extreme differences makes me thing we aught to be investigating entirely different methodologies for them

jaunty cave Sep 8, 2025, 6:41 PM

#

tender sigil no text leaderboard update in 9 days 😢

thank you for your patience, new update today 🙂

west pecan Sep 8, 2025, 7:18 PM

#

Hi, do we have option to select starting frame and ending frame? That would be awesome if you add this

lime cosmos Sep 8, 2025, 8:36 PM

#

Where can I request that the censorship applied to texts be lowered a little? It no longer allows me to make a correction when I paste the text.

sullen yacht Sep 9, 2025, 12:13 AM

#

AWESOME

drowsy needle Sep 9, 2025, 12:15 AM

#

lime cosmos Where can I request that the censorship applied to texts be lowered a little? It...

Sorry to say there isn't a way to get around the content filters; however, if you believe there are some false positives I'd encourage you to share them in this forum post #1376956905016004759

jaunty cave Sep 9, 2025, 1:31 AM

#

sullen yacht AWESOME

hell yea

spare raft Sep 9, 2025, 7:05 AM

#

hi were do i find the leaderboards

neat pond Sep 9, 2025, 9:49 AM

#

hello

drowsy needle Sep 9, 2025, 3:43 PM

#

spare raft hi were do i find the leaderboards

You can ind them here - https://lmarena.ai/leaderboard

spare raft Sep 9, 2025, 3:45 PM

#

Thanks

tender sigil Sep 9, 2025, 4:38 PM

#

jaunty cave thank you for your patience, new update today 🙂

Qwen3-Max !!!

tender sigil Sep 9, 2025, 4:40 PM

#

twin valve btw the fact that image leaderboards have a density of votes (No.votes / No. Mod...

I mean, there’s less than half as many votes in the image arena (1.7m vs 4m) as there is text, also split between 239 text models vs. 21 image models

#

I think that’s moreso a testament to the difficulties of creating/lower priority towards image models among AI companies that aren’t currently leading the field, likely due to level of expense and infrastructure needed to do training runs for image generation models compared to text

#

more structural incentives leading to monopolization of AI image generation compared to text, but it is also true that the text arena has been running since the beginning of LMArena and Image Arena is a newer creation

jaunty cave Sep 9, 2025, 5:09 PM

#

That's a good point about the time they've been running, but look at image-edit arena. 6 million votes on 10 models 🙂
It's been around even less time that text to image but has overtaken text arena in votes. Nano-Banana was quite an event

twin valve Sep 9, 2025, 8:34 PM

#

tender sigil I mean, there’s less than half as many votes in the image arena (1.7m vs 4m) as ...

that is discounting how the image arena exploded. That 1.7m votes was accumulated real quick compared to the text arena. If the text arena had the same progress it would have like 10m or more.

But sure the barrier of entry is surely not trivial.

short sparrow Sep 9, 2025, 9:00 PM

#

For search what model is really best o3 is on led but its not showing results as expected

tender sigil Sep 9, 2025, 10:18 PM

#

jaunty cave That's a good point about the time they've been running, but look at image-edit ...

true! Nano Banana was easily the biggest marketing push in the history of LMArena

fresh marsh Sep 10, 2025, 5:01 AM

#

short sparrow For search what model is really best o3 is on led but its not showing results as...

I have no idea how o3 is considered better than Gemini for search, I find it noticably worse

short sparrow Sep 10, 2025, 9:22 AM

#

fresh marsh I have no idea how o3 is considered better than Gemini for search, I find it not...

There are lot of parameters for evaluating it but still o3 doesn’t gives accurately what i need i think gemini and gpt 5 are better

visual dock Sep 10, 2025, 9:24 AM

#

3D cartoon style, Princess Nadine with Yusuf by her side, while Laila stands slightly behind them watching with concern, background of ancient temple at night, torches glowing, cinematic dramatic lighting, high quality"

#

/3D cartoon style, Princess Nadine with Yusuf by her side, while Laila stands slightly behind them watching with concern, background of ancient temple at night, torches glowing, cinematic dramatic lighting, high quality"

fresh marsh Sep 10, 2025, 11:05 AM

#

short sparrow There are lot of parameters for evaluating it but still o3 doesn’t gives accurat...

I find the formatting of o3 so bad ngl

#

For medical stuff at least

fading mason Sep 10, 2025, 11:30 AM

#

Awesome 👍

rare narwhal Sep 10, 2025, 11:39 AM

#

Hello, can somebody point me to the best place to get the newest leaderboard data? It looks like the text arena scores have been updated 2 days ago but the latest file on Hugging Face is still August 29. The new web omits a lot of the underlying information unfortunately.

drowsy needle Sep 10, 2025, 6:38 PM

#

rare narwhal Hello, can somebody point me to the best place to get the newest leaderboard dat...

What we have posted to leaderboards/hugging face is going to be the most up to date data we have released.

rare narwhal Sep 10, 2025, 6:39 PM

#

Many thanks, helpful! Is there any plans to make the data that was previously shown directly in the web interface available again? I think the category scores are now hidden and only the ranks are shown (e.g., instruction following).

drowsy needle Sep 10, 2025, 6:47 PM

#

rare narwhal Many thanks, helpful! Is there any plans to make the data that was previously sh...

It's possible! Do we want to enhance our leaderboards and are currently working on a future update. The best way to tell us about what specifically you're looking for should be shared in the #1372230675914031105 forum channel. Posting there the specifics would be a big help!

rare narwhal Sep 10, 2025, 6:48 PM

#

drowsy needle It's possible! Do we want to enhance our leaderboards and are currently working ...

Greatly appreciate the reply, thank you! Will post in feedback once I have some time!

drowsy needle Sep 10, 2025, 6:49 PM

#

rare narwhal Greatly appreciate the reply, thank you! Will post in feedback once I have some ...

No problem! Don't hesitate to ping me if you ever have any questions or run into problems on the site.

dusty cipher Sep 11, 2025, 5:40 AM

#

👍

steep temple Sep 11, 2025, 3:32 PM

#

💐

drowsy needle Sep 11, 2025, 3:57 PM

#

@undone sable Be sure to check out #1397655624103493813 for more information on how to properly use Video Arena

short sparrow Sep 11, 2025, 4:52 PM

#

fresh marsh I find the formatting of o3 so bad ngl

what search model you personally think its good

fresh marsh Sep 11, 2025, 5:30 PM

#

short sparrow what search model you personally think its good

Gpt 5 pro and Gemini 2.5 pro seem the best two to me. Gpt 5 high, thinking on pro subscriptio, and o3 give me noticably worse results. I think gpt5 pro is noticably better than Gemini 2.5 but it's not exactly a fair comparison

#

Claude opus CONSISTENTLY gives me false information that would literally kill patients if I listened to it

#

So I now don't trust it for anything at all

#

Context is I'm a resident and I use them for ideas on differentials and management for the most complex patients I get

burnt axle Sep 11, 2025, 7:00 PM

#

@jaunty cave hi

jaunty cave Sep 11, 2025, 7:18 PM

#

burnt axle <@1394374846741221458> hi

sup

burnt axle Sep 11, 2025, 7:18 PM

#

How to generate veo

drowsy needle Sep 11, 2025, 7:24 PM

#

burnt axle How to generate veo

Note to check out #1397655624103493813 for more inforamtion on how to use the bot. It's important to keep in mind that you're unable to select which specific model you want, as it's random which model provider you get for your prompt.

tender sigil Sep 11, 2025, 8:27 PM

#

Polymarket is now trading a market for the #2 ranked model at the end of the month https://polymarket.com/event/which-company-has-second-best-ai-model-end-of-september/will-alibaba-have-the-second-best-ai-model-on-september-30

Polymarket

Which company has second best AI model end of September?

Polymarket | This market will resolve according to the company that owns the model with the second-highest arena score based on the Chatbot Arena LLM Leaderb...

#

definitely will be more interesting to watch considering Gemini’s unbreakable lead in 1st for the last almost 6 months now

fresh marsh Sep 11, 2025, 9:47 PM

#

tender sigil Polymarket is now trading a market for the #2 ranked model at the end of the mon...

i dont get it

#

they consider #1 gemini?

#

seems unclear to me

#

nvm, their other listings

ivory peak Sep 12, 2025, 1:47 AM

#

Hello

#

/video

tender sigil Sep 12, 2025, 1:49 AM

#

fresh marsh they consider #1 gemini?

yes - based off of arena score with style control removed

#

Gemini 2.5 Pro holds a strong lead over second place Qwen3 Max there

fresh marsh Sep 12, 2025, 2:30 AM

#

why is the gemini 2.5 pro on lmarena so different from my 2.5 pro in my browser?

#

it gives such better answers

jaunty cave Sep 12, 2025, 4:29 AM

#

That's super interesting Kami, what sort of ways is it different? Do you have any examples of the same prompt and what you like about one response more than the other?

urban sparrow Sep 12, 2025, 12:18 PM

#

interesting

sterile linden Sep 12, 2025, 1:48 PM

#

hi

twin valve Sep 12, 2025, 3:57 PM

#

has ktibow left the server?

Furthermore <@&1349916362595635286> , if #ai-creations is for creations with AI, what about community ones (not necessarily AI based?)

#

I wanted to ask them if LMB is back, it seems back.

drowsy needle Sep 12, 2025, 4:00 PM

#

twin valve has ktibow left the server? Furthermore <@&1349916362595635286> , if <#13447332...

Hey - yeah KTibow unfortunately decided to leave the server. They played an important part in this community and will be very much missed.

#ai-creations what about community ones (not necessarily AI based?)
I'd prefer this remain just AI based/related for now.

LMB
What does this stand for? Sorry I'm still waking up.

strong knot Sep 12, 2025, 4:11 PM

#

It's quite saddening. But, if you want a similar discord to the old lmsys, openrouter is an option. Ktibow is also there. Idk how active though compared to here.

candid gorge Sep 12, 2025, 5:16 PM

#

So isn't it possible to create unlimited videos?

median briar Sep 12, 2025, 5:46 PM

#

drowsy needle Hey - yeah KTibow unfortunately decided to leave the server. They played an impo...

LMB is something ktibow build:

https://ktibow.github.io/lmb/

#

Really nice interface with a lot of new features (for the leaderboard)

brittle pine Sep 12, 2025, 9:01 PM

#

fresh marsh why is the gemini 2.5 pro on lmarena so different from my 2.5 pro in my browser?

this is well known thing for a long time. In Ai studio website, you also will get better and more detailed outputs

#

Answer is simply: web/app version have extra safety filters + unnecesary system prompt

#

in ai studio you can get raw output same as lmarena

#

Also in api, you have chance to turn off those filters and you can avoid that system prompt but web app will continue to be worse

#

The thing is gemini is installed default in sooooo many android phones, like "billions" android phones, so they feel like they have no chance to make any mistake which causes they keep security higher even if means worse outputs

fresh marsh Sep 12, 2025, 9:25 PM

#

brittle pine this is well known thing for a long time. In Ai studio website, you also will ge...

why do i bother paying then?

#

paying for pro doesnt give you better rate limits in studio right?

brittle pine Sep 12, 2025, 9:39 PM

#

fresh marsh paying for pro doesnt give you better rate limits in studio right?

no, pro membership not affects ai studio and ai studio already offers very high limit free

#

i dont understand too

#

Homewer, ai studio's generous limits wont lasts forever, i think when gemini 3.0 releases, limits gonna be lesser

#

Also you cant get veo 3 and deep research on ai studio, only app

twin valve Sep 12, 2025, 11:16 PM

#

drowsy needle Hey - yeah KTibow unfortunately decided to leave the server. They played an impo...

LMB is a creation of KTIbow.

drowsy needle Sep 12, 2025, 11:18 PM

#

twin valve LMB is a creation of KTIbow.

Thanks for the clarification! I wouldn't be able to answer that question for you. I'm assuming if you reach out to them (DM or Friend request) they'd be able to answer.

twin valve Sep 12, 2025, 11:19 PM

#

yeah I'll try to see if I can reach him via github, if not, well it is not the end of the world. Thank you!

fresh marsh Sep 12, 2025, 11:23 PM

#

twin valve LMB is a creation of KTIbow.

what is it

drowsy needle Sep 12, 2025, 11:23 PM

#

twin valve yeah I'll try to see if I can reach him via github, if not, well it is not the e...

Keep me updated if you're unable to. I'm still in contact, so I can nudge if it'll be helpful.

fresh marsh Sep 12, 2025, 11:24 PM

#

drowsy needle Thanks for the clarification! I wouldn't be able to answer that question for you...

does lmarena use a specific system prompt or instructions different from what users get default in a respective LLMs web interface? gemini 2.5 in lmarena gives me far better and different responses with identical prompt vs both ai studio and the gemini web app. same with claude opus 4.1 thinking

#

makes it feel stupid to pay for both of them when the free lmarena gives me very superior answers

fresh marsh Sep 12, 2025, 11:27 PM

#

fresh marsh does lmarena use a specific system prompt or instructions different from what us...

Not in anonymous battle mode. Im talking about side by side mode

jaunty cave Sep 12, 2025, 11:37 PM

#

fresh marsh Not in anonymous battle mode. Im talking about side by side mode

it's not different in battle vs SBS. And if the creator gives us a system prompt, we use it, otherwise we don't. We don't want to insert any of our own biases by designing our own system prompt

brittle pine Sep 13, 2025, 12:05 AM

#

i dont think there is any system prompt in lmarena

#

Must be plasebo

#

Or maybe lmarena's model version is older and maybe older version works better but im not sure about that

fresh marsh Sep 13, 2025, 12:08 AM

#

brittle pine Must be plasebo

.

fresh marsh Sep 13, 2025, 12:08 AM

#

jaunty cave it's not different in battle vs SBS. And if the creator gives us a system prompt...

absolutely not placebo. ill upload a practice case from boards review questions one second

#

FireShot_Capture_001_-_Google_Gemini_-_gemini.google.com.png

FireShot_Capture_002_-_LMArena_-_lmarena.ai.png

#

i dont even have to tell you which is which

#

same exact prompt copy pasted

#

@brittle pine

#

gemini 2.5 pro and gemini 2.5 pro on lmarena

jaunty cave Sep 13, 2025, 12:35 AM

#

Very interesting. I'm no medical expert, is your feedback that the answer from the model on lmarena is more comprehensive, detailed, accurate? Some other qualities?

brittle pine Sep 13, 2025, 12:38 AM

#

fresh marsh

But you using app

#

Can you compare ai studio and lmarena

#

(use 32k token and disable google search pls)

#

maybe in ai studio, google search is open and that could be cause worse outputs

#

turn off search

#

And use 32k thinking token, i think it will be same with lmarena

#

Because both are using Api. Both lmarena and ai studio using Api

#

Only web/mobile app is different

fresh marsh Sep 13, 2025, 3:00 AM

#

jaunty cave Very interesting. I'm no medical expert, is your feedback that the answer from t...

all of the above. the app gemini is missing some stuff that is basically common sense to anyone that would admit this patient as well

fresh marsh Sep 13, 2025, 3:01 AM

#

brittle pine (use 32k token and disable google search pls)

interesting. DISABLE search? ill try it

#

with search enabled it was giving me actual garbage. it gave me literal idiotic information that would kill the patient or at least some of their organs. 2.5 pro in studio with 32k tokens and search enabled i mean

compact oracle Sep 13, 2025, 3:03 AM

#

/imagine

fresh marsh Sep 13, 2025, 3:04 AM

#

brittle pine turn off search

wow

#

very interesting

#

its literally magnitudes better without search enabled (2.5 pro studio search disabled). with search enabled and same exact settings it gives me output almost identical to the garbage i posted above

#

wtf?

#

how does that even make sense?

#

@brittle pine

fresh marsh Sep 13, 2025, 3:06 AM

#

brittle pine Only web/mobile app is different

why would anyone want to use something worse thats paid vs free thats better...

jaunty cave Sep 13, 2025, 3:23 AM

#

fresh marsh with search enabled it was giving me actual garbage. it gave me literal idiotic ...

I bet that's a big part of it. We don't have search enabled except in search arena. When search is enabled models tend to reference the results they find even if they are low quality and would be better off generating a response with their own knowledge. This is super interesting stuff thanks for sharing the feedback

fresh marsh Sep 13, 2025, 3:24 AM

#

jaunty cave I bet that's a big part of it. We don't have search enabled except in search are...

no problem. im guessing this can be related to the bias i read about on your website, where people favor model responses with more source citations even if they arent necessarily quality sources

#

"We don't have search enabled except in search arena. " this explains a ton lol

jaunty cave Sep 13, 2025, 3:25 AM

#

yeah there are a lot of interesting human biases, kind of cool that we can collect enough data to actually measure them.

It's really funny, we put search in its own arena because we thought it would be an unfair advantage, but actually a lot of the time the results without searh are better. 🤣

fresh marsh Sep 13, 2025, 3:42 AM

#

surprisingly few models exist for search lol

#

honestly didnt even realize i was comparing them without search

#

have to compare with search now

#

is there no gpt5-high or thinking etc for search?

jaunty cave Sep 13, 2025, 3:47 AM

#

gpt-5-search is gpt-5 with web_search tool enabled https://platform.openai.com/docs/guides/tools-web-search?api-mode=responses and with reasoning_effort = "high".

It would be cool to add more settings of reaosning level but search functionality isn't used all that much compared to default text to justify adding so many at this point. Hopefully it will grow. We gotta do more to promote it, I think a lot of people don't realize it's available or that you have to enable it

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

fresh marsh Sep 13, 2025, 6:20 AM

#

jaunty cave gpt-5-search is gpt-5 with web_search tool enabled https://platform.openai.com/d...

by this "gpt-5-search is gpt-5 with web_search tool enabled and with reasoning_effort = "high"."
do you m ean its the same thing as gpt-5-high but searches the web?

jaunty cave Sep 13, 2025, 6:25 AM

#

it's the same model, but in search arena that model is allowed to use a tool to search the web and then has access to the results when it writes its answer. In theory this should improve it but if the info it gets from the web isn't great it actually can give worse responses

fresh marsh Sep 13, 2025, 6:45 AM

#

i see thanks

brittle pine Sep 13, 2025, 8:16 AM

#

fresh marsh why would anyone want to use something worse thats paid vs free thats better...

Usually, i mean logically api is always better because you have more control on it. Claude is same too. And api is always more expensive. Problem is google giving you api usage FREE

#

Sadly in gemini app you cant turn off search. Same with chatgpt, you cant turn off search in app.

#

And using search is usually bad because when you ask something, and if it uses search, it usually reads 25-30 source and giving you a answer with using 25-30 source

#

But if it cant use search, then it uses WHOLE DATABASE, WHOLE TRAIN DATA, DIFFERENT LANGUAGES' INFORMATIONS

#

but if search is open, quickly check 30 website and summaries to you

#

Then you can ask why they making search default open ? well, most of their trained data ending with 2024 or 2025 january so

#

they want to give you up to date information

#

they not want to that their LLM saying "no, Trump is not president, Joe Biden is"

twin valve Sep 13, 2025, 10:02 AM

#

brittle pine But if it cant use search, then it uses WHOLE DATABASE, WHOLE TRAIN DATA, DIFFER...

what do you mean with this? They always use their training data. Only the search results (summarized) go in their context window and influence the answer. It is RAG at the end.

brittle pine Sep 13, 2025, 11:34 AM

#

i dont know, it effects a lot

#

Also im sure search is causing influencing by your native language a lot which is not good

#

When i ask something, i want what whole world's opinion(included China), but search usually using your native language and well

#

giving less detailed or wrong answers

stiff orbit Sep 14, 2025, 1:10 AM

#

Dear LMArena Team,

I noticed the new models Qwen3-next-80b-a3b-instruct and Qwen3-next-80b-a3b-thinking from Alibaba Qwen have been released recently, and some reports mention they've been added to LMArena. However, I don't see them on the Text leaderboard yet. Could you explain why they aren't ranked? Is it due to insufficient votes, ongoing processing, or another reason?

Thanks for your help!

drowsy needle Sep 14, 2025, 3:58 AM

#

stiff orbit Dear LMArena Team, I noticed the new models Qwen3-next-80b-a3b-instruct and Qwe...

Hello ablobwave good question - collecting enough vote data to update the leaderboards can take a little bit of time. It depends but this can take up to about a week generally.

sharp portal Sep 14, 2025, 8:12 AM

#

Where can I find the archived versions of the leaderboard? (i.e. not the latest)

reef moth Sep 14, 2025, 10:02 AM

#

sharp portal Where can I find the archived versions of the leaderboard? (i.e. not the latest)

https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard/commits/main

Commits · lmarena-ai/lmarena-leaderboard

#

But this repository is no longer actively updated

gusty grove Sep 14, 2025, 10:48 AM

#

Hello everyone. I'm very glad to join this impactful Ai community.

sharp portal Sep 14, 2025, 1:55 PM

#

reef moth https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard/commits/main

Oh I see…I can't find the elo scores except very old versions there, but thanks

scarlet harbor Sep 15, 2025, 3:31 AM

#

ancient ledge Sep 15, 2025, 4:03 AM

#

How are about new system prompt gpt5 rank?

scarlet harbor Sep 15, 2025, 5:03 AM

#

ancient ledge How are about new system prompt gpt5 rank?

Havent tested it, but probably still worse then o3 pro, I doubt it would gain 200+ elo from just a system prompt change

glass sun Sep 15, 2025, 6:10 AM

#

scarlet harbor

The website is 100% vibe coded

scarlet harbor Sep 15, 2025, 7:14 AM

#

glass sun The website is 100% vibe coded

No I created it lol just for a screenshot

#

Gets the point across

twin valve Sep 15, 2025, 12:22 PM

#

scarlet harbor

600+ elo ratings between o3 and o3-pro ? Unless you use funky elo K factor parameters, is not even close to reality.

scarlet harbor Sep 15, 2025, 1:20 PM

#

twin valve 600+ elo ratings between o3 and o3-pro ? Unless you use funky elo K factor param...

Those are the results, I used a Chess.com bot, then took away around 300 elo points for inflation(More specific but in summary this). o3 beat a 1100 chess bot while o3 Pro beat a 1800 chess bot

#

o3 Pro played very decent moves tbh, none of them were bad until endgame(Which every LLM struggles with) but Pro was able to get a past pawn, o3 had some inaccuraces in middle and more in endgame then Pro.

#

Makes me hype for how 5 Pro does.

devout canyon Sep 15, 2025, 3:26 PM

#

scarlet harbor Those are the results, I used a Chess.com bot, then took away around 300 elo poi...

what do you think about gpt 5 pro vs o3 pro ?

scarlet harbor Sep 15, 2025, 4:26 PM

#

devout canyon what do you think about gpt 5 pro vs o3 pro ?

GPT 5 Pro would probably destroy o3 tbh

tender sigil Sep 15, 2025, 4:33 PM

#

text leaderboard update any time soon? been a week since the last one on Monday :p

drowsy needle Sep 15, 2025, 4:34 PM

#

tender sigil text leaderboard update any time soon? been a week since the last one on Monday ...

Shouldn't be too much longer

tender sigil Sep 15, 2025, 4:36 PM

#

okie 😄 excited to see where new GPT-5 system prompt places!! wondering if Opus 4.1 Thinking will continue it’s upward trajectory to possibly overtake Gemini 2.5 Pro 😳

glass sun Sep 15, 2025, 8:18 PM

#

Will the copilot arena EVER be brought back?

#

Yes or no

#

I’ve tried asking around so many times and I haven’t received an answer

jaunty cave Sep 15, 2025, 8:30 PM

#

@glass sun we can't make any promises, sorry :/

glass sun Sep 15, 2025, 8:31 PM

#

Mkay

vivid fable Sep 15, 2025, 10:10 PM

#

ok

jaunty cave Sep 15, 2025, 10:57 PM

#

text to video and image to video leaderboards updated! lots more votes!

shell cape Sep 16, 2025, 6:46 AM

#

Read #1397655624103493813 and do the /video command in #video-arena-1, not here.

devout basalt Sep 16, 2025, 6:05 PM

#

Seedream highrea must be 1st place what you guys think

jaunty cave Sep 16, 2025, 6:48 PM

#

devout basalt Seedream highrea must be 1st place what you guys think

new leaderboard is out! you're right seedream-4-high-res is first on text-to-image, but it's behind nano-banana on image editing by a lot

twin valve Sep 16, 2025, 8:54 PM

#

scarlet harbor Those are the results, I used a Chess.com bot, then took away around 300 elo poi...

I see but then you should specify that you tested them on chess. Further for chess tests there are already a couple of benchmarks that are interesting (but they don't tell the whole story)

https://dubesor.de/chess/chess-leaderboard
https://maxim-saplin.github.io/llm_chess/

and there are others.

Because otherwise one would assume the elo is relatable to the lmarena elo (it is not) and even if it is related to chess, it would be related to chess.com bots elo (that is not relatable to player elo). It is a rabbit hole the elo thing.

AI Chess Leaderboard - dubesor AI project

LLM AI Chess Leaderboard: Ranking, Elo, and Chess Performance of AI language models.

LLM Chess Leaderboard

LLM Chess Leaderboard comparing LLMs in simulated chess vs Random and Komodo Dragon. Rankings by Elo, win/loss, game duration, and tokens.

#

@twilit echo interesting, the LLM chess leaderboard that was based on a random player now stepped up the game a bit with Komodo chess levels (that again aren't totally comparable to the playerbase)

#

I am pretty sure that when people will see negative elo will say "wait that's wrong" (it isn't, elo works with differences)

twilit echo Sep 16, 2025, 8:58 PM

#

twin valve I am pretty sure that when people will see negative elo will say "wait that's wr...

yea minus elo makes no sense though, because in real elo maths its not possible to get negative. the worst player possible cannot get close to 0, due to the probablility rating

#

once they reach like ~400 rating they'd lose 0.0001 elo per loss

twin valve Sep 16, 2025, 9:01 PM

#

as far as I know it should work no problem.

1 /(1+10 ^((-3000-(-3200))/400) works no problem. The -3200 player has a probability of 0.24 vs a -3000 player .

#

identical if I do

1 /(1+10 ^((3000-(2800))/400)

a 3000 player vs a 2800

#

it works with differences, it doesn't matter the sign

#

ah I see you mean rating gain/loss. That's true.

#

likely they let the random player be clobbered, and hence he got real low, and then the others were scaled down.

twilit echo Sep 16, 2025, 9:03 PM

#

thats theory vs practise. in practicial application, no player will ever reach negative elo, no matter the matchups or how terrible

twin valve Sep 16, 2025, 9:03 PM

#

sure, if one starts high enough it likely won't happen

#

in cases like lichess and chess.com they have rating floors because there are some that want just to lose

#

and then you make an inverse ladder (that is, player A completely loses to player B, player C completely loses to player A, player D completely loses to C and so on)

twilit echo Sep 16, 2025, 9:05 PM

#

I used random movers and worstfish (deliberately always pick the lowest rated moves), and they wouldnt be able to get negative in my standard elo environment

twin valve Sep 16, 2025, 9:05 PM

#

you mean worstfish is not able to get negative from 441 ?

#

even adding -0.1 a lot of times?

twilit echo Sep 16, 2025, 9:06 PM

#

nope, at least not in 100 matches

#

everything else is not practical reality

twin valve Sep 16, 2025, 9:06 PM

#

I see. Yes if it adds -0.1 every time (or less), then it takes like forever

#

from the llm leaderboard with negative values. checking the result I believe the random player is simply anchored very low. LLMs that mostly are equal to it (see the results vsR) have around -30 rating.

That could explain all the other ratings (then it also depends on the K factor they use and so on)

twilit echo Sep 16, 2025, 9:10 PM

#

as much as i can appreciate his commitment, i just dont find the way the methods are applied logical or useful to me. good that different things exists though, but as a user its not useful to me and I always try to make whatever I enjoy consuming.
o1 is 140 elo just doesnt translate to anything for me. it doesn't relate to humanity.

twin valve Sep 16, 2025, 9:11 PM

#

agreed. It is simply fun

#

interesting that they use stockfish to approximate maia engines. I guess another side project

#

https://github.com/maxim-saplin/llm_chess/blob/main/stockfish_elos/Readme.md

twilit echo Sep 16, 2025, 9:13 PM

#

I tried starting mapping Stockfish to Elo, (even did a youtube video on it), but I actually found out the Stockfish elo levels are vastly inaccurate. e.g. whatever chess.com labels 1000 elo or 1500 elo is completely not related to any reality scaling. thus I abandoned that method and switched to accuracy, which while still inaccurate in particular in huge back and forth swing (blunder followed by blunder), found it to be more reliable overall.

twin valve Sep 16, 2025, 9:14 PM

#

yes, the community says the same (a lot of players want to know "if I beat bot XY, what is my real elo?" and the result is: they are all over the place)

#

Ok I think there is a bit more blogging here, relatively "hidden" from the main site: https://github.com/maxim-saplin/llm_chess/blob/main/docs/notes.md

" We first calibrate Random vs Dragon; then any model’s games vs Random/Dragon can be combined into a single 1D MLE estimate with a 95% CI."

so Random got clobbered. The 1st level of dragon is 125 (if I understood), hence the ton of negative things. Still he could explain that in the main page.

twilit echo Sep 16, 2025, 9:17 PM

#

even the worst dragon engine is like 1200. might be missing something, but 125 is far worse than random.

#

against a natural pool of players of all skill levels

twin valve Sep 16, 2025, 9:18 PM

#

they set the first level to 125

#

should be this one: https://github.com/maxim-saplin/llm_chess/blob/main/data_processing/calc_elos.py#L45

twilit echo Sep 16, 2025, 9:19 PM

#

well, in my player set, a complete random got like 438 elo, and worstfish was high 200s

#

after ~100 matches

twin valve Sep 16, 2025, 9:20 PM

#

ok, so if you would set the random to -30 (if I understand the data, that should be its level) then worstfish would be -260 at most.

Yeah something doesn't add up.

#

ah I see. If I am not wrong they don't do it like you do (or FIDE and others).
They collect the info and then run a MLE process, with anchored ratings.

So it is not game by game.
I was trying to search the K factor but I was unable to find it.

https://github.com/maxim-saplin/llm_chess/blob/main/data_processing/calc_elos.py#L20-L42

#

as you said, it is good that there is variation, but it is mostly fun

scarlet harbor Sep 16, 2025, 9:47 PM

#

twilit echo yea minus elo makes no sense though, because in real elo maths its not possible ...

I think accuracy is better

scarlet harbor Sep 16, 2025, 9:49 PM

#

twin valve I see but then you should specify that you tested them on chess. Further for che...

Also I dont find those accurate and they dont always have the models i'm looking for

jaunty cave Sep 16, 2025, 11:21 PM

#

twilit echo thats theory vs practise. in practicial application, no player will ever reach n...

in live rated chess tournaments sure, but "real life" is a broad term. In online video games with botting and smurfing it's definitely possible for Elo to go negative if the rating system designer hasn't planned for it by adding a floor or something

jaunty cave Sep 16, 2025, 11:24 PM

#

twin valve ah I see. If I am not wrong they don't do it like you do (or FIDE and others). T...

this is so cool actually I'm glad I saw this

tender sigil Sep 17, 2025, 12:40 AM

#

twilit echo I used random movers and worstfish (deliberately always pick the lowest rated mo...

Can you run worstfish locally? I’ve wanted to play with it for a while

limber musk Sep 17, 2025, 3:41 AM

#

hello

jaunty cave Sep 17, 2025, 3:41 AM

#

hi

proud siren Sep 17, 2025, 4:34 AM

#

hai guys

jaunty cave Sep 17, 2025, 4:35 AM

#

lively night, what's up AI people

lethal stag Sep 17, 2025, 4:36 AM

#

Hey what's up gang Vulpes here what's cooking

jaunty cave Sep 17, 2025, 4:36 AM

#

just writing code with AI

#

and using it to judge how good AIs are at writing code

lethal stag Sep 17, 2025, 4:37 AM

#

jaunty cave just writing code with AI

what are you making?

jaunty cave Sep 17, 2025, 4:37 AM

#

I'm making LMArena 😎

lethal stag Sep 17, 2025, 4:38 AM

#

oh, just testing?

jaunty cave Sep 17, 2025, 4:38 AM

#

I work here, but yeah a lot of my job is testing AI

lethal stag Sep 17, 2025, 4:39 AM

#

Ah, lol I'm slow

#

gotcha

jaunty cave Sep 17, 2025, 4:40 AM

#

lol np, what you up to? What bings you to lmarena

lethal stag Sep 17, 2025, 4:40 AM

#

boredom

#

but I'm working on a prompt, to turn Gemini into a good image-to-prompt generator

jaunty cave Sep 17, 2025, 4:41 AM

#

what's your favorite AI to use when you're board?

lethal stag Sep 17, 2025, 4:41 AM

#

like system instructions

#

banana is bananas, so that

jaunty cave Sep 17, 2025, 4:42 AM

#

haha that's cool, creative way to use AI to better use AI.

#

once you get the prompt, then you feed that into banana or seeddream or smth?

lethal stag Sep 17, 2025, 4:45 AM

#

lethal stag but I'm working on a prompt, to turn Gemini into a good image-to-prompt generato...

My plan is - I throw a random photo for Gemini Pro to analyze. Gemini turns it into a prompt, so I can recreate thAt photo as closely as possible. So then, I upload my photo to banana and give it that prompt, basically recreating that photo, but with my likeness

lethal stag Sep 17, 2025, 4:47 AM

#

jaunty cave once you get the prompt, then you feed that into banana or seeddream or smth?

So you can like, browse, lets say, Pinterest. You see a photo you like, throw it in AI Studio for Gemini, get a prompt, give it to banana, super easy

#

but Gemini seems to be random when it comes to making those prompts, sometimes they are on point, other times, I have to correct like 50% of it

jaunty cave Sep 17, 2025, 4:48 AM

#

intereting, do you find that gemini is good at coming up with prompts, like for example is this better than just taking the original photo, and a photo of you and asking banana to combine it

#

I see

lethal stag Sep 17, 2025, 4:49 AM

#

idk actually, I just feel like having a precise prompt gives so much more control and reusability

jaunty cave Sep 17, 2025, 4:50 AM

#

makes sense!

lethal stag Sep 17, 2025, 4:51 AM

#

and by using a prompt wit banana, instead of asking for a combo, there's more chance you get through censorship I think

#

I think there's more guardrails when it comes to deepfaking and putting ppl together etc.

jaunty cave Sep 17, 2025, 4:53 AM

#

oh yeah, I think especilly in the gemini app I heard people complaining it would not allow them to do a lot of things

lethal stag Sep 17, 2025, 4:55 AM

#

yeah, it's very sensitive to anything that could possibly upset the model

#

but it's barely any better in the Arena, but I have already gotten used to the vocabulary, that you can or cannot use for banana prompts

twin valve Sep 17, 2025, 11:34 AM

#

scarlet harbor Also I dont find those accurate and they dont always have the models i'm looking...

you find your method more accurate than the other two? Ok, to each their own I guess.

twin valve Sep 17, 2025, 11:35 AM

#

tender sigil Can you run worstfish locally? I’ve wanted to play with it for a while

likely it is possible but online is doable too: https://lichess.org/@/WorstFish

lichess.org

BOT WorstFish (1500)

BOT WorstFish played 289617 games since Jun 28, 2021.

twin valve Sep 17, 2025, 11:36 AM

#

lethal stag Hey what's up gang Vulpes here what's cooking

back to Cesar with you. Didn't the curier already clarified that?

#

(for the confused, the user has the username matching the name of a Legionaire in the Cesar Legion in Fallout News Vegas)

lethal stag Sep 17, 2025, 1:34 PM

#

twin valve back to Cesar with you. Didn't the curier already clarified that?

Ave! But not true to Caesar, I just like the name, apparently it means 'desert fox' in Latin

scarlet harbor Sep 17, 2025, 3:46 PM

#

twin valve you find your method more accurate than the other two? Ok, to each their own I g...

It's a more balanced method for sure

jaunty cave Sep 17, 2025, 4:18 PM

#

We should introduce a category for latin langauge on lmarena leaderboard 😄

tender sigil Sep 17, 2025, 11:03 PM

#

text leaderboard just updated!

tender sigil Sep 17, 2025, 11:27 PM

#

Anthropic does not move closer to Gemini for #1 - but Qwen3 Max expands its lead in #2 on no style control !

jaunty cave Sep 17, 2025, 11:43 PM

#

kinda crazy that OAI has 4 models within 1 point of each other overall, and like they are definitely each good at diffent things but it kind of cancels out

dreamy yoke Sep 18, 2025, 8:07 AM

#

i will win

agile sorrel Sep 18, 2025, 8:39 AM

#

helo

eternal vortex Sep 18, 2025, 9:16 AM

#

It's good to see legacy platforms holding their own in 2025. Let's me know that AI is highly scalable and on the right track.

atomic mural Sep 18, 2025, 11:14 AM

#

hi

tropic forge Sep 18, 2025, 12:15 PM

#

nice to read

robust zodiac Sep 18, 2025, 1:14 PM

#

hey! is there a way I can see the leaderboards before last update?

#

(eg to compare current vs old one)

sick glade Sep 18, 2025, 1:18 PM

#

Hi

drowsy needle Sep 18, 2025, 1:29 PM

#

robust zodiac hey! is there a way I can see the leaderboards before last update?

There is! It's not available on the site but you can find here - https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard/tree/main

lmarena-ai/lmarena-leaderboard at main

robust zodiac Sep 18, 2025, 1:37 PM

#

thanks!

lofty sun Sep 18, 2025, 1:43 PM

#

Hi, y'all!

Good day!

twin valve Sep 18, 2025, 3:15 PM

#

lethal stag Ave! But not true to Caesar, I just like the name, apparently it means 'desert f...

It always irked me that in the Legion people didn't speak much in Latin despite their fanaticism.

lethal stag Sep 18, 2025, 4:16 PM

#

twin valve It always irked me that in the Legion people didn't speak much in Latin despite ...

IIRC, only maybe Ceasar himself and perhaps some of his generals actually were even a little bit educated or cared at all about ancient romans and their culture/language/history. All the plebs below them were conquered and forcefully integrated into the legion or raised and brainwashed from childhood, to follow chain of command and obey Caesar/other superiors, to be fanatic killers, conquerors and enslavers of others who are not part of the Legion. Any education, if at all, would be only in stuff like weapons training, tactics, maybe some Latin phrases relevant to those things etc. but that's as sophisticated as it gets for them, I don't think most of them even understand or know, that Ceasar is making them all LARP as members of a ancient civilization from the past, they just do what they are told. And that's just the legionaries and soldiers, everyone else are slaves and women...

twin valve Sep 18, 2025, 4:50 PM

#

sure but for how they conduct themselves as "high and mighty" they could have read a thing or two. But yeah likely it fits.

#

@drowsy needle I like the "movers" in the announcement. If you keep that one can track the changes over time (unless one goes through the saved elo leaderboards)

drowsy needle Sep 18, 2025, 5:02 PM

#

twin valve <@283397944160550928> I like the "movers" in the announcement. If you keep that...

Glad to hear it! I'll be sure to share this with the team.

merry patrol Sep 18, 2025, 6:25 PM

#

how do I turn on the filter for Open LLM only on the leaderboard, guys?

tender sigil Sep 18, 2025, 6:26 PM

#

It’s the License tab all the way on the right - I don’t think you can sort but basically any model that doesn’t say “Proprietary” is open source

merry patrol Sep 18, 2025, 6:27 PM

#

it's kinda annoying that it's not sorted, but thx

pseudo rivet Sep 18, 2025, 7:37 PM

#

If someone made a classifier on responses to predict which LLM it came from, how accurate could it be? And if someone is doing so, could they create bot accounts to target a model and raise it in the leaderboards because they bet money on it?

alpine bridge Sep 18, 2025, 8:14 PM

#

I don’t understand

#

What’s the difference between Open rank and Rank (UB)?

#

Open sourced?

jaunty cave Sep 18, 2025, 8:17 PM

#

eternal vortex It's good to see legacy platforms holding their own in 2025. Let's me know that...

who is the legacy platform in question?

scarlet harbor Sep 18, 2025, 10:07 PM

#

eternal vortex It's good to see legacy platforms holding their own in 2025. Let's me know that...

Most scalable tech in history

zealous hinge Sep 19, 2025, 4:57 AM

#

jaunty cave kinda crazy that OAI has 4 models within 1 point of each other overall, and like...

Then, there will be two: Gemini 3 Pro and Flash, and maybe also Ultra will here.

broken harbor Sep 19, 2025, 5:13 AM

#

What is the overall general image to video best model but completely by your opinion not statistics guys? Really glad to find this community, I honestly find Kling most reliable especially for high motion. MidJourney I also find good for more creative stuff, art.

jaunty cave Sep 19, 2025, 5:18 AM

#

zealous hinge Then, there will be two: Gemini 3 Pro and Flash, and maybe also Ultra will here.

but Google's models do not all have the same score as eah other

jaunty cave Sep 19, 2025, 11:54 PM

#

some new stuff on the leaderboards, grok-4-fast

rose cradle Sep 20, 2025, 1:29 AM

#

isn't it weird that kimi 0905 is better than 0711 in every way yet it comes after it in the leaderboard, and deepseek v3.1-thinking is better than 3.1 and r1 but comes after them?

jaunty cave Sep 20, 2025, 3:42 AM

#

@rose cradle good question, just raised this with the team, currently it's just sorted by overall and then ties broken alphabetically. We'll update that to sort by the ratings in the other categories

rose cradle Sep 20, 2025, 3:53 AM

#

But shouldn't overall be different for them? For example, I think Kimi had a lower overall because math benchmark only came much later, but when it did, to my surprise, its position didn't move.

jaunty cave Sep 20, 2025, 3:55 AM

#

the overall here is the overall rank, which accounts for overlapping confidence intervals, here the 2 kimis and 2 deepseek are all rank 10 overall, and it just sorts them alphaetically

rose cradle Sep 20, 2025, 3:56 AM

#

But look at 3.1 thinking, it is better than 3.1 in a lot of things and it is 10 instead of 9

jaunty cave Sep 20, 2025, 4:01 AM

#

the overall column here is not an aggregate of the other columns, it just means the ratings when you include all votes. The categories do not cover all of the prompts. It is odd that it's better overall but worse on most categories.

in this case they have very close ratings and cis 1417 +/- 6 and 1414 +/- 7. But what's happening is that slight difference makes there one more model above them which 3.1 overlaps confidence intervals with and 3.1 thinking does not. So according to this method, there is one more model we are more certain is better than thinking but not better than 3.1

And that rank is the main sort key. I think cases like this definitely seem weird and are arguents for using a different sorting, but no matter how you order it something will look weird

rose cradle Sep 20, 2025, 4:04 AM

#

Ah, ok, thanks. I always thought that was aggregate since you replaced the numbers from the previous leaderboard with this

jaunty cave Sep 20, 2025, 4:05 AM

#

the one on the old site? I'm not actually sure how that one worked, I joined after it was deprecated. appreciate the feedback a lot

rose cradle Sep 20, 2025, 4:05 AM

#

Yes, the old one just had a single overall number and companies would fight for that number. Now it is kind of weird. They fight in individual categories but the main one is more subjective.

#

I also think you should remove copilot benchmark, and add links to lm battle web arena because it is super hidden

jaunty cave Sep 20, 2025, 4:06 AM

#

oh, the main overall on text is still the same methodology. https://lmarena.ai/leaderboard/text

#

copilot 😥 , too many other things to work on rn to resurrect that. it was also kinda hard to use. webdev arena integration definitely coming tho!

rose cradle Sep 20, 2025, 4:10 AM

#

Yeah, that's why I think you should remove it, it's been 120 days since last update

viscid zinc Sep 20, 2025, 8:45 AM

#

I would like to thank all the designers of this website, everyone who contributed to its completion, and everyone who thought of it. All thanks and appreciation for your efforts and hard work. You are geniuses and smart makers. I love you, I love you with all my heart. You are in my heart more than anyone who worked hard on something like this. I love you. All love to the entire team. You are creative. You deserve to be at the top of designers and at the top of this world. Thank you, thank you. You are better than Elon Musk.

tropic pivot Sep 20, 2025, 1:10 PM

#

@wicked rapids @pulsar grove Hi! If you´re trying to generate your prompts please check: https://discordapp.com/channels/1340554757349179412/1397655624103493813 🙂

fallen laurel Sep 20, 2025, 8:06 PM

#

Hello to everyone, .....I am here to learn from the diverse skill and mindset of all present, and contribute my own 2 cents.

hollow crown Sep 21, 2025, 1:23 AM

#

How to use seedream 4.0

#

I couldn't find it in direct chat

#

Can somebody help? 🙏🏻

compact roost Sep 21, 2025, 1:27 AM

#

hollow crown How to use seedream 4.0

they removed it

hollow crown Sep 21, 2025, 1:31 AM

#

Damn

#

I'm using it in replicate

#

After adding my credit card

#

I thought I could use it for free

old comet Sep 21, 2025, 3:57 AM

#

hi

simple dock Sep 21, 2025, 4:37 AM

#

Nice

tepid topaz Sep 21, 2025, 5:59 AM

#

hello

errant belfry Sep 21, 2025, 7:38 AM

#

tonight is the night

simple portal Sep 21, 2025, 8:35 AM

#

Hello

#

I am new ...happy to connect..

marble bluff Sep 21, 2025, 11:14 AM

#

Hello! Please check #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.

#

Hello! Please #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.

hoary compass Sep 21, 2025, 11:30 AM

#

the overall column here is not an aggregate of the other columns, it just means the ratings when you include all votes. The categories do not cover all of the prompts. It is odd that it's better overall but worse on most categories.

in this case they have very close ratings and cis 1417 +/- 6 and 1414 +/- 7. But what's happening is that slight difference makes there one more model above them which 3.1 overlaps confidence intervals with and 3.1 thinking does not. So according to this method, there is one more model we are more certain is better than thinking but not better than 3.1

And that rank is the main sort key. I think cases like this definitely seem weird and are arguents for using a different sorting, but no matter how you order it something will look weird

versed wigeon Sep 21, 2025, 2:24 PM

#

hello

tender sigil Sep 21, 2025, 4:10 PM

#

viscid zinc I would like to thank all the designers of this website, everyone who contribute...

awe this was really sweet

marble bluff Sep 21, 2025, 4:55 PM

#

Hello! Please #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.

tropic pivot Sep 21, 2025, 5:37 PM

#

Hi! Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 🙂

warped bone Sep 22, 2025, 6:58 AM

#

hi

#

@tropic pivot

smoky tapir Sep 22, 2025, 10:07 AM

#

is it any possibility to use LMArena API Key?

drowsy needle Sep 22, 2025, 4:25 PM

#

smoky tapir is it any possibility to use LMArena API Key?

We don't have an API for LMArena available.

main estuary Sep 22, 2025, 4:40 PM

#

hi everyone plzzz i asked something

#

available everyone for me plzz

drowsy needle Sep 22, 2025, 4:49 PM

#

main estuary hi everyone plzzz i asked something

whats up?

main estuary Sep 22, 2025, 5:09 PM

#

drowsy needle whats up?

Okay, tell me, how can I generate unlimited videos either on the official LMArena website or in this group? What should I do so that I get a video exactly related to my prompt, with sound included? I also messaged LMArena in their inbox and gave them a prompt, but I didn’t get a response. What’s going on with this?

fresh marsh Sep 22, 2025, 5:10 PM

#

main estuary Okay, tell me, how can I generate unlimited videos either on the official LMAren...

peepoDirty

main estuary Sep 22, 2025, 5:12 PM

#

??

drowsy needle Sep 22, 2025, 5:16 PM

#

main estuary Okay, tell me, how can I generate unlimited videos either on the official LMAren...

Sure there are a few questions in here:

The Video Arena bot can only be used in this Discord server, you can't use through Direct Messages.
It's not unlimited, as you can only create 5 generations per day.
If sound is included or not is going to be random. Not all models have sound capabilities, and since it's random what models you get, that means it's going to be random if you get sound or not.

desert crane Sep 22, 2025, 5:18 PM

#

Hi Everyone

main estuary Sep 22, 2025, 5:19 PM

#

drowsy needle Sure there are a few questions in here: 1. The Video Arena bot can only be used ...

So tell me, what’s the maximum length of a video we can make, and how do we get it made? Where should I give the prompt and what should I do?

drowsy needle Sep 22, 2025, 5:20 PM

#

main estuary So tell me, what’s the maximum length of a video we can make, and how do we get ...

The length is around 5-8 seconds. The #1397655624103493813 channel can answer your other questions.

main estuary Sep 22, 2025, 5:28 PM

#

drowsy needle The length is around 5-8 seconds. The <#1397655624103493813> channel can answer ...

thnks you soooo much i understand👍 💯 💓

drowsy needle Sep 22, 2025, 5:29 PM

#

main estuary thnks you soooo much i understand👍 💯 💓

No problem!

#

Good place for questions/feedback about the bot would be in #bot-feedback btw. We'd like to keep this channel focussed on #leaderboards discussion.

main estuary Sep 22, 2025, 5:30 PM

#

ok i understand my sweet lover brother @drowsy needle

drifting hornet Sep 23, 2025, 1:12 PM

#

Please, check #1397655624103493813 to learn how to use the bot.

true arch Sep 23, 2025, 5:08 PM

#

new here, not sure if this is right place to ask, for search leaderboard, can we include llama (meta ai) as well? if this has been consider, what are reasons not to include? Thanks all.

jaunty cave Sep 23, 2025, 5:10 PM

#

Hi @true arch does meta surface an API where we can get answers to queries using results of web search + llama model? All the models on search arena are offered as search + llm products by their respective companies.

tough osprey Sep 24, 2025, 12:39 AM

#

Hello, i'm working right now on a LMArena review video . The "Exclude Ties" feature is working? Because though the models rank list changes they're still showing ties.

sour hamlet Sep 24, 2025, 1:11 AM

#

Where can we actually View the leaderboards does this also displayed on the Web?

white iron Sep 24, 2025, 6:11 AM

#

sup

marble bluff Sep 24, 2025, 1:13 PM

#

Hello! Please #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.

#

Hello! Please #1397655624103493813 to learn how to use the bot and #video-arena-1 #video-arena-2 #video-arena-3 for your creations.

jaunty cave Sep 24, 2025, 3:35 PM

#

tough osprey Hello, i'm working right now on a LMArena review video . The "Exclude Ties" fea...

ah! the exclude ties means to exclude votes where the person voted "tie" or "tie both bad". The resulting leaderboard can still ahve models which are tied with each other

jaunty cave Sep 24, 2025, 3:37 PM

#

sour hamlet Where can we actually View the leaderboards does this also displayed on the Web?

Hi, the leaderboards are available here: https://lmarena.ai/leaderboard/ and it's visable on web or desktop

full pecan Sep 25, 2025, 1:31 AM

#

Hello everyone! If anyone knows, I'm curious as to how often is this leaderboard https://lmarena.ai/leaderboard/text/overall-no-style-control is updated

radiant glacier Sep 25, 2025, 1:45 AM

#

full pecan Hello everyone! If anyone knows, I'm curious as to how often is this leaderboard...

we ❤️ polymarket

full pecan Sep 25, 2025, 1:45 AM

#

hehe

austere charm Sep 25, 2025, 6:42 AM

#

hello ! everyone ! how are u all ?

pallid lotus Sep 25, 2025, 6:59 AM

#

let´s check it out

drifting hornet Sep 25, 2025, 1:43 PM

#

Please, head to #1397655624103493813 to learn to how to properly prompt the bot.

jaunty cave Sep 25, 2025, 6:02 PM

#

austere charm hello ! everyone ! how are u all ?

doing good how are you?

dire abyss Sep 25, 2025, 6:40 PM

#

hello

drowsy needle Sep 25, 2025, 6:49 PM

#

dire abyss hello

hello ablobwave

drowsy needle Sep 25, 2025, 7:23 PM

#

tender sigil Sep 25, 2025, 7:27 PM

#

look at all these new text models debuting dog we ain’t never getting a leaderboard update 😭💔

jaunty cave Sep 25, 2025, 7:27 PM

#

really interesting to see how variations even in provider, fal vs seedream official can have noticeable differences. I saw this post earlier about how different providers of other open source text models can differ a lot from the official. 💀
https://github.com/MoonshotAI/K2-Vendor-Verfier

GitHub

GitHub - MoonshotAI/K2-Vendor-Verfier: Verify Precision of all Kimi...

Verify Precision of all Kimi K2 API Vendor. Contribute to MoonshotAI/K2-Vendor-Verfier development by creating an account on GitHub.

stray parrot Sep 25, 2025, 7:29 PM

#

jaunty cave really interesting to see how variations even in provider, fal vs seedream offic...

This is fascinating, do other research companies do this or is someone doing this for multiple models?

jaunty cave Sep 25, 2025, 7:31 PM

#

stray parrot This is fascinating, do other research companies do this or is someone doing thi...

I think we're the only ones doing large scale tests with real user requests to measure the quality 😎

stray parrot Sep 25, 2025, 7:33 PM

#

I hadn't noticed Fal was listed as a separate model lol

#

I love the comparisons to the OpenRouter providers in general, I think the transparency is super important!

jaunty cave Sep 25, 2025, 7:37 PM

#

Yeah we didn't have the -fal provider suffix before. We added it now since we are serving the same model from different providers. We put a note about it on our Leaderboard Changelog to clarify it. https://news.lmarena.ai/leaderboard-changelog/

LMArena Blog

Leaderboard Changelog

This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!

For model deprecations, check the public updates on GitHub.

September 25, 2025
New model announcement:
Seedream-4-2k has been added to the Text-to-Image and Image Edit leaderboards.

Note that Seedream-4-high-res-f...

acoustic island Sep 25, 2025, 8:01 PM

#

Hello

drowsy needle Sep 25, 2025, 9:52 PM

#

acoustic island Hello

Hello

drifting hornet Sep 26, 2025, 12:14 AM

#

Please, check out ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

flat star Sep 26, 2025, 2:04 AM

#

hoi

burnt raptor Sep 26, 2025, 4:52 AM

#

jaunty cave Yeah we didn't have the -fal provider suffix before. We added it now since we ar...

What models are used for video?

jaunty cave Sep 26, 2025, 4:53 AM

#

burnt raptor What models are used for video?

all the models are listed on the leaderboards: https://lmarena.ai/leaderboard/text-to-video https://lmarena.ai/leaderboard/image-to-video

zealous adder Sep 26, 2025, 6:39 AM

#

Flash slash

twin valve Sep 26, 2025, 9:51 AM

#

@twilit echo (dunno where to put this, since #ai-creations is now filled to the brim with pics)

I still check your awesome LLM chess benchmark and I check whether the elo changes are stabilizing or not. Gpt5 seems still to have potential to go up. O3 a bit less. Gpt 4.5 feels like invincible - that thing can go real high but I guess it is expensive (dunno if it still plays but the match with k2 should be relatively recent. Would it be possible to add dates just as reference?). Grok4 seems stable as well. gpt-3.5-turbo-instruct ist still unclear. It still clobbers gpt5 sometimes.

#

Also other models keep coming but they get brought in clap city according to their scores.

#

it is mostly openAI and xAI. Gemini is strange though, its rating is far from stable if one sees the rating swings it gets.

#

and again this guy has the gpt 35 at the bottom https://maxim-saplin.github.io/llm_chess/ but as you ( @twilit echo ) discovered, gpt 35 plays well if the opponent plays well and plays terribly if the opponent plays terribly. Since gptt 35 played only against the random player, it makes sense it goes at the bottom and the benchmark maintainer will never notice that is actually much stronger than that.

LLM Chess Leaderboard

LLM Chess Leaderboard comparing LLMs in simulated chess vs Random and Komodo Dragon. Rankings by Elo, win/loss, game duration, and tokens.

#

correction, the problem seem the amount of illegal moves, since the support layer is not strictly helpful.

twilit echo Sep 26, 2025, 12:54 PM

#

twin valve <@126820015382069250> (dunno where to put this, since <#1344733249628541099> i...

4.5 was deprecated and shut down more than 2 months ago, so no newer games possible https://platform.openai.com/docs/deprecations#2025-04-14-gpt-4-5-preview
dates and other info doesnt fit the UI and overbloats everything. newest is top tho.

#

but since Elo is self-correcting the Elo changes even for retired models (since their former opponents change ratings, which gives less/more points adjusted in realtime). gpt-4.5 used to be higher elo (1817 continuation on retirement day)

twin valve Sep 26, 2025, 1:07 PM

#

twilit echo 4.5 was deprecated and shut down more than 2 months ago, so no newer games possi...

I thought that was deprecated but still available (though pricey)

Oh I didn't notice that you recompute the elo for everything rather than keeping it fix "as it were".

twilit echo Sep 26, 2025, 1:08 PM

#

twin valve I thought that was deprecated but still available (though pricey) Oh I didn't n...

every single day all elo is recomputed from scratch. i don't store the elo values anywhere, everything is calculated on the fly with every new game

twilit echo Sep 26, 2025, 1:23 PM

#

but with a database used as source to compute everything, its trivial to revisit any desired point in time and compute all stats as they were precisely at that moment. so nothing is ever lost even if I don't store hard values, because the formulas are what matters, not the output

jaunty cave Sep 26, 2025, 3:57 PM

#

twilit echo every single day all elo is recomputed from scratch. i don't store the elo value...

we do this at lmarena too haha
but because we're not actually using Elo, but Bradley-Terry which treats matches identicaly regardless of time

tender sigil Sep 26, 2025, 4:16 PM

#

jaunty cave we do this at lmarena too haha but because we're not actually using Elo, but Bra...

batch estimation if i’m correct? cool process 😄

jaunty cave Sep 26, 2025, 4:27 PM

#

tender sigil batch estimation if i’m correct? cool process 😄

yep, maximum likelihood estimation over the entire dataset rather than one row at a time

tender sigil Sep 26, 2025, 5:44 PM

#

Got it — batch MLE for Bradley-Terry makes total sense! Just to clarify: when you say "not using Elo," is the displayed score scale (e.g., 1469, 1437) still a log-transformed version of the BT parameters (like Elo uses for readability), even though the underlying estimation is batched BT? And how do confidence intervals factor in — are they computed on the raw BT scale before scaling to the displayed numbers?

jaunty cave Sep 26, 2025, 5:50 PM

#

Yep both the scores and the standard errors (+/- values) are computed in what I call the "natural scale" which means initialized to 0, using natural log (base e) and no scaling factor.

The "Elo" scale uses log base 10 and a scaling factor of 400, and an initial rating of something like 1000, 1200 etc.
After we fit the scores we multiply them by 400/log(10) and then add 1000 to each. For the +/- we only do the scaling

tender sigil Sep 26, 2025, 6:26 PM

#

okay, yeah that makes a ton of sense! thanks for the clarity 😄

twin valve Sep 26, 2025, 6:38 PM

#

twilit echo every single day all elo is recomputed from scratch. i don't store the elo value...

I see but how I understood it, since the previous games happened in the past, they had an influence about the new games (since Elo is moving forward in time), but the newer games do not have an influence on previous games.

Hence a model that doesn't play anymore has the Elo fixed in time.

Am I understanding the approach?

(if yes) It would be different if you, say, would compute the elo regardless of time. So picking results randomly, many times, and then averaging the final result.

tender sigil Sep 26, 2025, 8:26 PM

#

how are this many ppl that stupid to not read the actual channel they’re sending stuff in I don’t get it

twilit echo Sep 27, 2025, 12:07 AM

#

twin valve I see but how I understood it, since the previous games happened in the past, th...

no because the elo gains/losses are not static. for example, even if gpt-4.5 never ever plays any games anymore, if at its playtime the elo of an opponent was higher and thus they gained +30 if 1 year later the elo adjusted and their former opponents are now lower elo, it will auto adjust to e.g. +21 instead, automatically autocorrecting even for past games

#

my system is more like retroactive Elo or Bayesian rating where the entire history is continuously reinterpreted based on all available information.
This approach is mathematically more sound because it uses maximum information to determine ratings

autumn gyro Sep 27, 2025, 8:00 AM

#

what is leaderboard in LMArena guys?!

twin valve Sep 27, 2025, 10:50 AM

#

twilit echo my system is more like retroactive Elo or Bayesian rating where the entire histo...

I see, so it is different from FIDE / chess.com / lichess approaches.

You look also forward to adjust values rather than computing purely chronologically.

twin valve Sep 27, 2025, 10:51 AM

#

tender sigil how are this many ppl that stupid to not read the actual channel they’re sending...

While I disagree on the tone, I agree on the gist of it.

I mean we are already at AGI/ASI level if people refuse to read instructions that LLM would read and spam the wrong channel.

idle ocean Sep 27, 2025, 2:33 PM

#

autumn gyro what is leaderboard in LMArena guys?!

https://lmarena.ai/leaderboard

glass viper Sep 27, 2025, 3:45 PM

#

nice

alpine bridge Sep 27, 2025, 3:46 PM

#

when will it finally update

fresh marsh Sep 27, 2025, 4:38 PM

#

Can you explain to us your thought process for putting this in the leaderboard channel?

#

OkAnd teaTime

uneven adder Sep 27, 2025, 4:41 PM

#

fresh marsh Can you explain to us your thought process for putting this in the leaderboard c...

That's up above his true paygrade, Dude. Lol..

sour tundra Sep 27, 2025, 8:53 PM

#

amazing

mystic cloak Sep 28, 2025, 3:08 AM

#

Sweet

worn ocean Sep 28, 2025, 11:28 AM

#

Hello

earnest path Sep 28, 2025, 11:55 AM

#

Ola gente

gentle pendant Sep 28, 2025, 12:52 PM

#

Hi

full parrot Sep 28, 2025, 1:29 PM

#

Hi

sudden surge Sep 28, 2025, 5:47 PM

#

HELLO

dim wyvern Sep 28, 2025, 6:05 PM

#

Helloo

bronze ivy Sep 28, 2025, 6:35 PM

#

Hello

tender sigil Sep 28, 2025, 10:30 PM

#

everybody treating this like the greeting channel again 😭

#

hopefully we see a leaderboard update tomorrow!

#

probably been enough time to gather enough votes for all of the new models

simple spindle Sep 28, 2025, 10:53 PM

#

hello

full parrot Sep 28, 2025, 10:59 PM

#

Hi

tender sigil Sep 28, 2025, 11:07 PM

#

shush

#

talk about the leaderboards

#

say hello elsewhere

uneven adder Sep 28, 2025, 11:23 PM

#

tender sigil shush

They'll keep coming..

#

Seems benchmaxxxed for some names..

robust zodiac Sep 29, 2025, 4:30 AM

#

tender sigil hopefully we see a leaderboard update tomorrow!

Which bet did u take haha 😂

tender sigil Sep 29, 2025, 4:30 AM

#

robust zodiac Which bet did u take haha 😂

pfft 😂😂

#

I LOVE POLYMARKET

#

I had good profit off of holding Yes for Alibaba #2 and Google #1 in style control markets

robust zodiac Sep 29, 2025, 4:31 AM

#

Good picks

#

I cashed out alibaba when it was at 97 i think

tender sigil Sep 29, 2025, 4:34 AM

#

yea I’ve cashed out some of my position to trade on other markets but

#

there’s some funny business moving in it currently

#

big buys on OpenAI yes and Alibaba no but

#

based off of GPT-5-High’s system prompt

#

I don’t have high hopes for its performance debut it doesn’t seem much stronger

#

October #2 will be interesting

#

Google has been increasing in price, not because it’ll be surpassed I don’t think but likely because new Gemini flash could debut at #2

robust zodiac Sep 29, 2025, 9:44 AM

#

tender sigil Google has been increasing in price, not because it’ll be surpassed I don’t thin...

Oh snap i didnt even consider that option

#

Flash no2 debut would be brutal

tawny narwhal Sep 29, 2025, 9:46 AM

#

When they update text leaderboard?

tender sigil Sep 29, 2025, 1:10 PM

#

Hopefully today!

loud lotus Sep 29, 2025, 1:42 PM

#

is long-cat-thinking here in test?

drowsy needle Sep 29, 2025, 3:21 PM

#

tawny narwhal When they update text leaderboard?

This can vary as it depends on how many votes we're collecting, but would expect updates around a ~week

amber comet Sep 29, 2025, 5:10 PM

#

@hallow grove Hello! Please check #1397655624103493813 to learn how to use the bot.

idle ocean Sep 29, 2025, 5:21 PM

#

drowsy needle This can vary as it depends on how many votes we're collecting, but would expect...

is there any rule for this? Like 10 thousand votes per or something?

drowsy needle Sep 29, 2025, 5:55 PM

#

idle ocean is there any rule for this? Like 10 thousand votes per or something?

There isn't going to be one single metric we go off of as before we post an update we're going to validate the vote data to ensure accuracy before posting an update.

tawdry socket Sep 29, 2025, 5:57 PM

#

Why isn't Claude 4.5 on leaderboards yet?

#

@drowsy needle When was Claude 4.5 added to arena ? today?

carmine oasis Sep 29, 2025, 5:58 PM

#

It's beigng added he said in #general

#

being*

tawdry socket Sep 29, 2025, 6:01 PM

#

Bring added to leaderboard or added to the arena?

drowsy needle Sep 29, 2025, 6:01 PM

#

tawdry socket <@283397944160550928> When was Claude 4.5 added to arena ? today?

Right now ablobcheer

drowsy needle Sep 29, 2025, 6:01 PM

#

tawdry socket Bring added to leaderboard or added to the arena?

That's TBD

tawdry socket Sep 29, 2025, 6:02 PM

#

Oh. So you just added in the arena. Once you have enough votes, it will be added to leaderboard?

drowsy needle Sep 29, 2025, 6:03 PM

#

tawdry socket Oh. So you just added in the arena. Once you have enough votes, it will be added...

WebDev Arena specifically atm. I'll post another update when added to Text.

#

And yes, once we collect enough votes and validate we'll push a leaderboard update. TBD on when that'll happen.

tawdry socket Sep 29, 2025, 6:04 PM

#

So, it is still not there on other arena like text yet?

drowsy needle Sep 29, 2025, 6:06 PM

#

tawdry socket So, it is still not there on other arena like text yet?

Correct

thorn scroll Sep 29, 2025, 6:22 PM

#

hi

robust zodiac Sep 29, 2025, 6:29 PM

#

drowsy needle Correct

is this the text arena one?

drowsy needle Sep 29, 2025, 6:29 PM

#

robust zodiac is this the text arena one?

Yes

twin valve Sep 29, 2025, 6:29 PM

#

uneven adder Seems benchmaxxxed for some names..

8 M votes (yes, I don't use that much video/picture mode)
🤤

We need that many in the text arena!

rose cradle Sep 29, 2025, 7:56 PM

#

I guess this is super minor, but mobile has some annoyances

tender sigil Sep 29, 2025, 9:17 PM

#

drowsy needle This can vary as it depends on how many votes we're collecting, but would expect...

it has been 11 days

#

completely understandable with the amount of new models that have debuted recently - wanting to release scores for them all at once but

#

@tawdry socket Claude Sonnet 4.5 is in Text Arena currently

#

I got it on a prompt earlier today

tawdry socket Sep 29, 2025, 9:26 PM

#

Nice. How come they didn't have this as one of the anonymous model? Claude doesn't do that?

#

@tender sigil

#

How is it anyway? I can't test it right now .

tender sigil Sep 29, 2025, 9:30 PM

#

tawdry socket Nice. How come they didn't have this as one of the anonymous model? Claude doesn...

Claude doesn’t do anonymous testing models

#

at least they don’t for full release

#

but I have never seen someone show evidence of a codenamed model being part of the Claude family

tawdry socket Sep 29, 2025, 10:19 PM

#

Makes sense. Claude could have shown off lmarena performance at least on web Arena to market their model like other companies do when they release...

tender sigil Sep 29, 2025, 10:36 PM

#

even more models added to arena omg 😭 September has been insane

fresh marsh Sep 29, 2025, 10:57 PM

#

tender sigil even more models added to arena omg 😭 September has been insane

Need gemini 3

#

OkAnd

tender sigil Sep 29, 2025, 10:57 PM

#

fingers crossed 🤞

radiant glacier Sep 29, 2025, 11:13 PM

#

any chances of the text leaderboard getting updated by tomorrow?

#

😄

drowsy needle Sep 29, 2025, 11:18 PM

#

radiant glacier any chances of the text leaderboard getting updated by tomorrow?

TBD! We'll be sure to update when it's ready!

prime valve Sep 30, 2025, 3:21 AM

#

yes

valid ridge Sep 30, 2025, 6:58 AM

#

good luck

spare slate Sep 30, 2025, 8:27 AM

#

I don't get it. How do you actually get the AI to create a video. All I'm seeing is this conversation

void pike Sep 30, 2025, 9:09 AM

#

#video-arena-3 a little baby come on classroom and background music video generate

carmine oasis Sep 30, 2025, 12:04 PM

#

When is sonnet added to leaderboard?

echo pewter Sep 30, 2025, 1:35 PM

#

Hlo

#

Woww

tropic pivot Sep 30, 2025, 1:40 PM

#

Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to learn how to generate content 🙂

ruby idol Sep 30, 2025, 1:41 PM

#

good

rose stream Sep 30, 2025, 1:42 PM

#

carmine oasis When is sonnet added to leaderboard?

leaderboard hasnt been updated in some time

#

i wonder why

carmine oasis Sep 30, 2025, 2:06 PM

#

Its not enough time with automation to put sonnet45 placement?

rose stream Sep 30, 2025, 2:31 PM

#

carmine oasis Its not enough time with automation to put sonnet45 placement?

nah its in the voting already. The leaderboard just doesnt update automatically

soft crater Sep 30, 2025, 2:51 PM

#

Everytime a new famous model is released people start asking for leaderboard update. My best guess is that the update is combined with the big companies. I'm not saying that it is manipulated i I'm just suggesting that there is a schedule.

tender sigil Sep 30, 2025, 3:19 PM

#

There has been a ridiculous amount of new models released recently so

#

it is likely they are trying to update all at once in a big batch - it’s possible we don’t even see a leaderboard update today

idle ocean Sep 30, 2025, 3:25 PM

#

Yeah, theres a kot of new models that need a lot of votes before it can be acurate

loud lotus Sep 30, 2025, 3:30 PM

#

damn there are too many models. the rate of leaderboard upgrade cannot catch up with the rate of new models releasing

native bronze Sep 30, 2025, 4:29 PM

#

@tropic pivot

drowsy needle Sep 30, 2025, 4:32 PM

#

native bronze <@1407378663866896577>

Do you need help with something?

jaunty cave Sep 30, 2025, 5:52 PM

#

tender sigil how are this many ppl that stupid to not read the actual channel they’re sending...

Not LMArena but definitely leaderboard related. @twin valve as the resident chess expert what are your thoughts about the most recent chess ratings news? Seems like a previous change was not implemented correctly, and then applied retroactively with little/no warning?
https://x.com/BortnykChess/status/1972746172029936024

Something LMArena would definitely like to avoid

Oleksandr Bortnyk (@BortnykChess)

FIDE JUST STOLE MY RATING! How can they go back in time and take all my ratings? Absurd!

amber comet Sep 30, 2025, 5:57 PM

#

@crimson seal Please, check out ⁠⁠how-to-video-bot to learn how to properly prompt the bot.

tender sigil Sep 30, 2025, 6:05 PM

#

now up to 13 new models in the arena since the last LB update 😂

#

Wild times to be an arena fan!!

jaunty cave Sep 30, 2025, 6:10 PM

#

tender sigil now up to 13 new models in the arena since the last LB update 😂

updated 12 days ago, more than one model a day!!!

tender sigil Sep 30, 2025, 6:10 PM

#

I know right???

#

can’t keep up with getting to talk to all these new players 😂

jaunty cave Sep 30, 2025, 6:15 PM

#

tender sigil can’t keep up with getting to talk to all these new players 😂

we need to make battle royale mode, each prompt goes to 100 LLMs and the user eliminates them one by one

tender sigil Sep 30, 2025, 6:16 PM

#

bahahahahahahaha

#

or make 100 different bots play the same game or logic puzzle and steadily eliminate them when they miss a step

fresh marsh Sep 30, 2025, 6:57 PM

#

jaunty cave we need to make battle royale mode, each prompt goes to 100 LLMs and the user el...

honestly this is pretty cool lol

jaunty cave Sep 30, 2025, 7:12 PM

#

fresh marsh honestly this is pretty cool lol

would be crazy expensive lmao

fresh marsh Sep 30, 2025, 7:42 PM

#

jaunty cave would be crazy expensive lmao

imagine lmarena hosting an event or something though

#

like once a week or something

#

battle royal between all models llike you said

#

lmao

drowsy needle Sep 30, 2025, 7:49 PM

#

@vale island be sure to review #1397655624103493813 for information on how to properly use Video Arena

twin valve Sep 30, 2025, 8:18 PM

#

jaunty cave Not LMArena but definitely leaderboard related. <@257929879163633680> as the res...

Hey! Chess expert is too much (I think @twilit echo is as expert as me if not more) . Though I am rating interested. FIDE didn't apply them retroactively, they simply do a poor job. They noticed that people "farm" rating (nothing new really) and so they decided that for blitz and rapid time control they would use the pure elo formula - and not the 400 points cap - for all players over a certain rating. They did this in Dec 2024.

The problem? They forgot to implement it. Chess is full of nice ideas and poor executions (ironic for an intelletual game)

So thanks to Nakamura FIDE decided to create such a rule also for classical. Wronly imo, because without enough simulations they simply wreck the work of Mr. Sonas (that introduced the most recent rating fix as per March 2024). Simply adding a ton of edge case rules mess up with the ratings (and then people foolishly compare ratings between eras, while a lot of small but important details changed in the meantime).

Hence Bortnyk, that together with Naroditsky plays a lot of blitz and rapid in Charlotte, cries about his rating. He is partially right, because FIDE did a poor job of implenting the change and thus it feels like a rating steal, but it is not.

IMO it would have been easier to say "nah, if players farm rating we simply decide to unrate the event for that player", that was always a FIDE prerogative (they did it already with Alireza in 2023)

#

The problem with FIDE is that they are treating the 2700 and 2800 group as a sort of marketing. As if lmarena would say "we cannot have ratings going stale or going down, we need new models to reach 1450, 1500, 1550 and so on. We need to pump the ratings for marketing". So they try to find rules to not erode the 2800/2700 group while not pumping it too much where it feels fake.

#

while in reality only difference matters in ratings.

#

btw on reddit /r/chess there are like 3-4 threads about it.

#

here is one: https://old.reddit.com/r/chess/comments/1ntwsen/the_fide_rating_change_is_retroactive/

#

here another from Naroditsky directly: https://www.reddit.com/r/chess/comments/1ntgf7q/fides_major_announcement_on_amendments_in_ratings/ngvwpfr/

jaunty cave Sep 30, 2025, 10:19 PM

#

👀 👀 👀

rose stream Oct 1, 2025, 8:58 AM

#

Does anyone know why Google has been in the top spot for some time on text?

#

Gemini is good but i rarely ever see it considered the top model by anyone

robust zodiac Oct 1, 2025, 9:46 AM

#

cause it wins battles constantly?

#

what people "feel" vs a measured deterministic algorithm is often not matching

quaint solstice Oct 1, 2025, 12:13 PM

#

Hello!!!

jaunty cave Oct 1, 2025, 2:12 PM

#

robust zodiac what people "feel" vs a measured deterministic algorithm is often not matching

Hello what's up? See the recent leaderboard update?

robust zodiac Oct 1, 2025, 2:13 PM

#

jaunty cave Hello what's up? See the recent leaderboard update?

hey! I was just replying to Yeeto

jaunty cave Oct 1, 2025, 2:21 PM

#

lol oops I meant to reply to Vitor 🤣

#

You are definitely right, Gemini ranks high because it consistently wins the comparisons

tender sigil Oct 1, 2025, 4:20 PM

#

Yes!!! new leaderboard update is awesome

tender sigil Oct 1, 2025, 4:59 PM

#

both new versions of Gemini Flash, new Qwen 3 max and both versions of v1, and both terminus variants of V3.1 - Sonnet 4.5 and GLM 4.6 should debut soon!

tame flint Oct 1, 2025, 7:31 PM

#

guys in webdev whats diff between qwen3-coder and qwen3-coder-plus ?

idle ocean Oct 1, 2025, 11:32 PM

#

qwen 3 coder plus is newer, and apparently worse?

fresh marsh Oct 2, 2025, 12:05 AM

#

idle ocean qwen 3 coder plus is newer, and apparently worse?

LMAO

drifting hornet Oct 2, 2025, 2:06 AM

#

Please, check out ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

drifting hornet Oct 2, 2025, 3:52 PM

#

Please, check out ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

jaunty cave Oct 2, 2025, 7:26 PM

#

tame flint guys in webdev whats diff between qwen3-coder and qwen3-coder-plus ?

qwen3-coder is this model https://github.com/QwenLM/Qwen3-Coder
and qwen3-coder-plus is an updated version from 9/23 https://www.alibabacloud.com/help/en/model-studio/qwen-coder#a3bbe78773cec

GitHub

GitHub - QwenLM/Qwen3-Coder: Qwen3-Coder is the code version of Qwe...

Qwen3-Coder is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud. - QwenLM/Qwen3-Coder

Qwen-Coder model capabilities - Alibaba Cloud Model Studio - Alib...

Qwen-Coder model capabilities,Alibaba Cloud Model Studio:The Qwen3-Coder model has powerful coding capabilities. You can integrate it into your business using an API.

drowsy needle Oct 2, 2025, 7:59 PM

#

wooden mango Oct 2, 2025, 8:53 PM

#

Why is gemini 2.5 pro at the top

jaunty cave Oct 2, 2025, 11:49 PM

#

wooden mango Why is gemini 2.5 pro at the top

a lot of people vote vote for it over the other models it is compared against 🤷

wooden mango Oct 2, 2025, 11:50 PM

#

Really sherlock?

#

But it's garbage

drowsy needle Oct 3, 2025, 12:12 AM

#

wooden mango Really sherlock?

Reminder of one of our server rules:

✅ Treat others with Respect.

lilac current Oct 3, 2025, 6:35 AM

#

here is a working Sora 2 code - SYKQBJ

kind bane Oct 3, 2025, 9:26 AM

#

Thanks

hardy compass Oct 3, 2025, 11:03 AM

#

he is normally walking. Cinematic camera motion

rigid viper Oct 3, 2025, 11:53 AM

#

every time I do a prompt to generate a video, I can't find where it's generated

lyric furnace Oct 3, 2025, 1:33 PM

#

cat and lion world tur

naive vault Oct 3, 2025, 2:29 PM

#

Will sonnet 4.5 thinking ever be on the leaderboard

idle ocean Oct 3, 2025, 2:30 PM

#

was added a few days ago mate

#

don't expect it on the leaderboard that fast all the time

naive vault Oct 3, 2025, 2:31 PM

#

Wasn’t it added same time as 4.5

idle ocean Oct 3, 2025, 3:38 PM

#

Yeah, I'm not a lm arena insider but if I had to bet 4.5 thinking probably took first place by far and so they are doing their due diligence to check that that was legit.

#

the non thinking mode has 12 on each side error bars I think

drowsy needle Oct 3, 2025, 3:39 PM

#

rigid viper every time I do a prompt to generate a video, I can't find where it's generated

It looks like you haven't been prompting the bot correctly. Be sure to review the info in #1397655624103493813 as it should be helpful.

idle ocean Oct 3, 2025, 3:40 PM

#

naive vault Wasn’t it added same time as 4.5

yeah I remembered correctly, there's really large error bar for non thinking sonnet rn, so they probably are waiting until the bars get smaller to post sonnet thinking

naive vault Oct 3, 2025, 3:41 PM

#

Will be interesting to see if it surpasses 2.5 pro

idle ocean Oct 3, 2025, 5:54 PM

#

drowsy needle It looks like you haven't been prompting the bot correctly. Be sure to review th...

Hey pineapple, can you confirm, just for fun?

.

Pls?

drowsy needle Oct 3, 2025, 5:56 PM

#

idle ocean Hey pineapple, can you confirm, just for fun? . Pls?

Sry confirm what?

tropic pivot Oct 3, 2025, 5:58 PM

#

https://discordapp.com/channels/1340554757349179412/1397655624103493813

idle ocean Oct 3, 2025, 5:58 PM

#

idle ocean Yeah, I'm not a lm arena insider but if I had to bet 4.5 thinking probably took ...

This

small baby dog eyes

#

a really cute dog specifically

drowsy needle Oct 3, 2025, 6:03 PM

#

idle ocean This *small baby dog eyes*

Cute dog eyes difficult to resist, but yeah sorry can't provide any insider info

fresh marsh Oct 3, 2025, 6:44 PM

#

idle ocean Yeah, I'm not a lm arena insider but if I had to bet 4.5 thinking probably took ...

i doubt it honestly. i dont code but opus 4.1 16k thinking seems better for me than sonnet 4.5 32k thinking

idle ocean Oct 3, 2025, 6:45 PM

#

Huh

jaunty cave Oct 3, 2025, 8:49 PM

#

https://tenor.com/view/obiwan-kenobi-disturbance-in-the-force-star-wars-jedi-gif-10444289

Tenor

fresh marsh Oct 3, 2025, 9:00 PM

#

?

#

I meant for general use

#

Not coding

#

Was only saying I doubt 4.5 thinking took 1st place by so far they're delaying the results to verify or whatever was proposed above

#

And now Clayton is making me think otherwise OkAnd

idle ocean Oct 3, 2025, 9:05 PM

#

Lol

fresh marsh Oct 3, 2025, 9:06 PM

#

OkAnd BeerTime

drowsy needle Oct 3, 2025, 10:20 PM

#

@proper wave be sure to review #1397655624103493813

hasty mica Oct 4, 2025, 5:03 AM

#

NICE!

edgy pendant Oct 4, 2025, 6:43 AM

#

bro plese help me to finde out the code of the sora 2

undone tangle Oct 4, 2025, 4:34 PM

#

hello

safe thistle Oct 4, 2025, 5:28 PM

#

Nano-Banana dethorned.
https://x.com/arena/status/1974502371721162982

lmarena.ai (@arena)

🚨 Text-to-Image Leaderboard Shakeup!

Hunyuan Image 3.0 by @TencentHunyuan just stormed into the #1 spot in the Arena 🏆 - ranked as both the top overall and top open-source Text-to-Image model.

🖼️ This image generation model has leapfrogged over Seedream 4, and the famous

idle ocean Oct 4, 2025, 6:16 PM

#

safe thistle Nano-Banana dethorned. https://x.com/arena/status/1974502371721162982

still got image edit, which is its crown jewel, I'm suprised that hunyuan did it though, would have expected seedream

manic frost Oct 4, 2025, 7:52 PM

#

safe thistle Nano-Banana dethorned. https://x.com/arena/status/1974502371721162982

Bruh but hunywan 3 0 sucks

#

Its only a better version of DALL-E3

idle ocean Oct 4, 2025, 7:56 PM

#

I mean all of them are better versions of dall e when you think about it hard enough

autumn granite Oct 5, 2025, 3:24 AM

#

quartz mountain Oct 5, 2025, 6:30 AM

#

@proud crescent please check #1397655624103493813 to know how to properly use the Arena bot

frigid wigeon Oct 5, 2025, 7:11 AM

#

ok

tropic pivot Oct 5, 2025, 2:07 PM

#

@calm kraken @summer lantern Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to learn how to generate

solemn locust Oct 5, 2025, 3:36 PM

#

Hello i need coupon flova

idle ocean Oct 5, 2025, 4:14 PM

#

Please check ⁠how-to-video-bot to learn how to generate

#

the 2 messages above this are literally about this

#

my brain

ocean roost Oct 5, 2025, 4:17 PM

#

Hello @errant tinsel Please check https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to use the bot and https://discord.com/channels/1340554757349179412/1397655695150682194 https://discord.com/channels/1340554757349179412/1400148557427904664 https://discord.com/channels/1340554757349179412/1400148597768720384 for your creations.

jaunty cave Oct 6, 2025, 3:04 AM

#

autumn granite

super cool analysis! Thanks for sharing

viscid palm Oct 6, 2025, 5:26 AM

#

In the leaderboards, Claude 4.5 sonnet is ranked below Deepseek R1 and gpt5.

However from my usage, I've seen that sonnet makes way better frontend UIs compared to those above it.
Why is it still ranked so below?

late ridge Oct 6, 2025, 5:43 AM

#

nice

jaunty cave Oct 6, 2025, 6:20 AM

#

viscid palm In the leaderboards, Claude 4.5 sonnet is ranked below Deepseek R1 and gpt5. Ho...

which leaderboard are you talking about? On our main leaderboard claude 4.5 sonnet is above both of those
https://lmarena.ai/leaderboard/text/overall

viscid palm Oct 6, 2025, 6:23 AM

#

Here
https://web.lmarena.ai/leaderboard

#

gpt 5 at the top for web dev

#

but purely frontend speaking, claude sonnet performs better from my experience

modest pewter Oct 6, 2025, 9:40 AM

#

edgy pendant bro plese help me to finde out the code of the sora 2

this channel is full of toddlers

modest pewter Oct 6, 2025, 9:41 AM

#

viscid palm gpt 5 at the top for web dev

are you ok dude

viscid palm Oct 6, 2025, 10:01 AM

#

modest pewter are you ok dude

I love you

modest pewter Oct 6, 2025, 10:04 AM

#

k

idle ocean Oct 6, 2025, 12:38 PM

#

viscid palm Here https://web.lmarena.ai/leaderboard

Thats 4.1 not 4.5

viscid palm Oct 6, 2025, 12:38 PM

#

?

idle ocean Oct 6, 2025, 12:38 PM

#

viscid palm Here https://web.lmarena.ai/leaderboard

If you see 4.5 then it hasnt been put on the leaderboard yet

idle ocean Oct 6, 2025, 12:39 PM

#

viscid palm ?

The 4.5 model that shows up isnt the thinkng model

viscid palm Oct 6, 2025, 12:40 PM

#

Meh even non-thinking performs better than gpt 5

idle ocean Oct 6, 2025, 12:40 PM

#

If you say so

viscid palm Oct 6, 2025, 12:41 PM

#

Although I just realised the gpt 5 here is the "high" model, particularly not the one that is available to everyone for free so idk.

drowsy needle Oct 6, 2025, 3:39 PM

#

@icy girder be sure to check out #1397655624103493813 for instructions on how the video bot works.

rancid imp Oct 6, 2025, 3:47 PM

#

Good app

twin valve Oct 6, 2025, 3:55 PM

#

viscid palm In the leaderboards, Claude 4.5 sonnet is ranked below Deepseek R1 and gpt5. Ho...

see what I mean @idle ocean ? "In my use case model X is better than Y, I think it should be higher despite all other use cases that aren't mine"

idle ocean Oct 6, 2025, 3:56 PM

#

? Why am i being pinged

twin valve Oct 6, 2025, 3:57 PM

#

for the discussion we had in the other channel about the topic. The quoted message is a perfect example of what I meant. But If that is "too many messages ago" then nvm.

idle ocean Oct 6, 2025, 3:58 PM

#

I mean i agree that people are like that..

twin valve Oct 6, 2025, 3:58 PM

#

ok

sinful pulsar Oct 6, 2025, 4:00 PM

#

Ok

viscid palm Oct 6, 2025, 4:13 PM

#

twin valve see what I mean <@800105797496864779> ? "In my use case model X is better than Y...

I think it's just my unique skill of prompting 😎

amber comet Oct 6, 2025, 5:38 PM

#

Please, check out ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

jaunty cave Oct 6, 2025, 6:01 PM

#

viscid palm but purely frontend speaking, claude sonnet performs better from my experience

Interesting, when you use it are you usng it with thinking or without? sonnet 4.5 with thinking isn't on webdev arena leaderbaord yet

viscid palm Oct 6, 2025, 6:02 PM

#

Non-thinking

#

Even 4.1 beats gpt 5 (free) and 2.5 pro Gemini (from my experience)

#

But this is purely frontend-wise. Maybe web dev on the leaderboard is a culmination of all things web dev.

coral pulsar Oct 6, 2025, 6:24 PM

#

Hi

idle ocean Oct 6, 2025, 6:58 PM

#

viscid palm But this is purely frontend-wise. Maybe web dev on the leaderboard is a culminat...

have you actually tried web dev arena?

drowsy needle Oct 6, 2025, 9:17 PM

#

@wise depot for sonnet-4-5 I believe it's ranked higher despite having a lower score because the upper bound CI could be higher than the rank of the other models ranked below.

#

For these having the same rank & CI I'm not sure why they're listed this way. Will ask and followup when I know more.

idle ocean Oct 6, 2025, 9:46 PM

#

drowsy needle

https://discord.com/channels/1340554757349179412/1423698152959119501

relevant thread

ashen kelp Oct 6, 2025, 11:29 PM

#

yo

drowsy needle Oct 7, 2025, 12:48 AM

#

drowsy needle For these having the same rank & CI I'm not sure why they're listed this way. Wi...

So I don't have an update for you specific to this; however, once raised the team it's created a lot of discussion. @idle ocean

idle ocean Oct 7, 2025, 12:49 AM

#

thx

rough cliff Oct 7, 2025, 2:40 AM

#

Please head to #1397655624103493813 for a detailed guide on how to use the bot

jovial vector Oct 7, 2025, 11:19 AM

#

Hello

loud lotus Oct 7, 2025, 2:09 PM

#

does gemini3 show up in arena?

idle ocean Oct 7, 2025, 2:42 PM

#

Not out yet

robust zodiac Oct 7, 2025, 2:56 PM

#

is there anything even known about gemini 3 being out? I know there's some coednamed models but neither really sticks out to me as a potential gemini 3 so far

idle ocean Oct 7, 2025, 3:24 PM

#

not really, supposedly some ab tests on studio but that's it

robust zodiac Oct 7, 2025, 3:50 PM

#

that's what I thought but I see some people get super hyped about it that I dont even know what to think lol

remote trail Oct 7, 2025, 5:11 PM

#

robust zodiac that's what I thought but I see some people get super hyped about it that I dont...

comes out in 2 days/weeks/months

#

(according to some sources, in 2 days [thursday], which i still doubt)

#

but in january we will have it, 100% sure

#

highest probability: this quarter

idle ocean Oct 7, 2025, 5:17 PM

#

The way some people were talking, it was if it was already released which was annoying

drowsy needle Oct 7, 2025, 5:38 PM

#

@slow solstice be sure to check out #1397655624103493813 for information on how to use the bot properly.

#

Let me know if you have any questions.

fresh marsh Oct 7, 2025, 6:51 PM

#

idle ocean The way some people were talking, it was if it was already released which was an...

might be referring to the AB tests

#

people see in the networking data on those that it has an entirely different version number than 2.5 pro or 2.5 flash

#

and that apaprently its much better

#

not sure if i believe that theyd release a new flash preview end of september for 2.5 if theyre releasing 3.0 pro in october though

#

seems stupid

idle ocean Oct 7, 2025, 6:52 PM

#

I have seen a better/different feeling model then 2.5

#

true about the flash thing that baffles me

#

they slowly tweaked 2.5 pro over its lifetime, maybe they had a bunch of changes to 2.5 and for some reason ended up just doing it all at once

fresh marsh Oct 7, 2025, 6:53 PM

#

idle ocean they slowly tweaked 2.5 pro over its lifetime, maybe they had a bunch of changes...

ive read they always update their current model in october and release the new one in december 2 years in a row now?

#

havent looekd into it myslef though

idle ocean Oct 7, 2025, 6:54 PM

#

I have never heard of such as thing, but I can look into that

#

the December claim is a little stretched, cause 1.0 and 2.0 were both released in December, but plenty of other models get released at other times

median briar Oct 7, 2025, 8:52 PM

#

I think they were unsure if releasing a new 2.5 version was a good idea (but had a smaller team working on it the whole time) and now they have decided to release it (idk why).

#

But I bet the reason why they did not update 2.5 pro (which they probably have a better version of internally) is to make 3.0 flash look better.

#

And stick to the claim of 3 flash outperforming 2.5 pro

autumn granite Oct 7, 2025, 9:30 PM

#

median briar I think they were unsure if releasing a new 2.5 version was a good idea (but had...

Well they recently released Gemini for Chrome, maybe that's why

autumn granite Oct 8, 2025, 5:11 AM

#

Sonnet 4.5 on Web Dev Arena leaderboard...

frozen knot Oct 8, 2025, 6:21 AM

#

thats non thinkin tho

#

also gpt 5 being at the top is weird it always makes like the exact same ui

rose stream Oct 8, 2025, 7:35 AM

#

frozen knot also gpt 5 being at the top is weird it always makes like the exact same ui

but gpt-5 UIs loos so much better than sonnet

#

Anthropid models just spam gradients and their spacing is really off a lot of the time

frail pecan Oct 8, 2025, 4:53 PM

#

guys do you know how can I use the Claude sonnet 4.5 thinking? I think it’s different from the external thinking function right?

fresh marsh Oct 8, 2025, 5:07 PM

#

frail pecan guys do you know how can I use the Claude sonnet 4.5 thinking? I think it’s diff...

?

idle ocean Oct 8, 2025, 5:43 PM

#

frail pecan guys do you know how can I use the Claude sonnet 4.5 thinking? I think it’s diff...

??

frail pecan Oct 8, 2025, 5:44 PM

#

idle ocean ??

Or they’re same? Sorry that I’m not familiar with Claude🫡

jaunty cave Oct 8, 2025, 6:11 PM

#

frail pecan Or they’re same? Sorry that I’m not familiar with Claude🫡

You can select direct chat in the upper left and then select claude sonnet thinking model.

Also are you a fan of clearlove the Chinese jungler?

frail pecan Oct 8, 2025, 6:19 PM

#

jaunty cave You can select direct chat in the upper left and then select claude sonnet think...

I mean through the Claude app or through their official website to use the thinking model, not at the Lmarena haha. And yes, I am a fan of the Chinese jungle Clearlove🎉

jaunty cave Oct 8, 2025, 6:30 PM

#

oh, sry I don't actaully know all about how to use Claude through their app, feel free to use on LMArena though. I was a huge fan of clearlove too. Amazing career

#

Some lore about me, I started getting interested in statistics and machine learning by ranking League of Legends pros. ClearLove had amazing stats. here's a post I made 9 years ago. https://www.reddit.com/r/leagueoflegends/comments/4q783c/kda_rankings_in_professional_play_over_9200_pro/

Now almost 10 years later I work on leaderboards of AI models instead 😄

From the leagueoflegends community on Reddit: KDA Rankings in Profe...

Explore this post and more from the leagueoflegends community

drowsy needle Oct 8, 2025, 6:35 PM

#

jaunty cave Some lore about me, I started getting interested in statistics and machine learn...

I love learning Clayton lore

idle ocean Oct 8, 2025, 6:37 PM

#

wow, 9 years ago, back then I was younger

frail pecan Oct 8, 2025, 6:50 PM

#

jaunty cave Some lore about me, I started getting interested in statistics and machine learn...

wowww that’s incredible!! As he has retired and became a coach a long time ago, it’s really not easy to meet his fans nowadays. So happy to meet u!

marble bluff Oct 9, 2025, 2:06 AM

#

@uncut ginkgo Please check #1397655624103493813 to learn how to use the bot.

modest pewter Oct 9, 2025, 9:46 AM

#

frail pecan I mean through the Claude app or through their official website to use the thin...

You Must Sacrifice Your Wallet To Dario

rough cliff Oct 9, 2025, 1:32 PM

#

@cerulean grotto Please head to #1397655624103493813 for a detailed guide on how to use the bot

potent cave Oct 9, 2025, 5:42 PM

#

/video promt

amber comet Oct 9, 2025, 5:45 PM

#

potent cave /video promt

Please head to (https://discord.com/channels/1340554757349179412/1397655624103493813) for a detailed guide on how to use the bot.

drifting hornet Oct 9, 2025, 7:01 PM

#

@terse hornet Please, check out ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

#

@foggy agate Please, check out ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

drifting hornet Oct 9, 2025, 7:52 PM

#

Please, check out ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

jaunty cave Oct 10, 2025, 2:13 AM

#

search arena leaderboard updated today, 20k new votes, not much actually changed though lol. We need some new search models

fresh marsh Oct 10, 2025, 5:40 AM

#

jaunty cave search arena leaderboard updated today, 20k new votes, not much actually changed...

Gemini 3.0 search

#

OkAnd

jaunty cave Oct 10, 2025, 5:41 AM

#

what in the world does :malicepfpmoment: mean?

fresh marsh Oct 10, 2025, 5:42 AM

#

jaunty cave what in the world does :malicepfpmoment: mean?

LMAO

#

malicepfpmoment

#

Hard to explain, inside joke someone into sneakers might get

delicate acorn Oct 10, 2025, 7:08 AM

#

helo

twin valve Oct 10, 2025, 11:34 AM

#

jaunty cave search arena leaderboard updated today, 20k new votes, not much actually changed...

search is so underrated because is one of the most immediate (yet very useful) application of LLMs (or even agentic LLMs for deep search)

idle ocean Oct 10, 2025, 12:10 PM

#

Yeah

#

Only xAI seems to have gunned for it

fresh marsh Oct 10, 2025, 12:28 PM

#

jaunty cave search arena leaderboard updated today, 20k new votes, not much actually changed...

dont understand how grok beats google at search lmao

#

not saying its worse just think its funny

idle ocean Oct 10, 2025, 1:01 PM

#

Cause gemini sucks at search

twin valve Oct 10, 2025, 1:59 PM

#

fresh marsh dont understand how grok beats google at search lmao

either

they don't provide their best search model on lmarena (AFAIK the search models are more or less similar to the free and pro subscription models, the one around 20$, not the ones that costs more)
or search up to a certain level of complexity has no moat
or the search questions aren't too tricky and so many models performs similarly (the lmarena search started much later than the lmarena text, so the models there are more or less all recent)

Also not all vendors are in the lmarena. For example mistral has search abilities as cohere does, but they aren't in the arena.

idle ocean Oct 10, 2025, 2:04 PM

#

nah

#

you are both wrong

#

the reason why Gemini is so bad at search is 78% of the time the links it posts don't work

#

that it

#

Sometimes the best answer is the simplest one

twin valve Oct 10, 2025, 2:38 PM

#

Interesting because in my cases they work (I mean those under sources) but yeah could be the case too if I got consistently lucky.

radiant glacier Oct 11, 2025, 2:12 AM

#

why is qwen ranked so well?

#

every time I've tested it the answers were terrible related to lower ranked models

jaunty cave Oct 11, 2025, 2:21 AM

#

What sort of questions do you usually ask?

fresh marsh Oct 11, 2025, 2:38 AM

#

radiant glacier every time I've tested it the answers were terrible related to lower ranked mode...

It was better than 2.5 pro for medical questions for me tbh

#

As in more accurate to reality information

#

I think I used qwen max preview not the other qwen max

#

But dont remmebr

cerulean bane Oct 11, 2025, 6:06 AM

#

@cerulean bane

marble bluff Oct 11, 2025, 11:15 AM

#

@jade sundial Please check #1397655624103493813 to lear how to use the bot.

wooden mango Oct 11, 2025, 2:49 PM

#

Please go to #1397655624103493813 to learn how to use the bot

tropic pivot Oct 11, 2025, 3:45 PM

#

Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to generate content 🙂

dark solar Oct 11, 2025, 5:12 PM

#

👍

frozen knot Oct 12, 2025, 2:29 AM

#

radiant glacier why is qwen ranked so well?

Because questions are skewed by popularity in deliverables and coding which qwen is fine at, if you look at individual category scores that will give you a better idea

fresh marsh Oct 12, 2025, 2:32 AM

#

frozen knot Because questions are skewed by popularity in deliverables and coding which qwen...

Deliverables?

vestal quiver Oct 12, 2025, 2:32 AM

#

Why hasn't LMArena updated the text-to-video and image-to-video leaderboards for almost a month?

fresh marsh Oct 12, 2025, 2:37 AM

#

vestal quiver Why hasn't LMArena updated the text-to-video and image-to-video leaderboards for...

I know they dont show the model apparently until after people besides the person who generates it votes

#

I wonder if theyre not counting votes either until then?

fresh marsh Oct 12, 2025, 2:37 AM

#

fresh marsh I wonder if theyre not counting votes either until then?

@drowsy needle Do you know this?

quartz mountain Oct 12, 2025, 4:15 AM

#

@dreamy crater please head to -> #1397655624103493813 to learn how to prompt the bot and in what channes it is available for use

#

@last storm please review the info on how to use the Arena bot and in what channels by going to -> #1397655624103493813

queen ivy Oct 12, 2025, 8:34 AM

#

looking for feedback on what the community considers a good AI model with so many to choose from. Saves time.

empty whale Oct 12, 2025, 8:59 AM

#

i need a vid presenting future tense with will and going to

wooden mango Oct 12, 2025, 11:00 AM

#

queen ivy looking for feedback on what the community considers a good AI model with so man...

Grok 1

brave cobalt Oct 12, 2025, 12:06 PM

#

nano

drowsy needle Oct 12, 2025, 2:07 PM

#

fresh marsh I wonder if theyre not counting votes either until then?

Sry for the delay - it's the first two votes that are counted on the leaderboard, which is prior to the names of the models being revealed. After that, the votes don't contribute. There is a way to filter the leaderboards by votes only from the person that generated it.

drowsy needle Oct 12, 2025, 2:07 PM

#

vestal quiver Why hasn't LMArena updated the text-to-video and image-to-video leaderboards for...

It takes a bit to gather votes.

#

Would encourage others to votes on their preferences in the video arena channels!

marble bluff Oct 12, 2025, 3:44 PM

#

@lilac swallow Please check #1397655624103493813 to learn how to use the bot.

fresh marsh Oct 12, 2025, 3:48 PM

#

drowsy needle It takes a bit to gather votes.

If only the creator votes (or only someone else and not creator) and not a second person does it still get collected and used as data?

light mirage Oct 12, 2025, 10:14 PM

#

How to create videos?

wooden mango Oct 12, 2025, 11:06 PM

#

light mirage How to create videos?

Go to #1397655624103493813 and learn how to use the bot to create videos!

#

Go to ⁠#1397655624103493813 and learn how to use the bot to create videos!

tender sigil Oct 13, 2025, 1:01 AM

#

this channel for some reason

fresh marsh Oct 13, 2025, 1:21 PM

#

tender sigil this channel for some reason

LMAO

idle ocean Oct 13, 2025, 1:31 PM

#

Yeah

rough cliff Oct 13, 2025, 1:48 PM

#

@grizzled dust Please head to #1397655624103493813 for a detailed guide on how to use the bot

tropic pivot Oct 13, 2025, 2:18 PM

#

Please go to https://discordapp.com/channels/1340554757349179412/1397655624103493813 to learn how to generate 🙂

runic glen Oct 13, 2025, 2:44 PM

#

How to generate 3D video?

drowsy needle Oct 13, 2025, 3:46 PM

#

fresh marsh If only the creator votes (or only someone else and not creator) and not a secon...

IIRC first and second votes are what generates the leaderboards regardless if the creator votes or not. Will double check and keep you updated.

drowsy needle Oct 13, 2025, 8:00 PM

#

drowsy needle IIRC first and second votes are what generates the leaderboards regardless if th...

Confirmed if the prompter doesn't vote, that doesn't mean the first 2 votes don't contribute to the leaderboards.

fresh marsh Oct 13, 2025, 9:21 PM

#

drowsy needle Confirmed if the prompter doesn't vote, that doesn't mean the first 2 votes don'...

milkleedleleedleleedlelee

jaunty cave Oct 14, 2025, 1:05 AM

#

drowsy needle Confirmed if the prompter doesn't vote, that doesn't mean the first 2 votes don'...

oops I think we miscommunicated here. The first 2 votes would still contribute to the overall leaderboard. But the author vote will not contribute to the "author vote" category leaderboard like it usually would.

autumn granite Oct 14, 2025, 8:33 AM

#

GLM 4.6 is on Web Dev leaderboard on lmarena.ai but not on web.lmarena.ai?

#

#

From Web Dev category on lmarena.ai

#

I think it's interesting how the order is so different from the coding category on the text leaderboard

twin valve Oct 14, 2025, 9:47 AM

#

@drowsy needle

twin valve Oct 14, 2025, 9:49 AM

#

autumn granite I think it's interesting how the order is so different from the coding category ...

well in the webdev arena one expects a visible output (a sort of webpage). The user doesn't look at the code (at least I think)

in text arena one looks at the code.

the evaluation (in terms of what is checked) is different

autumn granite Oct 14, 2025, 10:35 AM

#

twin valve well in the webdev arena one expects a visible output (a sort of webpage). The u...

Yeah there's that aspect of it, but I also think the coding subcategory is kinda unreliable.

It was interesting how filtering out battles where a model faced Sonnet < 15 times changed the rankings so much (Sonnet 4 from #10 to #4, all without style control).

wooden mango Oct 14, 2025, 10:37 AM

#

autumn granite

Is deepseek really decent?

#

Haven't tried it for coding yet

autumn granite Oct 14, 2025, 10:38 AM

#

wooden mango Is deepseek really decent?

R1 seemed strong when it was released, not sure about now.

#

This is just Web Dev Arena like Pier said, so actual agentic coding performance might differ.

twin valve Oct 14, 2025, 2:22 PM

#

autumn granite Yeah there's that aspect of it, but I also think the coding subcategory is kinda...

yes, hence I asked you to do that as you are quick.

The point of the "unreliability" is simply that ratings have a lot of nuances. With the filtering we are checking the results within models that matched each other, rather than considering other models as well. It is a property of the ratings.

Otherwise one should generate other ratings (with filtering). Hopefully one can do that once we get all the results of the battles.

#

the best IMO would be like having a sort of meta benchmark, with many different (somewhat reliable) benchmarks taken in account.

Some did those, but they don't get updated that often.

Example: https://x.com/scaling01/status/1919217718420508782

Lisan al Gaib (@scaling01)

The Ultimate LLM Meta-Leaderboard averaged across the 28 best benchmarks

Gemini 2.5 Pro > o3 > Sonnet 3.7 Thinking

autumn granite Oct 14, 2025, 2:47 PM

#

twin valve yes, hence I asked you to do that as you are quick. The point of the "unreliabi...

Thanks for the insight, I wasn't quite sure what to make of it.

#

LiveBench agentic coding scores got updated (I think they increased the number of maximum turns).

idle ocean Oct 14, 2025, 2:55 PM

#

5 Pro does badly, weird

fresh marsh Oct 14, 2025, 4:02 PM

#

autumn granite LiveBench agentic coding scores got updated (I think they increased the number o...

LOL at gpt 5 pro being worse than gpt 5 medium high and.... mini high

#

And sonnet 4.5 non thinking

fresh marsh Oct 14, 2025, 4:08 PM

#

twin valve the best IMO would be like having a sort of meta benchmark, with many different ...

I think the number one indication of unreliability is gemini winning anything rn

#

Lol

fresh marsh Oct 14, 2025, 4:08 PM

#

twin valve the best IMO would be like having a sort of meta benchmark, with many different ...

Nvm that's old

twin valve Oct 14, 2025, 4:26 PM

#

fresh marsh I think the number one indication of unreliability is gemini winning anything rn

that link is a meta leaderboard from May 2025. It is not new. If you meant that gemini 2.5 Pro wins because in that link it wins

#

hence my point "I wish such meta leaderboards would be done more often"

fresh marsh Oct 14, 2025, 4:28 PM

#

twin valve that link is a meta leaderboard from May 2025. It is not new. If you meant that ...

Yeah nvm didn't look closely enough

drowsy needle Oct 14, 2025, 5:38 PM

#

idle ocean Oct 14, 2025, 5:38 PM

#

sora 2 doesn't beat veo 3?

fresh marsh Oct 14, 2025, 6:01 PM

#

idle ocean sora 2 doesn't beat veo 3?

thats what i was saying

#

thats unbelievable to me

wooden mango Oct 14, 2025, 8:02 PM

#

Veo 3 is kinda nice

viscid haven Oct 15, 2025, 9:13 AM

#

Hi

marble bluff Oct 15, 2025, 11:08 AM

#

Hello! @hexed patio Please check #1397655624103493813 to learn how to generate videos.

hidden cedar Oct 15, 2025, 11:09 AM

#

hello

wooden mango Oct 15, 2025, 11:29 AM

#

Hmm fried chicken yess

glacial wigeon Oct 15, 2025, 1:24 PM

#

nice one

timber stone Oct 15, 2025, 4:09 PM

#

gemini 2.5

high pilot Oct 15, 2025, 5:17 PM

#

hey, how can find my videos??

wooden mango Oct 15, 2025, 5:20 PM

#

high pilot hey, how can find my videos??

Check the Bot's messages @candid hill

night scroll Oct 15, 2025, 5:23 PM

#

@lofty solstice Note that this is an English only server.

neat quarry Oct 15, 2025, 5:53 PM

#

wooden mango Oct 15, 2025, 7:19 PM

#

neat quarry

That's nice

#

What did you use?

fringe ridge Oct 15, 2025, 9:10 PM

#

hi

tight rock Oct 15, 2025, 9:45 PM

#

Hey guys 👋 , I just want to know how often the text leaderboard updates. I see that the latest update was on 8 Oct. If Google releases a new model in late October, will this be reflected on the leaderboard soon?

drowsy needle Oct 15, 2025, 9:46 PM

#

tight rock Hey guys 👋 , I just want to know how often the text leaderboard updates. I see ...

It takes around a week normally for an update to happen.

tight rock Oct 15, 2025, 9:49 PM

#

drowsy needle It takes around a week normally for an update to happen.

thx! I tried to ask AI but got no reliable answers 👀

tight rock Oct 15, 2025, 9:50 PM

#

drowsy needle It takes around a week normally for an update to happen.

so can I expect an update in one or two days?

drowsy needle Oct 15, 2025, 9:52 PM

#

tight rock so can I expect an update in one or two days?

It's possible! Bit hard to say specifically as it takes time for us to collect votes and validate before posting an update, so it's a bit TBD.

fresh marsh Oct 15, 2025, 9:56 PM

#

drowsy needle It takes around a week normally for an update to happen.

"If Google releases a new model in late October, will this be reflected on the leaderboard soon?"

he's buying on polymarket

#

😭

upbeat swift Oct 16, 2025, 5:53 AM

#

Saw this paper a couple days ago: https://arxiv.org/pdf/2510.02306
It makes use of lmarena datasets (text, search, vision) and compare several rating systems. Elo, Glicko, TrueSkill, and an online variant of Bradley-Terry.

They claim that by just ignoring data from ties you get better predictive accuracy (even on datasets including ties) It's cool they released code too.

I have some issues with it though, mainly that it only uses dynamic rating systems rather than true Bradley-Terry. In my experience the dynamic rating systems are much more noisy and have recency bias on arena data. Also in my thinking Elo and online Bradley-Terry should be the same thing. 🤔

Any thought it was interesting, wonder if anyone else has seen this or has done experiments specifically about ties.

wheat tiger Oct 16, 2025, 9:35 AM

#

Hello

edgy pendant Oct 16, 2025, 12:02 PM

#

can you give me invitation code for sora 2 plese give me

twin valve Oct 16, 2025, 12:21 PM

#

upbeat swift Saw this paper a couple days ago: https://arxiv.org/pdf/2510.02306 It makes use ...

this is what I meant with "the community would do such analysis if this https://discord.com/channels/1340554757349179412/1372537524551159913 would happen".

Btw if I am not wrong lmarena has already a category where ties are removed.

Also cthorrez seems the same account of @jaunty cave . Is not the same user behind those accounts?

I think ties are actually important. In my simulations for other use cases (chess , videogames) ties really helps stabilize the values.

Sometimes the compress ratings differences too much though, so one could also play around with weights for ties (for example in some videogames we decided with the developer to weight the result of draws by 0.5). But the more one plays around the less consensus the model has (that is, it becomes more arbitrary)

Bildschirmfoto_2025-10-16_um_14.19.38.png

heady flume Oct 16, 2025, 12:28 PM

#

hello, our team train a language model, we want to sumbit on the leardbord, could pls tell me how to sumbimit the modle, thanks.

twin valve Oct 16, 2025, 12:31 PM

#

I guess #1372229840131985540 could help

idle ocean Oct 16, 2025, 12:41 PM

#

heady flume hello, our team train a language model, we want to sumbit on the leardbord, coul...

So true

idle ocean Oct 16, 2025, 12:42 PM

#

twin valve this is what I meant with "the community would do such analysis if this https://...

Whats the evidence of that?

idle ocean Oct 16, 2025, 12:44 PM

#

twin valve this is what I meant with "the community would do such analysis if this https://...

I think the best of all worlds is just where they ignore ties

rough cliff Oct 16, 2025, 1:39 PM

#

@brave pier Please head to #1397655624103493813 for a detailed guide on how to use the bot

brave pier Oct 16, 2025, 1:42 PM

#

how to use the bot

rough cliff Oct 16, 2025, 1:42 PM

#

brave pier how to use the bot

Please head to #1397655624103493813 for a detailed guide on how to use the bot

twin valve Oct 16, 2025, 2:05 PM

#

idle ocean Whats the evidence of that?

evidence for what in particular?

twin valve Oct 16, 2025, 2:06 PM

#

idle ocean I think the best of all worlds is just where they ignore ties

ties are important IMO. it could be like an extra category. For example "exclude ties" could be default and "include ties" could be extra

idle ocean Oct 16, 2025, 2:11 PM

#

twin valve evidence for what in particular?

Clayton

idle ocean Oct 16, 2025, 2:12 PM

#

twin valve ties are important IMO. it could be like an extra category. For example "exclude...

Yah that works

twin valve Oct 16, 2025, 2:15 PM

#

idle ocean Clayton

I see. I may have been mixing people but Clayton should have the surname (at least what I saw online) similar to the other user. I can be wrong though.

idle ocean Oct 16, 2025, 2:24 PM

#

Ok

idle ocean Oct 16, 2025, 2:27 PM

#

twin valve I see. I may have been mixing people but Clayton should have the surname (at lea...

Can confirm cthorrez is clayton thorrez https://cthorrez.github.io/

Clayton Thorrez

silk yoke Oct 16, 2025, 3:18 PM

#

very nice

upbeat swift Oct 16, 2025, 3:44 PM

#

twin valve this is what I meant with "the community would do such analysis if this https://...

Oh yeah I'm Clayton. Just this account is my personal account I had logged in on my phone haha. The other is my work account. I need to set up the auth to use the work account on my phone. 😅

idle ocean Oct 16, 2025, 4:06 PM

#

So true

twin valve Oct 16, 2025, 4:54 PM

#

so I wasn't wrong. Good to know.

jaunty cave Oct 16, 2025, 11:37 PM

#

tight rock so can I expect an update in one or two days?

😄

drowsy needle Oct 16, 2025, 11:38 PM

#

ablobnodfast

idle ocean Oct 16, 2025, 11:43 PM

#

drowsy needle Oct 16, 2025, 11:45 PM

#

ablobnodfast

idle ocean Oct 16, 2025, 11:45 PM

#

they said claude 4 sonnet performance and it seems like they didn't lie about that

fresh marsh Oct 17, 2025, 12:28 AM

#

idle ocean they said claude 4 sonnet performance and it seems like they didn't lie about th...

Lmao

drowsy needle Oct 17, 2025, 12:33 AM

#

jaunty cave Oct 17, 2025, 12:33 AM

#

idle ocean they said claude 4 sonnet performance and it seems like they didn't lie about th...

It's really pretty cool when LMArena results back up waht the model providers are saying. Especially in this case they released it yesterday, and by today we have >2k real live diverse human votes on it and it shows it's super close to 4-sonnet

#

it's like "ok let me do a quick vibe check with thousands of people all over the world doing different things"

idle ocean Oct 17, 2025, 2:06 AM

#

tho it beats the thinking version correct?

#

and thats the non thinking?

twilit echo Oct 17, 2025, 3:59 AM

#

It placed #46/#65 on my own leaderboard, #55 in Chess. Overall pretty consistent for a mini/flash type model.
Don't doubt it achieved Sonnet 4 level scores on benchmarks optimized for (Swebench etc.), but that doesn't universally translate to real world usage. Sonnet is smarter with more world knowledge (which is to be expected for larger models).

bold jungle Oct 17, 2025, 7:41 AM

#

when will the copilot leaderboard get updated??

karmic fog Oct 17, 2025, 7:44 AM

#

patience

twin valve Oct 17, 2025, 8:15 AM

#

bold jungle when will the copilot leaderboard get updated??

I am not sure that gets updated anymore. It requires votes in the VS code extension and I don't know if there are enough.

#

there are other coding LB and one can get an idea how models perform in all those

idle ocean Oct 17, 2025, 12:26 PM

#

twilit echo It placed #46/#65 on my own leaderboard, #55 in Chess. Overall pretty consistent...

What did sonnet 4 get?

twilit echo Oct 17, 2025, 1:27 PM

#

idle ocean What did sonnet 4 get?

very high, #5 on main. and #47 on chess (anthropic isn't great at chess in general).

idle ocean Oct 17, 2025, 1:58 PM

#

Ok

#

How about 4.5 or 4 thinking?

twilit echo Oct 17, 2025, 2:41 PM

#

idle ocean How about 4.5 or 4 thinking?

.... I publish everything, see profile.

idle ocean Oct 17, 2025, 2:44 PM

#

Ooh nice

#

4 thinking does worse than 4?

drifting hornet Oct 17, 2025, 11:29 PM

#

@vital osprey Please, read our guide ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

reef moth Oct 18, 2025, 1:21 AM

#

@jaunty cave @drowsy needle hi something goes wrong, 2.5 pro disappered from the leaderboard

#

drowsy needle Oct 18, 2025, 1:41 AM

#

reef moth <@1394374846741221458> <@283397944160550928> hi something goes wrong, 2.5 pro di...

Sorry about that! It is fixed now!

ripe bronze Oct 18, 2025, 6:09 AM

#

hello

#

thers any way to use vro 3.1 free ?

native pewter Oct 18, 2025, 7:19 AM

#

Hello

lethal root Oct 18, 2025, 11:01 AM

#

hi agine

marble bluff Oct 18, 2025, 1:08 PM

#

@shy thorn you might check #1397655624103493813 to learn how to use the bot.

marble bluff Oct 18, 2025, 1:39 PM

#

Hello! @sly geyser please check #1397655624103493813 to learn how to use the bot.

tropic pivot Oct 18, 2025, 6:31 PM

#

Hi there. Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to generate content

vital pelican Oct 18, 2025, 6:57 PM

#

HmHmmm

tropic pivot Oct 18, 2025, 6:58 PM

#

@vital pelican Please check https://discordapp.com/channels/1340554757349179412/1397655624103493813 to generate content 🙂

worldly vigil Oct 18, 2025, 9:48 PM

#

Hello.I am here to test & compare different AI models.

drifting hornet Oct 18, 2025, 10:05 PM

#

@marble moth Please, read our guide in ⁠⁠how-to-video-bot to learn how to properly prompt the bot.

stark wharf Oct 18, 2025, 11:32 PM

#

hello community, im here to explore the world of AI

sick crow Oct 19, 2025, 3:45 AM

#

#1397655624103493813

rough cliff Oct 19, 2025, 4:54 AM

#

Please head to #1397655624103493813 for a detailed guide on how to use the bot

inland mason Oct 19, 2025, 1:40 PM

#

Make it 9:16 size.

idle ocean Oct 19, 2025, 1:40 PM

#

no

wide nova Oct 19, 2025, 4:28 PM

#

I can certainly help you with that. While the current application is designed for analyzing existing videos, I can upgrade it to generate new videos from a text description using the Gemini API. This will allow you to create the video you described.
Here is a summary of the changes I'll make:
Video Generation Mode: I'll introduce a new "Generate Video" mode. On launch, you'll be able to choose between analyzing an existing video or generating a new one.
Prompt-based Creation: In the new mode, you'll find a text area where you can describe the video you want. I've pre-filled it with your request for a video about world leaders.
API Key Selection: Video generation with the Veo model requires an API key with billing enabled. I'll add a button that prompts you to select your API key before you can generate a video.
Real-time Feedback: Video generation can take a few minutes. I'll display a status screen with informative messages to keep you updated on the progress.
Seamless Integration: Once your video is ready, it will automatically load into the player, and you'll be able to use all the existing analysis tools on it.
Here are the updated files for the application:

idle ocean Oct 19, 2025, 4:28 PM

#

wide nova I can certainly help you with that. While the current application is designed fo...

soooo truee

ember quartz Oct 19, 2025, 4:56 PM

#

Sora 2 cod

full warren Oct 19, 2025, 5:42 PM

#

A short, hyper-realistic video clip in 8K. Scene: A beautiful woman in bed at night.
The video starts with a close-up on her face as she tosses her head gently on the pillow, her eyes closed but her expression troubled.
Her eyes snap open. She looks towards something off-screen (like an alarm clock), and a wave of deep sadness and frustration washes over her face.
She then turns her head to stare out a window as a single, silent tear slowly rolls down her cheek.
Cinematic, dramatic low-key lighting from moonlight. The mood is somber, quiet, and deeply emotional.

#

A realistic, short video clip in a vertical aspect ratio (9:16). A beautiful, sad woman is in bed, tossing and turning in slow motion. The camera is a close-up on her face, capturing her pained and troubled expression. She briefly opens her weary eyes, then closes them again with a deep, sorrowful sigh. The scene is lit by cool, blue moonlight. Highly detailed, photorealistic, 4K. Focus is on the raw emotion of sleeplessness.

idle ocean Oct 19, 2025, 5:43 PM

#

full warren A short, hyper-realistic video clip in 8K. Scene: A beautiful woman in bed at ni...

Please head to #1397655624103493813 for a detailed guide on how to use the bot

earnest arrow Oct 19, 2025, 5:53 PM

#

Complete Profile info

fervent magnet Oct 19, 2025, 6:01 PM

#

Is there gemini 3 in lm arena?

idle ocean Oct 19, 2025, 6:02 PM

#

fervent magnet Is there gemini 3 in lm arena?

don't think so

sharp fiber Oct 19, 2025, 7:47 PM

#

video area 1 is fire

tough pewter Oct 19, 2025, 7:49 PM

#

true

drifting hornet Oct 19, 2025, 7:50 PM

#

@sharp fiber Please, read our guide in ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

rough cliff Oct 20, 2025, 12:56 AM

#

@kindred bane Please head to #1397655624103493813 for a detailed guide on how to use the bot

kindred bane Oct 20, 2025, 1:00 AM

#

im here to explore the world of AI

floral bridge Oct 20, 2025, 3:25 AM

#

hello

rough cliff Oct 20, 2025, 2:06 PM

#

@fiery wagon Please head to #1397655624103493813 for a detailed guide on how to use the bot

idle ocean Oct 20, 2025, 4:40 PM

#

SO TRUE

drowsy needle Oct 20, 2025, 5:30 PM

#

idle ocean Oct 20, 2025, 5:31 PM

#

a lot of people were saying sora 2 was better than veo 3.1 so that's a surpise

drowsy needle Oct 20, 2025, 6:06 PM

#

idle ocean a lot of people were saying sora 2 was better than veo 3.1 so that's a surpise

Yeah this was an interesting one to watch for sure

#

sora-2 overall still performing really well

undone abyss Oct 21, 2025, 7:47 AM

#

drowsy needle Sorry about that! It is fixed now!

were you perhaps trying to remove 2.5 pro because there would be another model to replace it..

robust zodiac Oct 21, 2025, 8:12 AM

#

undone abyss were you perhaps trying to remove 2.5 pro because there would be another model t...

buy more z ai KEKW

rough cliff Oct 21, 2025, 11:27 AM

#

@opaque vapor Please head to #1397655624103493813 for a detailed guide on how to use the bot

edgy flower Oct 21, 2025, 5:03 PM

#

HI THEAR

fringe falcon Oct 21, 2025, 5:12 PM

#

hello

covert timber Oct 21, 2025, 11:04 PM

#

yup

tepid halo Oct 22, 2025, 12:00 AM

#

Hi

marble bluff Oct 22, 2025, 11:16 AM

#

@woven pewter @devout sorrel @quick pivot @abstract bane Please check on #1397655624103493813 to learn how to use the bot properly.

#

@quick pivot Hey! We ask you to use English only in this server, thank you.

quick pivot Oct 22, 2025, 11:22 AM

#

marble bluff <@1429648458251239498> <@1430483683537846272> <@1151728867116069015> <@143048662...

I read everything in the assistant bot and did everything, but now I don't know where to get the results. Please tell me.

marble bluff Oct 22, 2025, 11:23 AM

#

quick pivot I read everything in the assistant bot and did everything, but now I don't know ...

For your creations make sure you go to #video-arena-1 #video-arena-2 #video-arena-3

quick pivot Oct 22, 2025, 11:24 AM

#

marble bluff For your creations make sure you go to <#1397655695150682194> <#1400148557427904...

I went through and entered the Promptravila. Where can I see what was generated based on it?

rough cliff Oct 22, 2025, 1:05 PM

#

@torn cove Please head to #1397655624103493813 for a detailed guide on how to use the bot

spare zenith Oct 23, 2025, 1:49 AM

#

Hi

trail valve Oct 23, 2025, 5:04 AM

#

wow

violet flint Oct 23, 2025, 7:39 AM

#

hi

wary stag Oct 23, 2025, 8:23 AM

#

Rehan

sharp silo Oct 23, 2025, 12:29 PM

#

hi

steep elk Oct 23, 2025, 6:29 PM

#

hi

shadow wave Oct 23, 2025, 7:34 PM

#

Hi

drowsy needle Oct 23, 2025, 7:39 PM

#

hey everyone! lets try to use #general for greetings.

drifting hornet Oct 23, 2025, 10:03 PM

#

@lean spear Please, read our guide in ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

quiet jetty Oct 24, 2025, 10:35 AM

#

https://www.opus.pro/agent?ref_id=EF9RSOLZ0

Agent Opus | AI Video Generator for Social Media

AI video generator from OpusClip. Create authentic, on-brand videos from text, links, audio, blogs, and more.

sudden bluff Oct 24, 2025, 11:05 AM

#

THANKS

drifting hornet Oct 24, 2025, 2:15 PM

#

@potent island Please, read our guide in ⁠https://discord.com/channels/1340554757349179412/1397655624103493813 to learn how to properly prompt the bot.

golden nexus Oct 24, 2025, 5:41 PM

#

hello how do i generate an ai video

tropic pivot Oct 24, 2025, 5:42 PM

#

golden nexus hello how do i generate an ai video

Hello! Check https://discordapp.com/channels/1340554757349179412/1397655624103493813 🙂

sick rover Oct 25, 2025, 5:42 AM

#

Why is there no sound when generating a video?