#webdev-arena | Arena | Page 1

hot notch Mar 4, 2025, 5:47 AM

#

GPT 4.5 and grok 3 think where will they rank?

frank nimbus Mar 4, 2025, 6:00 AM

#

4.5 1215, grok 3 1200

#

maybe 4.5 is like 1150

hot notch Mar 4, 2025, 7:23 AM

#

Does making a model think improve it in code? deepseek v3 gains 220 elo by moving to R1, and o3 mini high gains 180 elo above GPT 4o 11-20, but gemini flash thinking has about the same elo than without think.

#

and it fits with their livebench where gemini think coding rating is not better than without think,
And R1 improves compared to v3 (61 to 66) and o3 mini high improve compared to GPT 4o 11 20 (46 to 82)
so if we follow the grok 3 think rating from livebench (it goes from 54 to 66) it will have a good elo gains on the web dev arena

#

And Claude 3.7 goes from 66 to 74 with think (with what limit of thinking?)

#

we are also waiting for it on the webdev arena

copper coyote Mar 4, 2025, 5:52 PM

#

Depends a lot since here is possible add more reasoning lines for better answers like"What would be easy if" or "I will get borrowed was example for now:

reef crown Mar 5, 2025, 12:57 PM

#

Is it possible to add some of the R1 Distill models to webdev arena?

fallow gate Mar 5, 2025, 3:45 PM

#

reef crown Is it possible to add some of the R1 Distill models to webdev arena?

I second this, would be cool to see qwen 32b/14b r1 distills

stiff kernel Mar 5, 2025, 3:50 PM

#

reef crown Is it possible to add some of the R1 Distill models to webdev arena?

Optimistically, you would max out at 1302 ELO

#

Which is below V3

#

The story gets a bit better in Hard (English): you could get up to 1307 ELO, slightly above V3

reef crown Mar 5, 2025, 5:39 PM

#

reef crown Is it possible to add some of the R1 Distill models to webdev arena?

@gritty tartan 🙂

wild island Mar 6, 2025, 3:03 PM

#

hello, anyone facing with issue claude 3.7 dont show nothing on code for webdev arena?

lament glade Mar 6, 2025, 5:54 PM

#

@wild island could you say more? thanks

hazy trench Mar 6, 2025, 5:59 PM

#

I'll just post what I posted the other server

#

#

It happens on firefox and opera, so unlikely a browser issue, I have only seen it happen for grok and sonnet

cyan oxide Mar 6, 2025, 6:29 PM

#

hazy trench

I've gotten these quite a few times

#

In the past though

#

with other models

lament glade Mar 6, 2025, 6:43 PM

#

interesting, thanks for reporting the issue!

normal copper Mar 7, 2025, 9:07 PM

#

Just came to report the exact same! I am using Chrome

wild island Mar 8, 2025, 10:40 AM

#

lament glade <@897232782911217734> could you say more? thanks

primal wharf Mar 8, 2025, 2:18 PM

#

wild island

Claude 3.7 most of the time like that and grok3 error 🤌🙂I donno what s wrong with them

#

Why there isn t the new chatgpt4o version and there is only the old one ؟

clever mortar Mar 10, 2025, 8:55 PM

#

I'm testing the "web dev arena". Everytime "claude-3-7-sonnet-20250219" is used, it never generates anything. This is really going to throw off the scores because we know it's good, but if it's broken it's going to score horribly.

#

You should implement a test that verifies that SOMETHING was generated before considering the test data valid.

stiff kernel Mar 10, 2025, 9:26 PM

#

clever mortar You should implement a test that verifies that SOMETHING was generated before co...

They do

quiet raven Mar 11, 2025, 5:42 AM

#

wild island

99% of the time on webdevarena at least one of them does that where it doent worrk but is seemingly. taking time to gennerate somehting

teal tendon Mar 27, 2025, 7:38 AM

#

No response every time with deepseek-v3-0324

high mesa Mar 27, 2025, 12:38 PM

#

Same here, no response with deepseek-v3-0324

Needs to be fixed asap as it's ranking is going to be tanked

lament glade Mar 28, 2025, 12:37 AM

#

thanks for reporting the issue! if one of the models generates empty response the vote will be marked invalid.

acoustic sapphire Apr 3, 2025, 4:08 PM

#

We need to have a way to ask follow ups separately for each side, because sometimes there is a bug on one side but the other side works, so I just want to give one side the error that is showing to see if it can fix it.

#

Also, I noticed with nightwhisper it would fail to run the code, but when I pick the other side as winner, it magically loads and then the output is actually better than the side I chose but it just didnt load before I selected a winner, this might mess up the results for nightwhisper

thick sigil Apr 3, 2025, 7:01 PM

#

acoustic sapphire We need to have a way to ask follow ups separately for each side, because someti...

Tell LLMs to generate random number and tell them to follow just part of the prompt marked with such number

acoustic sapphire Apr 3, 2025, 7:09 PM

#

thick sigil Tell LLMs to generate random number and tell them to follow just part of the pro...

https://3000-ifetj5ycewj3xxmficgdk-ae4bd0ef.e2b-foxtrot.dev

#

nw

thick sigil Apr 3, 2025, 7:24 PM

#

acoustic sapphire https://3000-ifetj5ycewj3xxmficgdk-ae4bd0ef.e2b-foxtrot.dev

Generate user interface for ...
At the top of your code, write comment with random number 0 - 100. I will use this number to give you follow-up prompts- for my next prompts, follow only the part with your number!

If your number is ..., do this. If your number is ..., do that.

acoustic sapphire Apr 3, 2025, 7:29 PM

#

damn i lost my nw session

#

gotta find it again

tardy timber Apr 4, 2025, 5:34 AM

#

I don't know if its specific to the prompts I am trying but half the time that I use the webdev arena only one of the LLMs work at all

fallow gate Apr 4, 2025, 6:09 AM

#

tardy timber I don't know if its specific to the prompts I am trying but half the time that I...

Same, the other llm usually writes like 10-20 lines then stops generating

#

If I add a follow-up, it usually works fine though

acoustic sapphire Apr 4, 2025, 7:51 PM

#

RIP nightwhisper

chrome raft Apr 5, 2025, 2:45 PM

#

acoustic sapphire RIP nightwhisper

What happened?

#

I got to know about it tdy and wanted to try 😫

acoustic sapphire Apr 5, 2025, 5:48 PM

#

chrome raft I got to know about it tdy and wanted to try 😫

i can show you some examples of its outputs if you want?

#

https://x.com/DrealR_/status/1907921770184860082
check this page

DrealR (@DrealR_) on X

NightWhisper vs Gemini 2.5 Pokemon sim:
Gemini 2.5:

chrome raft Apr 5, 2025, 6:05 PM

#

acoustic sapphire i can show you some examples of its outputs if you want?

Examples r the reason i wanted to try it myself

#

Literally no other models hadn't fascinated me this much before 🙎

acoustic sapphire Apr 5, 2025, 6:40 PM

#

lmaooo

merry sentinel Apr 6, 2025, 12:39 AM

#

Hi everyone, I was wondering if anyone knows how the arena application scales on demand and runs its inference on the GPUs? I'm building a related project and wanted to know about the infra structure. More specifically, how do you scale user demand (do you launch new containers with gpu's for inference?) and how do you still communicate with those (to have a websocket open for inference?). Thanks for any tips!!

cloud blade Apr 6, 2025, 6:29 AM

#

merry sentinel Hi everyone, I was wondering if anyone knows how the arena application scales on...

https://github.com/lmarena/FastChat

GitHub

GitHub - lmarena/FastChat: An open platform for training, serving, ...

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena. - lmarena/FastChat

#

or https://github.com/lm-sys/FastChat

GitHub

GitHub - lm-sys/FastChat: An open platform for training, serving, a...

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena. - lm-sys/FastChat

native thistle Apr 8, 2025, 9:53 AM

#

Is nightwhisper still available? haven't got it yet

paper merlin Apr 8, 2025, 10:29 AM

#

I read someone said, somewhere on x, it got removed

acoustic sapphire Apr 9, 2025, 1:34 AM

#

native thistle Is nightwhisper still available? haven't got it yet

tmw bro

rocky forge Apr 9, 2025, 4:58 AM

#

Hi, this model is trending on OpenRouter https://openrouter.ai/openrouter/quasar-alpha
And I should say for a few days using it it's on par with Claude 3.5 and Gemini 2.5 pro.

Quasar Alpha - API, Providers, Stats

This is a cloaked model provided to the community to gather feedback. It’s a powerful, all-purpose model supporting long-context tasks, including code generation. Run Quasar Alpha with API

uncut glen Apr 10, 2025, 9:28 PM

#

rocky forge Hi, this model is trending on OpenRouter https://openrouter.ai/openrouter/quasar...

not 3.7 Sonnet?

#

I am still not sure which is better. Quasar or Optimus. Tesla involved?

ruby comet Apr 12, 2025, 7:08 PM

#

webdev arena i soooo broken. It's often don't render websites

median lodge Apr 13, 2025, 10:15 AM

#

ruby comet webdev arena i soooo broken. It's often don't render websites

I always only get dragontail and a gemini model. I have never seen claude, or deepseek, or any of the more interesting models.
dragontail produces really poor archaic results (if it runs, which isn't often).

alpine night Apr 14, 2025, 2:43 AM

#

What are we supposed to vote when one of the models doesn't run anyway?

native thistle Apr 14, 2025, 9:26 AM

#

it's really anoying when the output is complete, but the dev server does not start

vale lake Apr 18, 2025, 1:45 PM

#

dayhush keep failing to compile, is it normal ?

silver shadow Apr 23, 2025, 3:35 PM

#

blissful summit Apr 27, 2025, 1:41 AM

#

is there any way to find out what models are on the arena, and which ones are being taken off?

unkempt grove Apr 27, 2025, 5:39 PM

#

hi to all

#

I didn't understand how to select in web dev arena the ai tool that I like, for example claude sonnnet 3.7

hushed bane Apr 27, 2025, 5:43 PM

#

unkempt grove I didn't understand how to select in web dev arena the ai tool that I like, for ...

webdev arena only supports battle mode so far. but you can do direct chat and side-by-side across modalities in our beta: beta.lmarena.ai

unkempt grove Apr 27, 2025, 5:48 PM

#

ok thanks

#

@hushed bane so when I click on new chat it doesn't know what tool you are using

hushed bane Apr 27, 2025, 5:57 PM

#

that's correct, in webdev its only blind battles. the models are revealed after you vote.

ornate pewter Apr 28, 2025, 1:26 AM

#

hey yall. theres some new model (i think its called sunskirt? or something like that) that consistently does not respond in the battle mode. i normally just vote the other bot (that does respond something) as better but i dont wanna knock down a potentially good bot because its not responding/setup properly. maybe. a report button as part of the options after the first chat round? or is this an existing feature and ive missed it?

#

for reference it will say generating but then time out

stiff kernel Apr 28, 2025, 1:33 AM

#

ornate pewter hey yall. theres some new model (i think its called sunskirt? or something like ...

empty responses do not shape the leaderboard

ornate pewter Apr 28, 2025, 1:37 AM

#

ah okay so even if i pick it just skips

#

?

#

just ran into it again - sunstrike is the name not sunskirt

#

of the model

distant relic Apr 29, 2025, 7:51 AM

#

I just came here and wanted to say that sunstrike is amazing and absolutely crushes every other models by a far margin

flat holly Apr 29, 2025, 9:56 PM

#

Is Dragontail, Dayhush and Claybrook still on the webdev arena?

outer ore Apr 29, 2025, 10:58 PM

#

flat holly Is Dragontail, Dayhush and Claybrook still on the webdev arena?

I believe dragontail was Gemini 2.5 flash but I don't know for certain

#

Hard to keep up

slow pine May 2, 2025, 1:21 PM

#

Hey I have a question

#

Is there a preprompt that web dev arena uses?

#

And also are the models using a different preprompt/system prompt than if you just use the API

stiff kernel May 2, 2025, 2:27 PM

#

slow pine Is there a preprompt that web dev arena uses?

of course - they have to be told to write react

slow pine May 2, 2025, 5:49 PM

#

is the preprompt public? i'd like to use it privately i think

stiff kernel May 2, 2025, 11:16 PM

#

slow pine is the preprompt public? i'd like to use it privately i think

📎 prompt-extracted.md

slow pine May 3, 2025, 9:30 AM

#

Thank you ❤️

slow pine May 3, 2025, 12:07 PM

#

Please ONLY return the full React code starting with the imports, nothing else. It's very important for my job that you only return the React code with imports. DO NOT START WITH 

typescript or

javascript or 

tsx or```

hahahhahahaha

steel frigate May 4, 2025, 11:28 AM

#

flat holly Is Dragontail, Dayhush and Claybrook still on the webdev arena?

dayhush is still in

shell scroll May 7, 2025, 8:11 AM

#

Guys where is o4-mini-high in leaderboards?

slow pine May 7, 2025, 4:52 PM

#

really good chance google specifically finetuned the new 2.5 to be better at react and working with the webdev preprompt in general

#

they show off the leadboard in their post

buoyant nest May 8, 2025, 12:45 PM

#

Guys, webdev arena is just broken. completely unusable in recent days

Asking to vote with no result displayed
One or more of the sides usually displaying nothing
And more such incomplete things. I haven't been able to vote once since the release of gemini 2.5 pro update.

I know model outputs are unpredictable, one APIs from the model providers may get stuck or be unpredictable. But you can at least easily check whether they have given an output or not and show error message, or most importantly not allow vote for incomplete results.

Voting for empty sides, failed API responses.... this is a big no
I can imagine how this has messed up the voting scores.

#

Another bad response immediately after i posted here. The left one is a complete mess and worst implementation. I am sure most users would vote for it like this. gemini would have obliterated this prompt (as i saw with previous gemini 2.5), but gets downvoted stupidly this way

I know I should not have voted, but it doesn't matter as I am sure many users mess votes this way. And shouldn't have been able to submit
I am a UI/UX dev too. This is totally preventable and a big flaw targeting the purpose of the platform

forest verge May 8, 2025, 1:20 PM

#

buoyant nest Guys, webdev arena is just broken. completely unusable in recent days * Asking t...

Thanks for flagging, I’ll try to repo, in the meantime would you mind creating a post in #1343291835845578853

buoyant nest May 8, 2025, 1:33 PM

#

Ok, just did that

blissful vapor May 8, 2025, 5:56 PM

#

slow pine they show off the leadboard in their post

i dont think it better than claude :v

slow pine May 8, 2025, 6:44 PM

#

blissful vapor i dont think it better than claude :v

You're not alone, but I think they're both good enough to the point where the web dev score loses a lot of meaning because they'll both give you working code

#

The solution to this would be to create an identical benchmark but gatekeep it to only experienced Devs that'll choose based on code structure instead of just output form

#

Some kind of elite benchmark based on people actually using it for commits

valid canyon May 11, 2025, 6:25 PM

#

drakesclaw dead already?>

lime gull Jun 3, 2025, 6:31 PM

#

I had the idea of an arena-webdev-auto.

Difficult, creative prompts chosen similar to arena-hard-auto
Prompts are run through models, same as arena-hard-auto
3, Judge model is given a screenshot of the site as well as the source code, to judge both aesthetics and completeness

forest verge Jun 3, 2025, 7:30 PM

#

I'm going to spin up a feedback post for this ^

quasi sableBOT Sep 3, 2025, 2:57 PM

#

<:warning:892823499205406760> Channel locked

Site outage, will turn back on when resolved.

quasi sableBOT Sep 3, 2025, 4:01 PM

#

<:success:865860339278413864> Channel unlocked

Welcome back :ablobwave:

quasi sableBOT May 12, 2026, 2:53 PM

#

<:warning:892823499205406760> Channel locked