#webdev-arena
1 messages · Page 1 of 1 (latest)
Does making a model think improve it in code? deepseek v3 gains 220 elo by moving to R1, and o3 mini high gains 180 elo above GPT 4o 11-20, but gemini flash thinking has about the same elo than without think.
and it fits with their livebench where gemini think coding rating is not better than without think,
And R1 improves compared to v3 (61 to 66) and o3 mini high improve compared to GPT 4o 11 20 (46 to 82)
so if we follow the grok 3 think rating from livebench (it goes from 54 to 66) it will have a good elo gains on the web dev arena
And Claude 3.7 goes from 66 to 74 with think (with what limit of thinking?)
we are also waiting for it on the webdev arena
Depends a lot since here is possible add more reasoning lines for better answers like"What would be easy if" or "I will get borrowed was example for now:
Is it possible to add some of the R1 Distill models to webdev arena?
I second this, would be cool to see qwen 32b/14b r1 distills
Optimistically, you would max out at 1302 ELO
Which is below V3
The story gets a bit better in Hard (English): you could get up to 1307 ELO, slightly above V3
@gritty tartan 🙂
hello, anyone facing with issue claude 3.7 dont show nothing on code for webdev arena?
@wild island could you say more? thanks
I'll just post what I posted the other server
It happens on firefox and opera, so unlikely a browser issue, I have only seen it happen for grok and sonnet
I've gotten these quite a few times
In the past though
with other models
interesting, thanks for reporting the issue!
Just came to report the exact same! I am using Chrome
Claude 3.7 most of the time like that and grok3 error 🤌🙂I donno what s wrong with them
Why there isn t the new chatgpt4o version and there is only the old one ؟
I'm testing the "web dev arena". Everytime "claude-3-7-sonnet-20250219" is used, it never generates anything. This is really going to throw off the scores because we know it's good, but if it's broken it's going to score horribly.
You should implement a test that verifies that SOMETHING was generated before considering the test data valid.
They do
99% of the time on webdevarena at least one of them does that where it doent worrk but is seemingly. taking time to gennerate somehting
No response every time with deepseek-v3-0324
Same here, no response with deepseek-v3-0324
Needs to be fixed asap as it's ranking is going to be tanked
thanks for reporting the issue! if one of the models generates empty response the vote will be marked invalid.
We need to have a way to ask follow ups separately for each side, because sometimes there is a bug on one side but the other side works, so I just want to give one side the error that is showing to see if it can fix it.
Also, I noticed with nightwhisper it would fail to run the code, but when I pick the other side as winner, it magically loads and then the output is actually better than the side I chose but it just didnt load before I selected a winner, this might mess up the results for nightwhisper
Tell LLMs to generate random number and tell them to follow just part of the prompt marked with such number
nw
Generate user interface for ...
At the top of your code, write comment with random number 0 - 100. I will use this number to give you follow-up prompts- for my next prompts, follow only the part with your number!
If your number is ..., do this. If your number is ..., do that.
I don't know if its specific to the prompts I am trying but half the time that I use the webdev arena only one of the LLMs work at all
Same, the other llm usually writes like 10-20 lines then stops generating
If I add a follow-up, it usually works fine though
RIP nightwhisper
What happened?
I got to know about it tdy and wanted to try 😫
i can show you some examples of its outputs if you want?
https://x.com/DrealR_/status/1907921770184860082
check this page
Examples r the reason i wanted to try it myself
Literally no other models hadn't fascinated me this much before 🙎
lmaooo
Hi everyone, I was wondering if anyone knows how the arena application scales on demand and runs its inference on the GPUs? I'm building a related project and wanted to know about the infra structure. More specifically, how do you scale user demand (do you launch new containers with gpu's for inference?) and how do you still communicate with those (to have a websocket open for inference?). Thanks for any tips!!
Is nightwhisper still available? haven't got it yet
I read someone said, somewhere on x, it got removed
tmw bro
Hi, this model is trending on OpenRouter https://openrouter.ai/openrouter/quasar-alpha
And I should say for a few days using it it's on par with Claude 3.5 and Gemini 2.5 pro.
not 3.7 Sonnet?
I am still not sure which is better. Quasar or Optimus. Tesla involved?
webdev arena i soooo broken. It's often don't render websites
I always only get dragontail and a gemini model. I have never seen claude, or deepseek, or any of the more interesting models.
dragontail produces really poor archaic results (if it runs, which isn't often).
What are we supposed to vote when one of the models doesn't run anyway?
it's really anoying when the output is complete, but the dev server does not start
dayhush keep failing to compile, is it normal ?
is there any way to find out what models are on the arena, and which ones are being taken off?
hi to all
I didn't understand how to select in web dev arena the ai tool that I like, for example claude sonnnet 3.7
webdev arena only supports battle mode so far. but you can do direct chat and side-by-side across modalities in our beta: beta.lmarena.ai
ok thanks
@hushed bane so when I click on new chat it doesn't know what tool you are using
that's correct, in webdev its only blind battles. the models are revealed after you vote.
hey yall. theres some new model (i think its called sunskirt? or something like that) that consistently does not respond in the battle mode. i normally just vote the other bot (that does respond something) as better but i dont wanna knock down a potentially good bot because its not responding/setup properly. maybe. a report button as part of the options after the first chat round? or is this an existing feature and ive missed it?
for reference it will say generating but then time out
empty responses do not shape the leaderboard
ah okay so even if i pick it just skips
?
just ran into it again - sunstrike is the name not sunskirt
of the model
I just came here and wanted to say that sunstrike is amazing and absolutely crushes every other models by a far margin
Is Dragontail, Dayhush and Claybrook still on the webdev arena?
I believe dragontail was Gemini 2.5 flash but I don't know for certain
Hard to keep up
Hey I have a question
Is there a preprompt that web dev arena uses?
And also are the models using a different preprompt/system prompt than if you just use the API
of course - they have to be told to write react
is the preprompt public? i'd like to use it privately i think
Thank you ❤️
Please ONLY return the full React code starting with the imports, nothing else. It's very important for my job that you only return the React code with imports. DO NOT START WITH
typescript or
javascript or
tsx or```
hahahhahahaha
dayhush is still in
Guys where is o4-mini-high in leaderboards?
really good chance google specifically finetuned the new 2.5 to be better at react and working with the webdev preprompt in general
they show off the leadboard in their post
Guys, webdev arena is just broken. completely unusable in recent days
- Asking to vote with no result displayed
- One or more of the sides usually displaying nothing
- And more such incomplete things. I haven't been able to vote once since the release of gemini 2.5 pro update.
I know model outputs are unpredictable, one APIs from the model providers may get stuck or be unpredictable. But you can at least easily check whether they have given an output or not and show error message, or most importantly not allow vote for incomplete results.
Voting for empty sides, failed API responses.... this is a big no
I can imagine how this has messed up the voting scores.
Another bad response immediately after i posted here. The left one is a complete mess and worst implementation. I am sure most users would vote for it like this. gemini would have obliterated this prompt (as i saw with previous gemini 2.5), but gets downvoted stupidly this way
I know I should not have voted, but it doesn't matter as I am sure many users mess votes this way. And shouldn't have been able to submit
I am a UI/UX dev too. This is totally preventable and a big flaw targeting the purpose of the platform
Thanks for flagging, I’ll try to repo, in the meantime would you mind creating a post in #1343291835845578853
Ok, just did that
i dont think it better than claude :v
You're not alone, but I think they're both good enough to the point where the web dev score loses a lot of meaning because they'll both give you working code
The solution to this would be to create an identical benchmark but gatekeep it to only experienced Devs that'll choose based on code structure instead of just output form
Some kind of elite benchmark based on people actually using it for commits
drakesclaw dead already?>
I had the idea of an arena-webdev-auto.
- Difficult, creative prompts chosen similar to arena-hard-auto
- Prompts are run through models, same as arena-hard-auto
3, Judge model is given a screenshot of the site as well as the source code, to judge both aesthetics and completeness
I'm going to spin up a feedback post for this ^
Site outage, will turn back on when resolved.
Welcome back :ablobwave: