#webdev-arena

1 messages · Page 1 of 1 (latest)

hot notch
#

GPT 4.5 and grok 3 think where will they rank?

frank nimbus
#

4.5 1215, grok 3 1200

#

maybe 4.5 is like 1150

hot notch
#

Does making a model think improve it in code? deepseek v3 gains 220 elo by moving to R1, and o3 mini high gains 180 elo above GPT 4o 11-20, but gemini flash thinking has about the same elo than without think.

#

and it fits with their livebench where gemini think coding rating is not better than without think,
And R1 improves compared to v3 (61 to 66) and o3 mini high improve compared to GPT 4o 11 20 (46 to 82)
so if we follow the grok 3 think rating from livebench (it goes from 54 to 66) it will have a good elo gains on the web dev arena

#

And Claude 3.7 goes from 66 to 74 with think (with what limit of thinking?)

#

we are also waiting for it on the webdev arena

copper coyote
#

Depends a lot since here is possible add more reasoning lines for better answers like"What would be easy if" or "I will get borrowed was example for now:

reef crown
#

Is it possible to add some of the R1 Distill models to webdev arena?

fallow gate
stiff kernel
#

Which is below V3

#

The story gets a bit better in Hard (English): you could get up to 1307 ELO, slightly above V3

wild island
#

hello, anyone facing with issue claude 3.7 dont show nothing on code for webdev arena?

lament glade
#

@wild island could you say more? thanks

hazy trench
#

I'll just post what I posted the other server

#

It happens on firefox and opera, so unlikely a browser issue, I have only seen it happen for grok and sonnet

cyan oxide
#

In the past though

#

with other models

lament glade
#

interesting, thanks for reporting the issue!

normal copper
#

Just came to report the exact same! I am using Chrome

primal wharf
# wild island

Claude 3.7 most of the time like that and grok3 error 🤌🙂I donno what s wrong with them

#

Why there isn t the new chatgpt4o version and there is only the old one ؟

clever mortar
#

I'm testing the "web dev arena". Everytime "claude-3-7-sonnet-20250219" is used, it never generates anything. This is really going to throw off the scores because we know it's good, but if it's broken it's going to score horribly.

#

You should implement a test that verifies that SOMETHING was generated before considering the test data valid.

quiet raven
# wild island

99% of the time on webdevarena at least one of them does that where it doent worrk but is seemingly. taking time to gennerate somehting

teal tendon
#

No response every time with deepseek-v3-0324

high mesa
#

Same here, no response with deepseek-v3-0324

Needs to be fixed asap as it's ranking is going to be tanked

lament glade
#

thanks for reporting the issue! if one of the models generates empty response the vote will be marked invalid.

acoustic sapphire
#

We need to have a way to ask follow ups separately for each side, because sometimes there is a bug on one side but the other side works, so I just want to give one side the error that is showing to see if it can fix it.

#

Also, I noticed with nightwhisper it would fail to run the code, but when I pick the other side as winner, it magically loads and then the output is actually better than the side I chose but it just didnt load before I selected a winner, this might mess up the results for nightwhisper

thick sigil
thick sigil
acoustic sapphire
#

damn i lost my nw session

#

gotta find it again

tardy timber
#

I don't know if its specific to the prompts I am trying but half the time that I use the webdev arena only one of the LLMs work at all

fallow gate
#

If I add a follow-up, it usually works fine though

acoustic sapphire
#

RIP nightwhisper

chrome raft
#

I got to know about it tdy and wanted to try 😫

acoustic sapphire
chrome raft
#

Literally no other models hadn't fascinated me this much before 🙎

acoustic sapphire
#

lmaooo

merry sentinel
#

Hi everyone, I was wondering if anyone knows how the arena application scales on demand and runs its inference on the GPUs? I'm building a related project and wanted to know about the infra structure. More specifically, how do you scale user demand (do you launch new containers with gpu's for inference?) and how do you still communicate with those (to have a websocket open for inference?). Thanks for any tips!!

cloud blade
native thistle
#

Is nightwhisper still available? haven't got it yet

paper merlin
#

I read someone said, somewhere on x, it got removed

acoustic sapphire
rocky forge
uncut glen
#

I am still not sure which is better. Quasar or Optimus. Tesla involved?

ruby comet
#

webdev arena i soooo broken. It's often don't render websites

median lodge
alpine night
#

What are we supposed to vote when one of the models doesn't run anyway?

native thistle
#

it's really anoying when the output is complete, but the dev server does not start

vale lake
#

dayhush keep failing to compile, is it normal ?

silver shadow
blissful summit
#

is there any way to find out what models are on the arena, and which ones are being taken off?

unkempt grove
#

hi to all

#

I didn't understand how to select in web dev arena the ai tool that I like, for example claude sonnnet 3.7

hushed bane
unkempt grove
#

ok thanks

#

@hushed bane so when I click on new chat it doesn't know what tool you are using

hushed bane
#

that's correct, in webdev its only blind battles. the models are revealed after you vote.

ornate pewter
#

hey yall. theres some new model (i think its called sunskirt? or something like that) that consistently does not respond in the battle mode. i normally just vote the other bot (that does respond something) as better but i dont wanna knock down a potentially good bot because its not responding/setup properly. maybe. a report button as part of the options after the first chat round? or is this an existing feature and ive missed it?

#

for reference it will say generating but then time out

stiff kernel
ornate pewter
#

ah okay so even if i pick it just skips

#

?

#

just ran into it again - sunstrike is the name not sunskirt

#

of the model

distant relic
#

I just came here and wanted to say that sunstrike is amazing and absolutely crushes every other models by a far margin

flat holly
#

Is Dragontail, Dayhush and Claybrook still on the webdev arena?

outer ore
#

Hard to keep up

slow pine
#

Hey I have a question

#

Is there a preprompt that web dev arena uses?

#

And also are the models using a different preprompt/system prompt than if you just use the API

stiff kernel
slow pine
#

is the preprompt public? i'd like to use it privately i think

slow pine
#

Thank you ❤️

slow pine
#
Please ONLY return the full React code starting with the imports, nothing else. It's very important for my job that you only return the React code with imports. DO NOT START WITH 

typescript or

javascript or 

tsx or```

hahahhahahaha
steel frigate
shell scroll
#

Guys where is o4-mini-high in leaderboards?

slow pine
#

really good chance google specifically finetuned the new 2.5 to be better at react and working with the webdev preprompt in general

#

they show off the leadboard in their post

buoyant nest
#

Guys, webdev arena is just broken. completely unusable in recent days

  • Asking to vote with no result displayed
  • One or more of the sides usually displaying nothing
  • And more such incomplete things. I haven't been able to vote once since the release of gemini 2.5 pro update.

I know model outputs are unpredictable, one APIs from the model providers may get stuck or be unpredictable. But you can at least easily check whether they have given an output or not and show error message, or most importantly not allow vote for incomplete results.

Voting for empty sides, failed API responses.... this is a big no
I can imagine how this has messed up the voting scores.

#

Another bad response immediately after i posted here. The left one is a complete mess and worst implementation. I am sure most users would vote for it like this. gemini would have obliterated this prompt (as i saw with previous gemini 2.5), but gets downvoted stupidly this way

I know I should not have voted, but it doesn't matter as I am sure many users mess votes this way. And shouldn't have been able to submit
I am a UI/UX dev too. This is totally preventable and a big flaw targeting the purpose of the platform

forest verge
buoyant nest
#

Ok, just did that

blissful vapor
slow pine
#

The solution to this would be to create an identical benchmark but gatekeep it to only experienced Devs that'll choose based on code structure instead of just output form

#

Some kind of elite benchmark based on people actually using it for commits

valid canyon
#

drakesclaw dead already?>

lime gull
#

I had the idea of an arena-webdev-auto.

  1. Difficult, creative prompts chosen similar to arena-hard-auto
  2. Prompts are run through models, same as arena-hard-auto
    3, Judge model is given a screenshot of the site as well as the source code, to judge both aesthetics and completeness
forest verge
#

I'm going to spin up a feedback post for this ^

quasi sableBOT
#
<:warning:892823499205406760> Channel locked

Site outage, will turn back on when resolved.

quasi sableBOT
#
<:success:865860339278413864> Channel unlocked

Welcome back :ablobwave:

quasi sableBOT
#
<:warning:892823499205406760> Channel locked