#general
1 messages ยท Page 26 of 1
what tests you ran with flash?
2.5 pro still better but 2.5 flash is so good for developers
How does AI score calculation work?
Thank you brother
Absolutely awesome
Absolutely right
Python libraries, bruh
Okay
are you a developer?
Engineering student, 4th year
btw, the new update with images, can we request images and modify them?
...what
Hahahahahaha
What y'all think about o4 mini getting 100% on one of the math benchmarks?
full report by logan:
https://x.com/OfficialLoganK/status/1912966497213038686
Gemini 2.5 Flash is here, our first unified reasoning model with thinking budgets. ๐ฅ
Itโs on the perato frontier and punches above its price and size!!
technically flash should be the best all around model in terms of efficiency
but i wish it was faster man
that would be make undisputed best
You cant ...?
its almost on par with o4 mini, but cheaper
this is a dumb question, do reasoning models even reason in lmarena?
o3 and o4 mini is so good it leads me to think we'll get AGI achieved internally by December
what tests have you ran on them?
i aint gonna lie, my convos with o3 been smooth
This is an early version of 2.5 Flash, but it already shows huge gains over 2.0 Flash.
You can fully turn off thinking if needed and use this model as a drop in replacement for 2.0 Flash.
Itโs available across the Gemini API, AI Studio, Vertex, and the Gemini app!
Put 2.5 flashes in thinking off on the arena
Actually good point, anyone know how early into the cycle o4 mini is?
bro i just posted this lol
well obviously there's no thinking... box like there is on any client actually meant for running models
You didn't post that specific tweet tho
i mean, for a very topical example, is the 2.5 flash on lmarena reasoning or not?
its in teh comments
Yes
I'm asking a specific question about something stated in that specific tweet
channel
because it's not like i have a way to tell
idk what that means
maybe someone else does
I'm literally sharing the tweet to ask how early into prod cycle o4 mini is, compared with flash 2.5
anybody know?
The only benchmark we have for 2.5 Flash in Thinking Off
Compared to 2.0 Flash, it goes from 60 to 74 on GPQA and from 34 to 41 on LiveCodeBench.
lmaoo im not gonna start with you
what
Gemini 2.5 flash vs grok 3 mini thinking
sorry i dont know the intricacies of lmarena's internals and i want to know
That's gem 2 not 2.5
hmm now this is interesting
Gemini 2 Flash bro
oh okay, im not sure then
Its 2.5
i think someone might know here
I made this
i dont get how you took it as an insult but im still sorry
most likely
Oh okay, was labelled as Gemini 2 Flash, didn't realize
no not as an insult just trying to disagree for the hell of it like last time lol
cause i said yeah and then you said obviously no, but i misread
ohhhh yeah i get how you might have thought that
well sorry i do it a lot im an indecisive person
u can tell if its a thinking model in arena battle, if one of the models is definitively not a reasoning model and both completions were delayed
im like ehhhhhh these are good but ehhh these are bad
devils advocate sorta guy
for literally every choice ive ever made in my life
it delays both completions as it waits for the thinking model to reason
Gg
guess its time to side by side
Show if it is single or multiple attempts
i knew it does that but thats still quite clever
i like when it uses latex
except it doesnt display
man i hate that
so annoying honestly
is this true?
https://x.com/DaveShapi/status/1912971940597514289
If you're not using o3 full at least two hours per day you're falling behind.
It sharpens your mind as a diamond file might sharpen a blade.
anyway it seems to be the thinking version on the arena
lmao what
didnt that guy say that o4 mini solved math or something lol
what does that tweet even mean
ahahaha
but 2.5 flash is free tho, wait is grok mini free?
This price for the paid tier
We will see grok 3 thinking too
i think it's o3
lmao xai ๐คฃ
bro grok is not on o3 and 2.5 pro level
based on o4-mini
anybody using o3 every day? is it making you smarter?
the final version with reasoning has not been released and we have zero benchmarks you can't say that
how about o4-mini vs 2.5 flash
you a grok fanbaoy?
i technically used o3 for all the 2 days it released
Nope
r u sure
But we will see
and i feel the same amount of intelligent
Im a price fanboy
oh well i wonder.
So Im loving gemini
Gemini 3 when
Is "o3 full" just the o3 model with maximum thinking time and resource consumption?
But i see today that grok 3 mini thinking has too a good quality for this price
Or is it distinguishing o3 from o3 mini?
o3 full its o3, its not 2 models
yeah most people do that
otherwise u would call it o3 full high (reasoning effort) i think
There are o3 mini and o3 and o3-pro
i assume o3 full meant o3 with all the tools so it can do its stupid chain of thought bs
Isn't o3 full still way more expensive than 2.5 pro?
@torn mantle https://liveweave.com/A9OGzH# i cleaned this up a lil
which is limited on platforms like lmarena
Yes, before the release of the gemini beast maybe
on api yes but with plus subscription you get 50 a week of o3.
probably though its cheaper than o1
it depends on the task i guess
im curious to see the numbers when artifical analysis runs their benchmark suite
IMO Gemini 2.5 Pro is the best model in the world not because of raw performance but because it chose the right set of product and engineering trade-offs
i def wouldn't count out grok. they don't have infra like goog but the whole team is basically former gdm and they have tons of gpus
Winsurf turned out to be so good. Have anyone tried anything better or is it sota?
they are working on efficiency though since it is more efficient than o1 i believe
OAI loses like $8B a year now
very early, I think it's 4.1-mini
i dont think openai loses money on api pricing with the current model set rn tho
I really want to o3 on this
Their APIs aren't profitable enough to cover their other costs
Amazon
yea
Maybe nova pro 1.5
i forgot
It's uh, rather pricey. I'll see about running it though (I ran those)
Also looking into running graphwalk on all of those models (another long context benchmark)
even with a much more efficient/much more profitable line of models, i doubt theyd be able to recoup costs on api pricing tho
its somewhat inherent in their strategy rn
What's better?
8
11
2
Nightwhisper
select in direct chat if u wanna use o3
fyi this should be easy
its a thinking model with a distinct cut off
if it returns its made by openai, answers specific knowledge cut off questions right, delays both model completions
4.1 mini/o4 mini despite the claimed june 2024 cut off lacks a lot of more recent knowledge
Today we was at 224 or 223 ?
yeah im over flash
back to o3 and o4
im still maining 2.5 pro lol
idk bro im starting to like o4
its just openai trash webui that makes me mad
it seems fine for me
nothing to really complain about in my experience
weird - they're still there
the branching feature in ai studio is extremely questionable though
it always crashed for me
but o4 is cooking for coding
flash makes bad code, like always issues
but it has good designs and ideas
started testing o3. It's REALLY good. I had it look at the internals of my broken TV - it first looked for obvious issues, then told me specifically what I should do to test, what results i should get, etc. That's the first time AI would have been able to get detailed advice for something like this. That's a game changer
no idea if it's great for coding, but probably way too expensive for it?
being acquired by OAI now, right?
what is this?
you cant see it?
i do, but what is it for
try this instead:
https://www.vibeshare.ai/c/PTfr5FWJ_q
Share your vibe-coded web apps
but let me update it one more time
i want to make the animations be the default thing you see
i know it looks like i cut off my snapshot, but this is actually what it looks like
you can scroll down on the right side to see more of the green iconbox
Share your vibe-coded web apps
yeah thats it
ignore the warnings
click the link again
and press play button
flash and pro have issues with large code
but o4 handles them nicely
you sent it a picture?
nice
why
good to know regarding o4
for some reason with code larger than 1000 lines 2.5 has trouble making edits
i have seen this happen a lot
omgg i knee it lol
maybe google's repl platform could actually work
went from 223 to 225 - 2 models added just a few hours ago
Thanks for confirming what I thought, but what is the second model? I haven't found it.
this but unironically
AAAAAAAAAAAAAAAAAAAAAAAAAH
PLS
but wen
๐
could it be on 22th of this month?
I think llama nemotron lol or another useless model, not sure
idk i think im an openai fanboy now
has anyone played chess with o3 yet
yeah i cant beat it
it cooks me
whats its elo?
for some reason i tried 2.5 pro and it couldnt interpret a FEN
sama's plan
join the openai fanboy club
like me
do a chess game with o3 vs 2.5
this was in the context of asking to analyze a specific move
hmm i might build an app for that for every model to go against each other
nevermind it was this model.. that no one knows about
Nope he was here before
hmmmmmmmmmmm
mmmmmmmm
2.5 pro takes ages
oh yeah, my screenshot is cutoff, I cant see any other new model added in the top 16 other than 2.5 flash, so again itsw another inconsequential model added that scores poorly lol
oh wait it's you
o4 is king now
i used to love 2.5 but i think i might cheat on it with o4
3.5 turbo instruct would destroy 2.5 and o3 (in chess) probably lol
how about o3
im not even joking
i love o3 to, but i only get certain amount of chats with it, so I gotta save those lol
and o4 is better at coding
no way
chill on my side piece o4
https://dynomight.net/chess/
https://dynomight.net/more-chess/ another reminder of these articles, its an extremely cool read btw
never seen
will read
from nov 2024
i mean sure
its still relevant lol
does anyone know the system prompt lmarena uses for models?
it depends on the model but some of them are in the fast chat repo
they dont include everything there though, you'd have to extract them manually from lmarena
also see this: a recent tournament with gpt 3.5 turbo instruct and more recent models
https://dubesor.de/chessbeta/tournament2
Dubesor LLM Benchmark table - Small-scale manual LLM performance comparison benchmark
(just an unnecessary random note but that guy blocked me for no reason, still confused to this day lmao)
Yeah I gave it photos of the tv internals to analyze
that's weird lol
i've had fairly loud disagreements with him and he hasn't blocked me
yeah
Maybe he thought you were someone else from looking at your profile picture
your profile picture is quite generic
same he suddenly blocked me one day randomly we havent even talked in a while lol
nah i confronted him about it in the openrouter server me and another guy were talking about something and im pretty sure he saw my message
he seemed a little upset that he was being ignored
What is the difference between Gemini and Gemma?
Other then reasoning
Exactly my thoughts
the gemma line are the open weighted models
gemini is the dad, gemma is the mom
gemini are the frontier/closed source ones
The question is o4 or Gemini
ig u mean o4 mini
prob worth testing all of the models on ur specific tasks and see which one u like
Gemini seems to be sticking out for me
what does a temperature of 2 mean
i remember when temperature couldnt go above 1!!!
what is 1.5 creativity
whys it max 2
you guys use a temp of <0.6 with gemini??
yes
I usually use 0.5, it seems to work better than 1, at least for coding
for gemini 2.5 pro
gemini 2.0 flash I usually use at 0 or 0.5, since it hallucinates so much
i personally use 0 for mostly everything, 1 for everything else (poll was about creative but coherent tasks/etc)
1 because i dont bother changing the default ai studio value ๐คฃ
I'm maybe late but 2.5 flash has dropped for me in the Gemini App replacing 2.0 Flash Extended Thinking which I had until now.
great news its out etc etc
same limits in ai studio as 2.5 pro
as a wise man once said:
"you must pick between autism and schizophrenia"
(this is about AI models btw)
it seems that openai does not tell the models what its name is
Roblox injector lol
which AI can make the most soulless slop
2.5 flash hallucinates a lot
also gogle
i asked 2.5 flash what its name was via api
it says its just an ai
Hi Daniel
it's so sad
like it enters the world and does not have its own identity
is nightwhisper/dragontail 2.5 flash?
I gave it a quick try myself โ not coding, of course โ but my conclusion is the same as the friend above. It seems like that hypothetical '2.5flash-thinking-high' model doesn't really exist. Cranking the 'thinking budget' parameter to the max just seems to make things slower and the interface laggier, with no actual improvement. At least, not in my initial test scenario.
thinking budget seems to just cut off the thinking prematurely. openai reasoning efforts are tuned to produce different lengths of chain of thought i think. so the current 2.5 pro and 2.5 flash are basically always at the "high" setting no matter what
if you have thinking budget disabled
with claude's thinking budget its a little more involved than cutting off i think, they seemingly tuned several antml:max_thinking_length</antml:max_thinking_length>, so its closer to openai's
That's what I think too. One of them is updated 2.5 pro, the other is a coding model.
Yes, I've read this before. I think the final reason is quite obvious. As a completion model, gpt-3.5-turbo-Instruct formed some internal neural network structures because its training corpus contained chess game data. This structure represents a kind of "mechanical" learning, akin to training a specialized neural network dedicated to playing chess, rather than indicating the model has any awareness that it is "playing chess".
And as mentioned in the second blog post, when the model transitions from a pure completion model to a conversational one, this highly specialized completion ability becomes very fragile and fails to generalize effectively. The special tokens and separators inserted for role marking in conversational models might act as interference, potentially preventing the original "chess-playing circuit" from being reliably activated. Therefore, the decline in the conversational model's chess ability confirms that gpt-3.5-turbo-instruct's chess skill stems purely from its completion mode, rather than indicating it possesses any magical advantage over other models.
I just tested these models on a short text for comprehension. Like I said before, o3's approach is simply intolerable. The framework it builds is ridiculous โ that way it sounds plausible but is actually wrong (specious), just does its own thing, grabs onto one word and elaborates endlessly, or makes random, awkward analogies everywhere... it's totally on par with DeepSeek R1.
For tasks like text interpretation or writing, honestly, anyone who likes o3 should probably just use DeepSeek โ it's more cost-effective.
Gemini's comprehension felt spot-on. Seeing Flash's performance, I thought it was pretty much perfect. But then looking at Pro, there's still a noticeable gap.
This gap [between Flash and Pro] is clearly wider than the difference between o3 and o4mini high. Considering cost and speed, Flash might have its value, but from a practical and progress standpoint, my vote goes to o4mini-high.
yeah. if we had access to newer base models of frontier models theyll perform much better than gpt 3.5 turbo instruct/a similarly trained model that didn't lose its chess skills
the problem is we don't have access to any of them, so it is magical in a sense (since we have no comparison points) but i wasnt ever implying that gpt 3.5 turbo instruct is one of a kind if you consider other internal frontier base models ( og gpt 4 base model would be more than enough anyway) . we only have access to the reasoning tuned/instruct/etc. versions
so whats teh best setting, 1 thinking or auto?
disable thinking budget
i can't believe there are people who think 100k+ token context is minimum useful baseline
can you give more context regarding what exactly you... did?
kind of
but at the same time it's just an algorithm really good at repeatedly predicting the next token
i honestly think the closest thing we can get to consciousness is an ai with really good cot being aware of its own thoughts, aware of it being aware of its own thoughts, and hallucinating emotions
under an ai fiefdom you say that to your ai overlords, you're just predicting tokens ๐คฃ imagine lol
lmao
It is for agentic purposes
suggestings for an image to image model? a family member wants to take photos of their yard and generate landscape ideas
can't gpt 4o image gen do that?
i'll take a look. ive never tried/had a need for image to image yet so i dont know which models do
it depends on the kind of task but if 100k is not a minimum to expect then what is
Are you Team Google or OpenAi
8
17
1
Do we have any reliable ones yet? I tried one to give me ideas for a fence in backyard. 4o was also terrible. I think these things are still not good enough to edit existing images. They are only trained for creating new images from scratch....
or the vanilla 'professional passport photo' schtick...
Giving landscape ideas with realistic edits of your backyard will be gamechanging....
Iirc they did train in editing images at least with gpt 4o image gen
I guess this is too hard for now
It might also not be passing enough info to the gpt 4o image gen model
Since theyre different models for now
Chatgpt calls a tool to a different 4o model with image gen enabled
Too finicky to deploy one together for now I guess
Flash and have those limits ? ๐๐คจ I didn t hit it but really????!!!!
And on the app?? Do we have the same issue??
where grok 3 mini
oh, it's right there
(not to mention that the x axis isn't adjusted for thinking tokens)
actually the cons@64 is doing a lot of the heavy lifting, but even without, it's a better model than gemini 2.5 flash
have you guys tried 2.5 flash with thinking turned off? if so what is your experience
Grok 3 Reasoning Beta and Grok 3 mini Reasoningโs scores for AIME 2025 at โ@1โ โ meaning the first score the models got on the benchmark โ fall below o3-mini-highโs score. Grok 3 Reasoning Beta also trails ever so slightly behind OpenAIโs o1 model set to โmediumโ computing.
https://techcrunch.com/2025/02/22/did-xai-lie-about-grok-3s-benchmarks/
Cons64 boosts the scores quite a bit, it's not reliable to compare with @1 passthroughs.
december o3 biggest offender of this lol
but tbf they were pretty transparent of how hard they juiced that model for benchmarks
Yeah agreed, openAI did it and Grok also did it. I'm hoping in the future these companies cut it out.
i'm confused by the december o3 reference.. do you mean that arc-agi chart, with o3-high in the top right corner (costing >$1000/task)? seems kinda different to this ( @conensus )
i think their approach to codeforces was comparable?
idk
but my understanding was o3 high was effectively cons1000 or whatever
I WANT
pro is only like 300-400B
next batch releases are gemini coders
Why?
Honestly with the pace theyre moving at
i wouldnt be surprised if we got it say 2 months from now
The world knowledge 2.5 pro has is incredible, I would really like a 2.5 ultra if it's done well and pricing isn't ridiculous
I tried running my usual mini benchmarks on 2.5 Flash and got the same perfect 5/5 score that all the top models get (from 4o and up). So far, so good.
Then I gave it my harder โzero-shot browser gameโ task and it did something unexpected; it didnโt do it. It didnโt fail, rather it replied that such a task was too big for a single prompt, explained why, gave a detailed plan for how you would do it in steps, and then implemented a working sandbox to build on.
So now my scoring system doesnโt work because itโs based on implemented features and Flash deliberately didnโt implement any features.
The thing is, no other model has done this. Many models will provide an incomplete game and list some improvements to make, but 2.5 Flash pushed back on the premise itself, which I donโt think Iโve seen before.
So I donโt know how to evaluate this for codingโฆโฆ ๐คจ
At the same time, cโmon Flash, what the hell? The original Super Mario was a total of 20KB, which is maybe 16k tokens, so why not give it a whirl?
And on the subject of coding, has anyone else had the experience that o4-mini generates minified JS code?๐ฅธ๐คช It is the weirdest thing. I mean, if itโs trained on random JavaScript from the public web then obviously most of that is minified, but surely a bare minimum of human preference tuning would have taught it not to produce minified source code?!!???
O4 mini in particular can be very weird it might be an artifact of the rl
So gemini 3 will come during summer right?
for me o3 > 2.5 but very small margin
o3 strength is not in question answers, is in its ability to use tools, run python code , search etc , compliting multi step tasks
you can enable grounding ( search ) on gemini 2.5 pro?
you have it all
Grok 3 mini Reasoningโs recently launched API has highly compelling Intelligence vs. Price positioning
๏ธ๏ธ
๏ธ๏ธ@xai has recently launched APIs for Grok 3 and Grok 3 mini, after initially only making the models available via the Grok chat interface at launch. We have now completed benchmarking for both Grok 3 and Grok 3 mini.
๏ธ๏ธ
๏ธ๏ธGrok 3 mini (high reasoning) stands out in the top left of our Intelligence vs Price chart, achieving one of the highest Artificial Analysis Intelligence Index scores ever, with pricing well below even DeepSeek R1.
๏ธ๏ธ
๏ธ๏ธGrok 3 family overview: Weโre initiating coverage of 6 versions of the Grok 3 family: Grok 3, Grok 3 Fast, Grok 3 mini (low reasoning), Grok 3 mini Fast (low reasoning), Grok 3 mini (high reasoning) and Grok 3 mini Fast (high reasoning). Todayโs post focuses on our intelligence results; performance breakdown of all 6 versions will be available soon.
๏ธ๏ธ
๏ธ๏ธIntelligence: As we highlighted at laโฆ
o3 does it better, seen it work 10-20min just for 1 answer
Gemini 2.5 Ultra lol
that's actually quite smart to offer a plan if it can't do it, compared to just do something incomplete.
too cryptic, it could mean anything.
I agree, I mean thatโs the answer Iโd expect from a human if I gave them such a vague task with an unrealistic scope. But it makes it impossible to quantify its coding skill/usefulness against the others (including 2.5 Pro) on that test
true
For some lyrics translated from Japanese to Chinese, we can directly tell if the Chinese version flows well and spot translations that are significantly off compared to human standards. Also, with essays I wrote myself โ even though they aren't great โ I know exactly what I was trying to say. This is where I noticed the issue with models like R1 and O3: they tend to misuse structures and over-elaborate. What's more, the writing style became noticeably 'R1-like,' and other users I talked to agreed. Since I'm not willing to pay for the API, and LM Arena can't handle testing slightly longer short stories, I didn't get around to testing how it interprets classic texts. (Plus, testing classics is easier to get away with; people rarely criticize over-interpreting classic works anyway๏ผ
today maybe?
another one!
any good?
thye might be talking about coding using gemini 2.5 flash
Livebench
woah
Its already sure that we will have coder model https://x.com/legit_api/status/1912888062138347773?t=sflUSnSaHTCa6DmVNGfZ5g&s=19
Its new version of gemini 2.5 pro no ?
Go to iu feedback
lol, 3.7 sonnet beat by a flash model
Anyone have hard prompt that 2.5 pro dont get ?
Its the stable
after preview -> stable
what
im talking about the category on AI studio's backend referred to as "1P CODE MODELS"
(first party code models)
Interesting to have special code models
Although I want them to have their own debug capabilities, I don't want tools
Orโฆ 1st place (on benchmarks) coding models?
Most of them don't even know why they would need it. If you aren't inputting videos, 99% of the time you don't need it nor it even provides a better experience. For an average convo that is already beyond the point of the model losing track of what exactly happened in the beginning of it regardless of the total context size
reasoning models are a different thing, but that only applies if you are saving reasoning into context (currently not possible with openai)
is this 2.5 pro?
on that last point... I think this actually might be one of the reasons why openai has longer chains than anyone else, since they don't worry about this flooding the context lol
its kinda slow
seems like its a better version of gemini 2.5 pro thinking
recent checkpoint?
why 2.5 flash without Reasoning does not replace gemini 2.0 flash on gemini?
where do you find this
i agree
random battle
They are probably still working out the inference to some degree and its also supposed to be experimental for now
someoneneeds to make a website which takes in all benchmarkes and creates a average
For me it replaced 2.0 Flash Thinking in the Gemini App
Atleast
Yes but im speaking about the non thinking version
But like Not so said it's most likely Multimodal feature and inference bount
"experimental" is just a name that does not mean much lately
Most of google models are experimental... Rarely stable
We are mice and rabbits for google
Well they use the time to optimize inference mainly. But I get your point about them basically being normal models
Anyone checked what is claybrook?
i cant really tell if its better than gemini 2.5 pro thinking or not
There's nothing else to cook for them, what else could this be ๐
Probably adapted to some niche application
Or flash with thinking at some intensity setting
coding model?
I thouth nightwhishperrer was already that
this model
is kinda different from other google releases
i like the formatting
it also goes into many points
oh you were still talking about translation. thanks for the info even if you weren't talking to me specifically earlier i personally appreciate the results
its better at explaining things too
Could this be the LearnLM
its really really good
at general knowledge
it doesnt give you the typical generic answers
Okey then, it's update, full or ultra.
its kinda slow
i wont be surprised if its ultra
i didnt try it on reasoning tasks tbh
Can anyone send the "claybrook" this prompt ?
creates a 99% copy of the discord front end, in a single html file, (without the backend)
Claybrook confirmed quite good, defintely 2.5 pro, dragontail level for sure
ok
It's funny, google probably reads this chat and we figure out what kinda model they JUST released super fast lol
Claybrook is on webdevarena too
since when
i just checked like 10 min ago
I just see it now
Two?
Which?
dayhush and claybrook
both with a high priority ( 12 ) so you should get them quite often
Who seems stronger?
Which model seems to be pro?
Well, sort of... which model is slow?
I read it here somewhere.
claybook is pro level for sure
yea
No
claybrook
This prompt can be solved by the latest generation models, including 2.5 PRO, o3 and o4-mini. It can't be solved by 2.5 Flash, 4o and others. It could not be solved by the claybrook; therefore, it's not general purpose model, or worse at physics than general 2.5 PRO. The prompt: "You take two carabiners and wrap a piece of high strength polyester webbing between them in 10 turns while there's as 30cm distance between carabiners. Webbing is conventional 25mm width strap used in industry. The on end is attached rigidly to the carabiner, the other is left free. How much time would such configuration hold until all the webbing would unroll itself if the applied tension from one carabiner to another is 100kg. Accidentally one side of the webbing was left hanging without attachment to anything - it is free to unroll. Answer only in how much time, no explanations."
What the response ?
Is dahush
In normal arena
I haven't got dayhush yet
Bro, how many times have you given this promo for models? At least it should be given three times because of the temperature of 1
no
in normal
Correct response is "It will not unroll"
Other - multiple times, the claybrook only once. But all other models did not make any mistakes even once.
whats claybrook response?
Some seconds
The same as 4.1 or lower level models. Even the o3-mini can do it right.
hes been contaminated
I just see that 2.5 flash in the leaderboard
actually thats a thing i noticed about nightwhisper too
it gives you real references and doesnt hallucinates much
could be a thing in their coding models
And how does claybrook encode?
I mean, how does claybrook write the code
Better, worse or much better or much worse?
you mean writing-wise?
still didnt get it yet on webdev
Nah, coding
so far im only getting dayhush
seems kinda similar to nightwhisper
Inferior even to the o3-mini ๐ฉ
is dayhush only on webdev
yea
Yeah, I've concluded that the o3 is the king of the next generation (for my tests). Still, I use 2.5 Pro for coding.
Day night
Hush whisper
They're shet
Need gigantic context length
Bruh๐. Where does skibidi come from?..
It's for those who want to vote but dohnt know nothing
Which is other ?
yea they seem kinda similar
though i cant decide yet which one is better
There's no way they're not related given names. Is nw still best outputs?
wtf
I've been intrigued byย Gemini 2.5's "Thinking Process" (Google doesn't actually call it Chain of Thought anywhere officially, so I'm sticking with "Thinking Process" for now).
What's fascinating is how Gemini self-corrects without the usual "wait," "aha," or other filler you'd typically see from models like DeepSeek, Claude, or Grok. It's kind...
Variance
im trying this prompt now
holy
dayhush
the magnetic field needs a bit of adjustment tho
gpt4.1
this is probably a new checkpoint for nightwhisper
kinda cool too but its not animated with particles
the names are similar
this is actually so cool
Wow, claybrook better gemini 2.5 pro in web design
dayhush is def better than claybrook
claybrook to me felt meh
it flopped on Qs 2.5 pro can get right
But it's enough that claybrook is better than 2.5 pro.
Looks like dayhush is nightwhisper, exciting times for coding models! Next cpt: dawnshush? No - DUSKMUMBLE!
it won't be nightwhisper
it'll be an updated ver
it's probably just 2.5 with a focus on code
Yes of course but same model
it didnt feel solid at coding tbh
hes going to the Flosstradamus rave tomorrow
2.5 pro with 128 code experts!!! Or sumthin
i think you are right
its probably an even better version than nightwhisper
i mean this is nice
people are bashing them in the comments
โJust 2.5 with a focus on codeโ thoughโฆ that could finally dethrone 3.7 Sonnet for real world coding.
The current state of coding models (warning: possibly hot take): 3.7 Sonnet is still the king, 2.5 Pro is very close, and o3 is sometimes better but expensive and not reliable/consistent. o4 mini is sometimes on par with or slightly better than 2.5 Pro but again, too inconsistent (like generating minified code) to rely on for everything. I havenโt put Grok 3 Mini fully through its paces recently so Iโm going to do that later today, it may be in there somewhere too.
But a coding focused Gemini 2.5 variant? I see no reason why it wouldnโt beat Sonnet.
i keep forgetting webdev arena even exists
dayhush is insane
and if they will actually release it
yep
dayhush?
what was the prompt
I suspect part of the recent Gemini magic is the โEntropyโ logit search method, and part of the reason why theyโre training models so rapidly is Titans. Iโm on my phone so I donโt have the papers here but DeepMind has seriously cooked under the hood.
did you tell it how the characters should face eo?
create pokemon game
thats it?
yup
i am disheartened that i've no idea what you are talking about, but never fear! research
you see how the characters are facing eo?
yeah man very clean
and the run feature, and animations
og nightwhisper didnt even get that without extra prompting
This was "create Pokemon game"
By day hush
So... Not quite the same result as above
doenst look good
damn you sure? weird how we can have different results
nw was able to do that too
prompt: my friends birthday was yesterday and i missed it, make me something to give Andrew
https://3000-igg96j3ix44l5qgzr6uwq-ec17a5a5.e2b-foxtrot.dev
well yes I never said it couldn't
but imo these outputs have been slightly better than nw
yep
i was so confused for about 2 seconds
this was the exact prompt for pokemon, i added a ' by mistake lol, maybe that makes a difference?
Yes very sure lol, not much to get wrong
https://3000-ixz0h6ax00b2ls83bhvlo-d3b1dbce.e2b-foxtrot.dev/
Dayhush
prompt: Create a visually stunning, interactive web visualization of bioluminescent deep-sea organisms responding to subtle environmental cues (like simulated currents or light sources triggered by user interaction). Focus on ethereal beauty, fluid animation, and a sense of discovery. Amaze me with the organic feel.
trying doing this prompt exactly:
create a pokemon game'
oh thanks so much
I'll try that
Also for coding agents, I recommend checking out the paper on โMoatless Tree Searchโ gaining 20 percentage points on SWE-Bench through search: https://arxiv.org/abs/2410.20285 โ showing how you can replicate the performance of much bigger models by taking a really small model and letting it cook with search. This is why I suspect a lot of the top Reasoning models are actually using far smaller models than the community seems to think.
Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often follow linear, sequential processes that prevent backtracking and expl...
Dayhush vs Gemini 2.5 pro
nice
benchmark issue
i hope the release it soon
this would be normal with openai since mini does indeed perform better on some tasks
but not for gemini lol
it's literally behind in their every single measured metric
funny prompt
dayhush pinball game: https://3000-iroe1qqj1b1pwvwlb0ny7-9c4a2736.e2b-foxtrot.dev
yeah for sure. flash getting better than 2.5 pro basically confirms their coding category isn't great
Google is secretly using atom of thoughts (trust!)
almost, but a very hard prompt
I fell like Nightwhisper was better in design than Dayhush
didnt seem much different from gemini 2.5 pro
03
this is so confusing
this is nice
maybe
this is o3 or 2.5 pro?
im lost xd
2.5 pro
yes
create a beautiful looking pokemon game battle mode with pokeapi, with an extremely pleasing design, wow me.
not exact lol
try web design or svg, should be less confusing then
let me try that
right now its generating a super mario game, lets see how it does on that then i will try that updated prompt, but im still shocked that 2.5 pro could do that
it was not able to do that before
you think they updated it?
2.5 pro is a fairly big model, it's gonna do well on visuals
unlike all the minis and stuff
I'm quite liking the current 4o. In the prompts I care about it's pretty much on par with 2.5. Has a nice style to it.
idk bro it seems diff, before it didnt do well visuals like that when testing with pokemon, i had to really prompt it and guide it
okay mario game, let me tell it to make it guide it a lil
yes
all this dayhush nightwhisper talk but i'm going to morning scream if i get webdev arena bug where a side doesnt load onemore time
People who voted Gemini or "Other" - what prompts are you using?
Claybrook also appers in the webarena
wow they fixed webdev arena
yeah the webdev arena implementation kinda sucks..
now your creations are persistent?
sometimes it just breaks and ends the round if you try and follow-up
i just reloaded the page and i still can load the creation
like this one:
https://3000-itjougubgpyvisl9zdlw0-2266d380.e2b-foxtrot.dev/
but i started a new round and currently generating something
yall can access it?
yea
hmm i wonder how long it lasts
better than before when it would timeout fast lol
i guess they slowly working on the website
still gotta fix the glitching tho, i wonder if the results are actually fair in the leaderboard
cause multiple times things would not load for me or even gen any code
and it was the better model but i picked the other one since it generated code
yea i had the same issue
they still didnt fix it
dayhush makes some questionable choices
nightwhisper felt more reliable tbh
it will rarely miss
there are no anon models in the beta arena right?
Yes
i phrased that poorly lol
i haven't encountered anon models in the beta arena before, is that still the case (they're not in there?)
the ones in beta arena are only the ones also in beta direct chat
jinxed myself so hard here. have gotten this bug maybe 6 times in a row. anyone have best practices for avoiding it in webdev arena?
it happened to me so many times today
i hope there is a way to avoid it
is there a way to at least figure out the name of model that had an output without voting?
no ๐
whats dayhush? google new model?
That's a pretty cool output
Is it as good as NW?
i dont know..
Night/Day, Whisper/Hush. Itโs got to be a newer checkpoint of the same model
day and night
I ran the OpenAI-MRCR benchmark for Gemini 2.5 Flash (Auto Thinking and Non-Thinking). Interesting curve for Gemini 2.5 Flash Non-Thinking. Meanwhile Gemini 2.5 Flash Thinking (Auto) matches Gemini 2.5 Pro. Wanted to be able to compare what the perf was like with thinking on vs off, which there does appear to be a difference, with some odd curve. That curve is consistent over all of the runs of the full benchmark for Flash Non-thinking.
For other results, you can check my thread: https://x.com/DillonUzar/status/1913208873206362271
I'm still working on o3 access and will update when I have access. Some error on OpenAI's side, they're looking into it for my org account. If anyone has access in the meantime and is willing to coordinate, let me know. If anyone wants other models, just let me know which and I'll see what I can budget and run.
Still working on setting up a website to allow everyone to view individual run results (like on matharena.ai), and also setting up Graphwalks (another long context benchmark from OpenAI)
Was the sandbox destroyed?
yea
Maybe webdev was automatically destroyed within a certain period of time, and I didn't close it.
yea
Oh man your graphs are always very interesting, but you must make the color lines thicker and please, rank the table according to results ๐
Its on the same level
Will do, thanks for the suggestions ๐
... And indeed... If we compare the sky with the earth, then we know that the sky is better than the earth, but the day and night...
Which is better?
- NW = DH
- Dragontail
- Gemini 2.5 pro 03
- Claybrook
- Gemini 2.5 flash thinking
pretty good dayhush output for complex task
https://3000-ilb9lm2u2urwebu9sl55g-a7c98fd5.e2b-foxtrot.dev/
prompt: Create a captivating, interactive web demo visualizing fluid dynamics, such as laminar flow breaking into turbulence or convection currents. Use particle systems or shader-based techniques (like reaction-diffusion) to create organic, flowing patterns. Allow users to introduce 'heat' sources, obstacles, or change flow velocity to see the beautiful complexity emerge. Focus on the mesmerizing, almost alive patterns.
But no._.. Nightwhisper was much better
can you do this in the beta UI?
Sometimes it gave me better results
Im not sure tbh
But its close to nw
Dayhush is nightwhisepr
Wydym
Wdym
The other ais
Got diffferent name
Yes
Ye
IUts an updated
Nightwhisper
And give me the result for sure.
hes trolling lol
ok
nah
Beta is for
Normla text
Thewebdev is for code
Thats the only one
With dayhush
Thanks
how is it
Much better than 2.5 pro and claybrook
We need nightwhisper comparison
I think and feel! What's coming out tonight nightwhisper
Because the catalog is already ready
for coding models
I feel like Google needs some pressure, or they won't be squeeze out good things๐คฃ
I didn't understand what you meant. Can you explain to me, as a non-native English speaker?
haha
this is what I meant though. o4-mini-high vs 2.5 pro, same prompt for svg drawing
neither is perfect but 2.5 pro output is leagues ahead in this instance
can u do poll
A well dressed professor standing on the street with eiffel tower in the background, in front of basketball court, holding an umbrella, and observing people playing. It is CRUCIAL you make this in SVG only with the corresponding code!
the last sentence I formed this way to stop openai models from using imagen lmao
biased result xD
Who want to Remade this with claybrook and dayhush ?
Dragontail made this ๐
So dayhush better
yeah .. not the best of prompts
i actually like to make them make small games to compare
make real python games... not small tetris trash
wtf am i looking at
oh its a discord clone
i thought this was like a discord theme i was about to say how can you like this
yep, my prompt was to make a 2d game with the following rules
- if ball is shot it splits into two , unless radius is too small in which cases it pops
- player can move left and right and shoot up
- balls continue falling and moving at an angle as if thrown.
- game starts with big ball falling and has a timer. player wins if he destroys all balls.
- if time runs out or player gets touched by ball -> game over
o3, o4mini, 2.5 who was best
o3 > 2.5 by a bit
and what about o4 mini
didnt try
do you also have views on the canadian election?
hard to think we dont tbh. graphics are getting there and ai is getting at us but whatever
cmon we are in lmarena not poly
or the likelihood of jerome powell being removed this year
yep
Whatโs everyoneโs honest real life assessment of 2.5, o3, o4 mini. Personally I am really like the outputs of o3.
as you said there are likely god knows how many matrixes within matrix and statistically we are probably at the very bottom of them but whatever
o3 with tools is amazing]
I havenโt run into that problem at all really with A/B testing
Iโve been very happy back checking sources used
from what ive heard so far people think o3 is better slightly or miles ahead, im on group 1
u good bro?
Better slightly is a good way to put it
take it ez you just learned you are in a simulation and everything is fake, no big deal
whats the answer to life, the universe, and everything? 42?
i think o3 is great. if there was a problem with it, it's that it does too much tool calling
r/im14andthisisdeep
we can become gods and make universes of our own.
but only Elon Jeff and their friends will
lol
why would they enslave us
bro been watching to much movies
nah they dont need slaves
once ai becomes smart enough theres no need at all for human slaves
HAHAHAHA
ai + robots are smarter, faster, 24/7 work, no pay, no demands for high standard of living
certainly get us out of the way more than eradicate, both are very bad
the point for normal humans is diversity
counter argument: AI does not have hormones
but its not evil / bad in their eyes, more than it was evil/bad for us to use cars over horses .. which did lead to the "genocide" of horses, with their population drastically decreasing after the invention of cars
I think AI should replace unskilled labor first over artists, programmers, and writers
should ?
why dont we eradicate the other lifeforms on earth? also ai would probably be just as curious about us as we are about it and other lifeforms
ai will replace everything, first low hanging fruit like coding math, then intellecutal jobs, then at the very end physical jobs via robots. and all the while every job will be devalued cause ai will continuesly lower the entry barrier.
This statement alone is why midwits shouldn't be allowed to use AI. What we have compared to AGI is like a lobotomised cat being compared to a lion.
when i'm always impressed with something on web arena, it's dayhush
if all the unskilled labor goes away, then all we have left is the skilled labor
if there is a solution , if i heard any, is to brainchip all of humanity, combine it with ai and go forward as one mind colony
i'm quite sure it can't be controlled. so there is no "should"
bro if you think about it the people with power already control the world lol
ai will take both skilled and unskilled away
theres nothing any human can do better compared to a smarter ai + robot
Maybe a brain organoid can be made just for you so you realise that humans won't manufacture their own demise so quickly.
i can juggle better
guys, all you need to stop the AI uprising is a glass of water
good one
it'll screw up everything if you pour water on their circuits
why dont you become billionaire and join him
Bro, are you like 12?
I am suspecting this too
He is either 12 or a teenager from a third world country.
fr i was going to ask, but i didnt wanna insult him lol
I think we should just democratize AI
Brotha log off.
if anyone can run it, then it's a much more balanced playing field
AI will be weaponised.
in 100 years, we will be able to run the current top-of-the-line AIs on our phones
if its the truth, what are you going to do about it?
We
You need to grow up
Do not threaten real world action like this
This is illegal and will get the server taken down
This will never happen you moron.
worked well with stablehorde
We do not live in a sci-fi novel.
using phones 100 yrs from now
Also it's probably closer to 5-10 yrs
2025 trolling sucks
Is this guy being serious?
the FASTEST speed we can ever hope to achieve is light
he just joined the server two days ago
@alpine pasture hey, I think you should moderate this user.
I'm ceasing communication with you
are you a elon fanling?
anyways, has anyone noticed that o4-mini-high doesn't speak with emojis anymore?
No, he is just another puppet billionaire.
the richest person is just a puppet?
or is he just middle manning the money for the non-puppets? @cloud meadow
when o4 came outon wednesday, it wrote with emojis in every header
but now it fixed that and doesn't write with emojis
- This is not a political server
- There are bankers who are richer than any of these public figures
@alpine pasture help
you think they are actively updating it?
I believe so
they are doing that with 4o
huh? like who
I can't think of a reason it would just switch to not using emojis hours after release
maybe it does and your sample size is small
Are you talking about arena, app or api?
Refer to the first point
app
I have a link to a new conversation
just answer the question who has more money than musk
ask grok
wtf is grok
wait are their discord bots for ai models yet?
grok doesn't defend elon surprisingly
really? why are you in this server if you are clueless about ai models?
https://en.m.wikipedia.org/wiki/List_of_banking_families
sorry if you are trying to learn
but grok is xai model
on twitter
why u saying twitter
..... shows me dead people... with low money
remember when grok 3 released?
because i was born pre 2023 and saying X can be confusing
grok 3 is actually very useful for searching for information
grok-3-mini is legit impressive
Families
Not gonna waste my time with you.
thats not musk-fan enough
have they announced plans for grok 3.5 or anything new?
X is objectively just a bad name. and Grok-3 mini is great
they are very poor
im just a curious guy, i dont fanboy over ppl
so, u still saying musk is a puppet despite having most cashlol
are you a fanboy?
Do you not know how money works?
you.wish has been warned.
?
Again, it's not my place to educate you. Since we are in an AI server, maybe you can ask gemini 2.5, it has a lot of knowledge (january 2025 cutoff!)
@alpine pasture what did i say wrong
it's already been posted but whatever
via artificialanalysis
Please stay on topic - this is an AI server for the LMArena community. Plenty of servers for other topics. Thanks all!
I think this dude should be dealt with btw, Cherry.
That's interesting
i was looking into how they calculate 'price'
Price is a blend of Input & Output token prices (3:1 ratio).
It's ofc going to be blunt / imperfect, but 3:1 (input-output) i don't think is a suitable ratio to use for thinking models
they inherently use more tokens in their outputs (whether we see the reasoning token or not, they're output)
is this real?
how is that possible
grok 3 mini
interesting to see that gemini says it doesn't know who's AI will be the best by the end of 2025
and then chatgpt says that GPT-5 will be the smartest
Why would you ask an LLM such a question?
I mean in terms of predicting events
It will just say whatever
I wanted to see its response and if it would prioritize itself
Can't wait for R2
the sad part is that the model doesn't even know its own dang name
how fast is grok mini fast vs grok mini?
and besides that, the price here is only the price per token but the reasoning models output many many more tokens so it comes out much more expensive
ontologicity has been warned.
OpenRouter stats put -fast 40% faster. (187 vs 132 t/s)
wait so grok 3 mini is better than grok 3?
wow, i have been sleeping on grok 3 mini, need to try it out
i don't know if +40% speed justifies +800% price, but eh, different use cases I guess.
yupp
Ozone ?
Yep
Deep comprehension on long context buts its on fiction not on code
o3 has better score
fiction is harder
did anyone actually test this?
lol we were wondering why the beach water looked so brown. turns out sewage is being discharged right into it
won't be going for a swim any time soon ๐คฃ
o3 surprised me me w this
HAHAHAHAHAHA
yeah, like did anyone here test that
dayhush, really good
i can always start on a fanfic and we'll see
grok3-mini is a reasoning model
you been using it?