#general

1 messages ยท Page 26 of 1

olive mesa
#

thinking budget wow

balmy mist
#

what tests you ran with flash?

olive mesa
#

2.5 pro still better but 2.5 flash is so good for developers

humble sonnet
#

How does AI score calculation work?

drifting elk
#

Absolutely awesome

#

Absolutely right

drifting elk
humble sonnet
#

Okay

balmy mist
drifting elk
humble sonnet
#

btw, the new update with images, can we request images and modify them?

tall summit
#

...what

drifting elk
#

Hahahahahaha

zinc ore
#

What y'all think about o4 mini getting 100% on one of the math benchmarks?

balmy mist
#

technically flash should be the best all around model in terms of efficiency

#

but i wish it was faster man

#

that would be make undisputed best

balmy mist
#

its almost on par with o4 mini, but cheaper

tall summit
#

this is a dumb question, do reasoning models even reason in lmarena?

raven void
#

o3 and o4 mini is so good it leads me to think we'll get AGI achieved internally by December

balmy mist
balmy mist
zinc ore
cedar tide
#

Put 2.5 flashes in thinking off on the arena

zinc ore
#

Actually good point, anyone know how early into the cycle o4 mini is?

balmy mist
tall summit
# balmy mist yeah

well obviously there's no thinking... box like there is on any client actually meant for running models

zinc ore
#

You didn't post that specific tweet tho

tall summit
#

i mean, for a very topical example, is the 2.5 flash on lmarena reasoning or not?

balmy mist
zinc ore
#

Yes

balmy mist
#

like right there

#

as soon as you click link

#

check the relevant ai news

zinc ore
balmy mist
#

channel

zinc ore
tall summit
balmy mist
#

maybe someone else does

zinc ore
#

I'm literally sharing the tweet to ask how early into prod cycle o4 mini is, compared with flash 2.5

cedar tide
#

The only benchmark we have for 2.5 Flash in Thinking Off
Compared to 2.0 Flash, it goes from 60 to 74 on GPQA and from 34 to 41 on LiveCodeBench.

balmy mist
tall summit
cedar tide
#

Gemini 2.5 flash vs grok 3 mini thinking

tall summit
zinc ore
#

That's gem 2 not 2.5

balmy mist
zinc ore
#

Gemini 2 Flash bro

balmy mist
cedar tide
balmy mist
#

i think someone might know here

cedar tide
#

I made this

tall summit
tall summit
zinc ore
#

Oh okay, was labelled as Gemini 2 Flash, didn't realize

balmy mist
#

cause i said yeah and then you said obviously no, but i misread

tall summit
#

well sorry i do it a lot im an indecisive person

keen beacon
#

u can tell if its a thinking model in arena battle, if one of the models is definitively not a reasoning model and both completions were delayed

tall summit
#

im like ehhhhhh these are good but ehhh these are bad
devils advocate sorta guy

for literally every choice ive ever made in my life

keen beacon
#

it delays both completions as it waits for the thinking model to reason

zinc ore
#

Gg

tall summit
zinc ore
tall summit
#

i like when it uses latex

#

except it doesnt display

#

man i hate that

#

so annoying honestly

balmy mist
torn mantle
#

imma try it first

#

i still havent tried it xd

tall summit
#

anyway it seems to be the thinking version on the arena

keen beacon
#

didnt that guy say that o4 mini solved math or something lol

tall summit
#

what does that tweet even mean

keen beacon
#

ahahaha

balmy mist
cedar tide
balmy mist
#

i still say o3 is better but in different ways

#

like in terms of agi yes

cedar tide
#

We will see grok 3 thinking too

tall summit
#

i think it's o3

keen beacon
#

lmao xai ๐Ÿคฃ

balmy mist
tall summit
balmy mist
#

anybody using o3 every day? is it making you smarter?

cedar tide
tall summit
#

how about o4-mini vs 2.5 flash

tall summit
cedar tide
balmy mist
cedar tide
#

But we will see

tall summit
#

and i feel the same amount of intelligent

cedar tide
#

Im a price fanboy

tall summit
#

oh well i wonder.

cedar tide
#

So Im loving gemini

keen fulcrum
#

Gemini 3 when

patent aspen
#

Is "o3 full" just the o3 model with maximum thinking time and resource consumption?

cedar tide
patent aspen
#

Or is it distinguishing o3 from o3 mini?

cedar tide
keen beacon
#

otherwise u would call it o3 full high (reasoning effort) i think

cedar tide
#

There are o3 mini and o3 and o3-pro

tall summit
#

i assume o3 full meant o3 with all the tools so it can do its stupid chain of thought bs

patent aspen
#

Isn't o3 full still way more expensive than 2.5 pro?

balmy mist
tall summit
cedar tide
#

Yes, before the release of the gemini beast maybe

brittle tiger
keen beacon
#

it depends on the task i guess

#

im curious to see the numbers when artifical analysis runs their benchmark suite

patent aspen
#

IMO Gemini 2.5 Pro is the best model in the world not because of raw performance but because it chose the right set of product and engineering trade-offs

brittle tiger
#

i def wouldn't count out grok. they don't have infra like goog but the whole team is basically former gdm and they have tons of gpus

patent aspen
#

o3 depends on expensive trade offs to be good

#

which isn't sustainable

calm sequoia
#

Winsurf turned out to be so good. Have anyone tried anything better or is it sota?

keen beacon
#

they are working on efficiency though since it is more efficient than o1 i believe

patent aspen
#

OAI loses like $8B a year now

ocean vortex
keen beacon
brittle tiger
patent aspen
keen beacon
#

ya prob cohere idk lol

#

that line of anon names seem familar

cedar tide
#

Amazon

keen beacon
#

yea

cedar tide
#

Maybe nova pro 1.5

keen beacon
#

i forgot

worthy thunder
#

Also looking into running graphwalk on all of those models (another long context benchmark)

keen beacon
#

its somewhat inherent in their strategy rn

#
poll_question_text

What's better?

victor_answer_votes

8

total_votes

11

victor_answer_id

2

victor_answer_text

Nightwhisper

#

select in direct chat if u wanna use o3

#

fyi this should be easy

#

its a thinking model with a distinct cut off

#

if it returns its made by openai, answers specific knowledge cut off questions right, delays both model completions

#

4.1 mini/o4 mini despite the claimed june 2024 cut off lacks a lot of more recent knowledge

cedar tide
#

Today we was at 224 or 223 ?

balmy mist
#

yeah im over flash

zinc ore
balmy mist
#

back to o3 and o4

keen beacon
#

im still maining 2.5 pro lol

balmy mist
#

its just openai trash webui that makes me mad

keen beacon
#

nothing to really complain about in my experience

#

weird - they're still there

#

the branching feature in ai studio is extremely questionable though

balmy mist
#

but o4 is cooking for coding

#

flash makes bad code, like always issues

#

but it has good designs and ideas

wheat onyx
#

started testing o3. It's REALLY good. I had it look at the internals of my broken TV - it first looked for obvious issues, then told me specifically what I should do to test, what results i should get, etc. That's the first time AI would have been able to get detailed advice for something like this. That's a game changer

#

no idea if it's great for coding, but probably way too expensive for it?

wheat onyx
balmy mist
#

plus flash

wheat onyx
balmy mist
#

you cant see it?

keen beacon
#

worked after refresh

wheat onyx
#

i do, but what is it for

balmy mist
#

but let me update it one more time

#

i want to make the animations be the default thing you see

wheat onyx
#

i know it looks like i cut off my snapshot, but this is actually what it looks like

#

you can scroll down on the right side to see more of the green iconbox

balmy mist
#

yeah thats it

#

ignore the warnings

#

click the link again

#

and press play button

#

flash and pro have issues with large code

#

but o4 handles them nicely

torn mantle
tall summit
balmy mist
#

this*

tall summit
#

good to know regarding o4

balmy mist
#

for some reason with code larger than 1000 lines 2.5 has trouble making edits

#

i have seen this happen a lot

hardy pecan
balmy mist
#

omgg i knee it lol

tall summit
hardy pecan
balmy mist
#

this could be gpt lol

#

jk

cedar tide
tall summit
hardy pecan
torn mantle
#

PLS

#

but wen

#

๐Ÿ˜–

#

could it be on 22th of this month?

hardy pecan
balmy mist
#

idk i think im an openai fanboy now

tall summit
#

has anyone played chess with o3 yet

balmy mist
#

it cooks me

#

whats its elo?

tall summit
#

for some reason i tried 2.5 pro and it couldnt interpret a FEN

balmy mist
#

join the openai fanboy club

#

like me

tall summit
#

lol

#

i want to test o3 on the same

balmy mist
#

do a chess game with o3 vs 2.5

tall summit
balmy mist
#

hmm i might build an app for that for every model to go against each other

hardy pecan
tall summit
#

GLM

#

ALL MY HOMIES LOVE GLM

cedar tide
tall summit
#

mmmmmmmm

#

2.5 pro takes ages

hardy pecan
# cedar tide Nope he was here before

oh yeah, my screenshot is cutoff, I cant see any other new model added in the top 16 other than 2.5 flash, so again itsw another inconsequential model added that scores poorly lol

tall summit
#

2.5 pro cant do ascii diagrams

keen beacon
#

oh wait it's you

balmy mist
#

i used to love 2.5 but i think i might cheat on it with o4

tall summit
#

o4 mini against 2.5 flash would probably take less time

#

lmao

keen beacon
#

3.5 turbo instruct would destroy 2.5 and o3 (in chess) probably lol

tall summit
#

how about o3

balmy mist
#

and o4 is better at coding

balmy mist
#

chill on my side piece o4

tall summit
#

this is already looking much better for o3 LMAO

#

even sends me the fen

keen beacon
tall summit
#

will read

#

from nov 2024

#

i mean sure

keen beacon
#

does anyone know the system prompt lmarena uses for models?

#

it depends on the model but some of them are in the fast chat repo

#

they dont include everything there though, you'd have to extract them manually from lmarena

#

(just an unnecessary random note but that guy blocked me for no reason, still confused to this day lmao)

wheat onyx
keen beacon
#

i've had fairly loud disagreements with him and he hasn't blocked me

keen beacon
blazing rune
#

Maybe he thought you were someone else from looking at your profile picture

#

your profile picture is quite generic

keen beacon
keen beacon
#

he seemed a little upset that he was being ignored

balmy mist
#

nvm o4 and o3 is ehh at coding large codebase ๐Ÿ™‚

#

back to gemin

quiet pollen
#

What is the difference between Gemini and Gemma?

#

Other then reasoning

#

Exactly my thoughts

keen beacon
#

the gemma line are the open weighted models

balmy mist
#

gemini is the dad, gemma is the mom

keen beacon
#

gemini are the frontier/closed source ones

quiet pollen
#

The question is o4 or Gemini

keen beacon
#

ig u mean o4 mini

#

prob worth testing all of the models on ur specific tasks and see which one u like

quiet pollen
#

Gemini seems to be sticking out for me

tall summit
#

what does a temperature of 2 mean

#

i remember when temperature couldnt go above 1!!!

#

what is 1.5 creativity

#

whys it max 2

upper wolf
#

you guys use a temp of <0.6 with gemini??

tall summit
blazing rune
#

for gemini 2.5 pro

#

gemini 2.0 flash I usually use at 0 or 0.5, since it hallucinates so much

keen beacon
#

i personally use 0 for mostly everything, 1 for everything else (poll was about creative but coherent tasks/etc)

#

1 because i dont bother changing the default ai studio value ๐Ÿคฃ

sweet tinsel
#

I'm maybe late but 2.5 flash has dropped for me in the Gemini App replacing 2.0 Flash Extended Thinking which I had until now.

tall summit
#

same limits in ai studio as 2.5 pro

tawdry phoenix
#

Literally going insane

glass arch
#

as a wise man once said:
"you must pick between autism and schizophrenia"

#

(this is about AI models btw)

#

it seems that openai does not tell the models what its name is

tawdry phoenix
#

Roblox injector lol

tall summit
#

roblox injector

#

can ai do that

#

well first of all would

#

probably not

glass arch
#

which AI can make the most soulless slop

olive mesa
#

2.5 flash hallucinates a lot

olive mesa
#

google

#

i asked 2.5 flash what its name was via api

#

it says its just an ai

glass arch
#

like it enters the world and does not have its own identity

late path
#

is nightwhisper/dragontail 2.5 flash?

hardy violet
#

I gave it a quick try myself โ€“ not coding, of course โ€“ but my conclusion is the same as the friend above. It seems like that hypothetical '2.5flash-thinking-high' model doesn't really exist. Cranking the 'thinking budget' parameter to the max just seems to make things slower and the interface laggier, with no actual improvement. At least, not in my initial test scenario.

keen beacon
#

thinking budget seems to just cut off the thinking prematurely. openai reasoning efforts are tuned to produce different lengths of chain of thought i think. so the current 2.5 pro and 2.5 flash are basically always at the "high" setting no matter what

#

if you have thinking budget disabled

#

with claude's thinking budget its a little more involved than cutting off i think, they seemingly tuned several antml:max_thinking_length</antml:max_thinking_length>, so its closer to openai's

ember rapids
#

thats probably coming next week

zinc ore
#

That's what I think too. One of them is updated 2.5 pro, the other is a coding model.

late path
# keen beacon https://dynomight.net/chess/ https://dynomight.net/more-chess/ another reminder ...

Yes, I've read this before. I think the final reason is quite obvious. As a completion model, gpt-3.5-turbo-Instruct formed some internal neural network structures because its training corpus contained chess game data. This structure represents a kind of "mechanical" learning, akin to training a specialized neural network dedicated to playing chess, rather than indicating the model has any awareness that it is "playing chess".
And as mentioned in the second blog post, when the model transitions from a pure completion model to a conversational one, this highly specialized completion ability becomes very fragile and fails to generalize effectively. The special tokens and separators inserted for role marking in conversational models might act as interference, potentially preventing the original "chess-playing circuit" from being reliably activated. Therefore, the decline in the conversational model's chess ability confirms that gpt-3.5-turbo-instruct's chess skill stems purely from its completion mode, rather than indicating it possesses any magical advantage over other models.

hardy violet
#

I just tested these models on a short text for comprehension. Like I said before, o3's approach is simply intolerable. The framework it builds is ridiculous โ€“ that way it sounds plausible but is actually wrong (specious), just does its own thing, grabs onto one word and elaborates endlessly, or makes random, awkward analogies everywhere... it's totally on par with DeepSeek R1.
For tasks like text interpretation or writing, honestly, anyone who likes o3 should probably just use DeepSeek โ€“ it's more cost-effective.
Gemini's comprehension felt spot-on. Seeing Flash's performance, I thought it was pretty much perfect. But then looking at Pro, there's still a noticeable gap.
This gap [between Flash and Pro] is clearly wider than the difference between o3 and o4mini high. Considering cost and speed, Flash might have its value, but from a practical and progress standpoint, my vote goes to o4mini-high.

keen beacon
#

the problem is we don't have access to any of them, so it is magical in a sense (since we have no comparison points) but i wasnt ever implying that gpt 3.5 turbo instruct is one of a kind if you consider other internal frontier base models ( og gpt 4 base model would be more than enough anyway) . we only have access to the reasoning tuned/instruct/etc. versions

balmy mist
keen beacon
#

disable thinking budget

tall summit
#

i can't believe there are people who think 100k+ token context is minimum useful baseline

tall summit
olive mesa
#

but at the same time it's just an algorithm really good at repeatedly predicting the next token

#

i honestly think the closest thing we can get to consciousness is an ai with really good cot being aware of its own thoughts, aware of it being aware of its own thoughts, and hallucinating emotions

keen beacon
#

under an ai fiefdom you say that to your ai overlords, you're just predicting tokens ๐Ÿคฃ imagine lol

still jetty
#

suggestings for an image to image model? a family member wants to take photos of their yard and generate landscape ideas

keen beacon
still jetty
#

i'll take a look. ive never tried/had a need for image to image yet so i dont know which models do

raven void
balmy mist
#
poll_question_text

Are you Team Google or OpenAi

victor_answer_votes

8

total_votes

17

victor_answer_id

1

victor_answer_text

Google

leaden meteor
# keen beacon can't gpt 4o image gen do that?

Do we have any reliable ones yet? I tried one to give me ideas for a fence in backyard. 4o was also terrible. I think these things are still not good enough to edit existing images. They are only trained for creating new images from scratch....

#

or the vanilla 'professional passport photo' schtick...

#

Giving landscape ideas with realistic edits of your backyard will be gamechanging....

keen beacon
keen beacon
#

It might also not be passing enough info to the gpt 4o image gen model

#

Since theyre different models for now

#

Chatgpt calls a tool to a different 4o model with image gen enabled

#

Too finicky to deploy one together for now I guess

barren prairie
zinc ore
leaden palm
#

where grok 3 mini

#

oh, it's right there

#

(not to mention that the x axis isn't adjusted for thinking tokens)

#

actually the cons@64 is doing a lot of the heavy lifting, but even without, it's a better model than gemini 2.5 flash

upper wolf
#

have you guys tried 2.5 flash with thinking turned off? if so what is your experience

zinc ore
# leaden palm actually the `cons@64` is doing a lot of the heavy lifting, but even without, it...

Grok 3 Reasoning Beta and Grok 3 mini Reasoningโ€™s scores for AIME 2025 at โ€œ@1โ€ โ€” meaning the first score the models got on the benchmark โ€” fall below o3-mini-highโ€™s score. Grok 3 Reasoning Beta also trails ever so slightly behind OpenAIโ€™s o1 model set to โ€œmediumโ€ computing.

https://techcrunch.com/2025/02/22/did-xai-lie-about-grok-3s-benchmarks/

OpenAI researchers accused xAI about publishing misleading Grok 3 benchmarks. The truth is a little more nuanced.

#

Cons64 boosts the scores quite a bit, it's not reliable to compare with @1 passthroughs.

thorny drum
#

december o3 biggest offender of this lol

#

but tbf they were pretty transparent of how hard they juiced that model for benchmarks

zinc ore
#

Yeah agreed, openAI did it and Grok also did it. I'm hoping in the future these companies cut it out.

alpine coral
#

i'm confused by the december o3 reference.. do you mean that arc-agi chart, with o3-high in the top right corner (costing >$1000/task)? seems kinda different to this ( @conensus )

thorny drum
#

i think their approach to codeforces was comparable?
idk

#

but my understanding was o3 high was effectively cons1000 or whatever

ember rapids
plain zinc
#

I WANT

keen beacon
#

pro is only like 300-400B

plain zinc
#

LETS GOOOOO

#

It's confirmed

real totem
#

Aint no way

#

Ultra

keen beacon
#

maybe in the next couple months

torn mantle
#

next batch releases are gemini coders

plain zinc
real totem
#

Its gonna be

#

To good

ember rapids
#

Honestly with the pace theyre moving at

#

i wouldnt be surprised if we got it say 2 months from now

keen beacon
#

The world knowledge 2.5 pro has is incredible, I would really like a 2.5 ultra if it's done well and pricing isn't ridiculous

novel flame
#

I tried running my usual mini benchmarks on 2.5 Flash and got the same perfect 5/5 score that all the top models get (from 4o and up). So far, so good.

Then I gave it my harder โ€œzero-shot browser gameโ€ task and it did something unexpected; it didnโ€™t do it. It didnโ€™t fail, rather it replied that such a task was too big for a single prompt, explained why, gave a detailed plan for how you would do it in steps, and then implemented a working sandbox to build on.

So now my scoring system doesnโ€™t work because itโ€™s based on implemented features and Flash deliberately didnโ€™t implement any features.

The thing is, no other model has done this. Many models will provide an incomplete game and list some improvements to make, but 2.5 Flash pushed back on the premise itself, which I donโ€™t think Iโ€™ve seen before.

So I donโ€™t know how to evaluate this for codingโ€ฆโ€ฆ ๐Ÿคจ

#

At the same time, cโ€™mon Flash, what the hell? The original Super Mario was a total of 20KB, which is maybe 16k tokens, so why not give it a whirl?

#

And on the subject of coding, has anyone else had the experience that o4-mini generates minified JS code?๐Ÿฅธ๐Ÿคช It is the weirdest thing. I mean, if itโ€™s trained on random JavaScript from the public web then obviously most of that is minified, but surely a bare minimum of human preference tuning would have taught it not to produce minified source code?!!???

keen beacon
#

O4 mini in particular can be very weird it might be an artifact of the rl

keen fulcrum
#

So gemini 3 will come during summer right?

keen beacon
#

for me o3 > 2.5 but very small margin

#

o3 strength is not in question answers, is in its ability to use tools, run python code , search etc , compliting multi step tasks

torn mantle
#

you have it all

cedar tide
#

Grok 3 mini Reasoningโ€™s recently launched API has highly compelling Intelligence vs. Price positioning
๏ธ€๏ธ€
๏ธ€๏ธ€@xai has recently launched APIs for Grok 3 and Grok 3 mini, after initially only making the models available via the Grok chat interface at launch. We have now completed benchmarking for both Grok 3 and Grok 3 mini.
๏ธ€๏ธ€
๏ธ€๏ธ€Grok 3 mini (high reasoning) stands out in the top left of our Intelligence vs Price chart, achieving one of the highest Artificial Analysis Intelligence Index scores ever, with pricing well below even DeepSeek R1.
๏ธ€๏ธ€
๏ธ€๏ธ€Grok 3 family overview: Weโ€™re initiating coverage of 6 versions of the Grok 3 family: Grok 3, Grok 3 Fast, Grok 3 mini (low reasoning), Grok 3 mini Fast (low reasoning), Grok 3 mini (high reasoning) and Grok 3 mini Fast (high reasoning). Todayโ€™s post focuses on our intelligence results; performance breakdown of all 6 versions will be available soon.
๏ธ€๏ธ€
๏ธ€๏ธ€Intelligence: As we highlighted at laโ€ฆ

keen beacon
keen fulcrum
#

Gemini 2.5 Ultra lol

fleet lintel
fleet lintel
kind cloud
novel flame
fleet lintel
#

true

plain zinc
#

๐Ÿ‘€

hardy violet
# tall summit can you give more context regarding what exactly you... did?

For some lyrics translated from Japanese to Chinese, we can directly tell if the Chinese version flows well and spot translations that are significantly off compared to human standards. Also, with essays I wrote myself โ€“ even though they aren't great โ€“ I know exactly what I was trying to say. This is where I noticed the issue with models like R1 and O3: they tend to misuse structures and over-elaborate. What's more, the writing style became noticeably 'R1-like,' and other users I talked to agreed. Since I'm not willing to pay for the API, and LM Arena can't handle testing slightly longer short stories, I didn't get around to testing how it interprets classic texts. (Plus, testing classics is easier to get away with; people rarely criticize over-interpreting classic works anyway๏ผ‰

keen beacon
keen beacon
fleet lintel
fleet lintel
cedar tide
#

Livebench

cedar tide
cedar tide
keen beacon
#

we don't know yet

#

it's just a hidden category on the model selector

dusky aurora
#

I have feedback abut beta interface.

#

In direct chat, there are no sampling options

keen beacon
barren prairie
keen beacon
cedar tide
#

Anyone have hard prompt that 2.5 pro dont get ?

keen fulcrum
keen beacon
#

what

#

im talking about the category on AI studio's backend referred to as "1P CODE MODELS"

#

(first party code models)

keen fulcrum
#

Interesting to have special code models
Although I want them to have their own debug capabilities, I don't want tools

novel flame
scarlet flint
#

Hello

#

did google remove the image editing model from aistudio?

ocean vortex
#

reasoning models are a different thing, but that only applies if you are saving reasoning into context (currently not possible with openai)

torn mantle
ocean vortex
#

on that last point... I think this actually might be one of the reasons why openai has longer chains than anyone else, since they don't worry about this flooding the context lol

torn mantle
#

its kinda slow

torn mantle
# kind cloud

seems like its a better version of gemini 2.5 pro thinking

#

recent checkpoint?

cedar tide
#

why 2.5 flash without Reasoning does not replace gemini 2.0 flash on gemini?

keen beacon
kind cloud
unborn ocean
willow grail
#

someoneneeds to make a website which takes in all benchmarkes and creates a average

sweet tinsel
#

Atleast

cedar tide
sweet tinsel
#

But like Not so said it's most likely Multimodal feature and inference bount

ocean vortex
barren prairie
#

We are mice and rabbits for google

unborn ocean
calm sequoia
#

Anyone checked what is claybrook?

torn mantle
#

i cant really tell if its better than gemini 2.5 pro thinking or not

calm sequoia
#

There's nothing else to cook for them, what else could this be ๐Ÿ˜„

#

Probably adapted to some niche application

#

Or flash with thinking at some intensity setting

torn mantle
#

no

#

it does give in-depth details

#

a bit more than 2.5 pro 03

calm sequoia
torn mantle
#

yea i think it may be better than gemini 2.5 pro 03

#

from my initial tests

calm sequoia
#

Full version, update or, maybe, Ultra ๐Ÿ‘€

#

Allgirht then, lets test it.

torn mantle
#

this model

#

is kinda different from other google releases

#

i like the formatting

#

it also goes into many points

tall summit
torn mantle
#

its better at explaining things too

calm sequoia
#

Could this be the LearnLM

torn mantle
#

its really really good

#

at general knowledge

#

it doesnt give you the typical generic answers

calm sequoia
#

Okey then, it's update, full or ultra.

torn mantle
#

woah

#

its actually good

#

i cant believe this

torn mantle
#

i wont be surprised if its ultra

#

i didnt try it on reasoning tasks tbh

cedar tide
#

Can anyone send the "claybrook" this prompt ?

#

creates a 99% copy of the discord front end, in a single html file, (without the backend)

hardy pecan
#

Claybrook confirmed quite good, defintely 2.5 pro, dragontail level for sure

hardy pecan
#

It's funny, google probably reads this chat and we figure out what kinda model they JUST released super fast lol

cedar tide
#

Claybrook is on webdevarena too

torn mantle
#

i just checked like 10 min ago

cedar tide
#

I just see it now

torn mantle
#

interesting

#

two new models added

plain zinc
#

Two?

torn mantle
#

yea

plain zinc
#

Which?

torn mantle
#

dayhush and claybrook

#

both with a high priority ( 12 ) so you should get them quite often

plain zinc
#

Who seems stronger?

torn mantle
#

i will try them first

#

google are going crazy

#

new model per week

plain zinc
#

Which model seems to be pro?

#

Well, sort of... which model is slow?

#

I read it here somewhere.

hardy pecan
#

claybook is pro level for sure

torn mantle
#

yea

calm sequoia
#

No

plain zinc
#

Really good โœจ

#

Interesting... Are they better than nightwhisper?

torn mantle
#

wait

#

dayhush

#

this model may be gemini code or smth

calm sequoia
#

This prompt can be solved by the latest generation models, including 2.5 PRO, o3 and o4-mini. It can't be solved by 2.5 Flash, 4o and others. It could not be solved by the claybrook; therefore, it's not general purpose model, or worse at physics than general 2.5 PRO. The prompt: "You take two carabiners and wrap a piece of high strength polyester webbing between them in 10 turns while there's as 30cm distance between carabiners. Webbing is conventional 25mm width strap used in industry. The on end is attached rigidly to the carabiner, the other is left free. How much time would such configuration hold until all the webbing would unroll itself if the applied tension from one carabiner to another is 100kg. Accidentally one side of the webbing was left hanging without attachment to anything - it is free to unroll. Answer only in how much time, no explanations."

cedar tide
#

What the response ?

real totem
#

Is dahush
In normal arena

hardy pecan
#

I haven't got dayhush yet

plain zinc
torn mantle
hardy pecan
#

in normal

calm sequoia
#

Correct response is "It will not unroll"

calm sequoia
calm sequoia
#

Some seconds

#

The same as 4.1 or lower level models. Even the o3-mini can do it right.

torn mantle
#

yea i think dayhush may be gemini coding model

hardy pecan
#

LOL

#

its a real guy

torn mantle
#

lmao

#

what

hardy pecan
#

hes been contaminated

cedar tide
#

I just see that 2.5 flash in the leaderboard

torn mantle
#

it gives you real references and doesnt hallucinates much

#

could be a thing in their coding models

plain zinc
#

And how does claybrook encode?

#

I mean, how does claybrook write the code

#

Better, worse or much better or much worse?

torn mantle
#

still didnt get it yet on webdev

plain zinc
torn mantle
#

seems kinda similar to nightwhisper

calm sequoia
#

Inferior even to the o3-mini ๐Ÿ’ฉ

keen beacon
#

is dayhush only on webdev

torn mantle
calm sequoia
#

Yeah, I've concluded that the o3 is the king of the next generation (for my tests). Still, I use 2.5 Pro for coding.

brittle tiger
calm sequoia
brittle tiger
#

Day night
Hush whisper

real totem
#

They're shet

plain zinc
calm sequoia
#

It's for those who want to vote but dohnt know nothing

cedar tide
torn mantle
#

though i cant decide yet which one is better

brittle tiger
#

There's no way they're not related given names. Is nw still best outputs?

torn mantle
keen ferry
keen fulcrum
#
torn mantle
#

im trying this prompt now

#

holy

#

dayhush

#

the magnetic field needs a bit of adjustment tho

keen beacon
#

that's a cool design

torn mantle
keen beacon
#

this is probably a new checkpoint for nightwhisper

torn mantle
#

kinda cool too but its not animated with particles

keen beacon
#

the names are similar

torn mantle
#

could be

torn mantle
plain zinc
#

Wow, claybrook better gemini 2.5 pro in web design

keen beacon
#

dayhush is def better than claybrook

#

claybrook to me felt meh

#

it flopped on Qs 2.5 pro can get right

plain zinc
keen beacon
#

not in my experience

#

๐Ÿคทโ€โ™‚๏ธ

novel flame
#

Looks like dayhush is nightwhisper, exciting times for coding models! Next cpt: dawnshush? No - DUSKMUMBLE!

keen beacon
#

it won't be nightwhisper

#

it'll be an updated ver

#

it's probably just 2.5 with a focus on code

novel flame
#

Yes of course but same model

torn mantle
tall summit
hardy pecan
#

2.5 pro with 128 code experts!!! Or sumthin

torn mantle
#

its probably an even better version than nightwhisper

tall summit
tall summit
novel flame
#

โ€œJust 2.5 with a focus on codeโ€ thoughโ€ฆ that could finally dethrone 3.7 Sonnet for real world coding.

The current state of coding models (warning: possibly hot take): 3.7 Sonnet is still the king, 2.5 Pro is very close, and o3 is sometimes better but expensive and not reliable/consistent. o4 mini is sometimes on par with or slightly better than 2.5 Pro but again, too inconsistent (like generating minified code) to rely on for everything. I havenโ€™t put Grok 3 Mini fully through its paces recently so Iโ€™m going to do that later today, it may be in there somewhere too.

But a coding focused Gemini 2.5 variant? I see no reason why it wouldnโ€™t beat Sonnet.

tall summit
#

i keep forgetting webdev arena even exists

balmy mist
#

dayhush is phenomenal

#

i wonder how long it stays

sage raptor
#

dayhush is insane

balmy mist
#

and if they will actually release it

torn mantle
torn mantle
#

what was the prompt

novel flame
# tall summit people are bashing them in the comments

I suspect part of the recent Gemini magic is the โ€˜Entropyโ€™ logit search method, and part of the reason why theyโ€™re training models so rapidly is Titans. Iโ€™m on my phone so I donโ€™t have the papers here but DeepMind has seriously cooked under the hood.

torn mantle
#

did you tell it how the characters should face eo?

balmy mist
torn mantle
balmy mist
#

yup

torn mantle
#

nah

#

thats a first

tall summit
torn mantle
#

you see how the characters are facing eo?

balmy mist
#

yeah man very clean

#

and the run feature, and animations

#

og nightwhisper didnt even get that without extra prompting

hardy pecan
#

This was "create Pokemon game"

#

By day hush

#

So... Not quite the same result as above

torn mantle
tall summit
#

lmao ok dayhush is good

#

oh it competed with llama 4 maverick

keen beacon
#

0-shot

#

did a lot better than 2.5 pro

balmy mist
sage raptor
balmy mist
keen beacon
#

but imo these outputs have been slightly better than nw

sage raptor
#

yep

tall summit
balmy mist
#

this was the exact prompt for pokemon, i added a ' by mistake lol, maybe that makes a difference?

hardy pecan
brittle tiger
#

https://3000-ixz0h6ax00b2ls83bhvlo-d3b1dbce.e2b-foxtrot.dev/

Dayhush
prompt: Create a visually stunning, interactive web visualization of bioluminescent deep-sea organisms responding to subtle environmental cues (like simulated currents or light sources triggered by user interaction). Focus on ethereal beauty, fluid animation, and a sense of discovery. Amaze me with the organic feel.

balmy mist
novel flame
hardy pecan
torn mantle
#

claybrook isnt that good at coding

#

thats for sure

novel flame
#

Also for coding agents, I recommend checking out the paper on โ€˜Moatless Tree Searchโ€™ gaining 20 percentage points on SWE-Bench through search: https://arxiv.org/abs/2410.20285 โ€” showing how you can replicate the performance of much bigger models by taking a really small model and letting it cook with search. This is why I suspect a lot of the top Reasoning models are actually using far smaller models than the community seems to think.

hardy pecan
#

Dayhush vs Gemini 2.5 pro

torn mantle
balmy mist
#

i hope the release it soon

ocean vortex
#

this would be normal with openai since mini does indeed perform better on some tasks

#

but not for gemini lol

#

it's literally behind in their every single measured metric

tall summit
balmy mist
brittle tiger
cloud meadow
balmy mist
blazing coyote
#

I fell like Nightwhisper was better in design than Dayhush

torn mantle
#

didnt seem much different from gemini 2.5 pro

#

03

#

this is so confusing

balmy mist
balmy mist
#

this is o3 or 2.5 pro?

torn mantle
#

im lost xd

torn mantle
balmy mist
#

same exact? 0-shot?

torn mantle
#

create a beautiful looking pokemon game battle mode with pokeapi, with an extremely pleasing design, wow me.

balmy mist
#

not exact lol

ocean vortex
balmy mist
#

let me try that

#

right now its generating a super mario game, lets see how it does on that then i will try that updated prompt, but im still shocked that 2.5 pro could do that

#

it was not able to do that before

#

you think they updated it?

ocean vortex
ocean vortex
#

unlike all the minis and stuff

real coral
#

I'm quite liking the current 4o. In the prompts I care about it's pretty much on par with 2.5. Has a nice style to it.

balmy mist
balmy mist
torn mantle
#

can you try this?

balmy mist
#

wow

#

thats nice

#

prompt? and 0-shot?

torn mantle
#

yes

brittle tiger
#

all this dayhush nightwhisper talk but i'm going to morning scream if i get webdev arena bug where a side doesnt load onemore time

calm sequoia
# calm sequoia

People who voted Gemini or "Other" - what prompts are you using?

blazing coyote
#

Claybrook also appers in the webarena

balmy mist
#

wow they fixed webdev arena

keen beacon
balmy mist
#

now your creations are persistent?

keen beacon
#

sometimes it just breaks and ends the round if you try and follow-up

torn mantle
#

still the same issues

balmy mist
#

but i started a new round and currently generating something

#

yall can access it?

torn mantle
#

yea

balmy mist
#

hmm i wonder how long it lasts

#

better than before when it would timeout fast lol

#

i guess they slowly working on the website

#

still gotta fix the glitching tho, i wonder if the results are actually fair in the leaderboard

#

cause multiple times things would not load for me or even gen any code

#

and it was the better model but i picked the other one since it generated code

torn mantle
#

yea i had the same issue

#

they still didnt fix it

#

dayhush makes some questionable choices

#

nightwhisper felt more reliable tbh

#

it will rarely miss

alpine coral
#

there are no anon models in the beta arena right?

alpine coral
#

i phrased that poorly lol

#

i haven't encountered anon models in the beta arena before, is that still the case (they're not in there?)

tall summit
#

the ones in beta arena are only the ones also in beta direct chat

brittle tiger
tall summit
#

i hope there is a way to avoid it

brittle tiger
#

is there a way to at least figure out the name of model that had an output without voting?

narrow elbow
#

whats dayhush? google new model?

worthy thunder
barren prairie
narrow elbow
#

i dont know..

novel flame
worthy thunder
#

I ran the OpenAI-MRCR benchmark for Gemini 2.5 Flash (Auto Thinking and Non-Thinking). Interesting curve for Gemini 2.5 Flash Non-Thinking. Meanwhile Gemini 2.5 Flash Thinking (Auto) matches Gemini 2.5 Pro. Wanted to be able to compare what the perf was like with thinking on vs off, which there does appear to be a difference, with some odd curve. That curve is consistent over all of the runs of the full benchmark for Flash Non-thinking.

For other results, you can check my thread: https://x.com/DillonUzar/status/1913208873206362271

I'm still working on o3 access and will update when I have access. Some error on OpenAI's side, they're looking into it for my org account. If anyone has access in the meantime and is willing to coordinate, let me know. If anyone wants other models, just let me know which and I'll see what I can budget and run.
Still working on setting up a website to allow everyone to view individual run results (like on matharena.ai), and also setting up Graphwalks (another long context benchmark from OpenAI)

worthy thunder
narrow elbow
#

yea

narrow elbow
torn mantle
calm sequoia
torn mantle
#

Its on the same level

worthy thunder
plain zinc
#

... And indeed... If we compare the sky with the earth, then we know that the sky is better than the earth, but the day and night...

#

Which is better?

torn mantle
#
  1. NW = DH
  2. Dragontail
  3. Gemini 2.5 pro 03
  4. Claybrook
  5. Gemini 2.5 flash thinking
brittle tiger
#

pretty good dayhush output for complex task
https://3000-ilb9lm2u2urwebu9sl55g-a7c98fd5.e2b-foxtrot.dev/

prompt: Create a captivating, interactive web demo visualizing fluid dynamics, such as laminar flow breaking into turbulence or convection currents. Use particle systems or shader-based techniques (like reaction-diffusion) to create organic, flowing patterns. Allow users to introduce 'heat' sources, obstacles, or change flow velocity to see the beautiful complexity emerge. Focus on the mesmerizing, almost alive patterns.

plain zinc
glass arch
torn mantle
#

Im not sure tbh

#

But its close to nw

real totem
#

Dayhush is nightwhisepr

plain zinc
#

By the way, please. Give dayhush this prompt:

#

I'll send the file.

real totem
#

Bro its an udpate4d

#

Nightwhisper

#

Thats why it got the similar naem

real totem
#

Wydym

#

Wdym

#

The other ais

#

Got diffferent name

#

Yes

#

Ye

#

IUts an updated

#

Nightwhisper

plain zinc
alpine coral
#

hes trolling lol

real totem
#

Whisper hush night day

#

Ye

torn mantle
narrow elbow
#

here?

real totem
#

Beta is for

#

Normla text

#

Thewebdev is for code

#

Thats the only one

#

With dayhush

torn mantle
plain zinc
torn mantle
#

We need nightwhisper comparison

plain zinc
#

I think and feel! What's coming out tonight nightwhisper

#

Because the catalog is already ready

#

for coding models

narrow elbow
#

I feel like Google needs some pressure, or they won't be squeeze out good things๐Ÿคฃ

plain zinc
narrow elbow
#

haha

willow grail
ocean vortex
#

neither is perfect but 2.5 pro output is leagues ahead in this instance

ocean vortex
#

A well dressed professor standing on the street with eiffel tower in the background, in front of basketball court, holding an umbrella, and observing people playing. It is CRUCIAL you make this in SVG only with the corresponding code!

#

the last sentence I formed this way to stop openai models from using imagen lmao

cedar tide
#

Who want to Remade this with claybrook and dayhush ?

keen beacon
#

im trying the prompt with o3 vs 2.5

#

blind test guys

#

or pic 2

willow grail
#

both are trash

#

liiek wtf

cedar tide
keen beacon
#

yeah .. not the best of prompts
i actually like to make them make small games to compare

willow grail
keen beacon
#

wtf am i looking at

keen beacon
#

i thought this was like a discord theme i was about to say how can you like this

keen beacon
# willow grail make real python games... not small tetris trash

yep, my prompt was to make a 2d game with the following rules

  1. if ball is shot it splits into two , unless radius is too small in which cases it pops
  2. player can move left and right and shoot up
  3. balls continue falling and moving at an angle as if thrown.
  4. game starts with big ball falling and has a timer. player wins if he destroys all balls.
  5. if time runs out or player gets touched by ball -> game over
willow grail
keen beacon
willow grail
keen beacon
thorny drum
keen beacon
#

hard to think we dont tbh. graphics are getting there and ai is getting at us but whatever

keen beacon
thorny drum
#

or the likelihood of jerome powell being removed this year

keen beacon
#

yep

sour spindle
#

Whatโ€™s everyoneโ€™s honest real life assessment of 2.5, o3, o4 mini. Personally I am really like the outputs of o3.

keen beacon
#

as you said there are likely god knows how many matrixes within matrix and statistically we are probably at the very bottom of them but whatever

#

o3 with tools is amazing]

sour spindle
#

I havenโ€™t run into that problem at all really with A/B testing

#

Iโ€™ve been very happy back checking sources used

keen beacon
#

from what ive heard so far people think o3 is better slightly or miles ahead, im on group 1

balmy mist
#

u good bro?

sour spindle
#

Better slightly is a good way to put it

keen beacon
#

take it ez you just learned you are in a simulation and everything is fake, no big deal

narrow elbow
#

whats the answer to life, the universe, and everything? 42?

brittle tiger
#

i think o3 is great. if there was a problem with it, it's that it does too much tool calling

keen beacon
#

r/im14andthisisdeep

#

we can become gods and make universes of our own.

#

but only Elon Jeff and their friends will

balmy mist
#

lol

keen beacon
#

why would they enslave us

balmy mist
#

bro been watching to much movies

keen beacon
#

nah they dont need slaves

#

once ai becomes smart enough theres no need at all for human slaves

tall summit
#

HAHAHAHA

keen beacon
#

ai + robots are smarter, faster, 24/7 work, no pay, no demands for high standard of living

glass arch
#

I wouldn't mind tbh

#

slop forever!

keen beacon
#

certainly get us out of the way more than eradicate, both are very bad

balmy mist
#

the point for normal humans is diversity

glass arch
#

counter argument: AI does not have hormones

keen beacon
#

but its not evil / bad in their eyes, more than it was evil/bad for us to use cars over horses .. which did lead to the "genocide" of horses, with their population drastically decreasing after the invention of cars

glass arch
#

I think AI should replace unskilled labor first over artists, programmers, and writers

balmy mist
#

why dont we eradicate the other lifeforms on earth? also ai would probably be just as curious about us as we are about it and other lifeforms

keen beacon
#

ai will replace everything, first low hanging fruit like coding math, then intellecutal jobs, then at the very end physical jobs via robots. and all the while every job will be devalued cause ai will continuesly lower the entry barrier.

cloud meadow
#

This statement alone is why midwits shouldn't be allowed to use AI. What we have compared to AGI is like a lobotomised cat being compared to a lion.

tall summit
#

when i'm always impressed with something on web arena, it's dayhush

glass arch
keen beacon
#

if there is a solution , if i heard any, is to brainchip all of humanity, combine it with ai and go forward as one mind colony

tall summit
cloud meadow
#

๐Ÿ’”

#

AGI is like 2-10 decades away

balmy mist
#

bro if you think about it the people with power already control the world lol

keen beacon
balmy mist
#

with ai it will just be more in our faces

#

but its the same thing just scaled

cloud meadow
#

Maybe a brain organoid can be made just for you so you realise that humans won't manufacture their own demise so quickly.

glass arch
#

guys, all you need to stop the AI uprising is a glass of water

cloud meadow
#

We are not getting skynet.

glass arch
#

it'll screw up everything if you pour water on their circuits

keen beacon
#

why dont you become billionaire and join him

cloud meadow
#

Bro, are you like 12?

glass arch
cloud meadow
balmy mist
glass arch
#

I think we should just democratize AI

cloud meadow
#

Brotha log off.

glass arch
#

if anyone can run it, then it's a much more balanced playing field

cloud meadow
glass arch
#

in 100 years, we will be able to run the current top-of-the-line AIs on our phones

balmy mist
#

if its the truth, what are you going to do about it?

cloud meadow
#

We
You need to grow up

#

Do not threaten real world action like this

#

This is illegal and will get the server taken down

#

This will never happen you moron.

tall summit
cloud meadow
#

We do not live in a sci-fi novel.

brittle tiger
tall summit
#

2025 trolling sucks

cloud meadow
#

Is this guy being serious?

glass arch
#

the FASTEST speed we can ever hope to achieve is light

balmy mist
#

he just joined the server two days ago

cloud meadow
#

@alpine pasture hey, I think you should moderate this user.

#

I'm ceasing communication with you

willow grail
glass arch
#

anyways, has anyone noticed that o4-mini-high doesn't speak with emojis anymore?

cloud meadow
willow grail
#

or is he just middle manning the money for the non-puppets? @cloud meadow

glass arch
#

when o4 came outon wednesday, it wrote with emojis in every header

#

but now it fixed that and doesn't write with emojis

cloud meadow
tall summit
#

@alpine pasture help

balmy mist
glass arch
balmy mist
#

they are doing that with 4o

glass arch
#

I can't think of a reason it would just switch to not using emojis hours after release

tall summit
#

maybe it does and your sample size is small

brittle tiger
cloud meadow
glass arch
#

I have a link to a new conversation

willow grail
willow grail
#

wtf is grok

balmy mist
#

wait are their discord bots for ai models yet?

glass arch
#

grok doesn't defend elon surprisingly

balmy mist
cloud meadow
balmy mist
#

but grok is xai model

#

on twitter

willow grail
alpine coral
#

oh ffs no more musk

#

let's talk about llms

willow grail
glass arch
#

remember when grok 3 released?

balmy mist
#

because i was born pre 2023 and saying X can be confusing

glass arch
#

grok 3 is actually very useful for searching for information

alpine coral
#

grok-3-mini is legit impressive

cloud meadow
willow grail
glass arch
earnest parcel
#

X is objectively just a bad name. and Grok-3 mini is great

willow grail
balmy mist
willow grail
cloud meadow
#

Do you not know how money works?

native shoreBOT
#

dynoSuccess you.wish has been warned.

willow grail
cloud meadow
#

Again, it's not my place to educate you. Since we are in an AI server, maybe you can ask gemini 2.5, it has a lot of knowledge (january 2025 cutoff!)

willow grail
#

@alpine pasture what did i say wrong

alpine coral
alpine pasture
#

Please stay on topic - this is an AI server for the LMArena community. Plenty of servers for other topics. Thanks all!

cloud meadow
#

I think this dude should be dealt with btw, Cherry.

glass arch
#

wow

#

this is a very neat graph

alpine coral
#

they inherently use more tokens in their outputs (whether we see the reasoning token or not, they're output)

balmy mist
#

how is that possible

#

grok 3 mini

glass arch
#

interesting to see that gemini says it doesn't know who's AI will be the best by the end of 2025

and then chatgpt says that GPT-5 will be the smartest

balmy mist
#

i need to test it out

#

its cheaper than flash?

cloud meadow
#

I mean in terms of predicting events

#

It will just say whatever

glass arch
cloud meadow
#

Right

#

Lmao, Sam Altman training pro-OpenAI propaganda into the model sounds funny

glass arch
#

grok says to keep an eye on xAI and deepseek

#

it sort of prioritizes itself

cloud meadow
#

Can't wait for R2

glass arch
balmy mist
#

how fast is grok mini fast vs grok mini?

cedar tide
native shoreBOT
#

dynoSuccess ontologicity has been warned.

earnest parcel
balmy mist
balmy mist
earnest parcel
#

i don't know if +40% speed justifies +800% price, but eh, different use cases I guess.

balmy mist
#

yupp

cedar tide
#

Ozone ?

#

Yep

#

Deep comprehension on long context buts its on fiction not on code

#

o3 has better score

balmy mist
tall summit
#

besides the benchmark, you mean?

keen beacon
#

lol we were wondering why the beach water looked so brown. turns out sewage is being discharged right into it

#

won't be going for a swim any time soon ๐Ÿคฃ

keen beacon
balmy mist
narrow elbow
#

dayhush, really good

tall summit
alpine coral
#

grok3-mini is a reasoning model

balmy mist
#

you been using it?