#general | Arena | Page 26

olive mesa Apr 17, 2025, 8:30 PM

#

thinking budget wow

balmy mist Apr 17, 2025, 8:30 PM

#

what tests you ran with flash?

olive mesa Apr 17, 2025, 8:30 PM

#

2.5 pro still better but 2.5 flash is so good for developers

humble sonnet Apr 17, 2025, 8:30 PM

#

How does AI score calculation work?

drifting elk Apr 17, 2025, 8:30 PM

#

olive mesa 2.5 pro still better but 2.5 flash is so good for developers

Thank you brother

#

Absolutely awesome

#

Absolutely right

drifting elk Apr 17, 2025, 8:31 PM

#

humble sonnet How does AI score calculation work?

Python libraries, bruh

humble sonnet Apr 17, 2025, 8:31 PM

#

Okay

balmy mist Apr 17, 2025, 8:32 PM

#

drifting elk Thank you brother

are you a developer?

drifting elk Apr 17, 2025, 8:32 PM

#

balmy mist are you a developer?

Engineering student, 4th year

humble sonnet Apr 17, 2025, 8:32 PM

#

btw, the new update with images, can we request images and modify them?

tall summit Apr 17, 2025, 8:32 PM

#

...what

drifting elk Apr 17, 2025, 8:33 PM

#

Hahahahahaha

zinc ore Apr 17, 2025, 8:34 PM

#

What y'all think about o4 mini getting 100% on one of the math benchmarks?

balmy mist Apr 17, 2025, 8:36 PM

#

full report by logan:
https://x.com/OfficialLoganK/status/1912966497213038686

Logan Kilpatrick (@OfficialLoganK) on X

Gemini 2.5 Flash is here, our first unified reasoning model with thinking budgets. 🔥

It’s on the perato frontier and punches above its price and size!!

https://t.co/uO1DsPXlE8

#

technically flash should be the best all around model in terms of efficiency

#

but i wish it was faster man

#

that would be make undisputed best

humble sonnet Apr 17, 2025, 8:38 PM

#

humble sonnet btw, the new update with images, can we request images and modify them?

You cant ...?

balmy mist Apr 17, 2025, 8:38 PM

#

its almost on par with o4 mini, but cheaper

tall summit Apr 17, 2025, 8:44 PM

#

this is a dumb question, do reasoning models even reason in lmarena?

raven void Apr 17, 2025, 8:45 PM

#

o3 and o4 mini is so good it leads me to think we'll get AGI achieved internally by December

balmy mist Apr 17, 2025, 8:46 PM

#

tall summit this is a dumb question, do reasoning models even reason in lmarena?

yeah

balmy mist Apr 17, 2025, 8:46 PM

#

raven void o3 and o4 mini is so good it leads me to think we'll get AGI achieved internally...

what tests have you ran on them?

balmy mist Apr 17, 2025, 8:48 PM

#

raven void o3 and o4 mini is so good it leads me to think we'll get AGI achieved internally...

i aint gonna lie, my convos with o3 been smooth

zinc ore Apr 17, 2025, 8:48 PM

#

https://x.com/OfficialLoganK/status/1912966500794654855

Logan Kilpatrick (@OfficialLoganK) on X

This is an early version of 2.5 Flash, but it already shows huge gains over 2.0 Flash.

You can fully turn off thinking if needed and use this model as a drop in replacement for 2.0 Flash.

It’s available across the Gemini API, AI Studio, Vertex, and the Gemini app!

cedar tide Apr 17, 2025, 8:48 PM

#

Put 2.5 flashes in thinking off on the arena

zinc ore Apr 17, 2025, 8:48 PM

#

Actually good point, anyone know how early into the cycle o4 mini is?

balmy mist Apr 17, 2025, 8:48 PM

#

zinc ore https://x.com/OfficialLoganK/status/1912966500794654855

bro i just posted this lol

tall summit Apr 17, 2025, 8:48 PM

#

balmy mist yeah

well obviously there's no thinking... box like there is on any client actually meant for running models

zinc ore Apr 17, 2025, 8:48 PM

#

You didn't post that specific tweet tho

tall summit Apr 17, 2025, 8:49 PM

#

i mean, for a very topical example, is the 2.5 flash on lmarena reasoning or not?

balmy mist Apr 17, 2025, 8:49 PM

#

zinc ore You didn't post that specific tweet tho

its in teh comments

zinc ore Apr 17, 2025, 8:49 PM

#

Yes

balmy mist Apr 17, 2025, 8:49 PM

#

like right there

#

as soon as you click link

#

check the relevant ai news

zinc ore Apr 17, 2025, 8:49 PM

#

balmy mist its in teh comments

I'm asking a specific question about something stated in that specific tweet

balmy mist Apr 17, 2025, 8:49 PM

#

channel

zinc ore Apr 17, 2025, 8:50 PM

#

balmy mist like right there

#general message

tall summit Apr 17, 2025, 8:50 PM

#

tall summit i mean, for a very topical example, is the 2.5 flash on lmarena reasoning or not...

because it's not like i have a way to tell

balmy mist Apr 17, 2025, 8:50 PM

#

zinc ore Actually good point, anyone know how early into the cycle o4 mini is?

idk what that means

#

maybe someone else does

zinc ore Apr 17, 2025, 8:50 PM

#

I'm literally sharing the tweet to ask how early into prod cycle o4 mini is, compared with flash 2.5

balmy mist Apr 17, 2025, 8:51 PM

#

zinc ore I'm literally sharing the tweet to ask how early into prod cycle o4 mini is, com...

anybody know?

cedar tide Apr 17, 2025, 8:51 PM

#

The only benchmark we have for 2.5 Flash in Thinking Off
Compared to 2.0 Flash, it goes from 60 to 74 on GPQA and from 34 to 41 on LiveCodeBench.

gemini_s_b_s_scaling_graphs.original.png

balmy mist Apr 17, 2025, 8:51 PM

#

tall summit well obviously there's no thinking... box like there is on any client actually m...

lmaoo im not gonna start with you

tall summit Apr 17, 2025, 8:51 PM

#

balmy mist lmaoo im not gonna start with you

what

cedar tide Apr 17, 2025, 8:51 PM

#

Gemini 2.5 flash vs grok 3 mini thinking

Screenshot_2025-04-17-22-36-15-728_com.android.chrome-edit.jpg

tall summit Apr 17, 2025, 8:52 PM

#

balmy mist lmaoo im not gonna start with you

sorry i dont know the intricacies of lmarena's internals and i want to know

zinc ore Apr 17, 2025, 8:52 PM

#

That's gem 2 not 2.5

balmy mist Apr 17, 2025, 8:52 PM

#

cedar tide Gemini 2.5 flash vs grok 3 mini thinking

hmm now this is interesting

zinc ore Apr 17, 2025, 8:52 PM

#

Gemini 2 Flash bro

balmy mist Apr 17, 2025, 8:52 PM

#

tall summit sorry i dont know the intricacies of lmarena's internals and i want to know

oh okay, im not sure then

cedar tide Apr 17, 2025, 8:52 PM

#

zinc ore Gemini 2 Flash bro

Its 2.5

balmy mist Apr 17, 2025, 8:52 PM

#

i think someone might know here

cedar tide Apr 17, 2025, 8:52 PM

#

I made this

tall summit Apr 17, 2025, 8:53 PM

#

balmy mist oh okay, im not sure then

i dont get how you took it as an insult but im still sorry

tall summit Apr 17, 2025, 8:53 PM

#

balmy mist i think someone might know here

most likely

zinc ore Apr 17, 2025, 8:53 PM

#

Oh okay, was labelled as Gemini 2 Flash, didn't realize

balmy mist Apr 17, 2025, 8:53 PM

#

tall summit i dont get how you took it as an insult but im still sorry

no not as an insult just trying to disagree for the hell of it like last time lol

#

cause i said yeah and then you said obviously no, but i misread

tall summit Apr 17, 2025, 8:53 PM

#

balmy mist no not as an insult just trying to disagree for the hell of it like last time lo...

ohhhh yeah i get how you might have thought that

#

well sorry i do it a lot im an indecisive person

keen beacon Apr 17, 2025, 8:54 PM

#

u can tell if its a thinking model in arena battle, if one of the models is definitively not a reasoning model and both completions were delayed

tall summit Apr 17, 2025, 8:54 PM

#

im like ehhhhhh these are good but ehhh these are bad
devils advocate sorta guy

for literally every choice ive ever made in my life

keen beacon Apr 17, 2025, 8:54 PM

#

it delays both completions as it waits for the thinking model to reason

cedar tide Apr 17, 2025, 8:54 PM

#

zinc ore Oh okay, was labelled as Gemini 2 Flash, didn't realize

Fix

Screenshot_2025-04-17-22-54-18-000_com.android.chrome-edit.jpg

zinc ore Apr 17, 2025, 8:54 PM

#

Gg

tall summit Apr 17, 2025, 8:54 PM

#

keen beacon u can tell if its a thinking model in arena battle, if one of the models is defi...

guess its time to side by side

zinc ore Apr 17, 2025, 8:56 PM

#

cedar tide Fix

Show if it is single or multiple attempts

tall summit Apr 17, 2025, 8:56 PM

#

keen beacon it delays both completions as it waits for the thinking model to reason

i knew it does that but thats still quite clever

#

i like when it uses latex

#

except it doesnt display

#

man i hate that

#

so annoying honestly

balmy mist Apr 17, 2025, 8:57 PM

#

is this true?
https://x.com/DaveShapi/status/1912971940597514289

David Shapiro ⏩ (@DaveShapi) on X

If you're not using o3 full at least two hours per day you're falling behind.

It sharpens your mind as a diamond file might sharpen a blade.

torn mantle Apr 17, 2025, 8:57 PM

#

imma try it first

#

i still havent tried it xd

tall summit Apr 17, 2025, 8:57 PM

#

anyway it seems to be the thinking version on the arena

tall summit Apr 17, 2025, 8:57 PM

#

balmy mist is this true? https://x.com/DaveShapi/status/1912971940597514289

lmao what

keen beacon Apr 17, 2025, 8:57 PM

#

didnt that guy say that o4 mini solved math or something lol

tall summit Apr 17, 2025, 8:57 PM

#

what does that tweet even mean

keen beacon Apr 17, 2025, 8:57 PM

#

ahahaha

balmy mist Apr 17, 2025, 8:57 PM

#

cedar tide Fix

but 2.5 flash is free tho, wait is grok mini free?

cedar tide Apr 17, 2025, 8:58 PM

#

balmy mist but 2.5 flash is free tho, wait is grok mini free?

This price for the paid tier

balmy mist Apr 17, 2025, 8:58 PM

#

i still say o3 is better but in different ways

#

like in terms of agi yes

cedar tide Apr 17, 2025, 8:59 PM

#

We will see grok 3 thinking too

tall summit Apr 17, 2025, 8:59 PM

#

i think it's o3

keen beacon Apr 17, 2025, 8:59 PM

#

lmao xai 🤣

balmy mist Apr 17, 2025, 8:59 PM

#

cedar tide We will see grok 3 thinking too

bro grok is not on o3 and 2.5 pro level

tall summit Apr 17, 2025, 8:59 PM

#

cedar tide We will see grok 3 thinking too

based on o4-mini

balmy mist Apr 17, 2025, 9:00 PM

#

anybody using o3 every day? is it making you smarter?

cedar tide Apr 17, 2025, 9:00 PM

#

balmy mist bro grok is not on o3 and 2.5 pro level

the final version with reasoning has not been released and we have zero benchmarks you can't say that

tall summit Apr 17, 2025, 9:00 PM

#

how about o4-mini vs 2.5 flash

balmy mist Apr 17, 2025, 9:00 PM

#

cedar tide the final version with reasoning has not been released and we have zero benchmar...

you a grok fanbaoy?

tall summit Apr 17, 2025, 9:01 PM

#

balmy mist anybody using o3 every day? is it making you smarter?

i technically used o3 for all the 2 days it released

cedar tide Apr 17, 2025, 9:01 PM

#

balmy mist you a grok fanbaoy?

Nope

balmy mist Apr 17, 2025, 9:01 PM

#

cedar tide Nope

r u sure

cedar tide Apr 17, 2025, 9:01 PM

#

But we will see

tall summit Apr 17, 2025, 9:01 PM

#

and i feel the same amount of intelligent

cedar tide Apr 17, 2025, 9:01 PM

#

Im a price fanboy

tall summit Apr 17, 2025, 9:01 PM

#

oh well i wonder.

cedar tide Apr 17, 2025, 9:01 PM

#

So Im loving gemini

keen fulcrum Apr 17, 2025, 9:01 PM

#

Gemini 3 when

patent aspen Apr 17, 2025, 9:01 PM

#

Is "o3 full" just the o3 model with maximum thinking time and resource consumption?

cedar tide Apr 17, 2025, 9:02 PM

#

cedar tide So Im loving gemini

But i see today that grok 3 mini thinking has too a good quality for this price

patent aspen Apr 17, 2025, 9:02 PM

#

Or is it distinguishing o3 from o3 mini?

cedar tide Apr 17, 2025, 9:02 PM

#

patent aspen Is "o3 full" just the o3 model with maximum thinking time and resource consumpti...

o3 full its o3, its not 2 models

keen beacon Apr 17, 2025, 9:02 PM

#

patent aspen Or is it distinguishing o3 from o3 mini?

yeah most people do that

#

otherwise u would call it o3 full high (reasoning effort) i think

cedar tide Apr 17, 2025, 9:02 PM

#

There are o3 mini and o3 and o3-pro

tall summit Apr 17, 2025, 9:03 PM

#

i assume o3 full meant o3 with all the tools so it can do its stupid chain of thought bs

patent aspen Apr 17, 2025, 9:03 PM

#

Isn't o3 full still way more expensive than 2.5 pro?

balmy mist Apr 17, 2025, 9:03 PM

#

@torn mantle https://liveweave.com/A9OGzH# i cleaned this up a lil

tall summit Apr 17, 2025, 9:03 PM

#

tall summit i assume o3 full meant o3 with all the tools so it can do its stupid chain of th...

which is limited on platforms like lmarena

cedar tide Apr 17, 2025, 9:04 PM

#

Yes, before the release of the gemini beast maybe

brittle tiger Apr 17, 2025, 9:04 PM

#

patent aspen Isn't o3 full still way more expensive than 2.5 pro?

on api yes but with plus subscription you get 50 a week of o3.

keen beacon Apr 17, 2025, 9:04 PM

#

patent aspen Isn't o3 full still way more expensive than 2.5 pro?

probably though its cheaper than o1

#

it depends on the task i guess

#

im curious to see the numbers when artifical analysis runs their benchmark suite

patent aspen Apr 17, 2025, 9:06 PM

#

IMO Gemini 2.5 Pro is the best model in the world not because of raw performance but because it chose the right set of product and engineering trade-offs

brittle tiger Apr 17, 2025, 9:06 PM

#

i def wouldn't count out grok. they don't have infra like goog but the whole team is basically former gdm and they have tons of gpus

patent aspen Apr 17, 2025, 9:08 PM

#

o3 depends on expensive trade offs to be good

#

which isn't sustainable

calm sequoia Apr 17, 2025, 9:11 PM

#

Winsurf turned out to be so good. Have anyone tried anything better or is it sota?

keen beacon Apr 17, 2025, 9:11 PM

#

they are working on efficiency though since it is more efficient than o1 i believe

patent aspen Apr 17, 2025, 9:11 PM

#

OAI loses like $8B a year now

ocean vortex Apr 17, 2025, 9:11 PM

#

zinc ore Actually good point, anyone know how early into the cycle o4 mini is?

very early, I think it's 4.1-mini

keen beacon Apr 17, 2025, 9:12 PM

#

patent aspen OAI loses like $8B a year now

i dont think openai loses money on api pricing with the current model set rn tho

brittle tiger Apr 17, 2025, 9:12 PM

#

I really want to o3 on this

https://x.com/DillonUzar/status/1912671944459034834/

Dillon Uzar (@DillonUzar) on X

I've updated the MRCR results with the full results for GPT-4.1, o3-mini, and o4-mini, I also added in some Anthropic models.

No tests beyond 200k tokens are included for o3-mini, o4-mini, and Anthropic due to limitations.

patent aspen Apr 17, 2025, 9:13 PM

#

keen beacon i dont think openai loses money on api pricing with the current model set rn tho

Their APIs aren't profitable enough to cover their other costs

keen beacon Apr 17, 2025, 9:13 PM

#

ya prob cohere idk lol

#

that line of anon names seem familar

cedar tide Apr 17, 2025, 9:14 PM

#

Amazon

keen beacon Apr 17, 2025, 9:14 PM

#

yea

cedar tide Apr 17, 2025, 9:14 PM

#

Maybe nova pro 1.5

keen beacon Apr 17, 2025, 9:14 PM

#

i forgot

worthy thunder Apr 17, 2025, 9:15 PM

#

brittle tiger I really want to o3 on this https://x.com/DillonUzar/status/1912671944459034834...

It's uh, rather pricey. I'll see about running it though (I ran those)

#

Also looking into running graphwalk on all of those models (another long context benchmark)

keen beacon Apr 17, 2025, 9:18 PM

#

patent aspen Their APIs aren't profitable enough to cover their other costs

even with a much more efficient/much more profitable line of models, i doubt theyd be able to recoup costs on api pricing tho

#

its somewhat inherent in their strategy rn

#

poll_question_text

What's better?

victor_answer_votes

8

total_votes

11

victor_answer_id

2

victor_answer_text

Nightwhisper

#

select in direct chat if u wanna use o3

#

fyi this should be easy

#

its a thinking model with a distinct cut off

#

if it returns its made by openai, answers specific knowledge cut off questions right, delays both model completions

#

4.1 mini/o4 mini despite the claimed june 2024 cut off lacks a lot of more recent knowledge

cedar tide Apr 17, 2025, 9:26 PM

#

Today we was at 224 or 223 ?

Screenshot_2025-04-17-23-26-11-837_com.android.chrome-edit.jpg

balmy mist Apr 17, 2025, 9:27 PM

#

yeah im over flash

zinc ore Apr 17, 2025, 9:27 PM

#

https://x.com/lmarena_ai/status/1912979323734425948?t=TyWExZCG-cl8BdEcaDFCPA&s=19

lmarena.ai (formerly lmsys.org) (@lmarena_ai) on X

Gemini 2.5 Flash ⚡ has landed in the WebDev Arena, and we can see that its strength in coding has impressed the community enough to land at #7 - solidly in the top 10!

balmy mist Apr 17, 2025, 9:27 PM

#

back to o3 and o4

keen beacon Apr 17, 2025, 9:27 PM

#

im still maining 2.5 pro lol

balmy mist Apr 17, 2025, 9:28 PM

#

keen beacon im still maining 2.5 pro lol

idk bro im starting to like o4

#

its just openai trash webui that makes me mad

keen beacon Apr 17, 2025, 9:29 PM

#

balmy mist its just openai trash webui that makes me mad

it seems fine for me

#

nothing to really complain about in my experience

#

weird - they're still there

#

the branching feature in ai studio is extremely questionable though

balmy mist Apr 17, 2025, 9:30 PM

#

keen beacon it seems fine for me

it always crashed for me

#

but o4 is cooking for coding

#

flash makes bad code, like always issues

#

but it has good designs and ideas

wheat onyx Apr 17, 2025, 9:31 PM

#

started testing o3. It's REALLY good. I had it look at the internals of my broken TV - it first looked for obvious issues, then told me specifically what I should do to test, what results i should get, etc. That's the first time AI would have been able to get detailed advice for something like this. That's a game changer

#

no idea if it's great for coding, but probably way too expensive for it?

wheat onyx Apr 17, 2025, 9:33 PM

#

calm sequoia Winsurf turned out to be so good. Have anyone tried anything better or is it sot...

being acquired by OAI now, right?

balmy mist Apr 17, 2025, 9:33 PM

#

o4: https://liveweave.com/A9OGzH#

#

plus flash

wheat onyx Apr 17, 2025, 9:34 PM

#

balmy mist o4: https://liveweave.com/A9OGzH#

what is this?

balmy mist Apr 17, 2025, 9:34 PM

#

you cant see it?

keen beacon Apr 17, 2025, 9:34 PM

#

#

worked after refresh

wheat onyx Apr 17, 2025, 9:34 PM

#

i do, but what is it for

balmy mist Apr 17, 2025, 9:34 PM

#

try this instead:
https://www.vibeshare.ai/c/PTfr5FWJ_q

Vibeshare

Share your vibe-coded web apps

#

but let me update it one more time

#

i want to make the animations be the default thing you see

wheat onyx Apr 17, 2025, 9:38 PM

#

#

i know it looks like i cut off my snapshot, but this is actually what it looks like

#

you can scroll down on the right side to see more of the green iconbox

balmy mist Apr 17, 2025, 9:40 PM

#

https://www.vibeshare.ai/c/kBaroVM2ln

Vibeshare

Share your vibe-coded web apps

#

yeah thats it

#

ignore the warnings

#

click the link again

#

and press play button

#

flash and pro have issues with large code

#

but o4 handles them nicely

tall summit Apr 17, 2025, 9:45 PM

#

wheat onyx started testing o3. It's REALLY good. I had it look at the internals of my broke...

you sent it a picture?

torn mantle Apr 17, 2025, 9:47 PM

#

wheat onyx

nice

tall summit Apr 17, 2025, 9:48 PM

#

balmy mist yeah im over flash

why

balmy mist Apr 17, 2025, 9:49 PM

#

balmy mist flash and pro have issues with large code

thi

#

this*

tall summit Apr 17, 2025, 9:49 PM

#

good to know regarding o4

balmy mist Apr 17, 2025, 9:49 PM

#

for some reason with code larger than 1000 lines 2.5 has trouble making edits

#

i have seen this happen a lot

hardy pecan Apr 17, 2025, 9:49 PM

#

balmy mist Apr 17, 2025, 9:50 PM

#

omgg i knee it lol

tall summit Apr 17, 2025, 9:50 PM

#

hardy pecan

maybe google's repl platform could actually work

hardy pecan Apr 17, 2025, 9:50 PM

#

cedar tide Today we was at 224 or 223 ?

went from 223 to 225 - 2 models added just a few hours ago

balmy mist Apr 17, 2025, 9:50 PM

#

hardy pecan

send twitter link please

#

this could be gpt lol

#

jk

cedar tide Apr 17, 2025, 9:51 PM

#

hardy pecan went from 223 to 225 - 2 models added just a few hours ago

Thanks for confirming what I thought, but what is the second model? I haven't found it.

tall summit Apr 17, 2025, 9:51 PM

#

balmy mist send twitter link please

this but unironically

hardy pecan Apr 17, 2025, 9:52 PM

#

balmy mist send twitter link please

https://x.com/OfficialLoganK/status/1912968696265343424

Logan Kilpatrick (@OfficialLoganK) on X

@theLance Time for Gemini code

torn mantle Apr 17, 2025, 9:52 PM

#

hardy pecan

AAAAAAAAAAAAAAAAAAAAAAAAAH

#

PLS

#

but wen

#

😖

#

could it be on 22th of this month?

hardy pecan Apr 17, 2025, 9:52 PM

#

cedar tide Thanks for confirming what I thought, but what is the second model? I haven't fo...

I think llama nemotron lol or another useless model, not sure

cedar tide Apr 17, 2025, 9:52 PM

#

hardy pecan I think llama nemotron lol or another useless model, not sure

🥴

balmy mist Apr 17, 2025, 9:53 PM

#

idk i think im an openai fanboy now

tall summit Apr 17, 2025, 9:53 PM

#

has anyone played chess with o3 yet

balmy mist Apr 17, 2025, 9:54 PM

#

tall summit has anyone played chess with o3 yet

yeah i cant beat it

#

it cooks me

#

whats its elo?

tall summit Apr 17, 2025, 9:54 PM

#

for some reason i tried 2.5 pro and it couldnt interpret a FEN

balmy mist Apr 17, 2025, 9:54 PM

#

tall summit for some reason i tried 2.5 pro and it couldnt interpret a FEN

sama's plan

#

join the openai fanboy club

#

like me

tall summit Apr 17, 2025, 9:54 PM

#

lol

#

i want to test o3 on the same

balmy mist Apr 17, 2025, 9:55 PM

#

do a chess game with o3 vs 2.5

tall summit Apr 17, 2025, 9:55 PM

#

tall summit for some reason i tried 2.5 pro and it couldnt interpret a FEN

this was in the context of asking to analyze a specific move

balmy mist Apr 17, 2025, 9:55 PM

#

hmm i might build an app for that for every model to go against each other

hardy pecan Apr 17, 2025, 9:55 PM

#

cedar tide 🥴

nevermind it was this model.. that no one knows about

tall summit Apr 17, 2025, 9:55 PM

#

GLM

#

ALL MY HOMIES LOVE GLM

cedar tide Apr 17, 2025, 9:56 PM

#

hardy pecan nevermind it was this model.. that no one knows about

Nope he was here before

tall summit Apr 17, 2025, 9:56 PM

#

balmy mist hmm i might build an app for that for every model to go against each other

hmmmmmmmmmmm

#

mmmmmmmm

#

2.5 pro takes ages

hardy pecan Apr 17, 2025, 9:57 PM

#

cedar tide Nope he was here before

oh yeah, my screenshot is cutoff, I cant see any other new model added in the top 16 other than 2.5 flash, so again itsw another inconsequential model added that scores poorly lol

tall summit Apr 17, 2025, 9:57 PM

#

Screenshot_2025-04-18-00-57-43-128_org.mozilla.firefox-edit.jpg

#

2.5 pro cant do ascii diagrams

keen beacon Apr 17, 2025, 9:58 PM

#

oh wait it's you

balmy mist Apr 17, 2025, 9:58 PM

#

tall summit 2.5 pro takes ages

o4 is king now

#

i used to love 2.5 but i think i might cheat on it with o4

tall summit Apr 17, 2025, 9:59 PM

#

o4 mini against 2.5 flash would probably take less time

#

lmao

keen beacon Apr 17, 2025, 9:59 PM

#

3.5 turbo instruct would destroy 2.5 and o3 (in chess) probably lol

tall summit Apr 17, 2025, 9:59 PM

#

how about o3

keen beacon Apr 17, 2025, 9:59 PM

#

keen beacon 3.5 turbo instruct would destroy 2.5 and o3 (in chess) probably lol

im not even joking

balmy mist Apr 17, 2025, 9:59 PM

#

tall summit how about o3

i love o3 to, but i only get certain amount of chats with it, so I gotta save those lol

#

and o4 is better at coding

balmy mist Apr 17, 2025, 10:00 PM

#

keen beacon 3.5 turbo instruct would destroy 2.5 and o3 (in chess) probably lol

no way

#

chill on my side piece o4

tall summit Apr 17, 2025, 10:00 PM

#

this is already looking much better for o3 LMAO

Screenshot_2025-04-18-01-00-10-789_org.mozilla.firefox-edit.jpg

#

even sends me the fen

keen beacon Apr 17, 2025, 10:00 PM

#

https://dynomight.net/chess/
https://dynomight.net/more-chess/ another reminder of these articles, its an extremely cool read btw

tall summit Apr 17, 2025, 10:00 PM

#

keen beacon https://dynomight.net/chess/ https://dynomight.net/more-chess/ another reminder ...

never seen

#

will read

#

from nov 2024

#

i mean sure

keen beacon Apr 17, 2025, 10:01 PM

#

tall summit i mean sure

its still relevant lol

#

does anyone know the system prompt lmarena uses for models?

#

it depends on the model but some of them are in the fast chat repo

#

they dont include everything there though, you'd have to extract them manually from lmarena

#

also see this: a recent tournament with gpt 3.5 turbo instruct and more recent models
https://dubesor.de/chessbeta/tournament2

Dubesor LLM Chess tournament 2

Dubesor LLM Benchmark table - Small-scale manual LLM performance comparison benchmark

#

(just an unnecessary random note but that guy blocked me for no reason, still confused to this day lmao)

wheat onyx Apr 17, 2025, 10:07 PM

#

tall summit you sent it a picture?

Yeah I gave it photos of the tv internals to analyze

keen beacon Apr 17, 2025, 10:08 PM

#

keen beacon (just an unnecessary random note but that guy blocked me for no reason, still co...

that's weird lol

#

i've had fairly loud disagreements with him and he hasn't blocked me

keen beacon Apr 17, 2025, 10:08 PM

#

keen beacon that's weird lol

yeah

blazing rune Apr 17, 2025, 10:08 PM

#

Maybe he thought you were someone else from looking at your profile picture

#

your profile picture is quite generic

keen beacon Apr 17, 2025, 10:09 PM

#

keen beacon i've had fairly loud disagreements with him and he hasn't blocked me

same he suddenly blocked me one day randomly we havent even talked in a while lol

keen beacon Apr 17, 2025, 10:09 PM

#

blazing rune Maybe he thought you were someone else from looking at your profile picture

nah i confronted him about it in the openrouter server me and another guy were talking about something and im pretty sure he saw my message

#

he seemed a little upset that he was being ignored

balmy mist Apr 17, 2025, 10:16 PM

#

nvm o4 and o3 is ehh at coding large codebase 🙂

#

back to gemin

quiet pollen Apr 17, 2025, 10:16 PM

#

What is the difference between Gemini and Gemma?

#

Other then reasoning

#

Exactly my thoughts

keen beacon Apr 17, 2025, 10:16 PM

#

the gemma line are the open weighted models

balmy mist Apr 17, 2025, 10:16 PM

#

gemini is the dad, gemma is the mom

keen beacon Apr 17, 2025, 10:16 PM

#

gemini are the frontier/closed source ones

quiet pollen Apr 17, 2025, 10:24 PM

#

The question is o4 or Gemini

keen beacon Apr 17, 2025, 10:24 PM

#

ig u mean o4 mini

#

prob worth testing all of the models on ur specific tasks and see which one u like

quiet pollen Apr 17, 2025, 10:25 PM

#

Gemini seems to be sticking out for me

tall summit Apr 17, 2025, 10:28 PM

#

what does a temperature of 2 mean

#

i remember when temperature couldnt go above 1!!!

#

what is 1.5 creativity

#

whys it max 2

upper wolf Apr 17, 2025, 10:42 PM

#

you guys use a temp of <0.6 with gemini??

tall summit Apr 17, 2025, 10:42 PM

#

upper wolf you guys use a temp of <0.6 with gemini??

yes

blazing rune Apr 17, 2025, 10:46 PM

#

upper wolf you guys use a temp of <0.6 with gemini??

I usually use 0.5, it seems to work better than 1, at least for coding

#

for gemini 2.5 pro

#

gemini 2.0 flash I usually use at 0 or 0.5, since it hallucinates so much

keen beacon Apr 17, 2025, 10:47 PM

#

i personally use 0 for mostly everything, 1 for everything else (poll was about creative but coherent tasks/etc)

#

1 because i dont bother changing the default ai studio value 🤣

sweet tinsel Apr 17, 2025, 11:00 PM

#

I'm maybe late but 2.5 flash has dropped for me in the Gemini App replacing 2.0 Flash Extended Thinking which I had until now.

tall summit Apr 17, 2025, 11:01 PM

#

sweet tinsel I'm maybe late but 2.5 flash has dropped for me in the Gemini App replacing 2.0 ...

great news its out etc etc

#

same limits in ai studio as 2.5 pro

tawdry phoenix Apr 17, 2025, 11:07 PM

#

#

Literally going insane

glass arch Apr 17, 2025, 11:10 PM

#

as a wise man once said:
"you must pick between autism and schizophrenia"

#

(this is about AI models btw)

#

it seems that openai does not tell the models what its name is

tawdry phoenix Apr 17, 2025, 11:17 PM

#

Roblox injector lol

tall summit Apr 17, 2025, 11:19 PM

#

roblox injector

#

can ai do that

#

well first of all would

#

probably not

glass arch Apr 17, 2025, 11:52 PM

#

which AI can make the most soulless slop

olive mesa Apr 18, 2025, 12:03 AM

#

2.5 flash hallucinates a lot

olive mesa Apr 18, 2025, 12:04 AM

#

glass arch it seems that openai does not tell the models what its name is

also gogle

#

google

#

i asked 2.5 flash what its name was via api

#

it says its just an ai

balmy mist Apr 18, 2025, 12:16 AM

#

sweet tinsel I'm maybe late but 2.5 flash has dropped for me in the Gemini App replacing 2.0 ...

Hi Daniel

glass arch Apr 18, 2025, 12:18 AM

#

olive mesa it says its just an ai

it's so sad

#

like it enters the world and does not have its own identity

late path Apr 18, 2025, 12:20 AM

#

is nightwhisper/dragontail 2.5 flash?

hardy violet Apr 18, 2025, 12:35 AM

#

I gave it a quick try myself – not coding, of course – but my conclusion is the same as the friend above. It seems like that hypothetical '2.5flash-thinking-high' model doesn't really exist. Cranking the 'thinking budget' parameter to the max just seems to make things slower and the interface laggier, with no actual improvement. At least, not in my initial test scenario.

keen beacon Apr 18, 2025, 12:38 AM

#

thinking budget seems to just cut off the thinking prematurely. openai reasoning efforts are tuned to produce different lengths of chain of thought i think. so the current 2.5 pro and 2.5 flash are basically always at the "high" setting no matter what

#

if you have thinking budget disabled

#

with claude's thinking budget its a little more involved than cutting off i think, they seemingly tuned several antml:max_thinking_length</antml:max_thinking_length>, so its closer to openai's

ember rapids Apr 18, 2025, 12:40 AM

#

late path is nightwhisper/dragontail 2.5 flash?

no

#

thats probably coming next week

zinc ore Apr 18, 2025, 12:49 AM

#

That's what I think too. One of them is updated 2.5 pro, the other is a coding model.

late path Apr 18, 2025, 12:55 AM

#

keen beacon https://dynomight.net/chess/ https://dynomight.net/more-chess/ another reminder ...

Yes, I've read this before. I think the final reason is quite obvious. As a completion model, gpt-3.5-turbo-Instruct formed some internal neural network structures because its training corpus contained chess game data. This structure represents a kind of "mechanical" learning, akin to training a specialized neural network dedicated to playing chess, rather than indicating the model has any awareness that it is "playing chess".
And as mentioned in the second blog post, when the model transitions from a pure completion model to a conversational one, this highly specialized completion ability becomes very fragile and fails to generalize effectively. The special tokens and separators inserted for role marking in conversational models might act as interference, potentially preventing the original "chess-playing circuit" from being reliably activated. Therefore, the decline in the conversational model's chess ability confirms that gpt-3.5-turbo-instruct's chess skill stems purely from its completion mode, rather than indicating it possesses any magical advantage over other models.

hardy violet Apr 18, 2025, 12:56 AM

#

I just tested these models on a short text for comprehension. Like I said before, o3's approach is simply intolerable. The framework it builds is ridiculous – that way it sounds plausible but is actually wrong (specious), just does its own thing, grabs onto one word and elaborates endlessly, or makes random, awkward analogies everywhere... it's totally on par with DeepSeek R1.
For tasks like text interpretation or writing, honestly, anyone who likes o3 should probably just use DeepSeek – it's more cost-effective.
Gemini's comprehension felt spot-on. Seeing Flash's performance, I thought it was pretty much perfect. But then looking at Pro, there's still a noticeable gap.
This gap [between Flash and Pro] is clearly wider than the difference between o3 and o4mini high. Considering cost and speed, Flash might have its value, but from a practical and progress standpoint, my vote goes to o4mini-high.

keen beacon Apr 18, 2025, 12:56 AM

#

late path Yes, I've read this before. I think the final reason is quite obvious. As a comp...

yeah. if we had access to newer base models of frontier models theyll perform much better than gpt 3.5 turbo instruct/a similarly trained model that didn't lose its chess skills

#

the problem is we don't have access to any of them, so it is magical in a sense (since we have no comparison points) but i wasnt ever implying that gpt 3.5 turbo instruct is one of a kind if you consider other internal frontier base models ( og gpt 4 base model would be more than enough anyway) . we only have access to the reasoning tuned/instruct/etc. versions

balmy mist Apr 18, 2025, 12:59 AM

#

keen beacon thinking budget seems to just cut off the thinking prematurely. openai reasoning...

so whats teh best setting, 1 thinking or auto?

keen beacon Apr 18, 2025, 12:59 AM

#

disable thinking budget

tall summit Apr 18, 2025, 1:01 AM

#

Screenshot_2025-04-18-03-59-56-906_ml.docilealligator.infinityforreddit-edit.jpg

#

i can't believe there are people who think 100k+ token context is minimum useful baseline

tall summit Apr 18, 2025, 1:04 AM

#

hardy violet I just tested these models on a short text for comprehension. Like I said before...

can you give more context regarding what exactly you... did?

olive mesa Apr 18, 2025, 1:16 AM

#

glass arch it's so sad

kind of

#

but at the same time it's just an algorithm really good at repeatedly predicting the next token

#

i honestly think the closest thing we can get to consciousness is an ai with really good cot being aware of its own thoughts, aware of it being aware of its own thoughts, and hallucinating emotions

keen beacon Apr 18, 2025, 1:17 AM

#

under an ai fiefdom you say that to your ai overlords, you're just predicting tokens 🤣 imagine lol

olive mesa Apr 18, 2025, 1:18 AM

#

keen beacon under an ai fiefdom you say that to your ai overlords, you're just predicting to...

lmao

drifting thorn Apr 18, 2025, 1:23 AM

#

tall summit i can't believe there are people who think 100k+ token context is minimum useful...

It is for agentic purposes

still jetty Apr 18, 2025, 1:28 AM

#

suggestings for an image to image model? a family member wants to take photos of their yard and generate landscape ideas

keen beacon Apr 18, 2025, 1:33 AM

#

still jetty suggestings for an image to image model? a family member wants to take photos of...

can't gpt 4o image gen do that?

still jetty Apr 18, 2025, 1:34 AM

#

i'll take a look. ive never tried/had a need for image to image yet so i dont know which models do

raven void Apr 18, 2025, 1:45 AM

#

tall summit i can't believe there are people who think 100k+ token context is minimum useful...

it depends on the kind of task but if 100k is not a minimum to expect then what is

balmy mist Apr 18, 2025, 3:21 AM

#

poll_question_text

Are you Team Google or OpenAi

victor_answer_votes

8

total_votes

17

victor_answer_id

1

victor_answer_text

Google

leaden meteor Apr 18, 2025, 3:22 AM

#

keen beacon can't gpt 4o image gen do that?

Do we have any reliable ones yet? I tried one to give me ideas for a fence in backyard. 4o was also terrible. I think these things are still not good enough to edit existing images. They are only trained for creating new images from scratch....

#

or the vanilla 'professional passport photo' schtick...

#

Giving landscape ideas with realistic edits of your backyard will be gamechanging....

keen beacon Apr 18, 2025, 3:24 AM

#

leaden meteor Do we have any reliable ones yet? I tried one to give me ideas for a fence in ba...

Iirc they did train in editing images at least with gpt 4o image gen

keen beacon Apr 18, 2025, 3:24 AM

#

leaden meteor Giving landscape ideas with realistic edits of your backyard will be gamechangin...

I guess this is too hard for now

#

It might also not be passing enough info to the gpt 4o image gen model

#

Since theyre different models for now

#

Chatgpt calls a tool to a different 4o model with image gen enabled

#

Too finicky to deploy one together for now I guess

barren prairie Apr 18, 2025, 3:44 AM

#

tall summit same limits in ai studio as 2.5 pro

Flash and have those limits ? 🙂🤨 I didn t hit it but really????!!!!
And on the app?? Do we have the same issue??

zinc ore Apr 18, 2025, 3:56 AM

#

https://x.com/bongrandp/status/1913040244963766432

Pierre Bongrand (@bongrandp) on X

For the first time, Google is responding to OpenAI's announcement in < 24 hours

The WAR is officially ON, and Google wants the LLM market

Google is now dominating +90% of the price share

leaden palm Apr 18, 2025, 3:59 AM

#

where grok 3 mini

#

oh, it's right there

#

(not to mention that the x axis isn't adjusted for thinking tokens)

#

actually the cons@64 is doing a lot of the heavy lifting, but even without, it's a better model than gemini 2.5 flash

upper wolf Apr 18, 2025, 4:19 AM

#

have you guys tried 2.5 flash with thinking turned off? if so what is your experience

zinc ore Apr 18, 2025, 4:25 AM

#

leaden palm actually the `cons@64` is doing a lot of the heavy lifting, but even without, it...

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever so slightly behind OpenAI’s o1 model set to “medium” computing.

https://techcrunch.com/2025/02/22/did-xai-lie-about-grok-3s-benchmarks/

TechCrunch

Kyle Wiggers

Did xAI lie about Grok 3's benchmarks? | TechCrunch

OpenAI researchers accused xAI about publishing misleading Grok 3 benchmarks. The truth is a little more nuanced.

#

Cons64 boosts the scores quite a bit, it's not reliable to compare with @1 passthroughs.

thorny drum Apr 18, 2025, 4:27 AM

#

december o3 biggest offender of this lol

#

but tbf they were pretty transparent of how hard they juiced that model for benchmarks

zinc ore Apr 18, 2025, 4:28 AM

#

Yeah agreed, openAI did it and Grok also did it. I'm hoping in the future these companies cut it out.

alpine coral Apr 18, 2025, 5:02 AM

#

i'm confused by the december o3 reference.. do you mean that arc-agi chart, with o3-high in the top right corner (costing >$1000/task)? seems kinda different to this ( @conensus )

thorny drum Apr 18, 2025, 5:07 AM

#

i think their approach to codeforces was comparable?
idk

#

but my understanding was o3 high was effectively cons1000 or whatever

ember rapids Apr 18, 2025, 5:12 AM

#

plain zinc Apr 18, 2025, 5:27 AM

#

I WANT

keen beacon Apr 18, 2025, 5:41 AM

#

pro is only like 300-400B

plain zinc Apr 18, 2025, 5:47 AM

#

ember rapids

#

LETS GOOOOO

#

It's confirmed

real totem Apr 18, 2025, 6:19 AM

#

Aint no way

#

Ultra

keen beacon Apr 18, 2025, 6:25 AM

#

plain zinc

no way

#

maybe in the next couple months

torn mantle Apr 18, 2025, 6:27 AM

#

next batch releases are gemini coders

plain zinc Apr 18, 2025, 6:28 AM

#

real totem Ultra

Why?

real totem Apr 18, 2025, 6:29 AM

#

plain zinc Why?

Cuz

#

Its gonna be

#

To good

ember rapids Apr 18, 2025, 6:29 AM

#

Honestly with the pace theyre moving at

#

i wouldnt be surprised if we got it say 2 months from now

keen beacon Apr 18, 2025, 6:31 AM

#

The world knowledge 2.5 pro has is incredible, I would really like a 2.5 ultra if it's done well and pricing isn't ridiculous

novel flame Apr 18, 2025, 6:32 AM

#

I tried running my usual mini benchmarks on 2.5 Flash and got the same perfect 5/5 score that all the top models get (from 4o and up). So far, so good.

Then I gave it my harder “zero-shot browser game” task and it did something unexpected; it didn’t do it. It didn’t fail, rather it replied that such a task was too big for a single prompt, explained why, gave a detailed plan for how you would do it in steps, and then implemented a working sandbox to build on.

So now my scoring system doesn’t work because it’s based on implemented features and Flash deliberately didn’t implement any features.

The thing is, no other model has done this. Many models will provide an incomplete game and list some improvements to make, but 2.5 Flash pushed back on the premise itself, which I don’t think I’ve seen before.

So I don’t know how to evaluate this for coding…… 🤨

#

At the same time, c’mon Flash, what the hell? The original Super Mario was a total of 20KB, which is maybe 16k tokens, so why not give it a whirl?

#

And on the subject of coding, has anyone else had the experience that o4-mini generates minified JS code?🥸🤪 It is the weirdest thing. I mean, if it’s trained on random JavaScript from the public web then obviously most of that is minified, but surely a bare minimum of human preference tuning would have taught it not to produce minified source code?!!???

keen beacon Apr 18, 2025, 6:41 AM

#

O4 mini in particular can be very weird it might be an artifact of the rl

keen fulcrum Apr 18, 2025, 6:48 AM

#

So gemini 3 will come during summer right?

keen beacon Apr 18, 2025, 6:50 AM

#

for me o3 > 2.5 but very small margin

#

o3 strength is not in question answers, is in its ability to use tools, run python code , search etc , compliting multi step tasks

torn mantle Apr 18, 2025, 6:52 AM

#

keen beacon o3 strength is not in question answers, is in its ability to use tools, run pyth...

you can enable grounding ( search ) on gemini 2.5 pro?

#

you have it all

cedar tide Apr 18, 2025, 6:53 AM

#

https://fixupx.com/ArtificialAnlys/status/1913057626117820438?t=eDD2-6fuPdTFRq7safK1DQ&s=19

Artificial Analysis (@ArtificialAnlys)

Grok 3 mini Reasoning’s recently launched API has highly compelling Intelligence vs. Price positioning
︀︀
︀︀@xai has recently launched APIs for Grok 3 and Grok 3 mini, after initially only making the models available via the Grok chat interface at launch. We have now completed benchmarking for both Grok 3 and Grok 3 mini.
︀︀
︀︀Grok 3 mini (high reasoning) stands out in the top left of our Intelligence vs Price chart, achieving one of the highest Artificial Analysis Intelligence Index scores ever, with pricing well below even DeepSeek R1.
︀︀
︀︀Grok 3 family overview: We’re initiating coverage of 6 versions of the Grok 3 family: Grok 3, Grok 3 Fast, Grok 3 mini (low reasoning), Grok 3 mini Fast (low reasoning), Grok 3 mini (high reasoning) and Grok 3 mini Fast (high reasoning). Today’s post focuses on our intelligence results; performance breakdown of all 6 versions will be available soon.
︀︀
︀︀Intelligence: As we highlighted at la…

keen beacon Apr 18, 2025, 7:03 AM

#

torn mantle you can enable grounding ( search ) on gemini 2.5 pro?

o3 does it better, seen it work 10-20min just for 1 answer

keen fulcrum Apr 18, 2025, 7:08 AM

#

Gemini 2.5 Ultra lol

fleet lintel Apr 18, 2025, 7:17 AM

#

novel flame I tried running my usual mini benchmarks on 2.5 Flash and got the same perfect 5...

that's actually quite smart to offer a plan if it can't do it, compared to just do something incomplete.

fleet lintel Apr 18, 2025, 7:17 AM

#

keen fulcrum Gemini 2.5 Ultra lol

too cryptic, it could mean anything.

kind cloud Apr 18, 2025, 7:19 AM

#

Screenshot_2025-04-18-16-18-51-712-edit_com.android.chrome.jpg

novel flame Apr 18, 2025, 7:22 AM

#

fleet lintel that's actually quite smart to offer a plan if it can't do it, compared to just ...

I agree, I mean that’s the answer I’d expect from a human if I gave them such a vague task with an unrealistic scope. But it makes it impossible to quantify its coding skill/usefulness against the others (including 2.5 Pro) on that test

fleet lintel Apr 18, 2025, 7:23 AM

#

true

plain zinc Apr 18, 2025, 7:26 AM

#

https://x.com/rakyll/status/1913099386793636162

Jaana Dogan ヤナドガン (@rakyll) on X

Time for Gemini Code 💎

#

👀

hardy violet Apr 18, 2025, 7:27 AM

#

tall summit can you give more context regarding what exactly you... did?

For some lyrics translated from Japanese to Chinese, we can directly tell if the Chinese version flows well and spot translations that are significantly off compared to human standards. Also, with essays I wrote myself – even though they aren't great – I know exactly what I was trying to say. This is where I noticed the issue with models like R1 and O3: they tend to misuse structures and over-elaborate. What's more, the writing style became noticeably 'R1-like,' and other users I talked to agreed. Since I'm not willing to pay for the API, and LM Arena can't handle testing slightly longer short stories, I didn't get around to testing how it interprets classic texts. (Plus, testing classics is easier to get away with; people rarely criticize over-interpreting classic works anyway）

keen beacon Apr 18, 2025, 7:28 AM

#

plain zinc https://x.com/rakyll/status/1913099386793636162

today maybe?

keen beacon Apr 18, 2025, 7:28 AM

#

kind cloud

another one!

fleet lintel Apr 18, 2025, 7:30 AM

#

kind cloud

any good?

fleet lintel Apr 18, 2025, 7:32 AM

#

plain zinc https://x.com/rakyll/status/1913099386793636162

thye might be talking about coding using gemini 2.5 flash

cedar tide Apr 18, 2025, 7:40 AM

#

Livebench

Screenshot_2025-04-18-09-36-44-550_com.android.chrome-edit.jpg

torn mantle Apr 18, 2025, 7:42 AM

#

plain zinc https://x.com/rakyll/status/1913099386793636162

woah

cedar tide Apr 18, 2025, 7:49 AM

#

Its already sure that we will have coder model https://x.com/legit_api/status/1912888062138347773?t=sflUSnSaHTCa6DmVNGfZ5g&s=19

ʟᴇɢɪᴛ (@legit_api) on X

WE ALSO SEEM TO BE GETTING CODE MODELS!

WE MIGHT SEE THE REVEAL OF NIGHTWHISPER

cedar tide Apr 18, 2025, 7:56 AM

#

kind cloud

Its new version of gemini 2.5 pro no ?

keen beacon Apr 18, 2025, 8:00 AM

#

we don't know yet

#

it's just a hidden category on the model selector

dusky aurora Apr 18, 2025, 8:02 AM

#

I have feedback abut beta interface.

#

In direct chat, there are no sampling options

keen beacon Apr 18, 2025, 8:03 AM

#

#new-ui-feedback

barren prairie Apr 18, 2025, 8:03 AM

#

dusky aurora In direct chat, there are no sampling options

Go to iu feedback

keen beacon Apr 18, 2025, 8:04 AM

#

cedar tide Livebench

lol, 3.7 sonnet beat by a flash model

cedar tide Apr 18, 2025, 8:05 AM

#

Anyone have hard prompt that 2.5 pro dont get ?

keen fulcrum Apr 18, 2025, 8:17 AM

#

keen beacon it's just a hidden category on the model selector

Its the stable
after preview -> stable

keen beacon Apr 18, 2025, 8:19 AM

#

what

#

im talking about the category on AI studio's backend referred to as "1P CODE MODELS"

#

(first party code models)

keen fulcrum Apr 18, 2025, 8:21 AM

#

Interesting to have special code models
Although I want them to have their own debug capabilities, I don't want tools

novel flame Apr 18, 2025, 8:41 AM

#

keen beacon im talking about the category on AI studio's backend referred to as "1P CODE MOD...

Or… 1st place (on benchmarks) coding models?

scarlet flint Apr 18, 2025, 8:43 AM

#

Hello

#

did google remove the image editing model from aistudio?

ocean vortex Apr 18, 2025, 8:44 AM

#

tall summit i can't believe there are people who think 100k+ token context is minimum useful...

Most of them don't even know why they would need it. If you aren't inputting videos, 99% of the time you don't need it nor it even provides a better experience. For an average convo that is already beyond the point of the model losing track of what exactly happened in the beginning of it regardless of the total context size

#

reasoning models are a different thing, but that only applies if you are saving reasoning into context (currently not possible with openai)

torn mantle Apr 18, 2025, 8:49 AM

#

kind cloud

is this 2.5 pro?

ocean vortex Apr 18, 2025, 8:49 AM

#

on that last point... I think this actually might be one of the reasons why openai has longer chains than anyone else, since they don't worry about this flooding the context lol

torn mantle Apr 18, 2025, 8:49 AM

#

its kinda slow

torn mantle Apr 18, 2025, 8:55 AM

#

kind cloud

seems like its a better version of gemini 2.5 pro thinking

#

recent checkpoint?

cedar tide Apr 18, 2025, 9:02 AM

#

why 2.5 flash without Reasoning does not replace gemini 2.0 flash on gemini?

keen beacon Apr 18, 2025, 9:09 AM

#

kind cloud

where do you find this

tall summit Apr 18, 2025, 9:10 AM

#

ocean vortex Most of them don't even know why they would need it. If you aren't inputting vid...

i agree

kind cloud Apr 18, 2025, 9:11 AM

#

keen beacon where do you find this

random battle

unborn ocean Apr 18, 2025, 9:16 AM

#

cedar tide why 2.5 flash without Reasoning does not replace gemini 2.0 flash on gemini?

They are probably still working out the inference to some degree and its also supposed to be experimental for now

willow grail Apr 18, 2025, 9:16 AM

#

someoneneeds to make a website which takes in all benchmarkes and creates a average

sweet tinsel Apr 18, 2025, 9:16 AM

#

cedar tide why 2.5 flash without Reasoning does not replace gemini 2.0 flash on gemini?

For me it replaced 2.0 Flash Thinking in the Gemini App

#

Atleast

cedar tide Apr 18, 2025, 9:16 AM

#

sweet tinsel For me it replaced 2.0 Flash Thinking in the Gemini App

Yes but im speaking about the non thinking version

sweet tinsel Apr 18, 2025, 9:17 AM

#

But like Not so said it's most likely Multimodal feature and inference bount

ocean vortex Apr 18, 2025, 9:23 AM

#

unborn ocean They are probably still working out the inference to some degree and its also su...

"experimental" is just a name that does not mean much lately

barren prairie Apr 18, 2025, 9:25 AM

#

ocean vortex "experimental" is just a name that does not mean much lately

Most of google models are experimental... Rarely stable

#

We are mice and rabbits for google

unborn ocean Apr 18, 2025, 9:27 AM

#

ocean vortex "experimental" is just a name that does not mean much lately

Well they use the time to optimize inference mainly. But I get your point about them basically being normal models

calm sequoia Apr 18, 2025, 9:33 AM

#

Anyone checked what is claybrook?

torn mantle Apr 18, 2025, 9:36 AM

#

calm sequoia Anyone checked what is claybrook?

google

#

i cant really tell if its better than gemini 2.5 pro thinking or not

calm sequoia Apr 18, 2025, 9:37 AM

#

There's nothing else to cook for them, what else could this be 😄

#

Probably adapted to some niche application

#

Or flash with thinking at some intensity setting

ocean vortex Apr 18, 2025, 9:39 AM

#

calm sequoia There's nothing else to cook for them, what else could this be 😄

coding model?

torn mantle Apr 18, 2025, 9:39 AM

#

no

#

it does give in-depth details

#

a bit more than 2.5 pro 03

calm sequoia Apr 18, 2025, 9:40 AM

#

ocean vortex coding model?

I thouth nightwhishperrer was already that

torn mantle Apr 18, 2025, 9:40 AM

#

yea i think it may be better than gemini 2.5 pro 03

#

from my initial tests

calm sequoia Apr 18, 2025, 9:42 AM

#

Full version, update or, maybe, Ultra 👀

#

Allgirht then, lets test it.

torn mantle Apr 18, 2025, 9:45 AM

#

this model

#

is kinda different from other google releases

#

i like the formatting

#

it also goes into many points

tall summit Apr 18, 2025, 9:46 AM

#

hardy violet For some lyrics translated from Japanese to Chinese, we can directly tell if the...

oh you were still talking about translation. thanks for the info even if you weren't talking to me specifically earlier i personally appreciate the results

torn mantle Apr 18, 2025, 9:46 AM

#

its better at explaining things too

calm sequoia Apr 18, 2025, 9:50 AM

#

Could this be the LearnLM

torn mantle Apr 18, 2025, 9:54 AM

#

its really really good

#

at general knowledge

#

it doesnt give you the typical generic answers

calm sequoia Apr 18, 2025, 9:55 AM

#

Okey then, it's update, full or ultra.

torn mantle Apr 18, 2025, 9:55 AM

#

woah

#

its actually good

#

i cant believe this

torn mantle Apr 18, 2025, 9:55 AM

#

calm sequoia Okey then, it's update, full or ultra.

its kinda slow

#

i wont be surprised if its ultra

#

i didnt try it on reasoning tasks tbh

cedar tide Apr 18, 2025, 9:57 AM

#

Can anyone send the "claybrook" this prompt ?

#

creates a 99% copy of the discord front end, in a single html file, (without the backend)

hardy pecan Apr 18, 2025, 10:00 AM

#

Claybrook confirmed quite good, defintely 2.5 pro, dragontail level for sure

torn mantle Apr 18, 2025, 10:01 AM

#

cedar tide creates a 99% copy of the discord front end, in a single html file, (without the...

ok

hardy pecan Apr 18, 2025, 10:02 AM

#

It's funny, google probably reads this chat and we figure out what kinda model they JUST released super fast lol

cedar tide Apr 18, 2025, 10:02 AM

#

Claybrook is on webdevarena too

torn mantle Apr 18, 2025, 10:02 AM

#

cedar tide Claybrook is on webdevarena too

since when

#

i just checked like 10 min ago

cedar tide Apr 18, 2025, 10:02 AM

#

I just see it now

torn mantle Apr 18, 2025, 10:03 AM

#

interesting

#

two new models added

plain zinc Apr 18, 2025, 10:03 AM

#

Two?

torn mantle Apr 18, 2025, 10:03 AM

#

#

yea

plain zinc Apr 18, 2025, 10:03 AM

#

Which?

torn mantle Apr 18, 2025, 10:03 AM

#

dayhush and claybrook

#

both with a high priority ( 12 ) so you should get them quite often

plain zinc Apr 18, 2025, 10:03 AM

#

Who seems stronger?

torn mantle Apr 18, 2025, 10:03 AM

#

i will try them first

#

google are going crazy

#

new model per week

plain zinc Apr 18, 2025, 10:05 AM

#

Which model seems to be pro?

#

Well, sort of... which model is slow?

#

I read it here somewhere.

hardy pecan Apr 18, 2025, 10:06 AM

#

claybook is pro level for sure

torn mantle Apr 18, 2025, 10:06 AM

#

yea

calm sequoia Apr 18, 2025, 10:07 AM

#

No

torn mantle Apr 18, 2025, 10:07 AM

#

cedar tide creates a 99% copy of the discord front end, in a single html file, (without the...

claybrook

plain zinc Apr 18, 2025, 10:07 AM

#

Really good ✨

#

Interesting... Are they better than nightwhisper?

torn mantle Apr 18, 2025, 10:08 AM

#

wait

#

#

dayhush

#

this model may be gemini code or smth

calm sequoia Apr 18, 2025, 10:08 AM

#

This prompt can be solved by the latest generation models, including 2.5 PRO, o3 and o4-mini. It can't be solved by 2.5 Flash, 4o and others. It could not be solved by the claybrook; therefore, it's not general purpose model, or worse at physics than general 2.5 PRO. The prompt: "You take two carabiners and wrap a piece of high strength polyester webbing between them in 10 turns while there's as 30cm distance between carabiners. Webbing is conventional 25mm width strap used in industry. The on end is attached rigidly to the carabiner, the other is left free. How much time would such configuration hold until all the webbing would unroll itself if the applied tension from one carabiner to another is 100kg. Accidentally one side of the webbing was left hanging without attachment to anything - it is free to unroll. Answer only in how much time, no explanations."

cedar tide Apr 18, 2025, 10:09 AM

#

What the response ?

real totem Apr 18, 2025, 10:09 AM

#

Is dahush
In normal arena

hardy pecan Apr 18, 2025, 10:09 AM

#

I haven't got dayhush yet

plain zinc Apr 18, 2025, 10:09 AM

#

calm sequoia This prompt can be solved by the latest generation models, including 2.5 PRO, o3...

Bro, how many times have you given this promo for models? At least it should be given three times because of the temperature of 1

torn mantle Apr 18, 2025, 10:09 AM

#

real totem Is dahush In normal arena

no

hardy pecan Apr 18, 2025, 10:09 AM

#

in normal

calm sequoia Apr 18, 2025, 10:09 AM

#

Correct response is "It will not unroll"

calm sequoia Apr 18, 2025, 10:10 AM

#

plain zinc Bro, how many times have you given this promo for models? At least it should be ...

Other - multiple times, the claybrook only once. But all other models did not make any mistakes even once.

torn mantle Apr 18, 2025, 10:11 AM

#

calm sequoia This prompt can be solved by the latest generation models, including 2.5 PRO, o3...

whats claybrook response?

calm sequoia Apr 18, 2025, 10:11 AM

#

Some seconds

#

The same as 4.1 or lower level models. Even the o3-mini can do it right.

torn mantle Apr 18, 2025, 10:11 AM

#

#

yea i think dayhush may be gemini coding model

hardy pecan Apr 18, 2025, 10:12 AM

#

https://alexchen.dev/

#

LOL

#

its a real guy

torn mantle Apr 18, 2025, 10:13 AM

#

lmao

#

what

hardy pecan Apr 18, 2025, 10:13 AM

#

hes been contaminated

cedar tide Apr 18, 2025, 10:13 AM

#

I just see that 2.5 flash in the leaderboard

Screenshot_2025-04-18-12-07-58-744_com.android.chrome-edit.jpg

torn mantle Apr 18, 2025, 10:15 AM

#

hardy pecan https://alexchen.dev/

actually thats a thing i noticed about nightwhisper too

#

it gives you real references and doesnt hallucinates much

#

could be a thing in their coding models

plain zinc Apr 18, 2025, 10:21 AM

#

And how does claybrook encode?

#

I mean, how does claybrook write the code

#

Better, worse or much better or much worse?

torn mantle Apr 18, 2025, 10:22 AM

#

plain zinc Better, worse or much better or much worse?

you mean writing-wise?

#

still didnt get it yet on webdev

plain zinc Apr 18, 2025, 10:24 AM

#

torn mantle you mean writing-wise?

Nah, coding

torn mantle Apr 18, 2025, 10:25 AM

#

plain zinc Nah, coding

so far im only getting dayhush

#

seems kinda similar to nightwhisper

calm sequoia Apr 18, 2025, 10:25 AM

#

Inferior even to the o3-mini 💩

torn mantle Apr 18, 2025, 10:25 AM

#

calm sequoia Inferior even to the o3-mini 💩

keen beacon Apr 18, 2025, 10:26 AM

#

is dayhush only on webdev

torn mantle Apr 18, 2025, 10:26 AM

#

keen beacon is dayhush only on webdev

yea

calm sequoia Apr 18, 2025, 10:27 AM

#

Yeah, I've concluded that the o3 is the king of the next generation (for my tests). Still, I use 2.5 Pro for coding.

brittle tiger Apr 18, 2025, 10:27 AM

#

https://x.com/slow_developer/status/1913175292006580688?t=ED94PaxdnFtvRSlstNQgtA&s=19

Haider. (@slow_developer) on X

ok, now this is interesting

LiveBench puts gemini 2.5 flash above 2.5 pro in coding

how's your experience?

calm sequoia Apr 18, 2025, 10:29 AM

#

brittle tiger Apr 18, 2025, 10:29 AM

#

Day night
Hush whisper

real totem Apr 18, 2025, 10:29 AM

#

They're shet

calm sequoia Apr 18, 2025, 10:29 AM

#

brittle tiger https://x.com/slow_developer/status/1913175292006580688?t=ED94PaxdnFtvRSlstNQgtA...

Need gigantic context length

plain zinc Apr 18, 2025, 10:31 AM

#

calm sequoia

Bruh💀. Where does skibidi come from?..

calm sequoia Apr 18, 2025, 10:31 AM

#

It's for those who want to vote but dohnt know nothing

cedar tide Apr 18, 2025, 10:33 AM

#

calm sequoia

Which is other ?

torn mantle Apr 18, 2025, 10:33 AM

#

brittle tiger Day night Hush whisper

yea they seem kinda similar

#

though i cant decide yet which one is better

brittle tiger Apr 18, 2025, 10:34 AM

#

There's no way they're not related given names. Is nw still best outputs?

torn mantle Apr 18, 2025, 10:34 AM

#

keen ferry Apr 18, 2025, 10:35 AM

#

torn mantle

wtf

keen fulcrum Apr 18, 2025, 10:36 AM

#

https://www.rxddit.com/r/Bard/comments/1k1v8vd/a_surprising_reason_why_gemini_25s_thinking/

rxddit.com

A Surprising Reason why Gemini 2.5's thinking models are so cheap (It’s not TPUs)

u/srivatsansam on r/Bard

I've been intrigued by Gemini 2.5's "Thinking Process" (Google doesn't actually call it Chain of Thought anywhere officially, so I'm sticking with "Thinking Process" for now).

What's fascinating is how Gemini self-corrects without the usual "wait," "aha," or other filler you'd typically see from models like DeepSeek, Claude, or Grok. It's kind...

hardy pecan Apr 18, 2025, 10:36 AM

#

brittle tiger https://x.com/slow_developer/status/1913175292006580688?t=ED94PaxdnFtvRSlstNQgtA...

Variance

torn mantle Apr 18, 2025, 10:38 AM

#

im trying this prompt now

#

holy

#

dayhush

#

the magnetic field needs a bit of adjustment tho

keen beacon Apr 18, 2025, 10:44 AM

#

torn mantle

woah what

#

that's a cool design

torn mantle Apr 18, 2025, 10:45 AM

#

torn mantle

gpt4.1

keen beacon Apr 18, 2025, 10:45 AM

#

this is probably a new checkpoint for nightwhisper

torn mantle Apr 18, 2025, 10:45 AM

#

kinda cool too but its not animated with particles

keen beacon Apr 18, 2025, 10:45 AM

#

the names are similar

torn mantle Apr 18, 2025, 10:45 AM

#

keen beacon the names are similar

yea

#

could be

torn mantle Apr 18, 2025, 10:46 AM

#

torn mantle

this is actually so cool

plain zinc Apr 18, 2025, 10:47 AM

#

Wow, claybrook better gemini 2.5 pro in web design

keen beacon Apr 18, 2025, 10:47 AM

#

dayhush is def better than claybrook

#

claybrook to me felt meh

#

it flopped on Qs 2.5 pro can get right

plain zinc Apr 18, 2025, 10:48 AM

#

keen beacon claybrook to me felt meh

But it's enough that claybrook is better than 2.5 pro.

keen beacon Apr 18, 2025, 10:48 AM

#

not in my experience

#

🤷‍♂️

novel flame Apr 18, 2025, 10:49 AM

#

Looks like dayhush is nightwhisper, exciting times for coding models! Next cpt: dawnshush? No - DUSKMUMBLE!

keen beacon Apr 18, 2025, 10:51 AM

#

it won't be nightwhisper

#

it'll be an updated ver

#

it's probably just 2.5 with a focus on code

novel flame Apr 18, 2025, 10:52 AM

#

Yes of course but same model

torn mantle Apr 18, 2025, 10:56 AM

#

plain zinc Wow, claybrook better gemini 2.5 pro in web design

it didnt feel solid at coding tbh

tall summit Apr 18, 2025, 10:56 AM

#

hardy pecan https://alexchen.dev/

hes going to the Flosstradamus rave tomorrow

hardy pecan Apr 18, 2025, 10:57 AM

#

2.5 pro with 128 code experts!!! Or sumthin

torn mantle Apr 18, 2025, 10:58 AM

#

keen beacon this is probably a new checkpoint for nightwhisper

i think you are right

#

its probably an even better version than nightwhisper

tall summit Apr 18, 2025, 10:58 AM

#

torn mantle gpt4.1

i mean this is nice

tall summit Apr 18, 2025, 10:59 AM

#

keen fulcrum https://www.rxddit.com/r/Bard/comments/1k1v8vd/a_surprising_reason_why_gemini_25...

people are bashing them in the comments

novel flame Apr 18, 2025, 11:00 AM

#

“Just 2.5 with a focus on code” though… that could finally dethrone 3.7 Sonnet for real world coding.

The current state of coding models (warning: possibly hot take): 3.7 Sonnet is still the king, 2.5 Pro is very close, and o3 is sometimes better but expensive and not reliable/consistent. o4 mini is sometimes on par with or slightly better than 2.5 Pro but again, too inconsistent (like generating minified code) to rely on for everything. I haven’t put Grok 3 Mini fully through its paces recently so I’m going to do that later today, it may be in there somewhere too.

But a coding focused Gemini 2.5 variant? I see no reason why it wouldn’t beat Sonnet.

balmy mist Apr 18, 2025, 11:03 AM

#

https://3000-ii7v95x3mngfpizk6honw-65401bad.e2b-foxtrot.dev

tall summit Apr 18, 2025, 11:05 AM

#

i keep forgetting webdev arena even exists

balmy mist Apr 18, 2025, 11:06 AM

#

dayhush is phenomenal

#

i wonder how long it stays

sage raptor Apr 18, 2025, 11:06 AM

#

dayhush is insane

balmy mist Apr 18, 2025, 11:06 AM

#

and if they will actually release it

torn mantle Apr 18, 2025, 11:06 AM

#

sage raptor dayhush is insane

yep

torn mantle Apr 18, 2025, 11:07 AM

#

balmy mist https://3000-ii7v95x3mngfpizk6honw-65401bad.e2b-foxtrot.dev

dayhush?

#

what was the prompt

novel flame Apr 18, 2025, 11:07 AM

#

tall summit people are bashing them in the comments

I suspect part of the recent Gemini magic is the ‘Entropy’ logit search method, and part of the reason why they’re training models so rapidly is Titans. I’m on my phone so I don’t have the papers here but DeepMind has seriously cooked under the hood.

torn mantle Apr 18, 2025, 11:07 AM

#

did you tell it how the characters should face eo?

balmy mist Apr 18, 2025, 11:07 AM

#

torn mantle what was the prompt

create pokemon game

torn mantle Apr 18, 2025, 11:07 AM

#

balmy mist create pokemon game

thats it?

balmy mist Apr 18, 2025, 11:08 AM

#

yup

torn mantle Apr 18, 2025, 11:08 AM

#

nah

#

thats a first

tall summit Apr 18, 2025, 11:08 AM

#

novel flame I suspect part of the recent Gemini magic is the ‘Entropy’ logit search method, ...

i am disheartened that i've no idea what you are talking about, but never fear! research

torn mantle Apr 18, 2025, 11:08 AM

#

you see how the characters are facing eo?

balmy mist Apr 18, 2025, 11:08 AM

#

yeah man very clean

#

and the run feature, and animations

#

og nightwhisper didnt even get that without extra prompting

hardy pecan Apr 18, 2025, 11:10 AM

#

https://3000-izzpt6s87uo0372wrmkxh-0b3529ea.e2b-foxtrot.dev

#

This was "create Pokemon game"

#

By day hush

#

So... Not quite the same result as above

torn mantle Apr 18, 2025, 11:11 AM

#

hardy pecan By day hush

doenst look good

tall summit Apr 18, 2025, 11:11 AM

#

lmao ok dayhush is good

#

oh it competed with llama 4 maverick

keen beacon Apr 18, 2025, 11:12 AM

#

0-shot

#

did a lot better than 2.5 pro

balmy mist Apr 18, 2025, 11:12 AM

#

hardy pecan This was "create Pokemon game"

damn you sure? weird how we can have different results

sage raptor Apr 18, 2025, 11:12 AM

#

keen beacon 0-shot

nw was able to do that too

balmy mist Apr 18, 2025, 11:12 AM

#

prompt: my friends birthday was yesterday and i missed it, make me something to give Andrew
https://3000-igg96j3ix44l5qgzr6uwq-ec17a5a5.e2b-foxtrot.dev

keen beacon Apr 18, 2025, 11:13 AM

#

sage raptor nw was able to do that too

well yes I never said it couldn't

#

but imo these outputs have been slightly better than nw

sage raptor Apr 18, 2025, 11:13 AM

#

yep

tall summit Apr 18, 2025, 11:13 AM

#

keen beacon 0-shot

i was so confused for about 2 seconds

balmy mist Apr 18, 2025, 11:13 AM

#

this was the exact prompt for pokemon, i added a ' by mistake lol, maybe that makes a difference?

hardy pecan Apr 18, 2025, 11:14 AM

#

balmy mist damn you sure? weird how we can have different results

Yes very sure lol, not much to get wrong

brittle tiger Apr 18, 2025, 11:14 AM

#

https://3000-ixz0h6ax00b2ls83bhvlo-d3b1dbce.e2b-foxtrot.dev/

Dayhush
prompt: Create a visually stunning, interactive web visualization of bioluminescent deep-sea organisms responding to subtle environmental cues (like simulated currents or light sources triggered by user interaction). Focus on ethereal beauty, fluid animation, and a sense of discovery. Amaze me with the organic feel.

balmy mist Apr 18, 2025, 11:14 AM

#

hardy pecan Yes very sure lol, not much to get wrong

trying doing this prompt exactly:
create a pokemon game'

novel flame Apr 18, 2025, 11:14 AM

#

tall summit i am disheartened that i've no idea what you are talking about, but never fear! ...

Entropy: https://timkellogg.me/blog/2024/10/10/entropix and ~~https://arxiv.org/html/2407.01082v2~~ https://arxiv.org/pdf/2402.10200

What is entropix doing?

tall summit Apr 18, 2025, 11:15 AM

#

novel flame Entropy: https://timkellogg.me/blog/2024/10/10/entropix and ~~https://arxiv.org/...

oh thanks so much

novel flame Apr 18, 2025, 11:15 AM

#

Titans: https://www.llmwatch.com/p/google-releases-transformer-20

🦾 Google Releases Transformer 2.0

And how to train your own version of o1-preview on a budget

hardy pecan Apr 18, 2025, 11:15 AM

#

balmy mist trying doing this prompt exactly: create a pokemon game'

I'll try that

torn mantle Apr 18, 2025, 11:18 AM

#

claybrook isnt that good at coding

#

thats for sure

novel flame Apr 18, 2025, 11:19 AM

#

Also for coding agents, I recommend checking out the paper on ‘Moatless Tree Search’ gaining 20 percentage points on SWE-Bench through search: https://arxiv.org/abs/2410.20285 — showing how you can replicate the performance of much bigger models by taking a really small model and letting it cook with search. This is why I suspect a lot of the top Reasoning models are actually using far smaller models than the community seems to think.

arXiv.org

SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search ...

Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often follow linear, sequential processes that prevent backtracking and expl...

hardy pecan Apr 18, 2025, 11:22 AM

#

Dayhush vs Gemini 2.5 pro

torn mantle Apr 18, 2025, 11:24 AM

#

hardy pecan Dayhush vs Gemini 2.5 pro

nice

ocean vortex Apr 18, 2025, 11:24 AM

#

brittle tiger https://x.com/slow_developer/status/1913175292006580688?t=ED94PaxdnFtvRSlstNQgtA...

benchmark issue

balmy mist Apr 18, 2025, 11:24 AM

#

i hope the release it soon

ocean vortex Apr 18, 2025, 11:25 AM

#

this would be normal with openai since mini does indeed perform better on some tasks

#

but not for gemini lol

#

it's literally behind in their every single measured metric

tall summit Apr 18, 2025, 11:25 AM

#

hardy pecan Dayhush vs Gemini 2.5 pro

funny prompt

balmy mist Apr 18, 2025, 11:25 AM

#

dayhush pinball game: https://3000-iroe1qqj1b1pwvwlb0ny7-9c4a2736.e2b-foxtrot.dev

brittle tiger Apr 18, 2025, 11:25 AM

#

ocean vortex benchmark issue

yeah for sure. flash getting better than 2.5 pro basically confirms their coding category isn't great

cloud meadow Apr 18, 2025, 11:26 AM

#

keen fulcrum https://www.rxddit.com/r/Bard/comments/1k1v8vd/a_surprising_reason_why_gemini_25...

Google is secretly using atom of thoughts (trust!)

balmy mist Apr 18, 2025, 11:26 AM

#

balmy mist dayhush pinball game: https://3000-iroe1qqj1b1pwvwlb0ny7-9c4a2736.e2b-foxtrot.de...

almost, but a very hard prompt

blazing coyote Apr 18, 2025, 11:26 AM

#

I fell like Nightwhisper was better in design than Dayhush

torn mantle Apr 18, 2025, 11:27 AM

#

balmy mist https://3000-ii7v95x3mngfpizk6honw-65401bad.e2b-foxtrot.dev

#

didnt seem much different from gemini 2.5 pro

#

03

#

this is so confusing

balmy mist Apr 18, 2025, 11:27 AM

#

torn mantle

this is nice

torn mantle Apr 18, 2025, 11:27 AM

#

blazing coyote I fell like Nightwhisper was better in design than Dayhush

maybe

balmy mist Apr 18, 2025, 11:27 AM

#

this is o3 or 2.5 pro?

torn mantle Apr 18, 2025, 11:27 AM

#

im lost xd

torn mantle Apr 18, 2025, 11:27 AM

#

balmy mist this is o3 or 2.5 pro?

2.5 pro

balmy mist Apr 18, 2025, 11:28 AM

#

torn mantle 2.5 pro

what prompt?

#

same exact? 0-shot?

torn mantle Apr 18, 2025, 11:28 AM

#

balmy mist same exact? 0-shot?

yes

#

create a beautiful looking pokemon game battle mode with pokeapi, with an extremely pleasing design, wow me.

balmy mist Apr 18, 2025, 11:29 AM

#

not exact lol

ocean vortex Apr 18, 2025, 11:29 AM

#

torn mantle this is so confusing

try web design or svg, should be less confusing then

balmy mist Apr 18, 2025, 11:29 AM

#

let me try that

#

right now its generating a super mario game, lets see how it does on that then i will try that updated prompt, but im still shocked that 2.5 pro could do that

#

it was not able to do that before

#

you think they updated it?

ocean vortex Apr 18, 2025, 11:30 AM

#

balmy mist right now its generating a super mario game, lets see how it does on that then i...

2.5 pro is a fairly big model, it's gonna do well on visuals

balmy mist Apr 18, 2025, 11:30 AM

#

https://3000-iihadi240nvhk3fjc9zoe-ed96e18a.e2b-foxtrot.dev

ocean vortex Apr 18, 2025, 11:30 AM

#

unlike all the minis and stuff

real coral Apr 18, 2025, 11:31 AM

#

I'm quite liking the current 4o. In the prompts I care about it's pretty much on par with 2.5. Has a nice style to it.

balmy mist Apr 18, 2025, 11:31 AM

#

ocean vortex 2.5 pro is a fairly big model, it's gonna do well on visuals

idk bro it seems diff, before it didnt do well visuals like that when testing with pokemon, i had to really prompt it and guide it

balmy mist Apr 18, 2025, 11:32 AM

#

balmy mist https://3000-iihadi240nvhk3fjc9zoe-ed96e18a.e2b-foxtrot.dev

okay mario game, let me tell it to make it guide it a lil

torn mantle Apr 18, 2025, 11:32 AM

#

@balmy mist https://3000-iibslowa8wbqm3106z5gm-337d29f2.e2b-foxtrot.dev

#

can you try this?

balmy mist Apr 18, 2025, 11:33 AM

#

wow

#

thats nice

#

prompt? and 0-shot?

torn mantle Apr 18, 2025, 11:33 AM

#

yes

brittle tiger Apr 18, 2025, 11:37 AM

#

all this dayhush nightwhisper talk but i'm going to morning scream if i get webdev arena bug where a side doesnt load onemore time

calm sequoia Apr 18, 2025, 11:39 AM

#

calm sequoia

People who voted Gemini or "Other" - what prompts are you using?

blazing coyote Apr 18, 2025, 11:39 AM

#

Claybrook also appers in the webarena

balmy mist Apr 18, 2025, 11:40 AM

#

wow they fixed webdev arena

keen beacon Apr 18, 2025, 11:41 AM

#

brittle tiger all this dayhush nightwhisper talk but i'm going to morning scream if i get webd...

yeah the webdev arena implementation kinda sucks..

balmy mist Apr 18, 2025, 11:41 AM

#

now your creations are persistent?

keen beacon Apr 18, 2025, 11:41 AM

#

sometimes it just breaks and ends the round if you try and follow-up

torn mantle Apr 18, 2025, 11:41 AM

#

balmy mist now your creations are persistent?

no

#

still the same issues

balmy mist Apr 18, 2025, 11:41 AM

#

torn mantle no

i just reloaded the page and i still can load the creation

#

like this one:
https://3000-itjougubgpyvisl9zdlw0-2266d380.e2b-foxtrot.dev/

#

but i started a new round and currently generating something

#

yall can access it?

torn mantle Apr 18, 2025, 11:43 AM

#

yea

balmy mist Apr 18, 2025, 11:43 AM

#

hmm i wonder how long it lasts

#

better than before when it would timeout fast lol

#

i guess they slowly working on the website

#

still gotta fix the glitching tho, i wonder if the results are actually fair in the leaderboard

#

cause multiple times things would not load for me or even gen any code

#

and it was the better model but i picked the other one since it generated code

torn mantle Apr 18, 2025, 11:49 AM

#

yea i had the same issue

#

they still didnt fix it

#

dayhush makes some questionable choices

#

nightwhisper felt more reliable tbh

#

it will rarely miss

alpine coral Apr 18, 2025, 12:08 PM

#

there are no anon models in the beta arena right?

cedar tide Apr 18, 2025, 12:18 PM

#

alpine coral there are no anon models in the beta arena right?

Yes

alpine coral Apr 18, 2025, 12:19 PM

#

i phrased that poorly lol

#

i haven't encountered anon models in the beta arena before, is that still the case (they're not in there?)

tall summit Apr 18, 2025, 12:20 PM

#

the ones in beta arena are only the ones also in beta direct chat

brittle tiger Apr 18, 2025, 12:20 PM

#

brittle tiger all this dayhush nightwhisper talk but i'm going to morning scream if i get webd...

jinxed myself so hard here. have gotten this bug maybe 6 times in a row. anyone have best practices for avoiding it in webdev arena?

tall summit Apr 18, 2025, 12:20 PM

#

brittle tiger jinxed myself so hard here. have gotten this bug maybe 6 times in a row. anyone ...

it happened to me so many times today

#

i hope there is a way to avoid it

brittle tiger Apr 18, 2025, 12:22 PM

#

is there a way to at least figure out the name of model that had an output without voting?

calm sequoia Apr 18, 2025, 12:25 PM

#

brittle tiger is there a way to at least figure out the name of model that had an output witho...

no 🙂

narrow elbow Apr 18, 2025, 12:30 PM

#

whats dayhush? google new model?

worthy thunder Apr 18, 2025, 12:32 PM

#

narrow elbow whats dayhush? google new model?

That's a pretty cool output

narrow elbow Apr 18, 2025, 12:32 PM

#

worthy thunder That's a pretty cool output

https://3000-ief2zmf8bfinbq4n4yrsz-884333c4.e2b-foxtrot.dev/

barren prairie Apr 18, 2025, 12:35 PM

#

narrow elbow whats dayhush? google new model?

Is it as good as NW?

narrow elbow Apr 18, 2025, 12:36 PM

#

i dont know..

novel flame Apr 18, 2025, 12:37 PM

#

barren prairie Is it as good as NW?

Night/Day, Whisper/Hush. It’s got to be a newer checkpoint of the same model

balmy mist Apr 18, 2025, 12:37 PM

#

novel flame Night/Day, Whisper/Hush. It’s got to be a newer checkpoint of the same model

day and night

worthy thunder Apr 18, 2025, 12:37 PM

#

I ran the OpenAI-MRCR benchmark for Gemini 2.5 Flash (Auto Thinking and Non-Thinking). Interesting curve for Gemini 2.5 Flash Non-Thinking. Meanwhile Gemini 2.5 Flash Thinking (Auto) matches Gemini 2.5 Pro. Wanted to be able to compare what the perf was like with thinking on vs off, which there does appear to be a difference, with some odd curve. That curve is consistent over all of the runs of the full benchmark for Flash Non-thinking.

For other results, you can check my thread: https://x.com/DillonUzar/status/1913208873206362271

I'm still working on o3 access and will update when I have access. Some error on OpenAI's side, they're looking into it for my org account. If anyone has access in the meantime and is willing to coordinate, let me know. If anyone wants other models, just let me know which and I'll see what I can budget and run.
Still working on setting up a website to allow everyone to view individual run results (like on matharena.ai), and also setting up Graphwalks (another long context benchmark from OpenAI)

worthy thunder Apr 18, 2025, 12:38 PM

#

narrow elbow https://3000-ief2zmf8bfinbq4n4yrsz-884333c4.e2b-foxtrot.dev/

Was the sandbox destroyed?

narrow elbow Apr 18, 2025, 12:39 PM

#

yea

narrow elbow Apr 18, 2025, 12:40 PM

#

worthy thunder Was the sandbox destroyed?

Maybe webdev was automatically destroyed within a certain period of time, and I didn't close it.

torn mantle Apr 18, 2025, 12:43 PM

#

barren prairie Is it as good as NW?

yea

calm sequoia Apr 18, 2025, 12:44 PM

#

worthy thunder I ran the OpenAI-MRCR benchmark for Gemini 2.5 Flash (Auto Thinking and Non-Thin...

Oh man your graphs are always very interesting, but you must make the color lines thicker and please, rank the table according to results 😄

torn mantle Apr 18, 2025, 12:44 PM

#

Its on the same level

worthy thunder Apr 18, 2025, 12:44 PM

#

calm sequoia Oh man your graphs are always very interesting, but you must make the color line...

Will do, thanks for the suggestions 🙂

plain zinc Apr 18, 2025, 12:44 PM

#

... And indeed... If we compare the sky with the earth, then we know that the sky is better than the earth, but the day and night...

#

Which is better?

torn mantle Apr 18, 2025, 12:44 PM

#

NW = DH
Dragontail
Gemini 2.5 pro 03
Claybrook
Gemini 2.5 flash thinking

brittle tiger Apr 18, 2025, 12:45 PM

#

pretty good dayhush output for complex task
https://3000-ilb9lm2u2urwebu9sl55g-a7c98fd5.e2b-foxtrot.dev/

prompt: Create a captivating, interactive web demo visualizing fluid dynamics, such as laminar flow breaking into turbulence or convection currents. Use particle systems or shader-based techniques (like reaction-diffusion) to create organic, flowing patterns. Allow users to introduce 'heat' sources, obstacles, or change flow velocity to see the beautiful complexity emerge. Focus on the mesmerizing, almost alive patterns.

plain zinc Apr 18, 2025, 12:45 PM

#

torn mantle 1. NW = DH 2. Dragontail 3. Gemini 2.5 pro 03 4. Claybrook 5. Gemini 2.5 flash t...

But no._.. Nightwhisper was much better

glass arch Apr 18, 2025, 12:45 PM

#

narrow elbow whats dayhush? google new model?

can you do this in the beta UI?

torn mantle Apr 18, 2025, 12:45 PM

#

plain zinc But no._.. Nightwhisper was much better

Sometimes it gave me better results

#

Im not sure tbh

#

But its close to nw

real totem Apr 18, 2025, 12:46 PM

#

Dayhush is nightwhisepr

plain zinc Apr 18, 2025, 12:46 PM

#

By the way, please. Give dayhush this prompt:

#

I'll send the file.

real totem Apr 18, 2025, 12:46 PM

#

Bro its an udpate4d

#

Nightwhisper

#

Thats why it got the similar naem

plain zinc Apr 18, 2025, 12:46 PM

#

plain zinc By the way, please. Give dayhush this prompt:

📎 for-dayhush.txt

real totem Apr 18, 2025, 12:47 PM

#

Wydym

#

Wdym

#

The other ais

#

Got diffferent name

#

Yes

#

Ye

#

IUts an updated

#

Nightwhisper

plain zinc Apr 18, 2025, 12:48 PM

#

plain zinc

And give me the result for sure.

alpine coral Apr 18, 2025, 12:48 PM

#

hes trolling lol

real totem Apr 18, 2025, 12:48 PM

#

Whisper hush night day

#

Ye

torn mantle Apr 18, 2025, 12:49 PM

#

plain zinc

ok

narrow elbow Apr 18, 2025, 12:49 PM

#

glass arch can you do this in the beta UI?

how?https://beta.lmarena.ai/?

LMArena

An open platform for evaluating AI through human preference

#

here?

real totem Apr 18, 2025, 12:50 PM

#

glass arch can you do this in the beta UI?

nah

#

Beta is for

#

Normla text

#

Thewebdev is for code

#

Thats the only one

#

With dayhush

torn mantle Apr 18, 2025, 12:50 PM

#

plain zinc

https://3000-iic8yfm1u12c8wyynb21b-0ce36864.e2b-foxtrot.dev

plain zinc Apr 18, 2025, 12:52 PM

#

torn mantle https://3000-iic8yfm1u12c8wyynb21b-0ce36864.e2b-foxtrot.dev

Thanks

torn mantle Apr 18, 2025, 12:52 PM

#

plain zinc Thanks

how is it

plain zinc Apr 18, 2025, 12:53 PM

#

torn mantle how is it

Much better than 2.5 pro and claybrook

torn mantle Apr 18, 2025, 12:53 PM

#

We need nightwhisper comparison

plain zinc Apr 18, 2025, 12:54 PM

#

I think and feel! What's coming out tonight nightwhisper

#

Because the catalog is already ready

#

for coding models

narrow elbow Apr 18, 2025, 12:57 PM

#

I feel like Google needs some pressure, or they won't be squeeze out good things🤣

plain zinc Apr 18, 2025, 12:59 PM

#

narrow elbow I feel like Google needs some pressure, or they won't be squeeze out good thing...

I didn't understand what you meant. Can you explain to me, as a non-native English speaker?

narrow elbow Apr 18, 2025, 1:00 PM

#

haha

willow grail Apr 18, 2025, 1:00 PM

#

ocean vortex Apr 18, 2025, 1:00 PM

#

balmy mist idk bro it seems diff, before it didnt do well visuals like that when testing wi...

this is what I meant though. o4-mini-high vs 2.5 pro, same prompt for svg drawing

#

#

neither is perfect but 2.5 pro output is leagues ahead in this instance

willow grail Apr 18, 2025, 1:01 PM

#

ocean vortex neither is perfect but 2.5 pro output is leagues ahead in this instance

can u do poll

ocean vortex Apr 18, 2025, 1:01 PM

#

A well dressed professor standing on the street with eiffel tower in the background, in front of basketball court, holding an umbrella, and observing people playing. It is CRUCIAL you make this in SVG only with the corresponding code!

#

the last sentence I formed this way to stop openai models from using imagen lmao

keen beacon Apr 18, 2025, 1:02 PM

#

ocean vortex the last sentence I formed this way to stop openai models from using imagen lmao

biased result xD

cedar tide Apr 18, 2025, 1:02 PM

#

Who want to Remade this with claybrook and dayhush ?

keen beacon Apr 18, 2025, 1:03 PM

#

im trying the prompt with o3 vs 2.5

#

blind test guys

#

pic 1

#

or pic 2

willow grail Apr 18, 2025, 1:05 PM

#

both are trash

#

liiek wtf

cedar tide Apr 18, 2025, 1:05 PM

#

torn mantle claybrook

Dragontail made this 👇
So dayhush better

keen beacon Apr 18, 2025, 1:05 PM

#

yeah .. not the best of prompts
i actually like to make them make small games to compare

willow grail Apr 18, 2025, 1:06 PM

#

keen beacon yeah .. not the best of prompts i actually like to make them make small games to...

make real python games... not small tetris trash

keen beacon Apr 18, 2025, 1:06 PM

#

wtf am i looking at

keen beacon Apr 18, 2025, 1:07 PM

#

cedar tide Dragontail made this 👇 So dayhush better

oh its a discord clone

#

i thought this was like a discord theme i was about to say how can you like this

keen beacon Apr 18, 2025, 1:09 PM

#

willow grail make real python games... not small tetris trash

yep, my prompt was to make a 2d game with the following rules

if ball is shot it splits into two , unless radius is too small in which cases it pops
player can move left and right and shoot up
balls continue falling and moving at an angle as if thrown.
game starts with big ball falling and has a timer. player wins if he destroys all balls.
if time runs out or player gets touched by ball -> game over

willow grail Apr 18, 2025, 1:09 PM

#

keen beacon yep, my prompt was to make a 2d game with the following rules 1. if ball is sho...

o3, o4mini, 2.5 who was best

keen beacon Apr 18, 2025, 1:09 PM

#

willow grail o3, o4mini, 2.5 who was best

o3 > 2.5 by a bit

willow grail Apr 18, 2025, 1:09 PM

#

keen beacon o3 > 2.5 by a bit

and what about o4 mini

keen beacon Apr 18, 2025, 1:10 PM

#

willow grail and what about o4 mini

didnt try

thorny drum Apr 18, 2025, 1:10 PM

#

keen beacon o3 > 2.5 by a bit

do you also have views on the canadian election?

keen beacon Apr 18, 2025, 1:10 PM

#

hard to think we dont tbh. graphics are getting there and ai is getting at us but whatever

keen beacon Apr 18, 2025, 1:10 PM

#

thorny drum do you also have views on the canadian election?

cmon we are in lmarena not poly

thorny drum Apr 18, 2025, 1:10 PM

#

or the likelihood of jerome powell being removed this year

keen beacon Apr 18, 2025, 1:11 PM

#

yep

sour spindle Apr 18, 2025, 1:11 PM

#

What’s everyone’s honest real life assessment of 2.5, o3, o4 mini. Personally I am really like the outputs of o3.

keen beacon Apr 18, 2025, 1:12 PM

#

as you said there are likely god knows how many matrixes within matrix and statistically we are probably at the very bottom of them but whatever

#

o3 with tools is amazing]

sour spindle Apr 18, 2025, 1:12 PM

#

I haven’t run into that problem at all really with A/B testing

#

I’ve been very happy back checking sources used

keen beacon Apr 18, 2025, 1:13 PM

#

from what ive heard so far people think o3 is better slightly or miles ahead, im on group 1

balmy mist Apr 18, 2025, 1:13 PM

#

u good bro?

sour spindle Apr 18, 2025, 1:13 PM

#

Better slightly is a good way to put it

keen beacon Apr 18, 2025, 1:14 PM

#

take it ez you just learned you are in a simulation and everything is fake, no big deal

narrow elbow Apr 18, 2025, 1:15 PM

#

whats the answer to life, the universe, and everything? 42?

brittle tiger Apr 18, 2025, 1:15 PM

#

i think o3 is great. if there was a problem with it, it's that it does too much tool calling

keen beacon Apr 18, 2025, 1:17 PM

#

r/im14andthisisdeep

#

we can become gods and make universes of our own.

#

but only Elon Jeff and their friends will

balmy mist Apr 18, 2025, 1:20 PM

#

lol

keen beacon Apr 18, 2025, 1:20 PM

#

why would they enslave us

balmy mist Apr 18, 2025, 1:21 PM

#

bro been watching to much movies

keen beacon Apr 18, 2025, 1:21 PM

#

nah they dont need slaves

#

once ai becomes smart enough theres no need at all for human slaves

tall summit Apr 18, 2025, 1:21 PM

#

HAHAHAHA

keen beacon Apr 18, 2025, 1:21 PM

#

ai + robots are smarter, faster, 24/7 work, no pay, no demands for high standard of living

glass arch Apr 18, 2025, 1:22 PM

#

I wouldn't mind tbh

#

slop forever!

keen beacon Apr 18, 2025, 1:22 PM

#

certainly get us out of the way more than eradicate, both are very bad

balmy mist Apr 18, 2025, 1:23 PM

#

the point for normal humans is diversity

glass arch Apr 18, 2025, 1:23 PM

#

counter argument: AI does not have hormones

keen beacon Apr 18, 2025, 1:23 PM

#

but its not evil / bad in their eyes, more than it was evil/bad for us to use cars over horses .. which did lead to the "genocide" of horses, with their population drastically decreasing after the invention of cars

glass arch Apr 18, 2025, 1:24 PM

#

I think AI should replace unskilled labor first over artists, programmers, and writers

keen beacon Apr 18, 2025, 1:24 PM

#

glass arch I think AI should replace unskilled labor first over artists, programmers, and w...

should ?

balmy mist Apr 18, 2025, 1:25 PM

#

why dont we eradicate the other lifeforms on earth? also ai would probably be just as curious about us as we are about it and other lifeforms

keen beacon Apr 18, 2025, 1:25 PM

#

ai will replace everything, first low hanging fruit like coding math, then intellecutal jobs, then at the very end physical jobs via robots. and all the while every job will be devalued cause ai will continuesly lower the entry barrier.

cloud meadow Apr 18, 2025, 1:25 PM

#

This statement alone is why midwits shouldn't be allowed to use AI. What we have compared to AGI is like a lobotomised cat being compared to a lion.

tall summit Apr 18, 2025, 1:25 PM

#

when i'm always impressed with something on web arena, it's dayhush

glass arch Apr 18, 2025, 1:26 PM

#

keen beacon should ?

if all the unskilled labor goes away, then all we have left is the skilled labor

keen beacon Apr 18, 2025, 1:26 PM

#

if there is a solution , if i heard any, is to brainchip all of humanity, combine it with ai and go forward as one mind colony

tall summit Apr 18, 2025, 1:26 PM

#

glass arch if all the unskilled labor goes away, then all we have left is the skilled labor

i'm quite sure it can't be controlled. so there is no "should"

cloud meadow Apr 18, 2025, 1:26 PM

#

💔

#

AGI is like 2-10 decades away

balmy mist Apr 18, 2025, 1:26 PM

#

bro if you think about it the people with power already control the world lol

keen beacon Apr 18, 2025, 1:27 PM

#

glass arch if all the unskilled labor goes away, then all we have left is the skilled labor

ai will take both skilled and unskilled away
theres nothing any human can do better compared to a smarter ai + robot

balmy mist Apr 18, 2025, 1:27 PM

#

with ai it will just be more in our faces

#

but its the same thing just scaled

cloud meadow Apr 18, 2025, 1:27 PM

#

Maybe a brain organoid can be made just for you so you realise that humans won't manufacture their own demise so quickly.

tall summit Apr 18, 2025, 1:27 PM

#

keen beacon ai will take both skilled and unskilled away theres nothing any human can do bet...

i can juggle better

glass arch Apr 18, 2025, 1:27 PM

#

guys, all you need to stop the AI uprising is a glass of water

narrow elbow Apr 18, 2025, 1:27 PM

#

https://tenor.com/view/explosion-mushroom-cloud-atomic-bomb-bomb-boom-gif-4464831

Tenor

Atomic bomb - Explosion

▶ Play video

tall summit Apr 18, 2025, 1:27 PM

#

glass arch guys, all you need to stop the AI uprising is a glass of water

good one

cloud meadow Apr 18, 2025, 1:27 PM

#

https://tenor.com/view/terminator-terminator-robot-looking-flex-cool-robot-gif-16625083

Tenor

#

We are not getting skynet.

glass arch Apr 18, 2025, 1:28 PM

#

it'll screw up everything if you pour water on their circuits

keen beacon Apr 18, 2025, 1:28 PM

#

why dont you become billionaire and join him

cloud meadow Apr 18, 2025, 1:29 PM

#

Bro, are you like 12?

glass arch Apr 18, 2025, 1:29 PM

#

cloud meadow Bro, are you like 12?

I am suspecting this too

cloud meadow Apr 18, 2025, 1:29 PM

#

glass arch I am suspecting this too

He is either 12 or a teenager from a third world country.

balmy mist Apr 18, 2025, 1:29 PM

#

cloud meadow Bro, are you like 12?

fr i was going to ask, but i didnt wanna insult him lol

glass arch Apr 18, 2025, 1:29 PM

#

I think we should just democratize AI

cloud meadow Apr 18, 2025, 1:29 PM

#

Brotha log off.

glass arch Apr 18, 2025, 1:29 PM

#

if anyone can run it, then it's a much more balanced playing field

cloud meadow Apr 18, 2025, 1:30 PM

#

glass arch I think we should just democratize AI

AI will be weaponised.

glass arch Apr 18, 2025, 1:30 PM

#

in 100 years, we will be able to run the current top-of-the-line AIs on our phones

balmy mist Apr 18, 2025, 1:30 PM

#

if its the truth, what are you going to do about it?

cloud meadow Apr 18, 2025, 1:31 PM

#

We
You need to grow up

#

Do not threaten real world action like this

#

This is illegal and will get the server taken down

#

This will never happen you moron.

tall summit Apr 18, 2025, 1:31 PM

#

glass arch I think we should just democratize AI

worked well with stablehorde

cloud meadow Apr 18, 2025, 1:31 PM

#

We do not live in a sci-fi novel.

brittle tiger Apr 18, 2025, 1:32 PM

#

glass arch in 100 years, we will be able to run the current top-of-the-line AIs on our phon...

using phones 100 yrs from now

Also it's probably closer to 5-10 yrs

tall summit Apr 18, 2025, 1:32 PM

#

2025 trolling sucks

cloud meadow Apr 18, 2025, 1:32 PM

#

Is this guy being serious?

glass arch Apr 18, 2025, 1:32 PM

#

the FASTEST speed we can ever hope to achieve is light

balmy mist Apr 18, 2025, 1:32 PM

#

he just joined the server two days ago

cloud meadow Apr 18, 2025, 1:33 PM

#

@alpine pasture hey, I think you should moderate this user.

#

#

I'm ceasing communication with you

willow grail Apr 18, 2025, 1:34 PM

#

cloud meadow I'm ceasing communication with you

are you a elon fanling?

glass arch Apr 18, 2025, 1:34 PM

#

anyways, has anyone noticed that o4-mini-high doesn't speak with emojis anymore?

cloud meadow Apr 18, 2025, 1:34 PM

#

willow grail are you a elon fanling?

No, he is just another puppet billionaire.

willow grail Apr 18, 2025, 1:34 PM

#

cloud meadow No, he is just another puppet billionaire.

the richest person is just a puppet?

#

or is he just middle manning the money for the non-puppets? @cloud meadow

glass arch Apr 18, 2025, 1:35 PM

#

when o4 came outon wednesday, it wrote with emojis in every header

#

but now it fixed that and doesn't write with emojis

cloud meadow Apr 18, 2025, 1:35 PM

#

willow grail the richest person is just a puppet?

This is not a political server
There are bankers who are richer than any of these public figures

tall summit Apr 18, 2025, 1:35 PM

#

@alpine pasture help

balmy mist Apr 18, 2025, 1:35 PM

#

glass arch but now it fixed that and doesn't write with emojis

you think they are actively updating it?

glass arch Apr 18, 2025, 1:35 PM

#

balmy mist you think they are actively updating it?

I believe so

balmy mist Apr 18, 2025, 1:36 PM

#

they are doing that with 4o

willow grail Apr 18, 2025, 1:36 PM

#

cloud meadow - This is not a political server - There are bankers who are richer than any of ...

huh? like who

glass arch Apr 18, 2025, 1:36 PM

#

I can't think of a reason it would just switch to not using emojis hours after release

tall summit Apr 18, 2025, 1:36 PM

#

maybe it does and your sample size is small

brittle tiger Apr 18, 2025, 1:36 PM

#

glass arch I can't think of a reason it would just switch to not using emojis hours after r...

Are you talking about arena, app or api?

cloud meadow Apr 18, 2025, 1:36 PM

#

willow grail huh? like who

Refer to the first point

glass arch Apr 18, 2025, 1:36 PM

#

https://chatgpt.com/share/6802556d-5578-8010-a4bc-bfeb15b4c6c2

ChatGPT

ChatGPT - Universe Expansion Explained

Shared via ChatGPT

glass arch Apr 18, 2025, 1:36 PM

#

brittle tiger Are you talking about arena, app or api?

app

#

I have a link to a new conversation

willow grail Apr 18, 2025, 1:37 PM

#

cloud meadow Refer to the first point

just answer the question who has more money than musk

balmy mist Apr 18, 2025, 1:37 PM

#

willow grail just answer the question who has more money than musk

ask grok

willow grail Apr 18, 2025, 1:37 PM

#

wtf is grok

balmy mist Apr 18, 2025, 1:37 PM

#

wait are their discord bots for ai models yet?

glass arch Apr 18, 2025, 1:37 PM

#

grok doesn't defend elon surprisingly

balmy mist Apr 18, 2025, 1:38 PM

#

willow grail wtf is grok

really? why are you in this server if you are clueless about ai models?

cloud meadow Apr 18, 2025, 1:38 PM

#

willow grail just answer the question who has more money than musk

https://en.m.wikipedia.org/wiki/List_of_banking_families

balmy mist Apr 18, 2025, 1:38 PM

#

willow grail wtf is grok

sorry if you are trying to learn

#

but grok is xai model

#

on twitter

willow grail Apr 18, 2025, 1:38 PM

#

balmy mist on twitter

why u saying twitter

alpine coral Apr 18, 2025, 1:39 PM

#

oh ffs no more musk

#

let's talk about llms

willow grail Apr 18, 2025, 1:39 PM

#

cloud meadow `https://en.m.wikipedia.org/wiki/List_of_banking_families`

..... shows me dead people... with low money

glass arch Apr 18, 2025, 1:39 PM

#

remember when grok 3 released?

balmy mist Apr 18, 2025, 1:39 PM

#

because i was born pre 2023 and saying X can be confusing

glass arch Apr 18, 2025, 1:39 PM

#

grok 3 is actually very useful for searching for information

alpine coral Apr 18, 2025, 1:40 PM

#

grok-3-mini is legit impressive

cloud meadow Apr 18, 2025, 1:40 PM

#

willow grail ..... shows me dead people... with low money

Families
Not gonna waste my time with you.

willow grail Apr 18, 2025, 1:40 PM

#

balmy mist because i was born pre 2023 and saying X can be confusing

thats not musk-fan enough

glass arch Apr 18, 2025, 1:40 PM

#

alpine coral grok-3-mini is legit impressive

have they announced plans for grok 3.5 or anything new?

earnest parcel Apr 18, 2025, 1:40 PM

#

X is objectively just a bad name. and Grok-3 mini is great

willow grail Apr 18, 2025, 1:40 PM

#

cloud meadow >Families Not gonna waste my time with you.

they are very poor

balmy mist Apr 18, 2025, 1:40 PM

#

willow grail thats not musk-fan enough

im just a curious guy, i dont fanboy over ppl

willow grail Apr 18, 2025, 1:41 PM

#

cloud meadow >Families Not gonna waste my time with you.

so, u still saying musk is a puppet despite having most cashlol

balmy mist Apr 18, 2025, 1:41 PM

#

willow grail so, u still saying musk is a puppet despite having most cashlol

are you a fanboy?

cloud meadow Apr 18, 2025, 1:41 PM

#

Do you not know how money works?

native shoreBOT Apr 18, 2025, 1:41 PM

#

dynoSuccess you.wish has been warned.

willow grail Apr 18, 2025, 1:41 PM

#

cloud meadow Do you not know how money works?

?

cloud meadow Apr 18, 2025, 1:42 PM

#

Again, it's not my place to educate you. Since we are in an AI server, maybe you can ask gemini 2.5, it has a lot of knowledge (january 2025 cutoff!)

willow grail Apr 18, 2025, 1:42 PM

#

@alpine pasture what did i say wrong

alpine coral Apr 18, 2025, 1:43 PM

#

it's already been posted but whatever
via artificialanalysis

alpine pasture Apr 18, 2025, 1:43 PM

#

Please stay on topic - this is an AI server for the LMArena community. Plenty of servers for other topics. Thanks all!

cloud meadow Apr 18, 2025, 1:43 PM

#

I think this dude should be dealt with btw, Cherry.

glass arch Apr 18, 2025, 1:43 PM

#

wow

#

this is a very neat graph

cloud meadow Apr 18, 2025, 1:44 PM

#

alpine coral it's already been posted but whatever via [artificialanalysis](https://x.com/Ar...

That's interesting

alpine coral Apr 18, 2025, 1:45 PM

#

alpine coral it's already been posted but whatever via [artificialanalysis](https://x.com/Ar...

i was looking into how they calculate 'price'

Price is a blend of Input & Output token prices (3:1 ratio).
It's ofc going to be blunt / imperfect, but 3:1 (input-output) i don't think is a suitable ratio to use for thinking models

#

they inherently use more tokens in their outputs (whether we see the reasoning token or not, they're output)

balmy mist Apr 18, 2025, 1:45 PM

#

alpine coral it's already been posted but whatever via [artificialanalysis](https://x.com/Ar...

is this real?

#

how is that possible

#

grok 3 mini

glass arch Apr 18, 2025, 1:46 PM

#

interesting to see that gemini says it doesn't know who's AI will be the best by the end of 2025

and then chatgpt says that GPT-5 will be the smartest

balmy mist Apr 18, 2025, 1:46 PM

#

i need to test it out

#

its cheaper than flash?

cloud meadow Apr 18, 2025, 1:46 PM

#

glass arch interesting to see that gemini says it doesn't know who's AI will be the best by...

Why would you ask an LLM such a question?

#

I mean in terms of predicting events

#

It will just say whatever

glass arch Apr 18, 2025, 1:48 PM

#

cloud meadow Why would you ask an LLM such a question?

I wanted to see its response and if it would prioritize itself

cloud meadow Apr 18, 2025, 1:48 PM

#

Right

#

Lmao, Sam Altman training pro-OpenAI propaganda into the model sounds funny

glass arch Apr 18, 2025, 1:49 PM

#

grok says to keep an eye on xAI and deepseek

#

it sort of prioritizes itself

cloud meadow Apr 18, 2025, 1:49 PM

#

Can't wait for R2

glass arch Apr 18, 2025, 1:49 PM

#

cloud meadow Lmao, Sam Altman training pro-OpenAI propaganda into the model sounds funny

the sad part is that the model doesn't even know its own dang name

balmy mist Apr 18, 2025, 1:49 PM

#

how fast is grok mini fast vs grok mini?

cedar tide Apr 18, 2025, 1:49 PM

#

alpine coral i was looking into how they calculate 'price' > Price is a blend of Input & Outp...

and besides that, the price here is only the price per token but the reasoning models output many many more tokens so it comes out much more expensive

native shoreBOT Apr 18, 2025, 1:49 PM

#

dynoSuccess ontologicity has been warned.

earnest parcel Apr 18, 2025, 1:51 PM

#

balmy mist how fast is grok mini fast vs grok mini?

OpenRouter stats put -fast 40% faster. (187 vs 132 t/s)

balmy mist Apr 18, 2025, 1:51 PM

#

alpine coral it's already been posted but whatever via [artificialanalysis](https://x.com/Ar...

wait so grok 3 mini is better than grok 3?

balmy mist Apr 18, 2025, 1:52 PM