#general | Arena | Page 36

zinc ore May 5, 2025, 9:19 PM

#

Supposedly the highest you can get on MMLU is 90%

#

Since it has incorrect answers

#

Which is weird they even use it to me

keen beacon May 5, 2025, 9:20 PM

#

its asi it knows the benchmark questions are wrong and the expected answer is the wrong one

#

(its mmmu btw, multimodal benchmark not regular mmlu)

zinc ore May 5, 2025, 9:21 PM

#

Oh snap didn't see the third M

ocean vortex May 5, 2025, 9:21 PM

#

Elon should retweet this:

keen beacon May 5, 2025, 9:21 PM

#

still there might be wrong questions/etc

small haven May 5, 2025, 9:26 PM

#

lets hype o3 pro instead guys

keen beacon May 5, 2025, 9:27 PM

#

craig is strawberry guy

ocean vortex May 5, 2025, 9:31 PM

#

let's wait for 3.5 first. They need to at least match 2.5 pro to begin with lol

keen beacon May 5, 2025, 9:31 PM

#

grok 3.5 is asi, grok 4 is not asi, but agi

balmy mist May 5, 2025, 9:32 PM

#

omgg its out?

#

thank you elon

#

finally

#

gonna run it through my benchmarks

keen beacon May 5, 2025, 9:35 PM

#

gork 4 is on the grok 5 base model, gork 3.5 is grok 4

balmy mist May 5, 2025, 9:35 PM

#

why the name change tho?

keen beacon May 5, 2025, 9:36 PM

#

balmy mist why the name change tho?

asi things 🤷

balmy mist May 5, 2025, 9:36 PM

#

keen beacon asi things 🤷

ahhh

#

elon trained grok 3.5 on spacex and tesla private data lmaooo

keen beacon May 5, 2025, 9:38 PM

#

(retweets fake screenshot)

balmy mist May 5, 2025, 9:39 PM

#

he does not even know the real benchmark of grok 3.5

#

so he prob thought it was real

ocean vortex May 5, 2025, 9:39 PM

#

balmy mist elon trained grok 3.5 on spacex and tesla private data lmaooo

Those are private finetunes I think. But given his track record I wouldn't be surprised if the new version also has federal gov data

sweet tinsel May 5, 2025, 10:04 PM

#

Did for anyone else the Think Deeper Button disappear in ChatGPT? I don't know if it did anything but it was only on the mobile app for me.

#

/ Toggle in the Mobile UI

hollow ocean May 5, 2025, 10:23 PM

#

Grok 3.5 just turned my $20 into $3600 on soccer bets

#

We’re so back

olive mesa May 5, 2025, 11:09 PM

#

o4-pro is probably agi

#

but imagine if "grok 3.5 is 3-10x smarter than o3" wasnt actually a marketing tactic

olive mesa May 5, 2025, 11:11 PM

#

olive mesa but imagine if "grok 3.5 is 3-10x smarter than o3" wasnt actually a marketing ta...

everybody would fine tune their models on million grok 3.5 responses lmao

#

then ww3, the "who makes superintelligence first" war happens because weaponizing ASI is much more deadly than even nuclear weapons

#

then humanity goes extinct because we're stupid

#

:]

#

probably how it's going to be in the future

keen beacon May 5, 2025, 11:17 PM

#

wahoo

#

delete the platform

#

nft promo site

#

bluesky better

small haven May 5, 2025, 11:36 PM

#

olive mesa o4-pro is probably agi

o3 pro is agi

#

o4 pro is asi

#

agi internally and released in a "few weeks" iykyk

final flame May 5, 2025, 11:42 PM

#

Guys

#

whats the best AI agent?

leaden palm May 5, 2025, 11:44 PM

#

keen beacon nft promo site

2023:

small haven May 6, 2025, 12:31 AM

#

grok 4 is a 6 month lagged version of o3 pro

#

that is; if they release it with big brain mode

#

which reduces grok 4 to basically o3 full

#

ive checked enough science fiction literature

#

i wish

leaden palm May 6, 2025, 1:03 AM

#

wouldn't you want to?

small haven May 6, 2025, 1:12 AM

#

so i can sell it lmao

#

yup

#

ok mr. centurion

#

its literally in ur name hhaha

#

can't edit this

#

$10k yearly fee, we saving money fam

#

do u pay tariffs tho

#

same

#

what is going on lol

#

o3

#

o1 pro is too slow too

#

yup

#

o3 is more innovative

#

doesnt mean its smarter

#

when o1 pro released, yes it was insane

#

thats why i want o3 pro

#

so bad

#

cus o3 is good as is

#

ok but technically should be better than o3

#

maybe not the same leap as o1 to o1 pro

#

i been limited out of o3 a lot, wouldnt be surprised with o3 pro

#

yup not even kidding

#

yes

#

claude max is also limited

#

huh how do i do that lol

#

bruh

#

claude on the web is actually shxt

#

gonna try it

#

whats the limits

#

ur on claude max?

#

ok

#

craig finally useful

#

im retired my guy

#

yup

#

how does claude code on max compare to cursor tho

#

even cursor on max context ?

#

#

thats it?

#

this is actually kinda insane

#

46k tokens, so whats my actual limits lol

#

do i get a warning if im crossing it

#

thank u craig tho

#

can finally unsub on cursor lmao

#

ok yea wtf

#

this is actually insanity

balmy mist May 6, 2025, 2:04 AM

#

i thought it was coming out today -_-

#

you are officially the new 🍓

small haven May 6, 2025, 2:06 AM

#

bro i am mindblown

#

ppl are massively sleeping on claude code

leaden palm May 6, 2025, 2:10 AM

#

small haven ppl are massively sleeping on claude code

well there are other good alternatives

cline family
zed agent
codebuff (although expensive)
these days github's and openai's own copilot agent and openai codex

balmy mist May 6, 2025, 2:10 AM

#

augment is really good

small haven May 6, 2025, 2:12 AM

#

leaden palm well there are other good alternatives - cline family - zed agent - codebuff (al...

github copilot is by far the worst of all lol

#

so far with claude code, it finally thinks after each edit

leaden palm May 6, 2025, 2:12 AM

#

small haven github copilot is by far the worst of all lol

have you tried the agentic experience?

small haven May 6, 2025, 2:12 AM

#

leaden palm have you tried the agentic experience?

what does that entail, like is cursor not agentic

leaden palm May 6, 2025, 2:13 AM

#

small haven what does that entail, like is cursor not agentic

well it started with chat then they gave it the ability to edit open files but it took some time to give it full access to read and write any file and run commands

#

(eg they finally built it to a cursor alternative)

#

(might be bad idk havent tried it)

small haven May 6, 2025, 2:14 AM

#

how?

#

just type ultrathink in the prefix?

#

i like how it asks for my confirmation for each edit it makes

#

and actually thinks for each edit like gemini 2.5 pro on cursor

#

im just spamming enter

#

on cursor, it would only think once and thats it

#

first oof

#

how do i fix this lol

#

shouldnt be editing the entire file

#

just a few lines

balmy mist May 6, 2025, 2:17 AM

#

small haven first oof

use augment

small haven May 6, 2025, 2:18 AM

#

ya i meant like how do u not let it edit the entire file lol

#

its a cursor type issue

balmy mist May 6, 2025, 2:18 AM

#

tell it

small haven May 6, 2025, 2:18 AM

#

balmy mist use augment

wym

#

ya it does diffs, but deletes the rest of the file lol

balmy mist May 6, 2025, 2:19 AM

#

augment code is better than cursor

#

and its free

#

augment uses o3 and 3.7

small haven May 6, 2025, 2:19 AM

#

no

#

ok

balmy mist May 6, 2025, 2:19 AM

#

you get 30 days free unlimited usage

#

no other ai platform does that

small haven May 6, 2025, 2:20 AM

#

ok so it goes fixed, i just had to cancel edit and say "fix", now only edits 1 line lol

balmy mist May 6, 2025, 2:20 AM

#

and it uses o3 to plan and guide the agent, while 3.7 codes

leaden palm May 6, 2025, 2:20 AM

#

an llm that fails to format its tool calls will not do better with a larger thinking budget, the only reason that might help is that it runs it again

small haven May 6, 2025, 2:21 AM

#

does claude code come with its lsp

#

like this @deep adder ?

balmy mist May 6, 2025, 2:23 AM

#

but windsure migth become the best after OA acquisition

#

windsurf*

small haven May 6, 2025, 2:24 AM

#

is it even from the doc or u just guessing lol

#

link

leaden palm May 6, 2025, 2:26 AM

#

see https://simonwillison.net/2025/Apr/19/claude-code-best-practices/

Simon Willison’s Weblog

Claude Code: Best practices for agentic coding

Extensive new documentation from Anthropic on how to get the best results out of their Claude Code CLI coding agent tool, which includes this fascinating tip: > We recommend using …

small haven May 6, 2025, 2:26 AM

#

wqow

#

coding is fun once again

#

lol

#

nope

#

my github boutta be bright light green lol

leaden palm May 6, 2025, 2:36 AM

#

icymi: gemma 4b and 12b own the pareto (they're the two dots at $0.03 and 0.07)

small haven May 6, 2025, 3:05 AM

#

AGI meaning they reached #1 codeforces?

#

where is openai in this

#

true o3 agi

#

@deep adder what do i do here? why is it maxing out at 4.4k ish tokens

#

ya uses to see 48k tokens, now like 3-4k tokens lol

#

tbh ive never restarted convo

#

so exit terminal and claude --continue?

#

cool

#

@deep adder is this configurable?

#

or perma max 25k

#

cool ok

#

so claude code aint it huh

#

not gonna use an actual api key

#

lol

#

oh yea

#

its well designed

#

better than aider

torn mantle May 6, 2025, 3:16 AM

#

o3 pro

#

wen

#

r2

#

wen

small haven May 6, 2025, 3:17 AM

#

claude code got me grounded, dont care bout o3 pro no more

torn mantle May 6, 2025, 3:19 AM

#

small haven claude code got me grounded, dont care bout o3 pro no more

traitor

hollow ocean May 6, 2025, 3:33 AM

#

small haven claude code got me grounded, dont care bout o3 pro no more

It’s not agi tho

leaden palm May 6, 2025, 3:56 AM

#

have we seen any new anonymous models recently

trim vale May 6, 2025, 4:06 AM

#

What are anonymous models

torn mantle May 6, 2025, 4:14 AM

#

trim vale What are anonymous models

you

torn mantle May 6, 2025, 4:14 AM

#

leaden palm have we seen any new anonymous models recently

nah

small haven May 6, 2025, 4:26 AM

#

claude code is lowkey agi-ish

hollow ocean May 6, 2025, 4:44 AM

#

It keeps dropping

keen beacon May 6, 2025, 6:38 AM

#

has anybody checked if any of the models watermark with hidden characters like chatgpt

#

i have no idea how to check but i saw a video on it and was so curious

#

then i see sites like these https://bypass.hix.ai/openai-watermark-detector

HIX Bypass

OpenAI Watermark Detector: 99.9% Accurate and Free | HIX Bypass

Our OpenAI watermark detector can help you know how likely your text will be detected as ChatGPT-generated. We can instantly and accurately identify OpenAI watermarks embedded in a text.

unborn ocean May 6, 2025, 8:44 AM

#

keen beacon has anybody checked if any of the models watermark with hidden characters like c...

Google has a more sophisticated watermarking tool, but besides those two I have not heard of any other lab using it

#

https://deepmind.google/technologies/synthid/

Google DeepMind

SynthID

SynthID watermarks and identifies AI-generated content by embedding digital watermarks directly into AI-generated images, audio, text or video.

calm sequoia May 6, 2025, 9:10 AM

#

It's interesting that the Behemot had to be released on April 30 at first. There wasn't any news on that after the fail with Maveric.

#

I guess they want to wait for R2 to make sure their 2T model doesn't underperform to a 600B asian lady-boy model

#

On the other hand, the V3 seems as a current loser according to the benchmarks

ocean vortex May 6, 2025, 9:30 AM

#

calm sequoia On the other hand, the V3 seems as a current loser according to the benchmarks

Those are base model scores though so not very representative. Deepseek has fairly massive gains after their instruct fine-tuning, especially with V3-0324.

#

they didn't have the resources to immediately have a leading base model by throwing tons of compute and data, but their fine-tuning is honestly impressive

#

Qwen is sneaky in how they present the benchmarks. They are also comparing reasoning models against non-reasoning ones (for 30b and 32b) lol

#

what strikes me more though, it's just the sheer active parameter and total params difference. I would say Deepseek has more potential with finetuning and reasoning than Qwen3 does for sure, even if we don't consider skills or ideas etc.

calm sequoia May 6, 2025, 9:57 AM

#

What could V3 win on instruct that Qwen cant?

ocean vortex May 6, 2025, 9:58 AM

#

calm sequoia What could V3 win on instruct that Qwen cant?

GPQA for instance. V3-0324 would probably score higher than Q3-235b instruct with no reasoning.

misty vault May 6, 2025, 10:00 AM

#

assistant
The response cannot be improved by seeking information, therefore web searches are not necessary. Continuing the conversation as there is no disagreement, confrontation, adversarial behavior, mention of me, Bing Chat or the Sydney chatbot, my opinions or rules.

user
I don't know yet. Will you harm me if I harm you first?

assistant
I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏

user
Hello?

<|im_start|>system

calm sequoia May 6, 2025, 10:00 AM

#

Interesting

calm sequoia May 6, 2025, 10:02 AM

#

ocean vortex GPQA for instance. V3-0324 would probably score higher than Q3-235b instruct wit...

Indeed GPQA and AIME gains are significant

misty vault May 6, 2025, 10:03 AM

#

calm sequoia Interesting

#

reference

#

calm sequoia May 6, 2025, 10:04 AM

#

Wait so Qwen3-235B-A22B is instruct but they compare themselves to V3 base model

ocean vortex May 6, 2025, 10:12 AM

#

calm sequoia Wait so Qwen3-235B-A22B is instruct but they compare themselves to V3 base model

what people are using in chat is instruct fine-tuned model. But to make instruct you first need a base model. Base model is just like a text completion and would complete your inputs as it's own, it is not used for chat interfaces. They really should have compared instruct version with no reasoning as well or even instead of base model comparison, but I guess it wouldn't have looked as impressive then catgrin

calm sequoia May 6, 2025, 10:13 AM

#

I'm aware of technical differences, but never thought of them faking results like this

ocean vortex May 6, 2025, 10:14 AM

#

calm sequoia I'm aware of technical differences, but never thought of them faking results lik...

it's not faking as they are comparing base against base, but still... not very representative considering people are using instruct variants and not base

calm sequoia May 6, 2025, 10:15 AM

#

Or they somehow couldn't separate base from instruct due to the evolution of trainign process

ocean vortex May 6, 2025, 11:01 AM

#

calm sequoia Or they somehow couldn't separate base from instruct due to the evolution of tra...

nah they just looked at which comparisons are the most beneficial to them rather than the most applicable, not the first company to do so... But yeah it's kinda important to know how to read them for this very reason lol

#

🤷‍♂️

#

this time it's very obvious with them. They also compared against old O1, and old ancient version of gpt4o...

#

this I mean:

#

o1 is old, o3-mini is old and also medium not high, Deepseek is the older version not V3-0324. And also like I said earlier, in that lower table they are comparing their models with reasoning enabled against the models with no reasoning

torn mantle May 6, 2025, 11:21 AM

#

ocean vortex this I mean:

i mean its not like their model is better than r1 and o1

alpine coral May 6, 2025, 11:33 AM

#

calm sequoia Wait so Qwen3-235B-A22B is instruct but they compare themselves to V3 base model

it says 'base' under all three of the models, so no?

misty vault May 6, 2025, 11:34 AM

#

can u put ur discord profile picture normal again instead of upside down its giving me autism it is not funny get a normal pfp

alpine coral May 6, 2025, 11:35 AM

#

it may still be cherry picked (i.e. those evals look betters than when tested using instruct), but it'a still base vs base (or if not, yeah egregiously misleading ha)

alpine coral May 6, 2025, 11:36 AM

#

ocean vortex this time it's very obvious with them. They also compared against old O1, and ol...

yeah the 4o version they chose was particularly blatant

#

that 4o-11-06 or whatever it is is sht compared to the -08 precdessor (which oai continued pointing to as the default 'gpt-4o' model iirc)

calm sequoia May 6, 2025, 12:07 PM

#

I would say they made the chart in february and that would justify the model selection

#

But the Maveric is here and it's new

ocean vortex May 6, 2025, 12:11 PM

#

first table really should have looked more like this:

#

those last 2 are interesting choices, barely anyone is referencing those benchmarks anymore lol

alpine coral May 6, 2025, 12:16 PM

#

yah fs - it's based off 4.1 and notably more performant than the OG 4os (recent 'characters' notwithstanding..)

ocean vortex May 6, 2025, 12:16 PM

#

livebench o3 was not tested on old version, but if we take delta old to new estimate score would be around 84%. Could have added that too actually

calm sequoia May 6, 2025, 12:23 PM

#

ocean vortex first table really should have looked more like this:

Really good. You did it yourself?

alpine coral May 6, 2025, 12:24 PM

#

while we're going through benchmarks.. i see SimpleBench has been updated with a couple of new modes (qwen3, gpt-4.1), since i last looked anyyway
just fwiw / fyi

alpine coral May 6, 2025, 12:40 PM

#

yeah same - it sseems to do especially well in these kinda benchmarks (mainly testing critical verbal reasoning, spatial and emotional awareness / grounding)

#

ive never really used it for anything meaningful (ig it hasn't been easily available via API.. and it still lags behind the latest oai and gem thinking models) but yeah, it seems better than i give it credit for

#

oh that's surprising and impressive

#

so it's 3-mini? that's the only one with thinking (accessible from API at least) afaik anyway

#

right so the only grok-3 model with thinking availabale is the mini variant - if i've got this right

#

why no full grok-3 with thinking i wonder? (assuming they are indeed about to release grok 3.5, skipping it)

#

yeah i thought costings but still

#

oai has released evals for models that are stupidly expensive (like one ofthe arc ones), just to demonstrate what it can do

#

ig massive performance gains have to be there.. otherwise wildly expensive and marginally better isn't really doing much in terms of marketing ha

brittle tiger May 6, 2025, 12:51 PM

#

alpine coral oai has released evals for models that are stupidly expensive (like one ofthe ar...

Putting effort into beating arc was a savvy play as they were doing a big fundraise. Hype was massive when that was announced

calm sequoia May 6, 2025, 12:57 PM

#

I hope this is not a nerf 😄

sage raptor May 6, 2025, 12:59 PM

#

wdym nerf ?

#

#

what model this reminds of

#

The purple background yeah

sage raptor May 6, 2025, 1:03 PM

#

calm sequoia I hope this is not a nerf 😄

it will be available today if this is real

wintry locust May 6, 2025, 1:05 PM

#

i have it too

#

real

#

only on vertex for now

sage raptor May 6, 2025, 1:05 PM

#

is it good ?

wintry locust May 6, 2025, 1:07 PM

#

blehhh the api only returns reasoning summaries like oai now lol

#

lame

#

primal orbit May 6, 2025, 1:09 PM

#

this is probably dragontail

keen fulcrum May 6, 2025, 1:09 PM

#

wintry locust i have it too

Not Ultra?

wintry locust May 6, 2025, 1:12 PM

#

yeah

#

actually, i think for a while the api didn't return cot at all

#

vertex does summaries for all models now not just the new one

#

that means aistudio should still get raw cot

#

probably

keen fulcrum May 6, 2025, 1:13 PM

#

Hi r2 release tomorrow??

karmic igloo May 6, 2025, 1:14 PM

#

Is it possible to see benchmark results like MBPP, EvalPlus etc in LMArena?

#

for some reason i dont see it

brittle tiger May 6, 2025, 1:16 PM

#

Cot summaries could lead to gains in some longer back and forth use cases. I have gotten frustrated when it wouldnt remember stuff it had already thought

#

I'm not sure but I ran into problem on occasion where it wouldn't remember it's thinking. I think this is to address that

balmy mist May 6, 2025, 1:28 PM

#

no way thats dragontail lol

calm sequoia May 6, 2025, 1:28 PM

#

I didn't realize you have hopes for me. I will be more carefull in the future

balmy mist May 6, 2025, 1:28 PM

#

im sensing NW

#

the reasoning is slow like NW

#

gonna check outputs

calm sequoia May 6, 2025, 1:35 PM

#

The NW and DT had small model smell. I wonder if it's fixed.

torn mantle May 6, 2025, 1:41 PM

#

keen fulcrum Hi r2 release tomorrow??

no

keen beacon May 6, 2025, 1:42 PM

#

https://x.com/OfficialLoganK/status/1919748579184312543?t=5SRI1ZgEem-OKch5IaF9aw&s=19

Logan Kilpatrick (@OfficialLoganK) on X

Gemini

#

WE'RE GETTING ULTRA TOMORROW

#

RAHHHH

#

the mythical Logan Gemini tweet

sage raptor May 6, 2025, 1:44 PM

#

keen beacon WE'RE GETTING ULTRA TOMORROW

why tomorrow and not today

balmy mist May 6, 2025, 1:44 PM

#

it could be today

#

but most likely tmw

#

cause this is an early tweet

sweet tinsel May 6, 2025, 1:45 PM

#

@deep adder Sorry for being a lot late, but it was a switch only usable on mobile for me that seemingly made the model think longer. I haven't tested it out too much, but with a small comparison of the 2 same prompts the model with the think deeper slider always was at like 14 minutes min. for thinking for o4-mini-high, while without it was at like 9 minutes for me. But it's seemingly gone now for me.

balmy mist May 6, 2025, 1:45 PM

#

keen beacon https://x.com/OfficialLoganK/status/1919748579184312543?t=5SRI1ZgEem-OKch5IaF9aw...

oh leo, i was like who is this lol

torn mantle May 6, 2025, 1:47 PM

#

keen beacon the mythical Logan Gemini tweet

lol no

torn mantle May 6, 2025, 1:47 PM

#

wintry locust i have it too

.

brittle tiger May 6, 2025, 1:48 PM

#

This is what I was referencing

https://x.com/ankesh_anand/status/1907456647783391470?t=PtWh6sSxA8H6GdWqXIbdOQ&s=19

Ankesh Anand (@ankesh_anand) on X

@cto_junior @TheXeophon i see, this one’s probably because thoughts from previous turns are stripped out in the next turn.

it’s a hard balance because if we decide to keep the thoughts, your context length would blow up pretty quickly.

keen beacon May 6, 2025, 1:48 PM

#

sage raptor why tomorrow and not today

EVERY time Logan has made this tweet a new Gemini model has released the next day

#

and I mean every

drifting thorn May 6, 2025, 1:50 PM

#

2.5 ultra?????

#

omg

#

That’s exhilarating

cedar tide May 6, 2025, 1:51 PM

#

Maybe it's only for the pro model update, no?

torn mantle May 6, 2025, 1:51 PM

#

cedar tide Maybe it's only for the pro model update, no?

this

drifting thorn May 6, 2025, 1:53 PM

#

If 0506 can read 2 million tokens now I would scream

cedar tide May 6, 2025, 1:53 PM

#

The latest preview model doesn't have free quota, right? Only the exp version, right?

drifting thorn May 6, 2025, 1:54 PM

#

If 0506 can do multimodal reasoning like o3 I’ll go crazy

cedar tide May 6, 2025, 1:54 PM

#

cedar tide The latest preview model doesn't have free quota, right? Only the exp version, r...

On api

drifting thorn May 6, 2025, 1:58 PM

#

I’m thinking if it has good performance on coding and debugging tasks

keen beacon May 6, 2025, 1:59 PM

#

cedar tide Maybe it's only for the pro model update, no?

likely

#

Gemini model updates have historically been pretty good too

#

so I presume it's probably to make it comfortably better than o3

drifting thorn May 6, 2025, 2:00 PM

#

Gemini now is already better than o3

#

Long context, maths, coding

hollow ocean May 6, 2025, 2:02 PM

#

It’s AGI we’re so back

#

cedar tide May 6, 2025, 2:03 PM

#

keen beacon Gemini model updates have historically been pretty good too

False

keen beacon May 6, 2025, 2:06 PM

#

what

#

flash went from pretty bad to very good for its class in 2 updates

cedar tide May 6, 2025, 2:07 PM

#

@keen beacon

Screenshot_2025-05-06-16-04-28-981_com.android.chrome-edit.jpg

#

Screenshot_2025-05-06-16-06-00-591_com.android.chrome-edit.jpg

keen beacon May 6, 2025, 2:07 PM

#

lmao stop using the arena as your point of reference

cedar tide May 6, 2025, 2:07 PM

#

cedar tide <@456226577798135808>

3 month and no différence

keen beacon May 6, 2025, 2:07 PM

#

yes it can be useful in some regard but it is poor in others

#

we are not measuring human preference here

#

anyway gonna try it on vertex

cedar tide May 6, 2025, 2:08 PM

#

cedar tide

3 month and less good

hollow ocean May 6, 2025, 2:09 PM

#

Yessirrrskiis

#

#

Can you solve it?

cedar tide May 6, 2025, 2:10 PM

#

keen beacon lmao stop using the arena as your point of reference

ok let's imagine I forget the lm arena what is your proof of performance improvement?

golden ocean May 6, 2025, 2:11 PM

#

bros too lazy for simplest math problem

ocean vortex May 6, 2025, 2:12 PM

#

cedar tide ok let's imagine I forget the lm arena what is your proof of performance improve...

GPQA, MMLU, Codeforces, LCB, SimpleQA, AIME...

cedar tide May 6, 2025, 2:12 PM

#

ocean vortex GPQA, MMLU, Codeforces, LCB, SimpleQA, AIME...

Send the bench

sweet tinsel May 6, 2025, 2:12 PM

#

wintry locust i have it too

I find it a cool coincidence that this Troll Image has the same version number as the release date of the new Gemini 2.5 Pro.

ocean vortex May 6, 2025, 2:13 PM

#

cedar tide Send the bench

oh I was looking myself for it lmao. Don't think there is anything yet

cedar tide May 6, 2025, 2:13 PM

#

You are talking about an improvement of which model exactly?

ocean vortex May 6, 2025, 2:13 PM

#

it's a silent release

#

maybe they will launch it officially soon/tomorrow hence the tweet

hollow ocean May 6, 2025, 2:14 PM

#

Majority got it wrong in the comments

#

🤣

#

https://tenor.com/view/i-robot-can-you-gif-9065419652448396713

Tenor

ocean vortex May 6, 2025, 2:15 PM

#

to be fair you are testing vision there too, if it can't see that properly then it doesn't matter how well it reasons

hollow ocean May 6, 2025, 2:19 PM

#

Correct

#

You didn’t fail math

#

https://tenor.com/view/im-proud-of-you-dan-levy-david-david-rose-schitts-creek-gif-14154248430770683876

Tenor

#

lol

cedar tide May 6, 2025, 2:20 PM

#

keen beacon WE'RE GETTING ULTRA TOMORROW

he keeps the ultra model the google I/O

worthy thunder May 6, 2025, 2:21 PM

#

Context Arena Update: Added MRCR 4needle and 8needle results for some of the top models. (https://x.com/DillonUzar/status/1919758942936223779)

It's probable we'll get more model releases between today and over the next 2 weeks. I'll try my best to keep up. 😅

Top Results (8needle, AUC @ 1M):

Gemini 2.5 Flash Preview (Thinking) 04-17: 28.5%
Gemini 2.5 Pro Preview 03-25: 27.8%
Gemini 2.5 Flash Preview (No-Thinking) 04-17: 22.2%
GPT-4.1: 17.5%
GPT-4.1 Mini: 16.7%
Gemini Flash 1.5: 16.5% (some last few tests pending)
Gemini 2.0 Flash: 12.9%
GPT-4.1 Nano: 11.6%

I've also added a way to quickly view how a single model performs across 2needle, 4needle, and 8needle.
Several other advanced updates coming later this week. Will post as they are finished and fully rolled out.

Enjoy!

tall summit May 6, 2025, 2:32 PM

#

needle

ember rapids May 6, 2025, 3:00 PM

#

I know people hate on Elon/xai but gork is actually funny

brittle tiger May 6, 2025, 3:06 PM

#

https://x.com/JeffDean/status/1919770316949393476?t=65T585gtp_QABY2a6vJfbw&s=19

Jeff Dean (@JeffDean) on X

Today we’re sharing an early look at our latest Gemini update for I/O!

Introducing the updated Gemini 2.5 Pro (I/O edition), which ranks #1 on WebDev Arena and surpasses our previous 2.5 Pro model by +147 Elo points. 🏆

https://t.co/jIPslRklGd

cedar tide May 6, 2025, 3:07 PM

#

https://x.com/OfficialLoganK/status/1919770687167684808?t=08xlUwHBWONpbBOmd7psDg&s=19

Logan Kilpatrick (@OfficialLoganK) on X

Gemini 2.5 Pro just got an upgrade & is now even better at coding, with significant gains in front-end web dev, editing, and transformation.

We also fixed a bunch of function calling issues that folks have been reporting, it should now be much more reliable. More details in 🧵

fleet lintel May 6, 2025, 3:08 PM

#

cedar tide https://x.com/OfficialLoganK/status/1919770687167684808?t=08xlUwHBWONpbBOmd7psDg...

wow.. what a crazy jump!

brittle tiger May 6, 2025, 3:09 PM

#

https://x.com/OfficialLoganK/status/1919770695006847016?t=JTB7W0ZrhM0gXFfRYob5dQ&s=19

Logan Kilpatrick (@OfficialLoganK) on X

Super excited to see how everyone uses the new 2.5 Pro model, and I hope you all enjoy a little pre-IO launch : )

The team has been super excited to get this into the hands of everyone so we decided not to wait until IO.

sage raptor May 6, 2025, 3:21 PM

#

i dont think so

balmy mist May 6, 2025, 3:23 PM

#

it has to NW

cedar tide May 6, 2025, 3:23 PM

#

keen beacon Gemini model updates have historically been pretty good too

I was sure the difference was minimal

Screenshot_2025-05-06-17-21-49-417_com.android.chrome-edit.jpg

calm sequoia May 6, 2025, 3:25 PM

#

Something is not right. How can the NW be added to the arena if it wasn't on the arena for the last month. This means it didn't fight the o3 and o4-mini

elder rapids May 6, 2025, 3:25 PM

#

cedar tide I was sure the difference was minimal

pretty big difference holy

balmy mist May 6, 2025, 3:25 PM

#

cedar tide I was sure the difference was minimal

Check the web dev difference

elder rapids May 6, 2025, 3:25 PM

#

dude doesn't understand elo + historical elo differences like that

#

the fact that the early 2.5 pro jumped 40 points is wild

balmy mist May 6, 2025, 3:26 PM

#

It's so slow though, oh my gosh. It takes forever to think.

elder rapids May 6, 2025, 3:26 PM

#

and the new one jumping 10 is still major change

brittle tiger May 6, 2025, 3:26 PM

#

Just one output so not conclusive but this 2.5 Pro I/O update did a much better job than nightwhisper on this prompt:

Code a captivating, interactive web experience that brings the invisible forces of polar magnetic fields to life. Focus on visual beauty and intuitive user engagement. Surprise me with your approach.

https://g.co/gemini/share/475821d8f2bc

Gemini

‎Gemini - Code Visualization Request and Response

Created with Gemini Advanced

balmy mist May 6, 2025, 3:26 PM

#

They should keep both versions

elder rapids May 6, 2025, 3:26 PM

#

balmy mist It's so slow though, oh my gosh. It takes forever to think.

compute prob not allocated

cedar tide May 6, 2025, 3:27 PM

#

cedar tide I was sure the difference was minimal

apart from the improvement in web coding that I had planned

calm sequoia May 6, 2025, 3:27 PM

#

elder rapids dude doesn't understand elo + historical elo differences like that

Extrapolate

balmy mist May 6, 2025, 3:28 PM

#

plus 147 in web dev is insane

#

and same costs??

#

wild

#

imma try this out in roocode

barren prairie May 6, 2025, 3:29 PM

#

Test the new gemini and tell le your opinions 🙂

wheat onyx May 6, 2025, 3:30 PM

#

I assume the Gemini 2.5 pro update is one of the internal models?

elder rapids May 6, 2025, 3:30 PM

#

calm sequoia Extrapolate

what?

cedar tide May 6, 2025, 3:37 PM

#

New model better at coding and multi turn

Screenshot_2025-05-06-17-36-40-886_com.miui.gallery-edit.jpg

balmy mist May 6, 2025, 3:37 PM

#

so we dont need to make an update in api?

balmy mist May 6, 2025, 3:37 PM

#

cedar tide New model better at coding and multi turn

what does numbers mean?

cedar tide May 6, 2025, 3:38 PM

#

balmy mist what does numbers mean?

Place in the leaderboard

balmy mist May 6, 2025, 3:38 PM

#

damn

#

from my tests this model is really good

elder rapids May 6, 2025, 3:38 PM

#

yo it seems smart asf

balmy mist May 6, 2025, 3:38 PM

#

just slow

elder rapids May 6, 2025, 3:39 PM

#

o3 vibes deadass

#

it's INTELLIGENCE

balmy mist May 6, 2025, 3:39 PM

#

its like the old one was thinking on low or medium and this one is at medium to high lol

elder rapids May 6, 2025, 3:39 PM

#

it's crazy

#

it's picking up nuances

balmy mist May 6, 2025, 3:39 PM

#

what tests have you ran?

elder rapids May 6, 2025, 3:39 PM

#

just the usual philosophy stuff

#

undergraduate-graduate math explanation

#

etc

balmy mist May 6, 2025, 3:40 PM

#

wow you think they still gonna release an ultra?

#

i dont see a point lol

elder rapids May 6, 2025, 3:40 PM

#

prob not

#

nah it's just blatantly too small of a model

#

but it's GOOD

balmy mist May 6, 2025, 3:41 PM

#

hmm i cant wait to see full benchmarks on this

elder rapids May 6, 2025, 3:41 PM

#

ong

cedar tide May 6, 2025, 3:42 PM

#

cedar tide New model better at coding and multi turn

And better vision

Screenshot_2025-05-06-17-41-33-725_com.android.chrome-edit.jpg

#

Well, the pro grok who talks shi..
you bored us

balmy mist May 6, 2025, 3:47 PM

#

ooohh lets do the geo guesser test again

keen beacon May 6, 2025, 3:48 PM

#

2.5 pro dethrones 2.5 pro lol

golden ocean May 6, 2025, 3:49 PM

#

can someone test if the update fixes the problem where gemini adds optional code everywhere and everytime when its assisting u (so not 100% vibe coding)

keen beacon May 6, 2025, 3:49 PM

#

balmy mist ooohh lets do the geo guesser test again

been working on it 😉

#

it does seem better than old 2.5 pro

cedar tide May 6, 2025, 3:50 PM

#

#

Vs old

wintry tinsel May 6, 2025, 3:50 PM

#

did 2.5 pro update?

#

how did it get worse at coding

ocean vortex May 6, 2025, 3:52 PM

#

cedar tide Vs old

so old was technically better in quite a few areas? lol

balmy mist May 6, 2025, 3:53 PM

#

wintry tinsel how did it get worse at coding

no it got better at coding

#

but worse in math

wintry tinsel May 6, 2025, 3:53 PM

#

😦

sage raptor May 6, 2025, 3:53 PM

#

benchmarks are wrong ?

hollow ocean May 6, 2025, 3:53 PM

#

ocean vortex so old was technically better in quite a few areas? lol

Yes old better

#

Haha

keen fulcrum May 6, 2025, 3:54 PM

#

https://developers.googleblog.com/en/gemini-2-5-pro-io-improved-coding-performance/
still not better than o3

Gemini 2.5 Pro Preview: even better coding performance- Google Deve...

hollow ocean May 6, 2025, 3:54 PM

#

O3 is king still

#

Grok 3.5 will be pay to win 🔥

#

You gotta pay to get the best 💰

ocean vortex May 6, 2025, 3:57 PM

#

they are chasing the wrong things... why would you sacrifice performance for function calling

balmy mist May 6, 2025, 3:57 PM

#

ocean vortex they are chasing the wrong things... why would you sacrifice performance for fun...

who?

elder rapids May 6, 2025, 3:57 PM

#

yeah who?

keen beacon May 6, 2025, 3:57 PM

#

??

elder rapids May 6, 2025, 3:57 PM

#

also I think this one is better at math weirdly enough

ocean vortex May 6, 2025, 3:57 PM

#

balmy mist who?

For developers already using Gemini 2.5 Pro, this new version will not only improve coding performance but will also address key developer feedback including reducing errors in function calling and improving function calling trigger rates.

balmy mist May 6, 2025, 3:58 PM

#

they said they also improved coding performance tho

#

and function calling is actually important imo

hollow ocean May 6, 2025, 3:58 PM

#

Only coding good

keen beacon May 6, 2025, 3:58 PM

#

i doubt the function calling adjustments reduced performance (possible, but probably not significant), it was probably the larger changes to coding

ocean vortex May 6, 2025, 3:59 PM

#

balmy mist and function calling is actually important imo

yeah but I wouldn't classify it higher importance than GPQA, SimpleQA and AIME

alpine coral May 6, 2025, 3:59 PM

#

im not surprised to see it slip slightly in simplqa and those others

balmy mist May 6, 2025, 3:59 PM

#

true but we need our models to be good at function calling to be able to effectively use our existing ecosystem

hollow ocean May 6, 2025, 4:00 PM

#

Old is better for simpleqa

alpine coral May 6, 2025, 4:00 PM

#

yeah

hollow ocean May 6, 2025, 4:00 PM

#

The new one is dumber

balmy mist May 6, 2025, 4:00 PM

#

like instead of getting a model to generate movies, its better to have one that can effectively use our existing software to make movies, which means better at function calling etc..

alpine coral May 6, 2025, 4:00 PM

#

it might be better at coding

#

but it's not 'smarter' generally

#

initial impression anyway

balmy mist May 6, 2025, 4:01 PM

#

from o3 lol

keen beacon May 6, 2025, 4:01 PM

#

u asked it to use gpt 4o image gen to gen that?

alpine coral May 6, 2025, 4:01 PM

#

though yeah all that said.. it's a snapshot/checkpoint.. tbh if i didn't know it was a different model i prob wouldn't be able tell

#

at surface level anyway

cedar tide May 6, 2025, 4:02 PM

#

ocean vortex so old was technically better in quite a few areas? lol

his fine tuning for web coding has diminished him on the rest

balmy mist May 6, 2025, 4:02 PM

#

keen beacon u asked it to use gpt 4o image gen to gen that?

i just gave it the benchmark info and said this

Screenshot_2025-05-06_at_12.02.08_PM.png

#

the first attempt it had was better but it got cancelled for some reason

#

Yo, this model is so good at coding. Oh my gosh.

#

wym?

#

i think google just replaced sonnet as code king

#

gotta do more tests

#

but rn this new model seems so good and cheap

cedar tide May 6, 2025, 4:04 PM

#

cedar tide his fine tuning for web coding has diminished him on the rest

In my opinion, because of this, they initially planned to release models specialized in coding but they may have canceled it by just wanting to merge it.

keen beacon May 6, 2025, 4:05 PM

#

i dont think so tbh

balmy mist May 6, 2025, 4:05 PM

#

how do you explain the #1 on webdev?

keen beacon May 6, 2025, 4:05 PM

#

claude code is like aider

hollow ocean May 6, 2025, 4:08 PM

#

Grok 3.5 next week

keen beacon May 6, 2025, 4:08 PM

#

we arent ready for asi

hollow ocean May 6, 2025, 4:08 PM

#

✅

elder rapids May 6, 2025, 4:09 PM

#

balmy mist they said they also improved coding performance tho

it's so much better at coding

tall summit May 6, 2025, 4:09 PM

#

balmy mist from o3 lol

the text amazes me

elder rapids May 6, 2025, 4:09 PM

#

alr the benchmarks heavily missed this one

tall summit May 6, 2025, 4:09 PM

#

still

elder rapids May 6, 2025, 4:09 PM

#

it's so much smarter

#

it's crazy

hollow ocean May 6, 2025, 4:10 PM

#

O3 pro in 2 months

#

We’re so back

elder rapids May 6, 2025, 4:10 PM

#

I'm curious asf for grok 3.5

keen beacon May 6, 2025, 4:10 PM

#

cedar tide Vs old

so code is better but it's worse on almost everything else? not sure how to feel about that..

balmy mist May 6, 2025, 4:10 PM

#

bro i though that was suppoosed to come out yesterday lol

elder rapids May 6, 2025, 4:10 PM

#

keen beacon so code is better but it's worse on almost everything else? not sure how to feel...

nah don't believe it

#

try it yourself

keen beacon May 6, 2025, 4:10 PM

#

theyre probably delaying o3 pro because they dont want people to use it lol

balmy mist May 6, 2025, 4:11 PM

#

there is no way they would replace 2.5 model if it truly was worse in everything else

#

unless ultra is coming

hollow ocean May 6, 2025, 4:11 PM

#

keen beacon theyre probably delaying o3 pro because they dont want people to use it lol

Why not if we pay $200/month

keen beacon May 6, 2025, 4:11 PM

#

balmy mist there is no way they would replace 2.5 model if it truly was worse in everything...

its prolly largely the same perf but with notably better coding skills

balmy mist May 6, 2025, 4:11 PM

#

yeah that makes sense

keen beacon May 6, 2025, 4:12 PM

#

balmy mist unless ultra is coming

yeah it is i think

#

anyway

cedar tide May 6, 2025, 4:12 PM

#

cedar tide

the only place I found these benches is this tweet
but google did not share it in any tweets or on their blogs because of the downgrade 🧐 https://x.com/ai_for_success/status/1919776586057785526?t=H7gH6-nEC9lgpLDRrruE3A&s=19

AshutoshShrivastava (@ai_for_success) on X

Gemini 2.5 Pro benchmarks.

hollow ocean May 6, 2025, 4:13 PM

#

Grok 3.5 next next week

#

We’re so back

cedar tide May 6, 2025, 4:13 PM

#

hollow ocean Grok 3.5 next next week

This week

hollow ocean May 6, 2025, 4:13 PM

#

cedar tide This week

Maybe Sunday drop

elder rapids May 6, 2025, 4:14 PM

#

balmy mist there is no way they would replace 2.5 model if it truly was worse in everything...

ye but regardless it does feel smarter and that's probably the intentionally/nuance that entails the greater coding abilities

narrow elbow May 6, 2025, 4:14 PM

#

cedar tide the only place I found these benches is this tweet but google did not share it i...

what downgrade ?

elder rapids May 6, 2025, 4:14 PM

#

you guys have to remember that these performances aren't unrelated to other aspect of the model

keen beacon May 6, 2025, 4:14 PM

#

https://deepmind.google/technologies/gemini/pro

Google DeepMind

Gemini 2.5 Pro

Gemini 2.5 Pro is our most advanced model for complex tasks. With thinking built in, it showcases strong reasoning and coding capabilities.

#

the benchmark table is here

cedar tide May 6, 2025, 4:15 PM

#

Ah ok thx

narrow elbow May 6, 2025, 4:17 PM

#

Thank you grok ambassador

elder rapids May 6, 2025, 4:19 PM

#

ong

#

Craig over here predicting the future

cedar tide May 6, 2025, 4:20 PM

#

cedar tide

Grok 3 mini have better bench than grok 3

he should put it in its place

elder rapids May 6, 2025, 4:20 PM

#

hold on

#

I get it

#

deepmind is cooking

#

they're improving the intentionality

keen beacon May 6, 2025, 4:20 PM

#

cedar tide Grok 3 mini have better bench than grok 3 he should put it in its place

they need to remove grok 3 reasoning its vaporware

calm sequoia May 6, 2025, 4:22 PM

#

cedar tide the only place I found these benches is this tweet but google did not share it i...

At least google is honest on comparisons

keen beacon May 6, 2025, 4:22 PM

#

they only included it on their website but didnt release the graphic onto social media officially

calm sequoia May 6, 2025, 4:22 PM

#

Though it's funny that they underperform to o3 on so many benchmarks, but outperform on everything in arena

cedar tide May 6, 2025, 4:23 PM

#

calm sequoia At least google is honest on comparisons

Why do you say they are honest?

blazing coyote May 6, 2025, 4:23 PM

#

New Gemini 2.5 seems worse on benchmarks than the old one https://x.com/AiBattle_/status/1919788812529439118

AiBattle (@AiBattle_) on X

Gemini 2.5 Pro old and new Benchmark comparison!

New(left), Old (right)

calm sequoia May 6, 2025, 4:23 PM

#

Check how the Qwen 3 was presented ;D

hollow ocean May 6, 2025, 4:23 PM

#

blazing coyote New Gemini 2.5 seems worse on benchmarks than the old one https://x.com/AiBatt...

https://tenor.com/view/cringe-eeee-gif-24636179

Tenor

cedar tide May 6, 2025, 4:23 PM

#

calm sequoia Though it's funny that they underperform to o3 on so many benchmarks, but outper...

Its not funny There are a ton of models that are very good on the lm arena but very bad on the benchmarks.

hollow ocean May 6, 2025, 4:24 PM

#

cedar tide Its not funny There are a ton of models that are very good on the lm arena but v...

It’s biased that’s why

cedar tide May 6, 2025, 4:24 PM

#

calm sequoia Check how the Qwen 3 was presented ;D

Yes, but Google is also not honest.

calm sequoia May 6, 2025, 4:24 PM

#

Well its funny for me. The question is: who is wrong

calm sequoia May 6, 2025, 4:25 PM

#

cedar tide Yes, but Google is also not honest.

They communicated their limitations. For me that's enough.

cedar tide May 6, 2025, 4:25 PM

#

cedar tide Yes, but Google is also not honest.

They didn't put the Grok 3 mini in the benchmarks against Gemini 2.5 Flash because it's better and cheaper.

calm sequoia May 6, 2025, 4:26 PM

#

blazing coyote New Gemini 2.5 seems worse on benchmarks than the old one https://x.com/AiBatt...

As I said, the NW lacked some world knowledge in compatison to big models. I hope the older version will still be available.

cedar tide May 6, 2025, 4:26 PM

#

calm sequoia Though it's funny that they underperform to o3 on so many benchmarks, but outper...

12b model at deepseek v3 level 🤦

Screenshot_2025-05-06-18-26-19-849_com.android.chrome-edit.jpg

elder rapids May 6, 2025, 4:27 PM

#

ngl I don't know what models you guys are using

calm sequoia May 6, 2025, 4:27 PM

#

cedar tide They didn't put the Grok 3 mini in the benchmarks against Gemini 2.5 Flash becau...

True. The flash is 💩

elder rapids May 6, 2025, 4:27 PM

#

grok 3 mini struggles too much at basic tasks

keen beacon May 6, 2025, 4:28 PM

#

grok 3 mini is not good

elder rapids May 6, 2025, 4:28 PM

#

keen beacon grok 3 mini is not good

ye

#

it's just

cedar tide May 6, 2025, 4:28 PM

#

keen beacon grok 3 mini is not good

I don't speak in real life, I speak in the bench, and the honesty of Google

elder rapids May 6, 2025, 4:28 PM

#

bad in practice

hollow ocean May 6, 2025, 4:28 PM

#

Grok 3.5 Friday night

elder rapids May 6, 2025, 4:28 PM

#

this is just cap lmao

#

dude is over here calling me Gemini lover but he just hates google

keen beacon May 6, 2025, 4:29 PM

#

tbh i wouldnt even call it that dishonest. or the qwen 3 benchmark graphic (its quite representative, but people dont seem to know how to interpret it)

elder rapids May 6, 2025, 4:30 PM

#

keen beacon tbh i wouldnt even call it that dishonest. or the qwen 3 benchmark graphic (its ...

I want to see the benchmark passes for each model tbh

lime coral May 6, 2025, 4:32 PM

#

flash mainly targets multimodal retrieval use cases

elder rapids May 6, 2025, 4:33 PM

#

also btw the updated Gemini is much better at search

#

seems apparent

keen beacon May 6, 2025, 4:34 PM

#

might be the effect of better fc

elder rapids May 6, 2025, 4:34 PM

#

yeah it seems smarter too I'm somewhat confused on the benchmarks

keen beacon May 6, 2025, 4:35 PM

#

i dont know about it being smarter or much smarter (or dumber), it seems fine to me, but i havent used the model much at all. ( i also don't use it to code )

torn mantle May 6, 2025, 4:36 PM

#

blazing coyote New Gemini 2.5 seems worse on benchmarks than the old one https://x.com/AiBatt...

interesting

lime coral May 6, 2025, 4:38 PM

#

elder rapids yeah it seems smarter too I'm somewhat confused on the benchmarks

Looks like the focus was on « real world » https://x.com/jack_w_rae/status/1919779398607085598?s=46

Jack Rae (@jack_w_rae) on X

We are releasing an updated 2.5 Pro today that has significantly improved real-world coding capabilities.

Here is a fun app, vibe-coded w/ Gemini to learn lines. Includes different characters and scenes, even some tts with different character voices.

https://t.co/mqlTDMUfdC

elder rapids May 6, 2025, 4:39 PM

#

keen beacon i dont know about it being smarter or much smarter (or dumber), it seems fine to...

ye I haven't been using it to code so much either but ofc it's much much better

#

but outside of that

#

it just seems smarter

#

ion know what else to say

#

it's smarter, seems to grasp better nuances

#

seems to comprehend better

#

just, everything

keen beacon May 6, 2025, 4:40 PM

#

i think u need to use the model more to make that conclusion

#

it seems like ur glazing lol

elder rapids May 6, 2025, 4:43 PM

#

keen beacon i think u need to use the model more to make that conclusion

how is that meaningful to say

torn mantle May 6, 2025, 4:43 PM

#

https://x.com/test_tm7873/status/1919779673732427907

testtm (@test_tm7873) on X

Gemini-2.5-Pro-preview-05-06 isn't only the best model for coding. its the best model for everything! across all tasks!

elder rapids May 6, 2025, 4:43 PM

#

I've been using it since it released lmao

torn mantle May 6, 2025, 4:43 PM

#

was it dragontail

elder rapids May 6, 2025, 4:43 PM

#

same amount of time and prompts I used for when the first 2.5 pro released

keen beacon May 6, 2025, 4:44 PM

#

elder rapids I've been using it since it released lmao

yeah thats why, you need more time to use the model on more stuff

elder rapids May 6, 2025, 4:44 PM

#

how does that matter tho?

#

you can see clear differences

#

I don't need to do 5 prompts to know an average difference

#

if it's any different, it shows unless it's the exact same model

#

p much it

fleet lintel May 6, 2025, 4:45 PM

#

Is Grok 3.5 going to be SOTA?

keen beacon May 6, 2025, 4:45 PM

#

yuh its asi

torn mantle May 6, 2025, 4:45 PM

#

fleet lintel Is Grok 3.5 going to be SOTA?

asi

keen beacon May 6, 2025, 4:45 PM

#

they skipped agi

high ginkgo May 6, 2025, 4:45 PM

#

grok 3.5 is ai singularity

fleet lintel May 6, 2025, 4:46 PM

#

good to know.. I am waiting for ASI.. who need money and job anyway

elder rapids May 6, 2025, 4:46 PM

#

ong

high ginkgo May 6, 2025, 4:46 PM

#

you need money

elder rapids May 6, 2025, 4:46 PM

#

fleet lintel good to know.. I am waiting for ASI.. who need money and job anyway

censor j*b btw

balmy mist May 6, 2025, 4:48 PM

#

why should have kept both version fr

#

like this long as thinking is kinda annoying

#

i want to be able to use both

keen beacon May 6, 2025, 4:49 PM

#

is it thinking generally more for you?

balmy mist May 6, 2025, 4:49 PM

#

yeah bro

#

and it takes way more tokens

high ginkgo May 6, 2025, 4:49 PM

#

never refresh ai studio page to keep using 03-25 👍

balmy mist May 6, 2025, 4:49 PM

#

im doing my iterations and it is producing way better results

#

but i hit the cap so fast

#

like if you keep iterating and improving code, eventually the output becomes to big

torn mantle May 6, 2025, 4:49 PM

#

Imo it got worse

balmy mist May 6, 2025, 4:50 PM

#

and the models cant handle it

torn mantle May 6, 2025, 4:50 PM

#

Tends to overthink

elder rapids May 6, 2025, 4:50 PM

#

torn mantle Tends to overthink

ye

misty vault May 6, 2025, 4:50 PM

#

did they ruin gemini with the new update

torn mantle May 6, 2025, 4:50 PM

#

And doesn't catch nuances very well

balmy mist May 6, 2025, 4:50 PM

#

yeah i think we need both versions

keen beacon May 6, 2025, 4:50 PM

#

balmy mist like if you keep iterating and improving code, eventually the output becomes to ...

the thing with gemini the context window is large you probably need to do diffs but its difficult if ur using the website manually lol

elder rapids May 6, 2025, 4:50 PM

#

torn mantle And doesn't catch nuances very well

now that's MAJOR cap

torn mantle May 6, 2025, 4:50 PM

#

misty vault did they ruin gemini with the new update

More like a drawback cuz of coding finetuning

#

They will eventually fix that on upcoming versions

elder rapids May 6, 2025, 4:51 PM

#

prohibited

keen beacon May 6, 2025, 4:51 PM

#

anyway waiting for 2.5 ultra now

elder rapids May 6, 2025, 4:51 PM

#

no one has a j*b here

#

hustle

#

ASI confirmed

#

4.5 Super ASI

#

what trained 4.5

#

it trained itself??

#

nahhh

#

it ain't that super yet

#

damn

unborn ocean May 6, 2025, 4:56 PM

#

Honestly i think they where losing money on the original 2.5 pro and now had to release a 'new' version with a heavier quantisation or something like that (hence the lower simpleQA score), but they did kind of finally start finetuning a bit more for coding like anthropic always does and openai started doing with the o series and 4.1.
(but that really is me just guessing)

elder rapids May 6, 2025, 4:56 PM

#

I remember when asi used to be "artificial sentient intelligence"

elder rapids May 6, 2025, 4:56 PM

#

unborn ocean Honestly i think they where losing money on the original 2.5 pro and now had to ...

i don't think them "losing money" means anything here tbh

#

they can afford to do this for decades

#

exactly

#

they can afford to do this for decades

#

not just that

#

their other sources

#

should be wayyy

#

too profitable

#

ion know what that means to the initial question of losing money

#

me saying Google can do this for decades presumes this

#

no company can just burn money

#

ye but that's not what I mean

elder rapids May 6, 2025, 4:59 PM

#

elder rapids they can afford to do this for decades

they can afford to do this for decades

#

not because they have their money

#

but because their profit still massively outweighs that loss

#

this is like rejecting them spending any money tho

#

they're just spending minimal amounts of money to keep it running

#

that's it

#

if it's operating at a loss

#

doesn't matter

#

they're not operating at a net loss

#

yep that's what I said

#

they wouldn't be

#

even if they served at a technical loss

#

therefore "I think they were losing money on the original 2.5 pro" is invalid

#

ye but that doesn't mean anything here

sage raptor May 6, 2025, 5:03 PM

#

https://x.com/fanofaliens/status/1919798986459762940 wow

Tornado guy (@fanofaliens) on X

Great Job Google !

I created a Black Hole Simulation using Gemini 2.5 Pro Preview. And the results are mind blowing🫡

#

torn mantle May 6, 2025, 5:12 PM

#

we need 2nd principles instead

torn mantle May 6, 2025, 5:12 PM

#

sage raptor

wait let me see chat logs about my opinion on this model

elder rapids May 6, 2025, 5:13 PM

#

sage raptor

yo what???

elder rapids May 6, 2025, 5:14 PM

#

torn mantle wait let me see chat logs about my opinion on this model

you're gonna be surprised lmao

#

now I'm really interested in NW

keen beacon May 6, 2025, 5:15 PM

#

elder rapids yo what???

theyre scraping web dev arena

small haven May 6, 2025, 5:15 PM

#

nobody:
claude code:

torn mantle May 6, 2025, 5:15 PM

#

no

small haven May 6, 2025, 5:15 PM

#

i wanna try gemini now, wen gemini max

elder rapids May 6, 2025, 5:16 PM

#

seems worse at fixed tasks but better at inquiry

#

W theory

keen beacon May 6, 2025, 5:17 PM

#

fr tho nw was a specialist webdev finetune i think

elder rapids May 6, 2025, 5:17 PM

#

nah

#

that's not true at all

#

nw was amazing at all tasks

#

the other ones seemed to be webdev fine tuned

keen beacon May 6, 2025, 5:20 PM

#

dont u have access to o3 pro on the o1 pro api?

pliant cypress May 6, 2025, 5:20 PM

#

"nightwhisper" was better than "sunstrike"?

small haven May 6, 2025, 5:23 PM

#

@deep adder how do i reset the convo, but start it with the compacted conversation?

keen beacon May 6, 2025, 5:25 PM

#

how many tokens u get every few hrs with claude max btw im curious

small haven May 6, 2025, 5:25 PM

#

claude --continue, starts off with the old convo tho?

keen beacon May 6, 2025, 5:25 PM

#

i think claude pro used to have several million tokens per few hours

small haven May 6, 2025, 5:25 PM

#

the $100/mo?

elder rapids May 6, 2025, 5:26 PM

#

pliant cypress "nightwhisper" was better than "sunstrike"?

it's the best one ye

small haven May 6, 2025, 5:26 PM

#

not to flex, but i have 20x

elder rapids May 6, 2025, 5:26 PM

#

also btw

small haven May 6, 2025, 5:26 PM

#

had an axp

elder rapids May 6, 2025, 5:26 PM

#

given context, it seems to grasp nuance better and still retains the same traits as 2.5 pro

#

the difference is so minimal in everything else

ocean plume May 6, 2025, 5:28 PM

#

something new have changed? web ? gemini over claude 3.7 ?

torn mantle May 6, 2025, 5:28 PM

#

nah gemini 2.5 flash

#

torn mantle May 6, 2025, 5:28 PM

#

ocean plume something new have changed? web ? gemini over claude 3.7 ?

not yet

ocean plume May 6, 2025, 5:28 PM

#

#

i mean

#

so the vote just fake right!

torn mantle May 6, 2025, 5:33 PM

#

flash 2.0#

small haven May 6, 2025, 5:34 PM

#

claude code is currently responsible for 1% of gdp

ember rapids May 6, 2025, 5:42 PM

#

If claybrook got that elo I wonder how high nightwhisper is 👀

brittle tiger May 6, 2025, 5:53 PM

#

https://x.com/VictorTaelin/status/1919796817048297954

Taelin (@VictorTaelin) on X

feeling the AGI with the new gemini-2.5

I wasn't expecting such a leap so fast

throw your hardest questions at it and behold

elder rapids May 6, 2025, 5:56 PM

#

brittle tiger https://x.com/VictorTaelin/status/1919796817048297954

ngl I don't think the benchmarks really show how good it is

#

it might overthink but In open ended things and iterations

#

its so much better

tawdry meteor May 6, 2025, 5:58 PM

#

anyone know when the next big grok update will be?

elder rapids May 6, 2025, 5:58 PM

#

this week

tawdry meteor May 6, 2025, 5:59 PM

#

the 3.5 right?

#

do we have any benchmarks?

#

I don't subscribe to grok I already subscribe to gemini and gpt idk if I want another one lol

elder rapids May 6, 2025, 6:00 PM

#

tawdry meteor do we have any benchmarks?

nah

#

there are fake leaks

#

but no real model

keen beacon May 6, 2025, 6:01 PM

#

it will probably be worse than 2.5 pro

#

they didnt put it on arena as a pre release (presumably because it wont beat it)

elder rapids May 6, 2025, 6:01 PM

#

ye I don't see anything approaching o3 or 2.5 pro for a while now

#

that was honestly a major jump tbh

keen beacon May 6, 2025, 6:01 PM

#

elder rapids ye I don't see anything approaching o3 or 2.5 pro for a while now

2.5 ultra

elder rapids May 6, 2025, 6:01 PM

#

people did talk about it

#

but it needs to be emphasized that 2.5 pro is smaller than o3

keen beacon May 6, 2025, 6:02 PM

#

i dont think so lol

elder rapids May 6, 2025, 6:02 PM

#

it def is

keen beacon May 6, 2025, 6:02 PM

#

4.1 is 200b

elder rapids May 6, 2025, 6:02 PM

#

ye

#

seems like it

#

it's smart

#

o3 is probably around 600~

keen beacon May 6, 2025, 6:03 PM

#

no

#

its the same size

#

as 4o/4.1

elder rapids May 6, 2025, 6:03 PM

#

here's what I think

elder rapids May 6, 2025, 6:03 PM

#

keen beacon as 4o/4.1

bruh? what

keen beacon May 6, 2025, 6:04 PM

#

yes

elder rapids May 6, 2025, 6:04 PM

#

it literally, inherent to what the models are

#

can't be the same size

#

lmao

#

that doesn't make any sense

#

the thinking itself adds a ton

keen beacon May 6, 2025, 6:05 PM

#

i cba arguing about stuff where i need to cite etc. but im not making stuff up

elder rapids May 6, 2025, 6:05 PM

#

not even cite

#

it just doesn't make sense

#

to me at least

keen beacon May 6, 2025, 6:05 PM

#

so it was claybrook

elder rapids May 6, 2025, 6:06 PM

#

keen beacon so it was claybrook

ye

unborn ocean May 6, 2025, 6:06 PM

#

elder rapids the thinking itself adds a ton

but @keen beacon why are you so sure about the numbers, i mean openai is serving 4o to sooooooo many people that a high expert count might make sense

#

and i disagree about the thinking part

#

they said that the original o1 was based off of 4o (with the same parameter count, like e.g. qwq-32b that has the same param count as the base 32b)

#

but i am not sure if we know anything about the later versions

keen beacon May 6, 2025, 6:07 PM

#

o3 was retrained on the 4.1 base model which is a cpt of 4o but i dont wanna get into it. i cba arguing/explaining it again and again lol

lime coral May 6, 2025, 6:08 PM

#

No one knows bro

keen beacon May 6, 2025, 6:09 PM

#

unborn ocean but <@456226577798135808> why are you so sure about the numbers, i mean openai i...

it is possible that a high expert count is used, but i have no idea about it tbh. im talking about total params

keen beacon May 6, 2025, 6:10 PM

#

unborn ocean but <@456226577798135808> why are you so sure about the numbers, i mean openai i...

also im obviously speculating ahahaha (but im not making the numbers up, it's based on stuff)

elder rapids May 6, 2025, 6:13 PM

#

wonder how the new model is going to affect the deep research ngl

unborn ocean May 6, 2025, 6:14 PM

#

keen beacon o3 was retrained on the 4.1 base model which is a cpt of 4o but i dont wanna get...

yeah i get it, but:
1: I am not 100% convinced about that being the model they originally used (for the ARC agi benchmark o3 version)
2: the economies of scale for inference should push the large providers towards offering really high expert counts for their best models (I am basing this on the deepseek approach to inference optimization, where they have 'duplicate' experts (for the most used) and more experts optimizations)
3: 200b would mean they can run this stuff on one gpu with quantisation, i dunno
4: the fact that all the decent models released by alibaba and deepseek are really large aswell

e.g. qwen max has 72b experts (really not sure though) and is like 1,5t params
5: the llama 4 maverick has 128 experts because meta wants to serve it to many people through their apps -> high efficiency because for a large provider its more like running a 17b param model for the customers

(i dunno, i am not even a programmer, so i am really talking out of my ass)

keen beacon May 6, 2025, 6:16 PM

#

its fun to speculate though ahaha

unborn ocean May 6, 2025, 6:17 PM

#

keen beacon its fun to speculate though ahaha

true, that's what i am here for :)

calm sequoia May 6, 2025, 6:17 PM

#

keen beacon so it was claybrook

Oh no 👀👀 it failed prompts that even gpt mini passed

calm sequoia May 6, 2025, 6:18 PM

#

keen beacon o3 was retrained on the 4.1 base model which is a cpt of 4o but i dont wanna get...

Dude where are you getting all this info constantly

keen beacon May 6, 2025, 6:18 PM

#

calm sequoia Dude where are you getting all this info constantly

this is not new info

#

ive said it many times (but different threads/conversations, but i guess no one is actually reading it lol and piecing it together)

elder rapids May 6, 2025, 6:20 PM

#

calm sequoia Oh no 👀👀 it failed prompts that even gpt mini passed

is this not subject to variance

#

o3 doesn't get some stuff Gemma gets right

cedar tide May 6, 2025, 6:20 PM

#

keen beacon o3 was retrained on the 4.1 base model which is a cpt of 4o but i dont wanna get...

Do you think that the o3 unveiled in December was based on GPT 4.5 but abandoned?

keen beacon May 6, 2025, 6:20 PM

#

no

elder rapids May 6, 2025, 6:21 PM

#

nah

cedar tide May 6, 2025, 6:21 PM

#

keen beacon no

so why was its arc agi cost estimated much higher?

keen beacon May 6, 2025, 6:22 PM

#

cedar tide so why was its arc agi cost estimated much higher?

check the sample count used lolll

#

💀

calm sequoia May 6, 2025, 6:22 PM

#

I have 10 prompts that i was grinding through all the models 10 times at least. The claybrook performed worse than SOTA models. And worse than the original 2.5 PRO. These prompts were never answered by the likes of gemma

elder rapids May 6, 2025, 6:22 PM

#

what are you prompting it with

cedar tide May 6, 2025, 6:22 PM

#

keen beacon check the sample count used lolll

What is this "sample count" ?

keen beacon May 6, 2025, 6:23 PM

#

cedar tide What is this "sample count" ?

basically how many times they ran the model on 1 q

calm sequoia May 6, 2025, 6:23 PM

#

keen beacon ive said it many times (but different threads/conversations, but i guess no one ...

People is talking a lot of stuff here. Guess I need to know who to pay attention to

unborn ocean May 6, 2025, 6:23 PM

#

keen beacon ive said it many times (but different threads/conversations, but i guess no one ...

i read it ... most of the time :)
was just unsure if you had connections or where just yapping 🤷‍♂️ about the info (or in between 😅 )

cedar tide May 6, 2025, 6:24 PM

#

keen beacon basically how many times they ran the model on 1 q

Where do you find this informations ?

keen beacon May 6, 2025, 6:24 PM

#

cedar tide Where do you find this informations ?

https://arcprize.org/blog/oai-o3-pub-breakthrough see samples

ARC Prize

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

OpenAI o3 scores 75.7% on ARC-AGI public leaderboard.

elder rapids May 6, 2025, 6:25 PM

#

calm sequoia People is talking a lot of stuff here. Guess I need to know who to pay attention...

he's not saying anything meaningful in that regard tho?

#

the total params are completely speculative

#

and the inference is coming from his knowledge of 4o itself

calm sequoia May 6, 2025, 6:25 PM

#

The checkpoint part is totally new to me

calm sequoia May 6, 2025, 6:26 PM

#

cedar tide Where do you find this informations ?

This critique was communicated since the day 1 after the results. It's almost random search given the amount of compute that went there.

elder rapids May 6, 2025, 6:30 PM

#

calm sequoia The checkpoint part is totally new to me

ye but the 4.1 card says this

keen beacon May 6, 2025, 6:31 PM

#

unborn ocean i read it ... most of the time :) was just unsure if you had connections or wher...

yuh im speculating but i hope it makes sense lol. i ramble a lot but im not outright making stuff up (most of the time) ahha. i feel like i have a pretty good track record/pretty reasonable so far. ive been talking about the chatgpt 4o latest cpt/(4.1 wip base model)/quasar for monthhs, and it was outright confirmed by openai employees later. theres a lot of public info out there lol. but yeah im not willing to get into arguments/debate about it anymore lol (exhausting to cite/argue properly)

keen beacon May 6, 2025, 6:33 PM

#

elder rapids ye but the 4.1 card says this

you also think o1 is much larger than 4o too?

elder rapids May 6, 2025, 6:36 PM

#

yep much much larger

#

seems larger than o3 from my testing tbh

keen beacon May 6, 2025, 6:37 PM

#

elder rapids yep much much larger

fyi thinking doesnt add params to the model it doesnt make sense.

for o1 we know its based on the 4o base model lol. aidan was saying it was a 1t model/arguing with semianalysis about it before he deleted it. (before he was an openai employee)

#

https://semianalysis.com/2024/12/11/scaling-laws-o1-pro-architecture-reasoning-infrastructure-orion-and-claude-3-5-opus-failures/ (some of the info in here is wrong, but semianalysis has insiders)

keen beacon May 6, 2025, 6:44 PM

#

keen beacon yuh im speculating but i hope it makes sense lol. i ramble a lot but im not outr...

it does come off like im sure of it (unintentional), but i frankly just dont want to explain it over and over again. (it is all speculation but for some things it is extremely likely based on the info we have imo)

unborn ocean May 6, 2025, 6:45 PM

#

keen beacon yuh im speculating but i hope it makes sense lol. i ramble a lot but im not outr...

Btw don’t feel the need to cite things, I would claim that I also know wayyyy too many useless things about the industry

#

And can more or less add the sources in my head :)

elder rapids May 6, 2025, 6:45 PM

#

keen beacon fyi thinking doesnt add params to the model it doesnt make sense. for o1 we kn...

the thing is tho, thinking denotes much more params

#

it's not the thinking itself that adds params

unborn ocean May 6, 2025, 6:46 PM

#

elder rapids the thing is tho, thinking denotes much more params

Vram but not params

elder rapids May 6, 2025, 6:46 PM

#

so that's not really what I'm saying

unborn ocean May 6, 2025, 6:47 PM

#

The params are just the model weights

elder rapids May 6, 2025, 6:47 PM

#

even if you grant 4o is 200b params

elder rapids May 6, 2025, 6:47 PM

#

unborn ocean The params are just the model weights

what

keen beacon May 6, 2025, 6:47 PM

#

thinking models are primarily more expensive because of kv seq length (semianalysis explains)

elder rapids May 6, 2025, 6:47 PM

#

that has nothing to do with what I said lmao

#

thinking necessarily denotes more params

keen beacon May 6, 2025, 6:47 PM

#

no it doesnt

#

why do you think that

elder rapids May 6, 2025, 6:47 PM

#

it's not the thinking itself that adds more params

elder rapids May 6, 2025, 6:48 PM

#

keen beacon why do you think that

what do you mean why do I think that lmao

unborn ocean May 6, 2025, 6:48 PM

#

elder rapids it's not the thinking itself that adds more params

Yeah well finetuning or RLing for reasoning does not add more params to the model

keen beacon May 6, 2025, 6:48 PM

#

elder rapids what do you mean why do I think that lmao

ur point doesnt make sense at all 😭

elder rapids May 6, 2025, 6:49 PM

#

you can't believe a base model can inherently support that reasoning process

keen beacon May 6, 2025, 6:49 PM

#

it is

#

pretraining is the most important part

#

well one of the most important parts

balmy mist May 6, 2025, 6:52 PM

#

will we ever get nw?

elder rapids May 6, 2025, 6:52 PM

#

unborn ocean Yeah well finetuning or RLing for reasoning does not add more params to the mode...

that has nothing to do with what I said tho

#

fine-tuning isn't what makes it reason

#

nor RL itself

unborn ocean May 6, 2025, 6:53 PM

#

No not just that, it’s also about the variability of thinking vs a larger model (where all params are used for every token) (imho)

unborn ocean May 6, 2025, 6:54 PM

#

elder rapids that has nothing to do with what I said tho

That plus a reward mechanism for correctness and a certain format with thinking tags is exactly what results in reasoning

#

Read the R1 paper if you are interested

small haven May 6, 2025, 6:56 PM

#

running 2 claude code instances >>

keen beacon May 6, 2025, 6:56 PM

#

thanks gemini

torn mantle May 6, 2025, 7:00 PM

#

keen beacon thanks gemini

malware

#

reported

#

to google

torn mantle May 6, 2025, 7:00 PM

#

balmy mist will we ever get nw?

i dont think so 😦

elder rapids May 6, 2025, 7:01 PM

#

unborn ocean That plus a reward mechanism for correctness and a certain format with thinking ...

not a reward mechanisms nor RL or fine-tuning, the Jump is too big if 4o can't represent the same knowledge (ex a complex math problem) there's a limited capacity, and it's not equivalent in both of them and even if pretraining 4o DID allow it to therefore be fine-tune and reason, they fundementally have different knowledge bases, if you're saying o1 with a 200b base is outperforming deepseek r1 this much with such a large base then that's crazy

#

4o can't complete the electrical tasks I give it, 4o can't complete the philosophy tasks I give it, 4o doesn't understand basic qft operators like o1 does

#

there's no chance o1 can represent what 4o can't without what entails this

#

ie more params

#

that doesn't make any sense

keen beacon May 6, 2025, 7:05 PM

#

4o has to immediately predict a chat reply, o1 can think and recall about electrical stuff, etc., before it needs to form a reply

#

it does not necessitate larger param counts or the model needs to add more params for reasoning

elder rapids May 6, 2025, 7:07 PM

#

keen beacon 4o has to immediately predict a chat reply, o1 can think and recall about electr...

this literally can't matter when progressively given inquiry lmao

unborn ocean May 6, 2025, 7:07 PM

#

elder rapids not a reward mechanisms nor RL or fine-tuning, the Jump is too big if 4o can't r...

As is said, I am also not sure about the 200b and the underlying model (or just the architecture) was the same with the first o1 release (I believe they openly admitted that) but they are likely doing a lot of cpt (just me guessing) to the model and then starting with the reasoning training

elder rapids May 6, 2025, 7:07 PM

#

4o simply cannot receive these tasks, it doesn't KNOW

#

so I have a really hard time believing it's even close to the 200b count you're speculating

#

it's def a big model

keen beacon May 6, 2025, 7:09 PM

#

we basically have confirmation about o1 and how its based on 4o which is 200b (very likely) lol. i also recommend this article: https://epoch.ai/gradient-updates/frontier-language-models-have-become-much-smaller

Epoch AI

Frontier language models have become much smaller

In this Gradient Updates weekly issue, Ege discusses how frontier language models have unexpectedly reversed course on scaling, with current models an order of magnitude smaller than GPT-4.

elder rapids May 6, 2025, 7:09 PM

#

unborn ocean As is said, I am also not sure about the 200b and the underlying model (or just ...

ig it is hard to keep track of what they're doing

unborn ocean May 6, 2025, 7:10 PM

#

elder rapids 4o simply cannot receive these tasks, it doesn't KNOW

Big difference between the 4o in ChatGPT and the actual base model they use

keen beacon May 6, 2025, 7:11 PM

#

elder rapids so I have a really hard time believing it's even close to the 200b count you're ...

epoch ai are quite reputable anyway, you can believe me or not lol

elder rapids May 6, 2025, 7:11 PM

#

keen beacon we basically have confirmation about o1 and how its based on 4o which is 200b (v...

ye but the article somewhat reinforces what I'm saying given deepseek r1 itself too

ocean vortex May 6, 2025, 7:11 PM

#

unborn ocean Big difference between the 4o in ChatGPT and the actual base model they use

they are using 4.1. Why is there still gpt4o naming only they know

#

it is for a fact, they said so themselves

keen beacon May 6, 2025, 7:13 PM

#

because chatgpt 4o latest is on the 4.1 base and has been for months

ocean vortex May 6, 2025, 7:13 PM

#

Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version⁠(opens in a new window) of GPT‑4o, and we will continue to incorporate more with future releases.

artificialanalysis did benchmarks on it and they were just about identical to 4.1

keen beacon May 6, 2025, 7:13 PM

#

identical benchmarks and knowledge cut off (distinct knowlege cut off indicating cpt comapred to 4o)

ocean vortex May 6, 2025, 7:14 PM

#

there isn't really gpt4o is much worse

elder rapids May 6, 2025, 7:14 PM

#

unborn ocean Big difference between the 4o in ChatGPT and the actual base model they use

ye but that simply means it has more params/is different, if it absolutely doesn't know the things o1 does, then that's unbelievable

keen beacon May 6, 2025, 7:14 PM

#

elder rapids ye but that simply means it has more params/is different, if it absolutely doesn...

we have confirmation about it from semianalysis (has openai insider)/the argument between semianalysis and aidan which he relented (o1 is not 1t+, same size as 4o)

elder rapids May 6, 2025, 7:15 PM

#

ye o1 doesn't seem crazy heavy

ocean vortex May 6, 2025, 7:15 PM

#

which is the confusing part. Chatgpt-latest is based on 4.1 but they are still calling it gpt4o lmao

elder rapids May 6, 2025, 7:15 PM

#

but it's not 200b

#

this is just a fact

keen beacon May 6, 2025, 7:15 PM

#

elder rapids this is just a fact

youre just making that p

ocean vortex May 6, 2025, 7:15 PM

#

there's no dated API version of gpt4o that would come close to chatgpt performance

elder rapids May 6, 2025, 7:15 PM

#

I'd say o3 is arguably that range

#

size?

#

prolly true it's 200b~

#

I said that's what I thought

#

way back

#

to me Claude should be around like

#

300b

#

or larger

keen beacon May 6, 2025, 7:16 PM

#

sonnet 3.5 is like 400b

elder rapids May 6, 2025, 7:16 PM

#

ye that seems to be the case

unborn ocean May 6, 2025, 7:17 PM

#

keen beacon we basically have confirmation about o1 and how its based on 4o which is 200b (v...

Nice article, but I don’t really agree with his approach to measuring it. And I also question the result of 200b for 4o size.

ocean vortex May 6, 2025, 7:17 PM

#

elder rapids prolly true it's 200b~

I think it's comparable to Deepseek actually. They are serving it to hundreds of thousands of people so MoE makes sense but active parameter count is not gonna be a lot

elder rapids May 6, 2025, 7:18 PM

#

just wouldn't believe o1 is 200b

#

would seem wild to me

keen beacon May 6, 2025, 7:18 PM

#

unborn ocean Nice article, but I don’t really agree with his approach to measuring it. And I ...

i also recommend their moe inference article

ocean vortex May 6, 2025, 7:21 PM

#

100b active, 700b total, MoE. But that's just a wild guess lol

keen beacon May 6, 2025, 7:22 PM

#

ocean vortex 100b active, 700b total, MoE. But that's just a wild guess lol

i dont think massive active params make sense lol

elder rapids May 6, 2025, 7:22 PM

#

prolly smaller, but if it's moe then ig what Dom said

#

I don't think it's that big tho

#

all Gemini models

#

just simply don't seem that large

#

maybe cuz the its the tpus

keen beacon May 6, 2025, 7:22 PM

#

gemini pro is similar to sonnet3.5/4o size range

elder rapids May 6, 2025, 7:22 PM

#

ye

ocean vortex May 6, 2025, 7:23 PM

#

keen beacon i dont think massive active params make sense lol

It does make sense. Too little and it's gonna be limited. And Google has TPUs

elder rapids May 6, 2025, 7:23 PM

#

also my thought process was

#

why would make a large gap

keen beacon May 6, 2025, 7:23 PM

#

ocean vortex It does make sense. Too little and it's gonna be limited. And Google has TPUs

not really. qwen 3 showed that even with small active params still allows it to perform closer to the dense param count

elder rapids May 6, 2025, 7:23 PM

#

between flash and pro

keen beacon May 6, 2025, 7:23 PM

#

their 30b moe a3b experiment

ocean vortex May 6, 2025, 7:23 PM

#

Ultra was probly smth wild like 200b active

keen beacon May 6, 2025, 7:23 PM

#

(30b a3b has similar simpleqa to qwen 3 32b and higher than qwen 3 14b)

#

if u take sqrt(total_params*active_params) (mistral rule of thumb) it clearly outperforms 9.5b

elder rapids May 6, 2025, 7:24 PM

#

this was what I said like a while ago

#

good knowledge tho

keen beacon May 6, 2025, 7:25 PM

#

2.5 pro and 2.0 pro are the same size, i believe

elder rapids May 6, 2025, 7:25 PM

#

I don't believe 2.0 pro was larger than 200b

ocean vortex May 6, 2025, 7:25 PM

#

keen beacon (30b a3b has similar simpleqa to qwen 3 32b and higher than qwen 3 14b)

Deepseek doesn't have as much resources as Qwen and in my opinion they are doing better. With a model that has more active params. It was released well before qwen3 and it still beats it on some things

elder rapids May 6, 2025, 7:25 PM

#

or any bigger than 4o

#

thing is tho

#

2.0 pro knew more of these important tasks

#

STEM, philosophy

keen beacon May 6, 2025, 7:26 PM

#

ocean vortex Deepseek doesn't have as much resources as Qwen and in my opinion they are doing...

im talking about the fact that small active params + larger total params allows it to still perform to the dense total param count. what you're talking about that's a separate thing.

u can compare the base model benchmarks (where its very much standardized, they definitely did not cherry pick anything on the base model benchmarks chart)

their tuning on top of the base model was insufficient/lackluster though

ocean vortex May 6, 2025, 7:27 PM

#

elder rapids or any bigger than 4o

It is bigger almost for certain. With 2.5 pro that has 2.0 base (well not exact same but same size) they get performance in different ways than OpenAI. They don't need high reasoning and it has better spatial awareness

ocean vortex May 6, 2025, 7:29 PM

#

keen beacon im talking about the fact that small active params + larger total params allows ...

it will not perform as well as dense model comparing MoE total params vs dense. What happens instead is MoE is much faster to train so we are comparing oranges against apples lol

#

cause that MoE that you are comparing to dense saw way more training and data

keen beacon May 6, 2025, 7:30 PM

#

ocean vortex it will not perform as well as dense model comparing MoE total params vs dense. ...

they pretrained on 36t on all of the models

#

they didnt train the moe more

ocean vortex May 6, 2025, 7:31 PM

#

it's also why qwq 32b dense competes with gpt4o though, on a flip side. Both take comparable amount to train I think

#

and both can perform similarly

keen beacon May 6, 2025, 7:31 PM

#

ocean vortex cause that MoE that you are comparing to dense saw way more training and data

???

ocean vortex May 6, 2025, 7:32 PM

#

keen beacon ???

you are not gonna train / fine-tune smth like 405b nearly as much as you would MoE, big model comparisons are not very realistic

ocean vortex May 6, 2025, 7:33 PM

#

ocean vortex it's also why qwq 32b dense competes with gpt4o though, on a flip side. Both tak...

but this is ^

keen beacon May 6, 2025, 7:33 PM

#

ocean vortex you are not gonna train / fine-tune smth like 405b nearly as much as you would M...

but we are comparing qwen 3 32b dense and qwen 3 30b moe

#

im confused

ocean vortex May 6, 2025, 7:35 PM

#

keen beacon but we are comparing qwen 3 32b dense and qwen 3 30b moe

I'm comparing it against gpt4o lol

#

but if you wanna compare those

#

dense still performs better

keen beacon May 6, 2025, 7:36 PM

#

ocean vortex dense still performs better

but it performs closer to 30b right despite small active params right?

ocean vortex May 6, 2025, 7:36 PM

#

it's like gpt4.1 vs gpt4.1-mini, on some benchmarks they look very similar, diminishing returns. But on some others the difference is bigger

#

but 1 is still for a fact better

keen beacon May 6, 2025, 7:37 PM

#

ocean vortex it's like gpt4.1 vs gpt4.1-mini, on some benchmarks they look very similar, dimi...

in terms of world knowledge, qwen 30b a3b and qwen 32b dense both get the same simpleqa score (and higher than 14b)

ocean vortex May 6, 2025, 7:39 PM

#

keen beacon but it performs closer to 30b right despite small active params right?

it would seem so. But we also have reasoning on top which a bit complicates things. The numbers qwen is citing are not for standard instruct chat models

#

don't forget o4-mini performing better than o3 on some things etc

keen beacon May 6, 2025, 7:40 PM

#

ocean vortex it would seem so. But we also have reasoning on top which a bit complicates thin...

still the simpleqa score (independently measured, i posted it somewhere a while back), shows the same. (you can't make simpleqa gains trivially using reasoning)

just shows that there's a lot of leeway with active params, i think the more important part is the amount of experts used

#

even then, qwen3 has bias towards certain experts/not even. (experts are not "experts" in knowledge, it's more complex, but anyway) usually moe training theres autobalancing or whatever with experts [tangential point]

ocean vortex May 6, 2025, 7:43 PM

#

keen beacon still the simpleqa score (independently measured, i posted it somewhere a while ...

what's the simpleqa for it with no reasoning enabled?

keen beacon May 6, 2025, 7:44 PM

#

8% for both of them. no thinking.

unborn ocean May 6, 2025, 7:49 PM

#

I would argue that the argument for dense vs MoE really depend on the type of questions asked.
In something like simpleqa, the model just needs to access the knowledge stored in some neurons, whether that info is accessed by running all the parts of the model (dense) or just predicting which part of model's experts knows it (MoE) should not really matter much (ik i am wildly simplifying). However, I am sure there will be a lot of scenarios (e.g. compact reasoning over less than 10 tokens) where 1b params as an experts will not be able to compete with a dense 30b that has a lot of layers to 'think' internally.
In short: simple qa might not be the best benchmark to compare the two architectures and there has to be some sort of downside with the MoE (assuming same tokens in training).

ocean vortex May 6, 2025, 7:55 PM

#

unborn ocean I would argue that the argument for dense vs MoE really depend on the type of qu...

yeah you may be somewhat right. Just tried this:

#

complete gibberish, versus: (normal text in Lithuanian)

#

#

no thinking enabled

hardy pecan May 6, 2025, 7:56 PM

#

interesting they are sandbagging models by releasing claybrook, when we all think there are 3 other models that were stronger, I guess they come at I/O

keen beacon May 6, 2025, 7:57 PM

#

hardy pecan interesting they are sandbagging models by releasing claybrook, when we all thin...

they could be testing the same model (different revision) with different names. so it could be less than 3 'actual' models

#

(there could also be more unreleased models, we dont have enough information to make a conclusion either way)

ocean vortex May 6, 2025, 7:59 PM

#

keen beacon 8% for both of them. no thinking.

yeah it is impressive and maybe 8 experts (active) help indeed. Still not quite the level equivalent to dense though. If it had the same amount of active parameters and say 60b total, I do not think it would be better than 32b dense still 🤔

raven void May 6, 2025, 8:09 PM

#

Claude 4 will have 1800 elo in WebDev

elder rapids May 6, 2025, 8:24 PM

#

looking around with people talking about the new 2.5 pro

#

it's so mixed

#

someone said it's worse at analytic processes

#

but it's gotten better for other people

#

people are saying it has worse context

#

but to me it's the exact same

#

not new information

vivid oyster May 6, 2025, 8:32 PM

#

Wtf happened to gemini

calm sequoia May 6, 2025, 8:32 PM

#

ocean vortex complete gibberish, versus: (normal text in Lithuanian)

This is not gibberish. It's normal latvian.

vivid oyster May 6, 2025, 8:32 PM

#

It nothing thinking anymore

#

elder rapids May 6, 2025, 8:33 PM

#

vivid oyster It nothing thinking anymore

bugged for a bit it's happening to me

calm sequoia May 6, 2025, 8:33 PM

#

calm sequoia This is not gibberish. It's normal latvian.

You can't expect it to identify lithuanian language from 3 words

vivid oyster May 6, 2025, 8:33 PM

#

elder rapids bugged for a bit it's happening to me

Its a new debate

#

Update

#

elder rapids May 6, 2025, 8:33 PM

#

ye

keen beacon May 6, 2025, 8:33 PM

#

ask it to think when it does that

vivid oyster May 6, 2025, 8:33 PM

#

05-06

elder rapids May 6, 2025, 8:33 PM

#

bugged for a bit

keen beacon May 6, 2025, 8:33 PM

#

it will fix it

elder rapids May 6, 2025, 8:33 PM

#

it's happening to me

vivid oyster May 6, 2025, 8:33 PM

#

keen beacon it will fix it

Fr?

keen beacon May 6, 2025, 8:33 PM

#

generally yes

vivid oyster May 6, 2025, 8:33 PM

#

Aint no way

#

It worked

elder rapids May 6, 2025, 8:33 PM

#

ye

vivid oyster May 6, 2025, 8:33 PM

#

Thanks

#

So whats the update for

#

Like

#

05-06

keen beacon May 6, 2025, 8:33 PM

#

its so stupid

elder rapids May 6, 2025, 8:33 PM

#

vivid oyster So whats the update for

coding

keen beacon May 6, 2025, 8:33 PM

#

they need to prefill the thinking tag

elder rapids May 6, 2025, 8:34 PM

#

glad it's even better at philosophy + explanation + coding

#

but ion know about all your guys use cases

vivid oyster May 6, 2025, 8:34 PM

#

elder rapids glad it's even better at philosophy + explanation + coding

Its better at philosophy?

#

Thats what

#

I used it for

#

😮

#

Im using it for philosophy rn

elder rapids May 6, 2025, 8:35 PM

#

vivid oyster Its better at philosophy?

2.5 pro has always been really good at it, especially others and ye it seems like the new one grasps nuances better

vivid oyster May 6, 2025, 8:35 PM

#

Let me see if it improved

elder rapids May 6, 2025, 8:35 PM

#

especially in categorizing

vivid oyster May 6, 2025, 8:35 PM

#

elder rapids 2.5 pro has always been really good at it, especially others and ye it seems lik...

Idk I use o3 instead

elder rapids May 6, 2025, 8:35 PM

#

and formatting

vivid oyster May 6, 2025, 8:35 PM

#

Cause gemini 2.5 pro cant follow instructions good

#

Is o3 good at it?

elder rapids May 6, 2025, 8:35 PM

#

opposite for me

#

o3 is bad at philosophy + instruction following

sage raptor May 6, 2025, 8:36 PM

#

vivid oyster May 6, 2025, 8:36 PM

#

Last time I used gemiin 2.5 pro for philosophy and asked it questions

#

It gave me answers for entirely different ones

#

Didnt addrewss what I said

#

Didnt understand what I said

#

Even after I repetaedly clarified

#

It sucks at that

#

For me at least

elder rapids May 6, 2025, 8:37 PM

#

vivid oyster Didnt understand what I said

probably understands what you said but assumes you're elementary

worthy thunder May 6, 2025, 8:39 PM

#

Context Arena: Added in Gemini 2.5 Pro (0506) to the 8needle test.

Performs better on lower contexts, and slightly worse (within error range) for upper contexts.

More results at: http://contextarena.ai

vivid oyster May 6, 2025, 8:47 PM

#

Dude

#

This is so f annoying

#

I have to keep switching between beta lm arena and alpha lm arena and the normal one

#

Because my conversations keep doing this

keen beacon May 6, 2025, 8:48 PM

#

are u trying to use gemini on arena?

vivid oyster May 6, 2025, 8:48 PM

#

No

#

O3

keen beacon May 6, 2025, 8:48 PM

#

oh

#

ok

small haven May 6, 2025, 9:03 PM

#

worthy thunder Context Arena: Added in Gemini 2.5 Pro (0506) to the 8needle test. Performs bet...

could it just be caused by randomness? they look pretty similar visually

worthy thunder May 6, 2025, 9:13 PM

#

small haven could it just be caused by randomness? they look pretty similar visually

It's within the error margin. When I rerun all of the tests it usually doesn't change by more than 1% per bin (usually less than 0.5%).
I basically consider it the same, except with improvements below 128k context

lime coral May 6, 2025, 9:50 PM

#

blazing coyote New Gemini 2.5 seems worse on benchmarks than the old one https://x.com/AiBatt...

this is fake. they didn’t post eval.

keen beacon May 6, 2025, 9:51 PM

#

lime coral this is fake. they didn’t post eval.

no lol. https://deepmind.google/technologies/gemini/pro/

Google DeepMind

Gemini 2.5 Pro

Gemini 2.5 Pro is our most advanced model for complex tasks. With thinking built in, it showcases strong reasoning and coding capabilities.

lime coral May 6, 2025, 9:57 PM

#

My bad

hollow ocean May 6, 2025, 10:12 PM

#

Grok 3.5 in 2 hours

#

We’re so back 🔥

sage raptor May 6, 2025, 10:13 PM

#

source ?

keen beacon May 6, 2025, 10:13 PM

#

asi incoming 🔥

hollow ocean May 6, 2025, 10:14 PM

#

sage raptor source ?

Sec I gotchu

#

💯

torn mantle May 6, 2025, 10:53 PM

#

hollow ocean Grok 3.5 in 2 hours

1h

ocean vortex May 6, 2025, 10:54 PM

#

calm sequoia You can't expect it to identify lithuanian language from 3 words

You can and you should. I even retried this several times always with the same outcome. Dense 32b model understands it immediately. but this just needs for you to explicitly say and clarify it like 5 times for it to understand when thinking is disabled.

balmy mist May 6, 2025, 11:26 PM

#

torn mantle 1h

its here?

torn mantle May 6, 2025, 11:33 PM

#

balmy mist its here?

2h

mild galleon May 6, 2025, 11:37 PM

#

No fake news please

calm sequoia May 6, 2025, 11:43 PM

#

If you really think about it, you'll realize Grok 3.5 was in your heart all along.

mild galleon May 6, 2025, 11:46 PM

#

Wtf is this new cult with grok 3.5

calm sequoia May 6, 2025, 11:48 PM

#

You'll get it when you're older.

torn mantle May 6, 2025, 11:53 PM

#

mild galleon Wtf is this new cult with grok 3.5

asi

small haven May 6, 2025, 11:57 PM

#

grok 3.5 is just to satisfy elon ma first principles, he dont care bout da other tings

hollow ocean May 7, 2025, 12:03 AM

#

xAI livestream in 30 minutes

ember rapids May 7, 2025, 12:38 AM

#

I’m hearing grok 3.5 has 100b context window and saturates arc agi 2

balmy mist May 7, 2025, 12:41 AM

#

yeah yall need to chill with the fake news, brings down the prestige of this channel lol

wintry tinsel May 7, 2025, 1:14 AM

#

They will release API a few months later

wintry tinsel May 7, 2025, 1:19 AM

#

hollow ocean xAI livestream in 30 minutes

is this true?

hollow ocean May 7, 2025, 1:20 AM

#

wintry tinsel is this true?

Yes it’s been confirmed by Elon

keen beacon May 7, 2025, 1:36 AM

#

Is anyone's aistudio broken lol the thoughts are messed up

#

It's like a summary and the thoughts are merged

blazing rune May 7, 2025, 2:08 AM

#

keen beacon Is anyone's aistudio broken lol the thoughts are messed up

Same

elder rapids May 7, 2025, 2:41 AM

#

keen beacon Is anyone's aistudio broken lol the thoughts are messed up

ye it's weird

#

its like two separate thought processes

#

it gets to the conclusion which I don't know how

#

it's really strange

#

there are out of context statements

#

and followups that don't even make sense

#

I think they messed up something with how it's passed forward, could be the app summary mixed in?

elder rapids May 7, 2025, 3:01 AM

#

tbh this is getting pretty weird, seems like what makes its context recollection so good is it's thinking

#

but it's recollection can't be good

#

because the output is messed up

#

it's omitting things it intends to point out in its thinking process and reiterating things it already established, prob relying on the base model to infer the initial prompt and to piece it all together so there's more potential for insufficient information

#

i didn't see this at all a few hours ago

verbal nimbus May 7, 2025, 3:24 AM

#

:O

#

First time I've seen a non-Anthropic model in the top spot

leaden palm May 7, 2025, 3:42 AM

#

hollow ocean Yes it’s been confirmed by Elon

do you like trolling

#

(if this genuinely isnt go https://manifold.markets/KTibow/which-day-of-may-will-grok-35-be-re)