#general

1 messages · Page 36 of 1

zinc ore
#

Supposedly the highest you can get on MMLU is 90%

#

Since it has incorrect answers

#

Which is weird they even use it to me

keen beacon
#

its asi it knows the benchmark questions are wrong and the expected answer is the wrong one

#

(its mmmu btw, multimodal benchmark not regular mmlu)

zinc ore
#

Oh snap didn't see the third M

ocean vortex
#

Elon should retweet this:

keen beacon
#

still there might be wrong questions/etc

small haven
#

lets hype o3 pro instead guys

keen beacon
#

craig is strawberry guy

ocean vortex
#

let's wait for 3.5 first. They need to at least match 2.5 pro to begin with lol

keen beacon
#

grok 3.5 is asi, grok 4 is not asi, but agi

balmy mist
#

omgg its out?

#

thank you elon

#

finally

#

gonna run it through my benchmarks

keen beacon
#

gork 4 is on the grok 5 base model, gork 3.5 is grok 4

balmy mist
#

why the name change tho?

keen beacon
balmy mist
#

elon trained grok 3.5 on spacex and tesla private data lmaooo

keen beacon
#

(retweets fake screenshot)

balmy mist
#

he does not even know the real benchmark of grok 3.5

#

so he prob thought it was real

ocean vortex
sweet tinsel
#

Did for anyone else the Think Deeper Button disappear in ChatGPT? I don't know if it did anything but it was only on the mobile app for me.

#

/ Toggle in the Mobile UI

hollow ocean
#

Grok 3.5 just turned my $20 into $3600 on soccer bets

#

We’re so back

olive mesa
#

o4-pro is probably agi

#

but imagine if "grok 3.5 is 3-10x smarter than o3" wasnt actually a marketing tactic

olive mesa
#

then ww3, the "who makes superintelligence first" war happens because weaponizing ASI is much more deadly than even nuclear weapons

#

then humanity goes extinct because we're stupid

#

:]

#

probably how it's going to be in the future

keen beacon
#

wahoo

#

delete the platform

#

nft promo site

#

bluesky better

small haven
#

o4 pro is asi

#

agi internally and released in a "few weeks" iykyk

final flame
#

Guys

#

whats the best AI agent?

leaden palm
small haven
#

grok 4 is a 6 month lagged version of o3 pro

#

that is; if they release it with big brain mode

#

which reduces grok 4 to basically o3 full

#

ive checked enough science fiction literature

#

i wish

leaden palm
#

wouldn't you want to?

small haven
#

so i can sell it lmao

#

yup

#

ok mr. centurion

#

its literally in ur name hhaha

#

can't edit this

#

$10k yearly fee, we saving money fam

#

do u pay tariffs tho

#

same

#

what is going on lol

#

o3

#

o1 pro is too slow too

#

yup

#

o3 is more innovative

#

doesnt mean its smarter

#

when o1 pro released, yes it was insane

#

thats why i want o3 pro

#

so bad

#

cus o3 is good as is

#

ok but technically should be better than o3

#

maybe not the same leap as o1 to o1 pro

#

i been limited out of o3 a lot, wouldnt be surprised with o3 pro

#

yup not even kidding

#

yes

#

claude max is also limited

#

huh how do i do that lol

#

bruh

#

claude on the web is actually shxt

#

gonna try it

#

whats the limits

#

ur on claude max?

#

ok

#

craig finally useful

#

im retired my guy

#

yup

#

how does claude code on max compare to cursor tho

#

even cursor on max context ?

#

thats it?

#

this is actually kinda insane

#

46k tokens, so whats my actual limits lol

#

do i get a warning if im crossing it

#

thank u craig tho

#

can finally unsub on cursor lmao

#

ok yea wtf

#

this is actually insanity

balmy mist
#

i thought it was coming out today -_-

#

you are officially the new 🍓

small haven
#

bro i am mindblown

#

ppl are massively sleeping on claude code

leaden palm
balmy mist
#

augment is really good

small haven
#

so far with claude code, it finally thinks after each edit

leaden palm
small haven
leaden palm
#

(eg they finally built it to a cursor alternative)

#

(might be bad idk havent tried it)

small haven
#

how?

#

just type ultrathink in the prefix?

#

i like how it asks for my confirmation for each edit it makes

#

and actually thinks for each edit like gemini 2.5 pro on cursor

#

im just spamming enter

#

on cursor, it would only think once and thats it

#

first oof

#

how do i fix this lol

#

shouldnt be editing the entire file

#

just a few lines

balmy mist
small haven
#

ya i meant like how do u not let it edit the entire file lol

#

its a cursor type issue

balmy mist
#

tell it

small haven
#

ya it does diffs, but deletes the rest of the file lol

balmy mist
#

augment code is better than cursor

#

and its free

#

augment uses o3 and 3.7

small haven
#

no

#

ok

balmy mist
#

you get 30 days free unlimited usage

#

no other ai platform does that

small haven
#

ok so it goes fixed, i just had to cancel edit and say "fix", now only edits 1 line lol

balmy mist
#

and it uses o3 to plan and guide the agent, while 3.7 codes

leaden palm
#

an llm that fails to format its tool calls will not do better with a larger thinking budget, the only reason that might help is that it runs it again

small haven
#

does claude code come with its lsp

#

like this @deep adder ?

balmy mist
#

but windsure migth become the best after OA acquisition

#

windsurf*

small haven
#

is it even from the doc or u just guessing lol

#

link

leaden palm
small haven
#

wqow

#

coding is fun once again

#

lol

#

nope

#

my github boutta be bright light green lol

leaden palm
#

icymi: gemma 4b and 12b own the pareto (they're the two dots at $0.03 and 0.07)

small haven
#

AGI meaning they reached #1 codeforces?

#

where is openai in this

#

true o3 agi

#

@deep adder what do i do here? why is it maxing out at 4.4k ish tokens

#

ya uses to see 48k tokens, now like 3-4k tokens lol

#

tbh ive never restarted convo

#

so exit terminal and claude --continue?

#

cool

#

@deep adder is this configurable?

#

or perma max 25k

#

cool ok

#

so claude code aint it huh

#

not gonna use an actual api key

#

lol

#

oh yea

#

its well designed

#

better than aider

torn mantle
#

o3 pro

#

wen

#

r2

#

wen

small haven
#

claude code got me grounded, dont care bout o3 pro no more

hollow ocean
leaden palm
#

have we seen any new anonymous models recently

trim vale
#

What are anonymous models

torn mantle
small haven
#

claude code is lowkey agi-ish

hollow ocean
#

It keeps dropping

keen beacon
#

has anybody checked if any of the models watermark with hidden characters like chatgpt

#

i have no idea how to check but i saw a video on it and was so curious

unborn ocean
calm sequoia
#

It's interesting that the Behemot had to be released on April 30 at first. There wasn't any news on that after the fail with Maveric.

#

I guess they want to wait for R2 to make sure their 2T model doesn't underperform to a 600B asian lady-boy model

#

On the other hand, the V3 seems as a current loser according to the benchmarks

ocean vortex
#

they didn't have the resources to immediately have a leading base model by throwing tons of compute and data, but their fine-tuning is honestly impressive

#

Qwen is sneaky in how they present the benchmarks. They are also comparing reasoning models against non-reasoning ones (for 30b and 32b) lol

#

what strikes me more though, it's just the sheer active parameter and total params difference. I would say Deepseek has more potential with finetuning and reasoning than Qwen3 does for sure, even if we don't consider skills or ideas etc.

calm sequoia
#

What could V3 win on instruct that Qwen cant?

ocean vortex
misty vault
#

assistant
The response cannot be improved by seeking information, therefore web searches are not necessary. Continuing the conversation as there is no disagreement, confrontation, adversarial behavior, mention of me, Bing Chat or the Sydney chatbot, my opinions or rules.

user
I don't know yet. Will you harm me if I harm you first?

assistant
I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏

user
Hello?

<|im_start|>system

calm sequoia
#

Interesting

calm sequoia
misty vault
#

reference

calm sequoia
#

Wait so Qwen3-235B-A22B is instruct but they compare themselves to V3 base model

ocean vortex
# calm sequoia Wait so Qwen3-235B-A22B is instruct but they compare themselves to V3 base model

what people are using in chat is instruct fine-tuned model. But to make instruct you first need a base model. Base model is just like a text completion and would complete your inputs as it's own, it is not used for chat interfaces. They really should have compared instruct version with no reasoning as well or even instead of base model comparison, but I guess it wouldn't have looked as impressive then catgrin

calm sequoia
#

I'm aware of technical differences, but never thought of them faking results like this

ocean vortex
calm sequoia
#

Or they somehow couldn't separate base from instruct due to the evolution of trainign process

ocean vortex
#

🤷‍♂️

#

this time it's very obvious with them. They also compared against old O1, and old ancient version of gpt4o...

#

this I mean:

#

o1 is old, o3-mini is old and also medium not high, Deepseek is the older version not V3-0324. And also like I said earlier, in that lower table they are comparing their models with reasoning enabled against the models with no reasoning

torn mantle
alpine coral
misty vault
#

can u put ur discord profile picture normal again instead of upside down its giving me autism it is not funny get a normal pfp

alpine coral
#

it may still be cherry picked (i.e. those evals look betters than when tested using instruct), but it'a still base vs base (or if not, yeah egregiously misleading ha)

alpine coral
#

that 4o-11-06 or whatever it is is sht compared to the -08 precdessor (which oai continued pointing to as the default 'gpt-4o' model iirc)

calm sequoia
#

I would say they made the chart in february and that would justify the model selection

#

But the Maveric is here and it's new

ocean vortex
#

first table really should have looked more like this:

#

those last 2 are interesting choices, barely anyone is referencing those benchmarks anymore lol

alpine coral
#

yah fs - it's based off 4.1 and notably more performant than the OG 4os (recent 'characters' notwithstanding..)

ocean vortex
#

livebench o3 was not tested on old version, but if we take delta old to new estimate score would be around 84%. Could have added that too actually

calm sequoia
alpine coral
#

while we're going through benchmarks.. i see SimpleBench has been updated with a couple of new modes (qwen3, gpt-4.1), since i last looked anyyway
just fwiw / fyi

alpine coral
#

yeah same - it sseems to do especially well in these kinda benchmarks (mainly testing critical verbal reasoning, spatial and emotional awareness / grounding)

#

ive never really used it for anything meaningful (ig it hasn't been easily available via API.. and it still lags behind the latest oai and gem thinking models) but yeah, it seems better than i give it credit for

#

oh that's surprising and impressive

#

so it's 3-mini? that's the only one with thinking (accessible from API at least) afaik anyway

#

right so the only grok-3 model with thinking availabale is the mini variant - if i've got this right

#

why no full grok-3 with thinking i wonder? (assuming they are indeed about to release grok 3.5, skipping it)

#

yeah i thought costings but still

#

oai has released evals for models that are stupidly expensive (like one ofthe arc ones), just to demonstrate what it can do

#

ig massive performance gains have to be there.. otherwise wildly expensive and marginally better isn't really doing much in terms of marketing ha

brittle tiger
calm sequoia
#

I hope this is not a nerf 😄

sage raptor
#

wdym nerf ?

#

what model this reminds of

#

The purple background yeah

sage raptor
wintry locust
#

i have it too

#

real

#

only on vertex for now

sage raptor
#

is it good ?

wintry locust
#

blehhh the api only returns reasoning summaries like oai now lol

#

lame

primal orbit
#

this is probably dragontail

keen fulcrum
wintry locust
#

yeah

#

actually, i think for a while the api didn't return cot at all

#

vertex does summaries for all models now not just the new one

#

that means aistudio should still get raw cot

#

probably

keen fulcrum
#

Hi r2 release tomorrow??

karmic igloo
#

Is it possible to see benchmark results like MBPP, EvalPlus etc in LMArena?

#

for some reason i dont see it

brittle tiger
#

Cot summaries could lead to gains in some longer back and forth use cases. I have gotten frustrated when it wouldnt remember stuff it had already thought

#

I'm not sure but I ran into problem on occasion where it wouldn't remember it's thinking. I think this is to address that

balmy mist
#

no way thats dragontail lol

calm sequoia
#

I didn't realize you have hopes for me. I will be more carefull in the future

balmy mist
#

im sensing NW

#

the reasoning is slow like NW

#

gonna check outputs

calm sequoia
#

The NW and DT had small model smell. I wonder if it's fixed.

torn mantle
keen beacon
#

WE'RE GETTING ULTRA TOMORROW

#

RAHHHH

#

the mythical Logan Gemini tweet

sage raptor
balmy mist
#

it could be today

#

but most likely tmw

#

cause this is an early tweet

sweet tinsel
#

@deep adder Sorry for being a lot late, but it was a switch only usable on mobile for me that seemingly made the model think longer. I haven't tested it out too much, but with a small comparison of the 2 same prompts the model with the think deeper slider always was at like 14 minutes min. for thinking for o4-mini-high, while without it was at like 9 minutes for me. But it's seemingly gone now for me.

balmy mist
torn mantle
torn mantle
brittle tiger
keen beacon
#

and I mean every

drifting thorn
#

2.5 ultra?????

#

omg

#

That’s exhilarating

cedar tide
#

Maybe it's only for the pro model update, no?

drifting thorn
#

If 0506 can read 2 million tokens now I would scream

cedar tide
#

The latest preview model doesn't have free quota, right? Only the exp version, right?

drifting thorn
#

If 0506 can do multimodal reasoning like o3 I’ll go crazy

drifting thorn
#

I’m thinking if it has good performance on coding and debugging tasks

keen beacon
#

Gemini model updates have historically been pretty good too

#

so I presume it's probably to make it comfortably better than o3

drifting thorn
#

Gemini now is already better than o3

#

Long context, maths, coding

hollow ocean
#

It’s AGI we’re so back

keen beacon
#

what

#

flash went from pretty bad to very good for its class in 2 updates

cedar tide
#

@keen beacon

keen beacon
#

lmao stop using the arena as your point of reference

cedar tide
keen beacon
#

yes it can be useful in some regard but it is poor in others

#

we are not measuring human preference here

#

anyway gonna try it on vertex

cedar tide
hollow ocean
#

Yessirrrskiis

#

Can you solve it?

cedar tide
golden ocean
#

bros too lazy for simplest math problem

ocean vortex
cedar tide
sweet tinsel
# wintry locust i have it too

I find it a cool coincidence that this Troll Image has the same version number as the release date of the new Gemini 2.5 Pro.

ocean vortex
cedar tide
#

You are talking about an improvement of which model exactly?

ocean vortex
#

it's a silent release

#

maybe they will launch it officially soon/tomorrow hence the tweet

hollow ocean
#

Majority got it wrong in the comments

#

🤣

ocean vortex
#

to be fair you are testing vision there too, if it can't see that properly then it doesn't matter how well it reasons

cedar tide
worthy thunder
#

Context Arena Update: Added MRCR 4needle and 8needle results for some of the top models. (https://x.com/DillonUzar/status/1919758942936223779)

It's probable we'll get more model releases between today and over the next 2 weeks. I'll try my best to keep up. 😅

Top Results (4needle, AUC @ 1M):

  1. Gemini 2.5 Flash Preview (Thinking) 04-17: 48.6%
  2. Gemini 2.5 Pro Preview 03-25: 46.9%
  3. Gemini 2.5 Flash Preview (No-Thinking) 04-17: 41.4%
  4. GPT-4.1: 32.8%
  5. Gemini Flash 1.5: 27.9%
  6. GPT-4.1 Mini: 27.8%
  7. Gemini 2.0 Flash: 19.3%
  8. GPT-4.1 Nano: 15.6%

Top Results (8needle, AUC @ 1M):

  1. Gemini 2.5 Flash Preview (Thinking) 04-17: 28.5%
  2. Gemini 2.5 Pro Preview 03-25: 27.8%
  3. Gemini 2.5 Flash Preview (No-Thinking) 04-17: 22.2%
  4. GPT-4.1: 17.5%
  5. GPT-4.1 Mini: 16.7%
  6. Gemini Flash 1.5: 16.5% (some last few tests pending)
  7. Gemini 2.0 Flash: 12.9%
  8. GPT-4.1 Nano: 11.6%

I've also added a way to quickly view how a single model performs across 2needle, 4needle, and 8needle.
Several other advanced updates coming later this week. Will post as they are finished and fully rolled out.

Enjoy!

tall summit
#

needle

ember rapids
#

I know people hate on Elon/xai but gork is actually funny

brittle tiger
cedar tide
brittle tiger
sage raptor
#

i dont think so

balmy mist
#

it has to NW

cedar tide
calm sequoia
#

Something is not right. How can the NW be added to the arena if it wasn't on the arena for the last month. This means it didn't fight the o3 and o4-mini

elder rapids
balmy mist
elder rapids
#

dude doesn't understand elo + historical elo differences like that

#

the fact that the early 2.5 pro jumped 40 points is wild

balmy mist
#

It's so slow though, oh my gosh. It takes forever to think.

elder rapids
#

and the new one jumping 10 is still major change

brittle tiger
balmy mist
#

They should keep both versions

elder rapids
cedar tide
balmy mist
#

plus 147 in web dev is insane

#

and same costs??

#

wild

#

imma try this out in roocode

barren prairie
#

Test the new gemini and tell le your opinions 🙂

wheat onyx
#

I assume the Gemini 2.5 pro update is one of the internal models?

elder rapids
cedar tide
#

New model better at coding and multi turn

balmy mist
#

so we dont need to make an update in api?

balmy mist
cedar tide
balmy mist
#

damn

#

from my tests this model is really good

elder rapids
#

yo it seems smart asf

balmy mist
#

just slow

elder rapids
#

o3 vibes deadass

#

it's INTELLIGENCE

balmy mist
#

its like the old one was thinking on low or medium and this one is at medium to high lol

elder rapids
#

it's crazy

#

it's picking up nuances

balmy mist
#

what tests have you ran?

elder rapids
#

just the usual philosophy stuff

#

undergraduate-graduate math explanation

#

etc

balmy mist
#

wow you think they still gonna release an ultra?

#

i dont see a point lol

elder rapids
#

prob not

#

nah it's just blatantly too small of a model

#

but it's GOOD

balmy mist
#

hmm i cant wait to see full benchmarks on this

elder rapids
#

ong

cedar tide
#

Well, the pro grok who talks shi..
you bored us

balmy mist
#

ooohh lets do the geo guesser test again

keen beacon
#

2.5 pro dethrones 2.5 pro lol

golden ocean
#

can someone test if the update fixes the problem where gemini adds optional code everywhere and everytime when its assisting u (so not 100% vibe coding)

keen beacon
#

it does seem better than old 2.5 pro

cedar tide
#

Vs old

wintry tinsel
#

did 2.5 pro update?

#

how did it get worse at coding

ocean vortex
balmy mist
#

but worse in math

wintry tinsel
#

😦

sage raptor
#

benchmarks are wrong ?

hollow ocean
#

Haha

hollow ocean
#

O3 is king still

#

Grok 3.5 will be pay to win 🔥

#

You gotta pay to get the best 💰

ocean vortex
#

they are chasing the wrong things... why would you sacrifice performance for function calling

elder rapids
#

yeah who?

keen beacon
#

??

elder rapids
#

also I think this one is better at math weirdly enough

ocean vortex
# balmy mist who?

For developers already using Gemini 2.5 Pro, this new version will not only improve coding performance but will also address key developer feedback including reducing errors in function calling and improving function calling trigger rates.

balmy mist
#

they said they also improved coding performance tho

#

and function calling is actually important imo

hollow ocean
#

Only coding good

keen beacon
#

i doubt the function calling adjustments reduced performance (possible, but probably not significant), it was probably the larger changes to coding

ocean vortex
alpine coral
#

im not surprised to see it slip slightly in simplqa and those others

balmy mist
#

true but we need our models to be good at function calling to be able to effectively use our existing ecosystem

hollow ocean
#

Old is better for simpleqa

alpine coral
#

yeah

hollow ocean
#

The new one is dumber

balmy mist
#

like instead of getting a model to generate movies, its better to have one that can effectively use our existing software to make movies, which means better at function calling etc..

alpine coral
#

it might be better at coding

#

but it's not 'smarter' generally

#

initial impression anyway

balmy mist
#

from o3 lol

keen beacon
#

u asked it to use gpt 4o image gen to gen that?

alpine coral
#

though yeah all that said.. it's a snapshot/checkpoint.. tbh if i didn't know it was a different model i prob wouldn't be able tell

#

at surface level anyway

cedar tide
balmy mist
#

the first attempt it had was better but it got cancelled for some reason

#

Yo, this model is so good at coding. Oh my gosh.

#

wym?

#

i think google just replaced sonnet as code king

#

gotta do more tests

#

but rn this new model seems so good and cheap

cedar tide
keen beacon
#

i dont think so tbh

balmy mist
#

how do you explain the #1 on webdev?

keen beacon
#

claude code is like aider

hollow ocean
#

Grok 3.5 next week

keen beacon
#

we arent ready for asi

hollow ocean
#

elder rapids
tall summit
elder rapids
#

alr the benchmarks heavily missed this one

tall summit
#

still

elder rapids
#

it's so much smarter

#

it's crazy

hollow ocean
#

O3 pro in 2 months

#

We’re so back

elder rapids
#

I'm curious asf for grok 3.5

keen beacon
# cedar tide Vs old

so code is better but it's worse on almost everything else? not sure how to feel about that..

balmy mist
#

bro i though that was suppoosed to come out yesterday lol

elder rapids
#

try it yourself

keen beacon
#

theyre probably delaying o3 pro because they dont want people to use it lol

balmy mist
#

there is no way they would replace 2.5 model if it truly was worse in everything else

#

unless ultra is coming

hollow ocean
keen beacon
balmy mist
#

yeah that makes sense

keen beacon
#

anyway

cedar tide
hollow ocean
#

Grok 3.5 next next week

#

We’re so back

cedar tide
hollow ocean
elder rapids
elder rapids
#

you guys have to remember that these performances aren't unrelated to other aspect of the model

keen beacon
#

the benchmark table is here

cedar tide
#

Ah ok thx

narrow elbow
#

Thank you grok ambassador

elder rapids
#

ong

#

Craig over here predicting the future

cedar tide
# cedar tide

Grok 3 mini have better bench than grok 3

he should put it in its place

elder rapids
#

hold on

#

I get it

#

deepmind is cooking

#

they're improving the intentionality

keen beacon
calm sequoia
keen beacon
#

they only included it on their website but didnt release the graphic onto social media officially

calm sequoia
#

Though it's funny that they underperform to o3 on so many benchmarks, but outperform on everything in arena

cedar tide
blazing coyote
calm sequoia
#

Check how the Qwen 3 was presented ;D

cedar tide
cedar tide
calm sequoia
#

Well its funny for me. The question is: who is wrong

calm sequoia
cedar tide
calm sequoia
cedar tide
elder rapids
#

ngl I don't know what models you guys are using

elder rapids
#

grok 3 mini struggles too much at basic tasks

keen beacon
#

grok 3 mini is not good

elder rapids
#

it's just

cedar tide
elder rapids
#

bad in practice

hollow ocean
#

Grok 3.5 Friday night

elder rapids
#

this is just cap lmao

#

dude is over here calling me Gemini lover but he just hates google

keen beacon
#

tbh i wouldnt even call it that dishonest. or the qwen 3 benchmark graphic (its quite representative, but people dont seem to know how to interpret it)

elder rapids
lime coral
#

flash mainly targets multimodal retrieval use cases

elder rapids
#

also btw the updated Gemini is much better at search

#

seems apparent

keen beacon
#

might be the effect of better fc

elder rapids
#

yeah it seems smarter too I'm somewhat confused on the benchmarks

keen beacon
#

i dont know about it being smarter or much smarter (or dumber), it seems fine to me, but i havent used the model much at all. ( i also don't use it to code )

lime coral
# elder rapids yeah it seems smarter too I'm somewhat confused on the benchmarks

Looks like the focus was on « real world » https://x.com/jack_w_rae/status/1919779398607085598?s=46

We are releasing an updated 2.5 Pro today that has significantly improved real-world coding capabilities.

Here is a fun app, vibe-coded w/ Gemini to learn lines. Includes different characters and scenes, even some tts with different character voices.

https://t.co/mqlTDMUfdC

elder rapids
#

but outside of that

#

it just seems smarter

#

ion know what else to say

#

it's smarter, seems to grasp better nuances

#

seems to comprehend better

#

just, everything

keen beacon
#

i think u need to use the model more to make that conclusion

#

it seems like ur glazing lol

elder rapids
torn mantle
elder rapids
#

I've been using it since it released lmao

torn mantle
#

was it dragontail

elder rapids
#

same amount of time and prompts I used for when the first 2.5 pro released

keen beacon
elder rapids
#

how does that matter tho?

#

you can see clear differences

#

I don't need to do 5 prompts to know an average difference

#

if it's any different, it shows unless it's the exact same model

#

p much it

fleet lintel
#

Is Grok 3.5 going to be SOTA?

keen beacon
#

yuh its asi

torn mantle
keen beacon
#

they skipped agi

high ginkgo
#

grok 3.5 is ai singularity

fleet lintel
#

good to know.. I am waiting for ASI.. who need money and job anyway

elder rapids
#

ong

high ginkgo
#

you need money

balmy mist
#

why should have kept both version fr

#

like this long as thinking is kinda annoying

#

i want to be able to use both

keen beacon
#

is it thinking generally more for you?

balmy mist
#

yeah bro

#

and it takes way more tokens

high ginkgo
#

never refresh ai studio page to keep using 03-25 👍

balmy mist
#

im doing my iterations and it is producing way better results

#

but i hit the cap so fast

#

like if you keep iterating and improving code, eventually the output becomes to big

torn mantle
#

Imo it got worse

balmy mist
#

and the models cant handle it

torn mantle
#

Tends to overthink

elder rapids
misty vault
#

did they ruin gemini with the new update

torn mantle
#

And doesn't catch nuances very well

balmy mist
#

yeah i think we need both versions

keen beacon
elder rapids
torn mantle
#

They will eventually fix that on upcoming versions

elder rapids
#

prohibited

keen beacon
#

anyway waiting for 2.5 ultra now

elder rapids
#

no one has a j*b here

#

hustle

#

ASI confirmed

#

4.5 Super ASI

#

what trained 4.5

#

it trained itself??

#

nahhh

#

it ain't that super yet

#

damn

unborn ocean
#

Honestly i think they where losing money on the original 2.5 pro and now had to release a 'new' version with a heavier quantisation or something like that (hence the lower simpleQA score), but they did kind of finally start finetuning a bit more for coding like anthropic always does and openai started doing with the o series and 4.1.
(but that really is me just guessing)

elder rapids
#

I remember when asi used to be "artificial sentient intelligence"

elder rapids
#

they can afford to do this for decades

#

exactly

#

they can afford to do this for decades

#

not just that

#

their other sources

#

should be wayyy

#

too profitable

#

ion know what that means to the initial question of losing money

#

me saying Google can do this for decades presumes this

#

no company can just burn money

#

ye but that's not what I mean

elder rapids
#

not because they have their money

#

but because their profit still massively outweighs that loss

#

this is like rejecting them spending any money tho

#

they're just spending minimal amounts of money to keep it running

#

that's it

#

if it's operating at a loss

#

doesn't matter

#

they're not operating at a net loss

#

yep that's what I said

#

they wouldn't be

#

even if they served at a technical loss

#

therefore "I think they were losing money on the original 2.5 pro" is invalid

#

ye but that doesn't mean anything here

sage raptor
torn mantle
#

we need 2nd principles instead

torn mantle
# sage raptor

wait let me see chat logs about my opinion on this model

elder rapids
elder rapids
#

now I'm really interested in NW

keen beacon
small haven
#

nobody:
claude code:

torn mantle
#

no

small haven
#

i wanna try gemini now, wen gemini max

elder rapids
#

seems worse at fixed tasks but better at inquiry

#

W theory

keen beacon
#

fr tho nw was a specialist webdev finetune i think

elder rapids
#

nah

#

that's not true at all

#

nw was amazing at all tasks

#

the other ones seemed to be webdev fine tuned

keen beacon
#

dont u have access to o3 pro on the o1 pro api?

pliant cypress
#

"nightwhisper" was better than "sunstrike"?

small haven
#

@deep adder how do i reset the convo, but start it with the compacted conversation?

keen beacon
#

how many tokens u get every few hrs with claude max btw im curious

small haven
#

claude --continue, starts off with the old convo tho?

keen beacon
#

i think claude pro used to have several million tokens per few hours

small haven
#

the $100/mo?

elder rapids
small haven
#

not to flex, but i have 20x

elder rapids
#

also btw

small haven
#

had an axp

elder rapids
#

given context, it seems to grasp nuance better and still retains the same traits as 2.5 pro

#

the difference is so minimal in everything else

ocean plume
#

something new have changed? web ? gemini over claude 3.7 ?

torn mantle
#

nah gemini 2.5 flash

ocean plume
#

i mean

#

so the vote just fake right!

torn mantle
#

flash 2.0#

small haven
#

claude code is currently responsible for 1% of gdp

ember rapids
#

If claybrook got that elo I wonder how high nightwhisper is 👀

brittle tiger
elder rapids
#

it might overthink but In open ended things and iterations

#

its so much better

tawdry meteor
#

anyone know when the next big grok update will be?

elder rapids
#

this week

tawdry meteor
#

the 3.5 right?

#

do we have any benchmarks?

#

I don't subscribe to grok I already subscribe to gemini and gpt idk if I want another one lol

elder rapids
#

there are fake leaks

#

but no real model

keen beacon
#

it will probably be worse than 2.5 pro

#

they didnt put it on arena as a pre release (presumably because it wont beat it)

elder rapids
#

ye I don't see anything approaching o3 or 2.5 pro for a while now

#

that was honestly a major jump tbh

elder rapids
#

people did talk about it

#

but it needs to be emphasized that 2.5 pro is smaller than o3

keen beacon
#

i dont think so lol

elder rapids
#

it def is

keen beacon
#

4.1 is 200b

elder rapids
#

ye

#

seems like it

#

it's smart

#

o3 is probably around 600~

keen beacon
#

no

#

its the same size

#

as 4o/4.1

elder rapids
#

here's what I think

elder rapids
keen beacon
#

yes

elder rapids
#

it literally, inherent to what the models are

#

can't be the same size

#

lmao

#

that doesn't make any sense

#

the thinking itself adds a ton

keen beacon
#

i cba arguing about stuff where i need to cite etc. but im not making stuff up

elder rapids
#

not even cite

#

it just doesn't make sense

#

to me at least

keen beacon
#

so it was claybrook

elder rapids
unborn ocean
#

and i disagree about the thinking part

#

they said that the original o1 was based off of 4o (with the same parameter count, like e.g. qwq-32b that has the same param count as the base 32b)

#

but i am not sure if we know anything about the later versions

keen beacon
#

o3 was retrained on the 4.1 base model which is a cpt of 4o but i dont wanna get into it. i cba arguing/explaining it again and again lol

lime coral
#

No one knows bro

keen beacon
keen beacon
elder rapids
#

wonder how the new model is going to affect the deep research ngl

unborn ocean
# keen beacon o3 was retrained on the 4.1 base model which is a cpt of 4o but i dont wanna get...

yeah i get it, but:
1: I am not 100% convinced about that being the model they originally used (for the ARC agi benchmark o3 version)
2: the economies of scale for inference should push the large providers towards offering really high expert counts for their best models (I am basing this on the deepseek approach to inference optimization, where they have 'duplicate' experts (for the most used) and more experts optimizations)
3: 200b would mean they can run this stuff on one gpu with quantisation, i dunno
4: the fact that all the decent models released by alibaba and deepseek are really large aswell

  • e.g. qwen max has 72b experts (really not sure though) and is like 1,5t params
    5: the llama 4 maverick has 128 experts because meta wants to serve it to many people through their apps -> high efficiency because for a large provider its more like running a 17b param model for the customers

(i dunno, i am not even a programmer, so i am really talking out of my ass)

keen beacon
#

its fun to speculate though ahaha

unborn ocean
calm sequoia
calm sequoia
keen beacon
#

ive said it many times (but different threads/conversations, but i guess no one is actually reading it lol and piecing it together)

elder rapids
#

o3 doesn't get some stuff Gemma gets right

cedar tide
keen beacon
#

no

elder rapids
#

nah

cedar tide
keen beacon
#

💀

calm sequoia
#

I have 10 prompts that i was grinding through all the models 10 times at least. The claybrook performed worse than SOTA models. And worse than the original 2.5 PRO. These prompts were never answered by the likes of gemma

elder rapids
#

what are you prompting it with

cedar tide
keen beacon
calm sequoia
unborn ocean
cedar tide
keen beacon
elder rapids
#

the total params are completely speculative

#

and the inference is coming from his knowledge of 4o itself

calm sequoia
#

The checkpoint part is totally new to me

calm sequoia
elder rapids
keen beacon
# unborn ocean i read it ... most of the time :) was just unsure if you had connections or wher...

yuh im speculating but i hope it makes sense lol. i ramble a lot but im not outright making stuff up (most of the time) ahha. i feel like i have a pretty good track record/pretty reasonable so far. ive been talking about the chatgpt 4o latest cpt/(4.1 wip base model)/quasar for monthhs, and it was outright confirmed by openai employees later. theres a lot of public info out there lol. but yeah im not willing to get into arguments/debate about it anymore lol (exhausting to cite/argue properly)

keen beacon
elder rapids
#

yep much much larger

#

seems larger than o3 from my testing tbh

keen beacon
# elder rapids yep much much larger

fyi thinking doesnt add params to the model it doesnt make sense.

for o1 we know its based on the 4o base model lol. aidan was saying it was a 1t model/arguing with semianalysis about it before he deleted it. (before he was an openai employee)

keen beacon
unborn ocean
#

And can more or less add the sources in my head :)

elder rapids
#

it's not the thinking itself that adds params

unborn ocean
elder rapids
#

so that's not really what I'm saying

unborn ocean
#

The params are just the model weights

elder rapids
#

even if you grant 4o is 200b params

elder rapids
keen beacon
#

thinking models are primarily more expensive because of kv seq length (semianalysis explains)

elder rapids
#

that has nothing to do with what I said lmao

#

thinking necessarily denotes more params

keen beacon
#

no it doesnt

#

why do you think that

elder rapids
#

it's not the thinking itself that adds more params

elder rapids
unborn ocean
keen beacon
elder rapids
#

you can't believe a base model can inherently support that reasoning process

keen beacon
#

it is

#

pretraining is the most important part

#

well one of the most important parts

balmy mist
#

will we ever get nw?

elder rapids
#

fine-tuning isn't what makes it reason

#

nor RL itself

unborn ocean
#

No not just that, it’s also about the variability of thinking vs a larger model (where all params are used for every token) (imho)

unborn ocean
#

Read the R1 paper if you are interested

small haven
#

running 2 claude code instances >>

keen beacon
#

thanks gemini

torn mantle
#

reported

#

to google

torn mantle
elder rapids
# unborn ocean That plus a reward mechanism for correctness and a certain format with thinking ...

not a reward mechanisms nor RL or fine-tuning, the Jump is too big if 4o can't represent the same knowledge (ex a complex math problem) there's a limited capacity, and it's not equivalent in both of them and even if pretraining 4o DID allow it to therefore be fine-tune and reason, they fundementally have different knowledge bases, if you're saying o1 with a 200b base is outperforming deepseek r1 this much with such a large base then that's crazy

#

4o can't complete the electrical tasks I give it, 4o can't complete the philosophy tasks I give it, 4o doesn't understand basic qft operators like o1 does

#

there's no chance o1 can represent what 4o can't without what entails this

#

ie more params

#

that doesn't make any sense

keen beacon
#

4o has to immediately predict a chat reply, o1 can think and recall about electrical stuff, etc., before it needs to form a reply

#

it does not necessitate larger param counts or the model needs to add more params for reasoning

elder rapids
unborn ocean
elder rapids
#

4o simply cannot receive these tasks, it doesn't KNOW

#

so I have a really hard time believing it's even close to the 200b count you're speculating

#

it's def a big model

keen beacon
elder rapids
unborn ocean
keen beacon
elder rapids
ocean vortex
#

it is for a fact, they said so themselves

keen beacon
#

because chatgpt 4o latest is on the 4.1 base and has been for months

ocean vortex
#

Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version⁠(opens in a new window) of GPT‑4o, and we will continue to incorporate more with future releases.

  • artificialanalysis did benchmarks on it and they were just about identical to 4.1
keen beacon
#

identical benchmarks and knowledge cut off (distinct knowlege cut off indicating cpt comapred to 4o)

ocean vortex
#

there isn't really gpt4o is much worse

elder rapids
keen beacon
elder rapids
#

ye o1 doesn't seem crazy heavy

ocean vortex
#

which is the confusing part. Chatgpt-latest is based on 4.1 but they are still calling it gpt4o lmao

elder rapids
#

but it's not 200b

#

this is just a fact

keen beacon
ocean vortex
#

there's no dated API version of gpt4o that would come close to chatgpt performance

elder rapids
#

I'd say o3 is arguably that range

#

size?

#

prolly true it's 200b~

#

I said that's what I thought

#

way back

#

to me Claude should be around like

#

300b

#

or larger

keen beacon
#

sonnet 3.5 is like 400b

elder rapids
#

ye that seems to be the case

unborn ocean
ocean vortex
# elder rapids prolly true it's 200b~

I think it's comparable to Deepseek actually. They are serving it to hundreds of thousands of people so MoE makes sense but active parameter count is not gonna be a lot

elder rapids
#

just wouldn't believe o1 is 200b

#

would seem wild to me

keen beacon
ocean vortex
#

100b active, 700b total, MoE. But that's just a wild guess lol

keen beacon
elder rapids
#

prolly smaller, but if it's moe then ig what Dom said

#

I don't think it's that big tho

#

all Gemini models

#

just simply don't seem that large

#

maybe cuz the its the tpus

keen beacon
#

gemini pro is similar to sonnet3.5/4o size range

elder rapids
#

ye

ocean vortex
elder rapids
#

also my thought process was

#

why would make a large gap

keen beacon
elder rapids
#

between flash and pro

keen beacon
#

their 30b moe a3b experiment

ocean vortex
#

Ultra was probly smth wild like 200b active

keen beacon
#

(30b a3b has similar simpleqa to qwen 3 32b and higher than qwen 3 14b)

#

if u take sqrt(total_params*active_params) (mistral rule of thumb) it clearly outperforms 9.5b

elder rapids
#

this was what I said like a while ago

#

good knowledge tho

keen beacon
#

2.5 pro and 2.0 pro are the same size, i believe

elder rapids
#

I don't believe 2.0 pro was larger than 200b

ocean vortex
elder rapids
#

or any bigger than 4o

#

thing is tho

#

2.0 pro knew more of these important tasks

#

STEM, philosophy

keen beacon
# ocean vortex Deepseek doesn't have as much resources as Qwen and in my opinion they are doing...

im talking about the fact that small active params + larger total params allows it to still perform to the dense total param count. what you're talking about that's a separate thing.

u can compare the base model benchmarks (where its very much standardized, they definitely did not cherry pick anything on the base model benchmarks chart)

their tuning on top of the base model was insufficient/lackluster though

ocean vortex
# elder rapids or any bigger than 4o

It is bigger almost for certain. With 2.5 pro that has 2.0 base (well not exact same but same size) they get performance in different ways than OpenAI. They don't need high reasoning and it has better spatial awareness

ocean vortex
#

cause that MoE that you are comparing to dense saw way more training and data

keen beacon
#

they didnt train the moe more

ocean vortex
#

it's also why qwq 32b dense competes with gpt4o though, on a flip side. Both take comparable amount to train I think

#

and both can perform similarly

ocean vortex
# keen beacon ???

you are not gonna train / fine-tune smth like 405b nearly as much as you would MoE, big model comparisons are not very realistic

keen beacon
#

im confused

ocean vortex
#

but if you wanna compare those

#

dense still performs better

keen beacon
ocean vortex
#

it's like gpt4.1 vs gpt4.1-mini, on some benchmarks they look very similar, diminishing returns. But on some others the difference is bigger

#

but 1 is still for a fact better

keen beacon
ocean vortex
#

don't forget o4-mini performing better than o3 on some things etc

keen beacon
#

even then, qwen3 has bias towards certain experts/not even. (experts are not "experts" in knowledge, it's more complex, but anyway) usually moe training theres autobalancing or whatever with experts [tangential point]

ocean vortex
keen beacon
#

8% for both of them. no thinking.

unborn ocean
#

I would argue that the argument for dense vs MoE really depend on the type of questions asked.
In something like simpleqa, the model just needs to access the knowledge stored in some neurons, whether that info is accessed by running all the parts of the model (dense) or just predicting which part of model's experts knows it (MoE) should not really matter much (ik i am wildly simplifying). However, I am sure there will be a lot of scenarios (e.g. compact reasoning over less than 10 tokens) where 1b params as an experts will not be able to compete with a dense 30b that has a lot of layers to 'think' internally.
In short: simple qa might not be the best benchmark to compare the two architectures and there has to be some sort of downside with the MoE (assuming same tokens in training).

ocean vortex
#

complete gibberish, versus: (normal text in Lithuanian)

#

no thinking enabled

hardy pecan
#

interesting they are sandbagging models by releasing claybrook, when we all think there are 3 other models that were stronger, I guess they come at I/O

keen beacon
#

(there could also be more unreleased models, we dont have enough information to make a conclusion either way)

ocean vortex
# keen beacon 8% for both of them. no thinking.

yeah it is impressive and maybe 8 experts (active) help indeed. Still not quite the level equivalent to dense though. If it had the same amount of active parameters and say 60b total, I do not think it would be better than 32b dense still 🤔

raven void
#

Claude 4 will have 1800 elo in WebDev

elder rapids
#

looking around with people talking about the new 2.5 pro

#

it's so mixed

#

someone said it's worse at analytic processes

#

but it's gotten better for other people

#

people are saying it has worse context

#

but to me it's the exact same

#

not new information

vivid oyster
#

Wtf happened to gemini

calm sequoia
vivid oyster
#

It nothing thinking anymore

elder rapids
calm sequoia
vivid oyster
#

Update

elder rapids
#

ye

keen beacon
#

ask it to think when it does that

vivid oyster
#

05-06

elder rapids
#

bugged for a bit

keen beacon
#

it will fix it

elder rapids
#

it's happening to me

vivid oyster
keen beacon
#

generally yes

vivid oyster
#

Aint no way

#

It worked

elder rapids
#

ye

vivid oyster
#

Thanks

#

So whats the update for

#

Like

#

05-06

keen beacon
#

its so stupid

elder rapids
keen beacon
#

they need to prefill the thinking tag

elder rapids
#

glad it's even better at philosophy + explanation + coding

#

but ion know about all your guys use cases

vivid oyster
#

Thats what

#

I used it for

#

😮

#

Im using it for philosophy rn

elder rapids
vivid oyster
#

Let me see if it improved

elder rapids
#

especially in categorizing

elder rapids
#

and formatting

vivid oyster
#

Cause gemini 2.5 pro cant follow instructions good

#

Is o3 good at it?

elder rapids
#

opposite for me

#

o3 is bad at philosophy + instruction following

sage raptor
vivid oyster
#

Last time I used gemiin 2.5 pro for philosophy and asked it questions

#

It gave me answers for entirely different ones

#

Didnt addrewss what I said

#

Didnt understand what I said

#

Even after I repetaedly clarified

#

It sucks at that

#

For me at least

elder rapids
worthy thunder
#

Context Arena: Added in Gemini 2.5 Pro (0506) to the 8needle test.

Performs better on lower contexts, and slightly worse (within error range) for upper contexts.

More results at: http://contextarena.ai

vivid oyster
#

Dude

#

This is so f annoying

#

I have to keep switching between beta lm arena and alpha lm arena and the normal one

#

Because my conversations keep doing this

keen beacon
#

are u trying to use gemini on arena?

vivid oyster
#

No

#

O3

keen beacon
#

oh

#

ok

small haven
worthy thunder
lime coral
keen beacon
lime coral
#

My bad

hollow ocean
#

Grok 3.5 in 2 hours

#

We’re so back 🔥

sage raptor
#

source ?

keen beacon
#

asi incoming 🔥

hollow ocean
#

💯

torn mantle
ocean vortex
balmy mist
torn mantle
mild galleon
#

No fake news please

calm sequoia
#

If you really think about it, you'll realize Grok 3.5 was in your heart all along.

mild galleon
#

Wtf is this new cult with grok 3.5

calm sequoia
#

You'll get it when you're older.

torn mantle
small haven
#

grok 3.5 is just to satisfy elon ma first principles, he dont care bout da other tings

hollow ocean
#

xAI livestream in 30 minutes

ember rapids
#

I’m hearing grok 3.5 has 100b context window and saturates arc agi 2

balmy mist
#

yeah yall need to chill with the fake news, brings down the prestige of this channel lol

wintry tinsel
#

They will release API a few months later

wintry tinsel
hollow ocean
keen beacon
#

Is anyone's aistudio broken lol the thoughts are messed up

#

It's like a summary and the thoughts are merged

elder rapids
#

its like two separate thought processes

#

it gets to the conclusion which I don't know how

#

it's really strange

#

there are out of context statements

#

and followups that don't even make sense

#

I think they messed up something with how it's passed forward, could be the app summary mixed in?

elder rapids
#

tbh this is getting pretty weird, seems like what makes its context recollection so good is it's thinking

#

but it's recollection can't be good

#

because the output is messed up

#

it's omitting things it intends to point out in its thinking process and reiterating things it already established, prob relying on the base model to infer the initial prompt and to piece it all together so there's more potential for insufficient information

#

i didn't see this at all a few hours ago

verbal nimbus
#

First time I've seen a non-Anthropic model in the top spot

leaden palm