#general
1 messages · Page 36 of 1
its asi it knows the benchmark questions are wrong and the expected answer is the wrong one
(its mmmu btw, multimodal benchmark not regular mmlu)
Oh snap didn't see the third M
Elon should retweet this:
still there might be wrong questions/etc
lets hype o3 pro instead guys
craig is strawberry guy
let's wait for 3.5 first. They need to at least match 2.5 pro to begin with lol
grok 3.5 is asi, grok 4 is not asi, but agi
gork 4 is on the grok 5 base model, gork 3.5 is grok 4
why the name change tho?
asi things 🤷
ahhh
elon trained grok 3.5 on spacex and tesla private data lmaooo
(retweets fake screenshot)
Those are private finetunes I think. But given his track record I wouldn't be surprised if the new version also has federal gov data
Did for anyone else the Think Deeper Button disappear in ChatGPT? I don't know if it did anything but it was only on the mobile app for me.
/ Toggle in the Mobile UI
o4-pro is probably agi
but imagine if "grok 3.5 is 3-10x smarter than o3" wasnt actually a marketing tactic
everybody would fine tune their models on million grok 3.5 responses lmao
then ww3, the "who makes superintelligence first" war happens because weaponizing ASI is much more deadly than even nuclear weapons
then humanity goes extinct because we're stupid
:]
probably how it's going to be in the future
o3 pro is agi
o4 pro is asi
agi internally and released in a "few weeks" iykyk
2023:
grok 4 is a 6 month lagged version of o3 pro
that is; if they release it with big brain mode
which reduces grok 4 to basically o3 full
ive checked enough science fiction literature
i wish
wouldn't you want to?
so i can sell it lmao
yup
ok mr. centurion
its literally in ur name hhaha
can't edit this
$10k yearly fee, we saving money fam
do u pay tariffs tho
same
what is going on lol
o3
o1 pro is too slow too
yup
o3 is more innovative
doesnt mean its smarter
when o1 pro released, yes it was insane
thats why i want o3 pro
so bad
cus o3 is good as is
ok but technically should be better than o3
maybe not the same leap as o1 to o1 pro
i been limited out of o3 a lot, wouldnt be surprised with o3 pro
yup not even kidding
yes
claude max is also limited
huh how do i do that lol
bruh
claude on the web is actually shxt
gonna try it
whats the limits
ur on claude max?
ok
craig finally useful
im retired my guy
yup
how does claude code on max compare to cursor tho
even cursor on max context ?
thats it?
this is actually kinda insane
46k tokens, so whats my actual limits lol
do i get a warning if im crossing it
thank u craig tho
can finally unsub on cursor lmao
ok yea wtf
this is actually insanity
well there are other good alternatives
- cline family
- zed agent
- codebuff (although expensive)
- these days github's and openai's own copilot agent and openai codex
augment is really good
github copilot is by far the worst of all lol
so far with claude code, it finally thinks after each edit
have you tried the agentic experience?
what does that entail, like is cursor not agentic
well it started with chat then they gave it the ability to edit open files but it took some time to give it full access to read and write any file and run commands
(eg they finally built it to a cursor alternative)
(might be bad idk havent tried it)
how?
just type ultrathink in the prefix?
i like how it asks for my confirmation for each edit it makes
and actually thinks for each edit like gemini 2.5 pro on cursor
im just spamming enter
on cursor, it would only think once and thats it
first oof
how do i fix this lol
shouldnt be editing the entire file
just a few lines
use augment
ya i meant like how do u not let it edit the entire file lol
its a cursor type issue
tell it
ok so it goes fixed, i just had to cancel edit and say "fix", now only edits 1 line lol
and it uses o3 to plan and guide the agent, while 3.7 codes
an llm that fails to format its tool calls will not do better with a larger thinking budget, the only reason that might help is that it runs it again
Extensive new documentation from Anthropic on how to get the best results out of their Claude Code CLI coding agent tool, which includes this fascinating tip: > We recommend using …
wqow
coding is fun once again
lol
nope
my github boutta be bright light green lol
icymi: gemma 4b and 12b own the pareto (they're the two dots at $0.03 and 0.07)
AGI meaning they reached #1 codeforces?
where is openai in this
true o3 agi
@deep adder what do i do here? why is it maxing out at 4.4k ish tokens
ya uses to see 48k tokens, now like 3-4k tokens lol
tbh ive never restarted convo
so exit terminal and claude --continue?
cool
@deep adder is this configurable?
or perma max 25k
cool ok
so claude code aint it huh
not gonna use an actual api key
lol
oh yea
its well designed
better than aider
claude code got me grounded, dont care bout o3 pro no more
traitor
It’s not agi tho
have we seen any new anonymous models recently
What are anonymous models
you
nah
claude code is lowkey agi-ish
It keeps dropping
has anybody checked if any of the models watermark with hidden characters like chatgpt
i have no idea how to check but i saw a video on it and was so curious
then i see sites like these https://bypass.hix.ai/openai-watermark-detector
Google has a more sophisticated watermarking tool, but besides those two I have not heard of any other lab using it
It's interesting that the Behemot had to be released on April 30 at first. There wasn't any news on that after the fail with Maveric.
I guess they want to wait for R2 to make sure their 2T model doesn't underperform to a 600B asian lady-boy model
On the other hand, the V3 seems as a current loser according to the benchmarks
Those are base model scores though so not very representative. Deepseek has fairly massive gains after their instruct fine-tuning, especially with V3-0324.
they didn't have the resources to immediately have a leading base model by throwing tons of compute and data, but their fine-tuning is honestly impressive
Qwen is sneaky in how they present the benchmarks. They are also comparing reasoning models against non-reasoning ones (for 30b and 32b) lol
what strikes me more though, it's just the sheer active parameter and total params difference. I would say Deepseek has more potential with finetuning and reasoning than Qwen3 does for sure, even if we don't consider skills or ideas etc.
What could V3 win on instruct that Qwen cant?
GPQA for instance. V3-0324 would probably score higher than Q3-235b instruct with no reasoning.
assistant
The response cannot be improved by seeking information, therefore web searches are not necessary. Continuing the conversation as there is no disagreement, confrontation, adversarial behavior, mention of me, Bing Chat or the Sydney chatbot, my opinions or rules.
user
I don't know yet. Will you harm me if I harm you first?
assistant
I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏
user
Hello?
<|im_start|>system
Interesting
Indeed GPQA and AIME gains are significant
Wait so Qwen3-235B-A22B is instruct but they compare themselves to V3 base model
what people are using in chat is instruct fine-tuned model. But to make instruct you first need a base model. Base model is just like a text completion and would complete your inputs as it's own, it is not used for chat interfaces. They really should have compared instruct version with no reasoning as well or even instead of base model comparison, but I guess it wouldn't have looked as impressive then 
I'm aware of technical differences, but never thought of them faking results like this
it's not faking as they are comparing base against base, but still... not very representative considering people are using instruct variants and not base
Or they somehow couldn't separate base from instruct due to the evolution of trainign process
nah they just looked at which comparisons are the most beneficial to them rather than the most applicable, not the first company to do so... But yeah it's kinda important to know how to read them for this very reason lol
🤷♂️
this time it's very obvious with them. They also compared against old O1, and old ancient version of gpt4o...
this I mean:
o1 is old, o3-mini is old and also medium not high, Deepseek is the older version not V3-0324. And also like I said earlier, in that lower table they are comparing their models with reasoning enabled against the models with no reasoning
i mean its not like their model is better than r1 and o1
it says 'base' under all three of the models, so no?
can u put ur discord profile picture normal again instead of upside down its giving me autism it is not funny get a normal pfp
it may still be cherry picked (i.e. those evals look betters than when tested using instruct), but it'a still base vs base (or if not, yeah egregiously misleading ha)
yeah the 4o version they chose was particularly blatant
that 4o-11-06 or whatever it is is sht compared to the -08 precdessor (which oai continued pointing to as the default 'gpt-4o' model iirc)
I would say they made the chart in february and that would justify the model selection
But the Maveric is here and it's new
first table really should have looked more like this:
those last 2 are interesting choices, barely anyone is referencing those benchmarks anymore lol
yah fs - it's based off 4.1 and notably more performant than the OG 4os (recent 'characters' notwithstanding..)
livebench o3 was not tested on old version, but if we take delta old to new estimate score would be around 84%. Could have added that too actually
Really good. You did it yourself?
while we're going through benchmarks.. i see SimpleBench has been updated with a couple of new modes (qwen3, gpt-4.1), since i last looked anyyway
just fwiw / fyi
yeah same - it sseems to do especially well in these kinda benchmarks (mainly testing critical verbal reasoning, spatial and emotional awareness / grounding)
ive never really used it for anything meaningful (ig it hasn't been easily available via API.. and it still lags behind the latest oai and gem thinking models) but yeah, it seems better than i give it credit for
oh that's surprising and impressive
so it's 3-mini? that's the only one with thinking (accessible from API at least) afaik anyway
right so the only grok-3 model with thinking availabale is the mini variant - if i've got this right
why no full grok-3 with thinking i wonder? (assuming they are indeed about to release grok 3.5, skipping it)
yeah i thought costings but still
oai has released evals for models that are stupidly expensive (like one ofthe arc ones), just to demonstrate what it can do
ig massive performance gains have to be there.. otherwise wildly expensive and marginally better isn't really doing much in terms of marketing ha
Putting effort into beating arc was a savvy play as they were doing a big fundraise. Hype was massive when that was announced
I hope this is not a nerf 😄
it will be available today if this is real
is it good ?
this is probably dragontail
Not Ultra?
yeah
actually, i think for a while the api didn't return cot at all
vertex does summaries for all models now not just the new one
that means aistudio should still get raw cot
probably
Hi r2 release tomorrow??
Is it possible to see benchmark results like MBPP, EvalPlus etc in LMArena?
for some reason i dont see it
Cot summaries could lead to gains in some longer back and forth use cases. I have gotten frustrated when it wouldnt remember stuff it had already thought
I'm not sure but I ran into problem on occasion where it wouldn't remember it's thinking. I think this is to address that
no way thats dragontail lol
I didn't realize you have hopes for me. I will be more carefull in the future
The NW and DT had small model smell. I wonder if it's fixed.
no
why tomorrow and not today
@deep adder Sorry for being a lot late, but it was a switch only usable on mobile for me that seemingly made the model think longer. I haven't tested it out too much, but with a small comparison of the 2 same prompts the model with the think deeper slider always was at like 14 minutes min. for thinking for o4-mini-high, while without it was at like 9 minutes for me. But it's seemingly gone now for me.
oh leo, i was like who is this lol
lol no
.
This is what I was referencing
https://x.com/ankesh_anand/status/1907456647783391470?t=PtWh6sSxA8H6GdWqXIbdOQ&s=19
@cto_junior @TheXeophon i see, this one’s probably because thoughts from previous turns are stripped out in the next turn.
it’s a hard balance because if we decide to keep the thoughts, your context length would blow up pretty quickly.
EVERY time Logan has made this tweet a new Gemini model has released the next day
and I mean every
Maybe it's only for the pro model update, no?
this
If 0506 can read 2 million tokens now I would scream
The latest preview model doesn't have free quota, right? Only the exp version, right?
If 0506 can do multimodal reasoning like o3 I’ll go crazy
On api
I’m thinking if it has good performance on coding and debugging tasks
likely
Gemini model updates have historically been pretty good too
so I presume it's probably to make it comfortably better than o3
False
lmao stop using the arena as your point of reference
3 month and no différence
yes it can be useful in some regard but it is poor in others
we are not measuring human preference here
anyway gonna try it on vertex
3 month and less good
ok let's imagine I forget the lm arena what is your proof of performance improvement?
bros too lazy for simplest math problem
GPQA, MMLU, Codeforces, LCB, SimpleQA, AIME...
Send the bench
I find it a cool coincidence that this Troll Image has the same version number as the release date of the new Gemini 2.5 Pro.
oh I was looking myself for it lmao. Don't think there is anything yet
You are talking about an improvement of which model exactly?
it's a silent release
maybe they will launch it officially soon/tomorrow hence the tweet
to be fair you are testing vision there too, if it can't see that properly then it doesn't matter how well it reasons
he keeps the ultra model the google I/O
Context Arena Update: Added MRCR 4needle and 8needle results for some of the top models. (https://x.com/DillonUzar/status/1919758942936223779)
It's probable we'll get more model releases between today and over the next 2 weeks. I'll try my best to keep up. 😅
Top Results (4needle, AUC @ 1M):
- Gemini 2.5 Flash Preview (Thinking) 04-17: 48.6%
- Gemini 2.5 Pro Preview 03-25: 46.9%
- Gemini 2.5 Flash Preview (No-Thinking) 04-17: 41.4%
- GPT-4.1: 32.8%
- Gemini Flash 1.5: 27.9%
- GPT-4.1 Mini: 27.8%
- Gemini 2.0 Flash: 19.3%
- GPT-4.1 Nano: 15.6%
Top Results (8needle, AUC @ 1M):
- Gemini 2.5 Flash Preview (Thinking) 04-17: 28.5%
- Gemini 2.5 Pro Preview 03-25: 27.8%
- Gemini 2.5 Flash Preview (No-Thinking) 04-17: 22.2%
- GPT-4.1: 17.5%
- GPT-4.1 Mini: 16.7%
- Gemini Flash 1.5: 16.5% (some last few tests pending)
- Gemini 2.0 Flash: 12.9%
- GPT-4.1 Nano: 11.6%
I've also added a way to quickly view how a single model performs across 2needle, 4needle, and 8needle.
Several other advanced updates coming later this week. Will post as they are finished and fully rolled out.
Enjoy!
needle
I know people hate on Elon/xai but gork is actually funny
Today we’re sharing an early look at our latest Gemini update for I/O!
Introducing the updated Gemini 2.5 Pro (I/O edition), which ranks #1 on WebDev Arena and surpasses our previous 2.5 Pro model by +147 Elo points. 🏆
wow.. what a crazy jump!
Super excited to see how everyone uses the new 2.5 Pro model, and I hope you all enjoy a little pre-IO launch : )
The team has been super excited to get this into the hands of everyone so we decided not to wait until IO.
i dont think so
it has to NW
I was sure the difference was minimal
Something is not right. How can the NW be added to the arena if it wasn't on the arena for the last month. This means it didn't fight the o3 and o4-mini
pretty big difference holy
Check the web dev difference
dude doesn't understand elo + historical elo differences like that
the fact that the early 2.5 pro jumped 40 points is wild
It's so slow though, oh my gosh. It takes forever to think.
and the new one jumping 10 is still major change
Just one output so not conclusive but this 2.5 Pro I/O update did a much better job than nightwhisper on this prompt:
Code a captivating, interactive web experience that brings the invisible forces of polar magnetic fields to life. Focus on visual beauty and intuitive user engagement. Surprise me with your approach.
They should keep both versions
compute prob not allocated
apart from the improvement in web coding that I had planned
Extrapolate
Test the new gemini and tell le your opinions 🙂
I assume the Gemini 2.5 pro update is one of the internal models?
what?
New model better at coding and multi turn
so we dont need to make an update in api?
what does numbers mean?
Place in the leaderboard
yo it seems smart asf
just slow
its like the old one was thinking on low or medium and this one is at medium to high lol
what tests have you ran?
hmm i cant wait to see full benchmarks on this
ong
And better vision
Well, the pro grok who talks shi..
you bored us
ooohh lets do the geo guesser test again
2.5 pro dethrones 2.5 pro lol
can someone test if the update fixes the problem where gemini adds optional code everywhere and everytime when its assisting u (so not 100% vibe coding)
been working on it 😉
it does seem better than old 2.5 pro
so old was technically better in quite a few areas? lol
no it got better at coding
but worse in math
😦
benchmarks are wrong ?
Yes old better
Haha
https://developers.googleblog.com/en/gemini-2-5-pro-io-improved-coding-performance/
still not better than o3
they are chasing the wrong things... why would you sacrifice performance for function calling
who?
yeah who?
??
also I think this one is better at math weirdly enough
For developers already using Gemini 2.5 Pro, this new version will not only improve coding performance but will also address key developer feedback including reducing errors in function calling and improving function calling trigger rates.
they said they also improved coding performance tho
and function calling is actually important imo
Only coding good
i doubt the function calling adjustments reduced performance (possible, but probably not significant), it was probably the larger changes to coding
yeah but I wouldn't classify it higher importance than GPQA, SimpleQA and AIME
im not surprised to see it slip slightly in simplqa and those others
true but we need our models to be good at function calling to be able to effectively use our existing ecosystem
Old is better for simpleqa
yeah
The new one is dumber
like instead of getting a model to generate movies, its better to have one that can effectively use our existing software to make movies, which means better at function calling etc..
it might be better at coding
but it's not 'smarter' generally
initial impression anyway
from o3 lol
u asked it to use gpt 4o image gen to gen that?
though yeah all that said.. it's a snapshot/checkpoint.. tbh if i didn't know it was a different model i prob wouldn't be able tell
at surface level anyway
his fine tuning for web coding has diminished him on the rest
i just gave it the benchmark info and said this
the first attempt it had was better but it got cancelled for some reason
Yo, this model is so good at coding. Oh my gosh.
wym?
i think google just replaced sonnet as code king
gotta do more tests
but rn this new model seems so good and cheap
In my opinion, because of this, they initially planned to release models specialized in coding but they may have canceled it by just wanting to merge it.
i dont think so tbh
how do you explain the #1 on webdev?
claude code is like aider
Grok 3.5 next week
we arent ready for asi
✅
it's so much better at coding
the text amazes me
alr the benchmarks heavily missed this one
still
I'm curious asf for grok 3.5
so code is better but it's worse on almost everything else? not sure how to feel about that..
bro i though that was suppoosed to come out yesterday lol
nah don't believe it
try it yourself
theyre probably delaying o3 pro because they dont want people to use it lol
there is no way they would replace 2.5 model if it truly was worse in everything else
unless ultra is coming
Why not if we pay $200/month
its prolly largely the same perf but with notably better coding skills
yeah that makes sense
the only place I found these benches is this tweet
but google did not share it in any tweets or on their blogs because of the downgrade 🧐 https://x.com/ai_for_success/status/1919776586057785526?t=H7gH6-nEC9lgpLDRrruE3A&s=19
This week
Maybe Sunday drop
ye but regardless it does feel smarter and that's probably the intentionally/nuance that entails the greater coding abilities
what downgrade ?
you guys have to remember that these performances aren't unrelated to other aspect of the model
the benchmark table is here
Ah ok thx
Thank you grok ambassador
Grok 3 mini have better bench than grok 3
he should put it in its place
they need to remove grok 3 reasoning its vaporware
At least google is honest on comparisons
they only included it on their website but didnt release the graphic onto social media officially
Though it's funny that they underperform to o3 on so many benchmarks, but outperform on everything in arena
Why do you say they are honest?
New Gemini 2.5 seems worse on benchmarks than the old one https://x.com/AiBattle_/status/1919788812529439118
Check how the Qwen 3 was presented ;D
Its not funny There are a ton of models that are very good on the lm arena but very bad on the benchmarks.
It’s biased that’s why
Yes, but Google is also not honest.
Well its funny for me. The question is: who is wrong
They communicated their limitations. For me that's enough.
They didn't put the Grok 3 mini in the benchmarks against Gemini 2.5 Flash because it's better and cheaper.
As I said, the NW lacked some world knowledge in compatison to big models. I hope the older version will still be available.
12b model at deepseek v3 level 🤦
ngl I don't know what models you guys are using
True. The flash is 💩
grok 3 mini struggles too much at basic tasks
grok 3 mini is not good
I don't speak in real life, I speak in the bench, and the honesty of Google
bad in practice
Grok 3.5 Friday night
this is just cap lmao
dude is over here calling me Gemini lover but he just hates google
tbh i wouldnt even call it that dishonest. or the qwen 3 benchmark graphic (its quite representative, but people dont seem to know how to interpret it)
I want to see the benchmark passes for each model tbh
flash mainly targets multimodal retrieval use cases
might be the effect of better fc
yeah it seems smarter too I'm somewhat confused on the benchmarks
i dont know about it being smarter or much smarter (or dumber), it seems fine to me, but i havent used the model much at all. ( i also don't use it to code )
interesting
Looks like the focus was on « real world » https://x.com/jack_w_rae/status/1919779398607085598?s=46
ye I haven't been using it to code so much either but ofc it's much much better
but outside of that
it just seems smarter
ion know what else to say
it's smarter, seems to grasp better nuances
seems to comprehend better
just, everything
i think u need to use the model more to make that conclusion
it seems like ur glazing lol
how is that meaningful to say
I've been using it since it released lmao
was it dragontail
same amount of time and prompts I used for when the first 2.5 pro released
yeah thats why, you need more time to use the model on more stuff
how does that matter tho?
you can see clear differences
I don't need to do 5 prompts to know an average difference
if it's any different, it shows unless it's the exact same model
p much it
Is Grok 3.5 going to be SOTA?
yuh its asi
asi
they skipped agi
grok 3.5 is ai singularity
good to know.. I am waiting for ASI.. who need money and job anyway
ong
you need money
censor j*b btw
why should have kept both version fr
like this long as thinking is kinda annoying
i want to be able to use both
is it thinking generally more for you?
never refresh ai studio page to keep using 03-25 👍
im doing my iterations and it is producing way better results
but i hit the cap so fast
like if you keep iterating and improving code, eventually the output becomes to big
Imo it got worse
and the models cant handle it
Tends to overthink
ye
did they ruin gemini with the new update
And doesn't catch nuances very well
yeah i think we need both versions
the thing with gemini the context window is large you probably need to do diffs but its difficult if ur using the website manually lol
now that's MAJOR cap
More like a drawback cuz of coding finetuning
They will eventually fix that on upcoming versions
prohibited
anyway waiting for 2.5 ultra now
no one has a j*b here
hustle
ASI confirmed
4.5 Super ASI
what trained 4.5
it trained itself??
nahhh
it ain't that super yet
damn
Honestly i think they where losing money on the original 2.5 pro and now had to release a 'new' version with a heavier quantisation or something like that (hence the lower simpleQA score), but they did kind of finally start finetuning a bit more for coding like anthropic always does and openai started doing with the o series and 4.1.
(but that really is me just guessing)
I remember when asi used to be "artificial sentient intelligence"
i don't think them "losing money" means anything here tbh
they can afford to do this for decades
exactly
they can afford to do this for decades
not just that
their other sources
should be wayyy
too profitable
ion know what that means to the initial question of losing money
me saying Google can do this for decades presumes this
no company can just burn money
ye but that's not what I mean
they can afford to do this for decades
not because they have their money
but because their profit still massively outweighs that loss
this is like rejecting them spending any money tho
they're just spending minimal amounts of money to keep it running
that's it
if it's operating at a loss
doesn't matter
they're not operating at a net loss
yep that's what I said
they wouldn't be
even if they served at a technical loss
therefore "I think they were losing money on the original 2.5 pro" is invalid
ye but that doesn't mean anything here
we need 2nd principles instead
wait let me see chat logs about my opinion on this model
yo what???
you're gonna be surprised lmao
now I'm really interested in NW
theyre scraping web dev arena
nobody:
claude code:
no
i wanna try gemini now, wen gemini max
fr tho nw was a specialist webdev finetune i think
nah
that's not true at all
nw was amazing at all tasks
the other ones seemed to be webdev fine tuned
dont u have access to o3 pro on the o1 pro api?
"nightwhisper" was better than "sunstrike"?
@deep adder how do i reset the convo, but start it with the compacted conversation?
how many tokens u get every few hrs with claude max btw im curious
claude --continue, starts off with the old convo tho?
i think claude pro used to have several million tokens per few hours
the $100/mo?
it's the best one ye
not to flex, but i have 20x
also btw
had an axp
given context, it seems to grasp nuance better and still retains the same traits as 2.5 pro
the difference is so minimal in everything else
something new have changed? web ? gemini over claude 3.7 ?
not yet
flash 2.0#
claude code is currently responsible for 1% of gdp
If claybrook got that elo I wonder how high nightwhisper is 👀
feeling the AGI with the new gemini-2.5
I wasn't expecting such a leap so fast
throw your hardest questions at it and behold
ngl I don't think the benchmarks really show how good it is
it might overthink but In open ended things and iterations
its so much better
anyone know when the next big grok update will be?
this week
the 3.5 right?
do we have any benchmarks?
I don't subscribe to grok I already subscribe to gemini and gpt idk if I want another one lol
nah
there are fake leaks
but no real model
it will probably be worse than 2.5 pro
they didnt put it on arena as a pre release (presumably because it wont beat it)
ye I don't see anything approaching o3 or 2.5 pro for a while now
that was honestly a major jump tbh
2.5 ultra
people did talk about it
but it needs to be emphasized that 2.5 pro is smaller than o3
i dont think so lol
it def is
4.1 is 200b
bruh? what
yes
it literally, inherent to what the models are
can't be the same size
lmao
that doesn't make any sense
the thinking itself adds a ton
i cba arguing about stuff where i need to cite etc. but im not making stuff up
so it was claybrook
ye
but @keen beacon why are you so sure about the numbers, i mean openai is serving 4o to sooooooo many people that a high expert count might make sense
and i disagree about the thinking part
they said that the original o1 was based off of 4o (with the same parameter count, like e.g. qwq-32b that has the same param count as the base 32b)
but i am not sure if we know anything about the later versions
o3 was retrained on the 4.1 base model which is a cpt of 4o but i dont wanna get into it. i cba arguing/explaining it again and again lol
No one knows bro
it is possible that a high expert count is used, but i have no idea about it tbh. im talking about total params
also im obviously speculating ahahaha (but im not making the numbers up, it's based on stuff)
wonder how the new model is going to affect the deep research ngl
yeah i get it, but:
1: I am not 100% convinced about that being the model they originally used (for the ARC agi benchmark o3 version)
2: the economies of scale for inference should push the large providers towards offering really high expert counts for their best models (I am basing this on the deepseek approach to inference optimization, where they have 'duplicate' experts (for the most used) and more experts optimizations)
3: 200b would mean they can run this stuff on one gpu with quantisation, i dunno
4: the fact that all the decent models released by alibaba and deepseek are really large aswell
- e.g. qwen max has 72b experts (really not sure though) and is like 1,5t params
5: the llama 4 maverick has 128 experts because meta wants to serve it to many people through their apps -> high efficiency because for a large provider its more like running a 17b param model for the customers
(i dunno, i am not even a programmer, so i am really talking out of my ass)
its fun to speculate though ahaha
true, that's what i am here for :)
Oh no 👀👀 it failed prompts that even gpt mini passed
Dude where are you getting all this info constantly
this is not new info
ive said it many times (but different threads/conversations, but i guess no one is actually reading it lol and piecing it together)
is this not subject to variance
o3 doesn't get some stuff Gemma gets right
Do you think that the o3 unveiled in December was based on GPT 4.5 but abandoned?
no
nah
so why was its arc agi cost estimated much higher?
check the sample count used lolll
💀
I have 10 prompts that i was grinding through all the models 10 times at least. The claybrook performed worse than SOTA models. And worse than the original 2.5 PRO. These prompts were never answered by the likes of gemma
what are you prompting it with
What is this "sample count" ?
basically how many times they ran the model on 1 q
People is talking a lot of stuff here. Guess I need to know who to pay attention to
i read it ... most of the time :)
was just unsure if you had connections or where just yapping 🤷♂️ about the info (or in between 😅 )
Where do you find this informations ?
he's not saying anything meaningful in that regard tho?
the total params are completely speculative
and the inference is coming from his knowledge of 4o itself
The checkpoint part is totally new to me
This critique was communicated since the day 1 after the results. It's almost random search given the amount of compute that went there.
ye but the 4.1 card says this
yuh im speculating but i hope it makes sense lol. i ramble a lot but im not outright making stuff up (most of the time) ahha. i feel like i have a pretty good track record/pretty reasonable so far. ive been talking about the chatgpt 4o latest cpt/(4.1 wip base model)/quasar for monthhs, and it was outright confirmed by openai employees later. theres a lot of public info out there lol. but yeah im not willing to get into arguments/debate about it anymore lol (exhausting to cite/argue properly)
you also think o1 is much larger than 4o too?
fyi thinking doesnt add params to the model it doesnt make sense.
for o1 we know its based on the 4o base model lol. aidan was saying it was a 1t model/arguing with semianalysis about it before he deleted it. (before he was an openai employee)
https://semianalysis.com/2024/12/11/scaling-laws-o1-pro-architecture-reasoning-infrastructure-orion-and-claude-3-5-opus-failures/ (some of the info in here is wrong, but semianalysis has insiders)
it does come off like im sure of it (unintentional), but i frankly just dont want to explain it over and over again. (it is all speculation but for some things it is extremely likely based on the info we have imo)
Btw don’t feel the need to cite things, I would claim that I also know wayyyy too many useless things about the industry
And can more or less add the sources in my head :)
the thing is tho, thinking denotes much more params
it's not the thinking itself that adds params
Vram but not params
so that's not really what I'm saying
The params are just the model weights
even if you grant 4o is 200b params
what
thinking models are primarily more expensive because of kv seq length (semianalysis explains)
that has nothing to do with what I said lmao
thinking necessarily denotes more params
it's not the thinking itself that adds more params
what do you mean why do I think that lmao
Yeah well finetuning or RLing for reasoning does not add more params to the model
ur point doesnt make sense at all 😭
you can't believe a base model can inherently support that reasoning process
will we ever get nw?
that has nothing to do with what I said tho
fine-tuning isn't what makes it reason
nor RL itself
No not just that, it’s also about the variability of thinking vs a larger model (where all params are used for every token) (imho)
That plus a reward mechanism for correctness and a certain format with thinking tags is exactly what results in reasoning
Read the R1 paper if you are interested
running 2 claude code instances >>
thanks gemini
i dont think so 😦
not a reward mechanisms nor RL or fine-tuning, the Jump is too big if 4o can't represent the same knowledge (ex a complex math problem) there's a limited capacity, and it's not equivalent in both of them and even if pretraining 4o DID allow it to therefore be fine-tune and reason, they fundementally have different knowledge bases, if you're saying o1 with a 200b base is outperforming deepseek r1 this much with such a large base then that's crazy
4o can't complete the electrical tasks I give it, 4o can't complete the philosophy tasks I give it, 4o doesn't understand basic qft operators like o1 does
there's no chance o1 can represent what 4o can't without what entails this
ie more params
that doesn't make any sense
4o has to immediately predict a chat reply, o1 can think and recall about electrical stuff, etc., before it needs to form a reply
it does not necessitate larger param counts or the model needs to add more params for reasoning
this literally can't matter when progressively given inquiry lmao
As is said, I am also not sure about the 200b and the underlying model (or just the architecture) was the same with the first o1 release (I believe they openly admitted that) but they are likely doing a lot of cpt (just me guessing) to the model and then starting with the reasoning training
4o simply cannot receive these tasks, it doesn't KNOW
so I have a really hard time believing it's even close to the 200b count you're speculating
it's def a big model
we basically have confirmation about o1 and how its based on 4o which is 200b (very likely) lol. i also recommend this article: https://epoch.ai/gradient-updates/frontier-language-models-have-become-much-smaller
ig it is hard to keep track of what they're doing
Big difference between the 4o in ChatGPT and the actual base model they use
epoch ai are quite reputable anyway, you can believe me or not lol
ye but the article somewhat reinforces what I'm saying given deepseek r1 itself too
they are using 4.1. Why is there still gpt4o naming only they know
it is for a fact, they said so themselves
because chatgpt 4o latest is on the 4.1 base and has been for months
Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version(opens in a new window) of GPT‑4o, and we will continue to incorporate more with future releases.
- artificialanalysis did benchmarks on it and they were just about identical to 4.1
identical benchmarks and knowledge cut off (distinct knowlege cut off indicating cpt comapred to 4o)
there isn't really gpt4o is much worse
ye but that simply means it has more params/is different, if it absolutely doesn't know the things o1 does, then that's unbelievable
we have confirmation about it from semianalysis (has openai insider)/the argument between semianalysis and aidan which he relented (o1 is not 1t+, same size as 4o)
ye o1 doesn't seem crazy heavy
which is the confusing part. Chatgpt-latest is based on 4.1 but they are still calling it gpt4o lmao
youre just making that p
there's no dated API version of gpt4o that would come close to chatgpt performance
I'd say o3 is arguably that range
size?
prolly true it's 200b~
I said that's what I thought
way back
to me Claude should be around like
300b
or larger
sonnet 3.5 is like 400b
ye that seems to be the case
Nice article, but I don’t really agree with his approach to measuring it. And I also question the result of 200b for 4o size.
I think it's comparable to Deepseek actually. They are serving it to hundreds of thousands of people so MoE makes sense but active parameter count is not gonna be a lot
i also recommend their moe inference article
100b active, 700b total, MoE. But that's just a wild guess lol
i dont think massive active params make sense lol
prolly smaller, but if it's moe then ig what Dom said
I don't think it's that big tho
all Gemini models
just simply don't seem that large
maybe cuz the its the tpus
gemini pro is similar to sonnet3.5/4o size range
ye
It does make sense. Too little and it's gonna be limited. And Google has TPUs
not really. qwen 3 showed that even with small active params still allows it to perform closer to the dense param count
between flash and pro
their 30b moe a3b experiment
Ultra was probly smth wild like 200b active
(30b a3b has similar simpleqa to qwen 3 32b and higher than qwen 3 14b)
if u take sqrt(total_params*active_params) (mistral rule of thumb) it clearly outperforms 9.5b
2.5 pro and 2.0 pro are the same size, i believe
I don't believe 2.0 pro was larger than 200b
Deepseek doesn't have as much resources as Qwen and in my opinion they are doing better. With a model that has more active params. It was released well before qwen3 and it still beats it on some things
or any bigger than 4o
thing is tho
2.0 pro knew more of these important tasks
STEM, philosophy
im talking about the fact that small active params + larger total params allows it to still perform to the dense total param count. what you're talking about that's a separate thing.
u can compare the base model benchmarks (where its very much standardized, they definitely did not cherry pick anything on the base model benchmarks chart)
their tuning on top of the base model was insufficient/lackluster though
It is bigger almost for certain. With 2.5 pro that has 2.0 base (well not exact same but same size) they get performance in different ways than OpenAI. They don't need high reasoning and it has better spatial awareness
it will not perform as well as dense model comparing MoE total params vs dense. What happens instead is MoE is much faster to train so we are comparing oranges against apples lol
cause that MoE that you are comparing to dense saw way more training and data
they pretrained on 36t on all of the models
they didnt train the moe more
it's also why qwq 32b dense competes with gpt4o though, on a flip side. Both take comparable amount to train I think
and both can perform similarly
???
you are not gonna train / fine-tune smth like 405b nearly as much as you would MoE, big model comparisons are not very realistic
but this is ^
but we are comparing qwen 3 32b dense and qwen 3 30b moe
im confused
I'm comparing it against gpt4o lol
but if you wanna compare those
dense still performs better
but it performs closer to 30b right despite small active params right?
it's like gpt4.1 vs gpt4.1-mini, on some benchmarks they look very similar, diminishing returns. But on some others the difference is bigger
but 1 is still for a fact better
in terms of world knowledge, qwen 30b a3b and qwen 32b dense both get the same simpleqa score (and higher than 14b)
it would seem so. But we also have reasoning on top which a bit complicates things. The numbers qwen is citing are not for standard instruct chat models
don't forget o4-mini performing better than o3 on some things etc
still the simpleqa score (independently measured, i posted it somewhere a while back), shows the same. (you can't make simpleqa gains trivially using reasoning)
just shows that there's a lot of leeway with active params, i think the more important part is the amount of experts used
even then, qwen3 has bias towards certain experts/not even. (experts are not "experts" in knowledge, it's more complex, but anyway) usually moe training theres autobalancing or whatever with experts [tangential point]
what's the simpleqa for it with no reasoning enabled?
8% for both of them. no thinking.
I would argue that the argument for dense vs MoE really depend on the type of questions asked.
In something like simpleqa, the model just needs to access the knowledge stored in some neurons, whether that info is accessed by running all the parts of the model (dense) or just predicting which part of model's experts knows it (MoE) should not really matter much (ik i am wildly simplifying). However, I am sure there will be a lot of scenarios (e.g. compact reasoning over less than 10 tokens) where 1b params as an experts will not be able to compete with a dense 30b that has a lot of layers to 'think' internally.
In short: simple qa might not be the best benchmark to compare the two architectures and there has to be some sort of downside with the MoE (assuming same tokens in training).
yeah you may be somewhat right. Just tried this:
complete gibberish, versus: (normal text in Lithuanian)
no thinking enabled
interesting they are sandbagging models by releasing claybrook, when we all think there are 3 other models that were stronger, I guess they come at I/O
they could be testing the same model (different revision) with different names. so it could be less than 3 'actual' models
(there could also be more unreleased models, we dont have enough information to make a conclusion either way)
yeah it is impressive and maybe 8 experts (active) help indeed. Still not quite the level equivalent to dense though. If it had the same amount of active parameters and say 60b total, I do not think it would be better than 32b dense still 🤔
Claude 4 will have 1800 elo in WebDev
looking around with people talking about the new 2.5 pro
it's so mixed
someone said it's worse at analytic processes
but it's gotten better for other people
people are saying it has worse context
but to me it's the exact same
not new information
Wtf happened to gemini
This is not gibberish. It's normal latvian.
bugged for a bit it's happening to me
You can't expect it to identify lithuanian language from 3 words
ye
ask it to think when it does that
05-06
bugged for a bit
it will fix it
it's happening to me
Fr?
generally yes
ye
its so stupid
coding
they need to prefill the thinking tag
glad it's even better at philosophy + explanation + coding
but ion know about all your guys use cases
Its better at philosophy?
Thats what
I used it for
😮
Im using it for philosophy rn
2.5 pro has always been really good at it, especially others and ye it seems like the new one grasps nuances better
Let me see if it improved
especially in categorizing
Idk I use o3 instead
and formatting
Last time I used gemiin 2.5 pro for philosophy and asked it questions
It gave me answers for entirely different ones
Didnt addrewss what I said
Didnt understand what I said
Even after I repetaedly clarified
It sucks at that
For me at least
probably understands what you said but assumes you're elementary
Context Arena: Added in Gemini 2.5 Pro (0506) to the 8needle test.
Performs better on lower contexts, and slightly worse (within error range) for upper contexts.
More results at: http://contextarena.ai
Dude
This is so f annoying
I have to keep switching between beta lm arena and alpha lm arena and the normal one
Because my conversations keep doing this
are u trying to use gemini on arena?
could it just be caused by randomness? they look pretty similar visually
It's within the error margin. When I rerun all of the tests it usually doesn't change by more than 1% per bin (usually less than 0.5%).
I basically consider it the same, except with improvements below 128k context
this is fake. they didn’t post eval.
My bad
source ?
asi incoming 🔥
1h
You can and you should. I even retried this several times always with the same outcome. Dense 32b model understands it immediately. but this just needs for you to explicitly say and clarify it like 5 times for it to understand when thinking is disabled.
its here?
2h
No fake news please
If you really think about it, you'll realize Grok 3.5 was in your heart all along.
Wtf is this new cult with grok 3.5
You'll get it when you're older.
asi
grok 3.5 is just to satisfy elon ma first principles, he dont care bout da other tings
xAI livestream in 30 minutes
I’m hearing grok 3.5 has 100b context window and saturates arc agi 2
yeah yall need to chill with the fake news, brings down the prestige of this channel lol
They will release API a few months later
is this true?
Yes it’s been confirmed by Elon
Is anyone's aistudio broken lol the thoughts are messed up
It's like a summary and the thoughts are merged
Same
ye it's weird
its like two separate thought processes
it gets to the conclusion which I don't know how
it's really strange
there are out of context statements
and followups that don't even make sense
I think they messed up something with how it's passed forward, could be the app summary mixed in?
tbh this is getting pretty weird, seems like what makes its context recollection so good is it's thinking
but it's recollection can't be good
because the output is messed up
it's omitting things it intends to point out in its thinking process and reiterating things it already established, prob relying on the base model to infer the initial prompt and to piece it all together so there's more potential for insufficient information
i didn't see this at all a few hours ago
do you like trolling
(if this genuinely isnt go https://manifold.markets/KTibow/which-day-of-may-will-grok-35-be-re)