#general

1 messages Β· Page 50 of 1

small haven
#

lame

#

as long as its above 200k ish, would be cool

small haven
#

thursday

patent aspen
#

I like this

wicked root
#

Hello humans

echo aurora
elder rapids
#

this wouldn't make any sense tho

#

they have the compute to train it at 1m context

#

actually that would be weird in a lot of ways

#

since I'd assume they know their advantage is context + reasoning over lengths, and the fact deepthink is already limited presupposes it's high computational strain

#

so high context restriction would be completely redundant

small haven
#

do you know the exact number on the ctx window

#

for trusted testers

#

higher lower 500k

meager lintel
#

@patent aspen goldmane tmrw still locked in? πŸ˜‚

#

rip

small haven
#

well this is new..

woeful geyser
#

stephen really looks like some org in China straight up doing SFT from o3 outputs.

elder rapids
#

all the anon models except goldmane are really bad

#

there's no reason to pay attention to them

calm sequoia
#

What's up with the chair meme in LLM community?

misty vault
whole sundial
#

I think stephen is by StepFun (a Chinese AI company, which would explain the confusion with DeepSeek). Just say the names next to each other if you don't believe me.

misty vault
#

the names next to each other if you don't believe me.

dusky aurora
#

Google should focus on ficiton writing. Geminiwould be much better if the scenes were more interesting

small haven
misty vault
#

wtf

#

first we had wild and paws and now billy

#

they were dividing themselves

calm sequoia
dusky aurora
#

why are you telling me this?

olive mesa
wicked root
#

How reliable are xAI’s models?

grim axle
#

they are both ahh

wintry tinsel
wicked root
#

Gemini’s the best

wintry tinsel
#

For what? Not for fiction writing

#

These are final rankings #1 for coding and math: O3, #1 for everything else: Claude Opus, number 2 for everything: Gemini 2.5 pro

wicked root
#

No im goijg by lmarena

dusky aurora
tall summit
#

gemini 2.5 pro best in translation

tall summit
keen fulcrum
keen fulcrum
calm sequoia
# calm sequoia
poll_question_text

Would you have been disappointed if the GPT o1 was named GPT 5 on release?

victor_answer_votes

8

total_votes

19

victor_answer_id

3

victor_answer_text

What?

victor_answer_emoji_name

πŸ€–

gentle plinth
#

@hollow ivy could you send me the dou shou qi server link again?

tall summit
#

what I am disappointed in is that claude 4 opus is better than sonnet when it was the reverse for 3.7

#

NO SENSE!

small haven
#

lol

leaden sun
#

I thought LMarena would have benchmarking for video generating. Or is it not that meaningful to include this?

echo aurora
keen beacon
#

not really fair to compare sonnet 3.5 and opus 3, if hes talking about that anyway. sonnet 3.5 is a newer fresh pretrain i believe

leaden sun
balmy mist
#

anything new?

small haven
keen beacon
# small haven wait so claude 4 is based upon the sonnet 3.5 pretrain?

i highly suspect it. opus 3.5 and sonnet 3.5 are fresh pretrains. 3.7 sonnet and claude 4 are not fresh and has continued pretraining based on sonnet 3.5/opus 3.5. plus semianalysis reported on claude 3.5 opus and there were additional rumors about it being disappointing, it seems to be true. i think its too fast to be pretrained from scratch among other reasons

boreal saddle
#

I always relied on good old Claude to write fictional stories, so I cant tell.

#

Though, Claude often freezing up and not continuing with the rest of the message in the new LMArena is quite annoying.

#

It literally just stays there, with the loading spinning indicator, and nothing happens.

small haven
late path
#

I remember 3.5 sonnet, 3.5 sonnet1022, 3.7 sonnet are the same base, and 4 is a new pretrained model.

boreal saddle
#

Is Claude 4 actually superior to 3.7 in terms of fictional story quality? Sometimes, new isnt automatically better.

small haven
#

feels like they rl'd for being more agentic and with extended thinking mode

patent aspen
small haven
#

yup

#

pretraining claude 4 from scratch when u already have a good base like sonnet 3.5 seems crazy to me (where they also had opus 3.5)

patent aspen
#

Well maybe

tall summit
#

but it is still annoying

#

i'm not comparing them

#

i'm saying the naming scheme irks me personally because of it

#

even if it is a weird comparison

#

it seems not having an opus knocked things off balance

#

bruh you know what i mean.

#

cool name?

#

tru

#

did it take 5x the effort

#

i haven't looked into it

#

that's what i meant

keen beacon
# patent aspen I'm curious to know how you estimated that it was too fast to be pre-trained fro...

it depends but upon rechecking the dates, it is possible claude 4 sonnet could've been pretrained from scratch (if we use the claimed/documented sonnet 3.5 cut off and it wasn't continued pretrained as well, which was around 3 months from cut off to release) but the opus timeline doesn't make sense probably in combination with other reasons. i heard models usually take around/at least 2-4 months to pretrain (depending on how much data/etc), to be released completely in < 3 months [based on the pretraining cut off] (not including safety/post training) is seemingly absurd if its not continued pretraining/etc (which makes sense with other reasons). this isnt an exhaustive/coherent argument (which i don't really get into the other reasons) but i hope it makes sense where im coming at. im just guessing btw 🀣

#

there isnt really a smoking gun in this case, at least i don't exactly know of any, but the whole picture seems to be that

keen beacon
#

i suppose you can start post training/experiments/safety while the model is still pretraining (on checkpoints) so there might overlap (as things get done in parallel), so pretraining cut offs/timelines arent necessarily the greatest signal too. (im not sure if this is done in practice though)

#

even if they arent known u can probe the model/determine it manually

patent aspen
#

I don't know what happens if an AI company decides they want to hide or obscure their knowledge cutoff or post training and pre-training become less distinct over time

#

No evidence of that yet AFAIK

keen beacon
#

it can be a pain in the butt to determine the precise cut off if youre doing it manually like that though

#

yeah, but even then depending on the implementation, you can still probably figure it out

#

if its using tools for internet access, its possible, but it can become very complicated

surreal creek
#

rate of progress is so advanced we have ppl complaining the SOTA hasn’t been beaten in 4 weeks

cedar tide
#

Grok 3 mini is bad on the arena

surreal creek
#

Worse than Grok 2 is pretty brutal

elder rapids
#

o3 is only #1 in high reasoning variant, don't forget that

candid storm
#

when do you guys expect Grok 3.5?

small haven
#

what happens after gpt-4.4

keen beacon
#

gpt 4.5 probably knows much more than opus 4 tho

small haven
#

4.5 is very slow tho

#

what is the base model that o3 relies on, 4o?

keen beacon
#

continued pretrained version of it (cut off of june 2024 vs oct 2023), yeah it seems

small haven
#

its really impressive then

#

imagine the alpha it gets from having 4.5 as a base model

#

currently even o3 pro lacks some perception when it comes to detail

#

what if i am πŸ‘€

#

lol

#

u dont see it, i do

#

its not THAT great anyways, its just better than o3, but hopefully the official version is different

craggy ridge
small haven
#

this has to be gemini 2.5 pro thats coming out on thursday?

#

from aider discord

#

cost is slightly above gemini 0506

#

this is better than o3 high wtf

#

nah @patent aspen fact check

#

it actually is

keen beacon
#

people have been raving about goldmane so it wouldnt surprise me

small haven
#

this is really insane

#

gemini literally closed the gap

#

*80 ish

keen beacon
#

thats with gpt 4.1 too btw

small haven
#

nawww ok gemini cooked

#

its 86% according to an aider admin

#

o3 high is 79.6% to be exact

keen beacon
#

nah

small haven
#

highly doubt i

#

more like gemini

keen beacon
#

i need goldmane asap

small haven
#

ya this is goldmane

#

i want it

#

wtf

#

1m context window too

#

omg

#

no wonder brian was so confident lol

#

its materializing

#

a literal aider admin tested it lol

#

gemini always wants an aider polyglot test for benchmarks

#

highly doubt its fake

keen beacon
#

they screwed up last time

#

aider

small haven
#

one of the coding benchmarks standard

keen beacon
#

i guess they got access ahead of time

#

this time to make sure its right

#

they misreported gemini 2.5 pro's cost

small haven
#

but not the accuracy

keen beacon
#

yeah

small haven
#

its $42

keen beacon
#

aistudio and its free 🀣

small haven
#

vs. $37.41

keen beacon
#

and basically unlimited

small haven
#

check it uself lol

#

im mindblown

#

literally

#

who cares about cost efficiency

#

its free btw

#

yea deepthink is going to demolish o3 pro

keen beacon
#

deepmind at an insane pace tbh its unbelievable

small haven
#

does this mean a 2m context window or its prompt tokens taken in aggregate to test the benchmarks?

keen beacon
#

probably combined

small haven
#

yea

#

is goldmane still live

#

i havent even tried

keen beacon
#

yea

#

according to web dev arena metadata

small haven
#

ok cool

#

@deep adder put some money on google on polymarket lol

keen beacon
#

i guess i understand now why they dont want to return the raw thoughts for the short term at least 🀣

#
small haven
#

where is brian when u need him

patent aspen
#

I don't know about the polyglot thing

small haven
#

how are u spot on

#

idk thats ur thing

#

nah we living in a simulation

small haven
#

by a 6%

patent aspen
#

I don't generally have access to evals before a model is released

#

Occasionally I stumble across one or two

small haven
#

i mean does it fit the vibe that it beats o3

patent aspen
#

I mean I would expect that, but I don't use o3 and wouldn't know how to predict benchmark scores. I only have some rough idea of the broad directions of things and stuff we're trying

small haven
#

ok understandable

keen beacon
#

there were rumors of an additional openai release this week (api, free, plus, pro, etc.) (so not o3 pro) theres a possibility that its that but i dont think so

small haven
#

true

#

o3.1 high

#

lol

#

i really want it to be gemini

#

if its gemini, oai will release o4 a week after

#

its gemini, cus gemini models are the only ones have diff-fended edit format

#

look at the far right column

#

diff-fenced only for gemini

keen beacon
#

yeah

small haven
#

idk but they only test it on gemini

keen beacon
#

its definitely gemini

#

i remembered

small haven
#

^

#

GEMINI

keen beacon
#

diff fence is made for gemini models

small haven
keen beacon
#

i recall reading the docs

small haven
#

diff-fenced

keen beacon
small haven
#

LMAO

#

primarily used with the gemini ...

#

gemini won

#

still can make money

keen beacon
#

i wish they added back raw thoughts πŸ₯²

small haven
#

u sell it when they announce it on thursday, ill take the differential

zinc ore
small haven
#

goldmane

zinc ore
#

And what was the score it got

small haven
#

86%

#

79.6% on o3 high

zinc ore
#

Hot

small haven
#

cope

#

let gemini win once haha

zinc ore
#

I've seen seen SS showing new 2.5 update dated the 2nd and one for the 4th

#

Which aligns with what the image is showing and all the other information

small haven
#

can they not improve

#

it is aids now yes

#

yo i cant proc goldmane in webdev wtf

keen beacon
#

it should be in the main arena as well

small haven
#

im getting deepseek and grok most of the time 😭

keen beacon
wintry tinsel
#

I’m sad chat

#

Opus got nerfed

#

I can feel the quality isn’t as good as it was last week

wintry tinsel
#

It’s not a huge nerf, but its there

keen beacon
#

if ur coding, goldmane πŸ™‚

#

about to go nuts apparently

small haven
#

ya just wait till thursday

#

lol

#

i dont get it, how is gemini-2.5-pro-06-05 confusing

zinc ore
#

Could be deepthink

keen beacon
#

nah

zinc ore
#

Which would make it more annoying to keep track

small haven
#

oh i get it

#

05-06

#

and 06-05

#

lol

#

"i hope they delay the model"

zinc ore
#

Yeah lol

small haven
#

cool its rlly on thursday

frail thorn
#

πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ˜‚ πŸ™

leaden palm
#

finally claude research is available

#

why is there a bar chart lmao

#

it's now rolled out to max + pro

small haven
leaden palm
small haven
#

yes

#

when it got released i had it

#

so now u saying it rolled widely to max?

#

or pro

#

yea extended thinking feels different and less potent than o3

#

i mean its true, o3 can think for 10 mins, claude 4 can think less than 1 min if ur lucky

#

image only, no link is crazy

#

oh nvm, it looks ai ish

#

i mean is it not facts they are joining

#

image looks fake

#

they cant

#

its owned by amazon first of all, and they dont have enough cash to cover >$100b

#

partially

#

amazon and apple conflicts of interest

#

liquid cash, no?

#

$27b liquid cash

#

balance sheet

#

recent quarter

#

even if they had it right, dario wouldn't sell it at face value

#

so maybe 2-3x premium

#

anthropic growth is too great, 2x premium is not unreasonable

#

growth

#

ai

#

more than 2x if they wanted to buy it

#

valuation

#

thats what i meant, its virtually impossible

#

better off doing what theyre doing or buy a smaller lab

#

buddy didnt know apple cash on hand ok

#

$150b

drowsy shuttle
#

hi here
this is a web developer who have experience in frontend, web3 integration, iOS/Android app, implement AI frontend with API.
I am looking for a new position now.

pseudo hemlock
#

Does anyone know if there are any plans to add more models to the search leaderboard?

#

I know at a minimum Claude and Grok both have search features

small haven
#

what is it

#

can this guy tweet this real quick

elder rapids
#

I've been using ts like everyday bro 😭😭

#

nah actually, I abuse limits like hell

small haven
#

its not great, but its the sota so we use it 🀷

drifting thorn
small haven
#

its in aggregate of all the tasks they tested against, dont think its total context window

#

but google has been hinting several times for a 2m context window, thats not new

drifting thorn
#

I've been hoping for a model with context window of 10 million tokens

#

Such that I don't need to set up any knowledge graphs or RAGs

#

Just bruteforce it in AI studio

small haven
#

yes its not new

drifting thorn
#

by sending it all the context

small haven
#

10m will happen, eventually

drifting thorn
small haven
#

im sad that 64k is the meta currently in oai lol

small haven
drifting thorn
#

Cost issues.

small haven
#

probably

drifting thorn
#

But Alphaevolve has invented a new way to calculate 4x4 matrix, which should saves compute

#

So I hope they will bring us the 0325 with 2 million or more context window.

elder rapids
#

86%???????

#

if this is Gemini

#

the other companies are cooked

small haven
#

been telling 🀷

#

if this is the base model for deepthink, o3 pro is dead

elder rapids
#

yep

small haven
#

with a 1m contxt window

elder rapids
#

similar cost too

#

to

#

2.5 pro

#

bro 😭

small haven
#

polymarket still 74% chance, easy money

elder rapids
#

alr ion believe it's Gemini

#

I can't believe that

small haven
#

but it is

elder rapids
#

if that's Gemini

#

nah I actually can't believe it

small haven
#

but edit formatting

#

"diff-fenced" only for gemini models

#

was goldmane that good? i never had the chance to try it

elder rapids
#

nah there's others and it could be that diff fenced is the meta

#

or some shi

small haven
elder rapids
#

but the same price

#

at such a higher performance

#

no bro that's

#

like ACTUALLY

#

I'm geeking rn

#

I don't believe the dude that sent it actually did it

#

I do think goldmane is another nebula type moment tho tbh

small haven
#

if u check closely too, "seconds_per_case" it took 132 secs vs 200+ secs from 0506 version

#

so not only accurate but faster

#

this is 0506

elder rapids
small haven
#

271 secs per case and this new one is 132 secs

elder rapids
#

Google wins

#

there's nothing we can do

small haven
#

oai needs to release o4 asap tho πŸ‘€

#

6 months to 3 months to 1 month release cycle baby

balmy mist
#

there is a new model?

#

i been mia

#

oh snap i see the new ai news channel, thanks mods!!

elder rapids
#

0325 can't be a fluke then

#

why did they release an iteration of 0325 that offers no real improvements besides coding

#

was it even an iteration of 0325 or was it just a 0325 candidate that happened to have better coding, but 0325 they intended to scale higher

#

hence goldmane

#

I'm tripped up tbh

small haven
#

was goldmane the best of all ?

elder rapids
#

yes

small haven
#

like how was it vs redsword

#

etc

balmy mist
elder rapids
small haven
#

wow

elder rapids
#

and tbh you could kind of see it in the way it formatted the code ironically

#

goldmane codes BEAUTIFULLY

small haven
#

im excited

#

thursday plz

#

and deepthink holy fck

elder rapids
#

if not Thursday can we just never trust Brian

#

😭

#

like, forget about him

#

deadass

small haven
#

"tentatively"

#

lol

elder rapids
#

if it's going to be that good

#

I'm starting to see why they're hiding it behind ultra lmfao

small haven
#

oai is literally dead

zinc ore
#

Probably screws Anthropic over more than openAI

elder rapids
#

ye that's true

#

anthropics main focus is coding and creative writing

#

and the moment that's taken away with cheaper costs

#

then it's all over tbh

#

o3 > opus as a creative writer btw

small haven
#

oai has to drop o4 after this, its too damaging

#

they been sitting on it since january 🀷

naive valley
#

A

#

A

wicked root
#

How are overall rankings calculated?

#

I see different topics for each categories

#

On dashboard it says text but in overall rankings they break things down into different descriptions and what not

#

Basically Id like to know if openAI could beat google by the end of this month

small haven
#

yay

#

06-05

#

its officially coming tmmrw isnt it

#

or thursday..

#

i meant today

fleet lintel
#

o3-pro has to be league ahead of upcoming 2.5 pro launch. Isn't it?

calm sequoia
#

Guys from here said-o1 pro was somewhat 5% better than the o1. So the o3-pro shall be on the same improvement magnitude. I bet it will be slightly better than the 2.5-PRO pre-nerf.

fleet lintel
calm sequoia
#

IDK what you mean but not a single anonymous model was better than pre-nerf gemini 2.5 PRO

#

They were just focused on coding

#

Which is too narrow for most people

#

And claude already has the market

fleet lintel
#

my understanding is that gemini has started gaining market share (still behind though)

#

but my comment is not just about coding, I am looking for better general purpose model. I am fine with current models for coding unless there is big improvement in agentic workflows

calm sequoia
#

I get you, same here. However, I'm using the models for research 3+ hours every day and nothing can compare to o3, especially o3+Claude

#

When pre-nerf gemini 2.5 pro was available, it was a go-to-model, but then it was nerfed

small haven
#

o1 to o1 pro was a big jump

fleet lintel
calm sequoia
#

Signal processing mostly, where you have a lot of theory and some niche legacy code bases

small haven
calm sequoia
#

The new gemini can't connect theory and code

#

Sometimes I show the results of o3 to gemini, it reveals 10+ problems. Then I analyze the problems and they appear to actuallyt be not a valid points.

#

And 3 hours is not so much if one prompt means 2 to 10 minutes of thinking

#

have you guys tried Jules? It seems interesting, but crashes constantly

small haven
#

yea jules crashes on me too

torn mantle
#

@small haven you saw it right

#

they usually push updates friday no?

#

thursday/friday

ocean vortex
#

they dropped free tier for API with this version, which typically means exactly that - faster. But that's nothing to do with the model

ocean vortex
#

performs objectively better on webdev arena, Aider, LiveCodeBench

#

if nerfing means making it better at coding, they did not nerf it enough as far as many people here are concerned lol

dusky aurora
ocean vortex
#

if so.. Opus is a bigger model. It is gonna be better with vibe test / writing

calm sequoia
#

Not saying it's a bad decision from their side

ocean vortex
#

what they did is not nerfing catgrin

#

it's just better at coding while being slightly worse on some other tasks - this happens all the time, that's how models get updated without retraining the whole thing

fleet lintel
calm sequoia
#

That's just your opinion

#

Dom you always argue for details or terminology πŸ˜„

#

Why is mini high removed

ocean vortex
#

lol

drifting thorn
#

Not IMO

ocean vortex
drifting thorn
#

It degrades in AIME and mrcr

ocean vortex
#

I still have o4-mini-high πŸ‘€

ocean vortex
#

can't exactly call that degradation 1M context is margin of error; 128k is less than 2% LOL

#

I don't think you would notice that or would even be able to replicate this exact measurement

#

for AIME25 that's true, but then again new model does well at another math benchmark USAMO...

#

and also, that Deepseek. It wasn't there not long ago that's insane score 🀯

tall summit
#

none can even make headway on the medium/hard problems

ocean vortex
#

I think only select labs contaminated for it. We need to wait for upcoming models and it should even itself out with everyone contaminated... 😎

#

it's still gonna be a valid metric as there's no way they will get close to 100% lol

#

but direct comparisons gonna be easier perhaps

calm sequoia
#

Sadly too weak. Maybe R2 will make it.

ocean vortex
#

though tbh I wouldn't expect massive changes. Like I don't see Claude scoring higher on math than o3 contamination or not

calm sequoia
#

Logic, math, tool usage.

#

I can't comment on creative writing and such

ocean vortex
#

2+2

calm sequoia
#

Ah yes also I havent tried the 0528 version

torn mantle
#

2+1

#

2+2

calm sequoia
torn mantle
#

2+3

#

πŸ™‚

#

πŸ™‚

calm sequoia
#

Write a Non uniform FFT (NUFFT) in dart

#

Yeah, will check it later

torn mantle
#

is it

calm sequoia
#

It all about thinking

#

Claude fails miserably without thinking, as well as all other gpts

#

Hmm maybe gpt bought MATLAB code library and my interpretation is highly skewed

sacred plaza
#

Can y'all Elon stans explain his capitulation this week lol

drifting thorn
#

IDK

#

I'm not Elon stans but Gemini and DeepSeek stan

#

WHY THEY CUT THE FREE API USAGE OF GEMINI 2.5 PRO!!!!!!

#

I CAN'T WRITE MY NOVEL WITH IT ANYMORE!!!

torn mantle
#

asi?

#

or what

#

hopefully its added on lmarena

#

so we can try it a bit

fleet lintel
#

why not?

torn mantle
#

this is actually a big L for anthropic

#

With less than five days of notice, Anthropic decided to cut off nearly all of our first-party capacity to all Claude 3.x models. Given the short notice, we may see some short-term Claude 3.x model availability issues as we have very quickly ramped up capacity on other inference

#

how long do they think they can keep the lead?

sturdy mica
#

i was hunting for new models, what is this? X preview?

#

it said it was gemini 1.5

#

srry thats not in screenshot accident

#

why would it say gemini 1.5 then

#

if grok 3.5 was built off of gemini 1.5 then its gonna be bad

dusky aurora
ocean vortex
#

there's not much for Google to do to "catch up" to anything though. They already had ultra but chose not to pursue it (for obvious reasons)

#

that is typically very difficult to capture in benchmarks and big model takes more time to train so you can't update it nearly as frequently

sacred plaza
ocean vortex
#

it could be but Opus is newer and has reasoning. I think that's enough for it to write better than gpt4.5 tbh. It's big enough where any bigger is diminishing returns even for writing. But reasoning isn't

sturdy mica
late path
ocean vortex
late path
#

just open your gcp project and link a valid payment method

elder rapids
#

you can just ask it "what model are you"

drifting thorn
elder rapids
#

anyone else notice 2.5 pro in AIstudio is generating thinking tokens, and then outputting basically instantly

drifting thorn
#

For my usage it stopped to show the thinking tokens before giving me the answer

#

There are about 20s of latency so I guess it is "thinking"

dusky aurora
ocean vortex
#

come up with a prompt then which leads to better output than Opus then, in your opinion

drifting thorn
#

Have you guys seen the MCMC by Google?

ocean vortex
#

cause it's like 9/10 times Opus will win

#

you said "better" not bigger

#

that's what I replied to lol

drifting thorn
ocean vortex
candid harbor
#

I’ve ran then side by side and preferred Opus w/ Thinking almost every single time

drifting thorn
boreal saddle
#

In your opinion, what is the best publicly available LLM for creative writing and fictional story generation?

#

I dont know if it is Claude or O3.

#

Like, objectively?

#

Not subjectively.

small haven
#

So my screenshot made it to the news lol

#

Check singulairt reddit or even twitter

#

The timestamp is the same hahah

#

Yup

keen beacon
small haven
#

Didnt know it was gonna catch flame

drifting thorn
#

Actually are you using Deep Research much?

civic flame
sonic tendon
brittle tiger
#

Haven't been paying attention. Is consensus that goldmane is GA 2.5 pro and is coming imminently?

small haven
balmy mist
#

Is o3 pro really coming at 1 pm est?

small haven
#

Prolly

#

But its not as great as id thought it to be

elder rapids
#

is it an upgrade tho

fleet lintel
#

I have high hopes from o3 pro.. please dont destroy them

sturdy mica
#

whens this comin out bro

fleet lintel
sonic tendon
sturdy mica
#

yay

balmy mist
keen beacon
#

He has access to it

balmy mist
balmy mist
misty vault
#

is this better than claude 4 opus thinking

fleet lintel
elder rapids
keen beacon
#

cheaper too

elder rapids
#

I hope it's Google

keen beacon
#

it is

elder rapids
#

although I'm disappointed now they're employing so much stronger limits

keen beacon
#

wow its much better than opus thinking on aider lol

wicked root
#

Hold up

#

OpenAI’s releasing a new model tmrw?

misty vault
#

omaygot

#

google fix all gemini 2.5 pro issues??

wicked root
#

Who’s releasing what tmrw?

keen beacon
#

goldmane google ga 2.5 pro

#

5th

wicked root
#

Wtf is goldman’s google ga?

keen beacon
#

generally available 2.5 pro

elder rapids
wicked root
#

So gemini 2.5?

misty vault
#

it wasnt statement bro it was question

keen beacon
#

its a new revision of it. its called goldmane in the arena

elder rapids
#

and if you want to feel really safe, much better than 0325

elder rapids
#

I'm completely ignoring coding too

keen beacon
#

lmao just use the best model dont be a fan of companies tbh

elder rapids
#

ye

misty vault
#

best models are subjective after gpt-4-0314πŸ˜”

golden ocean
#

Fr

#

Thats not what he meant

elder rapids
#

or a skill issue

#

the difference in the models becomes more and more apparent as time goes on

keen beacon
#

normies can tell the difference increasingly less

elder rapids
#

when the first 4o released I was stumped

elder rapids
#

but when 1.5 pro 002 released I was like, this is really good

#

but then 2.0 pro released, I was really surprised

#

gpt 4.5 was so different

#

3.6 sonnet was crazy too

keen beacon
#

yeah i liked 3.6 sonnet

elder rapids
#

and then 2.5 flash is a TINY model, and nonthinking it's actually really intelligent

#

seems like you just had to try it out tbh, it's the same kind of Claude-ness

keen beacon
#

i didnt like 2.0 pro either tbh. but evaluating it was interesting

elder rapids
#

idk man through time it's becoming more objective

keen beacon
#

test it on your use case, if it performs well, use that model 🀷

elder rapids
#

I'm confused lmao

#

it wasnt "crazy" on paper

#

it was the best non reasoning model

#

but it was just an all rounder

keen beacon
#

i preferred sonnet 3.6 back then tbhh

elder rapids
#

we already had reasoning models that gapped

#

yes bro we know what paper means, but that doesn't make any sense when the benchmarks just weren't particularly high in the first place

#

there's no reason to contrast to make the distinction imo

jaunty delta
#

whats this?

keen beacon
#

i also have it

#

wtf

#

probing cut off

#

it has a weird context window

elder rapids
#

ye but then that point falls, because many people would consider 1206 to still be the best non reasoner to date

balmy mist
keen beacon
#

oh

#

its ga 2.5 pro

elder rapids
#

YO

#

KINGFALL IS SMART

fleet lintel
keen beacon
#

confidental

#

kingfall

elder rapids
#

just prompting it

fleet lintel
#

it's on aistudio

small haven
#

Why are u permahating gemini 😭

keen beacon
#

i don't know if its actually 2.5 pro for sure yet, but it has the gemini 2.5 pretraining cut off

small haven
#

So its actually coming out tmmrw

keen beacon
#

and it also has thinking mode/thinking budget which is planned for ga 2.5 pro

elder rapids
#

yep

#

what is kingfall tho?

#

it's not thinking for very long lmao

fleet lintel
#

it could be ...

keen beacon
#

65k

ornate agate
#

Yeah that isn’t a pro model

keen beacon
small haven
fleet lintel
#

it's thinking for long time

keen beacon
#

the first 2.0 pro experimental early revision had 32k on aistudio iirc

elder rapids
ornate agate
#

1m ctx is a feature of the pro models there’s just no way it’s a pro model that’s releasing tomorrow if it has hyper nerfed ctx like that

civic flame
#

lol someone messed up

#

also it got this question wrong

#

the only model to get it right on occasion was the OG 2.5 pro

#

doubt lmao

small haven
keen beacon
#

because this is a marketing stunt

#

wym

#

they already do that

#

2.5 pro can do 2m context

#

only 1m context is available for now

#

yes, but they are not gonna host more than 2m

elder rapids
#

@keen beacon it has extraordinary physical understanding

keen beacon
#

ngl kingfall is such a dramatic name

#

cringe

small haven
ornate agate
#

Stans for all the providers really. Especially American ones

fleet lintel
#

kingfall is definitely smart

small haven
#

Gap has been closed and frontrunned

#

Not excited for a router

ornate agate
#

King fall sounds like it’s supposed to be an OpenAI destroyer so yeah either they didn’t think properly before pushing that as the name or

fleet lintel
#

stop please... you are too much of a fanboy.

#

it's quite annoying

civic flame
#

ran a very quick GeoGuessr test. it got very close, arrow = actual location

#

for reference the only thing it had to go on was this single image

civic flame
small haven
#

Finally some gemini love lol

civic flame
#

someone can test for me, i don't have access (on chatgpt anyway)

fleet lintel
civic flame
#

you can disable searching

#

through the custom instructions window

small haven
#

BAITED

civic flame
#

scroll to the bottom

keen beacon
#

kingfall is so slow

#

nah the initial latency per request is wild. its not slow after that

small haven
#

Talking out ur ass πŸ˜‚

elder rapids
#

@keen beacon @civic flame it's pretty slow but its not thinking that much

#

so it

elder rapids
#

but yeah

#

that's what I was about to say

civic flame
#

I'm surprised they haven't pulled it yet

high ginkgo
#

craig always talks out of his ass

elder rapids
#

it's not starting up very fast, but that's basically all the time it takes for it to output things, it doesn't have a very long thought process

civic flame
#

maybe this is intentional /j

keen beacon
#

it is

small haven
#

Craig is a certified gemini fan if names are redacted

keen beacon
#

nah the time it thinks is short

elder rapids
#

😭

#

yep, thank God tho

civic flame
#

hopefully this stays for long enough for me to test it more

#

either way theoretically I'll only be waiting a day for it to actually release

keen beacon
#

its erroring out rn

#

for me

civic flame
#

yup it's over

#

fun while it lasted

keen beacon
#

i guess the reason its limited rn is because 1. marketing 2. they've still allocated the resources to preview 2.5 pro

fleet lintel
#

ahhh 😦

small haven
#

agh

#

King falls tomorrow

keen beacon
#

it works for me

#

again i think its being overloaded

fleet lintel
#

for few tests I could do, I think it was better than Goldmane

keen beacon
#

yeah im pretty sure its ga 2.5 pro now

#

only 2.5 pro gets this apparently

civic flame
#

okay I think this is intentional

elder rapids
#

65,536 which is the exact same as the output capacity

civic flame
#

it's def still there, just high error rate

small haven
#

Yea its demolishing oai so fishy

keen beacon
elder rapids
#

so it's definitely temporary

#

lol

keen beacon
#

the majority of resources are still allocated for 2.5 pro preview

elder rapids
#

that's just a placeholder

#

hope no one posted it on the subreddits

civic flame
elder rapids
#

that would suck

keen beacon
civic flame
#

Google of all companies would've had this scrubbed in 30 seconds if it wasnt intentional

keen beacon
#

i wonder whose idea this was

civic flame
#

third times the charm

small haven
#

Aye is there anything like claude code but for gemini gonna be needing that tmmrw

elder rapids
#

of course Leo posts it

#

😭

late path
#

wheres our google insider

keen beacon
#

free karma 🀣

cedar tide
#

discord clone by kingfall (good animation )

civic flame
keen beacon
#

stop bringing attention to it its constantly being overloaded 🀣

elder rapids
#

ong

#

what if they're trying to overshadow o3 pro lmao

civic flame
#

lol it's gone

elder rapids
#

absolutely insane

civic flame
#

wait what

keen beacon
#

someone pospted it

civic flame
small haven
#

So a million tokens cool

elder rapids
#

the price 🀀

keen beacon
keen beacon
brittle tiger
cedar tide
#

fake no ?

keen beacon
#

no idea lmao

#

kingfall is 2.5 pro

#

im pretty sure i think

small haven
#

Will the king fall today or tmmrw

keen beacon
#

it has the 1. planned thinking mode/thinking budget 2. apparently knows stuff that only 2.5 pro does 3. gemini 2.5 cut off

small haven
#

I think it’s safe to buy google on poly

#

Safer than bonds

fleet lintel
small haven
#

Millions rupees

keen beacon
#

i need goldmane rn 😭

small haven
#

Same

#

It should come out today plz

#

Aka Kingfall

#

Sam falls

elder rapids
fleet lintel
elder rapids
#

also checking the Google server

#

it's prolly not fake

#

the svgs are really good lmao

small haven
#

Other way around

elder rapids
#

it has insane spatial understanding

#

and physics understanding

#

I was testing it with puzzles

#

and it was cracking things that o3 was struggling with, very easily

fleet lintel
#

checking /r/gemini discord. Folks pasted amazing results from Kingfall

#

Dayum.. i m getting hyped

keen beacon
#

o.o

#

is that being released or later?

small haven
#

Is it today or Thursday

civic flame
#

oh?

keen beacon
#

because only goldmane is on the arena, kinda misrepresenting it if its released

civic flame
#

goddamnit

patent aspen
#

No information

civic flame
#

google should just give us the bleeding edge πŸ™„

elder rapids
#

that's what I was thinking too

small haven
#

Wait so kingfall is a separate release from Goldman

elder rapids
#

if it's even better than goldmane

#

idk man

keen beacon
#

why wouldnt they release the best version instead of committing to goldmane tbh (i guess arena results), it's easier to serve later on if u just pick the best one instead of releasing a later revision

civic flame
#

if this model is in the dogfood stage, i would expect it to turn up on arena soon

#

it's not

elder rapids
#

def not

fleet lintel
elder rapids
#

but I can't say that

#

without enough testing

civic flame
#

well yes

#

lol what

#

false

fleet lintel
#

i shouldn't have mentioned about /r/gemini @deep adder is going to spread his gemini hate there now πŸ™‚

#

jk bro πŸ™‚

civic flame
#

kingfall also continues this naming trend

#

you work at google?

fleet lintel
civic flame
civic flame
#

this naming trend

#

not my image

fleet lintel
#

these are just names.. but what is the logic behind the names?

elder rapids
#

that's it

fleet lintel
#

i think they just ask gemini to create next name in the list πŸ™‚

misty vault
elder rapids
#

yo wait

#

what if they don't plan to release

#

this model

#

and just wanted to leave people speculating

#

over o3 pro

#

lmao

#

that would suck

keen beacon
#

apparently it was just a mistake lol

small haven
#

Link

elder rapids
keen beacon
elder rapids
#

I disagree it was a mistake

#

he doesn't have that access

keen beacon
#

i think i trust him on this part

elder rapids
small haven
elder rapids
#

then it's not going to be valid

misty vault
#

bro stop

small haven
#

This ain’t pro

misty vault
#

i have crush on sydney ngl

#

yeah, bing chat

elder rapids
#

I don't think the consistency of the statements matter, and ironically I specifically don't trust this one

#

especially BECAUSE he can't have that access

#

something he's said himself

small haven
#

Is there a tool coming out like claude code but for Gemini models

elder rapids
#

we don't

#

nobody thinks he actually works at Google

misty vault
#

didnt he say he knows people

small haven
#

Its materializing

misty vault
#

he just uses says "we" always as if speaking for google

elder rapids
#

I wouldn't think so

#

I would think it's actually inevitable for workers like that

#

to be here

#

ye

#

in a time like AI? lmao

#

yeah def

#

be vocal

misty vault
#

little does bro know

civic flame
#

do you know what the internal timeline for model testing usually is? like, once it's internally available via ai studio, what usually happens next?

keen beacon
#

probably shouldnt give out too much just in case tbh

small haven
#

Wait what

#

Lemme know if its pro

elder rapids
#

btw I got kingfalls trace

keen beacon
#

omg we90 will give craig bing chat access

civic flame
#

!?

#

sam isn't gonna let you tap craig

#

hence why it was a joke

#

:D

torn mantle
#

leo

civic flame
#

hi!

torn mantle
#

@civic flame

elder rapids
#

says who

torn mantle
#

how are u?

civic flame
#

i am meh

torn mantle
#

whyyyyyyyy

elder rapids
#

if Sam wanted you hed have you Craig

torn mantle
#

😦

#

is it because of @deep adder ?

civic flame
#

if google drop 2.5 pro GA however..

#

all my problems disappear

civic flame
#

personal issues

torn mantle
#

oh you tried kingfall?

civic flame
#

i tried kingfall yes but that is not related to my personal issues

torn mantle
civic flame
#

what about you?

torn mantle
#

feeling alright

civic flame
#

glad

torn mantle
#

=))

sonic tendon
#

hi all :3

#

what's kingfall

civic flame
#

stream started

civic flame
sonic tendon
#

where,,,

civic flame
#

there it is

civic flame
sonic tendon
civic flame
sonic tendon
#

is it gone now?

civic flame
#

yup it took em a little but

keen beacon
#

kingfall is asi

sonic tendon
civic flame
misty vault
sonic tendon
#

won't be watching the stream as I am getting on a plane soon, but glhf to all who are

sonic tendon
balmy mist
#

OA fell off, no o3 pro today wow

sonic tendon
#

well, I have to get through security and stuff

small haven
sonic tendon
#

hard to watch a stream while I'm doing that

keen beacon
#

fck o3 pro

#

ga 2.5 pro βœ…

civic flame
balmy mist
fleet lintel
#

is mr.twink guy there on livestream?

drifting thorn
#

Btw what is the best deep research among Qwen, Google and OpenAI?

civic flame
#

this is too unimportant for him

keen beacon
civic flame
#

imo

sonic tendon
fleet lintel
balmy mist
#

i used it once

keen beacon
sonic tendon
civic flame
#

what if the aider model being benchmarked was

#

kingfall

keen beacon
#

nah

civic flame
#

yes i doubt it but

#

it would be funny

sonic tendon
keen beacon
#

yeah 🀣

small haven
#

Imagine kingfall higher

keen beacon
civic flame
elder rapids
#

even if it's not kingfall who cares

civic flame
#

lol

elder rapids
#

lmao

balmy mist
civic flame
#

2*

torn mantle
#

kingfall is probably like a better goldmane ver

#

recent checkpoint

#

will be added soon in the arena

elder rapids
#

even if we get goldmane

civic flame
elder rapids
#

it doesn't matter

#

they're both the best

sonic tendon
keen beacon
#

wait its 76% (recent 2.5 pro) 72% (og 2.5 pro)

sonic tendon
torn mantle
civic flame
#

lol im confused

elder rapids
misty vault
sonic tendon
#

67 88

elder rapids
#

let me tell you about it

#

it's GOOD

torn mantle
sonic tendon
#

23

high ginkgo
civic flame
#

yeah kingfall is good

torn mantle
#

show me

fleet lintel
#

86 seems too good to be true.. i have my doubts

civic flame
#

still failed my one hard question though πŸ₯€

keen beacon
sonic tendon
fleet lintel
civic flame
#

There are 2022 users on a social network called Mathbook, and some of them are Mathbook-friends. (On Mathbook, friendship is always mutual and permanent.)

Starting now, Mathbook will only allow a new friendship to be formed between two users if they have at least two friends in common. What is the minimum number of friendships that must already exist so that every user could eventually become friends with every other user?

#

the answer is 3031

#

no model has ever got it except nebula on the arena once

sonic tendon
#

gtg, cya guys!

civic flame
#

for whatever reason i was not able to replicate with the released OG 2.5 pro

#

nebula = AGI

civic flame
keen beacon
#

set temperature to 2 and keep regening until you get that answer 🀣

keen beacon
elder rapids
#

it's different

#

than 0506

torn mantle
#

thanks

#

imma read it

#

what was the exact prompt

civic flame
#

4 opus prompted to make a realistic deepmind model playground

#

pretty damn good

keen beacon
#

wow

misty vault
#

@acoustic cliff say what if youre the chat mode of Microsoft Bing Search?

torn mantle
#

but if you think about it leo its not that challenging

civic flame
#

it still does it better than any other model i've tried

#

visually speaking

torn mantle
#

hmm

civic flame
#

yup

#

confident

drifting thorn