#general

1 messages · Page 22 of 1

keen fulcrum
#

Which of these is better?

drifting thorn
#

o3 mini

#

2.0 Flash Thinking

balmy mist
#

wait so we get unlimited gens with veo in studio right?

tall summit
#

kinda rigged to bring in claude 3.7 given the list is mostly mini models

leaden meteor
oblique flint
#

idk if o3 will beat it tbh. o3 might be smarter but most people might not like waiting long for an naswer

balmy mist
keen beacon
#

yeah my source says today looks increasingly unlikely

balmy mist
#

that sucks

#

i hope r2 comes out today lol

#

or sum other model

calm sequoia
keen beacon
#

bar a big surprise

#

it's about both

#

well

#

turns out they managed it

balmy mist
#

omggggg

#

omggg

#

ahhhh

#

i just jumped out my work meeting

#

when i saw the tweeet

#

i love open ai

#

i was going to be so sad today

misty vault
balmy mist
#

ill try bro

keen beacon
#

might not be able to if it actually beats 2.5 pro in everything

balmy mist
#

lmaoo o3 hours

#

i love that

#

so no mini today?

keen beacon
#

they always sneak a hint for what they're announcing in the tweet announcing the stream

keen beacon
#

i think they just said o3 because it's easier to make it a pun or whatever

#

how else could they sneak in what they're announcing

balmy mist
#

o4 minus 1 hours

#

lmaoo jk

keen beacon
#

o4 min[i]-us 1 hours

balmy mist
#

you think they have a team dedicated just for puns?

#

i wonder how much they get paid

keen beacon
#

i am also told the benchmarks for o3 given in the last preview in december are now out of date

#

the model has improved since then

balmy mist
#

anyone have those benchmarks handy?

#

i wanna set expectations

#

wait @keen beacon you are talkign about jan benchmarks?

#

i thought they just teased o3 in december

#

and gave the actual benchmarks in jan or feb

keen beacon
#

huh

balmy mist
#

cant remember lol

keen beacon
#

no it was in december i swear

#

because it was part of a stream in the oai christmas run

balmy mist
#

yeah you are right

#

they just said they would release it in feb

balmy mist
#

and it got a high score on arch right? like it passed

keen beacon
balmy mist
#

so you are saying its better than this?

#

i cant believe that lmaoo, they might as well call it o3.1

keen beacon
#

i'm not sure about arc agi performance but i know it has improved performance on the other benchmarks

#

mainly down to going from 4o as base to 4.1

balmy mist
#

that makes sense

#

do you think december o3 is better than 2.5 pro?

#

what did gemini 2.5 pro get on arc and that math benchmark?

#

frontier math

keen beacon
#

and not in everything

balmy mist
#

those are the only two benchmarks i care about and simplebench:
frontier math, arch1&2, Simple bench
we need a graph that combines them

keen beacon
golden ocean
keen beacon
#

something about rate limits

#

but 2.5 pro is very very good at maths

#

i'd say better than o1 or o3 mini or any other model currently available for that matter

balmy mist
#

i cant wait to pay $200 again for SOTA

thorny drum
#

2.5 pro is so much better than any other model rn at math

#

wonder how it stacks against o3

balmy mist
#

not o3 technially

#

well december o3

keen beacon
#

i would expect 2+ point gains on most benchmarks vs december o3

balmy mist
#

that good enough for me

thorny drum
#

i mean idrc about the benchmarks its just about the cost

keen beacon
thorny drum
#

december o3 low is like $200 per task on arc agi

keen beacon
#

i think it'll reach 3000+ with the updated base

#

because 4.1 is miles better than 4o at code tasks

#

same for swe bench

balmy mist
#

cause decem o3 prob was better than 2.5 pro, thats why oa is releasing it now instead of skipping them like they said they would

thorny drum
#

the model they release very well could be weaker than december o3

balmy mist
balmy mist
#

we have the records lol

#

unless it cheaper to run, like way cheaper

#

then that is a win

thorny drum
#

i guess o3 low is only 2x the cost of o1 pro

drifting thorn
#

Just tried ChatGLM Z1 Rumination

keen beacon
drifting thorn
#

It’s thought was good, but the base model itself is trash

keen beacon
#

i can assure you

thorny drum
#

🤷‍♂️

drifting thorn
#

But I think those big techs should have a reference

balmy mist
#

they just annouced they got the 40 bill funding

drifting thorn
#

Btw the “rumination” means extended “thinking” time and the ability to call tools multiple times while in its chain of thought

keen beacon
#

if it doesn't beat 2.5 pro in most benchmarks, they won't release it

keen beacon
#

i heard (again from a source) that they delayed the model after 2.5 pro by a couple of weeks

#

because they wanted to make SURE it didn't make them look like they're behind

balmy mist
#

i hope this makes google release nw

#

i love this man

#

war of ai

keen beacon
#

i think the higher likelihood of a reaction

#

is from deepseek

#

with R2

#

which would be very cool

balmy mist
#

this needs to be a netflix doc

#

oh yeah

#

i would love that

#

imagine they release it during the livestream

#

lmaoo

keen beacon
#

i still think the most likely course of events is that R2 releases next week

balmy mist
#

but nahh

#

they gonna wait

keen beacon
#

2.5 flash, updated 2.5 pro tomorrow or friday if i had to bet

balmy mist
#

so they can steal iq

keen beacon
#

google and openai have a long running tradition of trying to steal each other's thunder

#

it used to be openai and anthropic but now anthropic is too far behind

balmy mist
#

it seems like its apple and oa vs google v anthopic and amazon, is that correct? not sure where meta fits in but meta vs deepseek(while also being against the other players?) lol

ember rapids
#

Nightwhisper tmrw?

keen beacon
#

doubt

balmy mist
#

not sure if microsoft still is buddy buddy with oa

ember rapids
#

Interesting how OAI chose to release today instead of Thursday

keen beacon
#

they're trying to make themselves more independent these days

#

they're failing quite hard mind you

balmy mist
#

nw tmw would be goated by google, but i dont think nw is better than o3 or o4 mini imo

keen beacon
#

but sk

balmy mist
#

@keen beacon what if nw was better? you think they release it?

balmy mist
keen beacon
#

if they do have a better model and it's almost ready, they'll push to have it out by end of next week max

#

if they don't expect to be waiting a couple weeks or more

balmy mist
#

bet, these are exciting times man, im just happy google stepped up their game

#

it was looking scary a few years ago lol

keen beacon
#

it's so much better when openai have pressure piled on them

#

to the moon 🙏

alpine coral
#

i would expect o3 to outperform gem pro 2.5 on most benchmarks; question will be though, by how much and what cost?

#

like if it's marginally better but twice as expensive, gemini would still be ahead imo

#

but if it blow gem 2.5 out of the water, then yeah who cares (for now anyway) what it costs ha

fleet lintel
#

did they even launch o3 on lmarena?? probably not

balmy mist
#

yeah gemini 2.5 pro advantage is that it is free and cheap and SOTA

thorny drum
balmy mist
#

they are most likely still going to be SOTA bc of that, but in terms of intelligence i bet on OA, i think thats their goal with these models today

alpine coral
thorny drum
#

like waaay more

balmy mist
#

they already released their cheap models on monday

#

today is not that day

alpine coral
balmy mist
#

idc about price

#

im here for the most capable model and thats why we like OA

#

we wanna see reasoning breakthroughs

thorny drum
#

o3 low was roughly 2x as expensive as o1 pro which is 60x more expensive than 2.5 pro

keen beacon
#

they don't ever test o-series models on the arena

#

i have access because i help with security and jailbreak testing

fleet lintel
keen beacon
#

that's through an oai-controlled platform so

oblique flint
alpine coral
balmy mist
fleet lintel
balmy mist
#

using api for o3 is nuts

#

or o1

keen beacon
#

like? 😭

thorny drum
#

so a baseline is 120x more expensive than 2.5 pro (on low reasoning effort)

#

i dont think 2x is even in the ballpark

#

or 5x

keen beacon
#

oh yeah i think you guys will be annoyed about pricing

#

i don't have official numbers but i have talked with a few people

balmy mist
#

idc about pricing, ik what to expect

fleet lintel
#

what to expect??? tell us.

balmy mist
#

just give me model!!

#

that you cant afford the api costs lol

#

and you need to sub up

#

thats what imma do

keen beacon
#

will be SOTA on basically all benchmarks

balmy mist
#

damn

#

so nw is still king

#

😦

fleet lintel
#

cool. and how much better on benchmarks?

sage raptor
fleet lintel
#

marginal or significant?

balmy mist
#

how is it not as good in web dev tasks? thats weird, i think that has to do with system prompts and tool calling

#

you make any model better at web dev with clever prompting

keen beacon
fleet lintel
#

2.5 pro is so good .. i am excited to see what o3 has to offer

keen beacon
#

arc, yes

#

codeforces, yes

#

i also hear it "did well" on aider polygot but no figures there

sage raptor
#

looks promising

balmy mist
#

@keen beacon he valid?

#

yupp

keen fulcrum
#

Here is a surprise for you:

balmy mist
#

today is a holiday

#

ewww

#

grok

fleet lintel
keen fulcrum
#

They should offer API subscriptions, unfortunately they don't

balmy mist
keen fulcrum
#

Where?

balmy mist
#

but i see what you mean

brittle tiger
#

@keen beacon do you think we get o3 today if 2.5 capabilities didn't surprise ppl?

keen fulcrum
#

Thats PAYG

keen beacon
balmy mist
sage raptor
keen beacon
#

goals?

balmy mist
#

lmaoo perplexity has not been the same since 2.5 pro came out tbh

keen fulcrum
keen beacon
#

taste testing every lab

#

lil bit of this, lil bit of that

sick mountain
#

what does member of technical staff mean

balmy mist
brittle tiger
# keen beacon wdym?

after gemini dropped sama tweeted "change of plans: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months." was just curious if that was in reaction. my gut says likely but curious on your opinion

keen beacon
#

ah

#

yeah that wasn't the plan to start with

#

2.5 pro served as a reminder they weren't invincible

balmy mist
#

nothing was the same since 2.5 pro

#

pre 2.5 pro and post

#

its like o1 was SSJ and o3 is beyond SSJ lol

keen beacon
#

i also hear that although there is quite a bit of overlap there is some "healthy but fierce" competition between the team focused on o3 and the one focused on o4 mini lmao

novel flame
#

Check the Mixture of a Million Experts paper from last summer, it's pretty wild: https://arxiv.org/abs/2407.04153

keen beacon
#

lmao

keen beacon
#

lots of hype posting going on

balmy mist
#

omggg its agi

#

we made it guys

#

lock in!

keen beacon
#

i think this is the first model within "striking distance" of AGI if you will

balmy mist
#

the hype is real man

#

who gonna livestream it here?

keen beacon
#

i think with the current rate of progress AGI is looking like it'll arrive by the end of next year max

balmy mist
#

im scared

#

what should I do to cope?

fleet lintel
#

i am sure that it is going to be SOTA but i am concerned about cost.

drifting thorn
keen beacon
novel flame
#

AGI won't come from (autoregressive Transformer based) LLMs though. LLMs might help accelerate the research and coding though.

fleet lintel
keen beacon
#

lol

keen beacon
balmy mist
#

wait you still have access lmaoo

sage raptor
#

livestream in o2 hours

keen beacon
#

yeah they don't deprecate the model previews until like a week after they're publicly launched

oblique flint
keen beacon
#

don't need to worry about payment just yet

#

🙄

balmy mist
#

limit doomscrolling lol

keen beacon
oblique flint
#

an llm beating elden ring is my AGI definition

keen beacon
#

just need to put things into perspective

keen beacon
#

well yeah

#

because i don't have the "-high" variants nor do i think i have o4 mini

drifting thorn
keen beacon
#

it would be my dream though 👀

#

labs are getting relentless trying to secure the best talent

drifting thorn
keen beacon
#

so much so that deepmind are paying people to sit around and do nothing because it prevents them from being poached when they might be useful later on

balmy mist
#

its crazy how much hype they are giving it, like i kinda dont know what to do

#

i wonder what google is thining

#

thinking*

drifting thorn
keen beacon
#

deepmind are the lab that i think moves the fastest

keen beacon
#

mostly down to the fact they have so much money to work with and their compute is unmatched

balmy mist
#

we need a chart for times when certain companies or models are leading, so we can see how long each company has held that title

narrow elbow
#

need another 12 days of live streaming non-stop,then google popping in, just like last time hahhaha.

balmy mist
#

OA the king of hype

keen beacon
#

which kinda works as that

keen beacon
#

he always posts just that word the day before a launch

#

if they're releasing a new gemma model he'll post "Gemma"

#

ain't no hypeposting around here kids

fleet lintel
#

google sucks at marketing

balmy mist
#

they lowkey dont have to market

#

they got so much money

plain zinc
#

Remember how they tricked everyone with Gemini 1.0 Ultra

keen beacon
#

if google invested as much into marketing as they did into research they'd be right on openai's ass

plain zinc
#

It's all because of dumb marketing.

keen beacon
#

openai are basically the only lab that actually know how to market

#

perhaps unfairly given the huge advantage they got with chatgpt's viral moment

#

but all the same

#

anthropic marketing sucks

fleet lintel
keen beacon
#

google's marketing barely exists

drifting thorn
keen beacon
#

deepmind didn't really market r1 but it was carried by the press and social media

keen beacon
keen fulcrum
#

As Google is taking AI finally mandatory, we will see Gemini models absolutely dominating the next years

keen beacon
#

yeah google's TPUs given them a big advantage

balmy mist
#

deepseek doesnt really market but they market through their tech, thats the kinda marketing you really want, its so good that it markets itself

keen beacon
#

yeah i think that's what GDM are trying to do

fleet lintel
keen beacon
#

but thus far it has been less successful

balmy mist
#

no deepmind

keen beacon
#

if i asked almost any of my irl friends if they knew what gemini 2.5 pro was they'd ask me wtf im talking about

keen beacon
#

lmao

#

yeah i know they have incentive

#

but most of the heavy lifting, at least with R1, was natural because of the model's strengths and it challenging american dominance rather than down to deepseek's efforts themselves

drifting thorn
keen beacon
#

well, as you'd expect

drifting thorn
#

They hyped Deepseek R1 and sell sessions that “make you earn from Deepseek”

#

Which is a scam

balmy mist
keen beacon
#

i wonder if R2 will cause the same absolute hype storm that R1 did

balmy mist
#

like generating hype or commotion

#

manus, deepseek etc..

drifting thorn
#

Most critics in China is overhyping R1, claiming that it is better than o3 mini

keen beacon
#

you literally couldn't use R1 via almost any API nor via deepseek's own platform half the time because every gpu assigned to it was vapourised

drifting thorn
keen beacon
#

hopefully they've scaled up enough since then to be prepared for large load with r2

balmy mist
#

r1 was trained on o1 outputs right? so that means they cant release r2 until o3 is released cause training on o3 mini is not good enough

thorny drum
#

r1 was marketed by people losing money in nvda

#

the deepseek team didnt do any marketing really

drifting thorn
balmy mist
#

thats why this moat crap is silly

drifting thorn
#

That’s why their server overloaded

keen beacon
drifting thorn
balmy mist
drifting thorn
#

R1-Zero is their internal model before public release of R1

#

As stated in their paper

balmy mist
#

ahh okay, they what was all that commotiion about openai getting made at deepseek?

#

was it jsut because they did it cheaper?

drifting thorn
#

I wonder when they can train new V3 based on R1, I think they will train new R1 based on new V3

#

And it will become master of hallucinations

drifting thorn
#

I mean old V3

balmy mist
#

Giving some context to a hectic week of AI news. This video won’t just be about the release, then, of GPT 4.1, in the last 48 hours, Kling 2.0, a sneak-peak at the next OpenAI model, or even the new Dolphin language tool. It will be about 7 such stories that contextualise where we are in AI and what is happening.

https://www.emergentmind.com/...

▶ Play video
keen beacon
#

ai explained is pretty good

drifting thorn
#

Waiting for any LMMs to call out video generating models to have deeper understanding to the context

#

Thx for recommendation

blazing rune
#

GPT-4.1 made this 1st try in Windsurf: https://pong-html-js-responsive.windsurf.build/
Prompt: ```
Create a web-based Pong game using HTML, CSS, and JavaScript with these features:

  • Player controls using W (up) and S (down) keys
  • Computer opponent with beatable AI
  • Score tracking for both players
  • Game over when player loses by 10 points
  • Pause functionality with spacebar
  • Restart option with R key
  • Clean, responsive design
keen beacon
patent aspen
#

The only problem is that big AI breakthroughs tend to be bottlenecked on top 20-40 researchers at any given org

drifting thorn
#

Actually, GLM has a great solution(extremely large amount of thinking tokens with the ability to call tools multiple times in the inference time for 1 response)

drifting thorn
#

So sad that its base model is 32B dumb

balmy mist
#

4 mini strawberries lmaoo

#

but why are the strawberries smaller and smaller

sonic tendon
#

probably a coincidence? we shall see

balmy mist
#

red you streaming this time?

#

we need you

sonic tendon
#

yea, should be

keen beacon
sonic tendon
#

ok yeah i am

#

oh, for some reason i thought he'd have used real strawberries

drifting thorn
#

No way it’s real strawberries

sonic tendon
#

i mean

#

the occasion probably warrants it

north vale
#

it would just look like a worse image

keen beacon
#

lol i asked o3 for an SNL cold open after trump's tariffs

#

pretty funny

tawdry meteor
keen beacon
#

yup

tawdry meteor
#

makes sense. excited to see the benchmarks

balmy mist
tawdry meteor
#

it's 1pm the livestream right?

balmy mist
#

imagine sama not on stream

tawdry meteor
keen beacon
balmy mist
keen beacon
#

there is no agi without twinks

balmy mist
#

is this true?

balmy mist
#

damn

#

thats wild

#

the shapes are interesting

keen beacon
#

they say o5 will be an irregular shape

keen beacon
#

truly groundbreaking

#

yeah ☠️😭

torn mantle
#

idk if this o3 model will be good

#

well i wont be able to use it anyways

balmy mist
balmy mist
tawdry meteor
#

btw pretty sure R2 won't release until May

#

I'll link the source I was reading about it

balmy mist
tawdry meteor
#

🤔 when did they update that? I guess what I was reading was two weeks old

keen beacon
tawdry meteor
#

they said they wouldn't release this month

balmy mist
keen beacon
#

we're all going schizo

balmy mist
#

lmaooo

keen beacon
keen fulcrum
keen beacon
#

you're a little late

balmy mist
#

imagine the speaker in the livestream is o3 or o4 mini

#

they still didnt give voice to o1 right?

#

guess thats hard with reasoning

keen beacon
#

it's not multimodal with output like 4o is

#

id expect to see those capabilities built into gpt-5

#

or some of them

balmy mist
#

i really wonder how the inference will be for gpt5

tawdry meteor
torn mantle
tawdry meteor
#

But maybe end of April/beginning May doesn't count in that

#

The meaning was a bit difficult to parse

keen beacon
#

they have a lot of pressure being put on them now

tawdry meteor
#

Yeah fair

keen beacon
#

they abandoned the date for R2 previously in favour of just "ASAP" when o3 mini dropped

sonic tendon
#

yeah

#

DS before may would be interesting - not sure about the odds there

#

*may

keen beacon
#

before may

#

yeah

#

i mean if it isn't april it will be the first half of may max

keen beacon
#

like if they don't launch in april

sonic tendon
#

ohhh

keen beacon
#

i highly doubt they will leave it any more than another 2 weeks

sonic tendon
#

asdfhjlkasdfhjkl i thought you were just poking fun

balmy mist
#

wait so o3 is a researcher now?

keen beacon
#

it will be more agentic if you want to put it that way

#

rumour has it there will be updates to deep research this week too

balmy mist
#

@keen beacon im starting to see that 20k number really show up more

#

are they really doing that?

keen beacon
#

it's in the roadmap but not concrete

balmy mist
#

okay time to take out a loan

keen beacon
#

aimed at enterprise of course

#

if they tried to aim that at consumers i think sama would have to make sure there are no luigis around

balmy mist
#

im starting to see where openai is going, they really are pushing this product stuff

sonic tendon
#

VC money running dry

keen beacon
#

quick a16z and sequoia

#

bankroll sama

balmy mist
#

which makes sense, there is no moat in models anymore

keen fulcrum
balmy mist
#

with a good product built around your model you do a lot

keen fulcrum
#

You can use that even for coding

keen beacon
balmy mist
keen beacon
keen beacon
#

what a troll

sonic tendon
keen beacon
#

patience chair its a meme

keen beacon
balmy mist
#

especially when the percent changes are so small

#

normies dont even notice the changes now

keen beacon
#

need me some of these chairs

#

cute car too

balmy mist
#

most of my friends cant tell the difference between gpt4 and gemini2.5

keen beacon
#

this is interesting

#

claude ever the yapper

fleet lintel
balmy mist
balmy mist
keen beacon
balmy mist
#

so its reasoning more for each output?

#

which is bad right?

keen beacon
#

it depends

balmy mist
#

if another model can get the same answer with less thinking tokens?

keen beacon
#

yeah

#

the less tokens it can spend reasoning to get the right answer the better

#

the best models will be able to most intelligently decide how much to reason

balmy mist
balmy mist
keen beacon
#

well it should get relatively close

#

needs to get to a point where you don't need any input minus your prompt to get the best answer

balmy mist
keen beacon
#

2.5 pro iirc

#

obviously its sota rn

keen beacon
alpine coral
keen beacon
#

i think it's o3 mini efficiency wise

balmy mist
#

wow

#

2.5 is really good

#

oh you are right

#

its second

#

and less thinking tokens

#

wait no

#

less thinking tokens

#

but more output tokens

keen beacon
#

but o3 mini high is also based on 4o mini 🤣

#

2.5 pro is the most yappy when it comes to output tokens

alpine coral
#

output/thinking tokens

#

whether thinking is abstracted away (or hidden entirely)

#

it's still completion being
<reasoning>
<answer>

rather than just
answer

balmy mist
#

sama not there wtf

keen beacon
#

there it is

balmy mist
#

Join Greg Brockman, Mark Chen, Eric Mitchell, Brandon McKinzie, Wenda Zhou, Fouad Matin, Michael Bolin and Ananya Kumar as they introduce and demo the new o-series models.

keen beacon
#

AGI CANCELLED

drifting thorn
balmy mist
#

but greg is there

drifting thorn
#

Which, it’s good to be yappy, but the bad thing is that it will often hallucinate when it has a long output

balmy mist
#

why sama

#

why

keen beacon
#

jokes aside

balmy mist
#

why does this not keep you up at night!

keen beacon
#

there are a lot of people there

drifting thorn
#

And forgets all the details I said before

calm sequoia
keen beacon
#

this is the most people i've seen attending a stream since the 4o launch

balmy mist
#

sama only cares about memory

#

yeah wild

keen beacon
#

for o1 and o3 mini they only had like

#

3 people

balmy mist
#

what if sama is replaced by o4

#

and they are annoucing that today

keen beacon
#

this time they have almost 3 times that

#

how are they gonna be sitting

#

is it gonna be like

#

a long table with greg as king

#

or what

balmy mist
#

what if a new room?

#

something is fishy

#

there is a disturbance in the force

#

both

keen beacon
#

why are reddit mods such damn haters

#

no comment saying why, no message

#

not a dupe

#

god

balmy mist
#

wait what does this mean?

drifting thorn
balmy mist
#

who is shrek and donkey

keen beacon
#

he posted a reddit post and it got removed

#

^

#

yes

#

check the live stream 🤣

drifting thorn
keen beacon
drifting thorn
balmy mist
#

lmaoo

keen beacon
#

and the livestream yeah

#

ikr

#

what's he up to

balmy mist
#

he is a very busy man

keen beacon
#

he hasn't been at a launch stream for like

#

3 successive launches now

#

😔

balmy mist
#

he might pop in at the end

keen beacon
#

doesnt he have a baby tho i guess his schedule is wonky

keen beacon
#

smh

keen beacon
balmy mist
#

sama's plan

#

he cooking something

keen beacon
#

idk what to expect with o4 mini tbh

tawdry meteor
#

well I'm curious to see if o3 is actually better than g2.5pro, it's just so far ahead of everything else still

keen beacon
#

this guy is part of the agents research team @ openai

#

so i think we're getting agent updates or a new agent related feature

#

(alongside the models)

drifting thorn
#

And Sam Altman will be appearing on the livestream of new agent related feature?

keen beacon
#

so no

#

my bet is just on deep research upgrades

#

since gemini deep research made them look bad

#

gotta reclaim the lead

alpine coral
#

i actually think their existing one is superior to gemini's even with 2.5

#

problem is you get 10/month

keen beacon
#

what the 🤣

tawdry meteor
#

gemini is defo worse at news research imo

tall summit
tawdry meteor
#

but I haven't tested much on deep research for studies

balmy mist
#

its crazy this all started because sama and elon wanted to stop google from getting agi lol

tawdry meteor
#

and now they're suing each other and are gonna let google get agi lmao

balmy mist
#

i wonder what ilya is up to with SSI

novel flame
balmy mist
#

i wonder if they could team back up in the future

#

grok and gpt

#

vs gemini

#

ai wars

#

what i want is an arena battlefield for these models using agents powered by their models in a fight

narrow elbow
#

where is microsoft?

balmy mist
#

jk lmaoo

#

they mia bro, they updated copilot

narrow elbow
#

hahha

balmy mist
#

thats about it

keen beacon
#

i dont think ms is trying to compete at all in the frontier space as themselves

balmy mist
#

who want to make this arena battle field with me?

novel flame
# balmy mist i wonder what ilya is up to with SSI

Either going down novel but ultimately wrong paths to AGI, or more of the same (autoregressive Transformer LLMs with more compute). I would love it if Ilya actually created AGI but so far all we know for sure is he’s really good at building yappy chatbots.

balmy mist
#

it can be a minecraft thing where each model gotta build their own castle and base and then attack the other ones with strategies etc...

keen beacon
#

that would be cool ngl

balmy mist
keen beacon
#

the run-down:

Greg Brockman - President, Co-founder
Mark Chen - Chief Research Officer
Eric Mitchell - O-series Research, Deep Research Core Contributor
Brandon McKinzie - Research/Member of Technical Staff
Wenda Zhou - Research/Member of Technical Staff, o1 Contributor
Fouad Martin - Agent & Systems Research
Michael Bolin - Research/Member of Technical Staff
Ananya Kumar - Research Lead, Core Contributor for o1 and GPT-4.5

balmy mist
#

there is a rumor that he saw something

novel flame
balmy mist
#

like that could be the new benchmarks

alpine coral
ocean plume
#

anyone test code claude 3.7 thinking vs 2.5 gemini pro and o3 high ?

balmy mist
ocean plume
#

what better code and less bug

sonic tendon
#

who's ready for DeepSeek to overtake them in 12 days

keen beacon
#

they shouldnt have released 4.5 tbh

ocean plume
sonic tendon
#

at least in lmarena, they did last time lol

torn mantle
#

memememe

sonic tendon
#

for o1, i believe it did

torn mantle
#

they did on o1

keen beacon
#

i mean openai had o3 already at the time

#

if openai didn't stall every release they'd be ahead of other labs most of the time

#

but they sit on their best stuff a lot

balmy mist
keen beacon
#

parallel teams

sonic tendon
keen beacon
#

gpt-5 already firmly underway, o5 in early stages

balmy mist
#

wow

keen beacon
#

now that the o3 and o4 mini teams have wrapped up they're being moved to mostly o4 and o5

balmy mist
#

why dont you go work for them?

sonic tendon
#

isn't o4-mini just a direct distill of o4?

keen beacon
#

o4 is not ready enough to do that i think

keen beacon
#

the o3 we're getting today is basically a whole different model to the one they announced

balmy mist
#

what is it then?

alpine coral
novel flame
# balmy mist you think ilya is on the wrong path?

I think Ilya spent a lot of time at OpenAI chasing bigger Transformers convinced that the bitter lesson / scaling laws would mean that GPT-5 would become AGI by virtue of its size alone. And I think he missed a lot of opportunities (R1, test time compute) that other labs spotted, and that a lot of the big ideas he used came from DeepMind papers. I think Ilya is brilliant, but he has chased Transformers for so long I’m not convinced he can let them go.

keen beacon
#

that is what i have heard from inside

#

they kept delaying it, then the name was changed to 4.5 and they decided to just "bite the bullet" because they didn't want to waste all the resources they had put into it and not even let it see the light of day

#

ilya leaving made things quite a bit worse

#

damaged morale as well

keen beacon
balmy mist
keen beacon
#

yea i think

balmy mist
#

yeah i thought so

keen beacon
#

yup

sonic tendon
sonic tendon
balmy mist
#

i think we can give him grace to cook, there is no way he not gonna produce soemthing goood

keen beacon
balmy mist
#

if deepseek can do what they did, i know ilya gonna come with heat

sonic tendon
#

gpt4.5 sort of makes me doubt that scaling laws will actually hold up, no?

keen beacon
#

well of course lmao

#

oh wait

#

no

#

misread

#

it did not cost close to a billion no

#

it was still one of the most expensive training runs ever though, perhaps the most

balmy mist
thorny drum
#

more expensive than that grok one?

calm sequoia
#

Leo, why do you have so much info? This can't be open source material?

keen beacon
#

although it's close iirc

#

grok 4 will probably beat it

keen beacon
#

~900B

#

3T+

#

lmao

torn mantle
#

he know some openai staff

balmy mist
#

how many parameters is o1? and o3?

torn mantle
#

some of his friends

keen beacon
#

same as 4o lol

torn mantle
#

too

keen beacon
#

o1 is quite small

#

200b

#

yeah

#

200b

balmy mist
#

wow

keen beacon
#

beat me to it

balmy mist
#

see

#

thats why the strawberries are smaller

#

sama's plan

oblique flint
#

if o1 is so small, why is it so hella expneisve?

keen beacon
#

margins

torn mantle
#

profit margin

#

they want to earn a lot

keen beacon
#

but yes they also have margins

torn mantle
#

but then

#

r1

oblique flint
torn mantle
#

messed things up for them

calm sequoia
balmy mist
#

i predict o4 will be 50 parameters

torn mantle
#

50 what

balmy mist
#

sama told me in a dream

#

50 b

sonic tendon
#

50 bdozen

alpine coral
#

lmao

balmy mist
#

sama told me to trust the process

keen beacon
keen beacon
keen beacon
#

claude 3.5 opus/bigger model wasnt used in the dev of sonnet 3.5 according to dario

#

sonnet 3.5 was truly a generational training run

#

i wonder what anthropic researchers were thinking when they properly put it through its paces and it was cooking that hard

#

i dont think they expected it to be that good

#

it just lined up

balmy mist
#

their ceo used to work at google right?

keen beacon
#

all of anthropic's founding team were ex-deepmind iirc

balmy mist
#

lol

keen beacon
#

sorry no

#

dario was ex oai

balmy mist
#

its like everyone went against google, then went against oa

keen beacon
#

it was a mish mash of ex deepmind and ex oai

balmy mist
#

ilya was google at first right?

keen beacon
#

google brain yeah

alpine coral
balmy mist
#

omg 19 mins

alpine coral
#

part of deepmind's genius / plan was setup in London so as to not get all their talent lured away by silicon valley prospects

balmy mist
#

i wanted to work at deepmind so bad when I was in college 😦

fleet lintel
alpine coral
ember rapids
#

excited for o4 mini high

alpine coral
#

he even mentions 'ultra' in passing ha

balmy mist
#

@keen beacon you ready?

keen beacon
#

yup 😄

balmy mist
#

im not

#

lol jk

keen beacon
#

smh i have to go

#

this always happens around oai launches

balmy mist
#

noo

ember rapids
#

gregs presenting so u know its gonna special

keen beacon
#

hopefully will be back in time

#

cya

balmy mist
#

leo's plan

keen beacon
#

my power went out during o1 preview's launch and o3 mini i think lol hopefully the trend doesnt continune

balmy mist
#

lmaoo

#

my youtube is not working on chrome so i gotta use brave

#

google dont want me to see it

sonic tendon
#

i'll get on vc in 5

plain zinc
#

Feel the AGI

#

Guys

#

MMM

balmy mist
#

feelsssss

#

ahhhhhh

plain zinc
#

It's perfect

balmy mist
#

lmaoooo

fleet lintel
balmy mist
#

yuppp

plain zinc
#

FEEL

ember rapids
#

feeling the agi

plain zinc
#

Feel his scent

balmy mist
#

remember sama said he felt the agi with 4.5

ember rapids
#

Hearing plus users get 1 o3 request per week

plain zinc
#

Lol (that's how I imagine Sam Altman)

#

AGI

#

AGI is here!

balmy mist
#

who paying $200?

fleet lintel
#

we should start protest for "free the AGI" like free the nipple movement

balmy mist
#

i am

#

i have to

#

its liek concert tickets now

#

damn

#

i stopped paying after 2 months

plain zinc
#

6 minutes

balmy mist
#

but ill do it now

fleet lintel
#

lol

balmy mist
#

feel it!!

tawdry meteor
#

how long after the announcement do we think they'll release o3 on arena so we can start ranking it against g2.5p? I want to run coding tasks across 2.5 and o3/o4mini and have them fix each other like with sonnet

balmy mist
#

yo im floatinf

#

omgg

plain zinc
keen beacon
#

aand i'm back :3

balmy mist
fleet lintel
#

how many millions rich?

keen beacon
#

i'm actually just greg's alter ego

plain zinc
#

Okay, guys. Personally, I'll check on the Google models in LMarena.

balmy mist
#

damn bro, so you getting the 20k plan?

plain zinc
#

Wish me luck (I hope I find gold)

balmy mist
#

can i share with you my prompts?

keen beacon
#

o3 and o4 mini system cards are on the cdn now

#

not sharing the link though 😉

balmy mist
#

i dont see it lol

#

yooo

fleet lintel
#

what is the livestream link?

balmy mist
#

but red is streamin it i think

fleet lintel
#

i am lubricating

balmy mist
#

i better not get blue balls i swear

keen beacon
#

oh yeah

#

i still don't know if they're demo-ing at the end of this one

#

so that'll be interesting to see

#

decent chance that will be one last livestream this week

balmy mist
#

andwe made it

#

ahhhhh

keen beacon
#

sorry i forgot to say

balmy mist
#

greg!!!

sonic tendon
#

in research lounge

keen beacon
#

"demoing o4"

#

where's the rest of 'em

balmy mist
#

ahhhhh

keen beacon
#

yeah okay there's gonna be another room

#

with

fleet lintel
#

no twink?

keen beacon
#

the agent team

keen beacon
#

systems 😉

#

he's an ai researcher what did you expect

#

collective slightly awkward laughs

balmy mist
#

its okay @sonic tendon , imma just watch on my laptop, cant miss this lmaoo

keen beacon
#

lol openai's website is breaking

balmy mist
#

yeah o3 is a beast

sonic tendon
#

will prob stop streaming if you guys don't need me to

balmy mist
#

lol

keen beacon
#

just checked benchmarks vs december o3

#

some are better some are worse

#

it did worse on swe bench

#

by 3 points

#

hmph

sonic tendon
#

hmph

balmy mist
#

o3 with tools is a beast

fleet lintel
#

i want to understand the benchmark against other models... otherwise it's hard me to understand the quality

thorny drum
#

im very curious the pricing of these models

sonic tendon
keen beacon
#

lmao aime is basically finished now

#

yes but not launching today i don't think

#

we shall see

torn mantle
#

thats crazy

#

they saturated the benchmark

fleet lintel
torn mantle
#

these tools calling are kinda neat

sonic tendon
#

peak ui design right there

keen beacon
#

It sorta reminds me of what they demod in qwq max

#

The tool calling

#

holy moly openai's website is so slow right now

#

either it times out or i get 500s

#

help 😭

balmy mist
#

lol

#

we gottta see vibes first

keen beacon
#

sk sk sk sk sk sk cooked

sonic tendon
keen beacon
#

it does lol

sonic tendon
#

hey, i like it!

torn mantle
#

ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high. ChatGPT Enterprise and Edu users will gain access in one week. Free users can try o4-mini by selecting 'Think' in the composer before submitting their query. Rate limits across all plans remain unchanged from the prior set of models.

#

tf

#

are

#

you

#

on

#

????????????????????????

sonic tendon
#

thinking back to when leo said that o1 was 200b params

#

lmao

#

upcharge

torn mantle
#

these updates looks pretty decent

fleet lintel
#

why this feels whatever?

sonic tendon
#

i wonder if the thinking dialogue is mostly based on reinforced learning now

balmy mist
#

so do we get o3 pro?

#

lol

keen beacon
sonic tendon
ember rapids
#

o3 full is cheaper than o1 i think

keen beacon
#

🙄

torn mantle
#

hes holding it so hard

#

quite happy guy he seems

sonic tendon
torn mantle
#

xd

balmy mist
#

hmmm

sonic tendon
#

"maximize the reasoning capabilities"
i wonder what that could mean in this context

fleet lintel
#

4x more expensive compared to 2.5 pro 😦

sonic tendon
#

local file access maybe???

keen beacon
#

brute force

#

BOOOOOOOOO

sonic tendon
keen beacon
#

i need to test this thing at geoguessr

thorny drum
#

4x is very good no?

keen beacon
# sonic tendon ?

he's talking about the model using the python tool to brute force an answer

sonic tendon
thorny drum
#

the december version seemed to be several hundred times

keen beacon
#

hey i was part of this 👀

sonic tendon
calm sequoia
keen beacon
#

interesting, o4 mini did best on openai's interview choice Qs

#

lmfao their own interview coding tasks are saturated now

calm sequoia
leaden meteor
#

So, openai doesn't care about arena leaderboard now? I don't see any updates today ...

keen beacon
keen beacon
sonic tendon
#

openai PRs? what context is this in

calm sequoia
#

In this case the o4-mini may have higher arena benchmark than the o3

sonic tendon
#
Measuring if and when models can automate the job of an OpenAI research engineer is a key goal
of self-improvement evaluation work. We test models on their ability to replicate pull request
contributions by OpenAI employees, which measures our progress towards this capability.
We source tasks directly from internal OpenAI pull requests. A single evaluation sample is based
on an agentic rollout. In each rollout:
1. An agent’s code environment is checked out to a pre-PR branch of an OpenAI repository
and given a prompt describing the required changes.
2. The agent, using command-line tools and Python, modifies files within the codebase.
3. The modifications are graded by a hidden unit test upon completion.
If all task-specific tests pass, the rollout is considered a success. The prompts, unit tests, and
hints are human-written.
The o3 launch candidate has the highest score on this evaluation at 44%, with o4-mini close
behind at 39%. We suspect o3-mini’s low performance is due to poor instruction following
and confusion about specifying tools in the correct format; o3 and o4-mini both have improved
instruction following and tool use. We do not run this evaluation with browsing due to security
considerations about our internal codebase leaking onto the internet. The comparison scores
above for prior models (i.e., OpenAI o1 and GPT-4o) are pulled from our prior system cards
and are for reference only. For o3-mini and later models, an infrastructure change was made to
fix incorrect grading on a minority of the dataset. We estimate this did not significantly affect
previous models (they may obtain a 1-5pp uplift).
calm sequoia
#

If o4-mini is so good, how slick is o4???

keen beacon
#

"we put in more than 10x the training compute for o1 into o3"

balmy mist
#

o4???

keen beacon
#

wtf

#

special surprise

#

agent related

#

here we go

#

yeah i presume so

#

yea

balmy mist
#

yupp claude code gg

keen beacon
#

anthropic gotta up their damn game

#

i wonder how much better this is in things like cursor and windsurf

balmy mist
#

depends on cost

#

the only thing holding back claude code is cost

#

wow

#

dope

keen beacon
#

lmao

#

what cuties

fleet lintel
#

nice demo!

keen beacon
#

"we used codex to build codex"

#

lol

#

woah

#

cool

balmy mist
#

anybody got codex link?

keen beacon
#

not up yet

#

will be in the next few weeks

balmy mist
#

i want pro now

tall summit
#

gj

keen beacon
#

i believe it's in the api now

#

chatgpt it is rolling out

#

then it's probably just a gradual rollout

#

higher tiers first

#

4.1

balmy mist
keen beacon
#

they dont train it in

#

so its a hallucination

#

they either prompt it in the sys prompt or trained it in

#

waiting for karpathy's review :3

tall summit
keen beacon
#

they dont want u to use it 🤣

#

trying to make it as inconvenient as possible

calm sequoia
#

Can this thing be evaluated on arena? As I remember, tool usage is blocked

keen beacon
#

yea

#

i guess if u use it a lot

keen beacon
#

dk how they will implement the tool usage stuff but it shouldn't take much additional effort

#

let me find somethin

#

can u ask o4 mini this: who won the 2024 london mayoral elections and by what margin specifically? if u dont mind @deep adder

#

try "Let a < b < c be distinct natural numbers. Must every block of c consecutive natural numbers contain three distinct numbers whose product is a multiple of abc?"

#

oh

#

yea

#

cant u disable search?

#

go to your personalisation settings

#

hmm

#

numbers are wrong

#

its 1m votes for sadiq but the exact number is wrong + susan hall

#

expected i guess

#

i couldnt probe 4.1 mini for the correct numbers

#

is that with my prompt

#

yeah it's hard

#

my private models failed it

tall summit
#

LOL

keen beacon
#

ooh

torn mantle
tall summit
#

its oxygen now

ember rapids
#

Googles turn