#general
1 messages · Page 22 of 1
wait so we get unlimited gens with veo in studio right?
kinda rigged to bring in claude 3.7 given the list is mostly mini models
idk if o3 will beat it tbh. o3 might be smarter but most people might not like waiting long for an naswer
yeah my source says today looks increasingly unlikely
Likely to release or not release? And is this about o3 or o4-mini?
likely will not release today
bar a big surprise
it's about both
well
turns out they managed it
omggggg
omggg
ahhhh
i just jumped out my work meeting
when i saw the tweeet
i love open ai
i was going to be so sad today
Please refrain from busting all over the LMArena community discord chat
ill try bro
might not be able to if it actually beats 2.5 pro in everything
they always sneak a hint for what they're announcing in the tweet announcing the stream
either today (as they have never not launched them at the same time) or it they're feeling quirky tomorrow
i think they just said o3 because it's easier to make it a pun or whatever
how else could they sneak in what they're announcing
o4 min[i]-us 1 hours
i am also told the benchmarks for o3 given in the last preview in december are now out of date
the model has improved since then
lmao
wow sama said they managed to make o3 way better which sounds like a reach, "improved since then" seems more reasonable
anyone have those benchmarks handy?
i wanna set expectations
wait @keen beacon you are talkign about jan benchmarks?
i thought they just teased o3 in december
and gave the actual benchmarks in jan or feb
huh
cant remember lol
no it was in december i swear
because it was part of a stream in the oai christmas run
if they improved on this then wow
and it got a high score on arch right? like it passed
so you are saying its better than this?
i cant believe that lmaoo, they might as well call it o3.1
i'm not sure about arc agi performance but i know it has improved performance on the other benchmarks
mainly down to going from 4o as base to 4.1
that makes sense
we were able to really improve on what we previewed for o3 in many ways; i think people will be happy...
do you think december o3 is better than 2.5 pro?
what did gemini 2.5 pro get on arc and that math benchmark?
frontier math
perhaps just about
and not in everything
those are the only two benchmarks i care about and simplebench:
frontier math, arch1&2, Simple bench
we need a graph that combines them
they haven't run frontiermath on 2.5 pro iirc
this team consists of ai agents
something about rate limits
but 2.5 pro is very very good at maths
i'd say better than o1 or o3 mini or any other model currently available for that matter
that is odd
i cant wait to pay $200 again for SOTA
2.5 pro is so much better than any other model rn at math
wonder how it stacks against o3
i would expect 2+ point gains on most benchmarks vs december o3
that good enough for me
i mean idrc about the benchmarks its just about the cost
codeforces will be very interesting
december o3 low is like $200 per task on arc agi
i think it'll reach 3000+ with the updated base
because 4.1 is miles better than 4o at code tasks
same for swe bench
cause decem o3 prob was better than 2.5 pro, thats why oa is releasing it now instead of skipping them like they said they would
the model they release very well could be weaker than december o3
that competition coding right?
impossible in terms of benchmarks
we have the records lol
unless it cheaper to run, like way cheaper
then that is a win
i guess o3 low is only 2x the cost of o1 pro
Just tried ChatGLM Z1 Rumination
yup
It’s thought was good, but the base model itself is trash
it won't be
i can assure you
🤷♂️
But I think those big techs should have a reference
why would they release it if its worse than december lmaoo
they just annouced they got the 40 bill funding
Btw the “rumination” means extended “thinking” time and the ability to call tools multiple times while in its chain of thought
if it doesn't beat 2.5 pro in most benchmarks, they won't release it
i heard (again from a source) that they delayed the model after 2.5 pro by a couple of weeks
because they wanted to make SURE it didn't make them look like they're behind
i think the higher likelihood of a reaction
is from deepseek
with R2
which would be very cool
this needs to be a netflix doc
oh yeah
i would love that
imagine they release it during the livestream
lmaoo
i still think the most likely course of events is that R2 releases next week
2.5 flash, updated 2.5 pro tomorrow or friday if i had to bet
so they can steal iq
google and openai have a long running tradition of trying to steal each other's thunder
it used to be openai and anthropic but now anthropic is too far behind
yeah they got that aws partnership so they might gucci with that
it seems like its apple and oa vs google v anthopic and amazon, is that correct? not sure where meta fits in but meta vs deepseek(while also being against the other players?) lol
Nightwhisper tmrw?
doubt
not sure if microsoft still is buddy buddy with oa
Interesting how OAI chose to release today instead of Thursday
they're trying to make themselves more independent these days
they're failing quite hard mind you
nw tmw would be goated by google, but i dont think nw is better than o3 or o4 mini imo
but 
@keen beacon what if nw was better? you think they release it?
cause i was gonna be sad, @keen beacon pulled in a favor for me 🙂
google are less willing to sit on things than oai are
if they do have a better model and it's almost ready, they'll push to have it out by end of next week max
if they don't expect to be waiting a couple weeks or more
bet, these are exciting times man, im just happy google stepped up their game
it was looking scary a few years ago lol
i would expect o3 to outperform gem pro 2.5 on most benchmarks; question will be though, by how much and what cost?
like if it's marginally better but twice as expensive, gemini would still be ahead imo
but if it blow gem 2.5 out of the water, then yeah who cares (for now anyway) what it costs ha
did they even launch o3 on lmarena?? probably not
yeah gemini 2.5 pro advantage is that it is free and cheap and SOTA
its gonna be way more than 2x as expensive
they are most likely still going to be SOTA bc of that, but in terms of intelligence i bet on OA, i think thats their goal with these models today
well, kinda.. if the discord server counts ha
(or perhaps it's o4-mini) private model
like waaay more
yeah i would expect it'll be more
idc about price
im here for the most capable model and thats why we like OA
we wanna see reasoning breakthroughs
nope
o3 low was roughly 2x as expensive as o1 pro which is 60x more expensive than 2.5 pro
they don't ever test o-series models on the arena
i have access because i help with security and jailbreak testing
2x expensive but significantly better is fine. but 5x expensive and marginally better is not good
that's through an oai-controlled platform so
1k / 1M output tokens incoming
i suspected it was red teaming ha
thats why you buy the subscription
give something juicy to us
so a baseline is 120x more expensive than 2.5 pro (on low reasoning effort)
i dont think 2x is even in the ballpark
or 5x
oh yeah i think you guys will be annoyed about pricing
i don't have official numbers but i have talked with a few people
idc about pricing, ik what to expect
what to expect??? tell us.
just give me model!!
that you cant afford the api costs lol
and you need to sub up
thats what imma do
better than 2.5 pro in basically every regard but web development, however quite a lot more expensive
will be SOTA on basically all benchmarks
cool. and how much better on benchmarks?
in other coding tasks is SOTA ?
marginal or significant?
how is it not as good in web dev tasks? thats weird, i think that has to do with system prompts and tool calling
you make any model better at web dev with clever prompting
it won't be tiny amounts better, fairly significant. but still in striking distance for deepmind
2.5 pro is so good .. i am excited to see what o3 has to offer
swe bench, yes
arc, yes
codeforces, yes
i also hear it "did well" on aider polygot but no figures there
looks promising
who is noam?
https://x.com/legit_api/status/1912516320966430928
@keen beacon he valid?
yupp
Here is a surprise for you:
hard pass
They should offer API subscriptions, unfortunately they don't
hmmm, that kind of defeats the purpose a lil tho
Where?
but i see what you mean
@keen beacon do you think we get o3 today if 2.5 capabilities didn't surprise ppl?
Thats PAYG
he's a big contributor in reasoning model research @ OAI
wow, im excited af
wdym?
goals?
lmaoo perplexity has not been the same since 2.5 pro came out tbh
So an AI engineer hopping around
what does member of technical staff mean
lmaoo smart man
after gemini dropped sama tweeted "change of plans: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months." was just curious if that was in reaction. my gut says likely but curious on your opinion
ah
yeah that wasn't the plan to start with
2.5 pro served as a reminder they weren't invincible
nothing was the same since 2.5 pro
pre 2.5 pro and post
its like o1 was SSJ and o3 is beyond SSJ lol
i also hear that although there is quite a bit of overlap there is some "healthy but fierce" competition between the team focused on o3 and the one focused on o4 mini lmao
Check the Mixture of a Million Experts paper from last summer, it's pretty wild: https://arxiv.org/abs/2407.04153
The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of th...
lmao
i love it
https://x.com/kalinowski007/status/1912506846524633238 member of technical staff @ oai
lots of hype posting going on
i think this is the first model within "striking distance" of AGI if you will
i think with the current rate of progress AGI is looking like it'll arrive by the end of next year max
i am sure that it is going to be SOTA but i am concerned about cost.
Just when API for LLMs are available, I remember it costs several dollars per thousand tokens
ask chatgpt 😉
AGI won't come from (autoregressive Transformer based) LLMs though. LLMs might help accelerate the research and coding though.
hopefully not till 2030. companies wont think twice before firing us
lol
wait you still have access lmaoo
livestream in o2 hours
yeah they don't deprecate the model previews until like a week after they're publicly launched
well, I would suggest checking out claudeplayspokemon and gemini plays pokemon lol. It's not AGI if it can't play a children's game
limit doomscrolling lol
a matter of months ago the best llm couldn't get past even the first part
an llm beating elden ring is my AGI definition
just need to put things into perspective
damn so are you even excited?
Is that u?
wow
lol no it's a linkedin profile i found scrolling through suggested connections
it would be my dream though 👀
labs are getting relentless trying to secure the best talent
Ok it verified my guess on “two teams competing in OpenAI”
so much so that deepmind are paying people to sit around and do nothing because it prevents them from being poached when they might be useful later on
its crazy how much hype they are giving it, like i kinda dont know what to do
i wonder what google is thining
thinking*
I’ve seen this this afternoon and I can hardly get what it means
i think they're probably still quite confident they can retake the lead in the next month or two
deepmind are the lab that i think moves the fastest
Week **
mostly down to the fact they have so much money to work with and their compute is unmatched
we need a chart for times when certain companies or models are leading, so we can see how long each company has held that title
need another 12 days of live streaming non-stop,then google popping in, just like last time hahhaha.
i agree
OA the king of hype
iirc this arena has a video of how the leaderboard has progressed since it was created
which kinda works as that
oh wow, nice
best way to know if google are about to drop is if logan posts "Gemini" on twitter
he always posts just that word the day before a launch
if they're releasing a new gemma model he'll post "Gemma"
ain't no hypeposting around here kids
google sucks at marketing
It's much better than if they relied more on marketing.
Remember how they tricked everyone with Gemini 1.0 Ultra
if google invested as much into marketing as they did into research they'd be right on openai's ass
It's all because of dumb marketing.
openai are basically the only lab that actually know how to market
perhaps unfairly given the huge advantage they got with chatgpt's viral moment
but all the same
anthropic marketing sucks
dont remind me of 1.0 models. they were horrendous
google's marketing barely exists
LLM playing Minecraft is my definition
deepmind didn't really market r1 but it was carried by the press and social media
don't slander my boy 1.0 ultra... he was good at creativity. yes that was about it but 🥹
As Google is taking AI finally mandatory, we will see Gemini models absolutely dominating the next years
yeah google's TPUs given them a big advantage
deepseek doesnt really market but they market through their tech, thats the kinda marketing you really want, its so good that it markets itself
yeah i think that's what GDM are trying to do
you mean deepseek. dont be too sure about it. china has huge power over media houses and they have incentive to hype it up
but thus far it has been less successful
no deepmind
^
if i asked almost any of my irl friends if they knew what gemini 2.5 pro was they'd ask me wtf im talking about
yeah my bad
lmao
yeah i know they have incentive
but most of the heavy lifting, at least with R1, was natural because of the model's strengths and it challenging american dominance rather than down to deepseek's efforts themselves
True, media house in China hype Chinese AI way more than others
well, as you'd expect
They hyped Deepseek R1 and sell sessions that “make you earn from Deepseek”
Which is a scam
china seems really good at marketing
i wonder if R2 will cause the same absolute hype storm that R1 did
Most critics in China is overhyping R1, claiming that it is better than o3 mini
you literally couldn't use R1 via almost any API nor via deepseek's own platform half the time because every gpu assigned to it was vapourised
wild
Ofc
hopefully they've scaled up enough since then to be prepared for large load with r2
r1 was trained on o1 outputs right? so that means they cant release r2 until o3 is released cause training on o3 mini is not good enough
r1 was marketed by people losing money in nvda
the deepseek team didnt do any marketing really
They were banned from buying Nvidia GPUs
thats why this moat crap is silly
That’s why their server overloaded
they stockpiled tf out of them before the ban went into force though as the chinese do with most things
Nah it was trained by R1-zero’s output
r1 was trained on R1-zero output or you are saying r2 is?
R1-Zero is their internal model before public release of R1
As stated in their paper
ahh okay, they what was all that commotiion about openai getting made at deepseek?
was it jsut because they did it cheaper?
I wonder when they can train new V3 based on R1, I think they will train new R1 based on new V3
And it will become master of hallucinations
They didn’t state how V3 was trained…
I mean old V3
love this channel:
https://www.youtube.com/watch?v=yPxavsb2rgk&ab_channel=AIExplained
Giving some context to a hectic week of AI news. This video won’t just be about the release, then, of GPT 4.1, in the last 48 hours, Kling 2.0, a sneak-peak at the next OpenAI model, or even the new Dolphin language tool. It will be about 7 such stories that contextualise where we are in AI and what is happening.
ai explained is pretty good
Waiting for any LMMs to call out video generating models to have deeper understanding to the context
Thx for recommendation
GPT-4.1 made this 1st try in Windsurf: https://pong-html-js-responsive.windsurf.build/
Prompt: ```
Create a web-based Pong game using HTML, CSS, and JavaScript with these features:
- Player controls using W (up) and S (down) keys
- Computer opponent with beatable AI
- Score tracking for both players
- Game over when player loses by 10 points
- Pause functionality with spacebar
- Restart option with R key
- Clean, responsive design
https://x.com/ananyaku/status/1912523195175039398 from oai researcher - confirmed we're also getting o4 mini today. cc @sonic tendon
The only problem is that big AI breakthroughs tend to be bottlenecked on top 20-40 researchers at any given org
Actually, GLM has a great solution(extremely large amount of thinking tokens with the ability to call tools multiple times in the inference time for 1 response)
thx
So sad that its base model is 32B dumb
i love that
4 mini strawberries lmaoo
but why are the strawberries smaller and smaller
probably a coincidence? we shall see
yea, should be
just the 4o image gen being silly probably
No way it’s real strawberries
it would just look like a worse image
oh so they just come onto the arena after being cleared on safety and fully released?
yup
makes sense. excited to see the benchmarks
put into kling and lets start cooking
it's 1pm the livestream right?
the only real timezone /s
agi cancelled
@keen beacon https://x.com/TheXeophon/status/1912523048411951223
there is no agi without twinks
is this true?
can confirm
they say o5 will be an irregular shape
hes trolling
xd
maybe this means o4 mini will be faster than o3 mini? since it has a smaller strawberry?
bro o3 will change out lives
btw pretty sure R2 won't release until May
I'll link the source I was reading about it
that was the original plan but ppl said they updated that
🤔 when did they update that? I guess what I was reading was two weeks old
it's satire i wouldn't read into it too much
they said they wouldn't release this month
or sama's plan?
we're all going schizo
lmaooo
iirc i thought the expectation was R2 by end of April?
you're a little late
imagine the speaker in the livestream is o3 or o4 mini
they still didnt give voice to o1 right?
guess thats hard with reasoning
it's not multimodal with output like 4o is
id expect to see those capabilities built into gpt-5
or some of them
i really wonder how the inference will be for gpt5
South China Morning Post article from April 6 referenced a message from a business manager at Deepseek to clients that R2 would not be coming out March/April
it did? xd
But maybe end of April/beginning May doesn't count in that
The meaning was a bit difficult to parse
i kinda feel like their deadlines are fluid asf
they have a lot of pressure being put on them now
Yeah fair
they abandoned the date for R2 previously in favour of just "ASAP" when o3 mini dropped
whar?
like if they don't launch in april
ohhh
i highly doubt they will leave it any more than another 2 weeks
asdfhjlkasdfhjkl i thought you were just poking fun
wait so o3 is a researcher now?
it will be more agentic if you want to put it that way
rumour has it there will be updates to deep research this week too
@keen beacon im starting to see that 20k number really show up more
are they really doing that?
it's in the roadmap but not concrete
okay time to take out a loan
aimed at enterprise of course
if they tried to aim that at consumers i think sama would have to make sure there are no luigis around
im starting to see where openai is going, they really are pushing this product stuff
VC money running dry
which makes sense, there is no moat in models anymore
https://www.warp.dev/pricing
this is honestly the best AI sub you can get
with a good product built around your model you do a lot
You can use that even for coding
yoo they better not blue ball me again i swear
https://x.com/i/lists/1676646159539130369 just take a look at this feed of oai employee twitter accounts
nah the models are everything
what a troll
what does that even mean
patience chair its a meme
the patience chair silly
but the models can be copied so it a given that everyone will have a good model
especially when the percent changes are so small
normies dont even notice the changes now
most of my friends cant tell the difference between gpt4 and gemini2.5
other frontier labs can copy each other, but if you arent a frontier lab and youre building products off of them i think ur gonna fail hard eventually
this is interesting
claude ever the yapper
Best quality is almost infinite Gemini 2.5 Pro usage for Free.
People dont know but Google launched Gemini code assist and giving 2.5 pro for coding for free . I dont know why Google is doing it ?
you have a point bc before sama said to build with the the fact that models will get better overtime, but now openai is a product company and so is other frontier labs, where they are building tools around their own models and integrating it with stuff, so im not sure anymore
hmm explain this to me please, im kinda slow
they have to keep themselves afloat especially right now with how competitive things are
it depends
if another model can get the same answer with less thinking tokens?
yeah
the less tokens it can spend reasoning to get the right answer the better
the best models will be able to most intelligently decide how much to reason
why did they color code it like that its so hard to see, but the thinking tokens are on the bottom right?
hm?
pretty much what gpt5 should be able to do flawlessly?
well it should get relatively close
needs to get to a point where you don't need any input minus your prompt to get the best answer
for this chart who scores the best?
here's the reference
most of those tokens on the 3.7-thinking are often wasted.. it's 'reasoning' yields the same and sometimes inferior to just what the 3.7 vanilla model produces
i think it's o3 mini efficiency wise
wow
2.5 is really good
oh you are right
its second
and less thinking tokens
wait no
less thinking tokens
but more output tokens
but o3 mini high is also based on 4o mini 🤣
2.5 pro is the most yappy when it comes to output tokens
they're the same functionally
output/thinking tokens
whether thinking is abstracted away (or hidden entirely)
it's still completion being
<reasoning>
<answer>
rather than just
answer
Join Greg Brockman, Mark Chen, Eric Mitchell, Brandon McKinzie, Wenda Zhou, Fouad Matin, Michael Bolin and Ananya Kumar as they introduce and demo the new o-series models.
sama not there wtf
there it is
Join Greg Brockman, Mark Chen, Eric Mitchell, Brandon McKinzie, Wenda Zhou, Fouad Matin, Michael Bolin and Ananya Kumar as they introduce and demo the new o-series models.
AGI CANCELLED
It’s the only model that will give me 10000 words sequel from my unfinished novel
but greg is there
Which, it’s good to be yappy, but the bad thing is that it will often hallucinate when it has a long output
jokes aside
why does this not keep you up at night!
there are a lot of people there
And forgets all the details I said before
Smells like brute force search
this is the most people i've seen attending a stream since the 4o launch
this time they have almost 3 times that
how are they gonna be sitting
is it gonna be like
a long table with greg as king
or what
why are reddit mods such damn haters
no comment saying why, no message
not a dupe
god
wait what does this mean?
who is shrek and donkey
The moderators removed his post
Gotta sleep rn
lmaoo
he is a very busy man
he might pop in at the end
doesnt he have a baby tho i guess his schedule is wonky
busy enough to not attend a major launch but free enough to tweet all the time
smh
oh yeah good point
idk what to expect with o4 mini tbh
well I'm curious to see if o3 is actually better than g2.5pro, it's just so far ahead of everything else still
this guy is part of the agents research team @ openai
so i think we're getting agent updates or a new agent related feature
(alongside the models)
And Sam Altman will be appearing on the livestream of new agent related feature?
it's part of the same stream as the one for o3 and o4 mini
so no
my bet is just on deep research upgrades
since gemini deep research made them look bad
gotta reclaim the lead
i actually think their existing one is superior to gemini's even with 2.5
problem is you get 10/month
what the 🤣
gemini is defo worse at news research imo
oai researcher fanfic
but I haven't tested much on deep research for studies
its crazy this all started because sama and elon wanted to stop google from getting agi lol
and now they're suing each other and are gonna let google get agi lmao
i wonder what ilya is up to with SSI
Crazier that things went sour because Sam wanted profits and Elon wanted fascism.
wild bro, kinda sad tbh
i wonder if they could team back up in the future
grok and gpt
vs gemini
ai wars
what i want is an arena battlefield for these models using agents powered by their models in a fight
where is microsoft?
hahha
thats about it
i dont think ms is trying to compete at all in the frontier space as themselves
who want to make this arena battle field with me?
Either going down novel but ultimately wrong paths to AGI, or more of the same (autoregressive Transformer LLMs with more compute). I would love it if Ilya actually created AGI but so far all we know for sure is he’s really good at building yappy chatbots.
it can be a minecraft thing where each model gotta build their own castle and base and then attack the other ones with strategies etc...
that would be cool ngl
you think ilya is on the wrong path?
the run-down:
Greg Brockman - President, Co-founder
Mark Chen - Chief Research Officer
Eric Mitchell - O-series Research, Deep Research Core Contributor
Brandon McKinzie - Research/Member of Technical Staff
Wenda Zhou - Research/Member of Technical Staff, o1 Contributor
Fouad Martin - Agent & Systems Research
Michael Bolin - Research/Member of Technical Staff
Ananya Kumar - Research Lead, Core Contributor for o1 and GPT-4.5
there is a rumor that he saw something
What you’re describing is WW3
i would pay for that tbh
like that could be the new benchmarks
agreed (they've chosen instead to invest $13bn into openai for that)
anyone test code claude 3.7 thinking vs 2.5 gemini pro and o3 high ?
yeah 2.5 pro is daddy
what better code and less bug
who's ready for DeepSeek to overtake them in 12 days
they shouldnt have released 4.5 tbh
so who is grandpa
at least in lmarena, they did last time lol
for o1, i believe it did
they did on o1
i mean openai had o3 already at the time
if openai didn't stall every release they'd be ahead of other labs most of the time
but they sit on their best stuff a lot
is o5 being trained? or is focus on gpt5?
parallel teams
the clone of terence tao i have locked up in my basement
gpt-5 already firmly underway, o5 in early stages
wow
now that the o3 and o4 mini teams have wrapped up they're being moved to mostly o4 and o5
why dont you go work for them?
isn't o4-mini just a direct distill of o4?
i'm happy with what i'm doing now
o4 is not ready enough to do that i think
i don't think so
the o3 we're getting today is basically a whole different model to the one they announced
yeah they said in that interview that they started training gpt-4.5 two years ago (and planning for it a year before that)
I think Ilya spent a lot of time at OpenAI chasing bigger Transformers convinced that the bitter lesson / scaling laws would mean that GPT-5 would become AGI by virtue of its size alone. And I think he missed a lot of opportunities (R1, test time compute) that other labs spotted, and that a lot of the big ideas he used came from DeepMind papers. I think Ilya is brilliant, but he has chased Transformers for so long I’m not convinced he can let them go.
that is what i have heard from inside
they kept delaying it, then the name was changed to 4.5 and they decided to just "bite the bullet" because they didn't want to waste all the resources they had put into it and not even let it see the light of day
ilya leaving made things quite a bit worse
damaged morale as well
i dont think so lol. i think he saw how test time compute would pan out
hmm, we will have to see, i still think he has found a trick that most people are missing
yea i think
yeah i thought so
yup
honestly, maybe they should've just scrapped it lol
hmm
i think we can give him grace to cook, there is no way he not gonna produce soemthing goood
to be fair if i had put upwards of $150M into one model i'd literally just drop it out of principle 🙏 😭
if deepseek can do what they did, i know ilya gonna come with heat
gpt4.5 sort of makes me doubt that scaling laws will actually hold up, no?
rip
well of course lmao
oh wait
no
misread
it did not cost close to a billion no
it was still one of the most expensive training runs ever though, perhaps the most
you was right:
https://x.com/btibor91/status/1912544591821132135
more expensive than that grok one?
Leo, why do you have so much info? This can't be open source material?
yes
although it's close iirc
grok 4 will probably beat it
i suppose you could say i am relatively well connected
~900B
3T+
lmao
he has intels
he know some openai staff
how many parameters is o1? and o3?
some of his friends
same as 4o lol
too
wow
beat me to it
if o1 is so small, why is it so hella expneisve?
margins
semianalysis had an article on this, its the kv seq length
but yes they also have margins
do you have the link by any chance? Would be interested to read it
messed things up for them
Only oAI lab?
i predict o4 will be 50 parameters
50 what
50 bdozen
lmao
sama told me to trust the process
can confirm this is NOT the case gang 😭
the majority is from openai but there are some scattered among the other ones
its been a while might be this one: https://semianalysis.com/2024/12/11/scaling-laws-o1-pro-architecture-reasoning-training-infrastructure-orion-and-claude-3-5-opus-failures/
be wary that some of the article isnt true
claude 3.5 opus/bigger model wasnt used in the dev of sonnet 3.5 according to dario
sonnet 3.5 was truly a generational training run
i wonder what anthropic researchers were thinking when they properly put it through its paces and it was cooking that hard
i dont think they expected it to be that good
it just lined up
their ceo used to work at google right?
all of anthropic's founding team were ex-deepmind iirc
lol
its like everyone went against google, then went against oa
it was a mish mash of ex deepmind and ex oai
ilya was google at first right?
google brain yeah
yeah - mostly oai though i thought
omg 19 mins
part of deepmind's genius / plan was setup in London so as to not get all their talent lured away by silicon valley prospects
i wanted to work at deepmind so bad when I was in college 😦
this is also sad. So much talent in EU and eventaully everything is handed over to US
it be like that bro
i really like Demis Hassabis.. was listening to this the other day driving.. worth a listen https://podcasts.apple.com/au/podcast/demis-hassabis-on-ai-game-theory-multimodality-and/id1073226719?i=1000703257125
excited for o4 mini high
he even mentions 'ultra' in passing ha
yup 😄
noo
gregs presenting so u know its gonna special
my power went out during o1 preview's launch and o3 mini i think lol hopefully the trend doesnt continune
lmaoo
my youtube is not working on chrome so i gotta use brave
google dont want me to see it
It's perfect
lmaoooo
I feel super expensive agi
yuppp
feeling the agi
Feel his scent
remember sama said he felt the agi with 4.5
Hearing plus users get 1 o3 request per week
who paying $200?
we should start protest for "free the AGI" like free the nipple movement
i am
i have to
its liek concert tickets now
damn
i stopped paying after 2 months
6 minutes
but ill do it now
lol
feel it!!
how long after the announcement do we think they'll release o3 on arena so we can start ranking it against g2.5p? I want to run coding tasks across 2.5 and o3/o4mini and have them fix each other like with sonnet
PREPARE your prompts
aand i'm back :3
i thought you were gonna be in the livestream
how many millions rich?
you got me..
i'm actually just greg's alter ego
Okay, guys. Personally, I'll check on the Google models in LMarena.
damn bro, so you getting the 20k plan?
Wish me luck (I hope I find gold)
can i share with you my prompts?
what is the livestream link?
Join Greg Brockman, Mark Chen, Eric Mitchell, Brandon McKinzie, Wenda Zhou, Fouad Matin, Michael Bolin and Ananya Kumar as they introduce and demo the new o-series models.
but red is streamin it i think
i am lubricating
i better not get blue balls i swear
lotion at the ready lmao
oh yeah
i still don't know if they're demo-ing at the end of this one
so that'll be interesting to see
decent chance that will be one last livestream this week
sorry i forgot to say
greg!!!
in research lounge
ahhhhh
no twink?
the agent team
stop lubricating it's over
systems 😉
he's an ai researcher what did you expect
collective slightly awkward laughs
its okay @sonic tendon , imma just watch on my laptop, cant miss this lmaoo
kk
lol openai's website is breaking
yeah o3 is a beast
will prob stop streaming if you guys don't need me to
lol
just checked benchmarks vs december o3
some are better some are worse
it did worse on swe bench
by 3 points
hmph
hmph
o3 with tools is a beast
i want to understand the benchmark against other models... otherwise it's hard me to understand the quality
im very curious the pricing of these models
one million dollar per token
lmao aime is basically finished now
yes but not launching today i don't think
we shall see
nah
thats crazy
they saturated the benchmark
is this pass@1?
these tools calling are kinda neat
peak ui design right there
It sorta reminds me of what they demod in qwq max
The tool calling
holy moly openai's website is so slow right now
either it times out or i get 500s
help 😭
cooked
i wonder if your secret endpoint still works
it does lol
hey, i like it!
ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high. ChatGPT Enterprise and Edu users will gain access in one week. Free users can try o4-mini by selecting 'Think' in the composer before submitting their query. Rate limits across all plans remain unchanged from the prior set of models.
tf
are
you
on
????????????????????????
o4 mini on free
based
these updates looks pretty decent
why this feels whatever?
i wonder if the thinking dialogue is mostly based on reinforced learning now
sigh.
yeah, it happens
o3 full is cheaper than o1 i think
🙄
what an enigma
xd
hmmm
"maximize the reasoning capabilities"
i wonder what that could mean in this context
4x more expensive compared to 2.5 pro 😦
local file access maybe???
?
i need to test this thing at geoguessr
4x is very good no?
he's talking about the model using the python tool to brute force an answer
damn that's a good idea
the december version seemed to be several hundred times
nice dude!
Ahhh thats where the API is from 😄
interesting, o4 mini did best on openai's interview choice Qs
lmfao their own interview coding tasks are saturated now
wtf
So, openai doesn't care about arena leaderboard now? I don't see any updates today ...
i did notice that
openai PRs? what context is this in
In this case the o4-mini may have higher arena benchmark than the o3
o
Measuring if and when models can automate the job of an OpenAI research engineer is a key goal
of self-improvement evaluation work. We test models on their ability to replicate pull request
contributions by OpenAI employees, which measures our progress towards this capability.
We source tasks directly from internal OpenAI pull requests. A single evaluation sample is based
on an agentic rollout. In each rollout:
1. An agent’s code environment is checked out to a pre-PR branch of an OpenAI repository
and given a prompt describing the required changes.
2. The agent, using command-line tools and Python, modifies files within the codebase.
3. The modifications are graded by a hidden unit test upon completion.
If all task-specific tests pass, the rollout is considered a success. The prompts, unit tests, and
hints are human-written.
The o3 launch candidate has the highest score on this evaluation at 44%, with o4-mini close
behind at 39%. We suspect o3-mini’s low performance is due to poor instruction following
and confusion about specifying tools in the correct format; o3 and o4-mini both have improved
instruction following and tool use. We do not run this evaluation with browsing due to security
considerations about our internal codebase leaking onto the internet. The comparison scores
above for prior models (i.e., OpenAI o1 and GPT-4o) are pulled from our prior system cards
and are for reference only. For o3-mini and later models, an infrastructure change was made to
fix incorrect grading on a minority of the dataset. We estimate this did not significantly affect
previous models (they may obtain a 1-5pp uplift).
If o4-mini is so good, how slick is o4???
"we put in more than 10x the training compute for o1 into o3"
o4???
yupp claude code gg
anthropic gotta up their damn game
i wonder how much better this is in things like cursor and windsurf
nice demo!
anybody got codex link?
i want pro now
gj
i believe it's in the api now
chatgpt it is rolling out
then it's probably just a gradual rollout
higher tiers first
4.1
they dont train it in
so its a hallucination
they either prompt it in the sys prompt or trained it in
waiting for karpathy's review :3
coolio
Can this thing be evaluated on arena? As I remember, tool usage is blocked
the regular o3 and o4 mini yeah
dk how they will implement the tool usage stuff but it shouldn't take much additional effort
let me find somethin
can u ask o4 mini this: who won the 2024 london mayoral elections and by what margin specifically? if u dont mind @deep adder
try "Let a < b < c be distinct natural numbers. Must every block of c consecutive natural numbers contain three distinct numbers whose product is a multiple of abc?"
oh
yea
cant u disable search?
go to your personalisation settings
hmm
numbers are wrong
its 1m votes for sadiq but the exact number is wrong + susan hall
expected i guess
i couldnt probe 4.1 mini for the correct numbers
is that with my prompt
yeah it's hard
my private models failed it
LOL
ooh
its oxygen now
Googles turn