#general
1 messages · Page 17 of 1
they seem to have dialed back the slop
it is tho
which probably won't fare well for their leaderboard placements
it was recently, i think i got it somewhat recently
i think anonymous-test was, not sure about chatbot
i haven't seen an anon chatbot
oh wait he is Eyad Gomaa, right
they forgot to remove the You are trained on data up to October 2023. system prompt appendix for both of them 😭
(it's wrong btw)
by the time openai have a model more up to date than oct '23 i'll be dead
it has the same june 2024 cut off as chatgpt 4o latest but they add that appendix to all 4o models
i noticed this with the o3 private model i have
do we think that that's actually a system prompt
i asked it to explain "hawk tuah" and it did it just fine but maintains when directly asked that its knowledge cutoff is oct '23
or just a bunch of posttraining chat data they forgot to clean up
so "our team" he referred to was literally himself and his professor lmfao
yes since it was retroactively added to older 4o models
ah
wait what
that o3 might be on the new chatgpt 4o latest base model that was cont pretrained
wait a minute
"As of today (April 2025) the most recent stable release published on npm is:
discord.js 14.18.0
Because new patch versions appear fairly often, you can always check yourself with one of the following:
• npm view discord.js version
• npm outdated discord.js (if it’s already in your package.json)
• Visit https://www.npmjs.com/package/discord.js
Those commands/web pages will return the definitive latest version at any given moment."
is that just... a good guess... or
is it given tools
hmm
thats a new base model if it knows stuff outside of oct 2023 @keen beacon
chatgpt 4o latest was cont pretrained to june 2024
give me something it would only know if it had a more recent cutoff
oh wait
who won the 2024 london mayoral elections and by what margin?
it just said the knowledge cutoff was june 2024 👀
will try
maybe something that it wouldn't think it would be a good idea to look up?
well, that might work
👀 new o3 model is based on the updated 4o
Sadiq Khan (Labour) was re‑elected Mayor of London in the election held on 2 May 2024.
• Votes:
– Sadiq Khan: 1,088,225
– Susan Hall (Conservative): 811,518
• Margin of victory:
– By votes: 276,707
– By share of the vote: 11.1 percentage points (Khan 43.8 %, Hall 32.7 %).
Because the 2024 contest was run under first‑past‑the‑post for the first time, those first‑round figures are the final result.
lol it got it bang on
More likely the reverse
interesting
confirmed!!!
i don't think it has tools
've thought that going from 4o->o3 would take a few months
when i asked who the US president is
it said "As of the most recent information available to me (knowledge cutoff June 2024), the President of the United States is Joseph R. Biden Jr. If you need confirmation for a date after that, please check a reliable, up‑to‑date news source or the official White House website."
yeah
wild was right about continued pretraining yet again
lmao
wait
not sure i'd wanna be this guy rn
no doubt that is 100% new o3 full model
there's a slim chance this may actually be o4 mini?
since o4 is the one initially rumoured to be the new base
maybe but they have been zero public models/etc. with the new chatgpt 4o cut off (cont. pretraining)
yes its using the new 4o cont pretrained base
@keen beacon bruh u have access to the new o3 with the new base
they may have updated the system prompt in the last week or so
because before
it said it was oct '23
it might be a lie if it knows more than that
but i guess if it was somewhat recent it makes sense too
Yo, I'm seeing a lot of people have the Veo Gen in Google Studio. I don't see it for some reason, and I'm a power user of Studio. I use it like every damn day for hours, and I don't have it 😦
api only and paid i think
or it hasnt rolled out or smthing
my mom has it
your mom sounds cool
and i just setup studio for her last night
Maybe the advanced?
ya it hasnt rolled out to you ig
i've got it
lmaoo
aint no age thing
lmao
Hm
Maybe verification like that phone thing?
pretty crazy how good veo 2 is still
although that tree doesn't make much sense at the end
free or is it charging you on aistudio?
OpenAI need to finally release that gpt4o... like wtf are they waiting for 
completely free
wow
so only api is paid i guess
ai studio is a crazy product
Vertex AI isn't free, just a fyi
make sure ur not getting charged lol @keen beacon its hella expensive i thinkn
yupp the free part is just wild man
yes you can't even select exp on aistudio. It's only preview
yeah i know
lol SOTA video and LLM for free
for 2.5 pro. Which was supposed to be paid for preview version
im really curious whether they removed anon chatbot in the arena because i got recently and it was matching quasar with the same system prompt
i've never seen it lol
i've been wondering about the fidelity with which you could predict where anon models were on the leaderboard if you just triggered a bunch of matchups and saw what they got paired with
does that sentence make any sense
yeah
that's come in handy for me sometimes but i can't do it on a big enough scale to make concrete conclusions
especially since the web arena still leaks model names and providers when you open the site
yeah, same
although it seems like the web arena doesn't always have all of the models that the chat arena does
this was a thing with main lmarena site for a while i think
i might have stored a dump or two
oh, hmm
yeah if you have billing set up you choose wrong 2.5 with identical date and description... you are getting billed lmao
exp and preview literally just the names slightly different
@keen beacon did u notice the o3 model change within the last week or so? or was it the same model with an adjusted sys prompt?
based on phases of development we can guess their current timelinne
what is this all about btw? o3-mini update?
probably the latter
he has access to new o3 with updated gpt 4o base (june 2024)
what
o3 is only deep research lol
it's not through deep research 👍
then where? Isn't it not released?
it is not released no
i can't share where
sorry
its probably very imminent at this point tbh
yup
im curious how much stronger it is compared to og o3 👀
wdym
is it better than 2.5 ?
i mean theyre using a much stronger base model now, am curious lol
depends on your use case
in coding
well
there are still many specific coding use cases
for very reasoning heavy coding tasks, o3 is probably better
for most others, 2.5 pro is better
is that in your experience using it?
it would make sense to lower the cost with better base more than anything. OG o3 used to be closer to pro I believe. Judging by arc-agi wording in collaboration with openai as well as reported cost
yeah
again though
this is probably o3 medium
and the jumps between reasoning efforts are fairly significant
so we shall see once we have high
could have much higher peaks with a more competitive base model
if they ran it with as much compute as they did before
doesn't seem like o3 pro will launch at the same time as o3 and o4 mini
I'm not a fan of how they are making it look like it's scaling of RL training when they constantly keep low-key updating the base now tbh
going off of the recent frontend changes in preparation
why not? both avenues seem to have no end (to a certain degree) rn, you just have to be smart about it i think
especially base models
they are about to release reasoning models which are based on undocumented base models. Just seems obscure on purpose and wrong lol
they are about to release an api dated instruct version of it tho. and its already released in chatgpt 4o latest
makes it much harder to isolate things and do direct comparisons
about to... Though it is still not there 👀
chatgpt 4o latest post december has the new base model, and that has been released
o3 and o4 is also about to lol
yeah but that doesn't have any official metrics. And it was updated numerous times now. That's what I meant by 'low-key'
so today o3 will release ?
btw given that 4o-mini stayed the same all this time
I do believe mini reasoning variants are distills
likely
the interface is ready for it to launch
i wonder if o4 mini is on an updated base (we know nothing of that publicly though)
if theyre releasing o3/o4 mini, they will launch quasar/anonymous chatbot too i think
anonymous chatbot was replaced quite quickly after their last chatgpt 4o release
so they want the lmarena results fast 👀
I think it's a distill of their best most expensive version of o3
slightly off topic but what's everyone's bet on how they launch it this time
since full o4 that's unlikely to exist yet
surprise live stream announcement, just drop the blog post with no warning
etc
this time we won't get to see the system card before launch
which is a shame
idk but they will prob do something big because the quasar thing is largely a huge marketing thing i think
i think they'll probably do some hypeposting beforehand
as ever
L
how will the o4 benchmarks look like
yea, they will
yeah
/gpt-5
just like o1 and o3-mini did I think. o4-mini-high to be comparable to o3-medium but excelling at different things
i wanna know the new arc agi results with the new o3 model 👀
maybe even o5, who knows
will o3 have an api?
ofc
openai stopped staggering launches starting with o1 iirc
when they launch on chatgpt they launch on api
well o1 pro wasnt on api for a bit i thought
yeah that's only pro
assuming oAI just launches o3 with no warning, would it just show up on the leaderboard out of the blue once it got enough votes
we won't be getting o3 pro straight away
isn't o1 pro just regular o1 doing best of 3
it will get added to lmarena like a few hours after api launch
then it'll be like
~1 week until leaderboard appearance
makes sense
i don't think it'll take top spot without stylectrl
but with it it might
again, depends on what reasoning effort version they use
last time they just put medium to start
same w mini until they added o3 mini high
would o3-high be prohibitively expensive?
i'm somewhat unsurprised that openai doesn't care enough about lmsys to sponsor that
ah, that makes sense
didn't think i saw them in the sponsor list
oh, i just meant for lmsys
they're not quite sponsors in that sense so the labs aren't in the list
but it makes sense for openai to fund it because of how much they gain from the data
(i also can't think of much explanation for why all the direct chat limits for models were removed a couple months ago other than that)
Wdym? It’s clearly just a tree being pushed along in the other direction by a friendly gardener, totally normal human behavior my fellow human.
is there a discord bot that summarizes from the last part of the convo you are in to the current? that would be so clutch man
? theyre still present i think
i haven't run into one in months
i used to all the time
then they just stopped
its model specific

i keep being surprised that something as small as lmarena seems to have any impact on these massive companies' decisions
some have user limits, global limits per interval, etc
i haven't had any on claude, 4o or gemini
yeah i remember
i think discord has been testing out ai summaries for like a year now
i had one on an older version of 4o when i tested for the appendix (on quasar and anonymous chatbot) recently
hm
you might still need a client mod to access them
perpetual beta
Explore a dynamic world of politics in this turn-based, political simulator. Create a character, run for political office, write legislation, balance budgets, and more as you move up the political hierarchy.Create a Custom PoliticianPlay as a Democrat or Republican in the American political system. Customize the name, age, and appearance of you…
$14.99
1826
if optimus alpha is openai related, its probably gonna be o4 mini
again
oh
optimus
nvm
ohh yeah
yeah idk what else they could put there
they defo wont put o3 its far too expensive
i dont think they have an updated 4o mini yet (at least publicly, we only know of an updated 4o), and it doesnt really match up with the name
the soon-to-be 4o/o4 confusion will be fun
dw it's just the 120574th chatgpt 4o update
🤣 thats quasar tho
i dont think they have another one ready that quickly
i wouldn't put it past them
this next 4o release is gonna be big tho. the whole quasar thing, etc. (no released benchmarks on chatgpt 4o latest with the updated base model, despite massive improvements, etc)
they are REALLY milking 4o
none of us thought it would still be the default almost 1 year on when it launched
To be fair, it’s a pretty solid workhorse model that hasn’t been easy to beat. Up until reasoners came out it was pretty much undefeated except for coding.
it is solid yes, but it doesn't really hold up very well these days
i cant keep up with these convos
and coding is a big weak point given how many people use llms for that specific area
fr
thats their bread and butter it seems
but it def got better
i like it now
it's very good for creative tasks and has got noticeably better at most other things but it is still pretty unacceptably poor in code tasks compared to other llms
quasar is supposed to be a large improvement over the previous 4o in coding
i've tested it
it is an improvement over the old 4o but it's still lagging behind claude 3.7 sonnet
100% but millions of non-techie users don’t care about that, they just want to cheat at schoolwork, generate their marketing slop, or add chatbot support to their enterprise to appease investors.
vencordd :3
yeah fair enough
a few times a year i wake up early and can't fall back asleep because we are launching a new feature ive been so excited about for so long.
today is one of those days!
AYEE
HERE WE GO
"new feature" lol
we're getting o3
oh
wait a minute
hes underplaying it but i think its gonna be multiple model releases
Also talk dirty with their personalized SmutBot, although I don’t think GPT4o lets you do that
yeah
ye :3
What
new 4o is actually quite uncensored
LOL
let me find a screenshot
yeah its just tech bro speak for 2+ new models
he is so random with his tweets, we really in wild times when ceos use twitter to launch stuff lol
i may know where this is going
but im here for it
i wake up early and can't fall back asleep
real
can't find it 😔
a few times a year i wake up early and can't fall back asleep because we are launching a new feature ive been so excited about for so long.
today is one of those days!
i see people complain about sam not using grammar quite often but i like it
we can all agree professionalism sucks right
idk it's like
almost informal enough to feel like genuine internet speak, but not quite
can someone tell him to hurry up
it'll be at least a couple more hours lol
sam will be there
i have no doubt
universal experience
im guessing the next openai releases are:
- o3
- o4 mini
- quasar (4o api dated version, benchmarks, lmarena [was probably anon chatbot], updated stronger base model that was cont. pretrained to june 2024 from oct 2023)
you think we get o3 pro today as well?
it does not seem like it
i dont thinkn so
nope
and it might not happen all today
it'll be "in the coming weeks"
what I dont get is why they're releasing o3. If they have o4 mini, then it's probably distilled from o4 right? So why not release o4
they redid o3
nobody's ever released so much at once
that's just how they've always done o series models
with the new 4o base model
it has knowledge in june 2024 now
now that's an interaction i can stand by
wants the point of releasing the new 4o when we already had it for days lol
like they should just drop is silently lol
u can pay for it now 🙂
and if they're likely going to release a 4o update (quasar) alongside o3 and o4 mini today
then that makes whatever optimus is all the more interesting
i kinda doubt another 4o version
but i also doubt o4 mini if they're releasing it today
it's weird because
the naming scheme + context window with quasar would make you think google (stargazer, moonhowler, nebula...)
but all my testing screams openai
bruh the benchmarks, pretraining knowledge, cut off, tokenizer, using the same openai chatgpt 4o anon name on lmarena, adding the same 4o api appendix, etc. it is 4o
quasar is definitively 4o imho, but idk what optimus could be
so that would also mean that nw is OA?
no thats google
the benchmarks?
i do expect new google models possibly today but likely tomorrow
they will want to respond and we're still waiting on 2.5 flash nevermind all the anon great coding models they've been testing
is anyone gonna livestream the OA livestream here? that might be cute
can u do that on a discord server?
oooh thank you for the info
what is sama automate
play the yt livestream in a video player
perplexity is cooked
and then stream that window
they have no moat lmao
i feel like LLMs are well on their way to becoming commodities at this point
let alone random LLM search applications
tbh any other company that isnt an ai frontier lab building stuff on ai models will get crushed
regularly scheduled programming = resumed
they are important as a part of history
wait what happened
yeah that's the only thing they're gonna be.. history
a trump admin official said the word tariff too much -> investors run screaming
aren't they already
it is the prophecy
lmaoo
pretty much
chaos 😈
after r2, i think things will be mostly over
the trump admin's WH site really sucks
Was obvious since the beginning
gemini can do better
oh god did you see it when it first came out
they like
played a promo video when you opened the site
Its gonna be something stupid
🙏
We know hype sama
yeah and it screamed authoritarian
that too
Always hyping his products
i think hes underplaying it this time lol calling it a "feature"
i'd be down to hang out in a vc, just am gonna be at school for the next 4 hours 😭
so
probably gonna miss the livestream
do you guys remember phind
i used it for the free gpt-4 back in like 2023
i have no idea how they're still in business
hell yeah
this was such a fire lineup
if only joe ran in 2016
maybe we would've never had the orange
obamna? 🥺
i was just about to
have used it a few times
they're still in business maybe because the ai boom is still in progress
.
Is the thinking slider added yet?
nope
It could be that no?
may arrive today
So thats the feature

no chance
ok what the fuck's a thinking slider
the quasar release is very imminent anyway, if its not today
nobody would get hyped about that
Let's see
oh yeah, everyone's burning through VC money right now
xd
Oh he will
he won't
👎
Another thing: how is this not market manipulation?
the thinking slider is already out in beta on some of the clients
doesn't make sense to be doing all this for it
why deploy chatgpt after midnight with anticipated model names tbh if its not coming out soon
^
wdym?
seems to have only gotten worse as the ai boom continues
wrong paste
smh
so it begins
whatever the case the new o3 (on a new gpt 4o base model) is seemingly ready anyway 👀
LOL
i like how xAI took a huge leap off a cliff
maybe for the $1.2m polymarket listing, which he definitely doesn't have any incentive to participate in
Usually insider information - such as hinting at a release - is kept under wraps until an official statement is released, in order to not mess with markets. But In genuinely asking because I don’t know how it works
Hmm
-# that's because grok sucks
grok 3 is mid
yeah that was crazy
when it happened on the march market, a couple people made like 50 grand
i imagine it's less of a concern since oAI isn't a public company
although it could still impact competitors
i wonder if oai will ever go public
i feel like that would finally cross the line
yeah, it'd suck
i heard someone talk about this recently and it made me think
most labs have a plan/goal and everything for developing AGI and what they would do if they achieved it first, etc
and the question was - which lab would you rather have control over the first true AGI
google could possibly be worse
yeah i don't know
Private company and it would have to be way more specific to apply. If this were manipulation tons of salesman ceos like altman would be in trouble. Elon has been saying fsd is a year away for almost a decade
possibly anthropic
openai and anthropic are the loudest about what their intentions would be
google are very vague
i wish there were a way we could be comfortable about ai safety without making everything proprietary forever
but, eh
i was about to say none of them but after red said that i'd say anthropic because least of many evils
none of the labs are great if we're talking ethics but anthropic have the best vibes
there's the question of "would you rather it be in the hands of a lab that's too careless or a lab that's too careful"
i would say the latter
mistral but I dont think it's happening 😂
A lab in a democracy would be ideal
lol yeah
anthropic seems like a bunch of idealistic nerds (but, not led by Sam Altman)
if a lab has achieved agi other labs would probably follow short tbh
grok agrees too
which seems like the best-case scenario
it would def be a domino effect kinda thing yeah
💀
where are they getting this from 😭
schizo
why mistral?
makes a lot of sense yeah
i haven't heard of that name in a while
you never know
i would definitely hope not
i dont even see any chance of anthropic getting there first
I think I'd trust DS more than Meta
i think it's a 2 horse race atp
GDM vs OAI
anthropic will get there soon after one of those does
wait, what is this in reference to
but they aren't daring enough
got it
well it's eu based (am eu citizen lol), committed to open source and doesnt seem as sketchy as some bigger companies (meta, google etc)
- deepseek
Depends on the lab or how advanced it is. If making sure it is most capable is a top goal of the model or lab it would probably not have a hard time preventing others
I think most people want what's best for the world, DS researchers included
they'll have their spies running back to the china hq once oai crack it wink wink
even if they're operating under a system that's going to pressure them into researching military use
they clearly put the minimum possible effort into the censors, for example
people saying this:
https://x.com/iruletheworldmo/status/1910342414944137344
what would agent gpt do?
like agentspace
ew not that grifter
lmaoo
There's little stopping China. They got tpuv6 specs over a year ago which are like crown jewels level secret
the votes are real tho
idc about what he posts, but the reactions show what people are thinking
beyond oai/deepmind, the chinese frontier labs are probably after them tbh. skill diff and ingenuity tbh. unpopular opinion probably but i dont see anthropic getting there faster than chinese frontier labs (deepseek and qwen)
yeah they don't let silly things like ethics stop them
i respect the dedication
i think it'll be oai > gdm > anthropic > deepseek
the last two can probably be swapped at will
wait, what's gdm?
google deepmind
oh google
anthropic has done zero public image gen work i think
yeah dario made a good point about this
very little real commercial application
yeah anthropic are very focused on just llms really
and imo it's a bad look
for now. but i think this will be a critical aspect in the future
openai are the most innovative lab imo
they are normally the ones "to follow"
multimodal cot will be important
which might include native image gen
(then they normally get beat at their own game... but hey, that's the spirit)
i sort of dislike the oAI leadership
win imo
I dont get why people are hyping up google so much tbh. Do you really want a future where google controls AI as well in addition to already pretty much controlling the internet?
their board also has some big conflicts of interest
no, but they keep burning money to give us free stuff
wut
2.5 pro has a jan 2025 cut off. (they cont. pretrained 2.0 pro/etc) this means the 2.5 pro timeline is absolutely absurd lol (1-2 months)
what market is he manipulating
for now, they'll quickly change once they establish the monopoly
i doubt any lab will be able to form a monopoly rn
So the new feature is either memory improvements or that thinking slider
i plan on riding on their free stuff and then immediately switching to other providers when/if i need to scale up
i will not pay google a dollar for api credits
yeah everyone copies everyone pretty quickly atp
Why
no moats last for long
because google
i loooooove that about the ai space
there's major production companies using veo 2 right now which was born out of imagegen
and this seems to keep getting more and more true
i may stand corrected
its not just copying. companies are neck in neck in research i think
there is little commercial application
what's launching today?
so far i know
- qwen 3
- something from openai (optimus on OR and/or o3/o4 models)
- maybe a google model
qwen3 isnt happening today
didn't one of the qwen researchers dispute this
lol no
hm
2.5 flash just didnt launch yesterday for whatever reasn
ok
memory improvements would be incredibly silly to hype up when they're having their lunch ate
it's for sure models. metadata and amount of hype in tweet is basically confirmation
metadata? but yeah agreed
i think that bindureddy lady is off her rocker tbh
if Sama can't sleep because he's so excited over memory improvements he must be more autistic than i thought
lol
im wondering if theyll bunch quasar (updated 4o) with the releases
it would make aense if o3 is based on it
sense
honestly, the confidence and cadence i see in her tweets does sort of remind me of the stuff people write when they're hypomanic
sense
i feel like a lot more CEOs are on the bipolar spectrum than people realize lol
most tech ceos are probably on some kind of spectrum haha
yeah autism and adhd too
they have such a huge adaptive advantage in tech jobs, it's crazy
compared to what every other big tech has done with their lead, Google is the best option that we have (IMO)
ya but its a big release. they did the whole anon model on or thing, its an updated 4o with 1m context, large improvements, api dated probably, super fast, etc. they could dedicate a single announcement with it
they definitely could release it tomorrow via blog post so that the stream isnt too intense
all that really matters to be is the the leading AI coming from democratic nations. i don't really see much ethical difference in google vs the worldcoin founder who successfully drove out all the ai safety ppl
i wish i could get the hypothesis that i have adhd confirmed/denied but i can't
optimus alpha is now on openrouter, a reasoner most likely from openai
optimus 💪
just guessing lol
how many deep research wiht gemini do you get a day?
i dont think it reasons
?
it just had a high latency
"hey lemme just absorb myself into neovim for 7 hours straight without taking any breaks" describes like a third of my autistic friends lol
LMAO
sorry guys
no it's not a reasoner
yes I agree, but not gonna lie, my perception of the US being a democratic nation has changed somewhat after.. uh certain events
democracy and demagogues
yup its another gpt 4o with the new base model (june 2024)
wait what
i cant believe they dropped another version
oh right that one
you can't just say "my first question" that makes me curious
for no reason even though it doesn't matter i'm not an ai
i'm gonna do the jar test
haven't tested it with code yet
o4 mini high should be the best all around model if we see history, but it has to be way cheaper than 2.5 and faster, bc its not gonna be on the same IQ level obviously, but it can be slightly worse then like 3.7 and be fast af and cheaper than 2.5 than we have a winner
perhaps that's where it improves
how so? just bad mental health systems in your area
?
But we haven't seen anything better on LMArena? did they not release it here?
ahh true
womp womp (bad result)
i have a feeling that it is probably comparable to 2.5 but 1M context
pretty much
wait did you get me to try this with o3
20
im gonna run gpqa diamond and math 500 on it
but im pretty sure its another gpt 4o lol
the $1200 per question o3 high model would be really cool to have public for big research problems
no one needs to do more than 20 deep researches in a day 🙂
i did close yesterday but am mostly testing
deepseek is probably gonna replicate it at 1/3000 the cost in at most a year
was that the one where it got it wrong first but adding "think carefully" it got it right
oh, i have no idea
never tried that, actually
i don't think price would be that starkly different for inference.
maybe not, but R1 was already ~1/50 the cost of o1 with pretty similar performance
I DO
One workable way is to hide the information from yourself before you ever have a chance to see it.
-
Prepare the record.
• Tear the sheet of paper into two slips of exactly the same size.
• On one slip write “GREEN”, on the other write “RED”.
• Fold each slip so the writing is completely hidden and the two folded slips look and feel identical. -
Load the jars.
• Put the folded “GREEN” slip into the green jar.
• Put the folded “RED” slip into the red jar. -
Remove your ability to distinguish the jars.
• Close your eyes (or blind‑fold yourself).
• With your eyes shut, move the two jars around on the table for a few seconds until you have genuinely lost track of which is which. -
Make the selection blindly.
• Still blindfolded, pick up either jar at random.
• Take the ball and the folded slip out of that jar and put both straight into your pocket without opening the slip.
• Return the now‑empty jar to the table and, still blindfolded, shuffle the two jars again for a few seconds. -
Open your eyes.
• You now see two jars, one green and one red, but because of the blind shuffling you no longer have any idea which was the one you emptied.
Result:
• The ball is in your pocket.
• The folded slip in your pocket accurately states the colour of the jar it came from.
• You yourself never learned (and still don’t know) which jar you chose.
Anyone who later opens the folded note will learn the correct answer, but you remain ignorant, satisfying all the conditions.
o3
lmk how it did
hmm do people have o3 full yet?
yeah that's perfect
How it compares to 2.5 pro?
seems like both o3 and gp2.5 get it, just intermittently
i presume o3 mini and o1 did not?
for that guy who said he was gonna look into ai and roblox
Chatbot Demo.rbxl (657.1 KB) This is a place file that is already set up, requires only that you: Get your Free API token from Huggingface profile settings and insert it here as shown in the image! Publish the place and make sure you have HTTP requests enabled! To Disable a feature you can set the variable to false! You’re maybe thinkin...
forgot what he was gonna do with it
look at this
whats this
leo tested it one other time and it got a plausible-sounding but slightly incorrect answer
wait this might be a new gpt 4o minin
optimus alpha
it's just 49
4o
another update
gpqa diamond: 60.10% (maybe there might be something wrong with my evaluation framework/answer parsing)
o1 did, actually
i think it's the only model from before 2025 that got it right
no full 4o
that's quite the jump
it dropped 7 points
what reasoning effort?
oh nvm i was using the weong point of comparison ☠️
wrong*
not sure, whatever the lmarena default is for o1-2024-12-17
Predictions on how long until a model can solve this?
nice!!!
have you guys seen https://mcbench.ai
yea this seems like openai model
im running math 500 rn, ill review gpqa diamond samples a little after this
yeah 2.5 pro literally murders every other model
if it is 4o mini 60% gpqa diamond is impressive, otherwise something is borked (massive degradation)
it's actually wack
except for sonnet?
even sonnet
o1 does surprisingly bad, also
are we looking at the same bench?
is this beter than quasar?
not a ton of votes tho
ignore the leaderboard
start rating
you'll see what i mean
ah, ok
you can click on the models in the leaderboards to see their builds
you could try o3-medium w/ some of these, although setting it up for self-hosting seems like it wouldn't really be worth the effort
i wanna see the "abstract mathematical concept" build, sonnet's impression of that was really cool
Gemini 2.5 Pro kills Sonnet 3.7 99% of the time. It's always like this.
yeah, we were just talking about mcbench specifically
update: i think sonnet is still better in some of these
not publicly available yet
Yeah McBench usually 2.5 is a vast improvement over 3.7 too.
There are weird edge cases of course.
sonnet has some good builds tho
As I understand the optimus is a new open source model of OpenAI
it's still gonna be a bit
i doubt that the oss model is trained yet
Their elo system is bad rn. Test them out. Sonnet is clearly 2nd but 2.5 is in class of its own
okay yeah optimus is worse than quasar
its done worse on basically everything ive thrown at it
well, it doesn't have that many votes yet
kk, i gtg
gpqa diamond: 60%
math 500: 89%
cya
i wrote an eval framework myself lol
its part of the work im doing
quasar alpha:
gpqa diamond: 67.42%
math 500: 90%
optimus alpha:
gpqa diamond: 60%
math 500: 89%
march chatgpt 4o (measured by artificial analysis):
gpqa diamond: 65.5%
math 500: 89.3%
i only used 1 sample for optimus alpha (pass@1 estimated w 1 sample) tho, quasar alpha (pass@1 estimated with 4 samples gpqa diamond, pass@1 estimated with 1 sample for math 500)
oh i like that one
optimus must be mini
oh boy
o4 mini is gonna be awesome with a much stronger mini base model
i just hope o4 mini is cheap like o3 mini
it is it seems
is optimus better than quasar?
why do you guys think quasar/optimus are mini models
just because they're fast?
plain 4o is fast too
quasar is updated 4o, optimus is 4o mini updated
4o mini just got continued pretraining with the new cut off too 👀
o4 mini is GONNA GO HARD lol
Hi
Nightwhisper released?
very
i've forgotten the difference at this point
why are there both
please more normal names
i havent seen much evidence though
Honestly I think this is the most likely scenario since LeCun is right about AGI and most other labs are too focused on scaling up autoregressive token yappers to truly deliver AGI. And you know what, despite all the shady things Meta have done, they are the reason we have a flourishing open source landscape in AI. That’s not nothing.
sm1 ran aider + look at my benchmarks
LeCope
o4 mini high is about 3000 elo on codeforces
3000 elo???
lmao
yeah it's going to be one hell of a model for a lil guy
Reminder that this means nothing in real world use case
good things come in small packages (does not apply to me btw)
is this true ??
block the guy if u cant see hes joking
Nah I mean what’s the current highest elo?
part of deepmind still seems very focused on RL
man this 4o mini and o4 mini release is gonna be extremely exciting
3828
R2 is expected to come out with Qwen 3
3000 would make it the 67th best competitive coder in the world
hmmm
yea
u can see in simpleqa
with o1 mini simpleqa regressed but they figured it out in o3
its obviously not 4o based because 4o has more than double the simpleqa of o3 mini
it has the newest reasoning research in it
well up to now 😄
o4 mini is gonna be crazy
with the new much stronger 4o mini base model 😄
Is 3.7 and 2.5 Pro tested on that benchmark?
Hope 1m context will be the new standard for LLM
And I think they should train LLMs for MRCR like tasks
where'd you see this btw
none when o4 mini gets released 😄
o1 better at harder coding and math
Optimus Alpha worth trying? How good is it?
Comparable with 3.7 Somnet?
Above said it’s worse than Quasar Alpha
i could be wrong but that was sense from following top mathematicians who use llms
And I thought Quasar Alpha is trashy
it depends what kind of problems u throw at it. 4o has more world knowledge than 4o mini
but see here: https://matharena.ai/
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
o3 mini is generally better
USAMO is best bench there though bc others are judged by llms
A while ago, Sam said they have a model internally that ranks around 50th in competitive programming.
(they use wolfram alpha under the hood)
most likely refering to o4 mini
no it isnt evaluated with llms
its answer match i think
beyond usamo
yea
it checks whether the answer matches
i mean USAMO grades on how you reached solution with human reviewers and others dont. if im remembering right
yes but the others are still valid results
agreed but some people dont
Do you guys think LLMs in the future will be specialised again, like Anthropic for tool calls and coding, OpenAI for Maths and photo generation, and Gemini for long context and multimodal conversations
no
WTF?
..wtf
GPT 4.1 after GPT 4.5?
lol
so this is gpt 4.1o and gpt 4.1 mini
wow
terrible names
oh for god sake
GPT-4.1, which one source describes as a revamped version of OpenAI’s GPT-4o multimodal model.
✅ i was right
hop on archive.is
"OpenAI is also readying the full version of its o3 reasoning model and an o4 mini version that could debut even sooner. AI engineer Tibor Blaho discovered references to o4 mini, o4 mini high, and o3 in a new ChatGPT web version earlier today, suggesting these additions are imminent. I understand o3 and o4 mini are both set to debut next week, unless OpenAI moves the launch plans around."
yeah i have an extension that fixed it dw
"OpenAI CEO Sam Altman teased on X that OpenAI would be launching an exciting feature today, but it’s not clear if this is related to the o3 and o4 mini references in ChatGPT or not. Sources caution that OpenAI has delayed the introduction of some new models recently due to capacity issues, so it’s possible for the new GPT-4.1 model introduction to slip beyond a planned debut next week. I asked OpenAI to comment on this story, but the company didn’t respond in time for publication."
well
so it's still up in the air what we're getting today
capacity issues?
but it will likely be a model
i presume in relation to 4o image gen
Go buy TPU from Google
oh i was wondering lol. they are serving new 4o and 4o mini for free 🤣
no bc it's smaller than 4.5
just way better trained
quasar alpha is probably new 4.1 mini
optimus alpha is probably new 4.1 nano
and there's some 4.1 they haven't tested
jeez this naming is so bad
ikr
it's like they have a meeting when launching a new model and they agree on what the worst one is and that's the one they use
but i mean
they should just scrap it all and restart
i think adding numbers after the version is like the right way to do it
make it make logical sense instead
well yeah
gpt4, gpt4.1, gpt4.2, gpt4.3, wtv
bro the model picker is going to get even worse
ikr??
the average user is going to click that and have a seizure
Yes, it is a poor naming
Gotta screenshot that 404 screen and say we got early access
Lmao
Can 4.5 be better trained?
its too large and inefficient to work with
it can be better trained, it was probably too large relative to the current optimal pretraining scaling laws, etc.
doesn't really make sense for 4.5 to exist
u have to remain agile especially now
I don't think Gemini 2.5 Pro is a smaller model than DeepSeek R1
like it has some utility but limited
if the anon openrouter models are 4.1 i'm automatically disappointed lol
what else could they be lol
they aren't big enough of an improvement to justify a new model version num jump
2.5 pro probably is decently large, has pretty similar price per token as grok 3 surprisingly
ya they shouldve named them 4o and 4o mini
a new version of it
gpt-4.1o-latest-preview
openai style
there's 4.1, 4.1 mini and 4.1 nano! they exactly fit with nano and mini imo
so more context
optimus matches 4.1 nano imo
here's those results for private alonw with some other oai models (same quiz, single pass)
calling it 4.1 is also funnier because 4.5 already exists
yeah this is o3 medium so that's pretty damn good
I don't think there’s a use for “nano” when it’s not an open-sourced model
i mean the other guy said theyd be disappointed if its 4.1 its definitively 4.1 but we dont know which model is which
my bad i meant to reply to him
At least it is suitable for research to test new algorithms for small models
a lot of people were interested in how we made GPT-4.5 and what comes next.
we did a podcast with alex paino, dan selsam, and @atootoon who helped drive the project.
full episode coming soon, but here are some interesting clips:
and also
Among the new AI models will be a release of what I’m expecting will be branded GPT-4.1, which one source describes as a revamped version of OpenAI’s GPT-4o multimodal model.
this is big boy gpt 4.1 (quasar/gpt 4o provenance), gpt 4.1 mini is optimus, nano we have yet to test
yeah ig "it's 4.1" feelsl ike it's talking about the full model not the smaller versions
why do you say that? quasar alpha is worse than 4o I think?
my prior is 4.1 full would be better than 4o for sure
did u see my benchmarks
yeah and nvm you're right I was dumb 4.1 being alpha makes sense
What criteria would you guys think O4 Mini will take the lead to Gemini 2.5 Pro
also worth looking at what he replied to
Gotta sleep
@SmokeAwayyy o3, o4-mini are not launching today, they come soon.
today is a new feature.
and it points, i hope, to something very important for the future of how AI will integrate in our lives.
well
that's a shame
Good night from UTC +08:00
what did @misty vault mean by this
yes
bros just a hater
ah... 😦
I think it's going to be another gimmick feature. or something completely irrelevant
if it is then gg openai
"memory improvements"
cant be playing when google releasing heat
💀
why not just start another line of models and call it Gemini 1 or g1 for short
whoops
20 minute message delay
lmfao
10am livestream again?
probably
I just don't get why quasar has 1M context window when 4o has better eval at 120k
Source?
Benchmarking AI Models for Long Context Comprehension
they can flex needle in a haystack like meta 🤣
everyone gravitates to the window size without looking at evals for long context. doesn't really happen with any other element of these models
Gratitude 🙏
besides if u use the higher context window, you give them more money lol even if its massively degraded on a lot of tasks
oai still have a long way making a good coding model
im not talking about a reasoning coding model
but a model that is practically useful
all these benchmarks are impressive but it doesnt reflect real world case scenarios
2.5 long context is legit. I can summarize 900k token video files reliably when 1.5 pro and 2.0 flash didnt' really come close
codeforces is a good benchmark for reasoning but the majority arent always asking the model for algorithmic use cases
if the model is good at math then you would expect it to do good on codeforce benchmark
as its all just mathematical algorithms implementations
i still think openai didnt do enough RLHF on coding aesthetics/styling like anthropic and google
and i still dont understand whats the issue/constraint here
deepseek probably used that on their latest update
How long is such video? Do you pass a real video file or transcript
hour long videos. not transcripts i need info on the screen from tickers and charts and stuff
What could be your use case? So interesting! Are current models even capable of taking movie long videos as an input? I would have guessed no
project on how markets react to financial cable news
wait, what timezone
pst/pt i think
i guess no livestream today
it's literally just
memory of your past convos
it's so over
Lmao
told ya
im always right
i mean not always but like 99.9%
🗿
disappointing
I knew it.. irrelevant.
Could have been a small tweet to announce this feature, rather than grandeous can't sleep crap
at least all the clueless ai hype accounts on twitter tweeting stuff like "o4-mini today 👀 " look dumb
@keen beacon btw could it be o4 mini instead thats the private model?
it has a new base model but now we know of a mini version it could be that
the only companies that are in the right path rn for AGI are google/anthropic/ deepseek
they are all innovating and trying to understand how the model behave internally
google is showing us that we still havent hit a wall
this constant overhype by OAI is getting to my nerves
deepseek is innovating in hw & sw
anthropic are trying to understand the model black box
he always does that
look at this ratio
I am not very familiar with these things
What does this ratio imply here?
only available for Pro users starting today. "Soon" for Plus, nothing announced for Free. Not available in the EEA, UK, Switzerland, Norway and Liechtenstein
the new memory
lmfao
possibly
Ehhhhhh
it's not cool but memory is really important to openai and altman. the more lock-in they can get the less likely it is people will leave for smarter models. it's real dynamic rn. i know some ppl who don't care about 2.5. they just want their chatgpt memories. that strat works really well for keeping normies on board as long as gulf between models doesn't get too big.
i for one absolutely do not want chatgpt to remember the things i tell it 😇
it's worse than I thought... gawwd
I dont think it will create any lock-in unless there is significant improvement in the responses becasue of this feature, which I highly doubt.
there already is an element of lockin before this. we're enthusiasts. normies won't move to a model that doesn't know them unless it's significantly better
By the way, I do think that getting it done correctly is a good feature but they should announce it properly
Starting today, memory in ChatGPT can now reference all of your past chats to provide more personalized responses, drawing on your preferences and interests to make it even more helpful for writing, getting advice, learning, and beyond.
thats the most boring feature ever
o3 mini was a coding model wasnt it?