#general

1 messages · Page 17 of 1

ocean vortex
keen beacon
#

which is another llama slop model

#

not quasar

sonic tendon
#

they seem to have dialed back the slop

keen beacon
#

it is tho

sonic tendon
#

which probably won't fare well for their leaderboard placements

keen beacon
#

it was recently, i think i got it somewhat recently

sonic tendon
keen beacon
ocean vortex
#

oh wait he is Eyad Gomaa, right

keen beacon
#

and i use it quite often

#

will see if i can find it now..

keen beacon
#

(it's wrong btw)

#

by the time openai have a model more up to date than oct '23 i'll be dead

keen beacon
keen beacon
sonic tendon
#

do we think that that's actually a system prompt

keen beacon
#

i asked it to explain "hawk tuah" and it did it just fine but maintains when directly asked that its knowledge cutoff is oct '23

sonic tendon
#

or just a bunch of posttraining chat data they forgot to clean up

ocean vortex
keen beacon
keen beacon
#

that o3 might be on the new chatgpt 4o latest base model that was cont pretrained

#

wait a minute

#

"As of today (April 2025) the most recent stable release published on npm is:

discord.js 14.18.0

Because new patch versions appear fairly often, you can always check yourself with one of the following:

• npm view discord.js version
• npm outdated discord.js (if it’s already in your package.json)
• Visit https://www.npmjs.com/package/discord.js

Those commands/web pages will return the definitive latest version at any given moment."

#

is that just... a good guess... or

#

is it given tools

#

hmm

#

thats a new base model if it knows stuff outside of oct 2023 @keen beacon

sonic tendon
#

good question

#

might be able to ask it

keen beacon
#

chatgpt 4o latest was cont pretrained to june 2024

#

give me something it would only know if it had a more recent cutoff

#

oh wait

keen beacon
#

it just said the knowledge cutoff was june 2024 👀

sonic tendon
#

maybe something that it wouldn't think it would be a good idea to look up?

#

well, that might work

keen beacon
#

👀 new o3 model is based on the updated 4o

#

Sadiq Khan (Labour) was re‑elected Mayor of London in the election held on 2 May 2024.

• Votes:
– Sadiq Khan: 1,088,225
– Susan Hall (Conservative): 811,518

• Margin of victory:
– By votes: 276,707
– By share of the vote: 11.1 percentage points (Khan 43.8 %, Hall 32.7 %).

Because the 2024 contest was run under first‑past‑the‑post for the first time, those first‑round figures are the final result.

#

lol it got it bang on

neat apex
#

More likely the reverse

keen beacon
#

interesting

sonic tendon
#

wait, when did the updated 4o release?

#

i would

keen beacon
#

i don't think it has tools

sonic tendon
#

've thought that going from 4o->o3 would take a few months

keen beacon
#

when i asked who the US president is

#

it said "As of the most recent information available to me (knowledge cutoff June 2024), the President of the United States is Joseph R. Biden Jr. If you need confirmation for a date after that, please check a reliable, up‑to‑date news source or the official White House website."

keen beacon
#

wild was right about continued pretraining yet again

#

lmao

#

wait

sonic tendon
#

not sure i'd wanna be this guy rn

keen beacon
#

there's a slim chance this may actually be o4 mini?

#

since o4 is the one initially rumoured to be the new base

keen beacon
keen beacon
#

@keen beacon bruh u have access to the new o3 with the new base

#

they may have updated the system prompt in the last week or so

#

because before

#

it said it was oct '23

keen beacon
#

but i guess if it was somewhat recent it makes sense too

balmy mist
#

Yo, I'm seeing a lot of people have the Veo Gen in Google Studio. I don't see it for some reason, and I'm a power user of Studio. I use it like every damn day for hours, and I don't have it 😦

keen beacon
#

or it hasnt rolled out or smthing

balmy mist
#

my mom has it

sonic tendon
#

your mom sounds cool

balmy mist
#

and i just setup studio for her last night

neat apex
#

Maybe the advanced?

balmy mist
#

like basic

#

just opened the app

keen beacon
#

i've got it

balmy mist
#

lmaoo

neat apex
#

Hm

#

Maybe age?

balmy mist
#

aint no age thing

keen beacon
#

lmao

neat apex
#

Hm

balmy mist
#

none of the accounts have it 😦

#

now im team Open Ai again

neat apex
#

Maybe verification like that phone thing?

keen beacon
#

although that tree doesn't make much sense at the end

keen beacon
ocean vortex
#

OpenAI need to finally release that gpt4o... like wtf are they waiting for catgrin

keen beacon
#

wow

#

so only api is paid i guess

#

ai studio is a crazy product

hardy pecan
#

Vertex AI isn't free, just a fyi

keen beacon
#

make sure ur not getting charged lol @keen beacon its hella expensive i thinkn

balmy mist
ocean vortex
balmy mist
#

lol SOTA video and LLM for free

keen beacon
#

i've checked

#

these aren't available yet tho

ocean vortex
keen beacon
#

im really curious whether they removed anon chatbot in the arena because i got recently and it was matching quasar with the same system prompt

#

i've never seen it lol

sonic tendon
#

i've been wondering about the fidelity with which you could predict where anon models were on the leaderboard if you just triggered a bunch of matchups and saw what they got paired with

#

does that sentence make any sense

keen beacon
#

yeah

#

that's come in handy for me sometimes but i can't do it on a big enough scale to make concrete conclusions

sonic tendon
#

especially since the web arena still leaks model names and providers when you open the site

sonic tendon
keen beacon
#

i might have stored a dump or two

sonic tendon
#

oh, hmm

ocean vortex
#

exp and preview literally just the names slightly different

keen beacon
#

@keen beacon did u notice the o3 model change within the last week or so? or was it the same model with an adjusted sys prompt?

#

based on phases of development we can guess their current timelinne

ocean vortex
keen beacon
ocean vortex
#

o3 is only deep research lol

keen beacon
#

it's not through deep research 👍

ocean vortex
keen beacon
#

it is not released no

#

i can't share where

#

sorry

#

its probably very imminent at this point tbh

#

yup

#

im curious how much stronger it is compared to og o3 👀

#

wdym

sage raptor
keen beacon
#

depends on your use case

sage raptor
#

in coding

keen beacon
#

well

#

there are still many specific coding use cases

#

for very reasoning heavy coding tasks, o3 is probably better

#

for most others, 2.5 pro is better

keen beacon
ocean vortex
keen beacon
#

yeah

#

unfortunately

keen beacon
#

again though

#

this is probably o3 medium

#

and the jumps between reasoning efforts are fairly significant

#

so we shall see once we have high

keen beacon
#

if they ran it with as much compute as they did before

#

doesn't seem like o3 pro will launch at the same time as o3 and o4 mini

ocean vortex
#

I'm not a fan of how they are making it look like it's scaling of RL training when they constantly keep low-key updating the base now tbh

keen beacon
#

going off of the recent frontend changes in preparation

keen beacon
#

especially base models

ocean vortex
keen beacon
ocean vortex
#

makes it much harder to isolate things and do direct comparisons

ocean vortex
keen beacon
ocean vortex
#

o3 and o4 is also about to lol

ocean vortex
sage raptor
#

so today o3 will release ?

ocean vortex
#

btw given that 4o-mini stayed the same all this time

#

I do believe mini reasoning variants are distills

keen beacon
#

the interface is ready for it to launch

keen beacon
#

if theyre releasing o3/o4 mini, they will launch quasar/anonymous chatbot too i think

#

anonymous chatbot was replaced quite quickly after their last chatgpt 4o release

#

so they want the lmarena results fast 👀

ocean vortex
keen beacon
#

slightly off topic but what's everyone's bet on how they launch it this time

ocean vortex
#

since full o4 that's unlikely to exist yet

keen beacon
#

surprise live stream announcement, just drop the blog post with no warning

#

etc

#

this time we won't get to see the system card before launch

#

which is a shame

keen beacon
#

i think they'll probably do some hypeposting beforehand

#

as ever

sonic tendon
sage raptor
#

how will the o4 benchmarks look like

keen beacon
#

if they do a livestream

#

i wouldn't be surprised if they previewed o4 at the end

sage raptor
#

yea, they will

sonic tendon
#

yeah

keen beacon
#

/gpt-5

sonic tendon
#

arc-agi?

#

and humanity's last exam

ocean vortex
keen beacon
#

i wanna know the new arc agi results with the new o3 model 👀

sage raptor
#

maybe even o5, who knows

thorny drum
#

will o3 have an api?

keen beacon
#

ofc

#

openai stopped staggering launches starting with o1 iirc

#

when they launch on chatgpt they launch on api

thorny drum
#

well o1 pro wasnt on api for a bit i thought

keen beacon
sonic tendon
#

assuming oAI just launches o3 with no warning, would it just show up on the leaderboard out of the blue once it got enough votes

keen beacon
#

we won't be getting o3 pro straight away

sonic tendon
#

isn't o1 pro just regular o1 doing best of 3

keen beacon
#

then it'll be like

#

~1 week until leaderboard appearance

sonic tendon
#

makes sense

keen beacon
#

i don't think it'll take top spot without stylectrl

#

but with it it might

#

again, depends on what reasoning effort version they use

#

last time they just put medium to start

#

same w mini until they added o3 mini high

sonic tendon
#

would o3-high be prohibitively expensive?

keen beacon
#

openai give them credits lol

#

yes (for normal users)

sonic tendon
#

i'm somewhat unsurprised that openai doesn't care enough about lmsys to sponsor that

keen beacon
#

most labs fund their usage

#

labs + sponsors

sonic tendon
#

didn't think i saw them in the sponsor list

sonic tendon
keen beacon
#

but it makes sense for openai to fund it because of how much they gain from the data

#

(i also can't think of much explanation for why all the direct chat limits for models were removed a couple months ago other than that)

novel flame
balmy mist
#

is there a discord bot that summarizes from the last part of the convo you are in to the current? that would be so clutch man

keen beacon
#

i haven't run into one in months

#

i used to all the time

#

then they just stopped

#

its model specific

sonic tendon
#

i keep being surprised that something as small as lmarena seems to have any impact on these massive companies' decisions

keen beacon
#

some have user limits, global limits per interval, etc

keen beacon
keen beacon
sonic tendon
keen beacon
sonic tendon
#

you might still need a client mod to access them

keen beacon
#

oops

#

wrong paste

#

lmao

sonic tendon
#

lmaoo i was

#

wondering

sonic tendon
#

i feel like i recognize that game

#

*recognized

keen beacon
#

if optimus alpha is openai related, its probably gonna be o4 mini

#

again

#

oh

#

optimus

#

nvm

keen beacon
#

yeah idk what else they could put there

#

they defo wont put o3 its far too expensive

#

i dont think they have an updated 4o mini yet (at least publicly, we only know of an updated 4o), and it doesnt really match up with the name

sonic tendon
#

the soon-to-be 4o/o4 confusion will be fun

keen beacon
keen beacon
#

i dont think they have another one ready that quickly

#

i wouldn't put it past them

#

this next 4o release is gonna be big tho. the whole quasar thing, etc. (no released benchmarks on chatgpt 4o latest with the updated base model, despite massive improvements, etc)

#

they are REALLY milking 4o

#

none of us thought it would still be the default almost 1 year on when it launched

novel flame
balmy mist
#

@sonic tendon so there is a summary bot?

#

how do I use it?

keen beacon
#

it is solid yes, but it doesn't really hold up very well these days

balmy mist
#

i cant keep up with these convos

keen beacon
#

and coding is a big weak point given how many people use llms for that specific area

balmy mist
#

thats their bread and butter it seems

#

but it def got better

#

i like it now

keen beacon
#

it's very good for creative tasks and has got noticeably better at most other things but it is still pretty unacceptably poor in code tasks compared to other llms

keen beacon
#

i've tested it

sonic tendon
keen beacon
#

it is an improvement over the old 4o but it's still lagging behind claude 3.7 sonnet

sonic tendon
#

you'll need a discord client mod loader

#

it's a bit involved

novel flame
keen beacon
sage raptor
keen beacon
#

AYEE

#

HERE WE GO

#

"new feature" lol

#

we're getting o3

#

oh

#

wait a minute

#

hes underplaying it but i think its gonna be multiple model releases

novel flame
keen beacon
#

yeah

sonic tendon
keen beacon
#

LOL

#

let me find a screenshot

thorny drum
#

yeah its just tech bro speak for 2+ new models

balmy mist
#

he is so random with his tweets, we really in wild times when ceos use twitter to launch stuff lol

sonic tendon
#

i may know where this is going

balmy mist
#

but im here for it

sonic tendon
keen beacon
brittle tiger
keen beacon
#

we can all agree professionalism sucks right

balmy mist
#

fr

#

its all about efficiency of info at this point

#

no need to fluff it up

sonic tendon
#

idk it's like
almost informal enough to feel like genuine internet speak, but not quite

keen beacon
#

we're still waiting on whether this will be a livestream

#

i think it will be

balmy mist
#

can someone tell him to hurry up

keen beacon
#

it'll be at least a couple more hours lol

balmy mist
#

they always livestream lol, and if sam is not on it then its mid

#

yeah prob 1 pm est

keen beacon
#

i have no doubt

tall summit
keen beacon
#

im guessing the next openai releases are:

  • o3
  • o4 mini
  • quasar (4o api dated version, benchmarks, lmarena [was probably anon chatbot], updated stronger base model that was cont. pretrained to june 2024 from oct 2023)
keen beacon
balmy mist
#

it does not seem like it

keen beacon
#

i dont thinkn so

keen beacon
#

and it might not happen all today

#

it'll be "in the coming weeks"

oblique flint
#

what I dont get is why they're releasing o3. If they have o4 mini, then it's probably distilled from o4 right? So why not release o4

tall summit
keen beacon
#

with the new 4o base model

#

it has knowledge in june 2024 now

sonic tendon
balmy mist
#

wants the point of releasing the new 4o when we already had it for days lol

keen beacon
#

wait

#

i've just had

#

a thought

#

so we all know about optimus alpha

balmy mist
#

like they should just drop is silently lol

keen beacon
#

and if they're likely going to release a 4o update (quasar) alongside o3 and o4 mini today

#

then that makes whatever optimus is all the more interesting

#

i kinda doubt another 4o version

#

but i also doubt o4 mini if they're releasing it today

#

it's weird because

#

the naming scheme + context window with quasar would make you think google (stargazer, moonhowler, nebula...)

#

but all my testing screams openai

keen beacon
#

quasar is definitively 4o imho, but idk what optimus could be

balmy mist
#

so that would also mean that nw is OA?

keen beacon
#

no thats google

keen beacon
#

i do expect new google models possibly today but likely tomorrow

keen beacon
#

they will want to respond and we're still waiting on 2.5 flash nevermind all the anon great coding models they've been testing

balmy mist
#

is anyone gonna livestream the OA livestream here? that might be cute

keen beacon
#

can u do that on a discord server?

keen beacon
#

in a vc

#

stream your screen

sonic tendon
#

yeah

sage raptor
#

what is sama automate

sonic tendon
#

play the yt livestream in a video player

keen beacon
sonic tendon
#

and then stream that window

keen beacon
#

they have no moat lmao

sonic tendon
#

i feel like LLMs are well on their way to becoming commodities at this point

#

let alone random LLM search applications

keen beacon
#

tbh any other company that isnt an ai frontier lab building stuff on ai models will get crushed

#

regularly scheduled programming = resumed

tall summit
sonic tendon
keen beacon
lime coral
#

HE IS BACK

keen beacon
keen beacon
#

it is the prophecy

sonic tendon
tall summit
sonic tendon
#

after r2, i think things will be mostly over

keen beacon
#

the trump admin's WH site really sucks

lime coral
keen beacon
#

gemini can do better

sonic tendon
#

they like

keen beacon
#

the biden site was so much better

#

but imo obama's was the best

sonic tendon
#

played a promo video when you opened the site

torn mantle
keen beacon
#

🙏

torn mantle
#

We know hype sama

keen beacon
sonic tendon
#

that too

torn mantle
#

Always hyping his products

keen beacon
sonic tendon
#

i'd be down to hang out in a vc, just am gonna be at school for the next 4 hours 😭

#

so

#

probably gonna miss the livestream

keen beacon
#

obama you've got 8 more years in you buddy

#

come back

sonic tendon
keen beacon
#

hell yeah

#

this was such a fire lineup

#

if only joe ran in 2016

#

maybe we would've never had the orange

sonic tendon
#

obamna? 🥺

sonic tendon
tall summit
sonic tendon
#

.

torn mantle
keen beacon
#

nope

torn mantle
#

It could be that no?

keen beacon
#

may arrive today

sonic tendon
#

leo stole my line :(

#

/j

torn mantle
#

So thats the feature

keen beacon
keen beacon
tall summit
#

ok what the fuck's a thinking slider

keen beacon
#

the quasar release is very imminent anyway, if its not today

#

nobody would get hyped about that

torn mantle
sonic tendon
torn mantle
#

xd

keen beacon
#

sam would not be doing all that for a damn slider

#

even by his standards that's meh

torn mantle
sonic tendon
keen beacon
novel flame
keen beacon
#

the thinking slider is already out in beta on some of the clients

#

doesn't make sense to be doing all this for it

#

why deploy chatgpt after midnight with anticipated model names tbh if its not coming out soon

#

^

tall summit
keen beacon
#

wrong paste

#

smh

#

so it begins

#

whatever the case the new o3 (on a new gpt 4o base model) is seemingly ready anyway 👀

tall summit
keen beacon
#

i like how xAI took a huge leap off a cliff

sonic tendon
# sonic tendon wdym?

maybe for the $1.2m polymarket listing, which he definitely doesn't have any incentive to participate in

novel flame
# sonic tendon wdym?

Usually insider information - such as hinting at a release - is kept under wraps until an official statement is released, in order to not mess with markets. But In genuinely asking because I don’t know how it works

tall summit
keen beacon
#

grok 3 is mid

sonic tendon
#

when it happened on the march market, a couple people made like 50 grand

keen beacon
#

good on 'em

#

it was only really a matter of time

sonic tendon
#

although it could still impact competitors

keen beacon
#

i wonder if oai will ever go public

sonic tendon
#

i feel like that would finally cross the line

keen beacon
#

the old mission has been slowly being undone for ages

#

nail in the coffin

sonic tendon
#

yeah, it'd suck

keen beacon
#

i heard someone talk about this recently and it made me think

#

most labs have a plan/goal and everything for developing AGI and what they would do if they achieved it first, etc

#

and the question was - which lab would you rather have control over the first true AGI

sonic tendon
#

google could possibly be worse

keen beacon
#

yeah i don't know

brittle tiger
keen beacon
#

openai and anthropic are the loudest about what their intentions would be

#

google are very vague

sonic tendon
#

i wish there were a way we could be comfortable about ai safety without making everything proprietary forever

#

but, eh

tall summit
#

i was about to say none of them but after red said that i'd say anthropic because least of many evils

keen beacon
#

none of the labs are great if we're talking ethics but anthropic have the best vibes

#

there's the question of "would you rather it be in the hands of a lab that's too careless or a lab that's too careful"

#

i would say the latter

oblique flint
keen beacon
#

yeah no 😭

#

imagine if it's deepseek

#

that would be quite the predicament

brittle tiger
keen beacon
#

lol yeah

sonic tendon
#

anthropic seems like a bunch of idealistic nerds (but, not led by Sam Altman)

keen beacon
#

if a lab has achieved agi other labs would probably follow short tbh

sage raptor
sonic tendon
#

which seems like the best-case scenario

keen beacon
#

it would def be a domino effect kinda thing yeah

keen beacon
sonic tendon
keen beacon
#

schizo

sonic tendon
tall summit
sonic tendon
#

i haven't heard of that name in a while

keen beacon
#

imagine if meta got there first 💀

#

no way lol

tall summit
keen beacon
#

i would definitely hope not

#

i dont even see any chance of anthropic getting there first

sonic tendon
keen beacon
#

GDM vs OAI

#

anthropic will get there soon after one of those does

sonic tendon
keen beacon
#

but they aren't daring enough

oblique flint
# sonic tendon why mistral?

well it's eu based (am eu citizen lol), committed to open source and doesnt seem as sketchy as some bigger companies (meta, google etc)

brittle tiger
sonic tendon
keen beacon
#

they'll have their spies running back to the china hq once oai crack it wink wink

sonic tendon
#

even if they're operating under a system that's going to pressure them into researching military use

#

they clearly put the minimum possible effort into the censors, for example

balmy mist
#

what would agent gpt do?

#

like agentspace

keen beacon
#

ew not that grifter

balmy mist
#

lmaoo

keen beacon
#

i hate iruletheworld

#

😭

brittle tiger
balmy mist
#

the votes are real tho

#

idc about what he posts, but the reactions show what people are thinking

keen beacon
#

beyond oai/deepmind, the chinese frontier labs are probably after them tbh. skill diff and ingenuity tbh. unpopular opinion probably but i dont see anthropic getting there faster than chinese frontier labs (deepseek and qwen)

keen beacon
#

i respect the dedication

keen beacon
#

the last two can probably be swapped at will

sonic tendon
#

wait, what's gdm?

keen beacon
sonic tendon
#

oh google

keen beacon
#

anthropic has done zero public image gen work i think

sonic tendon
#

very little real commercial application

keen beacon
#

yeah anthropic are very focused on just llms really

sonic tendon
#

and imo it's a bad look

keen beacon
#

openai are the most innovative lab imo

#

they are normally the ones "to follow"

#

multimodal cot will be important

#

which might include native image gen

#

(then they normally get beat at their own game... but hey, that's the spirit)

sonic tendon
#

i sort of dislike the oAI leadership

keen beacon
#

openai's corporate structure is ridiculous

#

it's designed to be convoluted

oblique flint
#

I dont get why people are hyping up google so much tbh. Do you really want a future where google controls AI as well in addition to already pretty much controlling the internet?

keen beacon
#

their board also has some big conflicts of interest

sonic tendon
keen beacon
#

2.5 pro has a jan 2025 cut off. (they cont. pretrained 2.0 pro/etc) this means the 2.5 pro timeline is absolutely absurd lol (1-2 months)

thorny drum
#

what market is he manipulating

oblique flint
keen beacon
#

i doubt any lab will be able to form a monopoly rn

torn mantle
#

So the new feature is either memory improvements or that thinking slider

sonic tendon
#

i will not pay google a dollar for api credits

sonic tendon
sonic tendon
#

no moats last for long

sonic tendon
tall summit
brittle tiger
sonic tendon
keen beacon
#

its not just copying. companies are neck in neck in research i think

sonic tendon
#

there is little commercial application

leaden palm
#

what's launching today?
so far i know

  • qwen 3
  • something from openai (optimus on OR and/or o3/o4 models)
  • maybe a google model
keen beacon
#

qwen3 isnt happening today

sonic tendon
leaden palm
#

hm

oblique flint
#

2.5 flash just didnt launch yesterday for whatever reasn

keen beacon
#

it is models

#

i am telling you mow

#

now

#

he just worded it weirdly

leaden palm
keen beacon
#

memory improvements would be incredibly silly to hype up when they're having their lunch ate

brittle tiger
#

it's for sure models. metadata and amount of hype in tweet is basically confirmation

sonic tendon
#

metadata? but yeah agreed

keen beacon
#

i think that bindureddy lady is off her rocker tbh

#

if Sama can't sleep because he's so excited over memory improvements he must be more autistic than i thought

#

lol

brittle tiger
keen beacon
#

it would make aense if o3 is based on it

#

sense

sonic tendon
keen beacon
#

sense

sonic tendon
#

i feel like a lot more CEOs are on the bipolar spectrum than people realize lol

keen beacon
#

most tech ceos are probably on some kind of spectrum haha

sonic tendon
#

yeah autism and adhd too

#

they have such a huge adaptive advantage in tech jobs, it's crazy

fleet lintel
keen beacon
#

i like my guys a little nerdy and/or autistic and/or with adhd

#

:3

sonic tendon
#

same

#

i do definitively have adhd

keen beacon
#

they definitely could release it tomorrow via blog post so that the stream isnt too intense

brittle tiger
tall summit
leaden palm
#

optimus alpha is now on openrouter, a reasoner most likely from openai

keen beacon
#

woah

#

its o4 mini ✅

tall summit
#

optimus 💪

keen beacon
#

just guessing lol

leaden palm
#

NVM

#

NVM

balmy mist
#

how many deep research wiht gemini do you get a day?

leaden palm
#

i dont think it reasons

tall summit
leaden palm
#

it just had a high latency

sonic tendon
leaden palm
#

so i thought it reasoned

#

just another gpt 4o variant

tall summit
#

LMAO

leaden palm
#

sorry guys

keen beacon
#

no it's not a reasoner

oblique flint
keen beacon
#

got my first question wrong that chatgpt 4o latest gets right

#

damn this sucks

#

☠️

keen beacon
#

yup its another gpt 4o with the new base model (june 2024)

keen beacon
#

i cant believe they dropped another version

sonic tendon
#

oh right that one

tall summit
sonic tendon
#

i'm gonna do the jar test

keen beacon
#

haven't tested it with code yet

balmy mist
#

o4 mini high should be the best all around model if we see history, but it has to be way cheaper than 2.5 and faster, bc its not gonna be on the same IQ level obviously, but it can be slightly worse then like 3.7 and be fast af and cheaper than 2.5 than we have a winner

keen beacon
#

perhaps that's where it improves

sonic tendon
#

?

fleet lintel
sonic tendon
fleet lintel
#

i have a feeling that it is probably comparable to 2.5 but 1M context

tall summit
keen beacon
sonic tendon
keen beacon
#

im gonna run gpqa diamond and math 500 on it

#

but im pretty sure its another gpt 4o lol

brittle tiger
#

the $1200 per question o3 high model would be really cool to have public for big research problems

fleet lintel
brittle tiger
sonic tendon
keen beacon
# sonic tendon yeah

was that the one where it got it wrong first but adding "think carefully" it got it right

sonic tendon
#

never tried that, actually

brittle tiger
sonic tendon
keen beacon
# sonic tendon https://discord.com/channels/1340554757349179412/1343302058929033216/13587819770...

One workable way is to hide the information from yourself before you ever have a chance to see it.

  1. Prepare the record.
    • Tear the sheet of paper into two slips of exactly the same size.
    • On one slip write “GREEN”, on the other write “RED”.
    • Fold each slip so the writing is completely hidden and the two folded slips look and feel identical.

  2. Load the jars.
    • Put the folded “GREEN” slip into the green jar.
    • Put the folded “RED” slip into the red jar.

  3. Remove your ability to distinguish the jars.
    • Close your eyes (or blind‑fold yourself).
    • With your eyes shut, move the two jars around on the table for a few seconds until you have genuinely lost track of which is which.

  4. Make the selection blindly.
    • Still blindfolded, pick up either jar at random.
    • Take the ball and the folded slip out of that jar and put both straight into your pocket without opening the slip.
    • Return the now‑empty jar to the table and, still blindfolded, shuffle the two jars again for a few seconds.

  5. Open your eyes.
    • You now see two jars, one green and one red, but because of the blind shuffling you no longer have any idea which was the one you emptied.

Result:
• The ball is in your pocket.
• The folded slip in your pocket accurately states the colour of the jar it came from.
• You yourself never learned (and still don’t know) which jar you chose.

Anyone who later opens the folded note will learn the correct answer, but you remain ignorant, satisfying all the conditions.

#

o3

#

lmk how it did

thorny drum
#

hmm do people have o3 full yet?

calm sequoia
sonic tendon
#

seems like both o3 and gp2.5 get it, just intermittently

keen beacon
#

i presume o3 mini and o1 did not?

#

for that guy who said he was gonna look into ai and roblox

#
#

forgot what he was gonna do with it

#

look at this

torn mantle
#

whats this

sonic tendon
#

leo tested it one other time and it got a plausible-sounding but slightly incorrect answer

keen beacon
#

wait this might be a new gpt 4o minin

#

optimus alpha

#

it's just 49

#

4o

#

another update

#

gpqa diamond: 60.10% (maybe there might be something wrong with my evaluation framework/answer parsing)

sonic tendon
#

i think it's the only model from before 2025 that got it right

leaden palm
keen beacon
#

it dropped 7 points

keen beacon
keen beacon
#

wrong*

sonic tendon
brittle tiger
sonic tendon
torn mantle
#

yea this seems like openai model

keen beacon
#

im running math 500 rn, ill review gpqa diamond samples a little after this

keen beacon
#

if it is 4o mini 60% gpqa diamond is impressive, otherwise something is borked (massive degradation)

#

it's actually wack

sonic tendon
keen beacon
#

even sonnet

sonic tendon
#

o1 does surprisingly bad, also

keen beacon
#

2.5 pro base model diff

#

it just seems to have next level spatial understanding

sonic tendon
#

are we looking at the same bench?

balmy mist
sonic tendon
#

not a ton of votes tho

keen beacon
#

start rating

#

you'll see what i mean

sonic tendon
#

ah, ok

#

you can click on the models in the leaderboards to see their builds

#

you could try o3-medium w/ some of these, although setting it up for self-hosting seems like it wouldn't really be worth the effort

sonic tendon
#

goddamn

keen beacon
#

hmm optimus could be an updated 4o mini

#

im still running math 500 :\

sonic tendon
#

i wanna see the "abstract mathematical concept" build, sonnet's impression of that was really cool

night trout
#

Gemini 2.5 Pro kills Sonnet 3.7 99% of the time. It's always like this.

sonic tendon
sonic tendon
#

not publicly available yet

night trout
#

Yeah McBench usually 2.5 is a vast improvement over 3.7 too.

#

There are weird edge cases of course.

sonic tendon
#

sonnet has some good builds tho

calm sequoia
#

As I understand the optimus is a new open source model of OpenAI

sonic tendon
#

it's still gonna be a bit

keen beacon
#

i doubt that the oss model is trained yet

brittle tiger
keen beacon
#

okay yeah optimus is worse than quasar

#

its done worse on basically everything ive thrown at it

sonic tendon
#

kk, i gtg

keen beacon
#

gpqa diamond: 60%
math 500: 89%

keen beacon
#

i wrote an eval framework myself lol

#

its part of the work im doing

#

quasar alpha:

gpqa diamond: 67.42%
math 500: 90%

optimus alpha:
gpqa diamond: 60%
math 500: 89%

march chatgpt 4o (measured by artificial analysis):
gpqa diamond: 65.5%
math 500: 89.3%

#

i only used 1 sample for optimus alpha (pass@1 estimated w 1 sample) tho, quasar alpha (pass@1 estimated with 4 samples gpqa diamond, pass@1 estimated with 1 sample for math 500)

keen beacon
#

optimus must be mini

keen beacon
#

o4 mini is gonna be awesome with a much stronger mini base model

balmy mist
#

i just hope o4 mini is cheap like o3 mini

keen beacon
balmy mist
#

is optimus better than quasar?

leaden palm
#

why do you guys think quasar/optimus are mini models

#

just because they're fast?

#

plain 4o is fast too

keen beacon
#

4o mini just got continued pretraining with the new cut off too 👀

#

o4 mini is GONNA GO HARD lol

balmy mist
#

wow optimys is fast af

#

wow

keen beacon
#

Holy hell is that confusing

#

4o mini
o4 mini

keen fulcrum
#

Hi
Nightwhisper released?

tall summit
#

i've forgotten the difference at this point
why are there both
please more normal names

keen beacon
#

we're all still waiting lol

leaden palm
novel flame
# keen beacon imagine if *meta* got there first 💀

Honestly I think this is the most likely scenario since LeCun is right about AGI and most other labs are too focused on scaling up autoregressive token yappers to truly deliver AGI. And you know what, despite all the shady things Meta have done, they are the reason we have a flourishing open source landscape in AI. That’s not nothing.

keen beacon
ember rapids
#

o4 mini high is about 3000 elo on codeforces

drifting thorn
#

3000 elo???

keen beacon
keen beacon
lime coral
#

Reminder that this means nothing in real world use case

keen beacon
#

good things come in small packages (does not apply to me btw)

sage raptor
#

is this true ??

keen beacon
#

block the guy if u cant see hes joking

drifting thorn
brittle tiger
keen beacon
#

man this 4o mini and o4 mini release is gonna be extremely exciting

keen fulcrum
#

R2 is expected to come out with Qwen 3

keen beacon
#

3000 would make it the 67th best competitive coder in the world

#

hmmm

#

yea

#

u can see in simpleqa

#

with o1 mini simpleqa regressed but they figured it out in o3

#

its obviously not 4o based because 4o has more than double the simpleqa of o3 mini

#

it has the newest reasoning research in it

#

well up to now 😄

#

o4 mini is gonna be crazy

#

with the new much stronger 4o mini base model 😄

drifting thorn
keen beacon
#

its 1m context

#

didnt u see?

#

im not sure about o4 mini though

drifting thorn
#

Hope 1m context will be the new standard for LLM

#

And I think they should train LLMs for MRCR like tasks

brittle tiger
keen beacon
#

none when o4 mini gets released 😄

brittle tiger
#

o1 better at harder coding and math

keen beacon
#

no o1 isnt always better than o3 mini at math

#

o3 mini high is usually better

keen fulcrum
#

Optimus Alpha worth trying? How good is it?
Comparable with 3.7 Somnet?

drifting thorn
#

Above said it’s worse than Quasar Alpha

brittle tiger
#

i could be wrong but that was sense from following top mathematicians who use llms

drifting thorn
#

And I thought Quasar Alpha is trashy

keen beacon
#

o3 mini is generally better

brittle tiger
#

USAMO is best bench there though bc others are judged by llms

ember rapids
keen fulcrum
#

(they use wolfram alpha under the hood)

ember rapids
#

most likely refering to o4 mini

keen beacon
#

its answer match i think

#

beyond usamo

#

yea

#

it checks whether the answer matches

brittle tiger
#

i mean USAMO grades on how you reached solution with human reviewers and others dont. if im remembering right

keen beacon
brittle tiger
#

agreed but some people dont

drifting thorn
#

Do you guys think LLMs in the future will be specialised again, like Anthropic for tool calls and coding, OpenAI for Maths and photo generation, and Gemini for long context and multimodal conversations

north vale
drifting thorn
#

WTF?

drifting thorn
#

GPT 4.1 after GPT 4.5?

sage raptor
#

lol

drifting thorn
#

It is dumb

#

Just call it 4.7

keen beacon
#

so this is gpt 4.1o and gpt 4.1 mini

#

wow

#

terrible names

#

oh for god sake

#

GPT-4.1, which one source describes as a revamped version of OpenAI’s GPT-4o multimodal model.
✅ i was right

leaden palm
keen beacon
#

"OpenAI is also readying the full version of its o3 reasoning model and an o4 mini version that could debut even sooner. AI engineer Tibor Blaho discovered references to o4 mini, o4 mini high, and o3 in a new ChatGPT web version earlier today, suggesting these additions are imminent. I understand o3 and o4 mini are both set to debut next week, unless OpenAI moves the launch plans around."

keen beacon
#

"OpenAI CEO Sam Altman teased on X that OpenAI would be launching an exciting feature today, but it’s not clear if this is related to the o3 and o4 mini references in ChatGPT or not. Sources caution that OpenAI has delayed the introduction of some new models recently due to capacity issues, so it’s possible for the new GPT-4.1 model introduction to slip beyond a planned debut next week. I asked OpenAI to comment on this story, but the company didn’t respond in time for publication."

#

well

#

so it's still up in the air what we're getting today

keen beacon
#

but it will likely be a model

keen beacon
drifting thorn
#

Go buy TPU from Google

keen beacon
north vale
#

just way better trained

#

quasar alpha is probably new 4.1 mini
optimus alpha is probably new 4.1 nano
and there's some 4.1 they haven't tested

keen beacon
#

jeez this naming is so bad

north vale
#

ikr

keen beacon
#

it's like they have a meeting when launching a new model and they agree on what the worst one is and that's the one they use

north vale
#

but i mean

keen beacon
#

they should just scrap it all and restart

north vale
#

i think adding numbers after the version is like the right way to do it

keen beacon
#

make it make logical sense instead

north vale
#

gpt4, gpt4.1, gpt4.2, gpt4.3, wtv

keen beacon
#

bro the model picker is going to get even worse

sonic tendon
keen beacon
#

the average user is going to click that and have a seizure

drifting thorn
#

Yes, it is a poor naming

jaunty kraken
#

Gotta screenshot that 404 screen and say we got early access

drifting thorn
#

Lmao

balmy mist
#

i am so confused man

#

forget thesse names

#

just gimmie new model

drifting thorn
keen beacon
north vale
#

it can be better trained, it was probably too large relative to the current optimal pretraining scaling laws, etc.

#

doesn't really make sense for 4.5 to exist

keen beacon
#

u have to remain agile especially now

drifting thorn
#

I don't think Gemini 2.5 Pro is a smaller model than DeepSeek R1

north vale
#

like it has some utility but limited

keen beacon
#

if the anon openrouter models are 4.1 i'm automatically disappointed lol

keen beacon
#

they aren't big enough of an improvement to justify a new model version num jump

north vale
#

2.5 pro probably is decently large, has pretty similar price per token as grok 3 surprisingly

keen beacon
#

a new version of it

#

gpt-4.1o-latest-preview

#

openai style

north vale
torn mantle
north vale
#

optimus matches 4.1 nano imo

alpine coral
#

here's those results for private alonw with some other oai models (same quiz, single pass)

keen beacon
#

calling it 4.1 is also funnier because 4.5 already exists

keen beacon
drifting thorn
#

I don't think there’s a use for “nano” when it’s not an open-sourced model

keen beacon
drifting thorn
#

At least it is suitable for research to test new algorithms for small models

keen beacon
#

and also

Among the new AI models will be a release of what I’m expecting will be branded GPT-4.1, which one source describes as a revamped version of OpenAI’s GPT-4o multimodal model.

this is big boy gpt 4.1 (quasar/gpt 4o provenance), gpt 4.1 mini is optimus, nano we have yet to test

north vale
#

yeah ig "it's 4.1" feelsl ike it's talking about the full model not the smaller versions

north vale
#

my prior is 4.1 full would be better than 4o for sure

keen beacon
#

wtf happened between now and when i last checked this

#

it has completely tanked

keen beacon
north vale
keen beacon
drifting thorn
#

What criteria would you guys think O4 Mini will take the lead to Gemini 2.5 Pro

keen beacon
#

also worth looking at what he replied to

drifting thorn
#

Gotta sleep

keen beacon
#

well

#

that's a shame

drifting thorn
#

Good night from UTC +08:00

north vale
#

have ppl thought quasar alpha is better than current 4o?

#

why clown

leaden palm
#

what did @misty vault mean by this

keen beacon
#

bros just a hater

balmy mist
fleet lintel
balmy mist
#

if it is then gg openai

keen beacon
#

"memory improvements"

balmy mist
#

cant be playing when google releasing heat

keen beacon
#

💀

sonic tendon
#

why not just start another line of models and call it Gemini 1 or g1 for short

#

whoops

#

20 minute message delay

keen beacon
#

lmfao

leaden palm
#

10am livestream again?

keen beacon
#

probably

brittle tiger
brittle tiger
keen beacon
brittle tiger
keen beacon
torn mantle
#

oai still have a long way making a good coding model

#

im not talking about a reasoning coding model

#

but a model that is practically useful

#

all these benchmarks are impressive but it doesnt reflect real world case scenarios

brittle tiger
#

2.5 long context is legit. I can summarize 900k token video files reliably when 1.5 pro and 2.0 flash didnt' really come close

torn mantle
#

codeforces is a good benchmark for reasoning but the majority arent always asking the model for algorithmic use cases

#

if the model is good at math then you would expect it to do good on codeforce benchmark

#

as its all just mathematical algorithms implementations

#

i still think openai didnt do enough RLHF on coding aesthetics/styling like anthropic and google

#

and i still dont understand whats the issue/constraint here

#

deepseek probably used that on their latest update

calm sequoia
brittle tiger
#

hour long videos. not transcripts i need info on the screen from tickers and charts and stuff

balmy mist
#

so when is this update coming?

#

also is google releasing anything today?

calm sequoia
#

What could be your use case? So interesting! Are current models even capable of taking movie long videos as an input? I would have guessed no

brittle tiger
sonic tendon
keen beacon
#

pst/pt i think

#

i guess no livestream today

#

it's literally just

#

memory of your past convos

#

it's so over

#

Lmao

torn mantle
#

im always right

#

i mean not always but like 99.9%

leaden palm
#

🗿

sage raptor
#

disappointing

fleet lintel
brittle tiger
#

at least all the clueless ai hype accounts on twitter tweeting stuff like "o4-mini today 👀 " look dumb

keen beacon
#

@keen beacon btw could it be o4 mini instead thats the private model?

#

it has a new base model but now we know of a mini version it could be that

torn mantle
#

the only companies that are in the right path rn for AGI are google/anthropic/ deepseek

#

they are all innovating and trying to understand how the model behave internally

#

google is showing us that we still havent hit a wall

fleet lintel
#

this constant overhype by OAI is getting to my nerves

torn mantle
#

deepseek is innovating in hw & sw

#

anthropic are trying to understand the model black box

torn mantle
leaden palm
#

look at this ratio

fleet lintel
keen beacon
#

only available for Pro users starting today. "Soon" for Plus, nothing announced for Free. Not available in the EEA, UK, Switzerland, Norway and Liechtenstein

#

the new memory

#

lmfao

brittle tiger
#

it's not cool but memory is really important to openai and altman. the more lock-in they can get the less likely it is people will leave for smarter models. it's real dynamic rn. i know some ppl who don't care about 2.5. they just want their chatgpt memories. that strat works really well for keeping normies on board as long as gulf between models doesn't get too big.

keen beacon
#

i for one absolutely do not want chatgpt to remember the things i tell it 😇

fleet lintel
fleet lintel
brittle tiger
tall summit
#

what

#

what happened!!

fleet lintel
#

By the way, I do think that getting it done correctly is a good feature but they should announce it properly

tall summit
#

Starting today, memory in ChatGPT can now reference all of your past chats to provide more personalized responses, drawing on your preferences and interests to make it even more helpful for writing, getting advice, learning, and beyond.

#

thats the most boring feature ever

oblique flint
#

o3 mini was a coding model wasnt it?