#general

1 messages · Page 24 of 1

keen beacon
#

what would you name this new retrained o3

#

o4 then benchmark it with the juice u used for o3... i guess?

ocean vortex
#

be explicit that it's not just "o3". I think that's fairly obvious, no?

zinc ore
#

They can't do that again because once they actually release people will test it and get vastly different results. People already picked up on what they did, so they risk harming their brand continuing to do so

#

So caught up on trying to generate hype compared to their competition that they just make highly misleading test results

#

But anyone perceptive can pick up on

ocean vortex
#

that was never gonna see the day of light as just "o3" catgrin

keen beacon
#

the o3 in december is much less stronger than the retrained o3 with the new 4.1 base we have now, at least served with the same compute. so it does mean something

ocean vortex
keen beacon
#

still misleading since o3 pro is not gonna be served with that level of compute

#

it depends on the benchmarks though, if the o3 pro product they're gonna release performs impressively/or close compared to the initial o3 results, they could name it like that

ocean vortex
keen beacon
#

its misleading but i dont think its intentionally malicious

keen beacon
ocean vortex
keen beacon
# ocean vortex then

arent they using cons@64 or smthing like that, the grey area which i assume the score is for?

keen beacon
ocean vortex
#

iirc it was pass@1 for o3 supposedly

keen beacon
#

whats the grey area then

ocean vortex
#

and cons@64 only for o1

ocean vortex
keen beacon
#

even without the juicing it couldve just been doing even more massive and unreasonable chain of thought by default compared to the optimized new o3, to reach those scores

fringe carbon
tall summit
#

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers.

ocean vortex
#

just like they are listing tools now, this isn't all that different

keen beacon
#

they changed it now to o3-preview at least

ocean vortex
keen beacon
#

they knew they were gonna retrain o3 at the time lol?

ocean vortex
keen beacon
#

but its not even o3 pro

ocean vortex
#

I'm just calling it as such as it's the closest match to what it was

keen beacon
#

if they released that version of o3 without the juice and changed the reasoning effort levels would you be ok with it?

ocean vortex
keen beacon
#

it isnt a mistake

#

bro?

ocean vortex
#

well then why are you asking if they should go back and edit it lol

keen beacon
#

it was intentional at the time but they didnt know the future and committed to it. it is now misleading in the present

ocean vortex
#

they did it deliberately, what's there to edit? 👀

thorny drum
#

its pretty lame but on the tier list of gaming benchmarks its only in the middle

#

i think anyone who cares a ton about benchmarks probably watched the stream and knew they used ungodly compute on them

keen beacon
thorny drum
#

i personally think its cool to hear about these companies super powerful internal models

#

and i wasnt expecting an o3 that was $1/token

ocean vortex
keen beacon
tall summit
#

guys

#

it doesnt matter

#

just wanted to tell you

raven void
#

new o3 is pretty good, they've reduced the size while keeping similar performance to original o3

keen beacon
tall summit
#

pretty sure the consensus is o4 mini is better not "often" or even "sometimes" but more than just occasionally

tall summit
ocean vortex
keen beacon
#

theres nothing wrong with it lol

tall summit
zinc ore
#

Pro is just longer think for base o3 (from tech crunch)

keen beacon
ocean vortex
#

AlphaGeometry

#

and the fact that they didn't even have new base (4.1) back then makes it worse, not better tbh
At least now with an improved model it is not 1 million miles away...

#

it's still an interesting good model, but yeah... the direction openai is heading and what have they became lately comparing to like 1.5 years ago is something I really do not like

#

don't get me started on the pricing too... they finally reduced the price but only for the new model, o1 is amazingly now more expensive than o3 catgrin

ember rapids
#

We need Claude 4

#

That’s what I’m most hyped for

ocean vortex
#

with the releases of both 4-turbo and gpt4o, their pricing was essentially unmatched (good)

#

then o series started and all hell broke loose with them going to town on pricing lol

thorny drum
#

i feel like these new models r pretty good

#

whats wrong with them

ocean vortex
#

yeah it was a very novel idea. Preview had some bugs and was rough with later o1 being much better imo but this idea enabled good progress for sure

ember rapids
#

The most impressive aspect of o3 for me so far is its ability to generate good ideas

hardy pecan
#

Is o3, medium or high in the lmarena?

glass arch
dapper storm
#

Yea but what level of thinking is that

glass arch
#

we have no freaking clue I guess

keen beacon
#

i like how o3 talks

#

it's got more of a personality than o1

ocean vortex
glass arch
#

on chatgpt.com, it actually thinks with your name for some reason

ocean vortex
glass arch
#

here is what it does for me

elder rapids
glass arch
#

it also writes with a lot more emojis for some reason?

elder rapids
#

it still doesn't shine like 2.5 pro though tbh

glass arch
#

I think openai's models are the friendliest

#

gemini is quite annoying to talk to

ocean vortex
elder rapids
#

in my experience, the Gemini family is more of a customizable, but baseline neutral family

ocean vortex
elder rapids
#

I don't think it's a flaw lol

#

I think that's the intention

glass arch
glass arch
#

the openai models are better at social norms though

ocean vortex
#

"you are right to point that out!" - I hate these from gemini lmao

elder rapids
#

when I give it an example

#

and the example is using "you"

#

it thinks it needs to address it as the AI itself

#

rather than assuming it's a general example

keen beacon
ocean vortex
glass arch
#

if there was an AI model that I would trust to not destroy us if it got AGI, it would be chatgpt (any one of them tbh)

elder rapids
#

i wouldn't say Claude

#

it has a very good feel, but it's almost superficial

#

so yeah chatgpt prob

ocean vortex
elder rapids
#

to me Gemini has more of that meta intelligence, and that's what's deciding it's morality

#

rather than having instructed morals

glass arch
#

claude is actually very good at language (by this, I mean it can speak toki pona the best)

elder rapids
glass arch
#

I did a few tests with the arena, and claude comes out on top for all toki pona conversations

elder rapids
#

very contained

elder rapids
#

3.5 sonnet has always been clearly the best

#

but that's why I like 2.5 pro so much, because it does the same thing when you really want it to

glass arch
#

I want to run a test playing a game called "keep talking and nobody explodes"

#

I did it a few days ago with o3-mini-high

#

which of the new models should I pick?

elder rapids
#

keep talking and nobody explodes?

glass arch
#

yes

#

the one with the bomb

barren prairie
ocean vortex
elder rapids
#

I'm gonna explode

ocean vortex
#

💀

glass arch
elder rapids
#

damn if y'all feel this way about Gemini

#

you guys gotta know the strats

#

theres some insane keywords that these models react to

glass arch
#

which

#

I gotta try now

elder rapids
#

Gemini and Claude

#

with "cut and cleanly"

#

or 4o, with "pragmatic"

glass arch
#

openai is reaally good at naming huh

#

4o

#

o4

elder rapids
#

feels like they're doing this on purpose

glass arch
#

what does the "o" in 4o mean again?

elder rapids
#

if they release o5 after releasing 5o

elder rapids
#

ngl when the first 4o was "anonymous-chatbot" it felt so fresh

#

like a clear step above the other models

glass arch
#

it was truly the best

keen beacon
#

anonymous chatbot is chatgpt 4o latest. are you talking about the i am good gpt2 thing or the first chatgpt 4o latest variant

glass arch
#

now I have to switch off of 4o because it's not even worth giving orders to

elder rapids
keen beacon
#

oh

elder rapids
#

I am good gpt2 was another pre release variant

#

of 4o

keen beacon
#

yea ik

elder rapids
#

and then "birdie" was athene 70b

#

which was lowk a secret

#

y'all didn't know about that one

#

the best model at that time behind 4o

glass arch
#

what is your guys' go-to prompt for these new models?

#

I ask for pacman in pygame

#

and o4 did really good at this

elder rapids
#

I usually give random puzzles

#

and then ask older models to solve the same ones

#

o4 mini is good at puzzles

#

o3 is horrible

#

o3 is super smart outside of these rigor subjects tho

glass arch
#

yeah for my use cases, these models are very optimized

ornate stump
#

Does anyone know if they’re planning to improve the voice mode? It’s still the least utilized and developed feature considering its potential.

glass arch
#

also, I don't think an AI is going to be fooling a human any time soon

elder rapids
#

they're not gonna do that much

#

sesame is cool and all

glass arch
#

gemini is CRAZY at video

elder rapids
#

but the distribution is bad

#

since it's not a major distributor like Google

elder rapids
glass arch
#

I streamed me and the boys playing the binding of isaac, and it was able to identify stuff

elder rapids
#

it's in its own league as far as understanding videos

glass arch
ornate stump
# elder rapids sesame is cool and all

Yeah, but I mean, I don't need the AI to flirt with me or be that assertive. But the voice mode in Gemini or OpenAI is stupid you not only can't have a conversation with it, but you also can't use it under any circumstances.

glass arch
#

it likes to end its turn too quickly though

#

well, not gemini

glass arch
#

gemini talks way too freaking much

elder rapids
#

since when can grok understand videos

glass arch
elder rapids
#

and transcriptions

#

rather than watching the video

glass arch
#

oh yeah probably

elder rapids
#

since in that case, it doesn't do very well

glass arch
#

I don't know what grok does

elder rapids
#

I've seen it used on memes and stuff

#

video memes

ornate stump
elder rapids
#

and it doesn't do very well understand what it even is

glass arch
elder rapids
#

it may be able to

#

before no, now I think so

keen beacon
#

hmm sometimes 2.5 pro stops thinking on aistudio, especially with extremely long chats, not sure if its just a visual thing (it starts streaming with the same delay as if it were to start streaming a thought process), but if i prompt it to keep thinking itll usually pop it back up 🤣

elder rapids
elder rapids
ornate stump
elder rapids
#

oh ye this happens to me

#

same with it going outside of its thinking box

#

ye, I think it's a visual bug

ornate stump
elder rapids
#

based on what they're doing right now

#

since they're focusing on multimodality

real totem
#

Bro

#

I just came back

#

And I see this

#

o4 mini and o3 and 4.1

elder rapids
#

the streaming delay?

real totem
#

Ain’t no way all of these released so fast

elder rapids
#

I'm confused

real totem
#

Which is the best

#

One

elder rapids
#

o3 for general tasks, o4 mini for coding tasks/puzzles, 4.1 is API only

real totem
#

Are they better

#

Than gemini 2.5 ro

#

Pro

elder rapids
#

and its good at small coding tasks

ornate stump
real totem
#

Google’s next release

#

Is finna be crazy

real totem
#

For me

#

Not by a lot tho so ye

elder rapids
# real totem Than gemini 2.5 ro

o4 mini is a little better than 2.5 pro at pure coding tasks, from the benchmarks, o3 isn't really that much better, or even at all. It seems to fail things 2.5 is getting really easily, and 4.1 isn't a reasoning competitor

keen fulcrum
#

What is the difference between o4 mini and o4 mini high?

elder rapids
#

high/pro means it thinks longer

keen fulcrum
#

Simple to understand thanks

ornate stump
barren prairie
#

Sometimes I think that Gemini 2.5 is overthinking ...sometimes , the Flash thinking make good answers while the pro sucks 🙂 I still donno why

balmy mist
elder rapids
real totem
#

They wont switch from gemini

#

Cuz its free

elder rapids
#

2.0 flash thinking is a true overthinker

real totem
#

And o3 is paid

elder rapids
#

at least from the months I used it

ornate stump
keen fulcrum
#

Gemini 3 will be groundbreaking yet again

barren prairie
elder rapids
keen beacon
#

omg

#

o4 mini limits are pretty decent

elder rapids
#

although that's beneficial ONLY for coding

elder rapids
#

oh wait

elder rapids
#

you're talking about on the apps

real totem
#

I just switch accounts

#

When I get restricted

real totem
elder rapids
#

ye

real totem
#

It’s 50 messages I think

#

It’s pretty good

elder rapids
#

if you're talking about AI studio vs chatgpt plus

real totem
#

Yeah

elder rapids
#

that's just for API I think

#

I've sent more than 50 per chat in an hour

#

ye the difference is big, thinks a lot

#

wonder how good 2.5 pro would be with that crazy length

keen beacon
#

hmm i was checking to see if the 2.5 pro thinking bug was visual: 5.5s to the first token (first token in thoughts) and 6.1s (2.5 pro immediately going into a response), it doesn't seem to be visual

elder rapids
#

I'm confused on wym

keen beacon
#

if it were doing the thought process and skipping the thinking block

#

u would expect the first token to be delayed on the response

#

but its the same

#

delay with or without thoughts

#

so it doesnt seem to be visual

elder rapids
#

alr I'm lost but I hope you figure it out

#

sorry bro 🙏

keen beacon
#

its fine im explaining it confusingly

leaden palm
#

man how late is later...

keen beacon
#

both o3 andn o4 mini

leaden palm
#

o cool

#

time to give it my (only?) eval question

keen beacon
# keen beacon its fine im explaining it confusingly

the mechanism if it's not visual i guess is because they exclude thinking blocks, i guess when u reach a certainn amount of turns the model's tendency to start with a thinking block isnt there since all the past and numerous amount of turns had no thinking blocks

leaden palm
# leaden palm time to give it my (only?) eval question

o3 was... alright? the style was bad:

  • manually wrapped its text (presumably trained on too much TeX)
  • disregarded "mtok", took it to mean ktok, messing up the calculations
  • called phi 4 multimodal "φ‑4‑mm"
    but once i got past that, i did enjoy its discussion of the math and attention and economics
#

meanwhile o4 mini just spreading misinformation

keen beacon
#

did u set it to 0 temp btw

leaden palm
#

i left it at default 0.7

keen beacon
#

yea try using 0 temp

#

it seems its very sensitive

leaden palm
#

i'll try again but i'd be surprised if that improved it more than regenerating would

#

what the hell is this 😭

#

okay it was definitely better

#

i still like the style of the o series

elder rapids
#

they're all o series

#

😭 🙏

leaden palm
#

4.1:

leaden palm
keen beacon
#

it depends on how its ingested

#

but 2.5 pro i think is probably better

elder rapids
#

2.5 is absolutely better

#

it's not close

#

the Gemini models are basically made for that

#

tbh

#

as far as my testing goes

#

uploading anything to Gemini is far superior, it's audio understanding, it's video understanding

#

nah it goes for all the models

#

just try 2.5 pro with this stuff

#

it's seriously crazy

#

it's basically perfect

#

ye, but tbh its just different approaches

#

I like reading other models thinking

#

ye

#

can't say that's much of a problem tho cuz that's what benefits Gemini more in general

glass arch
plain zinc
#

How do you like o3? o4-mini high?

elder rapids
#

o1 pro is more of a academic

plain zinc
#

Tell me, please.

glass arch
#

I am gonna upload my video of this session to #ai-creations when youtube finishes processing it

elder rapids
#

yep, I still like 2.5 pro the most tho

#

it's just the way I prompt it

#

and how it allows itself to be prompted

#

that's just super unique to me tbh

#

so the feel that I like in other models, I can definitely replicate in 2.5 pro

#

and then boom

#

no need for them

drifting thorn
#

Btw is the o3 using tools in the arena?

glass arch
#

ok I shared my video

balmy mist
drifting thorn
#

Sad

#

Waiting for Deepseek R2 that integrates MCP tools in chain of thought

plain zinc
#

Dragontail, riverhollow, and shadebook are gone.

#

Shadebrook

balmy mist
#

lol

glass arch
#

I like how chatgpt just takes the verbal abuse I give it

balmy mist
#

you better becareful abusing chatgpt

#

it got memory now

glass arch
late path
#

does anyone else feel like o3's chatting style is strangely similar to r1...?

elder rapids
#

that didn't seem to be the case

#

so it must've happened within the last hours

#

by the way, I'm not sure about o4 mini being cheaper in practice

#

seems to be p4p not on 2.5 pros level

#

same goes for o3

glass arch
#

when are we gonna be allowed to use 4.1 in the app

elder rapids
#

you're not I think

glass arch
#

!!!!

#

bruh

keen beacon
elder rapids
#

4o seems to be the absolute replacement

glass arch
keen beacon
#

It's already live under 4o lol

elder rapids
#

in the chatgpt app?

keen beacon
#

They're continuing chatgpt 4o latest which is the chatgpt 4o model on the app, which is on the same base model as 4.1

thorny drum
#

4o != 4.1

keen beacon
#

It's mega confusing but yeah

#

Chatgpt 4o latest uses the 4.1 base model/has had it for a while before 4.1 released

#

The new base model where it was continued pretrained and has a newer cut off

glass arch
#

that is so stupid!????

keen beacon
#

For some reason lmao

elder rapids
#

which in this case, it would still be 4.1 → 4 → 4 omni

keen beacon
#

There are small differences though the chatgpt 4o tune is slightly different though even if it is on the 4.1 base model, it is more human preference aligned. But model performance is largely the same

elder rapids
#

since 4o previously probably still used gpt 4, as a distill

#

so it would be 4.0 Omni

#

now it could be 4.1 Omni and therefore 4o

glass arch
#

can't wait for 4.5o

elder rapids
#

could probably still be called 4o

glass arch
#

will it be better at language?

plain zinc
#

I don't see them anymore.

#

I only come across 2.5 pro, 2.0 flash thinking

elder rapids
#

could be just unluckiness

plain zinc
#

You won't get them either.

#

It's strange that Legit didn't reveal anything about their loss.

keen beacon
#

in the metadata there its still live

#

dragontail/etc

plain zinc
plain zinc
#

Did you see him just now?

#

Or recently?

keen beacon
keen beacon
#

if it was removed off the main arena

elder rapids
keen beacon
elder rapids
plain zinc
#

It can't be that I'm unlucky._.

#

I know for a fact that there were none.

#

What the...

elder rapids
#

sometimes I get an extremely long period of not getting certain models

#

like, I haven't even gotten o4 yet

#

or o3

plain zinc
#

Why is this happening?

#

Update?

elder rapids
#

seems like a balancing thing or smth

#

for the arena

#

could be that they actually did take them down

#

just for a little bit

#

for an update

#

we'll never know

plain zinc
elder rapids
#

oh damn I just got o3 vs 2.5 pro

#

crazy

plain zinc
#

...

#

What...

elder rapids
#

😭

plain zinc
#

10 new models have just been added.

#

It was 96 and became 106

elder rapids
#

kinda excited for 2.5 flash

keen beacon
#

it was supposed to come out a while back

#

the model was added on the sdk

elder rapids
#

ye

keen beacon
#

0409 or smthing

elder rapids
#

that wasn't too long ago tho

keen beacon
#

with thinking budget

elder rapids
#

damn I hope it's cheap and smart

#

that would be crazy

keen beacon
#

they delayed it for some reason

elder rapids
#

if it's a jump from 2.0 thinking, it's gonna be good

keen beacon
#

oh 2.5 flash is gonna be good

elder rapids
#

although it was trash at coding

#

it was for some reason INSANE at some things

#

tbh bro

#

I keep thinking back

#

the old times

#

where athene 70b was the hidden gem

#

that pure RLHF model was sick when you prompted it for the right things

elder rapids
#

if it's anything like 2.5 pro too

#

it's gonna go crazy

hardy pecan
#

O3 beat out 2.5 pro

plain zinc
#

And more...

#

HE's become insanely slow💀

elder rapids
plain zinc
#

I had to wait 3-5 minutes for the code generation to finish.

elder rapids
#

53% for high

#

damn I said o4 was gonna get low too

#

and o3 was gonna get 50+

hardy pecan
#

Hey it shows how long 2.5 pro is too, since everyone is impressed by Geminis offering. 50s on that benchmark is great, and actually SOTA

elder rapids
#

but I thought it was gonna be higher than 2.5 💔

hardy pecan
#

yeah the smaller mini models tend to suck at that benchmark

zinc ore
elder rapids
#

there's no way for them to have optimized that

#

so it's gonna be baseline what you see

#

rather than a high or medium variant like o1, o3, and o4

zinc ore
#

Exactly

brittle tiger
#

Flash thinking tomorrow

elder rapids
#

it's the same reason why openAI models seem to dominate benchmarks

#

but then not do well in practice

plain zinc
#

Yessss

#

I already thought that Google wouldn't show anything this week and would give OpenAI all week.

zinc ore
#

I just wish I could more readily compare the tokens/thinking time being used

plain zinc
#

But they decided to go all in.🔥

#

Bold and fast

zinc ore
#

Makes it harder to tell if these little 3% gaps are meaningful or exaggerated

#

So I don't like there's just a persistent mystery now

thorny drum
#

wdyt flash thinking means given pro is already a thinking model

elder rapids
#

but that's why it's still pretty much clear 2.5 pro is leading

thorny drum
#

or is this just flash

zinc ore
#

2.5 flash is a hybrid/unified model, so thinking model yes, once it releases

elder rapids
#

regardless of narrow task advantancment like o4 mini, or the general ability of o3

#

it seems like it's inherently a weaker model

#

due to the fact it literally has to think more

#

to achieve similar output

brittle tiger
#

I was really impressed by o3 but mostly for the tool use and UI. I think 2.5 Pro which was rushed out the door can match it and probably remains a dev fave. Think 2.5 flash will be better than o4-mini but just a hunch

balmy mist
#

no way

elder rapids
#

ye they've been doing this

#

for a while

#

it's a nice hint

zinc ore
#

Good guess is 2.5 flash, I think they might drop a coding model too

elder rapids
#

could, that would be pretty cool

#

that's what the discussion is about

balmy mist
#

i hope its nightwhisper

elder rapids
#

the fact it's so little above 2.5 pro while likely thinking so much more

#

going for all the other benchmarks as well

elder rapids
ivory schooner
#

我正在学会等待Behemoth ......

elder rapids
#

o4 mini costs more than 2.5 pro

elder rapids
#

I deserve the glaze

#

not you

keen beacon
#

buddy

#

who are you

elder rapids
#

2/2 you're just 1/1

keen beacon
#

do you have internal openai model access and connections?

#

no

#

pipe DOWN

#

🙄

elder rapids
#

I'm just like that

keen beacon
leaden palm
elder rapids
#

sorry you need to be granted access

leaden palm
keen beacon
zinc ore
#

I'm guessing they feel confident about 2.5 flash

leaden palm
#

maybe

elder rapids
#

that would be cool

balmy mist
leaden palm
#

seems openai is the only one whos got tool use in thinking though which is EXTREMELY weird

#

like it's just a ui level thing

#

how you call your own inference api

elder rapids
#

exactly

#

it becomes so much less of a jump

keen beacon
# leaden palm maybe

it does feel kinda weak on their part to release 2.5 flash, a worse model, when openai have just mostly beat 2.5 pro

elder rapids
#

while 2.5 is base

keen beacon
#

i think they'll wanna flex

balmy mist
leaden palm
#

yes
it's just whether or not you start another <think> after a tool message

elder rapids
zinc ore
#

This is hilarious because they're just benching the same model over and over again, so naturally it'll get similar scores, just incrementally worse/better based on thinking time

leaden palm
#

what do you mean you don't know about that

elder rapids
brittle tiger
balmy mist
#

google about to do openai dirty lol

keen beacon
balmy mist
#

y they cant let openai have one week

keen beacon
#

o4 mini is very strong

elder rapids
late path
#

this is our frontier model💀

keen beacon
#

in most cases it matches o3 and in some it beats it

elder rapids
#

but if 2.5 flash is narrow

keen beacon
#

i dont think flash has the chops for that

elder rapids
#

and anywhere near o4 mini

#

then o4 mini has to be cooked

#

the context + the price

#

at reasonable output

balmy mist
#

it would have to be nightwhisper right?

brittle tiger
balmy mist
#

they were waiting on this

#

like they probably been tesing o3 and o4 mini all day

topaz peak
#

really disappointing that o3 fails the bucket test damn

balmy mist
#

google is funny

#

i feel bad for open ai

#

they might have to releas o3 pro next week

elder rapids
#

o3 pro is gonna be cracked, although I don't think it's going to be straight up better than the one we saw during shipmas

#

the direction and bases have changed

zinc ore
#

They might release an updated 2.5 pro as well, think it's been about 3 weeks since it dropped and last year they were doing updates on around 3 week gaps sometimes. Usually 4 at the latest.

elder rapids
#

maybe

#

but now that it's actually confirmed there's going to be SOMETHING

#

yep

#

this is a crazy advantage

#

the fact they said it's trained WITH the tools

#

as it's reasoning

#

ngl

#

I didn't even know HLE 2.5 pro was 18%

#

or I didn't pay attention to that

late path
#

o4mini's hallucination rate is unbelievably high. it scores worse than o1mini on OpenAI's own PersonQA benchmark

elder rapids
#

I wanna know how good 2.5 pro DR is with that

elder rapids
#

it's smart

#

but it has unbelievable confidence in going in one route

elder rapids
#

o3 and 2.5 are very creative in their self reflectiveness, but o4 mini just really goes for it

#

super straightforward too

leaden palm
#

are you saying personqa is a subset of simpleqa

#

i dont think it is

plain zinc
#

Nightwhisper is coming out today!

leaden palm
#

ok its personqa

#

serves the same purpose i guess

plain zinc
#

dragontail cannot exit because it is still being tested

#

This may be the 2.5 flash model that will be released next week only.

brittle tiger
#

For ppl saying o4-mini is cheaper than 2.5 pro

alpine coral
zinc ore
#

Wow, nearly 20x price for o3 high

alpine coral
plain zinc
elder rapids
plain zinc
#

Okay. This is 2.5 Flash

#

Not 2.5 pro modified version 🥲

elder rapids
#

too early to say that lol

#

since it could be

#

Gemini 2.5 flash preview

Gemini 2.5 pro 0417

balmy mist
#

yeah w/ the agentic reasoning built in

plain zinc
#

Is 2.5 flash really a nightwhisper._.

elder rapids
#

but honestly could make sense

balmy mist
#

then openai is cooked

elder rapids
#

if it really is

#

yeah

#

that would just be insane

balmy mist
#

i feel bad for openai again

plain zinc
elder rapids
#

like I would be in disbelief

plain zinc
#

2.5-flash-high

balmy mist
#

lets do a go fund me for openai

elder rapids
#

but they all think for similar amount of times

#

they're capped

#

by lmsys

#

so whether it's thinking really fast and a ton, for the same amount of time

#

we'd never know

#

we don't even know if night whisper could even be flash 2.5

#

it could be those other checkpoints, why would the performance be so different

#

if not for a Gemini coder, 2.5 pro checkpoint, and then 2.5 flash checkpoint

balmy mist
#

do you work for openai?

#

they should hire you

#

how can google get you on their side?

#

notebooklm>

elder rapids
#

I see your point, but you probably think the image zoom feature that o3 uses is an example, but you're just wrong lmao, notebook llm? Google AI search (as in bard, from way before), native image gen flash? long context?

#

wym? this is exactly THE step forward these companies are trying to make, the tool usage, all that stuff basically came from bard lol

#

without search, hallucinations become a major problem again

#

also not to mention, project Astra

#

and still, the other things I mentioned

alpine coral
#

tbf they pioneered and open sourced the transformer architecture in 2017 - everyone else since (oai included) is arguably innovating on top of the status quo they established

elder rapids
#

even still tho

#

Google had been developing veo before open AI

#

had sora

#

they just announced it after

thorny drum
#

i really wonder who writes llm system prompts

#

ifl they have fun with them

alpine coral
#

yeah tbh i knew that's what you meant (was just thowing it in there - but in a way, kinda beside your point / taken for granted i know)

thorny drum
#

do you think they AB tested calling it 'yap score'

brittle tiger
#

i think you might be right about oai consumer stickiness if goog doesn't beat them by wide margin but they def innovate. invented self driving cars, solved protein folding, 2 nobel prizes in AI past year

elder rapids
#

what about context caching, Gemma, native audio understanding, learnLM, AI overviews

#

I honestly don't think that argument has any basis

#

even benefit of the doubt

balmy mist
#

@deep adder did you like 2.5 pro?

elder rapids
#

Google deepmind genuinely seem like the actual innovators, in and outside your direct chat bot interface

#

even mobile integration, Gemini has

#

how tho?

#

I would literally NEVER use it's o3s tool usage

#

if not for specific coding things

#

lmao

plain zinc
#

🤨

elder rapids
#

and btw

plain zinc
#

Don't like the model because it's a chatbot?

#

what?

elder rapids
#

no

#

read context

#

he's saying that openAI are the innovators

#

while naming things Google created

balmy mist
elder rapids
#

but it's not "actually" useful though?

#

you're saying it is

#

but that can't be the case if it's coding based

plain zinc
#

And completely replacing human labor

elder rapids
#

you can say native image gen

keen fulcrum
plain zinc
#

No, I don't want that kind of future.

elder rapids
#

but Google did that first

#

you can say vision

#

but Google did that first

#

you can say agents

plain zinc
elder rapids
#

but Claude and gemini did that first

#

the only thing you can say, is "reasoning"

brittle tiger
#

i think talking to dolphins will be useful

elder rapids
#

but did Google not release a paper

#

explaining exactly what openAI did

balmy mist
#

who are the mods of this server? it would be funny to add tags to people name for if they support a specific company lol

elder rapids
#

like, right after?

elder rapids
#

😭 🙏

balmy mist
elder rapids
#

why'd they do that tho

balmy mist
#

u dont wanna talk to dophins?

elder rapids
#

nah deepmind is cool, they just do some stuff

#

btw did we all just forget about genie

#

ye but openAI wouldn't be where it's at without the transformer architecture

#

they improved it for the same reasons Google knew

#

that's why Google was able to hop on board so early

#

ye

#

it's kinda crazy how one company

#

kinda sparked it all

#

or two ig

#

or 3

#

there's prolly more

#

but same idea

#

crazy how a couple companies are causing the craziest era

#

an invention probably greater than fire

#

ye but I'm pretty sure that would've been created regardless

#

things like that were bound to be created soon

#

not sure about large language models tho

#

cuz its actually pretty specific

#

like who knew bro

#

ye, but that seems more like the problem of antecedent improbabilities

keen fulcrum
#

It isn't

ember rapids
#

Logan tweeted the bat signal

#

Looks like we might be getting something tomorrow

patent bane
#

"iT IsN'T"

ember rapids
drifting thorn
#

Okay, just got the idea of this

#

It's basically dealing with the routing issue of having 1 million experts

#

but there's no companies(we've known now) that are using that much experts in a model

#

Deepseek has 256+1

#

Llama has 128/16 only

#

And other open-sourced models are dense models

fleet lintel
#

o3-high is extermely expensive

elder solar
#

i dont see gpt 4.1 on lm arena

leaden palm
elder solar
#

leaderboard

leaden palm
#

patience axel

plain zinc
elder solar
#

flash??

#

theres only pro

leaden palm
#

(and the open secret that google tests many models and tunes on lm arena anonymously)

elder solar
#

but gemini 2.5 flash is not out

leaden palm
#

so what

#

it's being tested on lm arena anyway

keen beacon
#

dont forget they offed their first whistleblower

fleet lintel
ornate stump
oblique flint
#

if flash 2.5 gets anywhere near o4 mini performance at flash 2.0 prices I'll be very happy

fleet lintel
#

It's actually quite meh compared to bigger models

#

but given the cost, it's amazing

fleet lintel
fleet lintel
ornate stump
alpine coral
#

yeah if it's dragontail that is being referred to as 2.5 Flash, it's def not meh

fleet lintel
alpine coral
#

which google anon model/s are you referring to then?

#

omg - could today be the day

#

google is finally unveiling euraka chatbot!!

fleet lintel
#

Riverholllow

alpine coral
#

in terms of performance

fleet lintel
#

or shadebrook.. They both behaved like flash

ornate stump
alpine coral
alpine coral
fleet lintel
#

what eureka chatbot does?

keen beacon
alpine coral
#

so if dragontail/nightwhisper isn't 2.5 Flash (which I think we can safely assume), i feel like there's a good chance that dragontail is 2.5 Pro with reasoning_budget set to high

elder rapids
#

even granting it, outside of speculation

#

that they'd do this

#

it wouldn't truly make sense

#

considering there's nothing about a whistleblower in openAi's case

#

that could be harmful in any way shape or form

#

or even openAI much as a company

keen beacon
#

..?

#

u dont believe companies off people

#

?

elder rapids
keen beacon
#

Thinking openai killed that guy is wild tbh

elder rapids
#

yeah, that's inherent to the idea

#

lol

fleet lintel
#

i dont think tech companies do it "yet".

elder rapids
#

I'm not sure how that's really substantive

keen beacon
#

silicon valley is one of the most disgusting places

#

on earth

elder rapids
#

considering I said, let's grant the premise that he was killed

#

say he was killed, rationalize it

#

like, in chat

keen beacon
#

i dont think it's far fetched to say some companies would go to extreme lengths

elder rapids
#

and if you can't, why speculate

keen beacon
#

Iirc wasn't he whistle blowing something that was obvious, wasn't it about copyrighted content

hardy violet
elder rapids
#

but even if we grant that too

#

that's not even crazy

keen beacon
#

its speculation true

elder rapids
#

of course speculation sometimes means something

#

but like, you don't gotta speculate, if on the grounds of accepting that premise, you still can't understand why

#

no dots are connected etc

keen beacon
#

nah cus ts pmo 😭 🙏🏽

elder rapids
#

then why is them killing the whistleblower unique

#

therefore it has to be the conclusion you're looking for

#

ykwim

fleet lintel
#

honestly, I think this is not the right place to discuss whether OAI assasssinated someone or not

elder rapids
#

just a little comment here and there

#

like now, discussion poof, dismissed

elder rapids
#

it seemed to be a really really large model tho

#

like gpt 4

#

and that's really the only way they could've competed at that time

#

they limited usage too

keen beacon
#

is there a new gpt model dropping

#

feel like i hear about it everytime i open the chat

elder rapids
#

and that's surprising for Google

#

or maybe they allocated enough compute

#

during then and now

#

but who knows

#

if they gave 2.5 pro high compute

#

it WOULD dominate

keen beacon
elder rapids
keen beacon
#

its alr dropped

#

Old news lol

elder rapids
fleet lintel
elder rapids
#

4.1 dropped like a little while ago 😭 🙏

keen beacon
#

they just dropped this monday

elder rapids
#

censor ts

#

not all of us w*rk

keen beacon
#

😭

elder rapids
#

they released more models

#

o3 full and o4 mini

fleet lintel
keen beacon
#

Space moves very fast

elder rapids
#

ngl 4.1 is kinda bad outside of coding

#

has vibes

#

but synthetic asf

keen beacon
#

20$ a month

#

and if using free you only get like 30 back and forths

#

vs geminis free trial 1 month repeatable full advanced

fleet lintel
#

4.1 should have been just a tweet. but they hyped it like crazy.. I am still mad about it

elder rapids
#

straight up AI studio

#

give them your data, you get infinite usage

#

I know the data I give them allows them to progress btw

#

it's worth a ton

#

don't glaze me yet tho

keen beacon
#

the thing about this is

#

skyvern and other things about to get full gemini support

elder rapids
#

whatever that means, I agree

#

🙏

keen beacon
elder rapids
#

niche territory right off the bat

keen beacon
elder rapids
#

damn you either live under a rock or you don't

keen beacon
#

go look, that will be free? nah

elder rapids
#

yo what if Google releases an agent

#

like, a full on agent

keen beacon
#

you can try this but its made by LangChai or some ccp stuff

keen beacon
#

they have screenshare but its bad

alpine coral
keen beacon
#

sends like past 5 second worth of clip for ai to understand

alpine coral
#

we don't just want thinking models anyway imo

elder rapids
#

and determination

keen beacon
#

i think that's what they'll do with copilot

#

but we'll see

elder rapids
#

seems like they can really create anything

#

they've just been upgrading the Gemini app nonstop

keen beacon
#

their funding

#

them mfs created a new material

elder rapids
#

was that even real tho

#

the Microsoft thing

keen beacon
fleet lintel
elder rapids
#

yo wtf

#

why'd I not know this

#

bro picks and chooses which rock he lives under

alpine coral
fleet lintel
keen beacon
keen beacon
#

like

fleet lintel
keen beacon
elder rapids
fleet lintel
#

dont worry too much about quantum computing for atleast 5 years. progress will happen but noting useable till thne

keen beacon
#

eventually things like AES will die out

elder rapids
#

just not that

#

which is crazy

#

how'd I not know that

#

I'm Lowkey disappointed I'm logging off

keen beacon
#

😭

fleet lintel
elder rapids
#

got dealt a revelation

fleet lintel
#

oh take care!

hardy violet
fleet lintel
#

This is my team experience as well with o4/o3 models.. very expensive and not necessarily better

fleet lintel
hardy violet
#

"Pro" and "Flash" are indeed clear and easy-to-understand product naming conventions. And yes, Google AI Studio needs a parameter to adjust the reasoning strength . Furthermore, I've noticed that many websites don't offer any option to adjust this reasoning strength, which is very inconvenient.🥲

alpine coral
#

yeah i don't think their model released will be called 2.5-Pro-High (though it might be.. just to hedge ha)

#

i think when they release the stable version, it will have a parameter for reasoning tokens (like sonnet.3.7-thinking is fixed at 32k; OAI models offer low, med, high). Perhaps it'll be a dropdown, low/med/high, but there's no reason why the end user couldn't specify the max number of tokens to be allocated for reasoning / thinking, like with max_output

#

and i think perhaps dragontail could be 2.5-Pro served with a high value set for reasoning_budget

#

but it won't be like separate model

fleet lintel
opaque adder
#

is o4 mini or o3 better than gemini 2.5 pro yet

#

or is chatgpt still braindead

fleet lintel
#

o3 > 2.5 pro > o4 mini

but o3 is very expensive for relatively small benefit over 2.5 pro.

I am a bit disappointed by yesterday's OAI release.

But chatgpt is no way braindead 🙂

opaque adder
#

wtf

#

o3 is better than 2.5 pro now

#

surely nightwhisper is gonna come out

#

why are u disappointed tho

#

ah shet yeah

hardy violet
# opaque adder wtf

No, that's not quite right. The difference is minimal. However, in some o3 scenarios, the cost can be ten times higher or even more! And there are also cases, particularly in code writing, where it's less than ideal (certainly not as good as the scores might lead you to believe). While its Agent performs well, which is fair, the base model is just deeply unsatisfying.

hardy violet
#

I'm not sure what the state of English is like with this model (I'm using Chinese). I've tested its understanding of a few texts, and the results are far from ideal. At first glance, it seems impressive and insightful, but that's quickly revealed to be an illusion. Like DeepSeek R1, it tends to latch onto certain words and over-interpret them, and its language is quite flamboyant. If you like DeepSeek's writing style, then perhaps this model is a compromise. In comparison, Gemini 2.5 Pro demonstrates a remarkably solid, appropriate, and somewhat insightful understanding. Furthermore, if understanding is subjective, I also tested the application of translating Japanese song lyrics into Chinese, and it was terrible, very disappointing.

keen beacon
#

goooood morning

balmy mist
sonic tendon
#

gm

plain zinc
opaque adder
opaque adder
alpine coral
#

still 1000% better than old school translation (wouldn't even be a question if it was translated; there'd be a litany odd / seemingly misplaced words etc)

#

kinda interesting come to think of it.. like you give an LLM text to translate, and ig generally it both translates and polishes it? Like it will spit out the translation with properly formed sentences, correct punctuation etc, even if the original was sloppy (like typos, poor / no punctuation etc)?

alpine coral
keen beacon
balmy mist
#

this si what o3 says about itself:
Think of me as a miniature A2A coordinator bundled with a suite of MCP‑ready tools.
Everything you saw in the video—agent discovery, delegation, precise tool execution—happens here on a smaller scale every time I answer. The protocols simply formalise and generalise what’s happening under the hood right now

#

so openai is pretty much doing what google is doing with A2A and anthropic is doing with mcps but built in

#

the only thing that will make o3 better is the ability add more tools or mcps to it overtime and allow o3 to create agents based on those tools

drifting thorn
tall summit
drifting thorn
#

Very good in "Chinese to English" I can confirm

tall summit
#

what lmao

#

paid tier for a discovery tool

keen beacon
#

either way people will just repost that stuff so you won't ever need to pay

tall summit
#

haven't looked into what he's doing but i do wonder

#

whether he has exclusive access to something (and what it is) or not

balmy mist
#

yeah to pay seems weird bc he posts it as soon as he gets news

#

thats what his twitter is for

tall summit
#

oh he'll surely stop posting it

balmy mist
#

like why he gets views lol

tall summit
#

once he makes it paid

#

but he also is saying he won't make it paid for now

#

clearly testing whether he can afford to

#

but nothings even happening i dont even know why im speculating

balmy mist
#

what woul dbe the benefit for people? like how useful is his tool anyway? it tells us what is most likely coming out but so does the companies

#

but for the webdev arena stuff its cool

#

like i never check webdev unless he posts about a new model

tall summit
#

or at most immediately when they do since its a discord webhook and he gets notifications

balmy mist
#

not the gemini one

tall summit
#

?

balmy mist
#

yesterday

#

logan posted about a new model

#

then legit posted about it a few hours later

#

about the exact model

#

but at that point whocares

#

i can just wait for today

#

unless you in prediction markets

#

idk, i just dont see the need for buying that, especially when you have channels like this

#

the only thing about openai is their UI sometimes buggs out

#

i always have issues with their website

narrow elbow
balmy mist
#

never get issues with studio, like maybe 1/10 times with studio and i usually know when it will bug out, but with chatgpt bruhh, its like 50/50

#

im gonna build an o3 clone that is adaptable where you can add new tools on the fly, its pretty much just a small scale ai ide or more specifically a augment or cursor or windsurf, but built into a model with specific mcps, like a starter package, i wonder if you built an ai ide with o3 as the brains, maybe that is where openai is going?

#

gpt5 is just o3 but access to more tools and trained on how to better use those tools and more agents built in

ocean vortex
balmy mist
#

true but is openai playground free?

#

and tbh i dont like using openai playground tbh, its a little confusing for me for some reason

#

studio is very straightforward

balmy mist
#

you cant even branch in chatgpt, so no point even mentioning that

ocean vortex
#

kinda by design

balmy mist
#

you can branch in playground?

tall summit
# balmy mist but at that point whocares

what kind of argument is "who cares" against the fact that it objectively does something most people don't have access to without doing what the bot itself does
though of course it's not worth it and i hope nobody pays for it if he ever makes it paid

balmy mist
balmy mist
tall summit
#

honestly nobody KNOWS when anything is dropping

#

thats THE POINT

balmy mist
#

thats why to me its pointless

tall summit
#

i can see its use

#

though to me it doesnt much matter what will come out

#

its all just names

balmy mist
#

what would paying for it do for you?

tall summit
#

until we actually see what it does

#

which only ever happens on lmarena or private access (which nobody normal will have access to regardless) or the actual release which after that it doesnt matter

tall summit
balmy mist
#

? i dont know how else i could write that

#

what would it do for you?

#

lol