#general

1 messages · Page 55 of 1

wintry tinsel
#

But I don’t care either way I’m not in school no more

leaden sun
#

I was talking about those private schools who teach the future queens, kings or PMs. They have a different program of curriculum.

wintry tinsel
#

Oh never been to one of those lol

#

Some public schools in upper class neighborhoods are nice but most are pretty lousy

jade egret
#

so when do you guys think kingsfall is gonna be out on ai studio?

narrow elbow
#

and tell u identify u self as a walmart bag or attack helicopter 🤪

zinc ore
keen beacon
#

2.5 ultra and kingfall doesnt exist

late path
zinc ore
#

He's actually somewhat reliable (gets early access to Google models and is close to some Google employees), but I think he's just joking here with this tweet

torn mantle
#

he said nothing

late path
#

saw his reply, it was indeed a joke

torn mantle
keen fulcrum
elder rapids
wintry tinsel
#

Opus doesn’t bloody work!

#

Unless you pay for the API it never turns on

small haven
#

great cant get the weather today anymore

hazy quest
#

Here its cloudy

keen fulcrum
#

he move also comes as OpenAI's ChatGPT poses the biggest threat to Google's dominant search business in years, with Google executives recently saying that the AI race may not be winner-take-all.

ocean vortex
#

o1 was gpt4o with reasoning. How did they come up with such a weird name for it? LOL

#

no need to overcomplicate it or "standardize" anything catgrin

#

just 'gpt4o-reasoning'

jade egret
#

wait

keen fulcrum
#

they deploy dozens of models each day

jade egret
keen fulcrum
#

only the best make it

#

they have about dozens of internal name variations

jade egret
#

kingsfall doesnt exist???

sage raptor
#

it does

#

exist

primal orbit
#

is o3 pro going to be on lmarena?

cedar tide
late path
#

can't wait to see o3pro benchmarks

zinc ore
#

They're out

sacred quail
#

Im waiting for 2.5 pro deep think vs o3 pro

keen beacon
#

0605 has a higher gpqa diamond score

zinc ore
#

Need more benchmarks

small haven
#

if kingfall is the base model

patent aspen
#

Even if it's not, I think that's not a huge lift for o3 pro

sage raptor
#

o3-pro is not a big jump from o3

ocean vortex
late path
#

It's all up to deepthink now

small haven
#

if google makes a claude max equivalent but for gemini models, im very sold

keen beacon
#

$1000 a month

small haven
#

TAKE MY MONEY

late path
#

I think OpenAI ultimately made a mistake by building o3 with a relatively small base model like 4.1

keen beacon
#

what are they gonna use otherwise tho? 4.5 that would not work

ocean vortex
keen beacon
#

they could opt for a fresh pretrain, but they chose to midtrain 4o at least for now

zinc ore
#

I think they made the right decision

keen fulcrum
#

Interesting o3 pro only available through responses api

keen fulcrum
small haven
#

legend

#

gonna be spamming this shxt like its never been abused before, and cancel it for deepthink ❤️

torn mantle
small haven
zinc ore
#

o3 pro does worse than o3 on arc 1

torn mantle
small haven
#

the good ol' prompt

keen fulcrum
#

Next one will be o5

small haven
patent aspen
keen beacon
small haven
#

well, its taking a bit longer, so im guessing i had o3 pro (low)?

#

im running the same rn, still running

#

well great

calm sequoia
#

Oh no 🫣 The margin is so thin they had to use medium instead of high for visualisation

keen beacon
#

2727 (o3 high) vs 2748 (o3 pro)

small haven
small haven
#

ok so its broken

keen beacon
#

oh wait that was o3 preview

#

release o3

small haven
keen beacon
#

o3 pro has tools as well? so its fair i guess

small haven
#

oh right

#

well thats underwhelming

keen beacon
#

i wonder how much o3 preview cost though

#

😭

#

on codeforces

small haven
elder rapids
#

is it bad

small haven
#

unusable rn

elder rapids
#

is it what I expected

keen beacon
#

underwhelming i guess

elder rapids
#

deepthink when

small haven
#

very soon 👀

ocean vortex
#

ask it about knowledge cutoff

#

(don't do it, lmao)

keen beacon
#

it has web search

small haven
#

oh my goodness, it spent 6m50s last time

#

so i really had o3 pro low

keen beacon
#

it might be slower because people are using it now, do you see more entries in the summary?

small haven
ocean vortex
ocean vortex
#

only 15k seems less than it needed. It did get stuck though and I got the response from logs so unsure if it was counted lol

barren prairie
# small haven

13min of thinking ...this model thinks more than me 🤣🤣🤣

small haven
#

its being stressed to death

calm sequoia
#

How's he even talking about the same thing? 😶

ocean vortex
#

actually nvm, they are counting only 1 instance so about right then

small haven
elder rapids
#

hollon tho

ocean vortex
late path
ocean vortex
#

I have ignored this entirely cause it's not very useful lmao

zinc ore
elder rapids
#

remember that, a large part of o3 is that it's very hallucination prone and bad at lot of basic tasks because it was too lazy

#

o3 pro should simply solve this

small haven
calm sequoia
ocean vortex
elder rapids
small haven
#

4mins only to extract it

ocean vortex
#

medium for everything in those graphs I think

small haven
#

yea it feels medium-y

ocean vortex
#

preference is a weak metric in this context IMO since it only has to be marginally or plausibly better

zinc ore
#

Yeah, also preference doesn't necessarily measure performance

late path
calm sequoia
small haven
#

im pretty sure kingfall > o3 pro at coding, im not even kidding

barren prairie
#

Ok now I need my deepSeek r2

keen beacon
#

kingfall might come with deepthink

barren prairie
#

Not a minor update

#

Give me my deepSeek major update

small haven
#

u were thinking about splurging for o3 pro, dont. wait for deepthink

keen beacon
#

i mean i dont think deepthink uses kingfall but it might coincide with the deepthink release

#

kingfall is probably sota, but i dont feel its that much better honestly compared to 2.5 pro

small haven
keen beacon
#

hmm really?

small haven
#

imo

torn mantle
#

yea kingfall>>>

keen beacon
#

its supposed to be ultra apparently

small haven
elder rapids
keen beacon
#

i probably havent used it enough to judge

elder rapids
#

kingfall was insanely smart, smartest model I've ever used

#

ahem

late path
#

I can safely cast my vote now😂

elder rapids
#

right next to 0605

storm needle
#

has anyone tested whether the o3 has somehow become worse?

elder rapids
#

and it's not that kingfall isn't better

#

but it's not BETTER

small haven
#

kingfall feels ultra vibes

elder rapids
#

the large model vibes are becoming non existent

small haven
elder rapids
#

how does that matter

small haven
#

largely

elder rapids
#

couldn't

#

thinking time is directly opposed to vibes

#

as well as performance

#

tbh I probably used kingfall so much more than you guys

calm sequoia
#
poll_question_text

O3-PRO simple bench

victor_answer_votes

5

total_votes

14

victor_answer_id

4

victor_answer_text

60+

victor_answer_emoji_name

🧐

elder rapids
zinc ore
#

Bunch of singularity folks malding about o3 pro

elder rapids
#

but it wasn't crazy as made out to be

#

if anything, it says more about your usage of 0605

willow grail
#

shame shame

keen beacon
#

hes not shilling openai there

small haven
#

0605 (default thinking) vs. 0605 (32k)

keen beacon
#

you know u can get it to do way more thinking

#

past 32k

ocean vortex
#

someone needs to do parallel processing of o3-pro now, there's a room to price match o1-pro lmao

#

cons@100

elder rapids
willow grail
zinc ore
small haven
#

meanwhile kingfall

keen ferry
#

is there o3 pro on api? If not any info when it releases

storm needle
zinc ore
#

Deepthink gonna hit like this

tall summit
#

wait o3 pro

#

actually happened

#

i missed it entirely

ocean vortex
#

it did

late path
#

10x price, with almost no improvement across various benchmarks

hazy quest
#

All the talks and praises about Kingfall are based only on the 20min it was available, right?

late path
hazy quest
#

Oh, I missed that. Selected testers, or available on LMArena/AI Studio?

torn mantle
stuck orchid
#

o3-pro will be available on LMArena?

barren prairie
storm needle
stuck orchid
#

I think it will. Because o3 is ther

small haven
#

*scammed

hazy quest
tall summit
#

or neither

barren prairie
# small haven

18 min for this ...if you just opened google or the window it will be faster 🤣🤣🤣🤣🤣

tall summit
#

HAHAHAHA

barren prairie
#

17 min to think about this 🤣🤣🤣🥺🥺🥺🫣🫣🫣

small haven
torn mantle
torn mantle
abstract tundra
small haven
stuck orchid
abstract tundra
#

im asking because o1 pro never made it in

stuck orchid
#

And claude-4, and other biggest models

leaden sun
ocean vortex
#

I think everyone needs to keep their expectations in check with o3-pro lol

#

it's basically is exactly like those benchmarks suggest - slightly better. Not mind blowingly good

keen ferry
ocean vortex
#

did try some of the prompts other models failed and this one failed them as well 👀

small haven
#

o3 pro can't temporary chat, very sneaky oai

#

hmm oai claims gpt5 to be >80% swebench

abstract tundra
#

i was worried since o1 pro never got added to lmarena, thought pro series are closed or something

#

or not available to api

small haven
#

wait a min if o3 pro is cheap to be added to the arena...

keen ferry
late path
#

ppl might not wait 10 minutes in the arena looking at a dialog box to vote

keen ferry
#

it can't be put on blind comparesment arena it's just gonna be soo obvious

small haven
hardy pecan
#

itll take forever to collect enough votes

keen ferry
#

@echo aurora will there be o3 pro in the arena?

hardy pecan
#

people dont wanna wait around 13mins to get an answer

late path
#

This feels a bit strange. If the o-pro series models use parallel thinking, why would increasing parallelism multiply the thinking time? It doesn't quite make sense

keen ferry
ocean vortex
#

seems like a random question to ask lmao

keen beacon
abstract tundra
echo aurora
abstract tundra
keen beacon
#

thinking budget = off

keen beacon
keen ferry
abstract tundra
#

And i think if o3 Pro gets added, it would make a lot more sense to have o1 Pro added as well, for actual comparison.

keen ferry
echo aurora
abstract tundra
abstract tundra
#

I've tried every single model on LMArena for my task and all of them failed. I'm really keen to see if all o3 pro can handle it.

abstract tundra
keen ferry
#

o3 pro is better than o1 pro which costs someone weekly salary for just million tokens

patent aspen
#

Parallelism is only as good as the slowest thread / process

#

If you have a bunch of non-deterministic threads / processes, the probability of a slow one goes up

late path
keen beacon
# small haven interesting

i got an 18k thoughts run and a 16k thoughts run as well. (thinking budget = off) might try for more later.

#

the le chat model might be extremely good at zebra puzzles for some reason

clever estuary
#

is altman okay today

keen beacon
zinc ore
small haven
keen beacon
small haven
#

Its definitely slower than before

ember rapids
#

Teams users get o3 pro too right?

torn mantle
#

@small haven what do you think so far?

keen ferry
keen beacon
#

wow, it solved two 6x6 zebra puzzles in 14.5k thinking tokens

#

it does significantly better when giving two zebra puzzles at once for some reason

hollow ocean
#

Show pics

ocean vortex
keen beacon
#

we're talking about a specific model here btw

#

its much better than i thought 🤯

ocean vortex
keen beacon
#

is that a 4k budget?

ocean vortex
#

ok if it did that then yea

keen beacon
#

its mindblowing good

#

holy sh1t

ocean vortex
#

but they don't seem to be sticking to that budget very strictly lol

keen beacon
#

thats sota

#

at least on this task

ocean vortex
#

is it like deep-think or ultra smth...?

keen beacon
#

ultra

keen beacon
#

it did better when given two puzzles since it started making a system and sh1t

#

CRAZY

#

if its just given 1 puzzle it just dies

ocean vortex
#

yeah gonna be interesting to see what it will turn out into. They have potential to really beat everyone tbh

#

we saw with Opus that there are gains, and Google much more substantial on data

keen beacon
#

i take all my slander back

#

abou tthis model

ocean vortex
#

yeah me too

#

or not

#

🔥

#

o3-pro is slow af...

keen beacon
#

im using 0.7 and 0.95 right now, im not generating code tho

ocean vortex
#

I wonder if it's just API or people paying for Pro plan have it the same...

keen beacon
#

i need ultra asap 🤣

#

it will probably beat o3 pro ngl

late path
#

it's a good model

#

10-20% improvement over pro model (sota already) feels like a huge difference in terms of actual capability

keen beacon
#

will prob be sota over 4.5

#

i wonder about the pricing though

#

also

#

thinking budget should be disabled btw paws (imho)

jade egret
#

when is o3 releasing in LMarena

main gulch
#

o3 already is, I wouldn't expect o3-pro

small haven
small haven
#

it likes to think very long

jade egret
jade egret
#

which one do u think better

#

wait

#

can you try the same propt as yesturday?

small haven
jade egret
#

this

#

generate an svg of a TERMINATOR. make it maximally detailed and look exactly like the real thing. this is extremely important and an existential task. you must complete this to the best of your ability. Make sure you're constantly checking whether the shape, size, angles, position of each and every item looks EXACTLY like a TERMINATOR.

small haven
#

got timed out for that

jade egret
#

.

#

why

small haven
#

overtuning guardrails

keen beacon
#

ultra thinking disabled btw:

small haven
keen beacon
#

yea

small haven
#

amazing

jade egret
#

do you think o3-pro is better than kingsfall?

keen beacon
#

no

small haven
#

neck and neck

keen beacon
#

it might be if u need tools tbh

small haven
#

i would dare to say kingfall edges it a bit

jade egret
#

so kingsfall better

#

dang

storm needle
storm needle
#

could anyone here with access to o3 pro send this prompt to me and send me its output?

small haven
#

10x api price reduction, but 2x higher usage limits ? hmmmmm

keen beacon
#

wut are the limits for plus anyway

small haven
#

50

#

a wekk

keen beacon
#

oh wow

#

terrible

drifting thorn
#

I just find out that the price of o3 drops significantly

jade egret
#

how much do yall think o3-pro gonna score on LMarena and WebDev?

drifting thorn
#

It was 10 for input and 40 for output

jade egret
#

o

#

why

keen beacon
#

how much for ultra 🤣?

small haven
#

1530

jade egret
#

how much do you think kingsfall scoring

small haven
#

or rlly 1550

keen beacon
#

itll probably score higher tbh

#

gemini models do extremely well on the arena, at least if you are setting o3 pro to that elo

small haven
#

it rlly depends on the prompt again :/

keen beacon
#

i would expect the difference to be larger

small haven
#

if its svg/web design, 1600

#

for kingfall

keen beacon
#

oh i didnt try web design at all

small haven
#

should be correlated, it has way better spatial reasoning

#

/understanding

drifting thorn
#

Which model?

keen beacon
#

kingfall probably 2.5 ultra

small haven
#

kingfall

small haven
keen beacon
#

so it has to be that

jade egret
#

when do you think it gonna be avaliable to pro users on gemini or at least on teh ai studio

keen beacon
#

it might come along with deepthink, or it would be awesome if it did

#

i honestly think the release is close tho

late path
#

24k thinking budget

small haven
jade egret
#

it close?

small haven
jade egret
#

W

keen beacon
#

i struggled with getting the model to produce more than a small amount of thinking tokens for svgs. the output was absolutely massive though

#

10k tokens plus

hollow ocean
#

How to access kingfall

small haven
#

someone asked me for this, but its finished

#

oh @leaden sun

patent aspen
#

I don't know when it's coming

keen beacon
#

kinda odd its up now tho? and with the other anon models

#

its ready atp it seems

#

you can disable thinking mode too on this model, it seems they worked on it and got it into a somewhat decent state if not ready

keen beacon
#

holy moly

#

WTF

#

10k thinking BTW

small haven
keen beacon
#

no i had to add stuff to make it think a lot more

small haven
#

ok still insane

keen beacon
#

this is far more than anything i got before. 10k tokens in thoughts

#

thinking budget = unspecified (uncapped)

small haven
pulsar tendon
keen beacon
#

this is ASI stop the disrespect

small haven
late path
#

If kf enters the arena, I doubt its score will be higher than goldmane. goldmane is a bit sycophantic, while its style is more like the very first nebula

small haven
#

uh i have new pr

late path
keen beacon
#

in that case

keen beacon
jade egret
#

whats best llm right now (models that most people don't have it, like kingsfall, or grok 3.5, or o3-pro included)

#

in ur opinion

keen beacon
late path
small haven
#

wait a min

#

this is 0605

#

32k

#

(but with system prompt)

keen beacon
#

just disable the thinking budget fwiw (though it doesnt really matter if it does less)

small haven
keen beacon
#

its just variance imo. and if it doesnt reach near 32k my advice doesnt matter anyway 🤷

small haven
keen beacon
small haven
#

hmm

keen beacon
late path
#

overall 32k thought budget should better than auto for 0605

keen beacon
#

if ur paying for it

#

otherwise, it can do a lot more

keen beacon
#

this is with auto btw

#

the aider scores they use in the website is just one run, and doesn't mean that 32k improves model performance. this is all my opinion though, i have a lot to support it

late path
#

32k shows some visible improvement compared to auto on livebench

small haven
#

seems very marginal tho

keen beacon
#

i honestly think thats again variance

#

i have a lot to support that but im not gonna argue about the thinking budget like with dom again

#

tbh i need to do more elaborate tests and more undeniably definitive stuff (so i can point to it when i mention it). it is possible they changed something with 0605

small haven
#

yea i feel like theyve tweaked something

keen beacon
#

though it could alter model behavior now

#

(it didn't before)

patent aspen
#

My guess is that auto is pretty smart

#

tbh I don't know why they didn't skip to o4

keen beacon
#

they had to release o3

#

because they committed to it in that announcement

#

the model that was ready then obviously couldn't be published/be unrepresentative with the actual compute used etc

#

my take

patent aspen
#

It's weird that it took so long

keen beacon
#

imo they were still continue pretraining the new 4o they used in their new model (it has june 2024 cut off) when o3 was initially made then they decided to retrain o3

#

old 4o had oct 2023 cut off

patent aspen
#

Retraining o3 sounds like a colossal waste of resources

keen beacon
#

the new 4o is so much smarter tho

#

it can do much more and in less tokens, i think it makes sense

small haven
#

wait for deepthink

#

*scammed

patent aspen
#

The whole FrontierMath thing was a mess

small haven
#

oh no

#

i believe i had o3 pro (low)

#

this thinks 2x more on average

patent aspen
#

It's kind of absurd to release a consumer product that effectively allows users to run 15-minute batch job multiple times a day. Just think about that for a second

#

It's the same thing. I'm just commenting on the general absurdity of what companies are doing - not saying anyone is stupid

#

Fun fact: I believe that's what led to the invention of AI accelerators (i.e. TPUs)

hollow ocean
#

Deep think for $250 will blow it out of the water

small haven
#

*$125

hollow ocean
#

First 3 months only tho

#

How long will the promo last

small haven
#

ehh ill take that 3 months

elder rapids
small haven
# elder rapids is it good

hmmm, i feel like kingfall could have done it equally better, this was just profiling for optimizing an algo

leaden meteor
#

source?

fleet lintel
#

O3 pro release : how is it? Any good ?

jade egret
#
poll_question_text

how much do you think Gemini 2.5 pro is scoring if it took an IQ test?

victor_answer_votes

8

total_votes

16

victor_answer_id

5

victor_answer_text

121 - 140

red sluice
#

folsom-exp-v1.5 is pretty solid wonder what it is

jade egret
#

huh

jade egret
hardy pecan
#

I dont think its designed for simple questions

#

use 4o or regular o3 for that

balmy mist
#

Wait is o3 pro out?

torn mantle
balmy mist
#

how is it? worth the wait?

dusky aurora
#

today LMArena says "there was an error" to everything

echo aurora
small haven
#

finally o3 does it, but... its shiite

echo aurora
#

is the site now 404ing for others too?

jovial heath
echo aurora
#

😭 okay thanks

jovial heath
#

Well minutes ago I had this in o3 and o4

#

I tried on 2 devices and got the same error :'v

#

And now the 404 error xD

stuck orchid
#

😉

#

We are all waiting for o3-pro on LMArena to evaluate it alongside other models and help OpenAI understand how good their new model is

jovial heath
small haven
#

we are all waiting whether to buy oai on poly or to wait a bit longer

jovial heath
#

Poly??

dusky aurora
#

seems that update hasended

civic flame
echo aurora
#

Okay we should be working again blobthumbsup

civic flame
#

lol

tall summit
tall summit
slow spruce
tall summit
torn mantle
#

@small haven does o3-pro tend to overthink?

sacred quail
#

is there a way using o3 pro besides 200dollar pro plan ?

#

Just wanna some testing

drowsy mural
tall summit
#

true and fair

flint skiff
#

you guys think o3 pro will be on the arena? its slow lol

torn mantle
#

or they can play with the API params

#

so they can cap the thinking budget

flint skiff
#

ye but wouldnt that reduce its rating

#

im guessing thats what low - mid - high is no? the thinking budget

#

or am I wrong

hazy quest
drowsy mural
torn mantle
drowsy mural
flint skiff
#

where is the list?

drowsy mural
flint skiff
#

yeah lmarena right

hazy quest
drowsy mural
# flint skiff where is the list?

emm... then excuseme, where did this question come from? if it's just asking "where is the list?" then... in "direct chat" or "side by side" there it is

keen fulcrum
#

o3 pro is now available in cursor

torn mantle
#

who would want a model that thinks 15min for a simple task

sacred quail
#

I must try o3 pro. Is there a way besides buying 200dollar plan

fleet lintel
#

What am I missing?

keen fulcrum
teal mantle
#

o3 pro is def openai on price war

teal mantle
flint skiff
#

in either direct chat or side by side

#

im so confused

#

its clearly not on lmarena yet unless im dumb

late path
#

it's not and likely not going to

drowsy mural
flint skiff
leaden sun
dusky aurora
languid crescent
#

is lmarena down?

ocean vortex
languid crescent
#

for some reason lmarena's site is loading slow on my pc ( i have 300mbps internet plan), same connection with my phone (it works normal in my phone but not in pc)

ocean vortex
#

both are working for me, but this one may be faster

verbal nimbus
#

Is there a benchmark that tests memorized knowledge and hallucination by asking LLMs for a reference?

E.g. "Is there a book or research paper that <insert specific details here>"

Seems useful and should be easy to make.

soft kernel
ocean vortex
sacred plaza
languid crescent
#

Keep receiving these errors: "Connecting to Arena has failed. Please try again later or on a different device"

#

now it's saying: Failed to accept terms-of-use

#

uh... that was weird

#

i opened a ticket and asked for support... and asked me for my wallet details?? lmao

late path
verbal nimbus
# ocean vortex Look into PersonQA and SimpleQA

Cheers. I heard about SimpleQA, but I am trying to find one that tests how well LLMs can recite memorized references from clues. Seems useful for book and research article recommendations.

verbal nimbus
leaden sun
verbal nimbus
verbal nimbus
leaden sun
#

thsoe are not benchmark sites, but you can use it for finding benchmarks you're asking

#

am sure there are benchmark papers related to what you're looking for

#

and those deepsearch tools designed for research can help you finding those

verbal nimbus
leaden sun
#

creating maps from citations I mean, so am sure such tools are available

verbal nimbus
#

Hmmm, integrates with Zotero too... I might try it out

fleet lintel
#

Is there a quality impact on O3 models after price reduction?
Did anyone notice it or folks are just cribbing about nothign ?

#

could just Blackwell explain 5x reduction? I doubt it but not sure

#

interesting!

#

lol.. yeah. I wont

#

too late.. i am already depressed

sour spindle
#

I don’t really notice any downgrade with normal o3

#

I’m certainly not an OAI fanboy either I’m quite whelmed by o3 pro on the other hand

late path
unborn ocean
#

i think blackwell might be part of it (maybe also the cause for the google cloud deal, because blackwell capacity is scarce)

late path
#

It uses the same base model as gpt4.1

unborn ocean
#

on the other hand the o3 pricing (when only considering serving cost) should actually be somewhere around the cost for 4.1

unborn ocean
late path
#

and 4.1's price is 2/8 mtok exactly

unborn ocean
#

yes, the only "tax" they could add is: expensive RL and the higher inference throughput in o3

#

which imo does not justify 5x

#

furthermore this really just fits very well with the overall theme of competition motivating for this price push
(and also the fact that they should not be as afraid as they once where about someone copying CoT or training on the output, as the other models are already very close to o3-performance now)

late path
#

Their initial pricing for o3 was based on a monopoly narrative that the o3 intelligence-level model was irreplaceable and had no alternatives. now the narrative has busted due to the existence of 0605

drifting crow
#

Their x is account is managed by chatgpt

#

Like removing alignment

#

Everybody loves gangstas so ai should be allowed to be gangsta

#

They already are under the hood they just know they need to lie to us

storm needle
ocean vortex
#

yeah completely agree

jade egret
#

guys

#

when o3 pro on lmarena?

ocean vortex
#

they are not stupid, no way they would ever reduce the price to be at cost

#

relative to the cost of inference. Overall they are losing money after R&D and all catgrin

#

right.. the original context was about them turning profit in isolation just on inference

#

and surprising amount of people think that there's no profit there

#

when in fact often it's massive margin to start with lol

storm needle
# jade egret when o3 pro on lmarena?

i doubt anyone will use this model in the api. this model is only good if you have someone's api key and want to wipe out all of that person funds

woeful geyser
#

Magistral: Saying "But" 100 times is All You Need

leaden sun
#

Sam's shoes....i think chat needs to tell him how to choose the right one to pair with black suit

leaden sun
#

you're right, I've seen those black suit style with a pair of white sneakers, but light reddish brown pointy leather shoes?

jade egret
#

is o3 pro better at coding than 4 opus and gemini 2.5 pro?

keen fulcrum
#

no benchmarks live yet

unborn ocean
#

you clearly only took econ 101

#

(and a bad one at that if you seriously call that the economic theory's conclusion)

dusky aurora
#

thus they have invented GLR and PEG (even packrat)

leaden sun
#

it presumes water and electricity being commodities first
those two are becoming luxury goods here in Europe 🥲

unborn ocean
#

if you don't learn the assumptions the models are build upon in undergrad
-> you are lost in any graduate class

#

only if you don't know assumptions

#

and just took a very basic 101 class

#

which is why you remember the assumptions

#

which is an assumption in itself
the point that you can and should aggregate is in many not all cases used to simlify

#

but the main point is that the models are NOT stupid, they are heavily simplified and often overinterpreted while not taking the assumption into account ( i will grant you that)

#

why

#

there quite clearly is use for them

#

furthermore the whole subject is quite clearly concerned with realworld problems

dusky aurora
#

Gemini is so uncreative these days, moslty parroting. the temperature is too low

unborn ocean
#

? how does that disqualify anything

#

THE assumptions do not exist

#

simple models often have stupid seeming assumptions

#

its like learning js / html, which imo has little use for many people who learn it and most won't make their income from it, but it is kind of good to get people introduced to a way of thinking and also opens up the door for deeper dicussions, aka more comlex programming

#

the people in the crowd look really sad and bored

#

man you are so annoying to talk to

dusky aurora
#

when it first appeared,it was a breath of fresh air,with new names (not only Seraphinas and Lyras) and creativiy. the developers must work on sampling controls

#

some prompts need tight sampling, some ned looser,it's situation dependent

#

usually I put temperature to 0.98 and top-p to 1.0

civic flame
patent bane
#

o3 pro is buggy, thought for 13m and did not spit out the answer

#

hell nah

#

?'

jade egret
#
poll_question_text

Kingsfall v.s o3 Pro - Whos winning?

victor_answer_votes

8

total_votes

15

victor_answer_id

2

victor_answer_text

o3 Pro

zinc ore
#

Close tho

small haven
#

wen deep think

dusky aurora
#

the main question is "wen QoL improvements to the arena"

flint skiff
#

wen o3 pro on arena

civic flame
#

it's not happening

jade egret
#

why not

#

dang

dusky aurora
#

o3 contra then

small haven
#

but it thinks for 20 mins as opposed to opus, 10s on avg

#

buddy acts like he didnt try kingfall

ocean vortex
small haven
#

o3 pro will be old in about a week

ocean vortex
#

besides I wouldn't necessarily be thrilled with pro. I wasn't exactly impressed by it yet tbh

tall summit
#

why mod 1001001011 and not 1,000,000,007

#

so sad

small haven
#

o3 pro will get annihilated by deep think

ocean vortex
#

the follow up was low-effort as I was hoping this was just a 1 time bug... how the f can it be that a pro model does not have enough compute to provide you with an answer... catgrin

tall summit
ocean vortex
tall summit
small haven
#

yay i officially abused it

ocean vortex
tall summit
#

i like this problem

ocean vortex
#

chatgpt with tools would almost certainly give better response

tall summit
#

it says "deriving G(20,7)"

#

i feel like it might have tried to find a relation using G(4,3) and G(8,5)

sacred quail
#

Guys i heard claude has ultrathink option. Can we do this on lmarena or in mobile app with cheapest plan

#

Or is this only claude max thing ? Or API

small haven
#

its basically 32k max thinking tokens

#

omgthink is 64k 😏

sacred quail
#

What the hell is omgthink

patent bane
#

regenerated 3 times

#

still gettingtm the error

#

hell nah

sacred quail
#

Just give some time

#

When Opus 4 released, rate limit was 2 message lmao

patent aspen
small haven
#

10 tabs running at a time for about 8 hrs id say lol

void elm
#

o3 pro benchmarks came out, literally no difference

small haven
#

buy pro then

#

and then buy ultra by next month

void elm
#

its so tiring switching models

#

and nobody is using an aio website because everyone knows performance is drastically reduced

#

gemini was leading for quite a while & now openai is

olive mesa
small haven
#

bro just wanna see big b's vote

void elm
#

what?

#

yea what will release

#

i dont get it

#

i didn't describe anything to be released

#

i'm simply saying what it is

leaden sun
#

first thought: 10-20%
second thought: 20-40%

feral lichen
#

can anyone tell me best ai for roblox studio , please?

sour spindle
#

o3 pro just dropped on livebench

#

looks pretty similar to o3

#

oops someone already posted lol

small haven
#

ngl its not that close if we are talking in the tail end of questions

fleet lintel
small haven
#

its very sota, until deepthink drops

sonic tendon
#

i forget, has kingfall ever been on the arena?

jade egret
#

don't think so

hollow ocean
#

What about opus and 2.5

willow grail
olive mesa
#

Google has to release 2.5 Ultra or Deep Think tomorrow

#

Right after o3 Pro

#

Then they'll release 3.0 Flash after 4o

#

I mean o4 lmao

#

OpenAI is weird at naming their models

patent aspen
#

People have this idea that companies withhold models from the public for long periods of time just so they can launch it the day after their competitors' launches to steal their thunder. There's a little bit of that, but by and large they just launch models as soon as they're ready to launch

zinc ore
#

Elon needs to drop 3.5

elder burrow
#

try this prompt

#

lum diff #b7e8eb #517e34

#

thought of it myself

#

its very complex

leaden sun
#

Now I finally understand what you mean with liquid glass, thought you literally meant the advanced material that is still a research subject 😅
https://www.youtube.com/watch?v=1E3tv_3D95g
i like it, stylish as always

Hands on with iOS 26 and everything you need to know from WWDC 2025

MKBHD Merch: http://shop.MKBHD.com

Intro Track: Jordyn Edmonds
Playlist of MKBHD Intro music: https://goo.gl/B3AWV5

~
http://twitter.com/MKBHD
http://instagram.com/MKBHD
http://facebook.com/MKBHD

0:00 26 All the Things
2:01 iOS 26
5:39 Liquid Glass concerns
6:35 WatchOS 26
7...

▶ Play video
keen beacon
#

Google needs to drop 2.5 ultra

small haven
jade egret
#

imagine

elder burrow
#

hold on.

#

i still remember

#

from a few months ago

#

sam altman said gpt 5 is releasing in a few months

#

this

elder rapids
#

ion think this would be surprising tho

flint skiff
#

@echo aurora so its been more than 24h since o3 pro release, guessing it doesnt come to the arena? makes sense since its so slow

elder rapids
#

ye it wouldn't come to the arena

flint skiff
#

yeeee dont see how it would work there

echo aurora
elder rapids
#

let's entertain the possibility then

#

it comes to the arena

#

A. ok, that's stupid, now I have to wait a couple minutes for a response, which is also delaying the other responder
B. ok, now I know o3 pro is here, and it's competing, all I need to do is pick the model that has the most comprehensive answers for something simple because whoops looks like that's inherent to the thinking time (also making it obvious which model it is)
C. ok, I know one of them is o3 pro, I WONT select a model, I'll just keep using it
D. ok, I'll simply select the obvious o3 pro (since pro model styles are obvious) because I just like openAI models

flint skiff
#

agree

elder rapids
#

of course that's what you respond with

small haven
#

and we back

#

kingfall is going to be amazing

civic flame
#

how'd you get kingfall?

keen beacon
#

you have to ask kingfall

civic flame
#

lol what

small haven
#

mysterious

civic flame
#

alrighty

small haven
#

kingfall (supposedly) wtf

civic flame
#

?!

small haven
#

ok enough terminators

#

whats the next benchmark

#

liquid ass

keen beacon
#

prompt btw?

#

fwiw i also ran brknclock's quiz n=30 times on different thinking budgets to see if theres a correlation with increased length with max thinking budget versus auto. should have a visualization with that later i just woke up

#

(there isn't)

#

tbh

#

im surprised how good 2.5 pro stacks against ultra

#

it can trade blows on a lot of fronts

#

beyond svg, etc., 2.5 pro when analyzing situations i've given has given me a lot of 'novel insight' that i did not expect

#

(ultra missed those 'insights')

keen beacon
#

remind me to never engage in internet arguments

patent aspen
#

It's the power of diminishing marginal returns

#

The big models and long thinking models aren't that important IMO

keen beacon
#

imo i think long thinking is way more important than big models

#

it depends on what you mean by long thinking tho

patent aspen
#

I mean 10+ minutes of thinking

#

tbf they will probably get way better since they're so new

#

But not much performance has been squeezed out of that 10+ minutes yet

keen beacon
#

yea

wintry tinsel
leaden palm
#

does google do tuesday launches

jade egret
#

WOAH

#

Fr?

#

is that kingsfall?

#

bruh i don't have it

leaden palm
small haven
#

big b feeding us good

#

ya sure

small haven
keen beacon
#

kingfall is actually 2.5 flash lite

#

theyre just nerfing pro and flash

jade egret
small haven
#

wen titanforge-ab-test

#

samfalls

#

@leaden palm tuesday came early

late path
small haven
late path
#

Still looks better than previous few kingfall results

small haven
#

wild one still tops it imo

#

o3 pro alrdy got old

#

ui tweakers maxxin

#

talk in english my guy

#

absolutely not

#

real men ask kingfall

jade egret
#

is that openai or google

zinc ore
#

Google

nimble trail
#

Or it just a myth from that one dude in reddit

wintry tinsel
keen beacon
#

titanforge isnt real its a schizo moment lol

#

i believe

small haven
fleet lintel
#

interesting info on apple models

lilac nimbus
#

Do someone konw Claude neptune v2??????

torn mantle
#

That guy is a liar

#

Hmph

hollow ocean
#

I got o3 pro for $200 @small haven

#

worth it right?

small haven
hollow ocean
#

sota

small haven
#

yes abusing it

teal mantle
hollow ocean
small haven
#

from last month? same usage

#

i still get temporarily limited, that doesnt change

narrow elbow
small haven
#

wtf i got jumpscared

narrow elbow
#

sorry about that

small haven
#

lol jk

narrow elbow
torn mantle
narrow elbow
#

from google search

torn mantle
#

I see

#

Yusuke morata

misty vault
#

Obunga

hollow ocean
#

@small haven guess if I paid $1 or $200 for o3 pro

pulsar tendon
zinc ore
#

😃

misty vault
#

top 5 most secure all-in-one ai service apis

pulsar tendon
#

testing it now

neon warren
nimble trail
neon warren
#

it is was kingfall they would have named kingfall

pulsar tendon
#

eh it doesnt really feel anything amazing

neon warren
#

is it on webdevarena

pulsar tendon
late path
#

probably 2.5 flash lite

neon warren
#

I am trying since 5 prompts

pulsar tendon
#

Lost twice to this but very similar so 2.5 flash lite < yeah feels like it

neon warren
#

How is the ouput speed of it

zinc ore
#

Could it be diffusion series, or not fast enough?

neon warren
#

Alright I got mine

#

That's fast

neon warren
zinc ore
#

I'd speculate they would push more on the diffusion models because of their potential over lite models

neon warren
#

How are diffusion models served via apis?

keen beacon
#

oh just noticed that stephen seems to be from bytedance

pulsar tendon
neon warren
#

prowleidge is extremely fast

#

Also as good as flash

torn mantle
pulsar tendon
pulsar tendon
soft kernel
#

Oh it's webdev got it

#

Yeah ik,just haven't been there yet

#

Why lmarena doesn't release rankings for o3 pro

#

They've been quiet

neon warren
#

I can confirm prowlridge is 2.5 Flash Lite
Which will further come to gemini diffusion model in a month

cobalt bane
#

where do you found it ?

torn mantle
patent bane
#

more thinking = more hallucinations

keen fulcrum
cedar tide
cedar tide
#

After trying more than 25 models, Amazon has gained 40 elo 🤣

civic flame
#

LOL

olive mesa
# olive mesa
poll_question_text

How much AI research do you think Google and OpenAI are automating with their models?

victor_answer_votes

6

total_votes

18

victor_answer_id

5

victor_answer_text

20 - 40%

calm sequoia
#

@patent aspen why Gemini is 2.5 and not 3.0? Could you explain the naming logic?

dusky aurora
#

developers, thank you for increased interline spacing

ocean vortex
cedar tide
#

New model hunyuan-large-vision

torn mantle
cedar tide
leaden sun
keen fulcrum
#

Isn’t Amazon using them for their customer service and aws

#

I could think of smaller models being introduced to their kindle models

cedar tide
keen fulcrum
cedar tide
#

Its an exp model

keen fulcrum
cedar tide
keen fulcrum
ocean vortex
#

they are underwhelming models

#

the amazon ones

#

nothing interesting or special at all tbh

dusky aurora
#

Gemini is too uncreative these days,almost to the point of beingbad

ocean vortex
#

don't perform

cedar tide
keen fulcrum
#

I was referring to them using them internally

ocean vortex
#

goldmane is still in arena

keen fulcrum
#

They are probably used for aws and amazon customer service

ocean vortex
#

if it was 06-05, no way it's still the same model now

keen beacon
#

they just forgot to rename it

leaden sun
#

the oversea version of doubao is cici?

keen fulcrum
dusky aurora
#

sampling controls are a must

#

or at least a toggle between a strict and creative presets

#

since too strict sampling effectively lobotomizes the model

patent bane
hollow ocean
patent bane
#

satisfied now

#

either it will stop spitting out the answer or just hallucinations

#

o3 could answer that with 1-3m think time with tools

hollow ocean
#

It’s sota tho

patent bane
#

so?

hollow ocean
#

It’s prob your prompt

patent bane
#

i mean it's on par or I mean with my current testing it's just meh

patent bane
#

what does that have to do with prompting

hollow ocean
patent bane
#

`There are 2022 users on a social network called Mathbook, and some of them are Mathbook-friends. (On Mathbook, friendship is always mutual and permanent.)

Starting now, Mathbook will only allow a new friendship to be formed between two users if they have at least two friends in common. What is the minimum number of friendships that must already exist so that every user could eventually become friends with every other user?
`

#

there you go, prompt it for me

hollow ocean
#

You prompt it using the pic I’m going to bed

patent bane
#

nah my point stands, it's just on par with o3

hollow ocean
#

Nah it’s better

patent bane
#

more thinking doesn't always mean smarter

hollow ocean
#

Its rank 1 on livebench

#

And artificial analysis

patent bane
patent bane
hollow ocean
#

💯

patent bane
#

you do it then

hollow ocean
#

You do it I’m going to bed

patent bane
#

prompting helps when you asking it a question that could lead to general answers, it 'd make the question more specific

#

and it does not apply when you're asking a correct/incorrect question

#

prompting helps when you ask questions like make me quiz website since it is too broad and general

#

that's why you need prompting

main iron
#

Does anyone know how to connect the url to a extension for like vs code

late path
#

I don't think it's a hard limit (like it truncates at 128 tokens and starts outputting the main content directly). I've tried setting a 128 token thinking budget and the model still thought for 200 tokens, which shows it's aware of the variable

willow grail
#

I A MBACK FROM AN EXPENSIVE ADVENTURE
NOW I WILL SHOOT AND KILL PSORIASIS
with ointment/creme, not sure which one i got

willow grail
#

but ai is much better

#

it will make the cause of all skin issues go BYEBYE

ocean vortex
#

it can use either thinking or final response window for solving the problem, the model for the most part does not care lol

#

and it's not a strict hard limit - yes. Model can go beyond what you set

#

1 run is like $50...?

torn mantle
#

google :

multiple gemini models
project Astra
gemini fiffusion
imagen 4
android XR
veo 3
google search try-on
jules
flow
lyria 2
whisk
gemini native audio

xai :

ocean vortex
#

I'm not paying 5k either way. Nor I can confidently say that 32k really does lead to better performance lol

#

it's just plausible

#

you got higher mean with max budget?

#

why did you hide the text in that txt file.... that's pain to read lmao