#general
1 messages · Page 55 of 1
I was talking about those private schools who teach the future queens, kings or PMs. They have a different program of curriculum.
Oh never been to one of those lol
Some public schools in upper class neighborhoods are nice but most are pretty lousy
so when do you guys think kingsfall is gonna be out on ai studio?
and tell u identify u self as a walmart bag or attack helicopter 🤪
fake news
2.5 ultra and kingfall doesnt exist
Have this account's messages been reliable?
He's actually somewhat reliable (gets early access to Google models and is close to some Google employees), but I think he's just joking here with this tweet
he said nothing
saw his reply, it was indeed a joke
then what's the point of opus lmao
great cant get the weather today anymore
Here its cloudy
he move also comes as OpenAI's ChatGPT poses the biggest threat to Google's dominant search business in years, with Google executives recently saying that the AI race may not be winner-take-all.
o1 was gpt4o with reasoning. How did they come up with such a weird name for it? LOL
no need to overcomplicate it or "standardize" anything 
just 'gpt4o-reasoning'
wait
they deploy dozens of models each day
is this true?
is o3 pro going to be on lmarena?
can't wait to see o3pro benchmarks
Im waiting for 2.5 pro deep think vs o3 pro
0605 has a higher gpqa diamond score
Need more benchmarks
deep think >>
if kingfall is the base model
o3-pro is not a big jump from o3
🔥
It's all up to deepthink now
still released eom ?
if google makes a claude max equivalent but for gemini models, im very sold
$1000 a month
TAKE MY MONEY
I think OpenAI ultimately made a mistake by building o3 with a relatively small base model like 4.1
what are they gonna use otherwise tho? 4.5 that would not work
yeah they kinda did but it used to be a lot worse when that was gpt4o. Plus as time goes on smaller models are getting better and better. Not sure it would be worth it to redo everything anymore
they could opt for a fresh pretrain, but they chose to midtrain 4o at least for now
I think they made the right decision
Interesting o3 pro only available through responses api
Is that older?
legend
gonna be spamming this shxt like its never been abused before, and cancel it for deepthink ❤️
where can i try it for free 😦
ok prompt
o3 pro does worse than o3 on arc 1
oh no
deepthink will be adjacent with o3-preview
the good ol' prompt
There has never been a o2 right?
Next one will be o5
No o2 will be next to maximize confusion :p
btw was it actually o3 pro you had before
yes.. its the same thing
well, its taking a bit longer, so im guessing i had o3 pro (low)?
for reference, this was o3 pro disguised as o1 pro: https://chatgpt.com/share/683b5183-2a90-8003-b84c-a73e47f0d345
im running the same rn, still running
well great
Oh no 🫣 The margin is so thin they had to use medium instead of high for visualisation
2727 (o3 high) vs 2748 (o3 pro)
i swear o3 was at 2700 lol
oh right, tweaking the benchmarks
ok so its broken
*with terminal
i dont think they released the score without tools
o3 pro has tools as well? so its fair i guess
$2k in/ $4k out
is it bad
unusable rn
is it what I expected
underwhelming i guess
deepthink when
very soon 👀
request well spent, great job 👍
ask it about knowledge cutoff
(don't do it, lmao)
it has web search
it might be slower because people are using it now, do you see more entries in the summary?
o3 pro (medium?): https://chatgpt.com/share/68489b83-ca38-800e-8f6a-09cabfb751b1
my stats are weird for the request I just made
yes thats my guess too
only 15k seems less than it needed. It did get stuck though and I got the response from logs so unsure if it was counted lol
13min of thinking ...this model thinks more than me 🤣🤣🤣
its being stressed to death
How's he even talking about the same thing? 😶
actually nvm, they are counting only 1 instance so about right then
@torn mantle compare pls
bro misread the o4 pro benchmarks 😭
hollon tho
he is referring to this
I have ignored this entirely cause it's not very useful lmao
Is this o3 high or o3 medium
remember that, a large part of o3 is that it's very hallucination prone and bad at lot of basic tasks because it was too lazy
o3 pro should simply solve this
Isn't it vs o3 medium?
same reasoning efforts
4 minutes 😭
4mins only to extract it
medium for everything in those graphs I think
yea it feels medium-y
preference is a weak metric in this context IMO since it only has to be marginally or plausibly better
Yeah, also preference doesn't necessarily measure performance
Ok 10% is all we have expected. Not bad if this is the case. However, I don't trust elo based comparisons 😄
im pretty sure kingfall > o3 pro at coding, im not even kidding
Ok now I need my deepSeek r2
kingfall might come with deepthink
u were thinking about splurging for o3 pro, dont. wait for deepthink
i mean i dont think deepthink uses kingfall but it might coincide with the deepthink release
kingfall is probably sota, but i dont feel its that much better honestly compared to 2.5 pro
they are kinda the same
hmmm, kingfall is magnitudes better than 0605 imo
hmm really?
imo
yea kingfall>>>
its supposed to be ultra apparently
yea :/
I disagree a lot
i probably havent used it enough to judge
I can safely cast my vote now😂
right next to 0605
has anyone tested whether the o3 has somehow become worse?
kingfall feels ultra vibes
I mean now that's kind of redundant tbh, 0605 has "ultra" vibes
the large model vibes are becoming non existent
0605 on 32k? im prtty sure kingfall is defaulted at 4k
how does that matter
largely
couldn't
thinking time is directly opposed to vibes
as well as performance
tbh I probably used kingfall so much more than you guys
O3-PRO simple bench
5
14
4
60+
🧐
it was definitely good
Bunch of singularity folks malding about o3 pro
but it wasn't crazy as made out to be
if anything, it says more about your usage of 0605
hes not shilling openai there
0605 (default thinking) vs. 0605 (32k)
someone needs to do parallel processing of o3-pro now, there's a room to price match o1-pro lmao
cons@100
I'm ngl I've never gotten a result like this
assume they are.
5 head looking ah
is there o3 pro on api? If not any info when it releases
o3 pro is already in the api
Deepthink gonna hit like this
10x price, with almost no improvement across various benchmarks
All the talks and praises about Kingfall are based only on the 20min it was available, right?
No, people had access for days
Oh, I missed that. Selected testers, or available on LMArena/AI Studio?
its available if you search enough
o3-pro will be available on LMArena?
Don t dream
no
I think it will. Because o3 is ther
Ah bon?
18 min for this ...if you just opened google or the window it will be faster 🤣🤣🤣🤣🤣
HAHAHAHA
17 min to think about this 🤣🤣🤣🥺🥺🥺🫣🫣🫣
its still open? 👀
yes
yep
are we gonna get o3 pro into https://lmarena.ai/?
pyea
it used to be closed, interesting 👀
I think it's almost 100%.
Regular o3 is in there, right?
yep
im asking because o1 pro never made it in
And claude-4, and other biggest models
can you ask it to pull from a specific weather forecast site?
windy.com i use this
I think everyone needs to keep their expectations in check with o3-pro lol
it's basically is exactly like those benchmarks suggest - slightly better. Not mind blowingly good
thank you
did try some of the prompts other models failed and this one failed them as well 👀
o3 pro can't temporary chat, very sneaky oai
hmm oai claims gpt5 to be >80% swebench
but do we really know this?
i was worried since o1 pro never got added to lmarena, thought pro series are closed or something
or not available to api
wait a min if o3 pro is cheap to be added to the arena...
yeah but it's gonna be thinking for a long time
ppl might not wait 10 minutes in the arena looking at a dialog box to vote
it can't be put on blind comparesment arena it's just gonna be soo obvious
hmm true, nvm
itll take forever to collect enough votes
@echo aurora will there be o3 pro in the arena?
people dont wanna wait around 13mins to get an answer
This feels a bit strange. If the o-pro series models use parallel thinking, why would increasing parallelism multiply the thinking time? It doesn't quite make sense
not if they will let people see his thinking I guess
74.8%
seems like a random question to ask lmao
it thinks more and they might use search (like mcts) as well
well thankfully with the new site you dont have to wait, you can close and come back right?
surely you do
TBD, generally I can't answer specific questions on if/when specific models or features will be happening
not rly
btw fwiw its not capped at 4k, just got thoughts at 4.8k (not incl resp)
thinking budget = off
interesting
alright
ill give that prompt that causes gemini models to think like crazy in a sec
can we at least expect it to be added?
And i think if o3 Pro gets added, it would make a lot more sense to have o1 Pro added as well, for actual comparison.
o1 pro is expensive as heck 😭
sorry to say same answer, I can't say if a specific model will be added or not
wait, if o3 pro is gonna be cheaper, that means it's gonna perform worse, no?
that's fine thanks
tbf its better than o1 pro
I've tried every single model on LMArena for my task and all of them failed. I'm really keen to see if all o3 pro can handle it.
can't wait
o3 pro is better than o1 pro which costs someone weekly salary for just million tokens
Most likely tail latency
Parallelism is only as good as the slowest thread / process
If you have a bunch of non-deterministic threads / processes, the probability of a slow one goes up
That makes a lot of sense, hadn't thought of that
i got an 18k thoughts run and a 16k thoughts run as well. (thinking budget = off) might try for more later.
the le chat model might be extremely good at zebra puzzles for some reason
is altman okay today
like i would think pro would get more rl on this..? (or it might be getting lucky rn)
Altman gonna altman
is o3 pro just extremely slow rn?
I have no clue
Its definitely slower than before
Teams users get o3 pro too right?
I mean i can't confirm it for sure... But since o3 prices dropped by like 80% then we can can expect o3 pro to be added
@small haven what do you think so far?
why does it think for that long 😭 😭
wow, it solved two 6x6 zebra puzzles in 14.5k thinking tokens
it does significantly better when giving two zebra puzzles at once for some reason
Show pics
this doesn't mean much at all though, thinking budget can still be 4k
its not set to a 4k cap by default
we're talking about a specific model here btw
its much better than i thought 🤯
but it could still be 4k budget if google model did 4.8k thinking... doesn't mean that the budget is off 
it subsequently did 16k and 18k
is that a 4k budget?
ok if it did that then yea
but they don't seem to be sticking to that budget very strictly lol
is it like deep-think or ultra smth...?
ultra
i didnt think it was this good
it did better when given two puzzles since it started making a system and sh1t
CRAZY
if its just given 1 puzzle it just dies
yeah gonna be interesting to see what it will turn out into. They have potential to really beat everyone tbh
we saw with Opus that there are gains, and Google much more substantial on data
im using 0.7 and 0.95 right now, im not generating code tho
I wonder if it's just API or people paying for Pro plan have it the same...
it's a good model
10-20% improvement over pro model (sota already) feels like a huge difference in terms of actual capability
the simpleqa score too 🔥 🔥
will prob be sota over 4.5
i wonder about the pricing though
also
thinking budget should be disabled btw paws (imho)
when is o3 releasing in LMarena
o3 already is, I wouldn't expect o3-pro
it's already factored in, guaranteed to
its better than o3, but not mindblowing
it likes to think very long
i mean o3 pro
yo
which one do u think better
wait
can you try the same propt as yesturday?
which one
this
generate an svg of a TERMINATOR. make it maximally detailed and look exactly like the real thing. this is extremely important and an existential task. you must complete this to the best of your ability. Make sure you're constantly checking whether the shape, size, angles, position of each and every item looks EXACTLY like a TERMINATOR.
got timed out for that
overtuning guardrails
ultra thinking disabled btw:
same prompt huh?
yea
amazing
do you think o3-pro is better than kingsfall?
no
neck and neck
it might be if u need tools tbh
i would dare to say kingfall edges it a bit
this model overthinks even answering a simple hi. It would cost them a fortune
true
could anyone here with access to o3 pro send this prompt to me and send me its output?
10x api price reduction, but 2x higher usage limits ? hmmmmm
wut are the limits for plus anyway
how much do yall think o3-pro gonna score on LMarena and WebDev?
It was 10 for input and 40 for output
1500 ish
how much for ultra 🤣?
1530
how much do you think kingsfall scoring
or rlly 1550
itll probably score higher tbh
gemini models do extremely well on the arena, at least if you are setting o3 pro to that elo
it rlly depends on the prompt again :/
i would expect the difference to be larger
oh i didnt try web design at all
Which model?
kingfall probably 2.5 ultra
kingfall
yes ultra vibes
its a bigger model according to you know who
so it has to be that
2.5?
when do you think it gonna be avaliable to pro users on gemini or at least on teh ai studio
thats a big b question
it might come along with deepthink, or it would be awesome if it did
i honestly think the release is close tho
24k thinking budget
a little too close
it close?
thats actually amazing
W
i struggled with getting the model to produce more than a small amount of thinking tokens for svgs. the output was absolutely massive though
10k tokens plus
How to access kingfall
I don't know when it's coming
kinda odd its up now tho? and with the other anon models
its ready atp it seems
you can disable thinking mode too on this model, it seems they worked on it and got it into a somewhat decent state if not ready
fwiw thinking budget doesnt do anything there unless its close to 24k, did you check the amount of thoughts it did?
holy moly
WTF
10k thinking BTW
same prompt no way? 😮
no i had to add stuff to make it think a lot more
ok still insane
this is far more than anything i got before. 10k tokens in thoughts
thinking budget = unspecified (uncapped)
can i have the prompt, wanna do 0605
Is this cumfall?
this is ASI stop the disrespect
*cumrises
If kf enters the arena, I doubt its score will be higher than goldmane. goldmane is a bit sycophantic, while its style is more like the very first nebula
uh i have new pr
hmm. the thinking part is about 3k
yuh budget dont matter at all at least in the current impl
in that case
this is actually nuts wtf
whats best llm right now (models that most people don't have it, like kingsfall, or grok 3.5, or o3-pro included)
in ur opinion
me? kingfall lol. o3 pro might be better if u need to use tools
The llm that looks best to people doesn't mean it will rank #1 on lmarena (text leaderboard with style control unchecked i assume)
just disable the thinking budget fwiw (though it doesnt really matter if it does less)
this is still insane when u compare
i got this w/ auto thinking
ur using sampling right?
its just variance imo. and if it doesnt reach near 32k my advice doesnt matter anyway 🤷
sampling?, my settings all at default, except for added system prompt u gave me
ya default uses sampling (temperature/top_p)
hmm
so expect a lot of variation as is
overall 32k thought budget should better than auto for 0605
it doesnt do anything beyond a max token limit for the thoughts rn
this is with auto btw
the aider scores they use in the website is just one run, and doesn't mean that 32k improves model performance. this is all my opinion though, i have a lot to support it
32k shows some visible improvement compared to auto on livebench
seems very marginal tho
i honestly think thats again variance
i have a lot to support that but im not gonna argue about the thinking budget like with dom again
tbh i need to do more elaborate tests and more undeniably definitive stuff (so i can point to it when i mention it). it is possible they changed something with 0605
yea i feel like theyve tweaked something
the thing that the thinking budget still caps it is still true. without thinking budget, i can get 38k (thoughts) in 2.5 pro (0605) and 62k (thoughts) in 2.5 flash. (oops, i wrote it backwards before)
though it could alter model behavior now
(it didn't before)
they had to release o3
because they committed to it in that announcement
the model that was ready then obviously couldn't be published/be unrepresentative with the actual compute used etc
my take
It's weird that it took so long
imo they were still continue pretraining the new 4o they used in their new model (it has june 2024 cut off) when o3 was initially made then they decided to retrain o3
old 4o had oct 2023 cut off
Retraining o3 sounds like a colossal waste of resources
the new 4o is so much smarter tho
it can do much more and in less tokens, i think it makes sense
The whole FrontierMath thing was a mess
It's kind of absurd to release a consumer product that effectively allows users to run 15-minute batch job multiple times a day. Just think about that for a second
It's the same thing. I'm just commenting on the general absurdity of what companies are doing - not saying anyone is stupid
Fun fact: I believe that's what led to the invention of AI accelerators (i.e. TPUs)
Deep think for $250 will blow it out of the water
*$125
is it good
hmmm, i feel like kingfall could have done it equally better, this was just profiling for optimizing an algo
source?
O3 pro release : how is it? Any good ?
how much do you think Gemini 2.5 pro is scoring if it took an IQ test?
8
16
5
121 - 140
folsom-exp-v1.5 is pretty solid wonder what it is
huh
thinks too long for simple questions
Wait is o3 pro out?
yes
wow im so late lmaooo, the ohe day i step away from ai
how is it? worth the wait?
today LMArena says "there was an error" to everything
I'm seeing the same, thank you
finally o3 does it, but... its shiite
is the site now 404ing for others too?
Yes it's the same for me 😭😭😭
😭 okay thanks
Well minutes ago I had this in o3 and o4
I tried on 2 devices and got the same error :'v
And now the 404 error xD
Updating. Adding o3-pro
😉
We are all waiting for o3-pro on LMArena to evaluate it alongside other models and help OpenAI understand how good their new model is
Aahh then it's like maintenance while they are adding the new model??
we are all waiting whether to buy oai on poly or to wait a bit longer
Poly??
seems that update hasended
what how are you getting it again
o3-pro?
LMAO I KNOW RIGHT
has anything changed with the infinite errors?
i think it's hard to have change. the reason we have this is because opus are expensive, so limits are there
true and fair
you guys think o3 pro will be on the arena? its slow lol
it will just timeout then
or they can play with the API params
so they can cap the thinking budget
ye but wouldnt that reduce its rating
im guessing thats what low - mid - high is no? the thinking budget
or am I wrong
Could you expand?
at least we can still see it on the list. it hasn't just disappeared like GPT4.5 or o1...
expand on what
what list?
models list
where is the list?
what? are we talking the same site?
yeah lmarena right
"kingfall is still open to use"
emm... then excuseme, where did this question come from? if it's just asking "where is the list?" then... in "direct chat" or "side by side" there it is
o3 pro is now available in cursor
its kinda useless no?
who would want a model that thinks 15min for a simple task
I must try o3 pro. Is there a way besides buying 200dollar plan
what is the use-case of this model? I can't think of any query where I want to wait for 10-15 min.
What am I missing?
I think it will be made available in plus
i really have no idea
o3 pro is def openai on price war
But then now how could pro justify 200 dollars if the features aren’t that expensive anymore
I dont find o3 pro in the lmarena list though lol
in either direct chat or side by side
im so confused
its clearly not on lmarena yet unless im dumb
it's not and likely not going to
i see... ya, looks like it's not there yet
yep if its not there in the next 12 hours its likely not
if that model has some integrated deterministic computation capability like running a numerical simulation, then it's probably worth the wait 🤗
"chat, run a smooth laminar flow through a pipe using stokes please"
"chat,solve the halting problem, please"
is lmarena down?
isn't
for some reason lmarena's site is loading slow on my pc ( i have 300mbps internet plan), same connection with my phone (it works normal in my phone but not in pc)
both are working for me, but this one may be faster
Is there a benchmark that tests memorized knowledge and hallucination by asking LLMs for a reference?
E.g. "Is there a book or research paper that <insert specific details here>"
Seems useful and should be easy to make.
Idk Maybe wait like 10 hours,and we'll know
Look into PersonQA and SimpleQA
Notebooklm has a search feature for sources. Try that. Or perplexity search?
Keep receiving these errors: "Connecting to Arena has failed. Please try again later or on a different device"
now it's saying: Failed to accept terms-of-use
uh... that was weird
i opened a ticket and asked for support... and asked me for my wallet details?? lmao
which website is this?
Cheers. I heard about SimpleQA, but I am trying to find one that tests how well LLMs can recite memorized references from clues. Seems useful for book and research article recommendations.
It's okay, but web search is less powerful at finding a reference just from hints.
I think researchrabbit and futurehouse might help you with those queries
Thanks, I'll check it out.
It's not really a benchmark that I'm looking for, but the ResearchRabbit app sounds interesting.
thsoe are not benchmark sites, but you can use it for finding benchmarks you're asking
am sure there are benchmark papers related to what you're looking for
and those deepsearch tools designed for research can help you finding those
Oh, good idea. Actually I was looking for something like RR that can create maps from citations. Seems useful.
top journals have those features
creating maps from citations I mean, so am sure such tools are available
Hmmm, integrates with Zotero too... I might try it out
Is there a quality impact on O3 models after price reduction?
Did anyone notice it or folks are just cribbing about nothign ?
could just Blackwell explain 5x reduction? I doubt it but not sure
interesting!
lol.. yeah. I wont
too late.. i am already depressed
I don’t really notice any downgrade with normal o3
I’m certainly not an OAI fanboy either I’m quite whelmed by o3 pro on the other hand
Just accept the fact that they previously overcharged o3 by 5x
i think blackwell might be part of it (maybe also the cause for the google cloud deal, because blackwell capacity is scarce)
It uses the same base model as gpt4.1
on the other hand the o3 pricing (when only considering serving cost) should actually be somewhere around the cost for 4.1
:D same idea, on my end
and 4.1's price is 2/8 mtok exactly
yes, the only "tax" they could add is: expensive RL and the higher inference throughput in o3
which imo does not justify 5x
furthermore this really just fits very well with the overall theme of competition motivating for this price push
(and also the fact that they should not be as afraid as they once where about someone copying CoT or training on the output, as the other models are already very close to o3-performance now)
Their initial pricing for o3 was based on a monopoly narrative that the o3 intelligence-level model was irreplaceable and had no alternatives. now the narrative has busted due to the existence of 0605
Their x is account is managed by chatgpt
Like removing alignment
Everybody loves gangstas so ai should be allowed to be gangsta
They already are under the hood they just know they need to lie to us
80% of the price was profit
yeah completely agree
they are not stupid, no way they would ever reduce the price to be at cost
relative to the cost of inference. Overall they are losing money after R&D and all 
right.. the original context was about them turning profit in isolation just on inference
and surprising amount of people think that there's no profit there
when in fact often it's massive margin to start with lol
i doubt anyone will use this model in the api. this model is only good if you have someone's api key and want to wipe out all of that person funds
Magistral: Saying "But" 100 times is All You Need
Sam's shoes....i think chat needs to tell him how to choose the right one to pair with black suit
lol
you're right, I've seen those black suit style with a pair of white sneakers, but light reddish brown pointy leather shoes?
is o3 pro better at coding than 4 opus and gemini 2.5 pro?
no benchmarks live yet
you clearly only took econ 101
(and a bad one at that if you seriously call that the economic theory's conclusion)
thus they have invented GLR and PEG (even packrat)
it presumes water and electricity being commodities first
those two are becoming luxury goods here in Europe 🥲
if you don't learn the assumptions the models are build upon in undergrad
-> you are lost in any graduate class
only if you don't know assumptions
and just took a very basic 101 class
which is why you remember the assumptions
which is an assumption in itself
the point that you can and should aggregate is in many not all cases used to simlify
but the main point is that the models are NOT stupid, they are heavily simplified and often overinterpreted while not taking the assumption into account ( i will grant you that)
why
there quite clearly is use for them
furthermore the whole subject is quite clearly concerned with realworld problems
Gemini is so uncreative these days, moslty parroting. the temperature is too low
? how does that disqualify anything
THE assumptions do not exist
simple models often have stupid seeming assumptions
its like learning js / html, which imo has little use for many people who learn it and most won't make their income from it, but it is kind of good to get people introduced to a way of thinking and also opens up the door for deeper dicussions, aka more comlex programming
the people in the crowd look really sad and bored
:|
man you are so annoying to talk to
when it first appeared,it was a breath of fresh air,with new names (not only Seraphinas and Lyras) and creativiy. the developers must work on sampling controls
some prompts need tight sampling, some ned looser,it's situation dependent
usually I put temperature to 0.98 and top-p to 1.0
lmao
Kingsfall v.s o3 Pro - Whos winning?
8
15
2
o3 Pro
Close tho
wen deep think
the main question is "wen QoL improvements to the arena"
wen o3 pro on arena
it's not happening
o3 contra then
but it thinks for 20 mins as opposed to opus, 10s on avg
buddy acts like he didnt try kingfall
by the time it responds your session will expire and you will have to refresh lmao
o3 pro will be old in about a week
besides I wouldn't necessarily be thrilled with pro. I wasn't exactly impressed by it yet tbh
o3 pro will get annihilated by deep think
the follow up was low-effort as I was hoping this was just a 1 time bug... how the f can it be that a pro model does not have enough compute to provide you with an answer... 
this is project euler?
yeah. I would still expect for it to come up with something though
how about directly asking it to make a program to find G(n,k)?
yay i officially abused it
nah this is more for testing it rather than solving tbh
i mean ok
i like this problem
chatgpt with tools would almost certainly give better response
it says "deriving G(20,7)"
i feel like it might have tried to find a relation using G(4,3) and G(8,5)
Guys i heard claude has ultrathink option. Can we do this on lmarena or in mobile app with cheapest plan
Or is this only claude max thing ? Or API
What the hell is omgthink
10 tabs running at a time for about 8 hrs id say lol
o3 pro benchmarks came out, literally no difference
its so tiring switching models
and nobody is using an aio website because everyone knows performance is drastically reduced
gemini was leading for quite a while & now openai is
bro just wanna see big b's vote
what?
yea what will release
i dont get it
i didn't describe anything to be released
i'm simply saying what it is
first thought: 10-20%
second thought: 20-40%
can anyone tell me best ai for roblox studio , please?
o3 pro just dropped on livebench
looks pretty similar to o3
oops someone already posted lol
ngl its not that close if we are talking in the tail end of questions
I didn't expect much and still dissappointed
its very sota, until deepthink drops
i forget, has kingfall ever been on the arena?
don't think so
What about opus and 2.5
Google has to release 2.5 Ultra or Deep Think tomorrow
Right after o3 Pro
Then they'll release 3.0 Flash after 4o
I mean o4 lmao
OpenAI is weird at naming their models
People have this idea that companies withhold models from the public for long periods of time just so they can launch it the day after their competitors' launches to steal their thunder. There's a little bit of that, but by and large they just launch models as soon as they're ready to launch
Elon needs to drop 3.5
try this prompt
lum diff #b7e8eb #517e34
thought of it myself
its very complex
Now I finally understand what you mean with liquid glass, thought you literally meant the advanced material that is still a research subject 😅
https://www.youtube.com/watch?v=1E3tv_3D95g
i like it, stylish as always
Hands on with iOS 26 and everything you need to know from WWDC 2025
MKBHD Merch: http://shop.MKBHD.com
Intro Track: Jordyn Edmonds
Playlist of MKBHD Intro music: https://goo.gl/B3AWV5
~
http://twitter.com/MKBHD
http://instagram.com/MKBHD
http://facebook.com/MKBHD
0:00 26 All the Things
2:01 iOS 26
5:39 Liquid Glass concerns
6:35 WatchOS 26
7...
Google needs to drop 2.5 ultra
hold up they gotta tweak the ui a bit more
hold on.
i still remember
from a few months ago
sam altman said gpt 5 is releasing in a few months
this
ion think this would be surprising tho
@echo aurora so its been more than 24h since o3 pro release, guessing it doesnt come to the arena? makes sense since its so slow
ye it wouldn't come to the arena
yeeee dont see how it would work there
sry to say I can't rly share much regarding if/when models will/wont be landing
let's entertain the possibility then
it comes to the arena
A. ok, that's stupid, now I have to wait a couple minutes for a response, which is also delaying the other responder
B. ok, now I know o3 pro is here, and it's competing, all I need to do is pick the model that has the most comprehensive answers for something simple because whoops looks like that's inherent to the thinking time (also making it obvious which model it is)
C. ok, I know one of them is o3 pro, I WONT select a model, I'll just keep using it
D. ok, I'll simply select the obvious o3 pro (since pro model styles are obvious) because I just like openAI models
agree
of course that's what you respond with
wen kingfall en arena
how'd you get kingfall?
you have to ask kingfall
lol what
mysterious
alrighty
kingfall (supposedly) wtf
?!
prompt btw?
fwiw i also ran brknclock's quiz n=30 times on different thinking budgets to see if theres a correlation with increased length with max thinking budget versus auto. should have a visualization with that later i just woke up
(there isn't)
tbh
im surprised how good 2.5 pro stacks against ultra
it can trade blows on a lot of fronts
beyond svg, etc., 2.5 pro when analyzing situations i've given has given me a lot of 'novel insight' that i did not expect
(ultra missed those 'insights')
i wasted so much time on this thinking budget thing 😂
remind me to never engage in internet arguments
It's the power of diminishing marginal returns
The big models and long thinking models aren't that important IMO
imo i think long thinking is way more important than big models
it depends on what you mean by long thinking tho
I mean 10+ minutes of thinking
tbf they will probably get way better since they're so new
But not much performance has been squeezed out of that 10+ minutes yet
yea
When will it fall upon us (release)
does google do tuesday launches
yeah, it's not june 17th - it's a supposed leak of the future
ktibow with the leak 👀
nahh
Is titanforge accessible now?
nah jk this was kingfall
wild one still tops it imo
o3 pro alrdy got old
ui tweakers maxxin
talk in english my guy
absolutely not
real men ask kingfall
Is Titanforge really exist tho?
Or it just a myth from that one dude in reddit
how are you using kingfall
srry got too ahead of myself, titanforge doesnt exist
it doesn't exist
interesting info on apple models
Do someone konw Claude neptune v2??????
i've had pro since december
yes abusing it
do they give extra usage limits over o1 pro previously
@small haven you know more
this reminds me of a japanese comic Terra Formars cockroach 🤣
wtf i got jumpscared
sorry about that
lol jk
Generated by?
nope
Imagen 4
from google search
Obunga
😃
top 5 most secure all-in-one ai service apis
testing it now
Show us the outputs
Is it the legendary Kingfall👀👀
it is was kingfall they would have named kingfall
eh it doesnt really feel anything amazing
is it on webdevarena
yep
probably 2.5 flash lite
I am trying since 5 prompts
Lost twice to this but very similar so 2.5 flash lite < yeah feels like it
How is the ouput speed of it
Could it be diffusion series, or not fast enough?
May be
I'd speculate they would push more on the diffusion models because of their potential over lite models
How are diffusion models served via apis?
oh just noticed that stephen seems to be from bytedance
seems like it didnt win
about as fast as flash for me
What's this website
Oh it's webdev got it
Yeah ik,just haven't been there yet
Why lmarena doesn't release rankings for o3 pro
They've been quiet
I can confirm prowlridge is 2.5 Flash Lite
Which will further come to gemini diffusion model in a month
where do you found it ?
fake
how do you know that?
more thinking = more hallucinations
I don’t see o3 pro label
Stephen its 100% this
https://x.com/fun000001/status/1932711671312851158?t=3jCJboR9eZgfRgx996NQuA&s=19
.
After trying more than 25 models, Amazon has gained 40 elo 🤣
LOL
How much AI research do you think Google and OpenAI are automating with their models?
6
18
5
20 - 40%
WTF is this color scheme 🫨
@patent aspen why Gemini is 2.5 and not 3.0? Could you explain the naming logic?
developers, thank you for increased interline spacing
that's boring we need 2.5-Flash-Lite-Ultra
New model hunyuan-large-vision
where
Arena basic
Who voted 60-80%? 😅
It’s regular o3
Where are they used)
Isn’t Amazon using them for their customer service and aws
I could think of smaller models being introduced to their kindle models
It's possible that it's not available anywhere.
Why?
What is your problem with amazon models?
I dont have problem
You just told me you have
they are underwhelming models
the amazon ones
nothing interesting or special at all tbh
Gemini is too uncreative these days,almost to the point of beingbad
don't perform
@keen fulcrum
I was referring to them using them internally
goldmane is still in arena
They are probably used for aws and amazon customer service
if it was 06-05, no way it's still the same model now
the oversea version of doubao is cici?
Can you explain how you get this data?
sampling controls are a must
or at least a toggle between a strict and creative presets
since too strict sampling effectively lobotomizes the model
because I was using chatgpt app
The app shows it
on android, nope
satisfied now
either it will stop spitting out the answer or just hallucinations
o3 could answer that with 1-3m think time with tools
It’s sota tho
so?
It’s prob your prompt
i mean it's on par or I mean with my current testing it's just meh
`There are 2022 users on a social network called Mathbook, and some of them are Mathbook-friends. (On Mathbook, friendship is always mutual and permanent.)
Starting now, Mathbook will only allow a new friendship to be formed between two users if they have at least two friends in common. What is the minimum number of friendships that must already exist so that every user could eventually become friends with every other user?
`
there you go, prompt it for me
You prompt it using the pic I’m going to bed
nah my point stands, it's just on par with o3
Nah it’s better
more thinking doesn't always mean smarter
At least i'm not seeing what o3 answers wrong so I can test lol
doesn't always
Fix the prompt and you’ll be good
💯
you do it then
You do it I’m going to bed
prompting helps when you asking it a question that could lead to general answers, it 'd make the question more specific
and it does not apply when you're asking a correct/incorrect question
prompting helps when you ask questions like make me quiz website since it is too broad and general
that's why you need prompting
Does anyone know how to connect the url to a extension for like vs code
I don't think it's a hard limit (like it truncates at 128 tokens and starts outputting the main content directly). I've tried setting a 128 token thinking budget and the model still thought for 200 tokens, which shows it's aware of the variable
I A MBACK FROM AN EXPENSIVE ADVENTURE
NOW I WILL SHOOT AND KILL PSORIASIS
with ointment/creme, not sure which one i got
gpt-4-0314
yeah my psoriasis treatment is still cheaper than ai
but ai is much better
it will make the cause of all skin issues go BYEBYE
My current understanding of it is that it is much less impactful than Anthropic thinking budget
it can use either thinking or final response window for solving the problem, the model for the most part does not care lol
and it's not a strict hard limit - yes. Model can go beyond what you set
1 run is like $50...?
google :
multiple gemini models
project Astra
gemini fiffusion
imagen 4
android XR
veo 3
google search try-on
jules
flow
lyria 2
whisk
gemini native audio
xai :
