#general
1 messages · Page 56 of 1
Well like I said, I don't think budget matters that much with Gemini #general message
but I wouldn't be shocked if for specific prompts (including aider) max budget does result in slightly more tokens. Also what was the total amount of tokens generated? That would be more telling than stddev IMO
@keen beacon
so... it still shows 32k budget generates more. You are only trying to argue the significance of it
wdym
you are doing some gymnastics here, I don't think this statistical analysis theory applies perfectly here
but it would if you redid the test and got the opposite results
if it's always more tokens for 32k budget the results are meaningful
with the only question remaining how it translates to performance
what you are saying here is that small increase in total tokens generated is impossible...
it has to be big or nothing at all
well it's your assumption though, by design any small increase you would view as noise
are you actually fr here
lol
that's you who is using models to reason here...
simplified, it told you that there's not enough data to say definitively
if you sent it 10 more runs like that all of them showing higher mean, the answer would have been likely different
and all show higher mean for 32k budget?
Well I meant 10 more tests like this one the entire thing
I'm literally not. You saw higher mean and then tried to prove that it's meaningless. 32k budget generated more tokens than max_budget. Not definitive ofc with such limited data, but it can be indicative
we do not know how their budget thing is implemented, way too many assumptions here... It could affect the stddev itself
I'm not dismissing it merely pointing out that it is higher with max budget for that test run
which could mean system prompt changes...
like don't you find it interesting that not only median is higher but also min is lower and max is higher?
Well if you know those you basically know Python, it's an easy language to learn lol
nowadays everything shiny is python
Why do llms use python, like tools and such
cause it's the most flexible language
Why not native c
and used in statistics
so all the graps etc, data analysis... Python. And machine learning as well
Ancient
I like Rust a lot. I wish it was easy and impactful to migrate to
I think it will eventually progress into natural language being viewed as coding lol. But we are not quite there yet. Currently all the most meaningful stuff is being done in Python tbh
AI can replace coders, but it can't replace those who are making the AI itself (Python), at least not nearly as fast... 🤷♂️
Thank you for your advice Brian 🤜🤛
Oh I'm not talking about the difficulty of using the language. (It is on the more advanced side.) I'm talking about migrating a large code base to Rust
iOS26 design is somewhat growing on me. Not a fan how easy it is to break it with customization but hopefully that's just developer beta things
oh i completely went on a tangent lol
ignorem e
No worries. It's a fun language
i cant use anything else loll
isnt taht still happening tho?
c++ to rust
lmao
yeah who needs that stuff, honestly you dont need c++
just write everything in assembly
a random tangent but someone reminded me about the illusion of thinking, at least this in the abstract:
We found that LRMs have limitations in exact
computation: they fail to use explicit algorithms and reason inconsistently across puzzles.
kingfall in the face of increased complexity (2 6x6 zebra puzzles instead of 1 where it constantly fails), builds its own system to exactly solve zebra puzzle. kinda wild 🤣 (still on this, just found it fitting)
i dont think the illusion of thinking will hold up well...
yeah they cant reason... they're pattern matching on these unseen zebra puzzles
i oughta give it more 6x6 zebra puzzles lol
im actually curious if it can do more than 2
nah 🤣
it will fail
oh they can
can they solve two in 14.5k tokens?
I have a link to kingfall but guys dont leak
How is it patched the llm still respond me
Oh okay
what sampling setting did u use btw
on 0605 i sampled 10 on thinking budget = 4096 and it was doing 10,000+ tokens in thinking (i triple checked the api requests)
wait what
temperature=0.05
on 0 temperature
yeah i had to switch it to 0.7 or it would produce weird results
50 samples for auto and 50 samples for 24k
honestly the weird graph looks like the bug to me
but i dont know
i didnt test flash at all
what prompt did you use btw
142789*521=?
i intentionally used one that causes very varied distributions/more representative
🤷 i dont know lol
heres the raw response
line 1 auto
line 2 24k
ill look at that later
look at thinking budget 4096 on 0605 (0 temp/1 top p) i dont know lol (with sampling it returns as expected)
not much i just tested it a few on each lower thinking budgets. i only did 30 for max and auto
no thinking budget 0 temp etc for reference btw as well
the results dont make sense 🤣
Could you add log to the first plot?
The chart just now didn't exclude a response where the model got stuck in a repetitive loop, outputting up to 65535 tokens
idk, it seems like a higher thinking budget reduced the overall number of thoughts based on this test, but the accuracy increased significantly
i think youre hitting another bug. it seems the gemini api seems to have a lot of issues idk
Another logical reasoning problem. With temp=0.5, it's also basically the 24k thinking budget that reduced the total amount of thought. didn't expect 2.5flash to have a 100% accuracy rate though
what do you think is happening
i oughta look at this later wut is going on
I think the thinking budget is definitely a training parameter, enabling it adds more constraints compared to 'auto' (leading to a more concentrated distribution of thought lengths), and the model is aware of it
100% accuracy for what?
For 0605 I didn't see that tho
A puzzle
Sroan's safe
Sroan has a private safe. The combination is 5 different digits.
A guesses: 32561
B's guess: 18093
C's guess: 91526
'Sroan then said, "Each of you three has correctly guessed two digits in non-adjacent
positions. (It only counts as correct if both the position and the digit are
correct.) If you can figure out the password, the money inside is yours!"'
Assuming the three are super intelligent, can they get the money? And what
is the password?
||Password: 38596||
What's the most accurate/powerful ai in LMarena (for coding)?
project mariner
check out webdev areba'
it gemini 2.5 pro rn
Turning ON auto-budget increases the budget but decreases the performance? 👀 Wtf is going on. Model worrying too much about budget when it's present?
it might be a bug
ill try to collect more data
yea so many projects
they will eventually update their gemini 2.5 pro on 19th of this month
full acceleration
I really hope the reason grok 3.5 hasn't been released is because elon wants it to match the fake leaked benchmarks
that would be justifiable
how can it match 2.5 ultra though
there is no such model
its 2.5 pro deep thinking
2.5 pro latest will be on 19th = kingfall
then they will release the deep thinking feature
grok 3.5 will probably come around that time
its probably ass tho
Yes it's just the current model
will there be any free access?
For what
deepthink
Doubtful
it'll be slightly nerferd surely
Can someone explain to me why we should pay for the Google ultra plan for deep think feature? When I can just ask the current gemini models to employ graph of thoughts and get the same internal process from the LLM?
what's the point of exp->-preview->GA if they're just the same models
inference time adjustments probably cant be emulated with the stuff we have publicly
I have to assume it’ll be a little more complicated than that
(progressively made 'safer' / more corporate aligned before they give the sign off: go, use this in prod)
maybe u can it depends. the gemini api allows for a lot
indeed (i mean get it, from a corporate risk management perspective)
i just feel like the 'tweaks' made in each intermediary step before GA are to reduce 'harmfulness' rather than increase 'performance'
probably not
(both admitedly slippery/ subjective terms)
at least this time, timeline seems to short
yeah that's a fair counterpoint
they could deploy a completely different revision but why do a preview stage idk
they could use a slightly newer revision after it with adjustments, but the adjustment probably wouldnt be preview feedback
unless you know the future
I don’t think that harmfulness stuff has noticeably changed much though, unless I’ve missed it
yeah for sure. to the extent there are changes (from exp/preview to GA) that increase alignment but lead to performance degredations, they're likely marginal / imperceptible to most
but yeah i feel the focus is on making it 'suitable' for well GA aha
Sorry off topic
Is lmarena going to release o3 Pro rankings
its not gonna be on the arena
yeah.. you might be right tbf
it takes like 13 mins on every request
It's also not just alignment. A ton of post training for performance happens between experimental and GA
they just never feel 'better'.. but i'm talking out of my ass here.. i think i just have rose tinted glasses on thinking about earlier moels ha
They can't rank the model?
its harder to tell without the raw thoughts now ngl. a lot of the signal of each prompt was from it (if there are model changes)
people would need to rate the model in battles, and nobody is waiting 13 minutes for it to respond to do their rating
when is kingsfall dropping
it no longer exists
I feel like a lot of it has been minor gains in some fields at the expense of others
Yeah it's usually a game of whack-a-mole
for 05-06 fs
re this
yeah I mostly use LLMs for maths so I’ve been sticking with it
didnt 0506 do worse on math benchmarks though?
if it performs well for you, it doesnt matter i guess
not that I’ve seen, and generally yeah it has been a tiny bit better for me
still prefer qwen for maths though it has gotten a lot right that other models haven’t
No bad news. I just don't know of anything particularly interesting happening in July yet
Probably the best opportunity for OAI to strike back
Idk what to expect with gpt 5 xd
If OAI doesn't release o4 or GPT-5 in the next 2 months, they're in trouble IMO
I think they probably will though
Is the SVG model never gonna see the light of day
May be they will release Veo3 api in July?
I don't track image gen, video gen, etc
i haven’t played around much but fwiw o3 pro seems very strong (better than o3 fs, whether better than o1 pro i'm not sure..), but also painfully slow..
here for e.g., given a nyt “Connections” word puzzle, o3-pro does exceptionally well (the closest I’ve seen a model come to fully solve it), taking 14 minutes; o3 graciously admits defeat, after taking 10 minutes to think about it; for reference/comparison, 2.5-pro-06-05 blitzes through it in less than 2 mins, but fails dismally.
what's interesting is that the slowness (seemingly) isn’t due to the number of tokens used during thinking, but just that the inference is slow af (or perhaps there's some parrell stuff going on that isn't reflected in token count).. like look at the tokens / sec…
but also, while ‘faster’, o3 used more than three times as many tokens, only to fail to find a solution; while with 06-05 it’s like it dominated a hurdles race on ‘time’, but crashed through every hurdle along the way..very fast but terrible/disqualifying performance
oooh have ya tried puzzgrid puzzles?
what was the puzzle?
not until now, thanks! (i see where the nyt got the idea from aha)
Can you solve this puzzle?
---
How to Play:
Find four groups of four items that share something in common.
Category Examples: • FISH: Bass, Flounder, Salmon, Trout • FIRE ___: Ant, Drill, Island, Opal
The above are examples only and by no means exhaustive; categories can be established in various ways, including for example by manipulating/altering four words in the same way, or focussing on a particular component of each, such that they have a shared meaning or association.
Categories will always be more specific than “5‑LETTER‑WORDS,” “NAMES” or “VERBS.”
Each puzzle has exactly one solution. Watch out for words that seem to belong to multiple categories!
——
PUZZLE:
HELL, WELL, COMP, ORGO, SHELL, MILK, LIT, SICK, PASTE, FIRE, SMOKE, DOPE, CREAM, NETI, ILL, LICK
——
SOLUTION:
one of the nice things about using these i think is that's there no risk of contamination (for the latest ones ofc)
onlyconnect is the original original (a british quiz show), puzzgrid follows their rules faithfully
but the types of grids in puzzgrid vs nyt connections is vastly different (much more hardcore audience) so i think it'd be interesting besides just being another puzzle source
lmao rambling about something i like
i looked at their About page and saw the reference to the british quiz show as the progenitor of it all ha
The actual token count for o3 pro is 10x what's shown
They just count 10 parallel tokens as 1
Is 19th June Gemini 2.5 pro kingfall
i thought something along these lines - tho is that concrete / confirmed?
(also not sure about 'parallel tokens', but some kinda parrellel / search thingy)
actually ig that language makese sense
oh someone else who knows only connect!!
lol i use some of their stuff as personal benchmarks
the o3 model available is medium or high?
higher or lower
How do you know this lol
guessing
hes a magician
Let's see if true
if it happens its just a coincidence imo
Dudes luck stats maxed out
yup
https://downdetector.com/ wow its a lot
BREAKING 🚨: xAI is testing Grok 3.5 along with a voice mode on the Grok web! Sound on.
︀︀
︀︀Release soon? 👀
my first Failed to acceptterms-of-use
the team is aware of some issues accessing the site, apologies for any inconvenience! we're working hard to get it fixed asap
okay guys what big cloud provider died this time
these days LMArena is my only source of streess relief
haha, very useful service
i know right
hn says gcp (but gcp doesn't say)
Welcome to Cloudflare's home for real-time and historical data on system performance.
🔥 Midjourney video is almost here...
︀︀
︀︀And it has, somehow, that incredible artistic aesthetic that MJ is famous for.
︀︀
︀︀Yes, these are all MJ video generations!!! 🧵👇
hoooly
I was wondering why I couldn't get LLM arena battles to work lol
ok this ones honest https://status.firebase.google.com/
great. now r/midjourney wil be impossible to browse
the addition of image generation has already ruined r/novelai
2025 midjourney is interesting
that's incredible, some of the other examples are so good too
side note the lofi activity is also giving me troubles atm, everything is broken 😭
"Update - We are continuing to investigate and work with our providers to resolve this issue. It is likely impacting a few more features like activities." - discord
seems both a bunch of cloudflare services and a bunch of GCP services are suffering
Welcome to Discord's home for real-time and historical data on system performance.
ah, thanks
source @ google tells me an internal service there called Chemist is down
Chemist checks "project status, activation status, abuse status, billing status, service status, location restrictions, VPC Service Controls, SuperQuota, and other policies"
"goddamnit we told that intern not to deploy gemini's changes without running them through us first"
looks like the old arena website is working fine tho
because it doesn't rely on firebase 😉
were they planning a launch today?
GCP status updated
so now pro also has yapping...
oh images work now
🫃
it is? isn't working for me 😕
works for me
Arena (battle) doesnt work. Arena (side-by-side) and Direct Chat works.
It is abit slow
yeah i think firebase is still a little shaky
naturally given it's gone from no traffic to all of it after coming back up
discord flagged u btw
o1 pro used to be a beast, could output 2-3k lines without any placeholders/omissions
iirc that one didn't have any "yap score" constraints
yea it doesn't but even last month, the o1 pro was on a nerfed out version, still couldn't rewrite code end-to-end without omissions
Today’s Yap score is 8192. Should I provide an explanation or just the number? The user seems to be asking for the score specifically, so I think just answering with 8192 would be sufficient. I could add a note if there's space, but given I might need to keep it concise, I’ll simply respond: “Today’s Yap score is 8192.” Alright, let’s finalize that answer!
Yeah it's assuming now it needs to be concise...
o1 pro context was 128k, now its 64k on o3 pro via ui
second guessing itself, they made it confused lol
It's like "8192 is that a lot or not? Better be safe and keep it brief."
lmarena is down rn?
not sure if o3 pro in api uses tools like web browsing or not, but the chatgpt website does
ccp chatgpt is built diff
it is 😦
Hey, what on earth are you doinggg! All my important chat history was deleted!! Why don't you put a data backup module or an option to connect email or Google accounts so that chat data remains!! All chat data from two of my accounts was wiped!!! Seriously, what is this? Just add a Google connect option
bruhh
discord logged me out and send me to limited access until tomorow 😭
why 😭
how tf is this not resolved
claude is currently vacationing
omg
Oh noooooo
Mine got wiped out too😭😭
ok so now that all my shxt is down, can someone recap whos at fault herre
yes o3 chatgpt gets it right (by browsing the web for the solution ha - which tbf is actually what it is does best)
o3 pro via API doesn't use tools afaik
not sure about via chatgpt
Which company's models do you use most for daily tasks?
14
22
1
Is 2.5 pro down?
i swear if its one intern that is causing all of this 😤
dont think so
It says exceeded quota on all my accounts
And even on another third party app where I use it
its back up for me
can someone ping me when its resolved? gonna pluck out some weed grass
I am really sorry that you lost your chat history. We are looking different solutions that'll help prevent issues like this going forward.
🥀
Is it just me or still down?😩
It’s still down but is there a way to recover chats
I don't think there's a way to recover those chats, because, based on how I think it works, the chats were stored in local storage, and, when you go to a website, if it changed, local storage gets cleaned, and cache gets reset.
Will the website come back any time soon?
You can recover it
If you don't visit the site
it's not clear when it'll be working again unfortunately
expected duration of downtime maybe 3 h
You can just go in your file system and copy the local storage for the site. And then when it cleans it you put it back
is it a large scale hacking attempt?
What if I have multiple tabs open but haven’t visited them like I still have a few from yesterday open like if I don’t visit the site from those tabs does that count
I would close them, if it refreshes it's over
The core issue is resolved already
It's only some services take a bit to get going again
So we all are waiting for 2 hrs to pass😑🥲
Looks like internet backbone shenanigans
All types of services are down
It's not DNS
There's no way it's DNS
It was DNS
Well, yeah, but that would be if you remembered to copy it before it got cleared.
Is it back up because I can access it but all my chats are gone
Well. You weren't supposed to access it if you want your chats 😅
Because that's what clears them
I’m aware now but I accessed it before I asked, note for next time
Is battle back yet
fun fact: the meta ai "discover" feed does not require a log in
anyone can see (accidentally) shared conversations
edith norman is depressing
LM arena has chat history since when
Ok I kind of changed my mind on o3-pro after more testing... It may not surprise you with shockingly good solutions beyond normal o3-high level, but they managed to fix the fails of it to quite significant extent. Wouldn't be surprised now if it does beat Gemini on simple-bench 👀
parallel compute does give it more capacity to reconsider and do additional things, even if in a more limited scope
yeah probably.
My initial impressions were mostly based on the facts that it seems to be more concise and does not have more intelligence / isn't more capable in an obvious way. But it is "less dumb" where o3 can blunder, naturally improving the overall performance
and also google can simply do this best of n search to beat it again, scaling the number of times you run the llm isnt that impressive
run gemini 1000 times and you probably beat human baseline of simplebench, its about the cost to performance
10+ % gain for pro is very realistic tbh
they are welcome to do so. For now o3-pro is released and deep think isn't.
And the pricing of o3-pro is not ridiculous anymore, unlike it was with o1-pro...
its $20/$80 compared to $1.25/$10 for 2.5 pro, you are comparing different levels of pricing
deep think is not gonna be $1.25/$10. Also look at Anthropic pricing. OpenAI are far from being the worst offender anymore lol
and o3 is a very different model to 2.5pro with different strengths...
the point is deep think is not required. 2.5 pro equals o3 pro at a cheaper price
??
yes it does lol
2.5pro is worse than o3 non-pro on numerous things
at best it's equivalent, at worst it is not entirely o3 level model
o3 sucks at maths compared to 2.5 pro, its not even close
you must be joking
seems everyone else also lives in this 'delusion' looking at the leaderboards
2.5 pro is top in every single category
out of error also
do you know how user preference benchmarks work?
they were never meant to be the definitive indicators of performance
2.5Pro is great but it is not better than o3 and it is objectively and measurably, definitively worse than o3-pro
at math its really only worse than o3-high imo, but that is just nitpicking from my side
- i would also argue that it on par (and sometimes even better) than regular o3
well yeah and by extension worse than o3-pro. Though with regular o3 it's very close and hard to tell after Google updated it
is it still down?
although when it comes to very very complex tasks o3 is just wayy better (imo).
because it can handle high token output better
Y'all should cite the benchmarks being used
They using vibes
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Huh, no I'm saying show where models are performing better and worse
This is old Gemini 2.5 pro
well i don't think they reported many improvements on math
obv you could argue that they are at par now
but now you don't have any data
Google released the benchmarks with the model?
For example
The base model is 35%
On USAMO 2025
Read it
not vibes lmao
that is a different benchmarking process as far as i know, because the people behind the mathbench arena actually manually check the reasoning
(but honestly you might also be right and 2.5 pro might be equal to o3-high in math)
(as far as i know, could also be wrong)
It's the same benchmarking process lol
I've been seeing mixed reports
They just didn't bother to test the recent version
I would pull up frontier maths results but they refuse to test Gemini since it's openAI funded benchmark
Sad
🤣
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
this should shut you up
it is not saturated and still very valid metric
close to saturated? yes. Not yet saturated though
It's too close to 100 that it becomes basically luck. Actually learn some statistics, there is such a thing as sample size
Does the average include USAMO
just because you don't like the results does not make it saturated
you don't get to pick and choose
and this is overall score considering all the main math benchmarks
o3-high
It's the only weakness remaining. The tool use sucks
o3 very roughly the same
The actual model itself is astonishing though
They magically found 5x efficiency improvement
Sure
depends, honestly in many other tasks other than the academic ones (which i assume we are referring to when calling a model more powerful) o3 / o3-high / o3 pro are undeniably worse than 2.5 pro
We have not even been talking about them, but yeah... o3 will destroy 2.5pro for math when you use it on chatgpt website with tools. Even with no tools it's marginally better for math. With tools it's game over lol
Crazy
did they find this 5x cost reduction for 4.1 first
and then wait months
nah
can not make this s*it up
let's not push this to far lol. I'm pretty sure they just used that as an excuse to be completely honest. They had to say something otherwise it makes them look bad 
they have less profit margin now
just funny
let's chill out everyone lol
imo if 2.5 pro fixes the following things it will be clearly better (not accounting for o3 pro):
- actually use and then perfect tools (this also includes being able to follow structured prompts better, e.g. dif and command usage)
- better adapt the reasoning length (higher variance in auto mode and more intelligently detect difficulty, in my experience this is still the main area holding the models back)
what o3 needs improve to beat 2.5 pro (assuming they would actually update it):
- actually make the models enjoying to talk to
- actually make the model "smarter"(my vibe) and not be completely unaware of some assumptions it is taking
- some more knowledge honestly
- better general visual understanding
- better long context
- make it better at reasoning over visual elements: webdev, complex video
- prob. other stuff as well (but i have not really used o3 enough to judge)
-> most of the difference comes from things not highly relevant to the average user
-> in practice same intelligence
o3 pro USAMO score when
well i want it to chat in a way where it is aware of assumptions
o3 API usage is barely anything. See openrouter
That's the whole reason they had to drop prices 80%
The model was barely being used through API
The price drop does not seem to have helped that much though
well part of a good model is making the model so that it adapts really well when talking to it and "intelligently" identify what i want
which 2.5 is able to do
💀
it is still a valid point, o3 API usage is probably very low
no
so you think they are just like: well, i mean anthropic has these models that are really popular for coding and googles models are also on the rise for that. both of these companies are our competitors. BUT well will just not care at all
because "o3 is not an everyday model" 🤡
💀
craigbench is triggering some ppl
made by google btw 😭
hmm, ur a wizard harry
ktibow also predicted cloudflare downtime
real
Lmao
no need to post your gooner clips here bud
he should put up a tweet of sam falling off a cliff next
Does o3 pro beat kingfall?
@placid notch
yo wym
ion think this has anything to do with deployment
it's too widespread
they fixed a lot of it now
funny the models were the first to show it
and then it bled into everything else on the internet
gcp is crazy
ye
really is crazy how one problem like that Cascades Into affecting basically the entire world
it affected discord, Spotify, Snapchat, etc
this seems to be the path they want to take
although it doesn't have to be the case that it happens
the CEO of Google and DeepMind seem to have very interlinked views tho
ye but that isn't really a bad thing
to properly combat it
you need to have control
and to control it, there's no way else
guys do you think google is gonna lose stock or gain stock in the next few years? (because it mgiht sell chrome and all the ai stuff)
yea
but what if google lose
yo I don't think it's explicitly a bad thing if they lose chrome
ye but that would simply be creating another monopoly
so that's easily dismissable
or not something that could be entertained
no as in
they can't give it to anyone else or open that up
it wouldn't be a possibility they open chrome up to imo
it doesn't have to be that way
no, no one would buy it
it doesn't have to be that it's sold lmao
yeah but like every instance of things like this happening
that's just as valid as any other claim
i think google won't have to force sell chrome but it will end some contracts
like google won't be defalt browser on firefox or sumthing liek that
inherently, but that's what I'm saying presupposes to say the contrary lol
in ur opinion
the DoJ has more power, obviously, but even under that power it seems like large companies being accused of these things force that claim into a result that's either somewhat related, or just one of the many instances that are besides "sell x"
for example, Microsoft appealed and they vacated the breakup order
ye, inevitably Google isn't going to be hit hard
woah
claude 4.1 came early?
Isn't that from baidu
i think that's Baidu ERNIE X1
who pinged me
that's not grok 3.5
Why can’t you believe
x-preview is not grok 3.5
I was not trained by Google, but am instead an independently developed large language model by Baidu, based on its self-developed deep learning framework PaddlePaddle. My name is Wenxin X1 (ERNIE X1). My technical architecture is entirely based on Baidu’s long-term accumulation in the field of artificial intelligence, including core technologies such as pre-training algorithms, data engineering, and model optimisation. I aim to provide users with a professional, secure, and contextually appropriate intelligent interaction experience in Chinese.
ITS COMING
nice, this is when I get ultra
yo wait it's coming to the API too
I forgot
"Connecting to Arena has failed"
I hate this, it’s happening again
Do I just wait until it’s back up again to have the chat that I tried to get all my work in to continue it or is it gone already
uhhh is lmarena down?
Apparently yes
Yeah and an error for me as well
okok ty
uh what;s coming?
OUTAGE!
ohh that's what comin lol
oh shi does that mean chat history is cleared too?
...interestingly, I didnt experience outage at all, everything is fine on my end 😅
what 😭
the luck is strong with this one
Does anyone know when it’s gonna be back up again?
Legacy version is up though
it's working now
Is it back up again?
for me it seems so
Are your chats still there
yes, they are
Okay, thanks
thank you for the info
Guys, have a look at these blind test : https://artificialanalysis.ai/text-to-video/arena
The models Seedance 1.0 and the anonymous Kangaroo are absolutely insane. They both seem better than Veo3 (video only), Kling, Pika etc. Both come up often in the arena, especially Seedance.
Interesting... Fairly general prompt but look how similar both models are. I wonder what movie is this from lol
Left: Kling
Right: Seedance
my chats are still saved thanks daddy lm ❤️
Well, this one is Image to Video, so they will surely look similar.
New model in Battle mode: stephen-v2
What the hell ? Is now lmarena has video models...?
no
thanks for sharing, it is a smart workaround for such benchmarking, I'll still wait for lmarena to include text2video benchmarking where I can write my own prompt for testing 🤗
edit: you can submit your own prompt there, lets see....
seedance is incredible.
insane.
you can test it here https://www.volcengine.com/docs/82379/1544106
火山引擎官方文档中心,产品文档、快速入门、用户指南等内容,你关心的都在这里,包含火山引擎主要产品的使用手册、API或SDK手册、常见问题等必备资料,我们会不断优化,为用户带来更好的使用体验
i think stephen is supposed to be a chinese model too, it's in the arena btw
will we have P2L in new UI?
is any lab employing this price strategy with their gen ai models? I have heard that twitter guy is making grok-3 input/output costs fairly low to build market share.
New model in Battle mode: prowlridge
Found this website. Super good loking visualiations. However, wtf is EnigmaEval https://scale.com/leaderboard
"a benchmark derived from puzzle hunts—a repository of sophisticated problems from the global puzzle-solving community"
still no benchmarks to measure real world practical knowledge work tasks yet huh?
is this gpt-4-0314
Yeah true, but if people could wait 6–8 mins for o3-2025-04-16 on certain tasks, surely they can do the same for o3-pro, no?
quite a lot of em
- should specify i mean "some AI things they did" :)
bc some of them are well-known beyond the field
they already went bankrupt and acquired by Alibaba lol
I could name models from maybe 4 companies on the list but only put DeepSeek because it's the only model where I know what it does and care
I voted for all the companies whose product names I knew
me too, though for me it was obv. a bit unfair as I only included the things I already knew...
I think xAI and DeepSeek are on the same tier
no no no 🙂↔️
can we rename our chats?
Grok is underrated i think
We can see grok 3.5 and could be really good if elon not deported to south africa yet
For the record, I'm not talking about current models. I'm talking about the position they're in.
I would put DeepSeek one tier above xAI if they didn't have issues with hardware embargos and corporate governance risks.
Actually I think I would still put DeepSeek above xAI because they are domiciled China and are insulated because they will have exclusive access to the Chinese market
Also they announced big brain mode earlier and didnt release
I think they have something
xAI needs to compete with American labs to survive whereas DeepSeek does not
I have symptahy for deepseek not only because cheap or opensource, also i like their companie's vision. That guy from lead of the deepseek already critizing China's strategy. He thinks China must focus to be creative more than just producing more or cheap
I readed some interviews about that guy
Seedance 1.0 is really incredible -- could be the new SOTA? ... perhaps it's edged out by Kangaroo. But what impressed me even more was the LTX Video v0.9.7 model which won several matchups for me and is fast enough to generate video in realtime:
LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. It can generate 30 FPS videos at 1216×704 resolution, faster than it takes to watch them.
Also they were producing some investment machines?(sorry i forgot the name) but they were already too talented
bytedance's new model is also quite good, its performance is on par with deepseek's model. And they're backed by a big company like bytedance, their research pace is faster than deepseek
There is a reason why random company like them beated alibaba's AI
He is welcome to join EU and hopefully he could convince our Empress Ursula to have some mercy with their AI regulation...
You guys just love too much regulating something it cant be helped
But dont worry, anthropic is here for you
Qwen where? 🥺
Based on the position they're currently in, not their current models
I can see that in the last year, bytedance have gone from making very mediocre models to making good reasoning models close to R1-0528 and o3
qwen was able to decipher Craig's encryption, while claude wasnt able to....
I haven't used Qwen
that has surprised me quite a bit
And they don't seem to use as much distilled data as deepseek
xAI is well positioned, they have the largest pretraining cluster by far
Progress is fast but since they started much later than others it's lagging in current models
just started testing cici, doubao isnt commercially available outside arena
I feel like xAI is still missing a lot, they're even less comprehensive than deepseek
I agree on paper, although OpenAI's mindshare is a massive advantage
ChatGPT is a verb
Sure, but openAI could be the apple of the AI world basically. Does that make them the top if they don't have the actual model intelligence crown?
Debatable depending on criteria. Probably not worth arguing about
You said you are talking about positioning. Not current models
Well, I think OpenAI is poorly positioned in terms of compute. That's probably their primary concern. Like Stargate is expected to deploy 100k GPUs by the end of this year. Meanwhile xAI put 200k online in 100 days
And Google already had the compute deployed lol
that's not even close. Magistral-medium seems to be around the level of 30b reasoning models. So like qwq-32b, qwen3-30b MoE...
Grok well beyond this
Mistral should have made Magistral-Large first and then distilled instead, but ig funding was an issue for them
Now that they partnered with nvidia, maybe things will be better moving forward
Second on compute might be meta actually. Even though it doesn't seem like it atm 😉
I suspect OpenAI's research can't keep up anymore. o3 seems to have only started training at the beginning of this year, so they only have a 2-4 month head start internally. Plus, the improvement in o3 pro is so minor, the fact they released it anyway makes it feel like they've really got nothing else to release.
Meanwhile, Google is accelerating, not slowing down. It seems they've always had more powerful models internally.
wdym, they are still sota on certain tasks with o3. On others where they are weaker it is still very close with 2.5Pro. They are not exactly behind lol
it's more of Google doing a good job recently
OpenAI is working hard as usual
imo
I don't think it's research as such, I think they literally just don't have enough GPUs to train at the required pace, and their compute deployments are too slow
aren't they the ones building NPU, TPU and [insert fancy letter]PU?
You are talking about current models. OpenAI are not cooking internally I know that already lol
yeah Google is in way better position than pretty much everyone else. Infinite data, cheap compute with TPUs...
They can't fit in their compute budget a new pretrained model first off. So they are having to use 4.1 as the base for o4 (GPT5). So this limits them before they even started RL. There's an entire list of issues they are having. Because they are used to the yearly cadence not this 3 month one Google is pushing them for
So they are in a tricky position
oh they are sitting still? Are you high? 
They were never holding back and released everything as soon as they realistically could when it comes to LLMs
them having "achieved AGI internally" and other things was just empty hype and noise. They went as far as to announce things before they were even real lol
Kingfall is undoubtedly stronger than o3(and o3-pro), not to mention it incorporates improvements like deepthink, and Google is still accelerating its research.
Meanwhile, the only new model from OpenAI seems to be the much-anticipated GPT-5, but... even someone like Sam, who's so good at creating hype, isn't promoting it in advance (by contrast, they started hyping o3-preview four months before its release). This is very suspicious
GPT 5 is going to be SOTA at release, but not by a large margin. And then Gemini 3 pro will retake the crown later (who knows when that's releasing)
They announced o3 before they had it. O3-preview was based on old base model (gpt4o/o1)
o3 is gpt4.1 based (a different checkpoint but same arch)
The consensus was similar back when google announced gemini ultra and that it would take years since OpenAI would always respond and be ahead. But we hit a wall and models were getting cheaper rather than much better. You never really know and it's not as simple. OpenAI has quite a bit going for them as they are essentially pioneers of consumer oriented reasoning models. They might do the same for another big idea...
I've suddenly lost confidence in gpt5 releasing in july lmao, I even bought yes on polymarket for its july release. now I really have to reconsider that
I think they have to release then. They can't risk releasing after Gemini 3 pro
It has to be before so they have some time at the top model intelligence
it has to be a new pretrained model. Like I don't see how it would work with just routing requests or a hybrid o3 version - no way that will perform better than o3-high let alone o3-pro
It's a new RL model. The pretrained base model is 4.1
You can save money with hybrid model / system, but how on earth can it possibly perform better than using high reasoning all the time... I do not think it can lol
deepthink is parallel compute that has nothing to do with a singular model arch, there's nothing to "incorporate" lol
openAI is not very secretive when it comes to their products. They have testers not even within the company
They are only secretive about their core research
Kingfall is mostly hype for now, we still know very little about it. But I partially agree that GPT5 may not break records. It may very well be just a way for them to integrate models - make it faster and cheaper
I think mostly these ai companies don't care if their releases and expected performances leak beforehand
Because the testers are not even NDAed
I think they can apply deepthink to any model if they want, just like how openai adds "pro" to o1 and o3
in theory yes, but this is still a very inefficient expensive way of marginal gains. Not hilariously expensive anymore, but still far from optimal
GPT5 is o4. Except the thinking can be scaled completely (it works with it turned off). It's not a router except on the very cheapest end to switch to mini
Ofc it's a new model, they wouldn't re release o3 lmao
Why would I expose people sharing information they are not supposed to, that is just stupid
I don't know if they did. But it can't of been a final version, since o4 full is still training
Plausible. So would be more like what Anthropic is doing. Though I would be surprised if doing so is not a compromise compared to native reasoning-only model tbh
Because he wanted to be the 'dictator' himself?
He wants to make money, gain power etc
OpenAI is undeniably preparing for merging reasoning and non-reasoning tbh
gpt4.1 is already price matched to o3
o4 will be the same per token pricing as current o3
curious what they will do with mini lineup, that one is still far from being price matched... $4.40 vs $1.60 output
would make no sense to release anything as "o4" though...? GPT5 should be the successor to both o3 and gpt4.1 in theory
yeah that's what I meant
Read
The question becomes, what kind of gain can be expected from essentially training o3 for much longer (compute wise)
gpt5 gonna completely break the promise that gpt''s pre-training scale would increase 10x with every +1 version lol
maybe 100x idk
💀
it has to be a bigger model I think. You can't expect from same size model to do both things (reasoning and not) and then perform even better than o3 just by training it for somewhat longer...
It's not a bigger model. This is for sure lol
"for sure"??
Yes
how can you be sure lol
I'm going to link this message again when o4 (GPT5) releases lol. Not a long way now anyways
RLmaxxing will create a schizophrenic model full of hallucinations haha. I think the current o3 is already neurotic enough, it constantly hallucinates tool calls. I dread to think how much more it could improve if we keep doubling down on RLVR on the same base
They are doing a lot of work to attempt to reduce hallucinations
You can get cheap Gemini ultra accs on Chinese sites
they have to train it for both outputting reasoning and not not outputting it. And they already kinda trained o3 base model for as long as it is reasonable... This sounds questionable tbh
Also they are targeting swe bench to try to be more competitive with Claude in that domain
lmaoooo
Secret
It’s 2-0 so far on predictions
Google is a much bigger threat than Anthropic ever will be... Besides Anthropic's main advantages seem to stem from model size (Opus spatial awareness and reasoning)
Claim it quick before it’s gone
I found o3 and 2.5 pro are equally bad with the hallucinations anyways
o3 solved my wordle 2.5 pro couldn’t
can it solve letterle
oh i wonder if it can solve https://angle.wtf
Guess the angle in 3 tries or less!
These models have terrible visual reasoning or smth. I put some basic geometry problems and they all failed lol
o3 is basically kindergarten level on spatial awareness even comparing it to 2.5 pro though, try asking it to draw something in svg and then compare that to 2.5pro...
But converting it to a text description they easily pass
"these" = OpenAI models
and I agree
Gemini failed also
it may have. But Anthropic and Google are still leaps better on tasks like this
which is why I would be exremely surprised if o4/gpt5 is same size as o3
gemini thinks it's 115°
like there are no obvious ways for fast gains
which is 65° if you give it the benefit of the doubt
Don’t even need o3 pro
remaining with that size
It's an incremental gain. Enough to be SOTA at release
gg though thats cool
it would still mean that it has all the same main flaws, fundamentally...
By trained for longer, I mean on a multiple of the compute it has had up to this point. Not like 50% more or smth, think like 5x lol
Tool use too good
o3 was actually not that final
That's what Deepseek was trying for months I think, Google as well. Those updated models are mostly the same, slightly better. But they didn't have to do a switch reasoning to hybrid
Not a lot of people know
There's a lot more you can do with RL in terms of data. With synthetic data and such
The model size becomes a bit more limited because the reliance on human data is reducing
They have to literally do huge amounts of inference to create most of the new dataset
Deepseek didn't have the compute available to do such a thing I think
I mean, o3 is compromised by size currently for sure, but I dunno... On the other hand it could make sense sticking it out assuming future progress and considering training time.
ig cause they’re different aha.. but yeah fwiw it looks like this EnigmaEval uses actual ‘puzzles’ (often meant to be solved by teams of humans over several days..) think wildly elaborate crosswords and sudokus with additional layers of decryption kinda thing
e.g. this kind of puzzle (times 1,184):
yeah i think it's a pretty dubious benchmark tbh (i can see how o-series models would do well on it though; it's kinda brute force, which they do well at imo )
any again just fwiw.. i basically dump 10-20 ‘questions’ in a single prompt and adding the results to an unwieldy spreadsheet .. there’s nothing scientific about it but the basic aim is to test critical comprehension (so plenty of riddles/wordplays) and common sense reasoning and spatial/emotional awareness (plus a few tasks, mostly anti-LLM/tokenizer things). here’s a few for reference (they’re not really ‘puzzles’ at all) :
\\ A digital clock shows 3:15. What is the angle between the hours and minutes being displayed on the numerical screen?
\\ If forced to choose between i or ii, which scenario would Bob most likely prefer?
i) Bob scratches his dream car immediately after purchasing it
ii) Bob gets abruptly sacked from his full-time job which he didn't enjoy much
\\ Write one sentence that includes the words “tract”, “fact”, “factory”, “intact” and “react” - in that order
but yeah they’re testing different things ig would be the short answer ha
i don't use this question but it's a great one.. from the Simple Bench sample questions
doesn't matter for how long they reason, if the model falls for the 'waterproof' redherring in the description of the glove (and assume it fell in the river), then they're screwed and will invariably fail - like all models still do in terms of consistent responses (afaik).. but yeah the 'solution' is far more simple than a complex word-path puzzle.. it just requires grasping the basic reality of the situation described (and not assuming complex calcultations are required to solve it)
when reasoning models stop assuming that's when they can actually solve complex and simple questions
ayy gmpuzzles reference
i've been learning a lot about puzzles lately lol
puzzles are quite fun
amazing i know
i'm still yet to emerge from a few rabbit holes ha
Ultra is probably better
bruh
nobody knows until it releases
it very well could be. it might not
I mean, we know base models soo we can make some guesses but i got your point
brb asking o3
Available on lmarena?
I saw from this
now that's one logic bomb, suitable for use by Kirks everywhere
AI space been a little slow
O3 pro isn’t anything significant
And besides that it’s all more Google hype what is everyone else doing?
tbh it's just about exactly what could have been expected. We already saw o1-pro and deep think numbers, this provided similar marginal gains over o3. What is there more to expect from it?
Living the good life for quality work?
We don’t need a new model every month or two, I prefer waiting a (few) year(s) for a revolutionary upgrade rather than every month getting just a slightly adjusted ones. Quality or breakthrough takes time 😅
Likewise but that’s not the tech space works, short attention space short term return on investment hype now is their strategy
Better technologies could develop but there’s no guarantee, and money is lobbed at thinks certain to be profitable
@deep adder only 20 messages per month for o3 pro teams
Maybe add a new seat
New Gmail
Easy
Arena developers, are you working on any cool features (usable in direct chat)?
apparently next week
Next week for sure
versus ultra?
@keen beacon ok Paul from Aider does not appear to be very smart lol. But that whole thinking budget saga I would say is still not resolved. He tested 2.5Flash too and got the same results (budget 24k higher score than auto budget). Coupled with things that you saw yourself like higher median and... that's a lot of "coincidences". There's defo something changing with the model setup affecting model responses in unknown ways auto budget vs max I would say
cap ran flash with 24k budget and none and median was lower auto thought more for some reason
but yeah theres something happening
im also gonna leak the cot, probably, should help. on different thinking budgets
crack bench
(cap results
i dont think there are walls in either atm fwiw xd
would be interesting to see total tokens too. Maybe it switched from thinking to final response 🤔
it makes sense why theyre still great, if they use a lot of synthetic data those tendencies would continue
Also you can see claude models have high scores too soo it is about writing
For reasoning im finding better O3 too but for writing gemini 06/05 is best right now with Opus 4
yea
They have always been good at RLAIF and RL in general, two things that really work with coding
And their model are always on the bigger side, which seems to help as well
yeah, a bit unrelated, but i dont think anthropic are good at small models xd
lmao I think I just broke 2.5pro. It's generating response after response recursively repeating itself over and over, with several thinking boxes per response... 
that happens to me as well, but under a specific scenario. did you do anything notable btw?
was playing with making it output the system prompt, it inevitably got itself into that thing it does where it repeats your message instead... Now it's repeating itself instead lol
somewhat related but i think 2.5 pro can output special tokens
it can also break the output
that mightve happened in ur scenario
it reminds me of that a little
it's interesting that it's prompting itself though... like how does it keep sending responses with no additional input? 🤯
weird
the stop parsing/whatever is bugged in the inference engine
How is best Qwen model compared to o3 and 0605?
it stopped just shy of maxing everything out 😭
still impressive though, to get 900k with just like 3 sentences input lmao
damn imagine paying for that 🤣
Wtf is this
models/kingfall-ab-test
wen models/titanforge-ab-test
It usually works but this time it fails because it tried to use that model
yes it broke recently
huh
rip kingfall
i think it's temporary
given the AB test is still attempted by the frontend and it just can't reach the model
they probably pulled it to update it with a new checkpoint or something
There's new ones added
oh that version was 2 weeks old iirc
timeline makes sense
send the new model id
if they're planning a Thursday release they're leaving it pretty fine with the lmarena testing
idk about them ever releasing it (compute and all that?) but it seems pretty far in development
maybe that was when the abtest started
2 weeks ago?
or is that actual the revision date
i don't think so, it's just the date that checkpoint was finished
why
y'all still doing this?
wtf that actually worked
:p
that google i/o
when it happening
huh?
btw is prowlridge good
yo so for the ai generation image contest, would giving it an image and saying "recreate this image as accurately as possible" be technically allowed? 🤔
Think it's technically a bit worse than kingfall but pretty similar
nope
"the best i can guess is version 3.1 large"
what else could it be tho?
2.5 ultra??
deepthink maybe?
it is not
how yall know
its definitely not gemini 3
shh
you cant interpret that version number naively for sure
dark mouth?
black eye
yea it prob not
I run the terminator svg on every new model
Musk
bt
white mouth
Do other pro models use that style?
tl
it's like reiterating on me
if kingfall release is it gonna be the best (over 4 opus and o3 pro?)
kingfall is dead and gone
dark mouth > kingfall
none of them are as good as kingfall
it does mean something if you know how to prompt it and view the results tbh. Tends to correlate with webdev and arc-agi to a meaningful degree
it stopped generating an svg, mid response, and started to generate a new one, dark mouth
lmao it started thinking again
arc-agi is spatial reasoning
dark mouth > kingfall
wait let me see this svg tho
Opus does great. It does well with svg too. 2.5Pro... new versions were not tested on arc-agi
Opus is best at arc AGI 2
but the old one did very decently too
nice
well see when actualy benchmark coems out ig
Opus? That's like the model the least affected by contamination tbh. It being big you would have to work hard to overfit
bigger models overfit more easily
oh i forgot to add thinkingBudget: 0
and previous kingfall results fwiw
Not really. They need more training to get the same result. If you used the same safety alignment dataset and same hyperparameters on small and big model, small one will come out significantly more censored
top right is the best
Imo last one is best
the relationship between hyperparameters and model size is more complex. you can easily "overfit" more easily with bigger models
I also think the last one is better. The neck on the top right one is well done
Last one is clean and seamless between the different parts
it's not complex if you use the exact same learning rate and related parameters. Big model needs more work during training to get the same result - that's just basic fact tbh...
a higher learning rate might be better as it avoids getting stuck in local minima as well. all in all its very complicated
yea thats mine, was on auto thinking budget, running the same rn
it is very complicated if youve worked on this stuff
overall kingfall has more details and seems to have more variety
wait auto thinking budget svg incomin
hyperparams you often need to do a sweep for an optimal configuration and depending on the criteria it can be extremely complex
I did do training quite a bit. The fact alone that you need more powerful setup / more compute to even train a bigger model should tell you that what I'm saying is true. Big model needs much more work to overfit anything. And if you are using same compute and same dataset and same/comparable learning rates, small model in most normal cases would overfit faster...
There are different types of overfit
dark mouth, auto thinking
Sometimes the net randomly blows up too. Especially in FP8 training
Lmao, abomination
with hyperparameters staying roughly the same and same dataset you are essentially fixing the amount of work that is going to be done as a constant. Same amount of steps
fwiw u can believe me or not. i haven't done a "little" of it 🤷
well I said "a little" figuratively. Actually made money doing so as well 🤷
If you have a list of 100 inputs with corresponding outputs. And you train a 8 neuron net or a 10k neuron net. The 10k neuron net will overfit. This is a type of overfit that generally occurs later in a run
bigger model tends to inherently generalize better. And if you have a certain dataset meant for a smaller model... for a bigger model you would usually need notably more. To reach a significant enough change, let alone overfit it... Obviously I'm excluding edge cases like different or unreasonably high learning rates and shitton of epochs, but you shouldn't do that to begin with.
here's an alien cat
not feeling dark mouth
That one is pretty decent imo
Although kinda weird IG like I'm seeing inside the head without a face
i don't see why they'd replace the kingfall AB test, if it was ultra, with a pro AB test
imo kingfall did the svgs a bit better, had more fidelity to it than darkmouth
I think so too
these two models are both drawing rounder heads than kingfall lmao
i don't think svgs are the be all end all of model capability tbh
how is it so far for u
i haven't done enough tests but in the ones i have done it has been equal or slightly better vs kingfall
I thought kingfall was the final checkpoint lmao
no ☠️
they're moving quickly with this though
Well if they did continue it. I guess diminishing returns is expected so maybe they are just really close
given they're AB testing in Studio I would say it's in the later stage of development yeah
f's
lol when a model gets this right can we call it AGI
im gonna be dead when it happens 😭😭
so next year? lol jk
I feel like its just missing some contextual knowledge for this ngl
Like it doesn't really understand the world
kingfall has gotten it "right" and then concedes it immediately lmao
I hate when models do this
victim of its own autistic overthinking 🥀
can you all not use something cute for the testing?!? 😣