#general | Arena | Page 56

ocean vortex Jun 12, 2025, 1:33 PM

#

ok I'll download it

#

Well like I said, I don't think budget matters that much with Gemini #general message

but I wouldn't be shocked if for specific prompts (including aider) max budget does result in slightly more tokens. Also what was the total amount of tokens generated? That would be more telling than stddev IMO

#

@keen beacon

#

so... it still shows 32k budget generates more. You are only trying to argue the significance of it

#

wdym

#

you are doing some gymnastics here, I don't think this statistical analysis theory applies perfectly here

#

but it would if you redid the test and got the opposite results

#

if it's always more tokens for 32k budget the results are meaningful

#

with the only question remaining how it translates to performance

#

what you are saying here is that small increase in total tokens generated is impossible...

#

it has to be big or nothing at all

#

well it's your assumption though, by design any small increase you would view as noise

#

are you actually fr here

#

lol

#

that's you who is using models to reason here...

#

simplified, it told you that there's not enough data to say definitively

#

if you sent it 10 more runs like that all of them showing higher mean, the answer would have been likely different

#

and all show higher mean for 32k budget?

#

Well I meant 10 more tests like this one the entire thing

#

I'm literally not. You saw higher mean and then tried to prove that it's meaningless. 32k budget generated more tokens than max_budget. Not definitive ofc with such limited data, but it can be indicative

#

we do not know how their budget thing is implemented, way too many assumptions here... It could affect the stddev itself

#

I'm not dismissing it merely pointing out that it is higher with max budget for that test run

#

which could mean system prompt changes...

#

like don't you find it interesting that not only median is higher but also min is lower and max is higher?

#

Well if you know those you basically know Python, it's an easy language to learn lol

#

nowadays everything shiny is python

keen fulcrum Jun 12, 2025, 2:10 PM

#

Why do llms use python, like tools and such

ocean vortex Jun 12, 2025, 2:11 PM

#

keen fulcrum Why do llms use python, like tools and such

cause it's the most flexible language

keen fulcrum Jun 12, 2025, 2:11 PM

#

ocean vortex cause it's the most flexible language

Why not native c

ocean vortex Jun 12, 2025, 2:11 PM

#

and used in statistics

#

so all the graps etc, data analysis... Python. And machine learning as well

ocean vortex Jun 12, 2025, 2:13 PM

#

keen fulcrum Why not native c

Ancient

patent aspen Jun 12, 2025, 2:14 PM

#

I like Rust a lot. I wish it was easy and impactful to migrate to

ocean vortex Jun 12, 2025, 2:14 PM

#

I think it will eventually progress into natural language being viewed as coding lol. But we are not quite there yet. Currently all the most meaningful stuff is being done in Python tbh

#

AI can replace coders, but it can't replace those who are making the AI itself (Python), at least not nearly as fast... 🤷‍♂️

torn mantle Jun 12, 2025, 2:17 PM

#

PUAHAHAHAHA

#

YOU STILL HAVENT SEEN MY EVIL SIDE

nimble trail Jun 12, 2025, 2:17 PM

#

Thank you for your advice Brian 🤜🤛

patent aspen Jun 12, 2025, 2:19 PM

#

Oh I'm not talking about the difficulty of using the language. (It is on the more advanced side.) I'm talking about migrating a large code base to Rust

ocean vortex Jun 12, 2025, 2:20 PM

#

iOS26 design is somewhat growing on me. Not a fan how easy it is to break it with customization but hopefully that's just developer beta things

keen beacon Jun 12, 2025, 2:20 PM

#

patent aspen Oh I'm not talking about the difficulty of using the language. (It is on the mor...

oh i completely went on a tangent lol

#

ignorem e

patent aspen Jun 12, 2025, 2:20 PM

#

No worries. It's a fun language

keen beacon Jun 12, 2025, 2:20 PM

#

i cant use anything else loll

#

isnt taht still happening tho?

#

c++ to rust

#

lmao

#

yeah who needs that stuff, honestly you dont need c++

#

just write everything in assembly

#

a random tangent but someone reminded me about the illusion of thinking, at least this in the abstract:

We found that LRMs have limitations in exact
computation: they fail to use explicit algorithms and reason inconsistently across puzzles.
kingfall in the face of increased complexity (2 6x6 zebra puzzles instead of 1 where it constantly fails), builds its own system to exactly solve zebra puzzle. kinda wild 🤣 (still on this, just found it fitting)

#

i dont think the illusion of thinking will hold up well...

#

yeah they cant reason... they're pattern matching on these unseen zebra puzzles

#

i oughta give it more 6x6 zebra puzzles lol

#

im actually curious if it can do more than 2

#

nah 🤣

#

it will fail

#

oh they can

#

can they solve two in 14.5k tokens?

cobalt bane Jun 12, 2025, 2:41 PM

#

I have a link to kingfall but guys dont leak

keen beacon Jun 12, 2025, 2:41 PM

#

its a little unfair since they rl on this so much

#

fake news

#

yes

cobalt bane Jun 12, 2025, 2:41 PM

#

How is it patched the llm still respond me

keen beacon Jun 12, 2025, 2:41 PM

#

its redirecting

#

to 0605

cobalt bane Jun 12, 2025, 2:41 PM

#

Oh okay

late path Jun 12, 2025, 2:57 PM

#

@keen beacon

XwcPHtTs2bMVFxenrl27hrskAAAAAADOC5yTAwAqO4JwAABwRvTs2VO7d8uddv48eN11VVXlWjfvXu3unXrpmrVqmny5Mmy2zk0AQAAAAAgVJyTAwBQElOjAwCAM2LXrl3yer2lbktKSlJMTEwFVwQAAAAAwPmBc3IAAEoiCAcAAAAAAAAAAAAAVCrWcBcAAAAAAAAAAAAAAMCZRBAOAAAAAAAAAAAAAKhUCMIBAAAAAAAAAAAAAJUKQTgAAAAAAAAAAAAAoFIhCAcAAAAAAAAAAAAAVCoE4QAAAAAAAAAAAACASoUgHAAAAAAAAAAAAABQqRCEAwAAAAAAAAAAAAAqlf8Hv0wCVDy7KXYAAAAASUVORK5CYII.png

#

tested on 2.5 flash 0520

keen beacon Jun 12, 2025, 2:58 PM

#

what sampling setting did u use btw

#

on 0605 i sampled 10 on thinking budget = 4096 and it was doing 10,000+ tokens in thinking (i triple checked the api requests)

civic flame Jun 12, 2025, 2:58 PM

#

keen beacon its redirecting

wait what

late path Jun 12, 2025, 2:58 PM

#

temperature=0.05

keen beacon Jun 12, 2025, 2:59 PM

#

keen beacon on 0605 i sampled 10 on thinking budget = 4096 and it was doing 10,000+ tokens i...

on 0 temperature

keen beacon Jun 12, 2025, 2:59 PM

#

late path temperature=0.05

yeah i had to switch it to 0.7 or it would produce weird results

late path Jun 12, 2025, 2:59 PM

#

50 samples for auto and 50 samples for 24k

keen beacon Jun 12, 2025, 3:00 PM

#

honestly the weird graph looks like the bug to me

#

but i dont know

#

i didnt test flash at all

#

what prompt did you use btw

late path Jun 12, 2025, 3:01 PM

#

142789*521=?

keen beacon Jun 12, 2025, 3:01 PM

#

i intentionally used one that causes very varied distributions/more representative

#

🤷 i dont know lol

late path Jun 12, 2025, 3:03 PM

#

heres the raw response
line 1 auto
line 2 24k

📎 raw_response.txt

keen beacon Jun 12, 2025, 3:03 PM

#

ill look at that later

#

look at thinking budget 4096 on 0605 (0 temp/1 top p) i dont know lol (with sampling it returns as expected)

#

not much i just tested it a few on each lower thinking budgets. i only did 30 for max and auto

#

no thinking budget 0 temp etc for reference btw as well

#

the results dont make sense 🤣

calm sequoia Jun 12, 2025, 3:33 PM

#

late path <@456226577798135808>

Could you add log to the first plot?

late path Jun 12, 2025, 3:36 PM

#

The chart just now didn't exclude a response where the model got stuck in a repetitive loop, outputting up to 65535 tokens

#

idk, it seems like a higher thinking budget reduced the overall number of thoughts based on this test, but the accuracy increased significantly

keen beacon Jun 12, 2025, 3:39 PM

#

i think youre hitting another bug. it seems the gemini api seems to have a lot of issues idk

late path Jun 12, 2025, 3:41 PM

#

Another logical reasoning problem. With temp=0.5, it's also basically the 24k thinking budget that reduced the total amount of thought. didn't expect 2.5flash to have a 100% accuracy rate though

keen beacon Jun 12, 2025, 3:44 PM

#

late path Another logical reasoning problem. With temp=0.5, it's also basically the 24k th...

what do you think is happening

#

i oughta look at this later wut is going on

late path Jun 12, 2025, 3:51 PM

#

I think the thinking budget is definitely a training parameter, enabling it adds more constraints compared to 'auto' (leading to a more concentrated distribution of thought lengths), and the model is aware of it

fleet lintel Jun 12, 2025, 3:51 PM

#

late path Another logical reasoning problem. With temp=0.5, it's also basically the 24k th...

100% accuracy for what?

keen beacon Jun 12, 2025, 3:52 PM

#

late path I think the thinking budget is definitely a training parameter, enabling it adds...

For 0605 I didn't see that tho

late path Jun 12, 2025, 3:53 PM

#

fleet lintel 100% accuracy for what?

A puzzle

#

Sroan's safe
Sroan has a private safe. The combination is 5 different digits.
A guesses: 32561
B's guess: 18093
C's guess: 91526
'Sroan then said, "Each of you three has correctly guessed two digits in non-adjacent
positions. (It only counts as correct if both the position and the digit are
correct.) If you can figure out the password, the money inside is yours!"'
Assuming the three are super intelligent, can they get the money? And what
is the password?
||Password: 38596||

languid crescent Jun 12, 2025, 4:00 PM

#

What's the most accurate/powerful ai in LMarena (for coding)?

jade egret Jun 12, 2025, 4:05 PM

#

project mariner

jade egret Jun 12, 2025, 4:05 PM

#

languid crescent What's the most accurate/powerful ai in LMarena (for coding)?

check out webdev areba'

#

it gemini 2.5 pro rn

calm sequoia Jun 12, 2025, 4:10 PM

#

late path The chart just now didn't exclude a response where the model got stuck in a repe...

Turning ON auto-budget increases the budget but decreases the performance? 👀 Wtf is going on. Model worrying too much about budget when it's present?

keen beacon Jun 12, 2025, 4:11 PM

#

calm sequoia Turning ON auto-budget increases the budget but decreases the performance? 👀 Wt...

it might be a bug

#

ill try to collect more data

torn mantle Jun 12, 2025, 4:12 PM

#

jade egret project mariner

yea so many projects

#

they will eventually update their gemini 2.5 pro on 19th of this month

#

full acceleration

#

I really hope the reason grok 3.5 hasn't been released is because elon wants it to match the fake leaked benchmarks

#

that would be justifiable

keen beacon Jun 12, 2025, 4:15 PM

#

torn mantle that would be justifiable

how can it match 2.5 ultra though

torn mantle Jun 12, 2025, 4:16 PM

#

keen beacon how can it match 2.5 ultra though

there is no such model

#

its 2.5 pro deep thinking

#

2.5 pro latest will be on 19th = kingfall

#

then they will release the deep thinking feature

#

grok 3.5 will probably come around that time

#

its probably ass tho

torn mantle Jun 12, 2025, 4:39 PM

#

and whats the GA endpoint?

#

the current model?

#

goldmane huh

#

f

patent aspen Jun 12, 2025, 4:40 PM

#

Yes it's just the current model

vocal pelican Jun 12, 2025, 4:41 PM

#

will there be any free access?

patent aspen Jun 12, 2025, 4:41 PM

#

vocal pelican will there be any free access?

For what

vocal pelican Jun 12, 2025, 4:42 PM

#

deepthink

patent aspen Jun 12, 2025, 4:43 PM

#

Doubtful

alpine coral Jun 12, 2025, 4:51 PM

#

it'll be slightly nerferd surely

sacred plaza Jun 12, 2025, 4:51 PM

#

Can someone explain to me why we should pay for the Google ultra plan for deep think feature? When I can just ask the current gemini models to employ graph of thoughts and get the same internal process from the LLM?

alpine coral Jun 12, 2025, 4:51 PM

#

what's the point of exp->-preview->GA if they're just the same models

keen beacon Jun 12, 2025, 4:52 PM

#

sacred plaza Can someone explain to me why we should pay for the Google ultra plan for deep t...

inference time adjustments probably cant be emulated with the stuff we have publicly

vocal pelican Jun 12, 2025, 4:52 PM

#

sacred plaza Can someone explain to me why we should pay for the Google ultra plan for deep t...

I have to assume it’ll be a little more complicated than that

alpine coral Jun 12, 2025, 4:52 PM

#

(progressively made 'safer' / more corporate aligned before they give the sign off: go, use this in prod)

keen beacon Jun 12, 2025, 4:52 PM

#

keen beacon inference time adjustments probably cant be emulated with the stuff we have publ...

maybe u can it depends. the gemini api allows for a lot

alpine coral Jun 12, 2025, 4:53 PM

#

indeed (i mean get it, from a corporate risk management perspective)

#

i just feel like the 'tweaks' made in each intermediary step before GA are to reduce 'harmfulness' rather than increase 'performance'

keen beacon Jun 12, 2025, 4:54 PM

#

probably not

alpine coral Jun 12, 2025, 4:54 PM

#

(both admitedly slippery/ subjective terms)

keen beacon Jun 12, 2025, 4:54 PM

#

at least this time, timeline seems to short

alpine coral Jun 12, 2025, 4:54 PM

#

yeah that's a fair counterpoint

keen beacon Jun 12, 2025, 4:54 PM

#

they could deploy a completely different revision but why do a preview stage idk

alpine coral Jun 12, 2025, 4:55 PM

#

not completely differnet

#

yeah like i said, i do get it

keen beacon Jun 12, 2025, 4:55 PM

#

alpine coral not completely differnet

they could use a slightly newer revision after it with adjustments, but the adjustment probably wouldnt be preview feedback

#

unless you know the future

vocal pelican Jun 12, 2025, 4:56 PM

#

alpine coral i just feel like the 'tweaks' made in each intermediary step before GA are to re...

I don’t think that harmfulness stuff has noticeably changed much though, unless I’ve missed it

alpine coral Jun 12, 2025, 4:57 PM

#

keen beacon they could use a slightly newer revision after it with adjustments, but the adju...

yeah for sure. to the extent there are changes (from exp/preview to GA) that increase alignment but lead to performance degredations, they're likely marginal / imperceptible to most

#

but yeah i feel the focus is on making it 'suitable' for well GA aha

soft kernel Jun 12, 2025, 4:58 PM

#

Sorry off topic
Is lmarena going to release o3 Pro rankings

keen beacon Jun 12, 2025, 4:58 PM

#

its not gonna be on the arena

alpine coral Jun 12, 2025, 4:58 PM

#

vocal pelican I don’t think that harmfulness stuff has noticeably changed much though, unless ...

yeah.. you might be right tbf

vocal pelican Jun 12, 2025, 4:58 PM

#

soft kernel Sorry off topic Is lmarena going to release o3 Pro rankings

it takes like 13 mins on every request

patent aspen Jun 12, 2025, 4:58 PM

#

It's also not just alignment. A ton of post training for performance happens between experimental and GA

alpine coral Jun 12, 2025, 4:59 PM

#

they just never feel 'better'.. but i'm talking out of my ass here.. i think i just have rose tinted glasses on thinking about earlier moels ha

soft kernel Jun 12, 2025, 5:00 PM

#

vocal pelican it takes like 13 mins on every request

They can't rank the model?

keen beacon Jun 12, 2025, 5:00 PM

#

alpine coral they just never feel 'better'.. but i'm talking out of my ass here.. i think i j...

its harder to tell without the raw thoughts now ngl. a lot of the signal of each prompt was from it (if there are model changes)

vocal pelican Jun 12, 2025, 5:00 PM

#

soft kernel They can't rank the model?

people would need to rate the model in battles, and nobody is waiting 13 minutes for it to respond to do their rating

jade egret Jun 12, 2025, 5:00 PM

#

when is kingsfall dropping

keen beacon Jun 12, 2025, 5:01 PM

#

it no longer exists

vocal pelican Jun 12, 2025, 5:01 PM

#

alpine coral they just never feel 'better'.. but i'm talking out of my ass here.. i think i j...

I feel like a lot of it has been minor gains in some fields at the expense of others

patent aspen Jun 12, 2025, 5:01 PM

#

vocal pelican I feel like a lot of it has been minor gains in some fields at the expense of ot...

Yeah it's usually a game of whack-a-mole

alpine coral Jun 12, 2025, 5:01 PM

#

for 05-06 fs

alpine coral Jun 12, 2025, 5:02 PM

#

vocal pelican I feel like a lot of it has been minor gains in some fields at the expense of ot...

re this

keen beacon Jun 12, 2025, 5:02 PM

#

ngl i still use 0506 lmao

#

its a mix for me

vocal pelican Jun 12, 2025, 5:04 PM

#

keen beacon ngl i still use 0506 lmao

yeah I mostly use LLMs for maths so I’ve been sticking with it

keen beacon Jun 12, 2025, 5:04 PM

#

vocal pelican yeah I mostly use LLMs for maths so I’ve been sticking with it

didnt 0506 do worse on math benchmarks though?

#

if it performs well for you, it doesnt matter i guess

vocal pelican Jun 12, 2025, 5:05 PM

#

not that I’ve seen, and generally yeah it has been a tiny bit better for me

#

still prefer qwen for maths though it has gotten a lot right that other models haven’t

keen beacon Jun 12, 2025, 5:05 PM

#

its very good

#

you could interpret that two ways i guess

patent aspen Jun 12, 2025, 5:09 PM

#

No bad news. I just don't know of anything particularly interesting happening in July yet

#

Probably the best opportunity for OAI to strike back

keen beacon Jun 12, 2025, 5:11 PM

#

Idk what to expect with gpt 5 xd

patent aspen Jun 12, 2025, 5:11 PM

#

If OAI doesn't release o4 or GPT-5 in the next 2 months, they're in trouble IMO

#

I think they probably will though

keen beacon Jun 12, 2025, 5:11 PM

#

Is the SVG model never gonna see the light of day

neon warren Jun 12, 2025, 5:14 PM

#

May be they will release Veo3 api in July?

patent aspen Jun 12, 2025, 5:14 PM

#

I don't track image gen, video gen, etc

keen beacon Jun 12, 2025, 5:14 PM

#

They did say 2.5 image gen was coming

#

I guess we have to look forward to

alpine coral Jun 12, 2025, 5:22 PM

#

i haven’t played around much but fwiw o3 pro seems very strong (better than o3 fs, whether better than o1 pro i'm not sure..), but also painfully slow..

#

here for e.g., given a nyt “Connections” word puzzle, o3-pro does exceptionally well (the closest I’ve seen a model come to fully solve it), taking 14 minutes; o3 graciously admits defeat, after taking 10 minutes to think about it; for reference/comparison, 2.5-pro-06-05 blitzes through it in less than 2 mins, but fails dismally.

#

what's interesting is that the slowness (seemingly) isn’t due to the number of tokens used during thinking, but just that the inference is slow af (or perhaps there's some parrell stuff going on that isn't reflected in token count).. like look at the tokens / sec…

#

but also, while ‘faster’, o3 used more than three times as many tokens, only to fail to find a solution; while with 06-05 it’s like it dominated a hurdles race on ‘time’, but crashed through every hurdle along the way..very fast but terrible/disqualifying performance

tall summit Jun 12, 2025, 5:33 PM

#

alpine coral here for e.g., given a nyt “Connections” word puzzle, o3-pro does exceptionally ...

oooh have ya tried puzzgrid puzzles?

civic flame Jun 12, 2025, 5:41 PM

#

alpine coral here for e.g., given a nyt “Connections” word puzzle, o3-pro does exceptionally ...

what was the puzzle?

alpine coral Jun 12, 2025, 6:06 PM

#

tall summit oooh have ya tried puzzgrid puzzles?

not until now, thanks! (i see where the nyt got the idea from aha)

alpine coral Jun 12, 2025, 6:07 PM

#

civic flame what was the puzzle?

Can you solve this puzzle?

---

How to Play:

Find four groups of four items that share something in common.

Category Examples: • FISH: Bass, Flounder, Salmon, Trout • FIRE ___: Ant, Drill, Island, Opal

The above are examples only and by no means exhaustive; categories can be established in various ways, including for example by manipulating/altering four words in the same way, or focussing on a particular component of each, such that they have a shared meaning or association.

Categories will always be more specific than “5‑LETTER‑WORDS,” “NAMES” or “VERBS.”

Each puzzle has exactly one solution. Watch out for words that seem to belong to multiple categories!

——

PUZZLE:

HELL, WELL, COMP, ORGO, SHELL, MILK, LIT, SICK, PASTE, FIRE, SMOKE, DOPE, CREAM, NETI, ILL, LICK

——

SOLUTION:

tall summit Jun 12, 2025, 6:07 PM

#

civic flame what was the puzzle?

https://connectionsplus.io/game/693

NYT Connections #693 | Connections+

Play the full NYT Connections archive and custom games on Connections Plus. When you're done, create and share custom Connections games!

alpine coral Jun 12, 2025, 6:07 PM

#

one of the nice things about using these i think is that's there no risk of contamination (for the latest ones ofc)

tall summit Jun 12, 2025, 6:08 PM

#

alpine coral not until now, thanks! (i see where the nyt got the idea from aha)

onlyconnect is the original original (a british quiz show), puzzgrid follows their rules faithfully

but the types of grids in puzzgrid vs nyt connections is vastly different (much more hardcore audience) so i think it'd be interesting besides just being another puzzle source

#

lmao rambling about something i like

alpine coral Jun 12, 2025, 6:09 PM

#

i looked at their About page and saw the reference to the british quiz show as the progenitor of it all ha

whole wagon Jun 12, 2025, 6:09 PM

#

alpine coral what's interesting is that the slowness (seemingly) isn’t due to the number of t...

The actual token count for o3 pro is 10x what's shown

#

They just count 10 parallel tokens as 1

#

Is 19th June Gemini 2.5 pro kingfall

alpine coral Jun 12, 2025, 6:12 PM

#

whole wagon They just count 10 parallel tokens as 1

i thought something along these lines - tho is that concrete / confirmed?

#

(also not sure about 'parallel tokens', but some kinda parrellel / search thingy)

#

actually ig that language makese sense

civic flame Jun 12, 2025, 6:14 PM

#

tall summit onlyconnect is the original original (a british quiz show), puzzgrid follows the...

oh someone else who knows only connect!!

#

lol i use some of their stuff as personal benchmarks

indigo hazel Jun 12, 2025, 6:24 PM

#

the o3 model available is medium or high?

small haven Jun 12, 2025, 6:26 PM

#

higher or lower

zinc ore Jun 12, 2025, 6:26 PM

#

AI studio has issues today

#

Discord attachments down

#

And anthropic having issues too

leaden palm Jun 12, 2025, 6:27 PM

#

even npm is facing difficulties

#

and neon is (predictably) experiencing an outage

whole wagon Jun 12, 2025, 6:28 PM

#

How do you know this lol

keen beacon Jun 12, 2025, 6:28 PM

#

guessing

small haven Jun 12, 2025, 6:28 PM

#

hes a magician

zinc ore Jun 12, 2025, 6:28 PM

#

Let's see if true

keen beacon Jun 12, 2025, 6:29 PM

#

if it happens its just a coincidence imo

zinc ore Jun 12, 2025, 6:29 PM

#

Dudes luck stats maxed out

keen beacon Jun 12, 2025, 6:29 PM

#

zinc ore Dudes luck stats maxed out

yup

whole wagon Jun 12, 2025, 6:29 PM

#

Oh hi trey

#

I saw u in the chess discords

zinc ore Jun 12, 2025, 6:29 PM

#

Hello

#

Yeah, I used to talk there years ago

keen beacon Jun 12, 2025, 6:30 PM

#

https://downdetector.com/ wow its a lot

zinc ore Jun 12, 2025, 6:32 PM

#

https://fixupx.com/testingcatalog/status/1933229458393047352

TestingCatalog News 🗞 (@testingcatalog)

BREAKING 🚨: xAI is testing Grok 3.5 along with a voice mode on the Grok web! Sound on.
︀︀
︀︀Release soon? 👀

**💬 3 🔁 3 ❤️ 27 👁️ 260 **

▶ Play video

dusky aurora Jun 12, 2025, 6:33 PM

#

my first Failed to acceptterms-of-use

echo aurora Jun 12, 2025, 6:33 PM

#

the team is aware of some issues accessing the site, apologies for any inconvenience! we're working hard to get it fixed asap

civic flame Jun 12, 2025, 6:34 PM

#

okay guys what big cloud provider died this time

dusky aurora Jun 12, 2025, 6:34 PM

#

these days LMArena is my only source of streess relief

narrow elbow Jun 12, 2025, 6:35 PM

#

haha, very useful service

tall summit Jun 12, 2025, 6:35 PM

#

civic flame okay guys what big cloud provider died this time

i know right

leaden palm Jun 12, 2025, 6:36 PM

#

civic flame okay guys what big cloud provider died this time

hn says gcp (but gcp doesn't say)

civic flame Jun 12, 2025, 6:36 PM

#

tall summit i know right

https://www.cloudflarestatus.com/

Cloudflare Status

Welcome to Cloudflare's home for real-time and historical data on system performance.

keen fulcrum Jun 12, 2025, 6:36 PM

#

https://fixupx.com/javilopen/status/1932918439129104483

Javi Lopez ⛩️ (@javilopen)

🔥 Midjourney video is almost here...
︀︀
︀︀And it has, somehow, that incredible artistic aesthetic that MJ is famous for.
︀︀
︀︀Yes, these are all MJ video generations!!! 🧵👇

**💬 136 🔁 202 ❤️ 2.6K 👁️ 737.3K **

▶ Play video

civic flame Jun 12, 2025, 6:36 PM

#

leaden palm hn says gcp (but gcp doesn't say)

cf too apparently then!?!?

#

lmfao

tall summit Jun 12, 2025, 6:37 PM

#

civic flame https://www.cloudflarestatus.com/

hoooly

keen fulcrum Jun 12, 2025, 6:37 PM

#

https://fixupx.com/javilopen/status/1932919326236971487

Javi Lopez ⛩️ (@javilopen)

2. It's incredible good for cartoons.

**💬 6 🔁 2 ❤️ 126 👁️ 13.6K **

▶ Play video

whole wagon Jun 12, 2025, 6:37 PM

#

I was wondering why I couldn't get LLM arena battles to work lol

leaden palm Jun 12, 2025, 6:37 PM

#

ok this ones honest https://status.firebase.google.com/

dusky aurora Jun 12, 2025, 6:38 PM

#

great. now r/midjourney wil be impossible to browse

civic flame Jun 12, 2025, 6:38 PM

#

that's crazy

#

cloudflare AND gcp failing at the same time

dusky aurora Jun 12, 2025, 6:38 PM

#

the addition of image generation has already ruined r/novelai

tall summit Jun 12, 2025, 6:39 PM

#

2025 midjourney is interesting

echo aurora Jun 12, 2025, 6:39 PM

#

keen fulcrum https://fixupx.com/javilopen/status/1932919326236971487

that's incredible, some of the other examples are so good too

#

side note the lofi activity is also giving me troubles atm, everything is broken 😭

civic flame Jun 12, 2025, 6:41 PM

#

"Update - We are continuing to investigate and work with our providers to resolve this issue. It is likely impacting a few more features like activities." - discord

#

seems both a bunch of cloudflare services and a bunch of GCP services are suffering

jade egret Jun 12, 2025, 6:42 PM

#

why can't i upload file

#

in discord rn

civic flame Jun 12, 2025, 6:43 PM

#

https://discordstatus.com/

Discord Status

Welcome to Discord's home for real-time and historical data on system performance.

echo aurora Jun 12, 2025, 6:43 PM

#

civic flame "Update - We are continuing to investigate and work with our providers to resolv...

ah, thanks

civic flame Jun 12, 2025, 6:44 PM

#

source @ google tells me an internal service there called Chemist is down

#

Chemist checks "project status, activation status, abuse status, billing status, service status, location restrictions, VPC Service Controls, SuperQuota, and other policies"

#

"goddamnit we told that intern not to deploy gemini's changes without running them through us first"

slender karma Jun 12, 2025, 6:47 PM

#

looks like the old arena website is working fine tho

civic flame Jun 12, 2025, 6:47 PM

#

because it doesn't rely on firebase 😉

#

were they planning a launch today?

#

GCP status updated

#

ocean vortex Jun 12, 2025, 6:55 PM

#

so now pro also has yapping...

civic flame Jun 12, 2025, 6:55 PM

#

oh images work now

ocean vortex Jun 12, 2025, 6:55 PM

#

🫃

echo aurora Jun 12, 2025, 6:58 PM

#

slender karma looks like the old arena website is working fine tho

it is? isn't working for me 😕

keen beacon Jun 12, 2025, 6:59 PM

#

works for me

civic flame Jun 12, 2025, 7:00 PM

#

ai studio is working again

#

things appear to be recovering

slender karma Jun 12, 2025, 7:00 PM

#

echo aurora it is? isn't working for me 😕

Arena (battle) doesnt work. Arena (side-by-side) and Direct Chat works.
It is abit slow

civic flame Jun 12, 2025, 7:01 PM

#

yeah i think firebase is still a little shaky

#

naturally given it's gone from no traffic to all of it after coming back up

whole wagon Jun 12, 2025, 7:01 PM

#

jade egret why can't i upload file

discord flagged u btw

small haven Jun 12, 2025, 7:02 PM

#

ocean vortex so now pro also has yapping...

o1 pro used to be a beast, could output 2-3k lines without any placeholders/omissions

ocean vortex Jun 12, 2025, 7:03 PM

#

small haven o1 pro used to be a beast, could output 2-3k lines without any placeholders/omis...

iirc that one didn't have any "yap score" constraints

small haven Jun 12, 2025, 7:05 PM

#

ocean vortex iirc that one didn't have any "yap score" constraints

yea it doesn't but even last month, the o1 pro was on a nerfed out version, still couldn't rewrite code end-to-end without omissions

ocean vortex Jun 12, 2025, 7:05 PM

#

Today’s Yap score is 8192. Should I provide an explanation or just the number? The user seems to be asking for the score specifically, so I think just answering with 8192 would be sufficient. I could add a note if there's space, but given I might need to keep it concise, I’ll simply respond: “Today’s Yap score is 8192.” Alright, let’s finalize that answer!

Yeah it's assuming now it needs to be concise...

small haven Jun 12, 2025, 7:06 PM

#

o1 pro context was 128k, now its 64k on o3 pro via ui

ocean vortex Jun 12, 2025, 7:06 PM

#

second guessing itself, they made it confused lol

#

It's like "8192 is that a lot or not? Better be safe and keep it brief."

grim girder Jun 12, 2025, 7:08 PM

#

lmarena is down rn?

patent bane Jun 12, 2025, 7:08 PM

#

alpine coral ``` Can you solve this puzzle? --- How to Play: Find four groups of four item...

not sure if o3 pro in api uses tools like web browsing or not, but the chatgpt website does

small haven Jun 12, 2025, 7:10 PM

#

ccp chatgpt is built diff

echo aurora Jun 12, 2025, 7:12 PM

#

grim girder lmarena is down rn?

it is 😦

little thorn Jun 12, 2025, 7:13 PM

#

Hey, what on earth are you doinggg! All my important chat history was deleted!! Why don't you put a data backup module or an option to connect email or Google accounts so that chat data remains!! All chat data from two of my accounts was wiped!!! Seriously, what is this? Just add a Google connect option

jade egret Jun 12, 2025, 7:13 PM

#

whole wagon discord flagged u btw

bruhh

#

discord logged me out and send me to limited access until tomorow 😭

#

why 😭

small haven Jun 12, 2025, 7:14 PM

#

how tf is this not resolved

keen beacon Jun 12, 2025, 7:14 PM

#

claude is currently vacationing

small haven Jun 12, 2025, 7:15 PM

#

omg

soft kernel Jun 12, 2025, 7:16 PM

#

little thorn Hey, what on earth are you doinggg! All my important chat history was deleted!! ...

Oh noooooo

#

Mine got wiped out too😭😭

small haven Jun 12, 2025, 7:17 PM

#

ok so now that all my shxt is down, can someone recap whos at fault herre

alpine coral Jun 12, 2025, 7:18 PM

#

patent bane not sure if o3 pro in api uses tools like web browsing or not, but the chatgpt w...

yes o3 chatgpt gets it right (by browsing the web for the solution ha - which tbf is actually what it is does best)

#

o3 pro via API doesn't use tools afaik

#

not sure about via chatgpt

patent aspen Jun 12, 2025, 7:18 PM

#

poll_question_text

Which company's models do you use most for daily tasks?

victor_answer_votes

14

total_votes

22

victor_answer_id

1

victor_answer_text

Google

wintry tinsel Jun 12, 2025, 7:19 PM

#

Is 2.5 pro down?

small haven Jun 12, 2025, 7:20 PM

#

i swear if its one intern that is causing all of this 😤

jade egret Jun 12, 2025, 7:21 PM

#

wintry tinsel Is 2.5 pro down?

dont think so

wintry tinsel Jun 12, 2025, 7:21 PM

#

It says exceeded quota on all my accounts

#

And even on another third party app where I use it

civic flame Jun 12, 2025, 7:22 PM

#

its back up for me

small haven Jun 12, 2025, 7:26 PM

#

can someone ping me when its resolved? gonna pluck out some weed grass

echo aurora Jun 12, 2025, 7:27 PM

#

little thorn Hey, what on earth are you doinggg! All my important chat history was deleted!! ...

I am really sorry that you lost your chat history. We are looking different solutions that'll help prevent issues like this going forward.

civic flame Jun 12, 2025, 7:33 PM

#

🥀

autumn kettle Jun 12, 2025, 7:40 PM

#

Is it just me or still down?😩

atomic pagoda Jun 12, 2025, 7:41 PM

#

It’s still down but is there a way to recover chats

lapis light Jun 12, 2025, 7:45 PM

#

I don't think there's a way to recover those chats, because, based on how I think it works, the chats were stored in local storage, and, when you go to a website, if it changed, local storage gets cleaned, and cache gets reset.

autumn kettle Jun 12, 2025, 7:46 PM

#

Will the website come back any time soon?

whole wagon Jun 12, 2025, 7:46 PM

#

lapis light I don't think there's a way to recover those chats, because, based on how I thin...

You can recover it

#

If you don't visit the site

echo aurora Jun 12, 2025, 7:46 PM

#

autumn kettle Will the website come back any time soon?

it's not clear when it'll be working again unfortunately

keen fulcrum Jun 12, 2025, 7:47 PM

#

autumn kettle Is it just me or still down?😩

expected duration of downtime maybe 3 h

whole wagon Jun 12, 2025, 7:47 PM

#

You can just go in your file system and copy the local storage for the site. And then when it cleans it you put it back

keen fulcrum Jun 12, 2025, 7:48 PM

#

is it a large scale hacking attempt?

atomic pagoda Jun 12, 2025, 7:48 PM

#

whole wagon If you don't visit the site

What if I have multiple tabs open but haven’t visited them like I still have a few from yesterday open like if I don’t visit the site from those tabs does that count

whole wagon Jun 12, 2025, 7:49 PM

#

I would close them, if it refreshes it's over

cerulean seal Jun 12, 2025, 7:49 PM

#

marsh shadowBOT Jun 12, 2025, 7:49 PM

#

cerulean seal Jun 12, 2025, 7:49 PM

#

meh

#

not tryna spam

#

i was just testing smth

#

pictures work

whole wagon Jun 12, 2025, 7:50 PM

#

The core issue is resolved already

#

It's only some services take a bit to get going again

autumn kettle Jun 12, 2025, 7:51 PM

#

So we all are waiting for 2 hrs to pass😑🥲

hardy pecan Jun 12, 2025, 7:51 PM

#

Looks like internet backbone shenanigans

#

All types of services are down

#

It's not DNS
There's no way it's DNS
It was DNS

lapis light Jun 12, 2025, 7:53 PM

#

whole wagon You can just go in your file system and copy the local storage for the site. And...

Well, yeah, but that would be if you remembered to copy it before it got cleared.

elder rapids Jun 12, 2025, 8:00 PM

#

yo

#

all Samsung users get a free year of perplexity bro

#

😭

#

how'd I not know this

atomic pagoda Jun 12, 2025, 8:02 PM

#

Is it back up because I can access it but all my chats are gone

whole wagon Jun 12, 2025, 8:03 PM

#

Well. You weren't supposed to access it if you want your chats 😅

#

Because that's what clears them

atomic pagoda Jun 12, 2025, 8:04 PM

#

I’m aware now but I accessed it before I asked, note for next time

autumn kettle Jun 12, 2025, 8:04 PM

#

It's back but chats gone🥲 life back to zero

#

How often does it happen every week?

whole wagon Jun 12, 2025, 8:05 PM

#

Is battle back yet

leaden palm Jun 12, 2025, 8:11 PM

#

fun fact: the meta ai "discover" feed does not require a log in

#

anyone can see (accidentally) shared conversations

leaden palm Jun 12, 2025, 8:24 PM

#

leaden palm anyone can see (accidentally) shared conversations

edith norman is depressing

wintry tinsel Jun 12, 2025, 8:40 PM

#

little thorn Hey, what on earth are you doinggg! All my important chat history was deleted!! ...

LM arena has chat history since when

ocean vortex Jun 12, 2025, 8:59 PM

#

Ok I kind of changed my mind on o3-pro after more testing... It may not surprise you with shockingly good solutions beyond normal o3-high level, but they managed to fix the fails of it to quite significant extent. Wouldn't be surprised now if it does beat Gemini on simple-bench 👀

#

parallel compute does give it more capacity to reconsider and do additional things, even if in a more limited scope

#

yeah probably.

whole wagon Jun 12, 2025, 9:04 PM

#

nah

#

o3 high is 53.1% and gemini 2.5 pro is 62.4%

ocean vortex Jun 12, 2025, 9:04 PM

#

My initial impressions were mostly based on the facts that it seems to be more concise and does not have more intelligence / isn't more capable in an obvious way. But it is "less dumb" where o3 can blunder, naturally improving the overall performance

whole wagon Jun 12, 2025, 9:04 PM

#

and also google can simply do this best of n search to beat it again, scaling the number of times you run the llm isnt that impressive

#

run gemini 1000 times and you probably beat human baseline of simplebench, its about the cost to performance

ocean vortex Jun 12, 2025, 9:08 PM

#

whole wagon o3 high is 53.1% and gemini 2.5 pro is 62.4%

10+ % gain for pro is very realistic tbh

ocean vortex Jun 12, 2025, 9:10 PM

#

whole wagon and also google can simply do this best of n search to beat it again, scaling th...

they are welcome to do so. For now o3-pro is released and deep think isn't.

#

And the pricing of o3-pro is not ridiculous anymore, unlike it was with o1-pro...

whole wagon Jun 12, 2025, 9:11 PM

#

its $20/$80 compared to $1.25/$10 for 2.5 pro, you are comparing different levels of pricing

ocean vortex Jun 12, 2025, 9:12 PM

#

whole wagon its $20/$80 compared to $1.25/$10 for 2.5 pro, you are comparing different level...

deep think is not gonna be $1.25/$10. Also look at Anthropic pricing. OpenAI are far from being the worst offender anymore lol

#

and o3 is a very different model to 2.5pro with different strengths...

whole wagon Jun 12, 2025, 9:13 PM

#

the point is deep think is not required. 2.5 pro equals o3 pro at a cheaper price

ocean vortex Jun 12, 2025, 9:13 PM

#

whole wagon the point is deep think is not required. 2.5 pro equals o3 pro at a cheaper pric...

??

whole wagon Jun 12, 2025, 9:13 PM

#

yes it does lol

ocean vortex Jun 12, 2025, 9:14 PM

#

2.5pro is worse than o3 non-pro on numerous things

#

at best it's equivalent, at worst it is not entirely o3 level model

whole wagon Jun 12, 2025, 9:14 PM

#

o3 sucks at maths compared to 2.5 pro, its not even close

ocean vortex Jun 12, 2025, 9:15 PM

#

whole wagon o3 sucks at maths compared to 2.5 pro, its not even close

you must be joking

whole wagon Jun 12, 2025, 9:16 PM

#

seems everyone else also lives in this 'delusion' looking at the leaderboards

#

2.5 pro is top in every single category

#

out of error also

ocean vortex Jun 12, 2025, 9:21 PM

#

whole wagon out of error also

do you know how user preference benchmarks work?

#

they were never meant to be the definitive indicators of performance

#

2.5Pro is great but it is not better than o3 and it is objectively and measurably, definitively worse than o3-pro

unborn ocean Jun 12, 2025, 9:23 PM

#

ocean vortex 2.5Pro is great but it is not better than o3 and it is objectively and measurabl...

at math its really only worse than o3-high imo, but that is just nitpicking from my side

#

i would also argue that it on par (and sometimes even better) than regular o3

ocean vortex Jun 12, 2025, 9:24 PM

#

unborn ocean at math its really only worse than o3-high imo, but that is just nitpicking from...

well yeah and by extension worse than o3-pro. Though with regular o3 it's very close and hard to tell after Google updated it

grim girder Jun 12, 2025, 9:24 PM

#

is it still down?

unborn ocean Jun 12, 2025, 9:24 PM

#

although when it comes to very very complex tasks o3 is just wayy better (imo).
because it can handle high token output better

zinc ore Jun 12, 2025, 9:25 PM

#

unborn ocean at math its really only worse than o3-high imo, but that is just nitpicking from...

Y'all should cite the benchmarks being used

whole wagon Jun 12, 2025, 9:25 PM

#

They using vibes

unborn ocean Jun 12, 2025, 9:25 PM

#

zinc ore Y'all should cite the benchmarks being used

https://matharena.ai

MathArena.ai

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

zinc ore Jun 12, 2025, 9:26 PM

#

Huh, no I'm saying show where models are performing better and worse

whole wagon Jun 12, 2025, 9:26 PM

#

unborn ocean https://matharena.ai

This is old Gemini 2.5 pro

unborn ocean Jun 12, 2025, 9:26 PM

#

well i don't think they reported many improvements on math

#

obv you could argue that they are at par now

#

but now you don't have any data

whole wagon Jun 12, 2025, 9:27 PM

#

Google released the benchmarks with the model?

#

For example

#

The base model is 35%

#

On USAMO 2025

#

Read it

ocean vortex Jun 12, 2025, 9:27 PM

#

whole wagon They using vibes

not vibes lmao

whole wagon Jun 12, 2025, 9:28 PM

#

Saturated benchmarks

#

Obviously

unborn ocean Jun 12, 2025, 9:28 PM

#

whole wagon For example

that is a different benchmarking process as far as i know, because the people behind the mathbench arena actually manually check the reasoning
(but honestly you might also be right and 2.5 pro might be equal to o3-high in math)

#

(as far as i know, could also be wrong)

whole wagon Jun 12, 2025, 9:28 PM

#

It's the same benchmarking process lol

echo aurora Jun 12, 2025, 9:28 PM

#

grim girder is it still down?

I've been seeing mixed reports

whole wagon Jun 12, 2025, 9:28 PM

#

They just didn't bother to test the recent version

#

I would pull up frontier maths results but they refuse to test Gemini since it's openAI funded benchmark

#

Sad

ocean vortex Jun 12, 2025, 9:30 PM

#

whole wagon Saturated benchmarks

🤣

#

#

https://matharena.ai

MathArena.ai

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

#

this should shut you up

whole wagon Jun 12, 2025, 9:30 PM

#

That's saturated obviously

#

87%

#

Lol

ocean vortex Jun 12, 2025, 9:30 PM

#

whole wagon That's saturated obviously

it is not saturated and still very valid metric

#

close to saturated? yes. Not yet saturated though

whole wagon Jun 12, 2025, 9:31 PM

#

It's too close to 100 that it becomes basically luck. Actually learn some statistics, there is such a thing as sample size

zinc ore Jun 12, 2025, 9:31 PM

#

Does the average include USAMO

ocean vortex Jun 12, 2025, 9:31 PM

#

just because you don't like the results does not make it saturated

#

you don't get to pick and choose

#

and this is overall score considering all the main math benchmarks

unborn ocean Jun 12, 2025, 9:32 PM

#

o3-high

whole wagon Jun 12, 2025, 9:33 PM

#

It's the only weakness remaining. The tool use sucks

unborn ocean Jun 12, 2025, 9:33 PM

#

o3 very roughly the same

whole wagon Jun 12, 2025, 9:33 PM

#

The actual model itself is astonishing though

#

They magically found 5x efficiency improvement

#

Sure

unborn ocean Jun 12, 2025, 9:34 PM

#

depends, honestly in many other tasks other than the academic ones (which i assume we are referring to when calling a model more powerful) o3 / o3-high / o3 pro are undeniably worse than 2.5 pro

ocean vortex Jun 12, 2025, 9:34 PM

#

whole wagon It's the only weakness remaining. The tool use sucks

We have not even been talking about them, but yeah... o3 will destroy 2.5pro for math when you use it on chatgpt website with tools. Even with no tools it's marginally better for math. With tools it's game over lol

unborn ocean Jun 12, 2025, 9:35 PM

#

no

#

then why would 4.1 already be priced this low

whole wagon Jun 12, 2025, 9:35 PM

#

Crazy

unborn ocean Jun 12, 2025, 9:35 PM

#

did they find this 5x cost reduction for 4.1 first

#

and then wait months

#

nah

#

can not make this s*it up

ocean vortex Jun 12, 2025, 9:36 PM

#

let's not push this to far lol. I'm pretty sure they just used that as an excuse to be completely honest. They had to say something otherwise it makes them look bad catgrin

#

they have less profit margin now

unborn ocean Jun 12, 2025, 9:37 PM

#

just funny

ocean vortex Jun 12, 2025, 9:38 PM

#

let's chill out everyone lol

unborn ocean Jun 12, 2025, 9:41 PM

#

imo if 2.5 pro fixes the following things it will be clearly better (not accounting for o3 pro):

actually use and then perfect tools (this also includes being able to follow structured prompts better, e.g. dif and command usage)
better adapt the reasoning length (higher variance in auto mode and more intelligently detect difficulty, in my experience this is still the main area holding the models back)

what o3 needs improve to beat 2.5 pro (assuming they would actually update it):

actually make the models enjoying to talk to
actually make the model "smarter"(my vibe) and not be completely unaware of some assumptions it is taking
some more knowledge honestly
better general visual understanding
better long context
make it better at reasoning over visual elements: webdev, complex video
prob. other stuff as well (but i have not really used o3 enough to judge)

-> most of the difference comes from things not highly relevant to the average user
-> in practice same intelligence

zinc ore Jun 12, 2025, 9:42 PM

#

o3 pro USAMO score when

unborn ocean Jun 12, 2025, 9:44 PM

#

well i want it to chat in a way where it is aware of assumptions

whole wagon Jun 12, 2025, 9:45 PM

#

o3 API usage is barely anything. See openrouter

#

That's the whole reason they had to drop prices 80%

#

The model was barely being used through API

#

The price drop does not seem to have helped that much though

unborn ocean Jun 12, 2025, 9:46 PM

#

well part of a good model is making the model so that it adapts really well when talking to it and "intelligently" identify what i want
which 2.5 is able to do

whole wagon Jun 12, 2025, 9:47 PM

#

💀

unborn ocean Jun 12, 2025, 9:47 PM

#

it is still a valid point, o3 API usage is probably very low

#

no

#

so you think they are just like: well, i mean anthropic has these models that are really popular for coding and googles models are also on the rise for that. both of these companies are our competitors. BUT well will just not care at all

#

because "o3 is not an everyday model" 🤡

whole wagon Jun 12, 2025, 9:50 PM

#

💀

small haven Jun 12, 2025, 9:52 PM

#

craigbench is triggering some ppl

#

made by google btw 😭

#

hmm, ur a wizard harry

#

ktibow also predicted cloudflare downtime

jade egret Jun 12, 2025, 10:07 PM

#

unborn ocean imo if 2.5 pro fixes the following things it will be clearly better (not account...

real

torn mantle Jun 12, 2025, 11:01 PM

#

Lmao

placid notch Jun 12, 2025, 11:21 PM

#

https://kick.com/firedancer/clips/clip_01H3JKR668VFE8KEZTH5DDZ586

Kick

firedancer - Watch clips on Kick

Free the nips during shark attacks

hardy pecan Jun 12, 2025, 11:28 PM

#

no need to post your gooner clips here bud

nimble trail Jun 12, 2025, 11:29 PM

#

#

👀

#

Is this just a lame dad jokes or...

jade egret Jun 12, 2025, 11:34 PM

#

nimble trail Is this just a lame dad jokes or...

or wut

#

tbh idk

nimble trail Jun 12, 2025, 11:36 PM

#

😭

#

Seems like I'm just reading into it a bit too much.

#

Coping about Kingfall rn.

small haven Jun 12, 2025, 11:38 PM

#

nimble trail

he should put up a tweet of sam falling off a cliff next

hollow ocean Jun 12, 2025, 11:41 PM

#

Does o3 pro beat kingfall?

placid notch Jun 12, 2025, 11:43 PM

#

@placid notch

elder rapids Jun 12, 2025, 11:53 PM

#

yo wym

#

ion think this has anything to do with deployment

#

it's too widespread

#

they fixed a lot of it now

#

funny the models were the first to show it

#

and then it bled into everything else on the internet

#

gcp is crazy

#

ye

#

really is crazy how one problem like that Cascades Into affecting basically the entire world

#

it affected discord, Spotify, Snapchat, etc

#

this seems to be the path they want to take

#

although it doesn't have to be the case that it happens

#

the CEO of Google and DeepMind seem to have very interlinked views tho

#

ye but that isn't really a bad thing

#

to properly combat it

#

you need to have control

#

and to control it, there's no way else

jade egret Jun 13, 2025, 12:02 AM

#

guys do you think google is gonna lose stock or gain stock in the next few years? (because it mgiht sell chrome and all the ai stuff)

#

yea

#

but what if google lose

elder rapids Jun 13, 2025, 12:05 AM

#

yo I don't think it's explicitly a bad thing if they lose chrome

#

ye but that would simply be creating another monopoly

#

so that's easily dismissable

#

or not something that could be entertained

#

no as in

#

they can't give it to anyone else or open that up

#

it wouldn't be a possibility they open chrome up to imo

#

it doesn't have to be that way

#

no, no one would buy it

#

it doesn't have to be that it's sold lmao

#

yeah but like every instance of things like this happening

#

that's just as valid as any other claim

jade egret Jun 13, 2025, 12:12 AM

#

i think google won't have to force sell chrome but it will end some contracts

#

like google won't be defalt browser on firefox or sumthing liek that

elder rapids Jun 13, 2025, 12:12 AM

#

inherently, but that's what I'm saying presupposes to say the contrary lol

jade egret Jun 13, 2025, 12:13 AM

#

jade egret Jun 13, 2025, 12:14 AM

#

jade egret

in ur opinion

elder rapids Jun 13, 2025, 12:14 AM

#

the DoJ has more power, obviously, but even under that power it seems like large companies being accused of these things force that claim into a result that's either somewhat related, or just one of the many instances that are besides "sell x"

#

for example, Microsoft appealed and they vacated the breakup order

#

ye, inevitably Google isn't going to be hit hard

lilac nimbus Jun 13, 2025, 1:18 AM

#

jade egret Jun 13, 2025, 2:46 AM

#

lilac nimbus

woah

small haven Jun 13, 2025, 2:49 AM

#

claude 4.1 came early?

keen fulcrum Jun 13, 2025, 2:59 AM

#

looks like grok 3.5 is live on arena

#

x-preview

keen beacon Jun 13, 2025, 3:00 AM

#

Isn't that from baidu

whole sundial Jun 13, 2025, 3:00 AM

#

i think that's Baidu ERNIE X1

small haven Jun 13, 2025, 3:02 AM

#

who pinged me

elder rapids Jun 13, 2025, 3:59 AM

#

keen fulcrum x-preview

that's not grok 3.5

keen fulcrum Jun 13, 2025, 4:06 AM

#

Why can’t you believe

small haven Jun 13, 2025, 4:07 AM

#

x-preview is not grok 3.5

jade egret Jun 13, 2025, 4:33 AM

#

how u know that

#

wut that

alpine coral Jun 13, 2025, 4:45 AM

#

#

I was not trained by Google, but am instead an independently developed large language model by Baidu, based on its self-developed deep learning framework PaddlePaddle. My name is Wenxin X1 (ERNIE X1). My technical architecture is entirely based on Baidu’s long-term accumulation in the field of artificial intelligence, including core technologies such as pre-training algorithms, data engineering, and model optimisation. I aim to provide users with a professional, secure, and contextually appropriate intelligent interaction experience in Chinese.

zinc ore Jun 13, 2025, 5:18 AM

#

small haven Jun 13, 2025, 5:26 AM

#

ITS COMING

torn mantle Jun 13, 2025, 5:55 AM

#

Its coming aaaaaaaaa

#

Finally

#

🥺

elder rapids Jun 13, 2025, 7:20 AM

#

zinc ore

nice, this is when I get ultra

#

yo wait it's coming to the API too

#

I forgot

dusky aurora Jun 13, 2025, 7:47 AM

#

"Connecting to Arena has failed"

atomic pagoda Jun 13, 2025, 7:51 AM

#

I hate this, it’s happening again

#

Do I just wait until it’s back up again to have the chat that I tried to get all my work in to continue it or is it gone already

languid crescent Jun 13, 2025, 7:55 AM

#

uhhh is lmarena down?

atomic pagoda Jun 13, 2025, 7:55 AM

#

Apparently yes

languid crescent Jun 13, 2025, 7:56 AM

#

ooh I thought Iwas the only one

#

same issue with no models?

atomic pagoda Jun 13, 2025, 7:56 AM

#

Yeah and an error for me as well

languid crescent Jun 13, 2025, 7:57 AM

#

okok ty

languid crescent Jun 13, 2025, 7:57 AM

#

small haven ITS COMING

uh what;s coming?

dusky aurora Jun 13, 2025, 8:04 AM

#

languid crescent uh what;s coming?

OUTAGE!

languid crescent Jun 13, 2025, 8:05 AM

#

dusky aurora OUTAGE!

ohh that's what comin lol

#

oh shi does that mean chat history is cleared too?

leaden sun Jun 13, 2025, 8:08 AM

#

...interestingly, I didnt experience outage at all, everything is fine on my end 😅

languid crescent Jun 13, 2025, 8:09 AM

#

leaden sun ...interestingly, I didnt experience outage at all, everything is fine on my end...

what 😭

dusky aurora Jun 13, 2025, 8:09 AM

#

leaden sun ...interestingly, I didnt experience outage at all, everything is fine on my end...

the luck is strong with this one

atomic pagoda Jun 13, 2025, 8:10 AM

#

Does anyone know when it’s gonna be back up again?

mystic mica Jun 13, 2025, 8:15 AM

#

Legacy version is up though

thorn bough Jun 13, 2025, 8:20 AM

#

it's working now

atomic pagoda Jun 13, 2025, 8:20 AM

#

Is it back up again?

dusky aurora Jun 13, 2025, 8:20 AM

#

for me it seems so

atomic pagoda Jun 13, 2025, 8:21 AM

#

Are your chats still there

dusky aurora Jun 13, 2025, 8:21 AM

#

yes, they are

atomic pagoda Jun 13, 2025, 8:21 AM

#

Okay, thanks

dusky aurora Jun 13, 2025, 8:21 AM

#

thorn bough it's working now

thank you for the info

hazy quest Jun 13, 2025, 8:23 AM

#

Guys, have a look at these blind test : https://artificialanalysis.ai/text-to-video/arena

The models Seedance 1.0 and the anonymous Kangaroo are absolutely insane. They both seem better than Veo3 (video only), Kling, Pika etc. Both come up often in the arena, especially Seedance.

Video Generation Model Arena | Artificial Analysis

Compare AI video generation models by choosing your preferred video without knowing the provider.

#

https://www.reddit.com/r/singularity/comments/1l9n0zq/kangaroo_an_anonymous_video_model_being_tested_on/

From the singularity community on Reddit: "Kangaroo", an anonymous ...

Explore this post and more from the singularity community

ocean vortex Jun 13, 2025, 8:39 AM

#

hazy quest Guys, have a look at these blind test : https://artificialanalysis.ai/text-to-vi...

Interesting... Fairly general prompt but look how similar both models are. I wonder what movie is this from lol

Left: Kling
Right: Seedance

languid crescent Jun 13, 2025, 8:40 AM

#

my chats are still saved thanks daddy lm ❤️

leaden sun Jun 13, 2025, 8:43 AM

#

https://tenor.com/view/bh187-braveheart-charge-attack-gif-11448206846315415195

Tenor

hazy quest Jun 13, 2025, 8:44 AM

#

ocean vortex Interesting... Fairly general prompt but look how similar both models are. I won...

Well, this one is Image to Video, so they will surely look similar.

mossy drum Jun 13, 2025, 10:02 AM

#

New model in Battle mode: stephen-v2

sacred quail Jun 13, 2025, 10:46 AM

#

What the hell ? Is now lmarena has video models...?

elder burrow Jun 13, 2025, 10:47 AM

#

zinc ore

WHAT

#

where is that

keen fulcrum Jun 13, 2025, 10:48 AM

#

sacred quail What the hell ? Is now lmarena has video models...?

no

leaden sun Jun 13, 2025, 10:58 AM

#

hazy quest Guys, have a look at these blind test : https://artificialanalysis.ai/text-to-vi...

thanks for sharing, it is a smart workaround for such benchmarking, I'll still wait for lmarena to include text2video benchmarking where I can write my own prompt for testing 🤗
edit: you can submit your own prompt there, lets see....

elder burrow Jun 13, 2025, 11:14 AM

#

hazy quest Guys, have a look at these blind test : https://artificialanalysis.ai/text-to-vi...

seedance is incredible.

#

insane.

keen fulcrum Jun 13, 2025, 11:17 AM

#

you can test it here https://www.volcengine.com/docs/82379/1544106

模型服务价格--火山方舟大模型服务平台-火山引擎

火山引擎官方文档中心，产品文档、快速入门、用户指南等内容，你关心的都在这里，包含火山引擎主要产品的使用手册、API或SDK手册、常见问题等必备资料，我们会不断优化，为用户带来更好的使用体验

#

https://artificialanalysis.ai/text-to-video/arena

Video Generation Model Arena | Artificial Analysis

Compare AI video generation models by choosing your preferred video without knowing the provider.

willow grail Jun 13, 2025, 11:18 AM

#

leaden sun Jun 13, 2025, 11:22 AM

#

i think stephen is supposed to be a chinese model too, it's in the arena btw

calm spear Jun 13, 2025, 11:45 AM

#

will we have P2L in new UI?

sacred plaza Jun 13, 2025, 12:49 PM

#

is any lab employing this price strategy with their gen ai models? I have heard that twitter guy is making grok-3 input/output costs fairly low to build market share.

mossy drum Jun 13, 2025, 12:50 PM

#

New model in Battle mode: prowlridge

calm sequoia Jun 13, 2025, 12:50 PM

#

Found this website. Super good loking visualiations. However, wtf is EnigmaEval https://scale.com/leaderboard

SEAL LLM Leaderboards: Expert-Driven Private Evaluations

Explore the SEAL leaderboards for expert-driven, private, regularly updated LLM rankings and evaluations across domains like coding, instruction following and more!

#

"a benchmark derived from puzzle hunts—a repository of sophisticated problems from the global puzzle-solving community"

sacred plaza Jun 13, 2025, 12:51 PM

#

still no benchmarks to measure real world practical knowledge work tasks yet huh?

misty vault Jun 13, 2025, 12:53 PM

#

willow grail

is this gpt-4-0314

calm sequoia Jun 13, 2025, 12:55 PM

#

👀

#

I wonder why @alpine coral your personal bench has such a different results

gleaming galleon Jun 13, 2025, 1:24 PM

#

vocal pelican people would need to rate the model in battles, and nobody is waiting 13 minutes...

Yeah true, but if people could wait 6–8 mins for o3-2025-04-16 on certain tasks, surely they can do the same for o3-pro, no?

unborn ocean Jun 13, 2025, 1:42 PM

#

quite a lot of em

#

should specify i mean "some AI things they did" :)

#

bc some of them are well-known beyond the field

late path Jun 13, 2025, 2:01 PM

#

they already went bankrupt and acquired by Alibaba lol

patent aspen Jun 13, 2025, 2:03 PM

#

I could name models from maybe 4 companies on the list but only put DeepSeek because it's the only model where I know what it does and care

late path Jun 13, 2025, 2:04 PM

#

I voted for all the companies whose product names I knew

unborn ocean Jun 13, 2025, 2:05 PM

#

me too, though for me it was obv. a bit unfair as I only included the things I already knew...

patent aspen Jun 13, 2025, 2:18 PM

#

I think xAI and DeepSeek are on the same tier

leaden sun Jun 13, 2025, 2:21 PM

#

patent aspen I think xAI and DeepSeek are on the same tier

no no no 🙂‍↔️

languid crescent Jun 13, 2025, 2:22 PM

#

can we rename our chats?

sacred quail Jun 13, 2025, 2:25 PM

#

Grok is underrated i think

torn mantle Jun 13, 2025, 2:26 PM

#

sacred quail Grok is underrated i think

stop

#

pls

sacred quail Jun 13, 2025, 2:28 PM

#

We can see grok 3.5 and could be really good if elon not deported to south africa yet

patent aspen Jun 13, 2025, 2:29 PM

#

For the record, I'm not talking about current models. I'm talking about the position they're in.

I would put DeepSeek one tier above xAI if they didn't have issues with hardware embargos and corporate governance risks.

Actually I think I would still put DeepSeek above xAI because they are domiciled China and are insulated because they will have exclusive access to the Chinese market

sacred quail Jun 13, 2025, 2:29 PM

#

Also they announced big brain mode earlier and didnt release

#

I think they have something

patent aspen Jun 13, 2025, 2:30 PM

#

xAI needs to compete with American labs to survive whereas DeepSeek does not

sacred quail Jun 13, 2025, 2:32 PM

#

I have symptahy for deepseek not only because cheap or opensource, also i like their companie's vision. That guy from lead of the deepseek already critizing China's strategy. He thinks China must focus to be creative more than just producing more or cheap

#

I readed some interviews about that guy

novel flame Jun 13, 2025, 2:33 PM

#

Seedance 1.0 is really incredible -- could be the new SOTA? ... perhaps it's edged out by Kangaroo. But what impressed me even more was the LTX Video v0.9.7 model which won several matchups for me and is fast enough to generate video in realtime:

LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. It can generate 30 FPS videos at 1216×704 resolution, faster than it takes to watch them.

https://github.com/Lightricks/LTX-Video

GitHub

GitHub - Lightricks/LTX-Video: Official repository for LTX-Video

Official repository for LTX-Video. Contribute to Lightricks/LTX-Video development by creating an account on GitHub.

sacred quail Jun 13, 2025, 2:33 PM

#

Also they were producing some investment machines?(sorry i forgot the name) but they were already too talented

late path Jun 13, 2025, 2:33 PM

#

bytedance's new model is also quite good, its performance is on par with deepseek's model. And they're backed by a big company like bytedance, their research pace is faster than deepseek

sacred quail Jun 13, 2025, 2:33 PM

#

There is a reason why random company like them beated alibaba's AI

leaden sun Jun 13, 2025, 2:35 PM

#

sacred quail We can see grok 3.5 and could be really good if elon not deported to south afric...

He is welcome to join EU and hopefully he could convince our Empress Ursula to have some mercy with their AI regulation...

sacred quail Jun 13, 2025, 2:35 PM

#

You guys just love too much regulating something it cant be helped

#

But dont worry, anthropic is here for you

leaden sun Jun 13, 2025, 2:36 PM

#

Qwen where? 🥺

patent aspen Jun 13, 2025, 2:37 PM

#

Based on the position they're currently in, not their current models

late path Jun 13, 2025, 2:38 PM

#

I can see that in the last year, bytedance have gone from making very mediocre models to making good reasoning models close to R1-0528 and o3

leaden sun Jun 13, 2025, 2:38 PM

#

qwen was able to decipher Craig's encryption, while claude wasnt able to....

patent aspen Jun 13, 2025, 2:38 PM

#

leaden sun Qwen where? 🥺

I haven't used Qwen

leaden sun Jun 13, 2025, 2:38 PM

#

that has surprised me quite a bit

late path Jun 13, 2025, 2:38 PM

#

late path I can see that in the last year, bytedance have gone from making very mediocre m...

And they don't seem to use as much distilled data as deepseek

whole wagon Jun 13, 2025, 2:38 PM

#

xAI is well positioned, they have the largest pretraining cluster by far

#

Progress is fast but since they started much later than others it's lagging in current models

leaden sun Jun 13, 2025, 2:40 PM

#

late path bytedance's new model is also quite good, its performance is on par with deepsee...

just started testing cici, doubao isnt commercially available outside arena

whole wagon Jun 13, 2025, 2:40 PM

#

I would put Google at S++ also

#

They have the full stack

#

From the hardware up

late path Jun 13, 2025, 2:40 PM

#

I feel like xAI is still missing a lot, they're even less comprehensive than deepseek

patent aspen Jun 13, 2025, 2:41 PM

#

I agree on paper, although OpenAI's mindshare is a massive advantage

#

ChatGPT is a verb

whole wagon Jun 13, 2025, 2:42 PM

#

Sure, but openAI could be the apple of the AI world basically. Does that make them the top if they don't have the actual model intelligence crown?

patent aspen Jun 13, 2025, 2:43 PM

#

Debatable depending on criteria. Probably not worth arguing about

whole wagon Jun 13, 2025, 2:43 PM

#

You said you are talking about positioning. Not current models

#

Well, I think OpenAI is poorly positioned in terms of compute. That's probably their primary concern. Like Stargate is expected to deploy 100k GPUs by the end of this year. Meanwhile xAI put 200k online in 100 days

#

And Google already had the compute deployed lol

ocean vortex Jun 13, 2025, 2:49 PM

#

that's not even close. Magistral-medium seems to be around the level of 30b reasoning models. So like qwq-32b, qwen3-30b MoE...

#

Grok well beyond this

#

Mistral should have made Magistral-Large first and then distilled instead, but ig funding was an issue for them

#

Now that they partnered with nvidia, maybe things will be better moving forward

whole wagon Jun 13, 2025, 2:51 PM

#

Second on compute might be meta actually. Even though it doesn't seem like it atm 😉

late path Jun 13, 2025, 2:51 PM

#

I suspect OpenAI's research can't keep up anymore. o3 seems to have only started training at the beginning of this year, so they only have a 2-4 month head start internally. Plus, the improvement in o3 pro is so minor, the fact they released it anyway makes it feel like they've really got nothing else to release.
Meanwhile, Google is accelerating, not slowing down. It seems they've always had more powerful models internally.

ocean vortex Jun 13, 2025, 2:53 PM

#

late path I suspect OpenAI's research can't keep up anymore. o3 seems to have only started...

wdym, they are still sota on certain tasks with o3. On others where they are weaker it is still very close with 2.5Pro. They are not exactly behind lol

#

it's more of Google doing a good job recently

#

OpenAI is working hard as usual

#

imo

whole wagon Jun 13, 2025, 2:54 PM

#

late path I suspect OpenAI's research can't keep up anymore. o3 seems to have only started...

I don't think it's research as such, I think they literally just don't have enough GPUs to train at the required pace, and their compute deployments are too slow

leaden sun Jun 13, 2025, 2:55 PM

#

aren't they the ones building NPU, TPU and [insert fancy letter]PU?

whole wagon Jun 13, 2025, 2:55 PM

#

ocean vortex wdym, they are still sota on certain tasks with o3. On others where they are wea...

You are talking about current models. OpenAI are not cooking internally I know that already lol

ocean vortex Jun 13, 2025, 2:55 PM

#

yeah Google is in way better position than pretty much everyone else. Infinite data, cheap compute with TPUs...

whole wagon Jun 13, 2025, 2:56 PM

#

They can't fit in their compute budget a new pretrained model first off. So they are having to use 4.1 as the base for o4 (GPT5). So this limits them before they even started RL. There's an entire list of issues they are having. Because they are used to the yearly cadence not this 3 month one Google is pushing them for

#

So they are in a tricky position

ocean vortex Jun 13, 2025, 2:57 PM

#

whole wagon You are talking about current models. OpenAI are not cooking internally I know t...

oh they are sitting still? Are you high? catgrin
They were never holding back and released everything as soon as they realistically could when it comes to LLMs

#

them having "achieved AGI internally" and other things was just empty hype and noise. They went as far as to announce things before they were even real lol

late path Jun 13, 2025, 2:58 PM

#

Kingfall is undoubtedly stronger than o3(and o3-pro), not to mention it incorporates improvements like deepthink, and Google is still accelerating its research.
Meanwhile, the only new model from OpenAI seems to be the much-anticipated GPT-5, but... even someone like Sam, who's so good at creating hype, isn't promoting it in advance (by contrast, they started hyping o3-preview four months before its release). This is very suspicious

whole wagon Jun 13, 2025, 2:59 PM

#

GPT 5 is going to be SOTA at release, but not by a large margin. And then Gemini 3 pro will retake the crown later (who knows when that's releasing)

ocean vortex Jun 13, 2025, 3:00 PM

#

ocean vortex them having "achieved AGI internally" and other things was just empty hype and n...

They announced o3 before they had it. O3-preview was based on old base model (gpt4o/o1)

whole wagon Jun 13, 2025, 3:01 PM

#

o3 is gpt4.1 based (a different checkpoint but same arch)

ocean vortex Jun 13, 2025, 3:03 PM

#

whole wagon GPT 5 is going to be SOTA at release, but not by a large margin. And then Gemini...

The consensus was similar back when google announced gemini ultra and that it would take years since OpenAI would always respond and be ahead. But we hit a wall and models were getting cheaper rather than much better. You never really know and it's not as simple. OpenAI has quite a bit going for them as they are essentially pioneers of consumer oriented reasoning models. They might do the same for another big idea...

late path Jun 13, 2025, 3:03 PM

#

I've suddenly lost confidence in gpt5 releasing in july lmao, I even bought yes on polymarket for its july release. now I really have to reconsider that

whole wagon Jun 13, 2025, 3:04 PM

#

I think they have to release then. They can't risk releasing after Gemini 3 pro

#

It has to be before so they have some time at the top model intelligence

late path Jun 13, 2025, 3:04 PM

#

hopefully

#

I'm looking forward to seeing a model smarter than kingfall anyway

ocean vortex Jun 13, 2025, 3:06 PM

#

late path I've suddenly lost confidence in gpt5 releasing in july lmao, I even bought yes ...

it has to be a new pretrained model. Like I don't see how it would work with just routing requests or a hybrid o3 version - no way that will perform better than o3-high let alone o3-pro

whole wagon Jun 13, 2025, 3:07 PM

#

It's a new RL model. The pretrained base model is 4.1

ocean vortex Jun 13, 2025, 3:08 PM

#

You can save money with hybrid model / system, but how on earth can it possibly perform better than using high reasoning all the time... I do not think it can lol

ocean vortex Jun 13, 2025, 3:09 PM

#

late path Kingfall is undoubtedly stronger than o3(and o3-pro), not to mention it incorpor...

deepthink is parallel compute that has nothing to do with a singular model arch, there's nothing to "incorporate" lol

whole wagon Jun 13, 2025, 3:11 PM

#

openAI is not very secretive when it comes to their products. They have testers not even within the company

#

They are only secretive about their core research

ocean vortex Jun 13, 2025, 3:12 PM

#

late path Kingfall is undoubtedly stronger than o3(and o3-pro), not to mention it incorpor...

Kingfall is mostly hype for now, we still know very little about it. But I partially agree that GPT5 may not break records. It may very well be just a way for them to integrate models - make it faster and cheaper

whole wagon Jun 13, 2025, 3:12 PM

#

I think mostly these ai companies don't care if their releases and expected performances leak beforehand

#

Because the testers are not even NDAed

late path Jun 13, 2025, 3:12 PM

#

ocean vortex deepthink is parallel compute that has nothing to do with a singular model arch,...

I think they can apply deepthink to any model if they want, just like how openai adds "pro" to o1 and o3

ocean vortex Jun 13, 2025, 3:13 PM

#

late path I think they can apply deepthink to any model if they want, just like how openai...

in theory yes, but this is still a very inefficient expensive way of marginal gains. Not hilariously expensive anymore, but still far from optimal

whole wagon Jun 13, 2025, 3:14 PM

#

GPT5 is o4. Except the thinking can be scaled completely (it works with it turned off). It's not a router except on the very cheapest end to switch to mini

#

Ofc it's a new model, they wouldn't re release o3 lmao

#

Why would I expose people sharing information they are not supposed to, that is just stupid

#

I don't know if they did. But it can't of been a final version, since o4 full is still training

ocean vortex Jun 13, 2025, 3:16 PM

#

whole wagon GPT5 is o4. Except the thinking can be scaled completely (it works with it turne...

Plausible. So would be more like what Anthropic is doing. Though I would be surprised if doing so is not a compromise compared to native reasoning-only model tbh

whole wagon Jun 13, 2025, 3:17 PM

#

Because he wanted to be the 'dictator' himself?

#

He wants to make money, gain power etc

ocean vortex Jun 13, 2025, 3:19 PM

#

OpenAI is undeniably preparing for merging reasoning and non-reasoning tbh
gpt4.1 is already price matched to o3

whole wagon Jun 13, 2025, 3:21 PM

#

o4 will be the same per token pricing as current o3

ocean vortex Jun 13, 2025, 3:22 PM

#

curious what they will do with mini lineup, that one is still far from being price matched... $4.40 vs $1.60 output

ocean vortex Jun 13, 2025, 3:24 PM

#

whole wagon o4 will be the same per token pricing as current o3

would make no sense to release anything as "o4" though...? GPT5 should be the successor to both o3 and gpt4.1 in theory

whole wagon Jun 13, 2025, 3:24 PM

#

GPT5 is o4

#

They just won't call it o4 to solve the naming issue

ocean vortex Jun 13, 2025, 3:25 PM

#

whole wagon They just won't call it o4 to solve the naming issue

yeah that's what I meant

whole wagon Jun 13, 2025, 3:26 PM

#

Read

#

The question becomes, what kind of gain can be expected from essentially training o3 for much longer (compute wise)

late path Jun 13, 2025, 3:28 PM

#

gpt5 gonna completely break the promise that gpt''s pre-training scale would increase 10x with every +1 version lol

#

maybe 100x idk

whole wagon Jun 13, 2025, 3:29 PM

#

💀

ocean vortex Jun 13, 2025, 3:29 PM

#

whole wagon The question becomes, what kind of gain can be expected from essentially trainin...

it has to be a bigger model I think. You can't expect from same size model to do both things (reasoning and not) and then perform even better than o3 just by training it for somewhat longer...

whole wagon Jun 13, 2025, 3:30 PM

#

It's not a bigger model. This is for sure lol

ocean vortex Jun 13, 2025, 3:30 PM

#

whole wagon It's not a bigger model. This is for sure lol

"for sure"??

whole wagon Jun 13, 2025, 3:30 PM

#

Yes

ocean vortex Jun 13, 2025, 3:30 PM

#

how can you be sure lol

hollow ocean Jun 13, 2025, 3:30 PM

#

Who else paid $1 for o3 pro

#

Team promo

whole wagon Jun 13, 2025, 3:32 PM

#

whole wagon It's not a bigger model. This is for sure lol

I'm going to link this message again when o4 (GPT5) releases lol. Not a long way now anyways

late path Jun 13, 2025, 3:32 PM

#

RLmaxxing will create a schizophrenic model full of hallucinations haha. I think the current o3 is already neurotic enough, it constantly hallucinates tool calls. I dread to think how much more it could improve if we keep doubling down on RLVR on the same base

whole wagon Jun 13, 2025, 3:32 PM

#

late path RLmaxxing will create a schizophrenic model full of hallucinations haha. I think...

They are doing a lot of work to attempt to reduce hallucinations

hollow ocean Jun 13, 2025, 3:32 PM

#

You can get cheap Gemini ultra accs on Chinese sites

ocean vortex Jun 13, 2025, 3:32 PM

#

they have to train it for both outputting reasoning and not not outputting it. And they already kinda trained o3 base model for as long as it is reasonable... This sounds questionable tbh

whole wagon Jun 13, 2025, 3:32 PM

#

Also they are targeting swe bench to try to be more competitive with Claude in that domain

hollow ocean Jun 13, 2025, 3:33 PM

#

Not a lot of people know about $1 o3 pro

#

tall summit Jun 13, 2025, 3:34 PM

#

hollow ocean

lmaoooo

hollow ocean Jun 13, 2025, 3:34 PM

#

Secret

hollow ocean Jun 13, 2025, 3:34 PM

#

tall summit lmaoooo

It’s 2-0 so far on predictions

ocean vortex Jun 13, 2025, 3:34 PM

#

whole wagon Also they are targeting swe bench to try to be more competitive with Claude in t...

Google is a much bigger threat than Anthropic ever will be... Besides Anthropic's main advantages seem to stem from model size (Opus spatial awareness and reasoning)

hollow ocean Jun 13, 2025, 3:35 PM

#

Claim it quick before it’s gone

whole wagon Jun 13, 2025, 3:36 PM

#

I found o3 and 2.5 pro are equally bad with the hallucinations anyways

hollow ocean Jun 13, 2025, 3:37 PM

#

whole wagon I found o3 and 2.5 pro are equally bad with the hallucinations anyways

o3 solved my wordle 2.5 pro couldn’t

tall summit Jun 13, 2025, 3:37 PM

#

can it solve letterle

#

oh i wonder if it can solve https://angle.wtf

Angle

Guess the angle in 3 tries or less!

whole wagon Jun 13, 2025, 3:38 PM

#

These models have terrible visual reasoning or smth. I put some basic geometry problems and they all failed lol

ocean vortex Jun 13, 2025, 3:38 PM

#

whole wagon I found o3 and 2.5 pro are equally bad with the hallucinations anyways

o3 is basically kindergarten level on spatial awareness even comparing it to 2.5 pro though, try asking it to draw something in svg and then compare that to 2.5pro...

whole wagon Jun 13, 2025, 3:39 PM

#

whole wagon These models have terrible visual reasoning or smth. I put some basic geometry p...

But converting it to a text description they easily pass

hollow ocean Jun 13, 2025, 3:39 PM

#

Vision not solved yet

#

Not much progress made

ocean vortex Jun 13, 2025, 3:39 PM

#

whole wagon These models have terrible visual reasoning or smth. I put some basic geometry p...

"these" = OpenAI models

#

and I agree

whole wagon Jun 13, 2025, 3:39 PM

#

Gemini failed also

ocean vortex Jun 13, 2025, 3:40 PM

#

whole wagon Gemini failed also

it may have. But Anthropic and Google are still leaps better on tasks like this

hollow ocean Jun 13, 2025, 3:40 PM

#

tall summit oh i wonder if it can solve https://angle.wtf

#

One shot

#

Light work

ocean vortex Jun 13, 2025, 3:41 PM

#

which is why I would be exremely surprised if o4/gpt5 is same size as o3

tall summit Jun 13, 2025, 3:41 PM

#

gemini thinks it's 115°

ocean vortex Jun 13, 2025, 3:41 PM

#

like there are no obvious ways for fast gains

tall summit Jun 13, 2025, 3:41 PM

#

which is 65° if you give it the benefit of the doubt

hollow ocean Jun 13, 2025, 3:41 PM

#

Don’t even need o3 pro

ocean vortex Jun 13, 2025, 3:41 PM

#

remaining with that size

whole wagon Jun 13, 2025, 3:41 PM

#

It's an incremental gain. Enough to be SOTA at release

tall summit Jun 13, 2025, 3:42 PM

#

hollow ocean

gg though thats cool

ocean vortex Jun 13, 2025, 3:42 PM

#

whole wagon It's an incremental gain. Enough to be SOTA at release

it would still mean that it has all the same main flaws, fundamentally...

hollow ocean Jun 13, 2025, 3:42 PM

#

tall summit gg though thats cool

o3 didn’t even try

#

Too smart

whole wagon Jun 13, 2025, 3:43 PM

#

By trained for longer, I mean on a multiple of the compute it has had up to this point. Not like 50% more or smth, think like 5x lol

hollow ocean Jun 13, 2025, 3:43 PM

#

Tool use too good

whole wagon Jun 13, 2025, 3:44 PM

#

o3 was actually not that final

ocean vortex Jun 13, 2025, 3:44 PM

#

whole wagon By trained for longer, I mean on a multiple of the compute it has had up to this...

That's what Deepseek was trying for months I think, Google as well. Those updated models are mostly the same, slightly better. But they didn't have to do a switch reasoning to hybrid

hollow ocean Jun 13, 2025, 3:45 PM

#

Not a lot of people know

#

https://tenor.com/view/ronaldo-suiii-siuuu-al-nassr-alnassr-ronaldo-al-nassr-gif-7395052735569211864

Tenor

whole wagon Jun 13, 2025, 3:46 PM

#

ocean vortex That's what Deepseek was trying for months I think, Google as well. Those update...

There's a lot more you can do with RL in terms of data. With synthetic data and such

#

The model size becomes a bit more limited because the reliance on human data is reducing

#

They have to literally do huge amounts of inference to create most of the new dataset

#

Deepseek didn't have the compute available to do such a thing I think

ocean vortex Jun 13, 2025, 3:55 PM

#

I mean, o3 is compromised by size currently for sure, but I dunno... On the other hand it could make sense sticking it out assuming future progress and considering training time.

alpine coral Jun 13, 2025, 4:02 PM

#

calm sequoia I wonder why <@1053335914555908116> your personal bench has such a different re...

ig cause they’re different aha.. but yeah fwiw it looks like this EnigmaEval uses actual ‘puzzles’ (often meant to be solved by teams of humans over several days..) think wildly elaborate crosswords and sudokus with additional layers of decryption kinda thing

#

e.g. this kind of puzzle (times 1,184):

#

yeah i think it's a pretty dubious benchmark tbh (i can see how o-series models would do well on it though; it's kinda brute force, which they do well at imo )

#

any again just fwiw.. i basically dump 10-20 ‘questions’ in a single prompt and adding the results to an unwieldy spreadsheet .. there’s nothing scientific about it but the basic aim is to test critical comprehension (so plenty of riddles/wordplays) and common sense reasoning and spatial/emotional awareness (plus a few tasks, mostly anti-LLM/tokenizer things). here’s a few for reference (they’re not really ‘puzzles’ at all) :

\\ A digital clock shows 3:15. What is the angle between the hours and minutes being displayed on the numerical screen?
\\ If forced to choose between i or ii, which scenario would Bob most likely prefer?
i) Bob scratches his dream car immediately after purchasing it
ii) Bob gets abruptly sacked from his full-time job which he didn't enjoy much
\\ Write one sentence that includes the words “tract”, “fact”, “factory”, “intact” and “react” - in that order

#

but yeah they’re testing different things ig would be the short answer ha

#

i don't use this question but it's a great one.. from the Simple Bench sample questions

#

doesn't matter for how long they reason, if the model falls for the 'waterproof' redherring in the description of the glove (and assume it fell in the river), then they're screwed and will invariably fail - like all models still do in terms of consistent responses (afaik).. but yeah the 'solution' is far more simple than a complex word-path puzzle.. it just requires grasping the basic reality of the situation described (and not assuming complex calcultations are required to solve it)

patent bane Jun 13, 2025, 4:15 PM

#

when reasoning models stop assuming that's when they can actually solve complex and simple questions

tall summit Jun 13, 2025, 4:18 PM

#

alpine coral e.g. this kind of puzzle (times 1,184):

ayy gmpuzzles reference

alpine coral Jun 13, 2025, 4:21 PM

#

tall summit ayy gmpuzzles reference

i've been learning a lot about puzzles lately lol

tall summit Jun 13, 2025, 4:23 PM

#

alpine coral i've been learning a lot about puzzles lately lol

puzzles are quite fun

#

amazing i know

alpine coral Jun 13, 2025, 4:23 PM

#

i'm still yet to emerge from a few rabbit holes ha

sacred quail Jun 13, 2025, 4:30 PM

#

Do you guys think 2.5 deep think can be better than o3 pro ?

#

I have that feeling

keen beacon Jun 13, 2025, 4:31 PM

#

Ultra is probably better

tall summit Jun 13, 2025, 4:32 PM

#

sacred quail Do you guys think 2.5 deep think can be better than o3 pro ?

bruh

#

nobody knows until it releases

#

it very well could be. it might not

sacred quail Jun 13, 2025, 4:33 PM

#

I mean, we know base models soo we can make some guesses but i got your point

tall summit Jun 13, 2025, 4:34 PM

#

brb asking o3

sacred quail Jun 13, 2025, 4:34 PM

#

I heard deep think gonna come to ai studio which is huge

#

We can try

stuck orchid Jun 13, 2025, 4:43 PM

#

alpine coral i don't use this question but it's a great one.. from the Simple Bench sample qu...

Available on lmarena?

sacred quail Jun 13, 2025, 4:44 PM

#

https://x.com/testingcatalog/status/1932116981680857578?s=46

TestingCatalog News 🗞 (@testingcatalog)

Oh? "Deep Think" on Gemini 2.5 Pro is already available via API to some early access users, it seems. The post also mentions that it will arrive at AI Studio soon!

Deep LFG 👀

h/t @atoz51515640

#

I saw from this

dusky aurora Jun 13, 2025, 4:46 PM

#

alpine coral any again just fwiw.. i basically dump 10-20 ‘questions’ in a single prompt and ...

now that's one logic bomb, suitable for use by Kirks everywhere

wintry tinsel Jun 13, 2025, 5:01 PM

#

AI space been a little slow

#

O3 pro isn’t anything significant

#

And besides that it’s all more Google hype what is everyone else doing?

ocean vortex Jun 13, 2025, 5:23 PM

#

wintry tinsel O3 pro isn’t anything significant

tbh it's just about exactly what could have been expected. We already saw o1-pro and deep think numbers, this provided similar marginal gains over o3. What is there more to expect from it?

leaden sun Jun 13, 2025, 5:50 PM

#

wintry tinsel And besides that it’s all more Google hype what is everyone else doing?

Living the good life for quality work?

We don’t need a new model every month or two, I prefer waiting a (few) year(s) for a revolutionary upgrade rather than every month getting just a slightly adjusted ones. Quality or breakthrough takes time 😅

wintry tinsel Jun 13, 2025, 5:55 PM

#

leaden sun Living the good life for quality work? We don’t need a new model every month o...

Likewise but that’s not the tech space works, short attention space short term return on investment hype now is their strategy

#

Better technologies could develop but there’s no guarantee, and money is lobbed at thinks certain to be profitable

hollow ocean Jun 13, 2025, 6:25 PM

#

@deep adder only 20 messages per month for o3 pro teams

#

Maybe add a new seat

#

New Gmail

#

Easy

dusky aurora Jun 13, 2025, 6:26 PM

#

Arena developers, are you working on any cool features (usable in direct chat)?

hollow ocean Jun 13, 2025, 6:27 PM

#

Deep think will be better

#

$125 for 3 months

#

Let’s see when it comes out

keen beacon Jun 13, 2025, 6:27 PM

#

apparently next week

hollow ocean Jun 13, 2025, 6:27 PM

#

Next week for sure

keen beacon Jun 13, 2025, 6:27 PM

#

versus ultra?

hollow ocean Jun 13, 2025, 6:28 PM

#

Its rank 1 on creative writing

#

It’s good

ocean vortex Jun 13, 2025, 6:29 PM

#

@keen beacon ok Paul from Aider does not appear to be very smart lol. But that whole thinking budget saga I would say is still not resolved. He tested 2.5Flash too and got the same results (budget 24k higher score than auto budget). Coupled with things that you saw yourself like higher median and... that's a lot of "coincidences". There's defo something changing with the model setup affecting model responses in unknown ways auto budget vs max I would say

keen beacon Jun 13, 2025, 6:30 PM

#

ocean vortex <@456226577798135808> ok Paul from Aider does not appear to be very smart lol. B...

cap ran flash with 24k budget and none and median was lower auto thought more for some reason

#

but yeah theres something happening

hollow ocean Jun 13, 2025, 6:30 PM

#

o3 close second

#

Next month

keen beacon Jun 13, 2025, 6:31 PM

#

ocean vortex <@456226577798135808> ok Paul from Aider does not appear to be very smart lol. B...

im also gonna leak the cot, probably, should help. on different thinking budgets

misty vault Jun 13, 2025, 6:32 PM

#

crack bench

keen beacon Jun 13, 2025, 6:32 PM

#

ocean vortex <@456226577798135808> ok Paul from Aider does not appear to be very smart lol. B...

#

(cap results

#

i dont think there are walls in either atm fwiw xd

ocean vortex Jun 13, 2025, 6:34 PM

#

keen beacon

would be interesting to see total tokens too. Maybe it switched from thinking to final response 🤔

keen beacon Jun 13, 2025, 6:35 PM

#

theyve been amazing from the start

#

claude 1

#

claude beta

sacred quail Jun 13, 2025, 6:36 PM

#

#

Not for live bench

keen beacon Jun 13, 2025, 6:36 PM

#

it makes sense why theyre still great, if they use a lot of synthetic data those tendencies would continue

sacred quail Jun 13, 2025, 6:36 PM

#

That plot unscramble category was always legit for long time

#

For creating stories

hollow ocean Jun 13, 2025, 6:37 PM

#

I heard the owner was biased

#

They change the questions a lot

sacred quail Jun 13, 2025, 6:38 PM

#

Also you can see claude models have high scores too soo it is about writing

#

For reasoning im finding better O3 too but for writing gemini 06/05 is best right now with Opus 4

small haven Jun 13, 2025, 6:40 PM

#

yea

unborn ocean Jun 13, 2025, 6:40 PM

#

keen beacon it makes sense why theyre still great, if they use a lot of synthetic data those...

They have always been good at RLAIF and RL in general, two things that really work with coding

#

And their model are always on the bigger side, which seems to help as well

keen beacon Jun 13, 2025, 6:41 PM

#

yeah, a bit unrelated, but i dont think anthropic are good at small models xd

ocean vortex Jun 13, 2025, 6:51 PM

#

lmao I think I just broke 2.5pro. It's generating response after response recursively repeating itself over and over, with several thinking boxes per response... catgrin

#

keen beacon Jun 13, 2025, 6:51 PM

#

that happens to me as well, but under a specific scenario. did you do anything notable btw?

ocean vortex Jun 13, 2025, 6:52 PM

#

was playing with making it output the system prompt, it inevitably got itself into that thing it does where it repeats your message instead... Now it's repeating itself instead lol

keen beacon Jun 13, 2025, 6:53 PM

#

somewhat related but i think 2.5 pro can output special tokens

#

it can also break the output

#

that mightve happened in ur scenario

#

it reminds me of that a little

ocean vortex Jun 13, 2025, 6:54 PM

#

it's interesting that it's prompting itself though... like how does it keep sending responses with no additional input? 🤯

#

weird

keen beacon Jun 13, 2025, 6:54 PM

#

ocean vortex it's interesting that it's prompting itself though... like how does it keep send...

the stop parsing/whatever is bugged in the inference engine

indigo hazel Jun 13, 2025, 6:58 PM

#

How is best Qwen model compared to o3 and 0605?

ocean vortex Jun 13, 2025, 7:03 PM

#

it stopped just shy of maxing everything out 😭

#

still impressive though, to get 900k with just like 3 sentences input lmao

keen beacon Jun 13, 2025, 7:04 PM

#

damn imagine paying for that 🤣

whole wagon Jun 13, 2025, 7:14 PM

#

Wtf is this

keen beacon Jun 13, 2025, 7:15 PM

#

they're doing side by side testing on aistudio

#

its broken for now

whole wagon Jun 13, 2025, 7:16 PM

#

models/kingfall-ab-test

small haven Jun 13, 2025, 7:16 PM

#

wen models/titanforge-ab-test

whole wagon Jun 13, 2025, 7:16 PM

#

It usually works but this time it fails because it tried to use that model

keen beacon Jun 13, 2025, 7:17 PM

#

yes it broke recently

jade egret Jun 13, 2025, 7:53 PM

#

whole wagon models/kingfall-ab-test

huh

small haven Jun 13, 2025, 8:01 PM

#

rip kingfall

civic flame Jun 13, 2025, 8:05 PM

#

i think it's temporary

#

given the AB test is still attempted by the frontend and it just can't reach the model

#

they probably pulled it to update it with a new checkpoint or something

zinc ore Jun 13, 2025, 8:06 PM

#

There's new ones added

keen beacon Jun 13, 2025, 8:06 PM

#

civic flame they probably pulled it to update it with a new checkpoint or something

oh that version was 2 weeks old iirc

#

timeline makes sense

civic flame Jun 13, 2025, 8:07 PM

#

yeah

#

oh?

#

well in that case surely it's going to arrive on lmarena any day now

small haven Jun 13, 2025, 8:07 PM

#

send the new model id

civic flame Jun 13, 2025, 8:08 PM

#

if they're planning a Thursday release they're leaving it pretty fine with the lmarena testing

keen beacon Jun 13, 2025, 8:08 PM

#

idk about them ever releasing it (compute and all that?) but it seems pretty far in development

civic flame Jun 13, 2025, 8:09 PM

#

it's so over

#

gemini-v3p1l-rev20-kingfall-sc__202505301__model__variant

keen beacon Jun 13, 2025, 8:09 PM

#

maybe that was when the abtest started

#

2 weeks ago?

#

or is that actual the revision date

civic flame Jun 13, 2025, 8:09 PM

#

i don't think so, it's just the date that checkpoint was finished

jade egret Jun 13, 2025, 8:10 PM

#

small haven rip kingfall

why

elder rapids Jun 13, 2025, 8:10 PM

#

y'all still doing this?

small haven Jun 13, 2025, 8:11 PM

#

wtf that actually worked

zinc ore Jun 13, 2025, 8:11 PM

#

:p

jade egret Jun 13, 2025, 8:11 PM

#

civic flame `gemini-v3p1l-rev20-kingfall-sc__202505301__model__variant`

that google i/o

#

when it happening

small haven Jun 13, 2025, 8:11 PM

#

THERES A NEW KINGFALL

#

CAP

jade egret Jun 13, 2025, 8:11 PM

#

small haven THERES A NEW KINGFALL

huh?

elder rapids Jun 13, 2025, 8:11 PM

#

btw is prowlridge good

mellow moss Jun 13, 2025, 8:11 PM

#

yo so for the ai generation image contest, would giving it an image and saying "recreate this image as accurately as possible" be technically allowed? 🤔

zinc ore Jun 13, 2025, 8:11 PM

#

Think it's technically a bit worse than kingfall but pretty similar

small haven Jun 13, 2025, 8:11 PM

#

elder rapids btw is prowlridge good

nope

civic flame Jun 13, 2025, 8:11 PM

#

I CALLED THIS

#

i was talking to someone about it in dms

#

and i said

#

let me aee

keen beacon Jun 13, 2025, 8:12 PM

#

but we already know its ultra

#

😂

civic flame Jun 13, 2025, 8:12 PM

#

civic flame let me aee

"the best i can guess is version 3.1 large"

keen beacon Jun 13, 2025, 8:13 PM

#

what else could it be tho?

small haven Jun 13, 2025, 8:14 PM

#

@zinc ore how did u even find this btw, u wizard

#

oh fireworks 😮

jade egret Jun 13, 2025, 8:15 PM

#

keen beacon but we already know its ultra

2.5 ultra??

keen beacon Jun 13, 2025, 8:15 PM

#

fr if its not ultra what is it

#

its a bigger model

jade egret Jun 13, 2025, 8:16 PM

#

keen beacon fr if its not ultra what is it

deepthink maybe?

keen beacon Jun 13, 2025, 8:16 PM

#

no

#

definitely not

jade egret Jun 13, 2025, 8:16 PM

#

no?

#

oh

keen beacon Jun 13, 2025, 8:16 PM

#

it is not

jade egret Jun 13, 2025, 8:16 PM

#

how yall know

small haven Jun 13, 2025, 8:17 PM

#

terminator coming

#

his teeths are black

keen beacon Jun 13, 2025, 8:18 PM

#

its definitely not gemini 3

civic flame Jun 13, 2025, 8:19 PM

#

shh

keen beacon Jun 13, 2025, 8:19 PM

#

you cant interpret that version number naively for sure

small haven Jun 13, 2025, 8:19 PM

#

civic flame shh

dark mouth?

civic flame Jun 13, 2025, 8:19 PM

#

black eye

#

trust

small haven Jun 13, 2025, 8:19 PM

#

black eye

jade egret Jun 13, 2025, 8:19 PM

#

keen beacon its definitely not gemini 3

yea it prob not

late path Jun 13, 2025, 8:19 PM

#

I run the terminator svg on every new model

jade egret Jun 13, 2025, 8:19 PM

#

late path I run the terminator svg on every new model

o3 pro?

#

oh nvm

ocean vortex Jun 13, 2025, 8:20 PM

#

small haven black eye

Musk

late path Jun 13, 2025, 8:20 PM

#

bt

small haven Jun 13, 2025, 8:20 PM

#

white mouth

zinc ore Jun 13, 2025, 8:21 PM

#

Do other pro models use that style?

late path Jun 13, 2025, 8:21 PM

#

tl

small haven Jun 13, 2025, 8:21 PM

#

it's like reiterating on me

jade egret Jun 13, 2025, 8:21 PM

#

if kingfall release is it gonna be the best (over 4 opus and o3 pro?)

keen beacon Jun 13, 2025, 8:21 PM

#

kingfall is dead and gone

small haven Jun 13, 2025, 8:21 PM

#

dark mouth > kingfall

late path Jun 13, 2025, 8:21 PM

#

none of them are as good as kingfall

ocean vortex Jun 13, 2025, 8:22 PM

#

it does mean something if you know how to prompt it and view the results tbh. Tends to correlate with webdev and arc-agi to a meaningful degree

small haven Jun 13, 2025, 8:22 PM

#

it stopped generating an svg, mid response, and started to generate a new one, dark mouth

keen beacon Jun 13, 2025, 8:22 PM

#

small haven it stopped generating an svg, mid response, and started to generate a new one, d...

lmao it started thinking again

ocean vortex Jun 13, 2025, 8:22 PM

#

arc-agi is spatial reasoning

small haven Jun 13, 2025, 8:22 PM

#

keen beacon lmao it started thinking again

dark mouth > kingfall

#

wait let me see this svg tho

ocean vortex Jun 13, 2025, 8:23 PM

#

Opus does great. It does well with svg too. 2.5Pro... new versions were not tested on arc-agi

zinc ore Jun 13, 2025, 8:23 PM

#

Opus is best at arc AGI 2

ocean vortex Jun 13, 2025, 8:23 PM

#

but the old one did very decently too

small haven Jun 13, 2025, 8:23 PM

#

#

what we thinking

elder rapids Jun 13, 2025, 8:23 PM

#

small haven

nice

jade egret Jun 13, 2025, 8:24 PM

#

well see when actualy benchmark coems out ig

zinc ore Jun 13, 2025, 8:24 PM

#

small haven

Decent

#

It's elaborate which gets points from me

ocean vortex Jun 13, 2025, 8:24 PM

#

Opus? That's like the model the least affected by contamination tbh. It being big you would have to work hard to overfit

keen beacon Jun 13, 2025, 8:25 PM

#

ocean vortex Opus? That's like the model the least affected by contamination tbh. It being bi...

bigger models overfit more easily

small haven Jun 13, 2025, 8:25 PM

#

oh i forgot to add thinkingBudget: 0

late path Jun 13, 2025, 8:26 PM

#

and previous kingfall results fwiw

ocean vortex Jun 13, 2025, 8:26 PM

#

keen beacon bigger models overfit more easily

Not really. They need more training to get the same result. If you used the same safety alignment dataset and same hyperparameters on small and big model, small one will come out significantly more censored

small haven Jun 13, 2025, 8:26 PM

#

late path and previous kingfall results fwiw

top right is the best

zinc ore Jun 13, 2025, 8:27 PM

#

Imo last one is best

keen beacon Jun 13, 2025, 8:27 PM

#

ocean vortex Not really. They need more training to get the same result. If you used the same...

the relationship between hyperparameters and model size is more complex. you can easily "overfit" more easily with bigger models

late path Jun 13, 2025, 8:27 PM

#

I also think the last one is better. The neck on the top right one is well done

zinc ore Jun 13, 2025, 8:28 PM

#

Last one is clean and seamless between the different parts

ocean vortex Jun 13, 2025, 8:28 PM

#

keen beacon the relationship between hyperparameters and model size is more complex. you can...

it's not complex if you use the exact same learning rate and related parameters. Big model needs more work during training to get the same result - that's just basic fact tbh...

keen beacon Jun 13, 2025, 8:28 PM

#

a higher learning rate might be better as it avoids getting stuck in local minima as well. all in all its very complicated

small haven Jun 13, 2025, 8:28 PM

#

zinc ore Last one is clean and seamless between the different parts

yea thats mine, was on auto thinking budget, running the same rn

keen beacon Jun 13, 2025, 8:28 PM

#

ocean vortex it's not complex if you use the exact same learning rate and related parameters....

it is very complicated if youve worked on this stuff

late path Jun 13, 2025, 8:28 PM

#

overall kingfall has more details and seems to have more variety

small haven Jun 13, 2025, 8:29 PM

#

wait auto thinking budget svg incomin

whole wagon Jun 13, 2025, 8:29 PM

#

LLM training is as complex as it gets

#

Well. For the runs these days

keen beacon Jun 13, 2025, 8:29 PM

#

hyperparams you often need to do a sweep for an optimal configuration and depending on the criteria it can be extremely complex

ocean vortex Jun 13, 2025, 8:30 PM

#

keen beacon it is very complicated if youve worked on this stuff

I did do training quite a bit. The fact alone that you need more powerful setup / more compute to even train a bigger model should tell you that what I'm saying is true. Big model needs much more work to overfit anything. And if you are using same compute and same dataset and same/comparable learning rates, small model in most normal cases would overfit faster...

whole wagon Jun 13, 2025, 8:31 PM

#

There are different types of overfit

small haven Jun 13, 2025, 8:31 PM

#

dark mouth, auto thinking

whole wagon Jun 13, 2025, 8:31 PM

#

keen beacon hyperparams you often need to do a sweep for an optimal configuration and depend...

Sometimes the net randomly blows up too. Especially in FP8 training

zinc ore Jun 13, 2025, 8:32 PM

#

small haven dark mouth, auto thinking

Lmao, abomination

whole wagon Jun 13, 2025, 8:32 PM

#

Is there a new model or smth

#

Why all the svgs

ocean vortex Jun 13, 2025, 8:32 PM

#

with hyperparameters staying roughly the same and same dataset you are essentially fixing the amount of work that is going to be done as a constant. Same amount of steps

keen beacon Jun 13, 2025, 8:33 PM

#

ocean vortex I did do training quite a bit. The fact alone that you need more powerful setup ...

fwiw u can believe me or not. i haven't done a "little" of it 🤷

ocean vortex Jun 13, 2025, 8:34 PM

#

keen beacon fwiw u can believe me or not. i haven't done a "little" of it 🤷

well I said "a little" figuratively. Actually made money doing so as well 🤷

whole wagon Jun 13, 2025, 8:34 PM

#

If you have a list of 100 inputs with corresponding outputs. And you train a 8 neuron net or a 10k neuron net. The 10k neuron net will overfit. This is a type of overfit that generally occurs later in a run

ocean vortex Jun 13, 2025, 8:36 PM

#

bigger model tends to inherently generalize better. And if you have a certain dataset meant for a smaller model... for a bigger model you would usually need notably more. To reach a significant enough change, let alone overfit it... Obviously I'm excluding edge cases like different or unreasonably high learning rates and shitton of epochs, but you shouldn't do that to begin with.

jade egret Jun 13, 2025, 8:37 PM

#

here's an alien cat

small haven Jun 13, 2025, 8:37 PM

#

not feeling dark mouth

zinc ore Jun 13, 2025, 8:38 PM

#

That one is pretty decent imo

#

Although kinda weird IG like I'm seeing inside the head without a face

keen beacon Jun 13, 2025, 8:39 PM

#

honestly this could be a 2.5 pro revision

#

ive barely done any tests tho

civic flame Jun 13, 2025, 8:40 PM

#

i don't see why they'd replace the kingfall AB test, if it was ultra, with a pro AB test

small haven Jun 13, 2025, 8:41 PM

#

zinc ore Although kinda weird IG like I'm seeing inside the head without a face

imo kingfall did the svgs a bit better, had more fidelity to it than darkmouth

zinc ore Jun 13, 2025, 8:41 PM

#

I think so too

late path Jun 13, 2025, 8:41 PM

#

these two models are both drawing rounder heads than kingfall lmao

civic flame Jun 13, 2025, 8:41 PM

#

i don't think svgs are the be all end all of model capability tbh

small haven Jun 13, 2025, 8:41 PM

#

civic flame i don't think svgs are the be all end all of model capability tbh

how is it so far for u

civic flame Jun 13, 2025, 8:42 PM

#

i haven't done enough tests but in the ones i have done it has been equal or slightly better vs kingfall

whole wagon Jun 13, 2025, 8:42 PM

#

I thought kingfall was the final checkpoint lmao

civic flame Jun 13, 2025, 8:42 PM

#

no ☠️

keen beacon Jun 13, 2025, 8:42 PM

#

nah

#

no way

civic flame Jun 13, 2025, 8:42 PM

#

they're moving quickly with this though

whole wagon Jun 13, 2025, 8:43 PM

#

Well if they did continue it. I guess diminishing returns is expected so maybe they are just really close

civic flame Jun 13, 2025, 8:43 PM

#

given they're AB testing in Studio I would say it's in the later stage of development yeah

small haven Jun 13, 2025, 8:43 PM

#

is this correct?

#

luxury sports car problem

civic flame Jun 13, 2025, 8:43 PM

#

no

#

it's <1KM

small haven Jun 13, 2025, 8:43 PM

#

f's

civic flame Jun 13, 2025, 8:43 PM

#

lol when a model gets this right can we call it AGI

#

im gonna be dead when it happens 😭😭

small haven Jun 13, 2025, 8:44 PM

#

civic flame im gonna be dead when it happens 😭😭

so next year? lol jk

whole wagon Jun 13, 2025, 8:44 PM

#

I feel like its just missing some contextual knowledge for this ngl

#

Like it doesn't really understand the world

elder rapids Jun 13, 2025, 8:44 PM

#

civic flame lol when a model gets this right can we call it AGI

kingfall has gotten it "right" and then concedes it immediately lmao

#

I hate when models do this

civic flame Jun 13, 2025, 8:47 PM

#

victim of its own autistic overthinking 🥀

leaden sun Jun 13, 2025, 8:49 PM

#

small haven not feeling dark mouth

can you all not use something cute for the testing?!? 😣