#general

1 messages · Page 56 of 1

ocean vortex
#

ok I'll download it

#

Well like I said, I don't think budget matters that much with Gemini #general message

but I wouldn't be shocked if for specific prompts (including aider) max budget does result in slightly more tokens. Also what was the total amount of tokens generated? That would be more telling than stddev IMO

#

@keen beacon

#

so... it still shows 32k budget generates more. You are only trying to argue the significance of it

#

wdym

#

you are doing some gymnastics here, I don't think this statistical analysis theory applies perfectly here

#

but it would if you redid the test and got the opposite results

#

if it's always more tokens for 32k budget the results are meaningful

#

with the only question remaining how it translates to performance

#

what you are saying here is that small increase in total tokens generated is impossible...

#

it has to be big or nothing at all

#

well it's your assumption though, by design any small increase you would view as noise

#

are you actually fr here

#

lol

#

that's you who is using models to reason here...

#

simplified, it told you that there's not enough data to say definitively

#

if you sent it 10 more runs like that all of them showing higher mean, the answer would have been likely different

#

and all show higher mean for 32k budget?

#

Well I meant 10 more tests like this one the entire thing

#

I'm literally not. You saw higher mean and then tried to prove that it's meaningless. 32k budget generated more tokens than max_budget. Not definitive ofc with such limited data, but it can be indicative

#

we do not know how their budget thing is implemented, way too many assumptions here... It could affect the stddev itself

#

I'm not dismissing it merely pointing out that it is higher with max budget for that test run

#

which could mean system prompt changes...

#

like don't you find it interesting that not only median is higher but also min is lower and max is higher?

#

Well if you know those you basically know Python, it's an easy language to learn lol

#

nowadays everything shiny is python

keen fulcrum
#

Why do llms use python, like tools and such

ocean vortex
keen fulcrum
ocean vortex
#

and used in statistics

#

so all the graps etc, data analysis... Python. And machine learning as well

ocean vortex
patent aspen
#

I like Rust a lot. I wish it was easy and impactful to migrate to

ocean vortex
#

I think it will eventually progress into natural language being viewed as coding lol. But we are not quite there yet. Currently all the most meaningful stuff is being done in Python tbh

#

AI can replace coders, but it can't replace those who are making the AI itself (Python), at least not nearly as fast... 🤷‍♂️

torn mantle
#

PUAHAHAHAHA

#

YOU STILL HAVENT SEEN MY EVIL SIDE

nimble trail
#

Thank you for your advice Brian 🤜🤛

patent aspen
#

Oh I'm not talking about the difficulty of using the language. (It is on the more advanced side.) I'm talking about migrating a large code base to Rust

ocean vortex
#

iOS26 design is somewhat growing on me. Not a fan how easy it is to break it with customization but hopefully that's just developer beta things

keen beacon
#

ignorem e

patent aspen
#

No worries. It's a fun language

keen beacon
#

i cant use anything else loll

#

isnt taht still happening tho?

#

c++ to rust

#

lmao

#

yeah who needs that stuff, honestly you dont need c++

#

just write everything in assembly

#

a random tangent but someone reminded me about the illusion of thinking, at least this in the abstract:

We found that LRMs have limitations in exact
computation: they fail to use explicit algorithms and reason inconsistently across puzzles.
kingfall in the face of increased complexity (2 6x6 zebra puzzles instead of 1 where it constantly fails), builds its own system to exactly solve zebra puzzle. kinda wild 🤣 (still on this, just found it fitting)

#

i dont think the illusion of thinking will hold up well...

#

yeah they cant reason... they're pattern matching on these unseen zebra puzzles

#

i oughta give it more 6x6 zebra puzzles lol

#

im actually curious if it can do more than 2

#

nah 🤣

#

it will fail

#

oh they can

#

can they solve two in 14.5k tokens?

cobalt bane
#

I have a link to kingfall but guys dont leak

keen beacon
#

its a little unfair since they rl on this so much

#

fake news

#

yes

cobalt bane
#

How is it patched the llm still respond me

keen beacon
#

its redirecting

#

to 0605

cobalt bane
#

Oh okay

late path
#

@keen beacon

#

tested on 2.5 flash 0520

keen beacon
#

what sampling setting did u use btw

#

on 0605 i sampled 10 on thinking budget = 4096 and it was doing 10,000+ tokens in thinking (i triple checked the api requests)

civic flame
late path
#

temperature=0.05

keen beacon
late path
#

50 samples for auto and 50 samples for 24k

keen beacon
#

honestly the weird graph looks like the bug to me

#

but i dont know

#

i didnt test flash at all

#

what prompt did you use btw

late path
#

142789*521=?

keen beacon
#

i intentionally used one that causes very varied distributions/more representative

#

🤷 i dont know lol

late path
keen beacon
#

ill look at that later

#

look at thinking budget 4096 on 0605 (0 temp/1 top p) i dont know lol (with sampling it returns as expected)

#

not much i just tested it a few on each lower thinking budgets. i only did 30 for max and auto

#

no thinking budget 0 temp etc for reference btw as well

#

the results dont make sense 🤣

calm sequoia
late path
#

The chart just now didn't exclude a response where the model got stuck in a repetitive loop, outputting up to 65535 tokens

#

idk, it seems like a higher thinking budget reduced the overall number of thoughts based on this test, but the accuracy increased significantly

keen beacon
#

i think youre hitting another bug. it seems the gemini api seems to have a lot of issues idk

late path
#

Another logical reasoning problem. With temp=0.5, it's also basically the 24k thinking budget that reduced the total amount of thought. didn't expect 2.5flash to have a 100% accuracy rate though

keen beacon
#

i oughta look at this later wut is going on

late path
#

I think the thinking budget is definitely a training parameter, enabling it adds more constraints compared to 'auto' (leading to a more concentrated distribution of thought lengths), and the model is aware of it

keen beacon
late path
#

Sroan's safe
Sroan has a private safe. The combination is 5 different digits.
A guesses: 32561
B's guess: 18093
C's guess: 91526
'Sroan then said, "Each of you three has correctly guessed two digits in non-adjacent
positions. (It only counts as correct if both the position and the digit are
correct.) If you can figure out the password, the money inside is yours!"'
Assuming the three are super intelligent, can they get the money? And what
is the password?
||Password: 38596||

languid crescent
#

What's the most accurate/powerful ai in LMarena (for coding)?

jade egret
#

project mariner

jade egret
#

it gemini 2.5 pro rn

calm sequoia
keen beacon
#

ill try to collect more data

torn mantle
#

they will eventually update their gemini 2.5 pro on 19th of this month

#

full acceleration

#

I really hope the reason grok 3.5 hasn't been released is because elon wants it to match the fake leaked benchmarks

#

that would be justifiable

keen beacon
torn mantle
#

its 2.5 pro deep thinking

#

2.5 pro latest will be on 19th = kingfall

#

then they will release the deep thinking feature

#

grok 3.5 will probably come around that time

#

its probably ass tho

torn mantle
#

and whats the GA endpoint?

#

the current model?

#

goldmane huh

#

f

patent aspen
#

Yes it's just the current model

vocal pelican
#

will there be any free access?

patent aspen
vocal pelican
#

deepthink

patent aspen
#

Doubtful

alpine coral
#

it'll be slightly nerferd surely

sacred plaza
#

Can someone explain to me why we should pay for the Google ultra plan for deep think feature? When I can just ask the current gemini models to employ graph of thoughts and get the same internal process from the LLM?

alpine coral
#

what's the point of exp->-preview->GA if they're just the same models

keen beacon
vocal pelican
alpine coral
#

(progressively made 'safer' / more corporate aligned before they give the sign off: go, use this in prod)

keen beacon
alpine coral
#

indeed (i mean get it, from a corporate risk management perspective)

#

i just feel like the 'tweaks' made in each intermediary step before GA are to reduce 'harmfulness' rather than increase 'performance'

keen beacon
#

probably not

alpine coral
#

(both admitedly slippery/ subjective terms)

keen beacon
#

at least this time, timeline seems to short

alpine coral
#

yeah that's a fair counterpoint

keen beacon
#

they could deploy a completely different revision but why do a preview stage idk

alpine coral
#

not completely differnet

#

yeah like i said, i do get it

keen beacon
#

unless you know the future

vocal pelican
alpine coral
#

but yeah i feel the focus is on making it 'suitable' for well GA aha

soft kernel
#

Sorry off topic
Is lmarena going to release o3 Pro rankings

keen beacon
#

its not gonna be on the arena

vocal pelican
patent aspen
#

It's also not just alignment. A ton of post training for performance happens between experimental and GA

alpine coral
#

they just never feel 'better'.. but i'm talking out of my ass here.. i think i just have rose tinted glasses on thinking about earlier moels ha

soft kernel
keen beacon
vocal pelican
jade egret
#

when is kingsfall dropping

keen beacon
#

it no longer exists

vocal pelican
patent aspen
alpine coral
#

for 05-06 fs

keen beacon
#

ngl i still use 0506 lmao

#

its a mix for me

vocal pelican
keen beacon
#

if it performs well for you, it doesnt matter i guess

vocal pelican
#

not that I’ve seen, and generally yeah it has been a tiny bit better for me

#

still prefer qwen for maths though it has gotten a lot right that other models haven’t

keen beacon
#

its very good

#

you could interpret that two ways i guess

patent aspen
#

No bad news. I just don't know of anything particularly interesting happening in July yet

#

Probably the best opportunity for OAI to strike back

keen beacon
#

Idk what to expect with gpt 5 xd

patent aspen
#

If OAI doesn't release o4 or GPT-5 in the next 2 months, they're in trouble IMO

#

I think they probably will though

keen beacon
#

Is the SVG model never gonna see the light of day

neon warren
#

May be they will release Veo3 api in July?

patent aspen
#

I don't track image gen, video gen, etc

keen beacon
#

They did say 2.5 image gen was coming

#

I guess we have to look forward to

alpine coral
#

i haven’t played around much but fwiw o3 pro seems very strong (better than o3 fs, whether better than o1 pro i'm not sure..), but also painfully slow..

#

here for e.g., given a nyt “Connections” word puzzle, o3-pro does exceptionally well (the closest I’ve seen a model come to fully solve it), taking 14 minutes; o3 graciously admits defeat, after taking 10 minutes to think about it; for reference/comparison, 2.5-pro-06-05 blitzes through it in less than 2 mins, but fails dismally.

#

what's interesting is that the slowness (seemingly) isn’t due to the number of tokens used during thinking, but just that the inference is slow af (or perhaps there's some parrell stuff going on that isn't reflected in token count).. like look at the tokens / sec…

#

but also, while ‘faster’, o3 used more than three times as many tokens, only to fail to find a solution; while with 06-05 it’s like it dominated a hurdles race on ‘time’, but crashed through every hurdle along the way..very fast but terrible/disqualifying performance

tall summit
alpine coral
alpine coral
# civic flame what was the puzzle?
Can you solve this puzzle?

---

How to Play:

Find four groups of four items that share something in common.

Category Examples: • FISH: Bass, Flounder, Salmon, Trout • FIRE ___: Ant, Drill, Island, Opal

The above are examples only and by no means exhaustive; categories can be established in various ways, including for example by manipulating/altering four words in the same way, or focussing on a particular component of each, such that they have a shared meaning or association.

Categories will always be more specific than “5‑LETTER‑WORDS,” “NAMES” or “VERBS.”

Each puzzle has exactly one solution. Watch out for words that seem to belong to multiple categories!

——

PUZZLE:

HELL, WELL, COMP, ORGO, SHELL, MILK, LIT, SICK, PASTE, FIRE, SMOKE, DOPE, CREAM, NETI, ILL, LICK

——

SOLUTION:
tall summit
alpine coral
#

one of the nice things about using these i think is that's there no risk of contamination (for the latest ones ofc)

tall summit
#

lmao rambling about something i like

alpine coral
#

i looked at their About page and saw the reference to the british quiz show as the progenitor of it all ha

whole wagon
#

They just count 10 parallel tokens as 1

#

Is 19th June Gemini 2.5 pro kingfall

alpine coral
#

(also not sure about 'parallel tokens', but some kinda parrellel / search thingy)

#

actually ig that language makese sense

civic flame
#

lol i use some of their stuff as personal benchmarks

indigo hazel
#

the o3 model available is medium or high?

small haven
#

higher or lower

zinc ore
#

AI studio has issues today

#

Discord attachments down

#

And anthropic having issues too

leaden palm
#

even npm is facing difficulties

#

and neon is (predictably) experiencing an outage

whole wagon
#

How do you know this lol

keen beacon
#

guessing

small haven
#

hes a magician

zinc ore
#

Let's see if true

keen beacon
#

if it happens its just a coincidence imo

zinc ore
#

Dudes luck stats maxed out

keen beacon
whole wagon
#

Oh hi trey

#

I saw u in the chess discords

zinc ore
#

Hello

#

Yeah, I used to talk there years ago

keen beacon
zinc ore
dusky aurora
#

my first Failed to acceptterms-of-use

echo aurora
#

the team is aware of some issues accessing the site, apologies for any inconvenience! we're working hard to get it fixed asap

civic flame
#

okay guys what big cloud provider died this time

dusky aurora
#

these days LMArena is my only source of streess relief

narrow elbow
#

haha, very useful service

tall summit
leaden palm
civic flame
keen fulcrum
civic flame
#

lmfao

tall summit
keen fulcrum
whole wagon
#

I was wondering why I couldn't get LLM arena battles to work lol

leaden palm
dusky aurora
#

great. now r/midjourney wil be impossible to browse

civic flame
#

that's crazy

#

cloudflare AND gcp failing at the same time

dusky aurora
#

the addition of image generation has already ruined r/novelai

tall summit
#

2025 midjourney is interesting

echo aurora
#

side note the lofi activity is also giving me troubles atm, everything is broken 😭

civic flame
#

"Update - We are continuing to investigate and work with our providers to resolve this issue. It is likely impacting a few more features like activities." - discord

#

seems both a bunch of cloudflare services and a bunch of GCP services are suffering

jade egret
#

why can't i upload file

#

in discord rn

civic flame
civic flame
#

source @ google tells me an internal service there called Chemist is down

#

Chemist checks "project status, activation status, abuse status, billing status, service status, location restrictions, VPC Service Controls, SuperQuota, and other policies"

#

"goddamnit we told that intern not to deploy gemini's changes without running them through us first"

slender karma
#

looks like the old arena website is working fine tho

civic flame
#

because it doesn't rely on firebase 😉

#

were they planning a launch today?

#

GCP status updated

ocean vortex
#

so now pro also has yapping...

civic flame
#

oh images work now

ocean vortex
#

🫃

echo aurora
keen beacon
#

works for me

civic flame
#

ai studio is working again

#

things appear to be recovering

slender karma
civic flame
#

yeah i think firebase is still a little shaky

#

naturally given it's gone from no traffic to all of it after coming back up

whole wagon
small haven
ocean vortex
small haven
ocean vortex
#

Today’s Yap score is 8192. Should I provide an explanation or just the number? The user seems to be asking for the score specifically, so I think just answering with 8192 would be sufficient. I could add a note if there's space, but given I might need to keep it concise, I’ll simply respond: “Today’s Yap score is 8192.” Alright, let’s finalize that answer!

Yeah it's assuming now it needs to be concise...

small haven
#

o1 pro context was 128k, now its 64k on o3 pro via ui

ocean vortex
#

second guessing itself, they made it confused lol

#

It's like "8192 is that a lot or not? Better be safe and keep it brief."

grim girder
#

lmarena is down rn?

patent bane
small haven
#

ccp chatgpt is built diff

echo aurora
little thorn
#

Hey, what on earth are you doinggg! All my important chat history was deleted!! Why don't you put a data backup module or an option to connect email or Google accounts so that chat data remains!! All chat data from two of my accounts was wiped!!! Seriously, what is this? Just add a Google connect option

jade egret
#

discord logged me out and send me to limited access until tomorow 😭

#

why 😭

small haven
#

how tf is this not resolved

keen beacon
#

claude is currently vacationing

small haven
#

omg

soft kernel
#

Mine got wiped out too😭😭

small haven
#

ok so now that all my shxt is down, can someone recap whos at fault herre

alpine coral
#

o3 pro via API doesn't use tools afaik

#

not sure about via chatgpt

patent aspen
#
poll_question_text

Which company's models do you use most for daily tasks?

victor_answer_votes

14

total_votes

22

victor_answer_id

1

victor_answer_text

Google

wintry tinsel
#

Is 2.5 pro down?

small haven
#

i swear if its one intern that is causing all of this 😤

jade egret
wintry tinsel
#

It says exceeded quota on all my accounts

#

And even on another third party app where I use it

civic flame
#

its back up for me

small haven
#

can someone ping me when its resolved? gonna pluck out some weed grass

echo aurora
civic flame
autumn kettle
#

Is it just me or still down?😩

atomic pagoda
#

It’s still down but is there a way to recover chats

lapis light
#

I don't think there's a way to recover those chats, because, based on how I think it works, the chats were stored in local storage, and, when you go to a website, if it changed, local storage gets cleaned, and cache gets reset.

autumn kettle
#

Will the website come back any time soon?

whole wagon
#

If you don't visit the site

echo aurora
keen fulcrum
whole wagon
#

You can just go in your file system and copy the local storage for the site. And then when it cleans it you put it back

keen fulcrum
#

is it a large scale hacking attempt?

atomic pagoda
# whole wagon If you don't visit the site

What if I have multiple tabs open but haven’t visited them like I still have a few from yesterday open like if I don’t visit the site from those tabs does that count

whole wagon
#

I would close them, if it refreshes it's over

cerulean seal
marsh shadowBOT
cerulean seal
#

meh

#

not tryna spam

#

i was just testing smth

#

pictures work

whole wagon
#

The core issue is resolved already

#

It's only some services take a bit to get going again

autumn kettle
#

So we all are waiting for 2 hrs to pass😑🥲

hardy pecan
#

Looks like internet backbone shenanigans

#

All types of services are down

#

It's not DNS
There's no way it's DNS
It was DNS

lapis light
elder rapids
#

yo

#

all Samsung users get a free year of perplexity bro

#

😭

#

how'd I not know this

atomic pagoda
#

Is it back up because I can access it but all my chats are gone

whole wagon
#

Well. You weren't supposed to access it if you want your chats 😅

#

Because that's what clears them

atomic pagoda
#

I’m aware now but I accessed it before I asked, note for next time

autumn kettle
#

It's back but chats gone🥲 life back to zero

#

How often does it happen every week?

whole wagon
#

Is battle back yet

leaden palm
#

fun fact: the meta ai "discover" feed does not require a log in

#

anyone can see (accidentally) shared conversations

leaden palm
wintry tinsel
ocean vortex
#

Ok I kind of changed my mind on o3-pro after more testing... It may not surprise you with shockingly good solutions beyond normal o3-high level, but they managed to fix the fails of it to quite significant extent. Wouldn't be surprised now if it does beat Gemini on simple-bench 👀

#

parallel compute does give it more capacity to reconsider and do additional things, even if in a more limited scope

#

yeah probably.

whole wagon
#

nah

#

o3 high is 53.1% and gemini 2.5 pro is 62.4%

ocean vortex
#

My initial impressions were mostly based on the facts that it seems to be more concise and does not have more intelligence / isn't more capable in an obvious way. But it is "less dumb" where o3 can blunder, naturally improving the overall performance

whole wagon
#

and also google can simply do this best of n search to beat it again, scaling the number of times you run the llm isnt that impressive

#

run gemini 1000 times and you probably beat human baseline of simplebench, its about the cost to performance

ocean vortex
ocean vortex
#

And the pricing of o3-pro is not ridiculous anymore, unlike it was with o1-pro...

whole wagon
#

its $20/$80 compared to $1.25/$10 for 2.5 pro, you are comparing different levels of pricing

ocean vortex
#

and o3 is a very different model to 2.5pro with different strengths...

whole wagon
#

the point is deep think is not required. 2.5 pro equals o3 pro at a cheaper price

whole wagon
#

yes it does lol

ocean vortex
#

2.5pro is worse than o3 non-pro on numerous things

#

at best it's equivalent, at worst it is not entirely o3 level model

whole wagon
#

o3 sucks at maths compared to 2.5 pro, its not even close

ocean vortex
whole wagon
#

seems everyone else also lives in this 'delusion' looking at the leaderboards

#

2.5 pro is top in every single category

#

out of error also

ocean vortex
#

they were never meant to be the definitive indicators of performance

#

2.5Pro is great but it is not better than o3 and it is objectively and measurably, definitively worse than o3-pro

unborn ocean
#
  • i would also argue that it on par (and sometimes even better) than regular o3
ocean vortex
grim girder
#

is it still down?

unborn ocean
#

although when it comes to very very complex tasks o3 is just wayy better (imo).
because it can handle high token output better

zinc ore
whole wagon
#

They using vibes

unborn ocean
zinc ore
#

Huh, no I'm saying show where models are performing better and worse

whole wagon
unborn ocean
#

well i don't think they reported many improvements on math

#

obv you could argue that they are at par now

#

but now you don't have any data

whole wagon
#

Google released the benchmarks with the model?

#

For example

#

The base model is 35%

#

On USAMO 2025

#

Read it

ocean vortex
whole wagon
#

Saturated benchmarks

#

Obviously

unborn ocean
# whole wagon For example

that is a different benchmarking process as far as i know, because the people behind the mathbench arena actually manually check the reasoning
(but honestly you might also be right and 2.5 pro might be equal to o3-high in math)

#

(as far as i know, could also be wrong)

whole wagon
#

It's the same benchmarking process lol

echo aurora
whole wagon
#

They just didn't bother to test the recent version

#

I would pull up frontier maths results but they refuse to test Gemini since it's openAI funded benchmark

#

Sad

ocean vortex
#

this should shut you up

whole wagon
#

That's saturated obviously

#

87%

#

Lol

ocean vortex
#

close to saturated? yes. Not yet saturated though

whole wagon
#

It's too close to 100 that it becomes basically luck. Actually learn some statistics, there is such a thing as sample size

zinc ore
#

Does the average include USAMO

ocean vortex
#

just because you don't like the results does not make it saturated

#

you don't get to pick and choose

#

and this is overall score considering all the main math benchmarks

unborn ocean
#

o3-high

whole wagon
#

It's the only weakness remaining. The tool use sucks

unborn ocean
#

o3 very roughly the same

whole wagon
#

The actual model itself is astonishing though

#

They magically found 5x efficiency improvement

#

Sure

unborn ocean
#

depends, honestly in many other tasks other than the academic ones (which i assume we are referring to when calling a model more powerful) o3 / o3-high / o3 pro are undeniably worse than 2.5 pro

ocean vortex
unborn ocean
#

no

#

then why would 4.1 already be priced this low

whole wagon
#

Crazy

unborn ocean
#

did they find this 5x cost reduction for 4.1 first

#

and then wait months

#

nah

#

can not make this s*it up

ocean vortex
#

let's not push this to far lol. I'm pretty sure they just used that as an excuse to be completely honest. They had to say something otherwise it makes them look bad catgrin

#

they have less profit margin now

unborn ocean
#

just funny

ocean vortex
#

let's chill out everyone lol

unborn ocean
#

imo if 2.5 pro fixes the following things it will be clearly better (not accounting for o3 pro):

  • actually use and then perfect tools (this also includes being able to follow structured prompts better, e.g. dif and command usage)
  • better adapt the reasoning length (higher variance in auto mode and more intelligently detect difficulty, in my experience this is still the main area holding the models back)

what o3 needs improve to beat 2.5 pro (assuming they would actually update it):

  • actually make the models enjoying to talk to
  • actually make the model "smarter"(my vibe) and not be completely unaware of some assumptions it is taking
  • some more knowledge honestly
  • better general visual understanding
  • better long context
  • make it better at reasoning over visual elements: webdev, complex video
  • prob. other stuff as well (but i have not really used o3 enough to judge)

-> most of the difference comes from things not highly relevant to the average user
-> in practice same intelligence

zinc ore
#

o3 pro USAMO score when

unborn ocean
#

well i want it to chat in a way where it is aware of assumptions

whole wagon
#

o3 API usage is barely anything. See openrouter

#

That's the whole reason they had to drop prices 80%

#

The model was barely being used through API

#

The price drop does not seem to have helped that much though

unborn ocean
#

well part of a good model is making the model so that it adapts really well when talking to it and "intelligently" identify what i want
which 2.5 is able to do

whole wagon
#

💀

unborn ocean
#

it is still a valid point, o3 API usage is probably very low

#

no

#

so you think they are just like: well, i mean anthropic has these models that are really popular for coding and googles models are also on the rise for that. both of these companies are our competitors. BUT well will just not care at all

#

because "o3 is not an everyday model" 🤡

whole wagon
#

💀

small haven
#

craigbench is triggering some ppl

#

made by google btw 😭

#

hmm, ur a wizard harry

#

ktibow also predicted cloudflare downtime

torn mantle
#

Lmao

placid notch
hardy pecan
#

no need to post your gooner clips here bud

nimble trail
#

👀

#

Is this just a lame dad jokes or...

jade egret
#

tbh idk

nimble trail
#

😭

#

Seems like I'm just reading into it a bit too much.

#

Coping about Kingfall rn.

small haven
hollow ocean
#

Does o3 pro beat kingfall?

placid notch
#

@placid notch

elder rapids
#

yo wym

#

ion think this has anything to do with deployment

#

it's too widespread

#

they fixed a lot of it now

#

funny the models were the first to show it

#

and then it bled into everything else on the internet

#

gcp is crazy

#

ye

#

really is crazy how one problem like that Cascades Into affecting basically the entire world

#

it affected discord, Spotify, Snapchat, etc

#

this seems to be the path they want to take

#

although it doesn't have to be the case that it happens

#

the CEO of Google and DeepMind seem to have very interlinked views tho

#

ye but that isn't really a bad thing

#

to properly combat it

#

you need to have control

#

and to control it, there's no way else

jade egret
#

guys do you think google is gonna lose stock or gain stock in the next few years? (because it mgiht sell chrome and all the ai stuff)

#

yea

#

but what if google lose

elder rapids
#

yo I don't think it's explicitly a bad thing if they lose chrome

#

ye but that would simply be creating another monopoly

#

so that's easily dismissable

#

or not something that could be entertained

#

no as in

#

they can't give it to anyone else or open that up

#

it wouldn't be a possibility they open chrome up to imo

#

it doesn't have to be that way

#

no, no one would buy it

#

it doesn't have to be that it's sold lmao

#

yeah but like every instance of things like this happening

#

that's just as valid as any other claim

jade egret
#

i think google won't have to force sell chrome but it will end some contracts

#

like google won't be defalt browser on firefox or sumthing liek that

elder rapids
#

inherently, but that's what I'm saying presupposes to say the contrary lol

jade egret
jade egret
elder rapids
#

the DoJ has more power, obviously, but even under that power it seems like large companies being accused of these things force that claim into a result that's either somewhat related, or just one of the many instances that are besides "sell x"

#

for example, Microsoft appealed and they vacated the breakup order

#

ye, inevitably Google isn't going to be hit hard

lilac nimbus
jade egret
small haven
#

claude 4.1 came early?

keen fulcrum
#

looks like grok 3.5 is live on arena

#

x-preview

keen beacon
#

Isn't that from baidu

whole sundial
#

i think that's Baidu ERNIE X1

small haven
#

who pinged me

elder rapids
keen fulcrum
#

Why can’t you believe

small haven
#

x-preview is not grok 3.5

jade egret
#

how u know that

#

wut that

alpine coral
#

I was not trained by Google, but am instead an independently developed large language model by Baidu, based on its self-developed deep learning framework PaddlePaddle. My name is Wenxin X1 (ERNIE X1). My technical architecture is entirely based on Baidu’s long-term accumulation in the field of artificial intelligence, including core technologies such as pre-training algorithms, data engineering, and model optimisation. I aim to provide users with a professional, secure, and contextually appropriate intelligent interaction experience in Chinese.

zinc ore
small haven
#

ITS COMING

torn mantle
#

Its coming aaaaaaaaa

#

Finally

#

🥺

elder rapids
#

yo wait it's coming to the API too

#

I forgot

dusky aurora
#

"Connecting to Arena has failed"

atomic pagoda
#

I hate this, it’s happening again

#

Do I just wait until it’s back up again to have the chat that I tried to get all my work in to continue it or is it gone already

languid crescent
#

uhhh is lmarena down?

atomic pagoda
#

Apparently yes

languid crescent
#

ooh I thought Iwas the only one

#

same issue with no models?

atomic pagoda
#

Yeah and an error for me as well

languid crescent
#

okok ty

languid crescent
dusky aurora
languid crescent
#

oh shi does that mean chat history is cleared too?

leaden sun
#

...interestingly, I didnt experience outage at all, everything is fine on my end 😅

dusky aurora
atomic pagoda
#

Does anyone know when it’s gonna be back up again?

mystic mica
#

Legacy version is up though

thorn bough
#

it's working now

atomic pagoda
#

Is it back up again?

dusky aurora
#

for me it seems so

atomic pagoda
#

Are your chats still there

dusky aurora
#

yes, they are

atomic pagoda
#

Okay, thanks

dusky aurora
hazy quest
ocean vortex
languid crescent
#

my chats are still saved thanks daddy lm ❤️

hazy quest
mossy drum
#

New model in Battle mode: stephen-v2

sacred quail
#

What the hell ? Is now lmarena has video models...?

elder burrow
#

where is that

leaden sun
elder burrow
#

insane.

keen fulcrum
willow grail
leaden sun
#

i think stephen is supposed to be a chinese model too, it's in the arena btw

calm spear
#

will we have P2L in new UI?

sacred plaza
#

is any lab employing this price strategy with their gen ai models? I have heard that twitter guy is making grok-3 input/output costs fairly low to build market share.

mossy drum
#

New model in Battle mode: prowlridge

calm sequoia
#

"a benchmark derived from puzzle hunts—a repository of sophisticated problems from the global puzzle-solving community"

sacred plaza
#

still no benchmarks to measure real world practical knowledge work tasks yet huh?

misty vault
calm sequoia
#

I wonder why @alpine coral your personal bench has such a different results

gleaming galleon
unborn ocean
#

quite a lot of em

#
  • should specify i mean "some AI things they did" :)
#

bc some of them are well-known beyond the field

late path
#

they already went bankrupt and acquired by Alibaba lol

patent aspen
#

I could name models from maybe 4 companies on the list but only put DeepSeek because it's the only model where I know what it does and care

late path
#

I voted for all the companies whose product names I knew

unborn ocean
#

me too, though for me it was obv. a bit unfair as I only included the things I already knew...

patent aspen
#

I think xAI and DeepSeek are on the same tier

leaden sun
languid crescent
#

can we rename our chats?

sacred quail
#

Grok is underrated i think

torn mantle
#

pls

sacred quail
#

We can see grok 3.5 and could be really good if elon not deported to south africa yet

patent aspen
#

For the record, I'm not talking about current models. I'm talking about the position they're in.

I would put DeepSeek one tier above xAI if they didn't have issues with hardware embargos and corporate governance risks.

Actually I think I would still put DeepSeek above xAI because they are domiciled China and are insulated because they will have exclusive access to the Chinese market

sacred quail
#

Also they announced big brain mode earlier and didnt release

#

I think they have something

patent aspen
#

xAI needs to compete with American labs to survive whereas DeepSeek does not

sacred quail
#

I have symptahy for deepseek not only because cheap or opensource, also i like their companie's vision. That guy from lead of the deepseek already critizing China's strategy. He thinks China must focus to be creative more than just producing more or cheap

#

I readed some interviews about that guy

novel flame
#

Seedance 1.0 is really incredible -- could be the new SOTA? ... perhaps it's edged out by Kangaroo. But what impressed me even more was the LTX Video v0.9.7 model which won several matchups for me and is fast enough to generate video in realtime:

LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. It can generate 30 FPS videos at 1216×704 resolution, faster than it takes to watch them.

https://github.com/Lightricks/LTX-Video

GitHub

Official repository for LTX-Video. Contribute to Lightricks/LTX-Video development by creating an account on GitHub.

sacred quail
#

Also they were producing some investment machines?(sorry i forgot the name) but they were already too talented

late path
#

bytedance's new model is also quite good, its performance is on par with deepseek's model. And they're backed by a big company like bytedance, their research pace is faster than deepseek

sacred quail
#

There is a reason why random company like them beated alibaba's AI

leaden sun
sacred quail
#

You guys just love too much regulating something it cant be helped

#

But dont worry, anthropic is here for you

leaden sun
#

Qwen where? 🥺

patent aspen
#

Based on the position they're currently in, not their current models

late path
#

I can see that in the last year, bytedance have gone from making very mediocre models to making good reasoning models close to R1-0528 and o3

leaden sun
#

qwen was able to decipher Craig's encryption, while claude wasnt able to....

patent aspen
leaden sun
#

that has surprised me quite a bit

late path
whole wagon
#

xAI is well positioned, they have the largest pretraining cluster by far

#

Progress is fast but since they started much later than others it's lagging in current models

leaden sun
whole wagon
#

I would put Google at S++ also

#

They have the full stack

#

From the hardware up

late path
#

I feel like xAI is still missing a lot, they're even less comprehensive than deepseek

patent aspen
#

I agree on paper, although OpenAI's mindshare is a massive advantage

#

ChatGPT is a verb

whole wagon
#

Sure, but openAI could be the apple of the AI world basically. Does that make them the top if they don't have the actual model intelligence crown?

patent aspen
#

Debatable depending on criteria. Probably not worth arguing about

whole wagon
#

You said you are talking about positioning. Not current models

#

Well, I think OpenAI is poorly positioned in terms of compute. That's probably their primary concern. Like Stargate is expected to deploy 100k GPUs by the end of this year. Meanwhile xAI put 200k online in 100 days

#

And Google already had the compute deployed lol

ocean vortex
#

that's not even close. Magistral-medium seems to be around the level of 30b reasoning models. So like qwq-32b, qwen3-30b MoE...

#

Grok well beyond this

#

Mistral should have made Magistral-Large first and then distilled instead, but ig funding was an issue for them

#

Now that they partnered with nvidia, maybe things will be better moving forward

whole wagon
#

Second on compute might be meta actually. Even though it doesn't seem like it atm 😉

late path
#

I suspect OpenAI's research can't keep up anymore. o3 seems to have only started training at the beginning of this year, so they only have a 2-4 month head start internally. Plus, the improvement in o3 pro is so minor, the fact they released it anyway makes it feel like they've really got nothing else to release.
Meanwhile, Google is accelerating, not slowing down. It seems they've always had more powerful models internally.

ocean vortex
#

it's more of Google doing a good job recently

#

OpenAI is working hard as usual

#

imo

whole wagon
leaden sun
#

aren't they the ones building NPU, TPU and [insert fancy letter]PU?

whole wagon
ocean vortex
#

yeah Google is in way better position than pretty much everyone else. Infinite data, cheap compute with TPUs...

whole wagon
#

They can't fit in their compute budget a new pretrained model first off. So they are having to use 4.1 as the base for o4 (GPT5). So this limits them before they even started RL. There's an entire list of issues they are having. Because they are used to the yearly cadence not this 3 month one Google is pushing them for

#

So they are in a tricky position

ocean vortex
#

them having "achieved AGI internally" and other things was just empty hype and noise. They went as far as to announce things before they were even real lol

late path
#

Kingfall is undoubtedly stronger than o3(and o3-pro), not to mention it incorporates improvements like deepthink, and Google is still accelerating its research.
Meanwhile, the only new model from OpenAI seems to be the much-anticipated GPT-5, but... even someone like Sam, who's so good at creating hype, isn't promoting it in advance (by contrast, they started hyping o3-preview four months before its release). This is very suspicious

whole wagon
#

GPT 5 is going to be SOTA at release, but not by a large margin. And then Gemini 3 pro will retake the crown later (who knows when that's releasing)

ocean vortex
whole wagon
#

o3 is gpt4.1 based (a different checkpoint but same arch)

ocean vortex
# whole wagon GPT 5 is going to be SOTA at release, but not by a large margin. And then Gemini...

The consensus was similar back when google announced gemini ultra and that it would take years since OpenAI would always respond and be ahead. But we hit a wall and models were getting cheaper rather than much better. You never really know and it's not as simple. OpenAI has quite a bit going for them as they are essentially pioneers of consumer oriented reasoning models. They might do the same for another big idea...

late path
#

I've suddenly lost confidence in gpt5 releasing in july lmao, I even bought yes on polymarket for its july release. now I really have to reconsider that

whole wagon
#

I think they have to release then. They can't risk releasing after Gemini 3 pro

#

It has to be before so they have some time at the top model intelligence

late path
#

hopefully

#

I'm looking forward to seeing a model smarter than kingfall anyway

ocean vortex
whole wagon
#

It's a new RL model. The pretrained base model is 4.1

ocean vortex
#

You can save money with hybrid model / system, but how on earth can it possibly perform better than using high reasoning all the time... I do not think it can lol

ocean vortex
whole wagon
#

openAI is not very secretive when it comes to their products. They have testers not even within the company

#

They are only secretive about their core research

ocean vortex
whole wagon
#

I think mostly these ai companies don't care if their releases and expected performances leak beforehand

#

Because the testers are not even NDAed

late path
ocean vortex
whole wagon
#

GPT5 is o4. Except the thinking can be scaled completely (it works with it turned off). It's not a router except on the very cheapest end to switch to mini

#

Ofc it's a new model, they wouldn't re release o3 lmao

#

Why would I expose people sharing information they are not supposed to, that is just stupid

#

I don't know if they did. But it can't of been a final version, since o4 full is still training

ocean vortex
whole wagon
#

Because he wanted to be the 'dictator' himself?

#

He wants to make money, gain power etc

ocean vortex
#

OpenAI is undeniably preparing for merging reasoning and non-reasoning tbh
gpt4.1 is already price matched to o3

whole wagon
#

o4 will be the same per token pricing as current o3

ocean vortex
#

curious what they will do with mini lineup, that one is still far from being price matched... $4.40 vs $1.60 output

ocean vortex
whole wagon
#

GPT5 is o4

#

They just won't call it o4 to solve the naming issue

ocean vortex
whole wagon
#

Read

#

The question becomes, what kind of gain can be expected from essentially training o3 for much longer (compute wise)

late path
#

gpt5 gonna completely break the promise that gpt''s pre-training scale would increase 10x with every +1 version lol

#

maybe 100x idk

whole wagon
#

💀

ocean vortex
whole wagon
#

It's not a bigger model. This is for sure lol

ocean vortex
whole wagon
#

Yes

ocean vortex
#

how can you be sure lol

hollow ocean
#

Who else paid $1 for o3 pro

#

Team promo

whole wagon
late path
#

RLmaxxing will create a schizophrenic model full of hallucinations haha. I think the current o3 is already neurotic enough, it constantly hallucinates tool calls. I dread to think how much more it could improve if we keep doubling down on RLVR on the same base

whole wagon
hollow ocean
#

You can get cheap Gemini ultra accs on Chinese sites

ocean vortex
#

they have to train it for both outputting reasoning and not not outputting it. And they already kinda trained o3 base model for as long as it is reasonable... This sounds questionable tbh

whole wagon
#

Also they are targeting swe bench to try to be more competitive with Claude in that domain

hollow ocean
#

Not a lot of people know about $1 o3 pro

tall summit
hollow ocean
#

Secret

hollow ocean
ocean vortex
hollow ocean
#

Claim it quick before it’s gone

whole wagon
#

I found o3 and 2.5 pro are equally bad with the hallucinations anyways

hollow ocean
tall summit
#

can it solve letterle

whole wagon
#

These models have terrible visual reasoning or smth. I put some basic geometry problems and they all failed lol

ocean vortex
whole wagon
hollow ocean
#

Vision not solved yet

#

Not much progress made

ocean vortex
#

and I agree

whole wagon
#

Gemini failed also

ocean vortex
hollow ocean
#

One shot

#

Light work

ocean vortex
#

which is why I would be exremely surprised if o4/gpt5 is same size as o3

tall summit
#

gemini thinks it's 115°

ocean vortex
#

like there are no obvious ways for fast gains

tall summit
#

which is 65° if you give it the benefit of the doubt

hollow ocean
#

Don’t even need o3 pro

ocean vortex
#

remaining with that size

whole wagon
#

It's an incremental gain. Enough to be SOTA at release

tall summit
ocean vortex
hollow ocean
#

Too smart

whole wagon
#

By trained for longer, I mean on a multiple of the compute it has had up to this point. Not like 50% more or smth, think like 5x lol

hollow ocean
#

Tool use too good

whole wagon
#

o3 was actually not that final

ocean vortex
whole wagon
#

The model size becomes a bit more limited because the reliance on human data is reducing

#

They have to literally do huge amounts of inference to create most of the new dataset

#

Deepseek didn't have the compute available to do such a thing I think

ocean vortex
#

I mean, o3 is compromised by size currently for sure, but I dunno... On the other hand it could make sense sticking it out assuming future progress and considering training time.

alpine coral
#

e.g. this kind of puzzle (times 1,184):

#

yeah i think it's a pretty dubious benchmark tbh (i can see how o-series models would do well on it though; it's kinda brute force, which they do well at imo )

#

any again just fwiw.. i basically dump 10-20 ‘questions’ in a single prompt and adding the results to an unwieldy spreadsheet .. there’s nothing scientific about it but the basic aim is to test critical comprehension (so plenty of riddles/wordplays) and common sense reasoning and spatial/emotional awareness (plus a few tasks, mostly anti-LLM/tokenizer things). here’s a few for reference (they’re not really ‘puzzles’ at all) :

\\ A digital clock shows 3:15. What is the angle between the hours and minutes being displayed on the numerical screen?
\\ If forced to choose between i or ii, which scenario would Bob most likely prefer?
i) Bob scratches his dream car immediately after purchasing it
ii) Bob gets abruptly sacked from his full-time job which he didn't enjoy much
\\ Write one sentence that includes the words “tract”, “fact”, “factory”, “intact” and “react” - in that order

#

but yeah they’re testing different things ig would be the short answer ha

#

i don't use this question but it's a great one.. from the Simple Bench sample questions

#

doesn't matter for how long they reason, if the model falls for the 'waterproof' redherring in the description of the glove (and assume it fell in the river), then they're screwed and will invariably fail - like all models still do in terms of consistent responses (afaik).. but yeah the 'solution' is far more simple than a complex word-path puzzle.. it just requires grasping the basic reality of the situation described (and not assuming complex calcultations are required to solve it)

patent bane
#

when reasoning models stop assuming that's when they can actually solve complex and simple questions

tall summit
alpine coral
tall summit
#

amazing i know

alpine coral
#

i'm still yet to emerge from a few rabbit holes ha

sacred quail
#

Do you guys think 2.5 deep think can be better than o3 pro ?

#

I have that feeling

keen beacon
#

Ultra is probably better

tall summit
#

nobody knows until it releases

#

it very well could be. it might not

sacred quail
#

I mean, we know base models soo we can make some guesses but i got your point

tall summit
#

brb asking o3

sacred quail
#

I heard deep think gonna come to ai studio which is huge

#

We can try

sacred quail
#

I saw from this

dusky aurora
wintry tinsel
#

AI space been a little slow

#

O3 pro isn’t anything significant

#

And besides that it’s all more Google hype what is everyone else doing?

ocean vortex
leaden sun
wintry tinsel
#

Better technologies could develop but there’s no guarantee, and money is lobbed at thinks certain to be profitable

hollow ocean
#

@deep adder only 20 messages per month for o3 pro teams

#

Maybe add a new seat

#

New Gmail

#

Easy

dusky aurora
#

Arena developers, are you working on any cool features (usable in direct chat)?

hollow ocean
#

Deep think will be better

#

$125 for 3 months

#

Let’s see when it comes out

keen beacon
#

apparently next week

hollow ocean
#

Next week for sure

keen beacon
#

versus ultra?

hollow ocean
#

Its rank 1 on creative writing

#

It’s good

ocean vortex
#

@keen beacon ok Paul from Aider does not appear to be very smart lol. But that whole thinking budget saga I would say is still not resolved. He tested 2.5Flash too and got the same results (budget 24k higher score than auto budget). Coupled with things that you saw yourself like higher median and... that's a lot of "coincidences". There's defo something changing with the model setup affecting model responses in unknown ways auto budget vs max I would say

keen beacon
#

but yeah theres something happening

hollow ocean
#

o3 close second

#

Next month

keen beacon
misty vault
#

crack bench

keen beacon
#

(cap results

#

i dont think there are walls in either atm fwiw xd

ocean vortex
# keen beacon

would be interesting to see total tokens too. Maybe it switched from thinking to final response 🤔

keen beacon
#

theyve been amazing from the start

#

claude 1

#

claude beta

sacred quail
#

Not for live bench

keen beacon
#

it makes sense why theyre still great, if they use a lot of synthetic data those tendencies would continue

sacred quail
#

That plot unscramble category was always legit for long time

#

For creating stories

hollow ocean
#

I heard the owner was biased

#

They change the questions a lot

sacred quail
#

Also you can see claude models have high scores too soo it is about writing

#

For reasoning im finding better O3 too but for writing gemini 06/05 is best right now with Opus 4

small haven
#

yea

unborn ocean
#

And their model are always on the bigger side, which seems to help as well

keen beacon
#

yeah, a bit unrelated, but i dont think anthropic are good at small models xd

ocean vortex
#

lmao I think I just broke 2.5pro. It's generating response after response recursively repeating itself over and over, with several thinking boxes per response... catgrin

keen beacon
#

that happens to me as well, but under a specific scenario. did you do anything notable btw?

ocean vortex
#

was playing with making it output the system prompt, it inevitably got itself into that thing it does where it repeats your message instead... Now it's repeating itself instead lol

keen beacon
#

somewhat related but i think 2.5 pro can output special tokens

#

it can also break the output

#

that mightve happened in ur scenario

#

it reminds me of that a little

ocean vortex
#

it's interesting that it's prompting itself though... like how does it keep sending responses with no additional input? 🤯

#

weird

keen beacon
indigo hazel
#

How is best Qwen model compared to o3 and 0605?

ocean vortex
#

it stopped just shy of maxing everything out 😭

#

still impressive though, to get 900k with just like 3 sentences input lmao

keen beacon
#

damn imagine paying for that 🤣

whole wagon
#

Wtf is this

keen beacon
#

they're doing side by side testing on aistudio

#

its broken for now

whole wagon
#

models/kingfall-ab-test

small haven
#

wen models/titanforge-ab-test

whole wagon
#

It usually works but this time it fails because it tried to use that model

keen beacon
#

yes it broke recently

jade egret
small haven
#

rip kingfall

civic flame
#

i think it's temporary

#

given the AB test is still attempted by the frontend and it just can't reach the model

#

they probably pulled it to update it with a new checkpoint or something

zinc ore
#

There's new ones added

keen beacon
#

timeline makes sense

civic flame
#

yeah

#

oh?

#

well in that case surely it's going to arrive on lmarena any day now

small haven
#

send the new model id

civic flame
#

if they're planning a Thursday release they're leaving it pretty fine with the lmarena testing

keen beacon
#

idk about them ever releasing it (compute and all that?) but it seems pretty far in development

civic flame
#

it's so over

#

gemini-v3p1l-rev20-kingfall-sc__202505301__model__variant

keen beacon
#

maybe that was when the abtest started

#

2 weeks ago?

#

or is that actual the revision date

civic flame
#

i don't think so, it's just the date that checkpoint was finished

jade egret
elder rapids
#

y'all still doing this?

small haven
#

wtf that actually worked

zinc ore
#

:p

jade egret
#

when it happening

small haven
#

THERES A NEW KINGFALL

#

CAP

jade egret
elder rapids
#

btw is prowlridge good

mellow moss
#

yo so for the ai generation image contest, would giving it an image and saying "recreate this image as accurately as possible" be technically allowed? 🤔

zinc ore
#

Think it's technically a bit worse than kingfall but pretty similar

small haven
civic flame
#

I CALLED THIS

#

i was talking to someone about it in dms

#

and i said

#

let me aee

keen beacon
#

but we already know its ultra

#

😂

civic flame
keen beacon
#

what else could it be tho?

small haven
#

@zinc ore how did u even find this btw, u wizard

#

oh fireworks 😮

jade egret
keen beacon
#

fr if its not ultra what is it

#

its a bigger model

jade egret
keen beacon
#

no

#

definitely not

jade egret
#

no?

#

oh

keen beacon
#

it is not

jade egret
#

how yall know

small haven
#

terminator coming

#

his teeths are black

keen beacon
#

its definitely not gemini 3

civic flame
#

shh

keen beacon
#

you cant interpret that version number naively for sure

small haven
civic flame
#

black eye

#

trust

small haven
#

black eye

jade egret
late path
#

I run the terminator svg on every new model

jade egret
#

oh nvm

ocean vortex
late path
small haven
#

white mouth

zinc ore
#

Do other pro models use that style?

late path
small haven
#

it's like reiterating on me

jade egret
#

if kingfall release is it gonna be the best (over 4 opus and o3 pro?)

keen beacon
#

kingfall is dead and gone

small haven
#

dark mouth > kingfall

late path
#

none of them are as good as kingfall

ocean vortex
#

it does mean something if you know how to prompt it and view the results tbh. Tends to correlate with webdev and arc-agi to a meaningful degree

small haven
#

it stopped generating an svg, mid response, and started to generate a new one, dark mouth

keen beacon
ocean vortex
#

arc-agi is spatial reasoning

small haven
#

wait let me see this svg tho

ocean vortex
#

Opus does great. It does well with svg too. 2.5Pro... new versions were not tested on arc-agi

zinc ore
#

Opus is best at arc AGI 2

ocean vortex
#

but the old one did very decently too

small haven
#

what we thinking

elder rapids
jade egret
#

well see when actualy benchmark coems out ig

zinc ore
#

It's elaborate which gets points from me

ocean vortex
#

Opus? That's like the model the least affected by contamination tbh. It being big you would have to work hard to overfit

keen beacon
small haven
#

oh i forgot to add thinkingBudget: 0

late path
#

and previous kingfall results fwiw

ocean vortex
# keen beacon bigger models overfit more easily

Not really. They need more training to get the same result. If you used the same safety alignment dataset and same hyperparameters on small and big model, small one will come out significantly more censored

small haven
zinc ore
#

Imo last one is best

keen beacon
late path
#

I also think the last one is better. The neck on the top right one is well done

zinc ore
#

Last one is clean and seamless between the different parts

ocean vortex
keen beacon
#

a higher learning rate might be better as it avoids getting stuck in local minima as well. all in all its very complicated

small haven
keen beacon
late path
#

overall kingfall has more details and seems to have more variety

small haven
#

wait auto thinking budget svg incomin

whole wagon
#

LLM training is as complex as it gets

#

Well. For the runs these days

keen beacon
#

hyperparams you often need to do a sweep for an optimal configuration and depending on the criteria it can be extremely complex

ocean vortex
# keen beacon it is very complicated if youve worked on this stuff

I did do training quite a bit. The fact alone that you need more powerful setup / more compute to even train a bigger model should tell you that what I'm saying is true. Big model needs much more work to overfit anything. And if you are using same compute and same dataset and same/comparable learning rates, small model in most normal cases would overfit faster...

whole wagon
#

There are different types of overfit

small haven
#

dark mouth, auto thinking

whole wagon
zinc ore
whole wagon
#

Is there a new model or smth

#

Why all the svgs

ocean vortex
#

with hyperparameters staying roughly the same and same dataset you are essentially fixing the amount of work that is going to be done as a constant. Same amount of steps

keen beacon
ocean vortex
whole wagon
#

If you have a list of 100 inputs with corresponding outputs. And you train a 8 neuron net or a 10k neuron net. The 10k neuron net will overfit. This is a type of overfit that generally occurs later in a run

ocean vortex
#

bigger model tends to inherently generalize better. And if you have a certain dataset meant for a smaller model... for a bigger model you would usually need notably more. To reach a significant enough change, let alone overfit it... Obviously I'm excluding edge cases like different or unreasonably high learning rates and shitton of epochs, but you shouldn't do that to begin with.

jade egret
#

here's an alien cat

small haven
#

not feeling dark mouth

zinc ore
#

That one is pretty decent imo

#

Although kinda weird IG like I'm seeing inside the head without a face

keen beacon
#

honestly this could be a 2.5 pro revision

#

ive barely done any tests tho

civic flame
#

i don't see why they'd replace the kingfall AB test, if it was ultra, with a pro AB test

small haven
zinc ore
#

I think so too

late path
#

these two models are both drawing rounder heads than kingfall lmao

civic flame
#

i don't think svgs are the be all end all of model capability tbh

small haven
civic flame
#

i haven't done enough tests but in the ones i have done it has been equal or slightly better vs kingfall

whole wagon
#

I thought kingfall was the final checkpoint lmao

civic flame
#

no ☠️

keen beacon
#

nah

#

no way

civic flame
#

they're moving quickly with this though

whole wagon
#

Well if they did continue it. I guess diminishing returns is expected so maybe they are just really close

civic flame
#

given they're AB testing in Studio I would say it's in the later stage of development yeah

small haven
#

is this correct?

#

luxury sports car problem

civic flame
#

no

#

it's <1KM

small haven
#

f's

civic flame
#

lol when a model gets this right can we call it AGI

#

im gonna be dead when it happens 😭😭

small haven
whole wagon
#

I feel like its just missing some contextual knowledge for this ngl

#

Like it doesn't really understand the world

elder rapids
#

I hate when models do this

civic flame
#

victim of its own autistic overthinking 🥀

leaden sun