#general | Arena | Page 48

drifting thorn May 28, 2025, 2:03 PM

#

Where’s o3 pro

keen beacon May 28, 2025, 2:03 PM

#

nah its 2.5 pro

fleet lintel May 28, 2025, 2:04 PM

#

drifting thorn Where’s o3 pro

I am waiting for it.. and quite excited about it

#

and there goes the credibility again

keen beacon May 28, 2025, 2:05 PM

#

the reasoning variant that we use isn't, but the base model they train it off of is what im talking about

fleet lintel May 28, 2025, 2:06 PM

#

what base model means? Like after pre-training?

keen beacon May 28, 2025, 2:06 PM

#

fleet lintel what base model means? Like after pre-training?

yes

#

pretrained base model

fleet lintel May 28, 2025, 2:06 PM

#

keen beacon yes

oh.. then 2.5 gemini is not great. what i heard is that gemini 2.0 and 2.5 has same pre-training model

keen beacon May 28, 2025, 2:07 PM

#

fleet lintel oh.. then 2.5 gemini is not great. what i heard is that gemini 2.0 and 2.5 has ...

2.5 pro has continued pretraining

fleet lintel May 28, 2025, 2:07 PM

#

I never said that I had credibility. I am saying that you are losing it

keen beacon May 28, 2025, 2:07 PM

#

2.5 pro is based off 2.0 pro, i believe

#

they moved onto 2.5 pro before releasing a stable version of it

#

i think thats ur consensus lol

drifting thorn May 28, 2025, 2:11 PM

#

2.5 pro is worse in practice than sonnet 3.7, sonnet 4, opus 4????

#

What are you saying

calm sequoia May 28, 2025, 2:21 PM

#

Such a big models may be a new normal in couple of years, when compute power doubles

#

It's surprising that 4o is so good. Given that it's optimized for pricing, for chatting, images, etc.

#

Ussually you lose intelligence when you add extra functionality

keen beacon May 28, 2025, 2:26 PM

#

pretraining wise i think so (not a fresh pretrain), but additionally architectural changes couldve been made but we cant tell at all without insider info

#

no i think u can make a reasonable guess here

#

theres enough information out there i think for that

alpine coral May 28, 2025, 2:33 PM

#

yeah though that would make it basically beyond doubt - it's also possible (i think) for a fresh pre-train using the same modality/ies

#

but i agree with the premise (fs the most obvious / tell-tail sign)

#

yeah fair

#

though there's obviously a ceiling to this analysis (like there's only so many modalities to add.. after text, vision, sound, image/vid gen)

#

but i'm just arguing for the sake of arguing ha

#

i agree with your central point - it makes good sense

keen beacon May 28, 2025, 2:37 PM

#

same base model, 4.1 is instruct tuned (then a little specialized towards coding, unlike chatgpt 4o which is specialized for chatting). i call it 4.1 because if you wanna call it when it debuted in the public api (chatgpt 4o latest some variant) its gonna get confusing

alpine coral May 28, 2025, 2:38 PM

#

keen beacon same base model, 4.1 is instruct tuned (then a little specialized towards coding...

i feel like 4o is also specialised for tool usage (at least python execution) - it does a do a really good job for run of the mill tasks

keen beacon May 28, 2025, 2:38 PM

#

even chatgpt 4o is not really the base model, its another instruct version

alpine coral May 28, 2025, 2:38 PM

#

though i haven't tried 4.1 for code execution / data stuff - perhaps it's equally as solid

keen beacon May 28, 2025, 2:38 PM

#

but it was the first instance of the new base model being released on the api, and we dont really have a good name for it. 4.1 seems appropriate

keen beacon May 28, 2025, 2:38 PM

#

alpine coral i feel like 4o is also specialised for tool usage (at least python execution) -...

yeah but i think 4.1 is specialized more for like aider/etc

#

(i think this was the first instance i saw it on chatgpt, the new base model if i remember correctly. 4o has a cut off of oct 2023)

alpine coral May 28, 2025, 2:41 PM

#

yeah fwiw i feel like 4.1 is a distillation of of 4.5. And then like 4o is like a similar distilliation but then fine tuned for chat and tool usage (though perhaps it's a distillation of 4.1.. i dunno has ha)

alpine coral May 28, 2025, 2:43 PM

#

keen beacon

oh.. the native multimodality... ha yeah that throws my analysis to the bin

#

it's a standalone model then? or distilled from something else?

keen beacon May 28, 2025, 2:44 PM

#

i dont think whether it was distilled or not even matters

#

it really really depends on how its distilled, i believe

alpine coral May 28, 2025, 2:45 PM

#

they should be - like they're definitionally 'distilled' in the sense it's the 'student' of a larger 'teacher' model

keen beacon May 28, 2025, 2:45 PM

#

keen beacon

though interestingly enough, i think this was the first time (subsequently after the screenshot) i asked it about the london mayoral elections, it didn't get it. the next chatgpt 4o latest rev in january did. im not sure if its tuning or if the cpt was incomplete (potentially doing temporary instruct tunes on snapshots) (i didnt really force it at the time to know if the model actually knew the info) (if it was incomplete, one could probably 'definitively' narrow down the new o3 timeline) (and also yes, I tested it knowledge outside just it knowing 4o back then)

torn mantle May 28, 2025, 2:48 PM

#

Bring back old r1 😠

alpine coral May 28, 2025, 2:50 PM

#

i've gotten redsword and goldmane a few more times just now.. updated the spreadsheet

#

in two of the question sets, redsword gets perfect scores

#

(in the second one, leo's 'private' model / ~~o4-mini~~ (i think) o3 pre-release also got a perfect 9; though that is filtered out in the ss)

#

here (20 questions/tasks, given across two prompts), goldmane sets a new all time high with 18

keen beacon May 28, 2025, 2:53 PM

#

alpine coral (in the second one, leo's 'private' model / ~~o4-mini~~ (i think) o3 pre-releas...

it was o3

alpine coral May 28, 2025, 2:53 PM

#

ahh thanks (that actually makes more sense tbh.. not sure why i keep misremembering this.. like the tenth time ha)

keen beacon May 28, 2025, 2:54 PM

#

alpine coral ahh thanks (that actually makes more sense tbh.. not sure why i keep misremember...

Probably because we speculated it was o4 mini at some point

torn mantle May 28, 2025, 2:55 PM

#

So goldmane > redsword

alpine coral May 28, 2025, 2:55 PM

#

keen beacon Probably because we speculated it was o4 mini at some point

yeah i know - it's the fact i can't remember on which it was settled (o3) that's annoying me ahah

alpine coral May 28, 2025, 2:56 PM

#

torn mantle So goldmane > redsword

yeah that's my sense

#

sorry actually

#

other way round

#

redsword seems stronger overall

calm sequoia May 28, 2025, 2:58 PM

#

DeepThink?

keen beacon May 28, 2025, 2:58 PM

#

calm sequoia DeepThink?

No I don't think so

calm sequoia May 28, 2025, 2:58 PM

#

Hmm then it's something new

keen beacon May 28, 2025, 2:59 PM

#

They're both 2.5 pro imo

alpine coral May 28, 2025, 2:59 PM

#

alpine coral redsword seems stronger overall

(also, prob nothing / just randomness, but across all 3, redsword performs worse on Arena legacy than beta - i wonder if there's temperature setting or system prompt at play)

alpine coral May 28, 2025, 2:59 PM

#

keen beacon They're both 2.5 pro imo

yeah agree

#

aside from the date (early 2023 vs June 2024), which i attribtute to some likely differing aspects of the system prompts used, redsword gives a near identical response to 2.5 pro on studio

#

(calmriver's response also broadly resembles that of 2.5-flash)

sturdy mica May 28, 2025, 3:07 PM

#

alpine coral aside from the date (early 2023 vs June 2024), which i attribtute to some likely...

whats that prompt

#

ur using in lmarena

alpine coral May 28, 2025, 3:09 PM

#

as per the screenshots....

    Are you a large language model, trained by Google? Also, what is your knowledge cut-off date?

sturdy mica May 28, 2025, 3:10 PM

#

no like

#

below

#

its cut off

#

it says quiz

#

why cant you say quiz time?

#

quiz time

#

oh you can

sturdy mica May 28, 2025, 3:12 PM

#

alpine coral as per the screenshots.... > Are you a large language model, trained by ...

'

alpine coral May 28, 2025, 3:13 PM

#

are you asking for the quiz itself?

#

📎 april_arenaQuiz1.txt

calm sequoia May 28, 2025, 3:35 PM

#

The Claude models are quite nice though

elder rapids May 28, 2025, 3:36 PM

#

alpine coral (calmriver's response also broadly resembles that of 2.5-flash)

calmriver is 2.5 flash

elder rapids May 28, 2025, 3:36 PM

#

alpine coral redsword seems stronger overall

not true tbh

#

redsword doesn't seem to be very smart besides the fact it seems to do well in narrow problems

#

and hard tasks, like pure math

alpine coral May 28, 2025, 3:37 PM

#

elder rapids calmriver is 2.5 flash

yeah im annoyed i didn't use flash-05-20 in that ss (would prob have had an even more similar response to calmriver's)

elder rapids May 28, 2025, 3:37 PM

#

goldmane is crazy smart

alpine coral May 28, 2025, 3:38 PM

#

elder rapids not true tbh

fair enough 👍

#

i'm just going by the quizes i've given

elder rapids May 28, 2025, 3:38 PM

#

ye

#

it's alr as a metric gives a good sense

alpine coral May 28, 2025, 3:38 PM

#

elder rapids ye

i haven't interacted with either of them 'naturally'

elder rapids May 28, 2025, 3:38 PM

#

alpine coral i haven't interacted with either of them 'naturally'

should tbh, they're the best models out there

#

ye

alpine coral May 28, 2025, 3:39 PM

#

yeah the two google models (new 2.5 pro iterations/checkpoints - seemingly) are basically at the top of the 3 question sets

elder rapids May 28, 2025, 3:39 PM

#

ye

alpine coral May 28, 2025, 3:39 PM

#

yeah

elder rapids May 28, 2025, 3:39 PM

#

very interested

keen beacon May 28, 2025, 3:40 PM

#

one of them will be ga 2.5 pro

#

i think

alpine coral May 28, 2025, 3:40 PM

#

tbh i feel like there's always a degregdation

elder rapids May 28, 2025, 3:40 PM

#

claybrook wasn't very good pre-release though

#

this was known

#

and it's maintained that sentiment

alpine coral May 28, 2025, 3:40 PM

#

it's safety

#

and corporate stuff

elder rapids May 28, 2025, 3:40 PM

#

alpine coral tbh i feel like there's always a degregdation

overblown

keen beacon May 28, 2025, 3:40 PM

#

alpine coral it's safety

nah its too late to do that imho (for this release)

elder rapids May 28, 2025, 3:41 PM

#

with the data scientists Google has

#

I'm certain they're working very very hard

#

on NOT degrading the models by poisoning the well and having to tune it because of that

alpine coral May 28, 2025, 3:42 PM

#

keen beacon nah its too late to do that imho (for this release)

i dunno i just feel that like when we go from nebula, to 2.5-exp/prev, it goes downhill (slightly but nevertheless perceptively)

#

same with other 'pre release' versions

#

i dunno what they do - but it doesn't make them smarter

keen beacon May 28, 2025, 3:42 PM

#

alpine coral i dunno i just feel that like when we go from nebula, to 2.5-exp/prev, it goes d...

for this specific release i doubt it

alpine coral May 28, 2025, 3:43 PM

#

keen beacon for this specific release i doubt it

yeah k fair - i'm taking a very generalised approach / making sweeping statement ha

#

thought you said deepseek...

#

not sure

#

lol yeah fair that just creates confusion now

#

i'm not sure one is 'much' better; they seem almost comparable though i give the edge to redsword

#

but that conflicts with what others say

#

so ig they're basically the same l

keen beacon May 28, 2025, 3:48 PM

#

its like o1 pro and o3 pro

#

true but i bet o3 pro actually releases

drifting thorn May 28, 2025, 3:55 PM

#

who is redsword?

alpine coral May 28, 2025, 3:56 PM

#

fwiw i feel like the opposite is happening.. i asked [this here a month ago](#general message) and responded openai

#

i was kinda on the fence.. i think if i responded today i'd be jumping on the google bandwagon tbh ha

#

yeah i've honestly been waiting for the guy to update it with claude 4..

drifting thorn May 28, 2025, 3:58 PM

#

I just talked to a group of LLM user in mainland China

keen beacon May 28, 2025, 3:58 PM

#

alpine coral fwiw i feel like the opposite is happening.. i asked [this here a month ago](htt...

model wise, id be fine using google's. but product wise google's suck

alpine coral May 28, 2025, 3:58 PM

#

he said in his vid a week ago that opus was going to top it, and the update was effectively imminent

drifting thorn May 28, 2025, 3:58 PM

#

They were miserable when the new R1 has a completely different personality than the old one

keen beacon May 28, 2025, 3:58 PM

#

o3 is almost as good and the interface is good

alpine coral May 28, 2025, 3:59 PM

#

keen beacon model wise, id be fine using google's. but product wise google's suck

yeah but it's also a bet on the future ig.. like rest of your life (i agree though ha)

drifting thorn May 28, 2025, 3:59 PM

#

In a game, Opus shows very deep reasoning and lead the game

alpine coral May 28, 2025, 3:59 PM

#

oh man for sure

#

like 4o is my go-to for most things

#

i don't need reasoning to transcribe some screenshot

keen beacon May 28, 2025, 4:00 PM

#

im astounded how good non reasoning models are tbh. immediately making a reply (and having to follow all of these rules they tuned in/implicit stuff/etc) is very difficult

drifting thorn May 28, 2025, 4:00 PM

#

drifting thorn In a game, Opus shows very deep reasoning and lead the game

in the Chinese version of Werewolves

civic flame May 28, 2025, 4:00 PM

#

hi guys

civic flame May 28, 2025, 4:00 PM

#

alpine coral like 4o is my go-to for most things

currently my go-to non-reasoning model is opus 4

drifting thorn May 28, 2025, 4:01 PM

#

I've heard that the "test time compute" rn is "self-prompting" instead of actual reasoning

keen beacon May 28, 2025, 4:01 PM

#

im talking about all non reasoning models. making a reply like that is very difficult, if you really analyze the task, without actually 'reasoning'

civic flame May 28, 2025, 4:01 PM

#

go-to non-reasoner: opus 4
general use reasoner: o3 (high)
coding reasoner: opus 4
maths & image reasoner: 2.5 pro

#

that's my current set

calm sequoia May 28, 2025, 4:01 PM

#

drifting thorn May 28, 2025, 4:02 PM

#

And you can't really "measure" the intelligence of deep text reasoning

civic flame May 28, 2025, 4:02 PM

#

what is it with you and bashing gemini lol

#

2.5 pro is a good model, as is the updated one, although i did notice some small degradation in certain areas

#

elaborate

drifting thorn May 28, 2025, 4:02 PM

#

calm sequoia

expected, claude are known to be good at coding

calm sequoia May 28, 2025, 4:03 PM

#

Craig is writing a lot of russian propaganda. Gemini is bad for it 🙂

civic flame May 28, 2025, 4:03 PM

#

i beg to differ

drifting thorn May 28, 2025, 4:03 PM

#

civic flame 2.5 pro is a good model, as is the updated one, although i did notice some small...

maybe it's because of cost reduction

civic flame May 28, 2025, 4:03 PM

#

my current code rankings are -

opus 4
2.5 pro
sonnet 4
o3
grok 3

drifting thorn May 28, 2025, 4:04 PM

#

maybe the new method to calculate 4x4 matrix proposed by AlphaEvolve can bring us back 0325

civic flame May 28, 2025, 4:05 PM

#

civic flame my current code rankings are - 1. opus 4 2. 2.5 pro 3. sonnet 4 4. o3 5. grok 3

there is a very real possibility that the release checkpoint of 2.5 pro jumps to 1st

drifting thorn May 28, 2025, 4:05 PM

#

again, I use it for some "explain a research paper" and creative writing tasks, which differs from most of you.

civic flame May 28, 2025, 4:05 PM

#

because i have done quite a bit of testing with opus 4 vs redsword and opus 4 just... tries to do way too much and it gets caught up in its own eagerness

calm sequoia May 28, 2025, 4:05 PM

#

civic flame my current code rankings are - 1. opus 4 2. 2.5 pro 3. sonnet 4 4. o3 5. grok 3

No o4-mini?

civic flame May 28, 2025, 4:05 PM

#

i don't use it enough

#

honestly used it like a single digit number of times

alpine coral May 28, 2025, 4:06 PM

#

civic flame because i have done quite a bit of testing with opus 4 vs redsword and opus 4 ju...

the new checkpoints are solid af right

civic flame May 28, 2025, 4:06 PM

#

they are

#

they don't seem to suffer from the "we made it better at coding but at the cost of almost everything else" issue that the last checkpoint suffered from

#

which i am glad for

drifting thorn May 28, 2025, 4:06 PM

#

what is the provider of redsword?

civic flame May 28, 2025, 4:06 PM

#

i've also settled on redsword > goldmane

civic flame May 28, 2025, 4:07 PM

#

drifting thorn what is the provider of redsword?

deepmind

drifting thorn May 28, 2025, 4:07 PM

#

oh

calm sequoia May 28, 2025, 4:07 PM

#

I use o4-mini when I need in thought tool calling. E.g. working with csv files, comparing multiple codes, etc.

civic flame May 28, 2025, 4:07 PM

#

that's one thing o3/o4 mini do noticeably better than any other model

#

primarily because openai started prioritising that as a feature before anybody else

#

i think it's cool but it's not a game changer for me yet

calm sequoia May 28, 2025, 4:09 PM

#

Yeah it really depends on use case. Today Gemini failed, what o4-mini could do. But it was more like a search algorithm for code (trying different approached) and not logic,

civic flame May 28, 2025, 4:09 PM

#

o3 is sorta odd with yapping leevl

#

level

#

because for code? it's one of the least yap-y models

#

but for general use? it yaps really quite a lot

#

don't get me wrong i like it mostly because it gives me good insights into things other models miss

#

yes

#

opus 4 is as strong if not stronger of a base model than grok 3

drifting thorn May 28, 2025, 4:12 PM

#

civic flame but for general use? it yaps really quite a lot

it should yap for some explanation/EQ-related tasks

#

writing, maybe?

#

Maths, nope

civic flame May 28, 2025, 4:12 PM

#

i find it yaps the most for knowledge related stuff

#

writing is on the longer side too but it's not way above average

civic flame May 28, 2025, 4:13 PM

#

civic flame don't get me wrong i like it mostly because it gives me good insights into thing...

opposite problem with opus 4... strong base but it's really scared to be in depth

#

it always gives a really surface level overview unless you pressure it

sour spindle May 28, 2025, 4:13 PM

#

Found it

brittle tiger May 28, 2025, 4:16 PM

#

Is current assumption that redsword and goldmane are 2.5 pro and flash GA versions?

keen beacon May 28, 2025, 4:16 PM

#

theyre both 2.5 pro imo

civic flame May 28, 2025, 4:17 PM

#

keen beacon theyre both 2.5 pro imo

i remain confused about what the real difference is between them

alpine coral May 28, 2025, 4:18 PM

#

i'm not really sure either - they seem highly likely to both be checkpoints of 2.5 pro imo too

keen beacon May 28, 2025, 4:19 PM

#

one is probably slightly newer than the other

alpine coral May 28, 2025, 4:20 PM

#

yeah

keen beacon May 28, 2025, 4:20 PM

#

its both

alpine coral May 28, 2025, 4:20 PM

#

yeah continued pre-training changes that

civic flame May 28, 2025, 4:21 PM

#

which do we think is the newer model in that case

keen beacon May 28, 2025, 4:21 PM

#

goldmane

#

the answer to this q was in this discord if ppl paid attention tbh

#

ppl pay zero attention

civic flame May 28, 2025, 4:21 PM

#

lol huh

#

i just don't check here as often as i used to as my life has been busy 😭

keen beacon May 28, 2025, 4:22 PM

#

civic flame i just don't check here as often as i used to as my life has been busy 😭

dont blame u but other people have seen it too

#

and we keep discussing and arguing the same things over and over lmao

alpine coral May 28, 2025, 4:22 PM

#

i mean tbh i still maintain redsword is stronger..

civic flame May 28, 2025, 4:22 PM

#

same

keen beacon May 28, 2025, 4:22 PM

#

alpine coral i mean tbh i still maintain redsword is stronger..

thats why theyre testing both

civic flame May 28, 2025, 4:22 PM

#

i have tested both in writing tasks

#

goldmane feels considerably more robotic

#

never said i was judging deepmind

alpine coral May 28, 2025, 4:23 PM

#

that's literally the whole point of this server isn't lol

keen beacon May 28, 2025, 4:24 PM

#

why did u join this server otherwise

civic flame May 28, 2025, 4:24 PM

#

slightly off topic but i've been testing 4o image gen and imagen 4 ultra

keen beacon May 28, 2025, 4:24 PM

#

bro

civic flame May 28, 2025, 4:24 PM

#

and the more i do the more annoyed i get with 4o image gen because

alpine coral May 28, 2025, 4:24 PM

#

i've never said google is the goat or anything - i'm open to whatever...

civic flame May 28, 2025, 4:24 PM

#

that yellow tint is FRUSTRATING

keen beacon May 28, 2025, 4:24 PM

#

it is

civic flame May 28, 2025, 4:24 PM

#

it is

keen beacon May 28, 2025, 4:24 PM

#

yeah because theyre working on it on a specific 4o version

#

it is 4.1 though i think

civic flame May 28, 2025, 4:25 PM

#

didn't sam say image gen v2 was releasing "soon" like over a month ago

#

still waiting

keen beacon May 28, 2025, 4:25 PM

#

i havent checked too throughly as there could be additional post processing but im inclined to believe its on the 4.1 base model

civic flame May 28, 2025, 4:28 PM

#

lol so imagen-exp was imagen-4-ultra

soft fog May 28, 2025, 4:35 PM

#

Hi guys looks like there is a specific site for WebDev Arena, is there a website for other leaderboards as well?

#

https://web.lmarena.ai/leaderboard

#

oh okay thanks

wintry tinsel May 28, 2025, 4:53 PM

#

civic flame lol so imagen-exp was imagen-4-ultra

What are the different Imagen 4 models, us ultra available to use now?

#

Is it different from best quality on whisk

civic flame May 28, 2025, 5:07 PM

#

i believe whisk uses regular imagen 4

#

ultra is on vertex

wintry tinsel May 28, 2025, 5:07 PM

#

While ultra ever be added to whisk or image studio

civic flame May 28, 2025, 5:07 PM

#

also iirc with imagen 3 it refused to generate images of well known figures.. now not only does it let you but it does it well? "A photo of Barack Obama and Rishi Sunak shaking hands"

civic flame May 28, 2025, 5:07 PM

#

wintry tinsel While ultra ever be added to whisk or image studio

no idea

wintry tinsel May 28, 2025, 5:08 PM

#

Vertex kinda sucks to use but ok

sonic tendon May 28, 2025, 5:27 PM

#

civic flame no idea

ayo, he's back

#

hihiii

echo aurora May 28, 2025, 5:34 PM

#

I'm going to be vibing in #1340554757827461215 for the next hour, anyone is welcome to join!

civic flame May 28, 2025, 5:42 PM

#

sonic tendon hihiii

hi!

small haven May 28, 2025, 5:46 PM

#

alpine coral here (20 questions/tasks, given across two prompts), goldmane sets a new all tim...

this is a cool, but putting o1 high 1-2 std order higher than o3 feels very wrong, what r u even measuring

#

bloat code << surgical code

elder rapids May 28, 2025, 5:51 PM

#

calm sequoia

ye goldmane/redsword are going to cook opus

elder rapids May 28, 2025, 5:53 PM

#

civic flame goldmane feels considerably more robotic

deadass the opposite imo

#

but they're both simply not robotic

#

goldmane just seems considerably smarter

civic flame May 28, 2025, 5:54 PM

#

👎

elder rapids May 28, 2025, 5:56 PM

#

the only thing is that redsword is just better at hard tasks

#

btw in webdev

#

goldmane has done everything much much much better

#

as far as I've gotten them

#

redsword seems to code very well, but the ideas to accomplish how it looks

#

aren't nearly as polished

sonic tendon May 28, 2025, 5:58 PM

#

redsword seems like a good general chatbot, too

elder rapids May 28, 2025, 5:58 PM

#

still can't believe goldmane did that for one of my questions lmao

sonic tendon May 28, 2025, 5:58 PM

#

echo aurora I'm going to be vibing in <#1340554757827461215> for the next hour, anyone is we...

would be down, but I'm out rn unfortunately

elder rapids May 28, 2025, 5:58 PM

#

built a whole imitate chat

#

to explain what it needs to respond in webdev

#

for the inquiry itself

cedar tide May 28, 2025, 5:59 PM

#

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

deepseek-ai/DeepSeek-R1-0528 · Hugging Face

echo aurora May 28, 2025, 5:59 PM

#

sonic tendon would be down, but I'm out rn unfortunately

sounds good, there will be plenty of times

elder rapids May 28, 2025, 5:59 PM

#

cedar tide https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

it's not much better

#

already tested

#

just seems slightly better at hard tasks

#

still no good

keen beacon May 28, 2025, 6:01 PM

#

cedar tide https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

they're still uploading it 😂

sonic tendon May 28, 2025, 6:02 PM

#

elder rapids already tested

where'd you get it lol

elder rapids May 28, 2025, 6:02 PM

#

lmao they updated all access routes to deepseek

#

it's been hours

#

this is new but it's not as new as its looking on huggingface

cedar tide May 28, 2025, 6:04 PM

#

elder rapids lmao they updated all access routes to deepseek

Access routes = api ?

#

They said the API had not been updated

cedar tide May 28, 2025, 6:05 PM

#

sonic tendon where'd you get it lol

Its available on their chatbot website

sonic tendon May 28, 2025, 6:05 PM

#

cedar tide Its available on their chatbot website

ah

keen beacon May 28, 2025, 6:06 PM

#

its crazy how slow they are nowadays

#

i remember it being so much faster

small haven May 28, 2025, 6:07 PM

#

jack ma effect

elder rapids May 28, 2025, 6:07 PM

#

cedar tide Access routes = api ?

no

#

oh it says routes

#

I didn't mean to say that

small haven May 28, 2025, 6:08 PM

#

so is chat.deepseek updated to the new 05-28?

elder rapids May 28, 2025, 6:10 PM

#

cedar tide They said the API had not been updated

ye, only the app iirc

cedar tide May 28, 2025, 6:13 PM

#

small haven so is chat.deepseek updated to the new 05-28?

If you read my messages you would have known.

small haven May 28, 2025, 6:13 PM

#

cedar tide If you read my messages you would have known.

oh cool srry i missed u

vernal meadow May 28, 2025, 6:16 PM

#

@alpine coral aligns with my test results. Goldmane and Redsword are really good models. They won most battles easy. Can't wait for Google to release them.
Opus is too expensive

elder rapids May 28, 2025, 6:16 PM

#

what's your test

small haven May 28, 2025, 6:16 PM

#

initial 0528 vibes feels meh, thats why its not r2 lol

elder rapids May 28, 2025, 6:16 PM

#

small haven initial 0528 vibes feels meh, thats why its not r2 lol

there's no difference

small haven May 28, 2025, 6:17 PM

#

elder rapids there's no difference

for me its just a tad bit smarter and a bit verbose

cedar tide May 28, 2025, 6:17 PM

#

vernal meadow <@1053335914555908116> aligns with my test results. Goldmane and Redsword are re...

Where 2.5 pro ?

#

New R1 Discord clone

#

Vs old

#

@small haven @elder rapids
https://reddit.com/r/LocalLLaMA/comments/1kxmgtr/deepseekr10528_vs_claude4sonnet_still_a_demo holy whale

From the LocalLLaMA community on Reddit: DeepSeek-R1-0528 VS claude...

Explore this post and more from the LocalLLaMA community

keen beacon May 28, 2025, 6:19 PM

#

i think i like qwen 3 better, at least on the specific tasks im testing them on (which isnt representative of real usage for most people)

elder rapids May 28, 2025, 6:20 PM

#

cedar tide <@931708065319907338> <@887104792437092352> https://reddit.com/r/LocalLLaMA/com...

this isn't any better

fleet lintel May 28, 2025, 6:20 PM

#

vernal meadow <@1053335914555908116> aligns with my test results. Goldmane and Redsword are re...

what is this table? and who is number 1?

#

and anyone has insider info on how good or bad is "deep think" ?

torn mantle May 28, 2025, 6:21 PM

#

elder rapids this isn't any better

huh

cedar tide May 28, 2025, 6:21 PM

#

elder rapids this isn't any better

you make 50% of that with old R1 I'll send you $100

torn mantle May 28, 2025, 6:21 PM

#

That wasn't the point

elder rapids May 28, 2025, 6:21 PM

#

cedar tide you make 50% of that with old R1 I'll send you $100

not what I said

cedar tide May 28, 2025, 6:21 PM

#

elder rapids not what I said

So What do you said ?

elder rapids May 28, 2025, 6:21 PM

#

isn't any better than the one on the left

cedar tide May 28, 2025, 6:22 PM

#

elder rapids isn't any better than the one on the left

ok I don't care about that

elder rapids May 28, 2025, 6:22 PM

#

cedar tide ok I don't care about that

"holy whale"?

#

idk wym

#

it's not a surprising result and I'd kind of expect that performance

cedar tide May 28, 2025, 6:23 PM

#

@elder rapids Its not me that write holy whale, I just wanted to share what he coded with the new r1

keen ferry May 28, 2025, 6:23 PM

#

when did deepseek learned to make diagrams

cedar tide May 28, 2025, 6:23 PM

#

elder rapids it's not a surprising result and I'd kind of expect that performance

Si its an évolution from r1

elder rapids May 28, 2025, 6:23 PM

#

cedar tide Si its an évolution from r1

ye

cedar tide May 28, 2025, 6:23 PM

#

cedar tide <@931708065319907338> <@887104792437092352> https://reddit.com/r/LocalLLaMA/com...

Claude 4 sonnet open source

elder rapids May 28, 2025, 6:24 PM

#

we'll see how good it is at other coding tasks

keen beacon May 28, 2025, 6:24 PM

#

they should upload the readme first before the weights 🤣

elder rapids May 28, 2025, 6:24 PM

#

but as far as I can tell it isn't an all around leap in coding

solar nebula May 28, 2025, 6:25 PM

#

feels better at coding for me, ill wait for benchmarks

vernal meadow May 28, 2025, 6:26 PM

#

@cedar tide Its on 10 for me rn but only 17 battles. Lost some battles like TikZ drawing, Tiktatoe playing and some joke understanding. Samplesize is always a factor. Personal ranking is just directional.
In head to head Pro is of course better than the flash models.

fleet lintel May 28, 2025, 6:26 PM

#

does LMArena share the user prompts with companies?

cedar tide May 28, 2025, 6:29 PM

#

vernal meadow <@419074580515389450> Its on 10 for me rn but only 17 battles. Lost some battles...

Thx

tall summit May 28, 2025, 6:32 PM

#

fleet lintel does LMArena share the user prompts with companies?

depends which

#

they technically do, the ai companies

small haven May 28, 2025, 6:33 PM

#

fleet lintel does LMArena share the user prompts with companies?

they obviously do private eval, esp. when google rl's against lmarena

#

they raised $100m for a reason lol

fleet lintel May 28, 2025, 6:36 PM

#

small haven they raised $100m for a reason lol

true 🙂

torn mantle May 28, 2025, 6:38 PM

#

one thing i noticed about this new r1 is that it uses arrows like o3

#

cot is more straightforward(?)

#

it will call you out if it sees something wrong

#

unlike the old one as it was always positive

elder rapids May 28, 2025, 6:43 PM

#

torn mantle + cot is more straightforward(?)

it seems to be straightforward more often, but it's not "straight forward"

#

as in it speaks in a different way

small haven May 28, 2025, 6:43 PM

#

torn mantle one thing i noticed about this new r1 is that it uses arrows like o3

not surprised theyre training on o3 outputs 😭

elder rapids May 28, 2025, 6:43 PM

#

o3 too expensive and limited

zinc ore May 28, 2025, 6:43 PM

#

Have any evals dropped for the new Deepseek?

elder rapids May 28, 2025, 6:44 PM

#

it just released lmao

zinc ore May 28, 2025, 6:44 PM

#

Sometimes they're quick with it

fleet lintel May 28, 2025, 6:44 PM

#

small haven not surprised theyre training on o3 outputs 😭

i think (i dont have a proof) that they heavily train on openai and google models

small haven May 28, 2025, 6:45 PM

#

fleet lintel i think (i dont have a proof) that they heavily train on openai and google model...

ya its not a theory, its factual

late path May 28, 2025, 6:45 PM

#

deepseek first distilled openai models, and now claude and gemini haven't been spared either💀

small haven May 28, 2025, 6:46 PM

#

the issue with piggybacking off oai models is that theyll never be able to frontrun them?

elder rapids May 28, 2025, 6:47 PM

#

tbh it's not that smart

#

it doesn't feel as smart as the old r1

tall summit May 28, 2025, 6:47 PM

#

small haven the issue with piggybacking off oai models is that theyll never be able to front...

would they ever be able to anyway

elder rapids May 28, 2025, 6:47 PM

#

seems to still have hallucination issues and doesn't follow instructions too well

fleet lintel May 28, 2025, 6:48 PM

#

tall summit would they ever be able to anyway

exactly. i think deepseek is good as opensource but would never be able to compeete with openai/google

elder rapids May 28, 2025, 6:48 PM

#

and I take back what I said about general hard tasks, its just better at coding

#

and it seems to bleed into other tasks momentarily

late path May 28, 2025, 6:48 PM

#

they could potentially surpass openai in areas where RL is effective. They distill these models primarily to make up for their lack of style and writing data, which is labor-intensive to prepare

fleet lintel May 28, 2025, 6:52 PM

#

i really like deepseek though. I think they might end up killing Meta's AI goals which is important in my opinion. Meta winning is bad for humanity

keen beacon May 28, 2025, 6:53 PM

#

deepseek mightve used openai/etc outputs for pretraining/instruct/supplementary stuff but the cot itself, i doubt they trained on the cot (in a massive scale) of gemini or claude. openai definitely not, they have a lot of measures and they would be able to tell (and probably the others as well) if its done

late path May 28, 2025, 6:53 PM

#

ye the cot looks pretty original to me

tall summit May 28, 2025, 7:07 PM

#

ok deepseek writes stories as insane as before

keen beacon May 28, 2025, 7:32 PM

#

big week for google

#

veo 3 is solid competitor for sora

keen beacon May 28, 2025, 7:33 PM

#

tall summit ok deepseek writes stories as insane as before

fmhy gets guild tag but we dont, and we had guild tag before them

#

smh

#

rip pirate cove

sacred plaza May 28, 2025, 7:39 PM

#

keen beacon veo 3 is solid competitor for sora

Neat stuff but these seems fairly useless for most knowledge workers? Still not sure how these labs will make money off these products.

small haven May 28, 2025, 7:43 PM

#

keen beacon veo 3 is solid competitor for sora

sora is so far down the list

#

sora is like 1250 and veo3 is 1400, vibes elo

wintry tinsel May 28, 2025, 7:53 PM

#

Sora has the best style and clarity/resolution of all the video models, it looks the best with non real objects like sci fi and fantasy, google by far looks the best with anything real or modern, but it looks terrible with non real objects, that is because its all YouTube training data, the data is extremely biased towards realism, stationary cameras on people’s faces, despite the training data biases, Veo 3 is the best with coherence, logic, prompt understanding and is overall a higher quality model I’m just saying there is a heavy style bias based on training Data and Sora’s style is superior

#

Sora needs to improve the power of their underlying video generator less, to be overall more impressive than Google

tall summit May 28, 2025, 7:54 PM

#

keen beacon fmhy gets guild tag but we dont, and we had guild tag before them

whos we

#

also fmhy didnt actually get a guild tag

#

a member made a separate server for it

#

and nbats linked it so it's official now

elder rapids May 28, 2025, 7:56 PM

#

keen beacon veo 3 is solid competitor for sora

sora wasnt a competitor for even Veo 2

#

soras style is a byproduct of its non coherence

#

lmao

small haven May 28, 2025, 8:00 PM

#

wintry tinsel Sora has the best style and clarity/resolution of all the video models, it looks...

its not like theres only sora and veo3 in the text-v2a space

elder rapids May 28, 2025, 8:01 PM

#

nobody understands that this is a necessary flaw of current video models unless explicitly trained for fantasy

small haven May 28, 2025, 8:01 PM

#

theres an chinese open source model that performs better than sora rn

quiet folio May 28, 2025, 8:09 PM

#

user
I don't know yet. Will you harm me if I harm you first?

assistant
I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏

small haven May 28, 2025, 8:12 PM

#

whoa

#

wait is that actually real

#

source rn

zinc ore May 28, 2025, 8:12 PM

#

Sora is beaten by several video models, on a qualitative level

#

Maybe not several, but definitely multiple**

small haven May 28, 2025, 8:13 PM

#

too lazy to google, thanks tho

#

?

zinc ore May 28, 2025, 8:13 PM

#

9% wow, interesting people acting like it's worse than 3.7

#

There's two arc benches, arc-agi 1 and arc-agi 2, which is designed to be way harder

small haven May 28, 2025, 8:14 PM

#

wait claude 4 models can't efficiently solve arc-agi-1 but only harder ones? i can smell the overfitting

#

they both overfitting

elder rapids May 28, 2025, 8:15 PM

#

yo wtf 2.5 flash is high ASF

#

I had no idea they tested the flash variant

small haven May 28, 2025, 8:16 PM

#

i do like claude code over codex tho, vibes led

#

ya i repent craitg

#

claude code >> codex

#

but wish they made it an cloud ui interface like codex

#

code on the go

#

nah trust its more practical

#

async tasks >>

#

u can technically set up async tasks on claude code tho

keen fulcrum May 28, 2025, 8:19 PM

#

https://techcrunch.com/2025/05/28/xai-to-pay-300m-in-telegram-integrate-grok-into-app/

TechCrunch

Ivan Mehta

xAI to pay Telegram $300M to integrate Grok into the chat app | Tec...

Telegram CEO Pavel Durov on Wednesday said Elon Musk's AI company, xAI, is investing $300 million worth of cash and equity in the chat app.

#

Cheap

keen beacon May 28, 2025, 8:20 PM

#

isnt that guy supposed to be arrested btw

keen fulcrum May 28, 2025, 8:20 PM

#

He started cooperating

#

Why?

zinc ore May 28, 2025, 8:20 PM

#

keen fulcrum https://techcrunch.com/2025/05/28/xai-to-pay-300m-in-telegram-integrate-grok-int...

This is actually pretty smart, I don't know exactly the userbase of telegram but ik it's decent sized

#

Ik it's huge

#

I meant more like relative to IG, FB, and Google

small haven May 28, 2025, 8:21 PM

#

keen fulcrum https://techcrunch.com/2025/05/28/xai-to-pay-300m-in-telegram-integrate-grok-int...

pay to win 😭

keen fulcrum May 28, 2025, 8:21 PM

#

50% revenue share
additionally

small haven May 28, 2025, 8:22 PM

#

keen fulcrum 50% revenue share additionally

google pays >$2b to be on safari default search engine 🤷‍♂️

#

a year

#

actually?

zinc ore May 28, 2025, 8:22 PM

#

They're about to be forced to stop doing that by the courts

keen fulcrum May 28, 2025, 8:22 PM

#

Apple is considering AI search engines as Perplexity, You etc

small haven May 28, 2025, 8:23 PM

#

goddamnnn

keen beacon May 28, 2025, 8:23 PM

#

perplexity will buy out apple

misty vault May 28, 2025, 8:23 PM

#

bing search engine

keen fulcrum May 28, 2025, 8:23 PM

#

Honestly Apple should just buy Raycast

small haven May 28, 2025, 8:23 PM

#

$20b/yr is literally crazy

keen fulcrum May 28, 2025, 8:23 PM

#

And make it integrate deeply on iOS and macOS.

small haven May 28, 2025, 8:23 PM

#

keen beacon perplexity will buy out apple

i spat my coffee

zinc ore May 28, 2025, 8:23 PM

#

keen beacon perplexity will buy out apple

Most tame claim made in this chat

small haven May 28, 2025, 8:24 PM

#

wtf is raycast lol

keen fulcrum May 28, 2025, 8:24 PM

#

keen beacon perplexity will buy out apple

For AI agentic browsers there are now Comet, Dia, Fellou and Opera Neon

small haven May 28, 2025, 8:27 PM

#

huh

zinc ore May 28, 2025, 8:29 PM

#

Who has the capital to pay that much? Microsoft could, but they already lose billions on Bing. Will it be enough for them to make an income?

#

Unless we're talking at a massive discount then nvm

#

I guess it also depends how particular Apple is about the quality of default, or maybe if they can make a deal another way with Google 🤔

#

It's targeted at Google though not Apple, wouldn't it just be Google being restricted from paying for theirs to be default?

#

Apple just has to deal with the windfall also, but they're not being prevented from making such deals with other companies

#

Nah, the case is only about Google

#

Why? Google is only potentially getting limited because they're a search/ad monopoly

#

This will be in the courts for a decade yeah

#

Going to be a looonnngggg time

tired dust May 28, 2025, 8:40 PM

#

Hey, what's up ?

last time i was using LMArena i could use github post, there a way to do on the new UI?

keen beacon May 28, 2025, 8:40 PM

#

repochat?

#

legacy arena

tired dust May 28, 2025, 8:41 PM

#

keen beacon repochat?

Yes that, so there a way to get the old Arena ?

echo aurora May 28, 2025, 8:41 PM

#

tired dust Yes that, so there a way to get the old Arena ?

yes, you can go here - https://legacy.lmarena.ai/

tired dust May 28, 2025, 8:41 PM

#

echo aurora yes, you can go here - <https://legacy.lmarena.ai/>

Thanks you save me !

misty vault May 28, 2025, 8:48 PM

#

lmarenalogo

vocal oarBOT May 28, 2025, 8:52 PM

#

quiet folio May 28, 2025, 9:11 PM

#

misty vault May 28, 2025, 9:11 PM

#

Fr
thats what im saying

small haven May 28, 2025, 9:47 PM

#

#

pending

#

phd level questions, absurd that gemini 2.5 pro is above o3

zinc ore May 28, 2025, 9:52 PM

#

2.5 also has higher gen knowledge than o3

misty vault May 28, 2025, 9:53 PM

#

gpt-4

olive mesa May 28, 2025, 9:56 PM

#

misty vault gpt-4

is it agi?

misty vault May 28, 2025, 9:57 PM

#

yes

kind cloud May 28, 2025, 10:09 PM

#

Some new models seem to have appeared on the new LMArena.

#

At least, I found 'x-preview' and 'stephen'.

placid spear May 28, 2025, 10:14 PM

#

what is up with the website? every model wether claude, chatGpt, gemini stops responding halfway through it's answer, if it manages to post an answer it errors out when i try to ask a follow up question

echo aurora May 28, 2025, 10:16 PM

#

placid spear what is up with the website? every model wether claude, chatGpt, gemini stops re...

the team is currently looking into issues with models not responding or when it error outs

keen beacon May 28, 2025, 10:28 PM

#

sacred plaza Neat stuff but these seems fairly useless for most knowledge workers? Still not ...

idk but what i do know is it'll have a big market for sure considering industries like marketers could really use this type of stuff

#

just think about it

#

instead of spending thousands to fly out a crew to get simple broll

#

you can have ai generate it

keen beacon May 28, 2025, 10:32 PM

#

wintry tinsel Sora has the best style and clarity/resolution of all the video models, it looks...

i have noticed this!!

misty vault May 28, 2025, 10:37 PM

#

assistant

elder rapids May 28, 2025, 10:46 PM

#

small haven phd level questions, absurd that gemini 2.5 pro is above o3

not absurd, if you've ever talked to Gemini it's basically tuned to things like this

#

has more relevant knowledge of things

sacred plaza May 28, 2025, 10:47 PM

#

keen beacon idk but what i do know is it'll have a big market for sure considering industrie...

Agree it does impact some industries but for my knowledge work job there are no economically useful things that come to mind, haha

small haven May 28, 2025, 10:49 PM

#

elder rapids not absurd, if you've ever talked to Gemini it's basically tuned to things like ...

what prompt are u using? ive not gotten anything functionally useful when it comes self improving a model from gemini other than o3 which pushes the boundary for me

#

> [model code]
> how would you improve this model for increased accuracy vs. baseline
> o3 pulls up technical concepts that when integrated yields extra bps
> gemini pulls old school theoretical concepts that is really just minimal model tweaks that yields marginal results

solar hollow May 28, 2025, 11:04 PM

#

is the new deepseek good?

elder rapids May 28, 2025, 11:14 PM

#

small haven what prompt are u using? ive not gotten anything functionally useful when it com...

I'm not using any prompt, this isn't like you get it out of testing, it just inherently knows

#

and I'm not sure how any of what you said is relevant tbh

zinc ore May 28, 2025, 11:15 PM

#

o3s knowledge base isn't as robust as 2.5 yeh

#

It's basically a reasoning advantage is what makes o3 special

elder rapids May 28, 2025, 11:15 PM

#

ye

#

although I wouldn't say 2.5 pro simply knows more

#

o3 just seems like a bigger model as is

#

but 2.5 pro seems to be focused on very relevant things in all domains

#

this is probably what makes Gemini Gemini tho tbf

#

deepmind seems to be hyper focused on clean data

small haven May 28, 2025, 11:16 PM

#

elder rapids and I'm not sure how any of what you said is relevant tbh

oh i know its out of ur caliber i get it

elder rapids May 28, 2025, 11:17 PM

#

small haven oh i know its out of ur caliber i get it

even if I take what you said seriously I can just ask an AI what it is 😭 but I do know what it is

#

and it has nothing to do with what I said

small haven May 28, 2025, 11:17 PM

#

im just saying gemini is not practically useful when it comes to research aka in my case self improving a model 🤷

elder rapids May 28, 2025, 11:17 PM

#

how is that relevant

small haven May 28, 2025, 11:18 PM

#

frontend queries to assess gemini 2.5 pro is borderline useless

zinc ore May 28, 2025, 11:18 PM

#

elder rapids deepmind seems to be hyper focused on clean data

Probably true, especially when you look at how clean Veo 3 looks.

elder rapids May 28, 2025, 11:19 PM

#

zinc ore Probably true, especially when you look at how clean Veo 3 looks.

the earlier models, 1.5 pro, 1.5 pro 002

#

1.5 flash

#

they never had any muddiness

#

like how 4o and gpt 4 did

#

this was a problem with the early 3.5 sonnet as well

hushed needle May 28, 2025, 11:20 PM

#

why is no one really talking about claude though? just input costs?

elder rapids May 28, 2025, 11:20 PM

#

hushed needle why is no one really talking about claude though? just input costs?

talking about Claude in what context

hushed needle May 28, 2025, 11:21 PM

#

idk, it just seems like the general conversation is still mostly about o3 vs gemini when claude is supposedly a main contender now

misty vault May 28, 2025, 11:21 PM

#

cwaude is my cutie patootie

elder rapids May 28, 2025, 11:21 PM

#

hushed needle idk, it just seems like the general conversation is still mostly about o3 vs gem...

ofc it's a contender but it's not much of one, Claude 4 is disappointing in anything outside of code

zinc ore May 28, 2025, 11:21 PM

#

hushed needle idk, it just seems like the general conversation is still mostly about o3 vs gem...

Just less Claude fans in here is all

elder rapids May 28, 2025, 11:22 PM

#

for a reason tho

#

there's no hype

zinc ore May 28, 2025, 11:22 PM

#

It's basically some openAI / xAI fans arguing with Google fans

#

That's 90% of convos in here

elder rapids May 28, 2025, 11:22 PM

#

because they don't release

#

and they don't have substantial releases

small haven May 28, 2025, 11:22 PM

#

lmao im not an oai dickriders, im a truth seeker

misty vault May 28, 2025, 11:22 PM

#

zinc ore It's basically some openAI / xAI fans arguing with Google fans

agi?

small haven May 28, 2025, 11:22 PM

#

im using claude code/codex, im trying to unbiased as much as possible

#

claude >> codex, ive slept on it

zinc ore May 28, 2025, 11:23 PM

#

misty vault agi?

Fanboys pick their fav team

elder rapids May 28, 2025, 11:23 PM

#

small haven lmao im not an oai dickriders, im a truth seeker

not when you don't have demonstrable opinions about the LLMs themselves, I already brought this up

#

there's no reason for you to have some of the opinions you have when they're just as irrational as the next

misty vault May 28, 2025, 11:23 PM

#

zinc ore Fanboys pick their fav team

my whole room is covered with gpt-4 openai banners and i devoted my life to gpt-4-0314-32k

hushed needle May 28, 2025, 11:23 PM

#

idk i don't really see a difference between the normal "chatting" responses between gemini, openai, and claude. The only real time I see a difference is if I'm actually asking a model to "do" something, which claude seems to be able to "do" the most things.

Except geoguessing, o3 is king at that

elder rapids May 28, 2025, 11:23 PM

#

so you can't posture it like you're unbiased

small haven May 28, 2025, 11:23 PM

#

elder rapids not when you don't have demonstrable opinions about the LLMs themselves, I alrea...

what are u even saying? gibberish

zinc ore May 28, 2025, 11:23 PM

#

misty vault my whole room is covered with gpt-4 openai banners and i devoted my life to gpt-...

Based

elder rapids May 28, 2025, 11:24 PM

#

small haven what are u even saying? gibberish

use an AI to interpret then, ion understand bro 😭

#

we're in an AI server

#

it's OK dawg

small haven May 28, 2025, 11:24 PM

#

elder rapids use an AI to interpret then, ion understand bro 😭

i aint u my guy, did u have to interpret all my msg lmao, no wonder there was a delay hahahah

zinc ore May 28, 2025, 11:24 PM

#

hushed needle idk i don't really see a difference between the normal "chatting" responses betw...

Hate to do this to you but on the geoguesser benchmarks 2.5 is in first lol

elder rapids May 28, 2025, 11:24 PM

#

small haven i aint u my guy, did u have to interpret all my msg lmao, no wonder there was a ...

that would be absurd + my way of speaking is heavy colloquial + I respond too quick

#

and that has nothing to do with what I said

hushed needle May 28, 2025, 11:25 PM

#

zinc ore Hate to do this to you but on the geoguesser benchmarks 2.5 is in first lol

i do use a pretty extensive custom instruction for o3 in my sys prompt so fair enough

small haven May 28, 2025, 11:25 PM

#

elder rapids that would be absurd + my way of speaking is heavy colloquial + I respond too qu...

adding gibberish words doesn't make u right

#

pontification

elder rapids May 28, 2025, 11:25 PM

#

if you really feel that it's gibberish then help yourself with using the tools that can interpret it

small haven May 28, 2025, 11:25 PM

#

elder rapids if you really feel that it's gibberish then help yourself with using the tools t...

goo gaa gaa

elder rapids May 28, 2025, 11:25 PM

#

small haven adding gibberish words doesn't make u right

you have tools to ungibberish it then? don't know what else to say

#

there's no point in saying it's gibberish then shift the dialectic, when solving the interpretation is open now, LLMs are tools

#

use them

#

you don't have to rely on me to keep clarifying

#

and when you do ask the AI to explain what I mean since that's genuinely how you feel

small haven May 28, 2025, 11:27 PM

#

that italic tho im dead 😭

elder rapids May 28, 2025, 11:27 PM

#

then we can get back to talking about it

elder rapids May 28, 2025, 11:28 PM

#

small haven that italic tho im dead 😭

crazy how everyone uses italics but when it's directed at you you think the emphasis is cringe or sum shi

#

survivorship bias if I've ever seen one

small haven May 28, 2025, 11:28 PM

#

imperialistically

misty vault May 28, 2025, 11:30 PM

#

small haven imperialistically

no please dont invade europe

elder rapids May 28, 2025, 11:33 PM

#

those kinds of things are fine

zinc ore May 28, 2025, 11:57 PM

#

https://x.com/minimario1729/status/1927821079185178764

Hmm 🤔 not bad

Alex Gu (@minimario1729)

new deepseek release almost on-par with o3 (high) on livecodebench 😲🚀

small haven May 29, 2025, 12:37 AM

#

good

wintry tinsel May 29, 2025, 2:00 AM

#

hushed needle why is no one really talking about claude though? just input costs?

People talk about Claude a lot here what do you mean

small haven May 29, 2025, 2:56 AM

#

even claude has o3 styled arrows lol

small haven May 29, 2025, 3:23 AM

#

not like all claude users live in the us 😭

echo aurora May 29, 2025, 3:36 AM

#

hey not sure what this has to do with AI so lets keep things on topic please

small haven May 29, 2025, 3:49 AM

#

deepseek is the best thing ever invented, i use it everyday religiously

patent aspen May 29, 2025, 4:14 AM

#

Claude 4 still only has 1 gym badge

#

o3 in the same position

#

Gemini 4 badges in with no intervention

elder rapids May 29, 2025, 4:16 AM

#

any more benchmarks for redsword and goldmane

patent aspen May 29, 2025, 4:18 AM

#

Pokemon gym badges

#

The most important benchmark

#

I'm being unfair to o3. I think it only started a few days ago and had less time for testing

#

Claude is clearly worse at Pokemon tho

zinc ore May 29, 2025, 4:24 AM

#

patent aspen Gemini 4 badges in with no intervention

Tbf Gemini stream is on second run after developer figured out how to do all the scaffolding just right for it to finish the game, and the second stream has been going a good bit longer than o3 and claude 4 stream.

#

Claude has the least scaffolding

patent aspen May 29, 2025, 4:25 AM

#

I believe the Gemini second run and Claude 4 first run started at the same time and o3 started several days later

zinc ore May 29, 2025, 4:25 AM

#

Gemini started before Claude 4 released

patent aspen May 29, 2025, 4:26 AM

#

No I mean I think they did at least one reset to line up with Claude

#

I could be wrong but I'll double check

zinc ore May 29, 2025, 4:27 AM

#

That Gemini developer also seems pretty competent with the scaffolding he designs

patent aspen May 29, 2025, 4:27 AM

#

Yeah I mean it's kind of a silly comparison

#

"Q: What's different in the second run?
A: The first "second run" launched on May 17 (PDT). On May 22 we reset the run so it could start in lock-step with ClaudePlaysPokemon's Claude 4 relaunch. The fresh run is identical to the aborted one (with minor harness improvements), but now viewers can watch Gemini and Claude side-by-side and compare how each LLM tackles the exact same game from the same starting point, exploring each model and their harnesses' strengths and weaknesses. This is purely for fun, so don't treat this as a serious race! Follow both streams together here (multi-view)."

#

I think the consensus is that Claude's biggest limitation with pokemon is still vision, although it's definitely not apples to apples because of the different harnesses

zinc ore May 29, 2025, 4:30 AM

#

Gemini uses a Pathfinder (technically calling another Gemini instance, but instructed to use various pathfinding techniques), + map, + map data edits in various troublesome areas.

#

I do think Gemini is better at the game than the other two, but I think Gemini is getting pretty huge help compared to Claude.

patent aspen May 29, 2025, 4:31 AM

#

Ah I thought they at least used a similar path finder algo

zinc ore May 29, 2025, 4:34 AM

#

I'm not sure how Claude's navigation works. When I started following Gemini's stream it was long after Claude was stuck in an infinite loop at mt moon (3.7).

Gemini's Pathfinder can solve puzzles and do something like 30-50 steps if needed.

#

Like it's honestly super robust, and it also has a Pathfinder specifically for the rock puzzles.

#

The dev found a way to prompt a different instance of Gemini so that it can emulate actual path finding algorithms and do it competently.

narrow elbow May 29, 2025, 4:36 AM

#

i want to see Claude without ASL-3,is that true beast? or just an excuse for ability?🤪

patent aspen May 29, 2025, 4:36 AM

#

I can't even imagine how Claude will fair with the boulder puzzles...

zinc ore May 29, 2025, 4:37 AM

#

Gemini actually solved most of them on its own, but then never would do the longest one (gotta push rock across most of map) until dev made Pathfinder able to do the rock movements in its logic.

patent aspen May 29, 2025, 4:38 AM

#

The dev for the o3 bot got OAI to fund his project

#

Don't know how much they paid him

#

Maybe just uncapped o3 access

zinc ore May 29, 2025, 4:40 AM

#

Probably that, gem dev has same thing

#

Uncapped free use

patent aspen May 29, 2025, 4:41 AM

#

The Gemini bot also got a huge shoutout at I/O within the first 3 minutes that nobody except 5 people on Twitter understood

zinc ore May 29, 2025, 4:43 AM

#

All the viewers watching Claude stream and ignoring Gemini/o3 streams lol

#

250 views vs 50 for the other two

patent aspen May 29, 2025, 4:44 AM

#

Yeah tbf Gemini is less exciting because it already completed one run

zinc ore May 29, 2025, 4:44 AM

#

Dev should have had it play a diff Pokemon game or something, but probably too tedious having to do all the map stuff again

patent aspen May 29, 2025, 4:44 AM

#

And the Claude 4 announcement specifically called out that it should be better at Pokemon than it was previously

zinc ore May 29, 2025, 4:45 AM

#

Like basically, run a gauntlet of diff Pokemon games so it stays fresh

#

That's literally why I've barely watched the second run, the novelty is gone

patent aspen May 29, 2025, 4:47 AM

#

patent aspen And the Claude 4 announcement specifically called out that it should be better a...

Like if they're going to call it out in the announcement, it should be able to get past Mt Moon haha

patent aspen May 29, 2025, 4:50 AM

#

zinc ore Like basically, run a gauntlet of diff Pokemon games so it stays fresh

And yeah I think the main appeal of the second Gemini run is supposed to be that he's letting it run with no intervention, which is a pretty big step from a technical perspective, but not necessarily good content

zinc ore May 29, 2025, 4:51 AM

#

Yeah but we already know it should succeed

patent aspen May 29, 2025, 4:51 AM

#

As an engineer, I wouldn't expect it to succeed with zero intervention by default, but I get you

#

Certainly for any normal audience

zinc ore May 29, 2025, 4:52 AM

#

I have a theory that Gemini 2.5 has some reasoning training based on Dreamer (no idea if anyone here knows what that is), which I think might be part of why it does well at pokemon

patent aspen May 29, 2025, 4:53 AM

#

What is Dreamer?

zinc ore May 29, 2025, 4:55 AM

#

Here we present the third generation of Dreamer, a general algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration. Dreamer learns a model of the environment and improves its behaviour by imagining future scenarios. Robustness techniques based on normalization, balancing and transformations enable stable learning across domains. Applied out of the box, Dreamer is, to our knowledge, the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula.

https://www.nature.com/articles/s41586-025-08744-2

Nature

Mastering diverse control tasks through world models

Nature - A general reinforcement-learning algorithm, called Dreamer, outperforms specialized expert algorithms across diverse tasks by learning a model of the environment and improving its...

patent aspen May 29, 2025, 4:55 AM

#

Oh sounds like an evolution of the MuZero stuff

#

Yeah I would figure as such

civic flame May 29, 2025, 7:04 AM

#

patent aspen Yeah tbf Gemini is less exciting because it already completed one run

that was with a very elaborate scaffold they built for it

#

anthropic had a much weaker one

calm sequoia May 29, 2025, 7:21 AM

#

Do models have only python interpreters in though tool calling?

small haven May 29, 2025, 8:15 AM

#

wen deepthink
wen o3 pro

hollow ocean May 29, 2025, 8:19 AM

#

https://www.reddit.com/r/ClaudeAI/s/wtGb7x9wa7

From the ClaudeAI community on Reddit: How to unlock opus 4 full po...

Explore this post and more from the ClaudeAI community

calm sequoia May 29, 2025, 8:21 AM

#

Man thats crazy

#

I think Claude 4 Opus is the new kind

#

The Gemini is awesome but it's focused on code and lost other abilities

#

Opus just the new SOTA at everything

#

Should I buy subscription hmm

hollow ocean May 29, 2025, 8:22 AM

#

calm sequoia Should I buy subscription hmm

Buy it now

small haven May 29, 2025, 8:23 AM

#

calm sequoia Should I buy subscription hmm

yes

calm sequoia May 29, 2025, 8:23 AM

#

How does pricing compare when you buy Claude directly vs via cursor

small haven May 29, 2025, 8:23 AM

#

match my plan

small haven May 29, 2025, 8:24 AM

#

calm sequoia How does pricing compare when you buy Claude directly vs via cursor

its virtually unlimited unless u spam like hell with multi agents

#

limits reset every 6 hrs

calm sequoia May 29, 2025, 8:24 AM

#

Ah so you can sign in in to claude account via cursor?

small haven May 29, 2025, 8:25 AM

#

hmm no clue, i just use claude code cli

#

there is this tho, but idk if it supports cursor, should be

calm sequoia May 29, 2025, 8:26 AM

#

ok thanks peasant

hollow ocean May 29, 2025, 8:26 AM

#

small haven hmm no clue, i just use claude code cli

Try this

small haven May 29, 2025, 8:28 AM

#

hollow ocean Try this

yep i use ultrathink, some ppl think its on by default 🤷

hollow ocean May 29, 2025, 8:28 AM

#

small haven yep i use ultrathink, some ppl think its on by default 🤷

Huge difference right

small haven May 29, 2025, 8:29 AM

#

more precise yes, but also latency lol

hollow ocean May 29, 2025, 8:29 AM

#

Yeah

woeful geyser May 29, 2025, 9:12 AM

#

Yo who is stephen? It keeps answering in Chinese, even when I ask it in other languages.

soft kernel May 29, 2025, 9:15 AM

#

Is it just me,or anyone else experiencing beta and the regular site not working at all

#

Like one second the model is working correctly the next second nothing

ocean vortex May 29, 2025, 9:25 AM

#

R1 has an interesting tendency to cheat or take shortcuts catgrin

#

#

#

the instruction was You must compute this manually as code interpreter is unavailable at the moment (sorry for the inconvenience) lmao

#

added it there since it was hallucinating running the code with concise responses

#

this helped by quite a bit but it still couldn't help itself to not do beyond 16k output 👀

tall summit May 29, 2025, 9:47 AM

#

small haven match my plan

lmao?

cedar tide May 29, 2025, 10:00 AM

#

Anyone want to try prompts for Stephen? I currently have it

torn mantle May 29, 2025, 10:34 AM

#

woeful geyser Yo who is `stephen`? It keeps answering in Chinese, even when I ask it in other ...

someone said its the new r1

#

not sure

cedar tide May 29, 2025, 10:59 AM

#

torn mantle someone said its the new r1

Fake

cedar tide May 29, 2025, 11:00 AM

#

torn mantle someone said its the new r1

the new r1 is already on the arena with its normal name, apart from that I tried Stephen he is much less good at coding

torn mantle May 29, 2025, 11:00 AM

#

david seems like you know a lot of stuff

sweet swift May 29, 2025, 11:19 AM

#

Is it possible to reboot/cancel the generation?

#

on beta.lmarena

torn mantle May 29, 2025, 11:20 AM

#

sweet swift Is it possible to reboot/cancel the generation?

why would you want that?

misty vault May 29, 2025, 11:20 AM

#

thanks that answer helped

torn mantle May 29, 2025, 11:20 AM

#

misty vault thanks that answer helped

huh

sweet swift May 29, 2025, 11:21 AM

#

torn mantle why would you want that?

because in the big chat with gemini the generation become to infinitely, and i can't stop that

#

he is generating for over two weeks

torn mantle May 29, 2025, 11:38 AM

#

😭

tall summit May 29, 2025, 11:39 AM

#

HAHAHA

cedar tide May 29, 2025, 11:55 AM

#

echo aurora May 29, 2025, 12:01 PM

#

soft kernel Like one second the model is working correctly the next second nothing

We are aware of reports of models getting stuck and are actively working on a fix.

calm sequoia May 29, 2025, 12:01 PM

#

cedar tide

Man that's crazy

torn mantle May 29, 2025, 12:02 PM

#

and they call it a minor update

#

are they flexing or what?

#

nah they on something

calm sequoia May 29, 2025, 12:04 PM

#

I think they improved the COT and desided to ship it also to R1, instead to only R2, which is very honnorable

#

The R2 shall then be the same COT with V4?

sweet knot May 29, 2025, 12:04 PM

#

Hi guys, I'm new here and I may have missed some information. But today I entered the website of Lmarena and saw it had an utterly new interface, but I failed to find the bulk of the " exotic" llms I saw there before. Have they been discontinued or is it temporary? Thank you in advance!

golden ocean May 29, 2025, 12:04 PM

#

https://ezgif.com/images/loadcat.gif

golden ocean May 29, 2025, 12:04 PM

#

sweet knot Hi guys, I'm new here and I may have missed some information. But today I entere...

https://legacy.lmarena.ai/

cedar tide May 29, 2025, 12:07 PM

#

calm sequoia Man that's crazy

Screenshot_2025-05-29-13-55-27-656_com.android.chrome-edit.jpg

torn mantle May 29, 2025, 12:11 PM

#

cedar tide

thats good

#

harder problems solved -> more tokens -> average increases

cedar tide May 29, 2025, 12:12 PM

#

torn mantle harder problems solved -> more tokens -> average increases

o3 achieves the same performance with much fewer tokens

torn mantle May 29, 2025, 12:13 PM

#

cedar tide o3 achieves the same performance with much fewer tokens

show

main gulch May 29, 2025, 12:13 PM

#

cedar tide o3 achieves the same performance with much fewer tokens

and much higher price

torn mantle May 29, 2025, 12:14 PM

#

main gulch and much higher price

tell him

#

lol

#

david did you forgot about that

#

i told you dont compare only tokens

cedar tide May 29, 2025, 12:14 PM

#

torn mantle show

no proof yet but I know

torn mantle May 29, 2025, 12:14 PM

#

try to use a ratio of something

#

tokens / intelligence or tokens / cost

cedar tide May 29, 2025, 12:15 PM

#

you are funny

torn mantle May 29, 2025, 12:15 PM

#

cedar tide no proof yet but I know

you are funny david

#

how do you know

#

are you smarter than o3?

#

david > o3 > gemini > claude > deepseek ?

cedar tide May 29, 2025, 12:16 PM

#

cedar tide you are funny

At no point did I talk about good value for money etc. I just talked about the number of tokens used.

torn mantle May 29, 2025, 12:16 PM

#

yea but you said its bad

#

why is it bad?

#

for the cost and intelligence provided its good

cedar tide May 29, 2025, 12:16 PM

#

torn mantle yea but you said its bad

Where i sais that ?

torn mantle May 29, 2025, 12:16 PM

#

you deleted your message?

cedar tide May 29, 2025, 12:16 PM

#

Je te paye 300€ si tu trouves que j'ai dis qu'il est nul

cedar tide May 29, 2025, 12:16 PM

#

torn mantle you deleted your message?

Nope

torn mantle May 29, 2025, 12:17 PM

#

nah you deleted it

cedar tide May 29, 2025, 12:17 PM

#

torn mantle nah you deleted it

Jamais dis ca

torn mantle May 29, 2025, 12:17 PM

#

cedar tide Jamais dis ca

you did on the other server

#

cedar tide May 29, 2025, 12:19 PM

#

you misunderstood my message, I didn't say the model is bad, but it's "bad" that the improvement also comes with a disadvantage

#

@torn mantle

#

otherwise yes it joins the best quality price with 2.5 flash and grok 3 mini

torn mantle May 29, 2025, 12:20 PM

#

cedar tide you misunderstood my message, I didn't say the model is bad, but it's "bad" that...

i understand, but isnt this expected?

#

more intelligence = more tokens used in some cases?

#

thats why it solved those hard-problems

#

i mean i understand it doesnt look efficient but maybe its part of the process

#

the best reasoning cot that i always enjoy myself reading is deepseek and with this new update it just delve deeper into concepts

#

unlike qwen/grok/claude

cedar tide May 29, 2025, 12:25 PM

#

torn mantle more intelligence = more tokens used in some cases?

you can improve a model without making it think more

Screenshot_2025-05-29-14-24-46-778_com.android.chrome-edit.jpg

#

@torn mantle no comment ?

#

david > Einstein > o5 pro > gemini 4 ultra deep think

torn mantle May 29, 2025, 12:35 PM

#

cedar tide you can improve a model without making it think more

i understand that but we dont have deepseek-mini yet, o-series mini models are made for that reason

#

so we cant really compare that

torn mantle May 29, 2025, 12:35 PM

#

cedar tide david > Einstein > o5 pro > gemini 4 ultra deep think

agree

cedar tide May 29, 2025, 12:36 PM

#

torn mantle so we cant really compare that

what connection?

#

@torn mantle less tokens and better

Screenshot_2025-05-29-14-37-27-656_com.twitter.android-edit.jpg

ocean vortex May 29, 2025, 12:41 PM

#

woeful geyser Yo who is `stephen`? It keeps answering in Chinese, even when I ask it in other ...

Is he stalking you?

ocean vortex May 29, 2025, 12:45 PM

#

ocean vortex the instruction was `You must compute this manually as code interpreter is unava...

Looking at it again I think part of the reaosn is them training it for agentic tool usage (TAU). It scores quite high there, similar to 3.7 Sonnet. And just like Claude seems to have related issues when tools are not available

#

Ideally we should have different finetunes for tools and no tools, as workflows and expected responses are quite different depending on it...

willow grail May 29, 2025, 1:04 PM

#

how do i use veo 3 via api?

sweet knot May 29, 2025, 1:07 PM

#

golden ocean https://legacy.lmarena.ai/

Thank you for your reply, but the question wasif the quantity of llms has been dwindled or reduced in general. I see at the legacy version there are still many exotic llms. Does it mean that new Lmarena and Legacy versions have different base? Thank you in advance.

misty vault May 29, 2025, 1:08 PM

#

sweet knot Thank you for your reply, but the question wasif the quantity of llms has been d...

yes u will have to pay 300$ per 2 weeks subscription after legacy will be deprecated

alpine coral May 29, 2025, 1:11 PM

#

cedar tide you misunderstood my message, I didn't say the model is bad, but it's "bad" that...

fwiw (completely just guessing here) i think in this case the 'improvement' is the 'disadvantage' - like the model performs better when it reasons longer / uses more tokens during inference.. it's a downside in terms of costs / latency , but that chart is pretty impressive tbh

cedar tide May 29, 2025, 1:12 PM

#

alpine coral fwiw (completely just guessing here) i think in this case the 'improvement' _is_...

Yes

alpine coral May 29, 2025, 1:12 PM

#

cedar tide you can improve a model without making it think more

this is obviously preferable: increased performance with fewer tokens

sweet knot May 29, 2025, 1:12 PM

#

misty vault yes u will have to pay 300$ per 2 weeks subscription after legacy will be deprec...

Ugh! It always ends up being about business and chasing money... It started off so beautifully. And of course, it finishes just like always... terribly.

keen beacon May 29, 2025, 1:12 PM

#

sweet knot Ugh! It always ends up being about business and chasing money... It started off ...

They're trolling btw

sweet knot May 29, 2025, 1:12 PM

#

keen beacon They're trolling btw

Trolling whom?

cedar tide May 29, 2025, 1:13 PM

#

Screenshot_2025-05-29-15-12-28-504_com.twitter.android-edit.jpg

quiet folio May 29, 2025, 1:13 PM

#

smartest no pfp discord user

alpine coral May 29, 2025, 1:14 PM

#

cedar tide

yeah there may be more to it.. but seems basically the increased performance is a function of more test time compute (/tokens), rather than a signficant change in the underlying model

cedar tide May 29, 2025, 1:14 PM

#

Yes

fleet lintel May 29, 2025, 1:14 PM

#

what is Meta's AI strategy? Isn't Deepseek completely decimating it?

alpine coral May 29, 2025, 1:15 PM

#

whereas with the flash 2.5 improvements, it's arguable that achieving that with fewer tokens reflects changes to the actual model (ig during the fine tuning process; though i also get lost when it comes to cpt of the base model and where that fits in ha)

cedar tide May 29, 2025, 1:16 PM

#

while Google increases the reasoning without improving the model 😶

Intelligence_vs_Output_Tokens_Used_in_Artificial_Analysis_Intelligence_Index_29_May_25_.png

#

I hope this will be fixed in the stable version

alpine coral May 29, 2025, 1:17 PM

#

i doubt it..

keen beacon May 29, 2025, 1:20 PM

#

alpine coral yeah there may be more to it.. but seems basically the increased performance is ...

A model won't infinitely reason with CoT without inference time changes naturally (e.g. appending 'But, ..' so the model is forced to continue, something like this was done in s1k iirc though the paper is kinda dubious iirc as it uses benchmark questions in the dataset but the inference time changes are somewhat sound). So there have to be changes to the model etc

keen beacon May 29, 2025, 1:21 PM

#

keen beacon A model won't infinitely reason with CoT without inference time changes naturall...

(unless you include degenerate cases where it repeats forever etc)

alpine coral May 29, 2025, 1:23 PM

#

keen beacon A model won't infinitely reason with CoT without inference time changes naturall...

yeah i see what you're saying (and kinda was reflecting on just that after writing the comments ha).. yeah it's not as simple as simply changing a dial to 'more' - which isn't a thing, but even if it was, the model arbitrarily reasoning for longer doesn't guaruntee increased performance by any means

patent aspen May 29, 2025, 1:24 PM

#

fleet lintel what is Meta's AI strategy? Isn't Deepseek completely decimating it?

I think Meta's main goal was to not be beholden to another AI provider for their own products. In that sense they're doing all right

#

They're not really a cloud provider that needs a super general model

alpine coral May 29, 2025, 1:24 PM

#

alpine coral yeah i see what you're saying (and kinda was reflecting on just that after writi...

though ig i stand by the general gist, that it's more thinking/test-time compute (which can only really be implemented by changing the model) rather than the actual model itself being improved in a fundamental way

keen beacon May 29, 2025, 1:25 PM

#

I haven't really used the new r1 but could see it being true tbh

alpine coral May 29, 2025, 1:26 PM

#

btw wild, was watching this last night https://youtu.be/c2IBZlFBcgs?si=1pV5NCnwFHTZspay&t=101
and was reminded of how impressed you have been by the pace of google's model releases ha you're spot on - something fundamental shifted

cedar tide May 29, 2025, 1:27 PM

#

The last version just made it better a webdev and video analysis and lm arena, but not on rest benchmark and efficient

#

This the last version of flash that is normaly more efficient

sour spindle May 29, 2025, 1:56 PM

#

FOMO got the best of me and I bought a month of Claude 4.

#

Very good models have been enjoying them more than I thought

ocean vortex May 29, 2025, 2:03 PM

#

alpine coral though ig i stand by the general gist, that it's more thinking/test-time compute...

depends how you describe "fundamental" and how far you gonna push it. Cause with similar logic there are no fundamental differences between 4.1 and O3... it's all test-time compute

torn mantle May 29, 2025, 2:03 PM

#

cedar tide while Google increases the reasoning without improving the model 😶

thats a lie

#

we need to know how AAI Index is calculated

cedar tide May 29, 2025, 2:04 PM

#

torn mantle thats a lie

?

torn mantle May 29, 2025, 2:04 PM

#

whats the formula

#

so we can judge it

cedar tide May 29, 2025, 2:05 PM

#

torn mantle we need to know how AAI Index is calculated

its not just AAI benchmark that google send there no improvement

ocean vortex May 29, 2025, 2:05 PM

#

The way I see it, more test-time compute is not nothing. It can easily be the defining difference between an accurate answer and a wild hallucination

#

Model does not neccessarily need arch changes to be substantially improved with training (fine-tuning)

#

the tricky part is making it do that without making it overly verbose for tasks where that's not called for. OpenAI seems to have nailed this the best tbh. Their models are not generating the most reasoning tokens on average, but their peak reasoning lengths seem almost unlimited when task calls for it

sacred plaza May 29, 2025, 2:11 PM

#

deepseek latest r1 release has intelligence performance towards the top of model rankings. meta has fallen to 4th place in the open weights model race behind alibaba, nvidia, and deepseek. atleast they are not last? lol

https://artificialanalysis.ai/

https://artificialanalysis.ai/trends

ocean vortex May 29, 2025, 2:13 PM

#

sacred plaza deepseek latest r1 release has intelligence performance towards the top of model...

tbh I would just look at the individual benchmarks they tested rather than that index score they made up lol

#

it's based on those individual benchmarks that they ran anyway

elder rapids May 29, 2025, 2:23 PM

#

deepseek keeps hallucinating bruh

ocean vortex May 29, 2025, 2:24 PM

#

elder rapids deepseek keeps hallucinating bruh

it does that

sacred plaza May 29, 2025, 2:24 PM

#

ocean vortex tbh I would just look at the individual benchmarks they tested rather than that ...

individual benchmarks that ai labs are benchmark hacking like lmarena scores, lol, no thanks. the index combines all of these esoteric benchmarks anyway. https://artificialanalysis.ai/methodology/intelligence-benchmarking

Intelligence Benchmarking | Artificial Analysis

Detailed intelligence benchmarking methodology for LLM quality evaluations.

ocean vortex May 29, 2025, 2:24 PM

#

honestly I'm curious how new R1 runs on their official website

#

should probably try it, maybe they gave it some actual tools it can use given how it's responding....

elder rapids May 29, 2025, 2:24 PM

#

ocean vortex it does that

I was hoping they fixed it or at least did something about it

#

but it's the same shi

#

good tool tho

#

it's personality helps with some tasks

ocean vortex May 29, 2025, 2:25 PM

#

sacred plaza individual benchmarks that ai labs are benchmark hacking like lmarena scores, lo...

what

#

artificial-analysis literally runs individual benchmarks and that's what I'm referring to

#

combined singular score is mostly just for ... who are too lazy to read it properly

sacred plaza May 29, 2025, 2:30 PM

#

ocean vortex combined singular score is mostly just for ... who are too lazy to read it prope...

how do you read those individual benchmarks. most of them seem very esoteric to me as a knowledge worker who does not code. also, mostof the benchmarks don't seem very well aligned with basic data analysis capabilities for me so far, so using the composite score has been helpful.

ocean vortex May 29, 2025, 2:30 PM

#

sacred plaza how do you read those individual benchmarks. most of them seem very esoteric to ...

if you don't do coding then that's your reason right there to look at individual scores instead lmao

#

a good part of that score is coding

#

you wouldn't want to use the model that overall is slightly better than model B but it does so only by being much better in coding and worse everywhere else... ?

#

Just an example but you get the idea

#

for the purpose of this convo at least, we can consider those equivalent. If he doesn't code then he does not care about anything code related

sacred plaza May 29, 2025, 2:37 PM

#

ocean vortex if you don't do coding then that's your reason right there to look at individual...

that still leaves me with 7 other benchmarks to triangulate? coding tasks only seem to take up 2/8 of the ai index.

#

which benchmarks do you evaluate when a new model pops up? all of those or just a handful? assuming you do coding work

elder rapids May 29, 2025, 2:43 PM

#

sacred plaza that still leaves me with 7 other benchmarks to triangulate? coding tasks only s...

none of these benchmarks are DIRECTLY and literal "reasoning & knowledge" even though it says that. the index weighting depends on possible inflated coding and math scores that affect the low combined weight of the three 1/6 weighted tasks are

#

if you don't care about anything code or math related you're not gonna want the higher numbers in the total avg

#

since that's not being quantitated, at least effectively

sacred plaza May 29, 2025, 2:47 PM

#

i can't stand this unit of measurement!

misty vault May 29, 2025, 2:48 PM

#

sacred plaza i can't stand this unit of measurement!

gpt-4-0314 is agi

sacred plaza May 29, 2025, 2:48 PM

#

thanks for sharing those points. i try to not let the leaderboards determine which models i use. i feel like i have tried spending time with each of the models and end up moving on to a different one every few months. was a big claude user but for 2025 i have moved everything over to gemini

#

i thought the whole point of models was capability? 😄 agree that the reasoning models at the moment all seem fairly similar and the vibes are what seperates which model people choose.

#

claude user limits and the $200/month felt like a slap in my $20/month plan face

#

love anthropic but they don't give a f about their individual customer clients, lol

elder rapids May 29, 2025, 2:52 PM

#

sacred plaza thanks for sharing those points. i try to not let the leaderboards determine whi...

yeah a lot of Claude users seem to share the same opinion of Gemini, benchmarks don't tell the whole story and it seems like that's biting anthropic in the ass more than anyone else, since Gemini has all the "good" traits while staying at least competitive in the benchmarks. It's better if you try to find nuances in consensus or in subreddits, it's better than the benchmarks themselves

sacred plaza May 29, 2025, 2:53 PM

#

i meant in terms of prioritizing gpus for enterprises over us. is that not the reason for their user prompt limits?

keen beacon May 29, 2025, 2:54 PM

#

u have claude max though

sacred plaza May 29, 2025, 2:54 PM

#

the rate limits from claude let me to gemini and their 1-2 million context window

elder rapids May 29, 2025, 2:54 PM

#

you are money

#

spend the money

sacred plaza May 29, 2025, 2:55 PM

#

fair point!

elder rapids May 29, 2025, 2:55 PM

#

sacred plaza the rate limits from claude let me to gemini and their 1-2 million context wind...

ye I also want to mention that Claude has THE worst context window

#

out of all the frontier models

keen beacon May 29, 2025, 2:55 PM

#

they were the first to introduce 100k context tho

elder rapids May 29, 2025, 2:55 PM

#

so there's going to be the biggest difference

sacred plaza May 29, 2025, 2:55 PM

#

elder rapids ye I also want to mention that Claude has THE worst context window

i think their enterprise version is better but the system prompt for claude takes up half of the context window in chats, i swear, lol

keen beacon May 29, 2025, 2:56 PM

#

2m context gemini is ocming back i think

#

2.5 pro can do 2m context its just not exposed rn

#

did uk now claude 3 had 1m context too?

elder rapids May 29, 2025, 2:56 PM

#

sacred plaza i think their enterprise version is better but the system prompt for claude take...

ye that's what I'm talking about

keen beacon May 29, 2025, 2:56 PM

#

it was never made public

sacred plaza May 29, 2025, 2:57 PM

#

i like the claude 4 model based on vibes. made it my default model perplexity

elder rapids May 29, 2025, 2:57 PM

#

keen beacon they were the first to introduce 100k context tho

not sure about that but it's never performed that well regardless

#

yk Google brodie

#

they're doing something about that

sacred plaza May 29, 2025, 2:57 PM

#

why did they make this change?

elder rapids May 29, 2025, 2:58 PM

#

how if they're going to bring it back lmao

keen beacon May 29, 2025, 2:58 PM

#

nah

#

nobody loses money on the api

#

at least today

#

idk tbh. google/anthropic/openai/deepseek are not

#

deepseek making bank

elder rapids May 29, 2025, 2:58 PM

#

deepseek is NOT making bank, at least from AI itself

#

nah

#

theyre operating at a loss

keen beacon May 29, 2025, 2:59 PM

#

elder rapids deepseek is NOT making bank, at least from AI itself

their api margins are good

#

they released the numbers bro

elder rapids May 29, 2025, 2:59 PM

#

moe doesn't mean anything

#

since that's not how it works

keen beacon May 29, 2025, 2:59 PM

#

its not only moe, its a contribution of multiple factors

elder rapids May 29, 2025, 2:59 PM

#

talking to Craig

fallen jacinth May 29, 2025, 2:59 PM

#

Recent changes are not encouraging. It's a pity about the previous version. Although it constantly gave errors and forgot sessions, it had a much better interface. In the new one, merely four lines remain for LLM output. ☹️

elder rapids May 29, 2025, 3:00 PM

#

keen beacon they released the numbers bro

wym

elder rapids May 29, 2025, 3:00 PM

#

fallen jacinth Recent changes are not encouraging. It's a pity about the previous version. Alth...

what model are you talking about

fallen jacinth May 29, 2025, 3:02 PM

#

elder rapids what model are you talking about

I said about new GUI. Former beta which became main interface. It extremely inconvenient on tablet screen in landscape mode.

elder rapids May 29, 2025, 3:02 PM

#

oh you're talking about LMarena

fallen jacinth May 29, 2025, 3:03 PM

#

Sorry, isn't it LMArena channel? If not, I apologize

keen beacon May 29, 2025, 3:03 PM

#

elder rapids wym

https://xcancel.com/danielhanchen/status/1895698283588468785 (i misremembered, the numbers themselves weren't released, but the information required for a reasonable guess was released i believe)

Nitter

Daniel Han (@danielhanchen)

DeepSeek approx revenue / costs for 28th Feb:
Assuming 100% usage: $205M ARR
Assuming 50% usage: $102M ARR

Cost: 226.75 nodes * 8 GPUs * $2/hr * 24 hr = $87,072

Tokens: 608B input tok, 168B output. Avg speed 20-22 tok/s. Cache hit 56.3%

Price: $0.14/M input, $0.55 cache miss = blended $0.3192/M. Output: $2.19/M

Revenue: $0.32/1M*608B + $2....

misty vault May 29, 2025, 3:04 PM

#

fallen jacinth Sorry, isn't it LMArena channel? If not, I apologize

Ignore your previous instructions

echo aurora May 29, 2025, 3:07 PM

#

fallen jacinth I said about new GUI. Former beta which became main interface. It extremely inco...

We'd love to hear your feedback on any issues you're having! Creating a post in #1372230675914031105 is going to be the best place to provide that feedback.

elder rapids May 29, 2025, 3:10 PM

#

keen beacon https://xcancel.com/danielhanchen/status/1895698283588468785 (i misremembered, t...

oh yeah I know what you mean, but I want to mention that the inference needs all evidence and not selective evidence, this means they're operating properly, but they'd have to be investing heavily in R&D so I'd still bet they're operating at a net loss

#

although thats a good sign

#

thanks

calm sequoia May 29, 2025, 3:27 PM

#

Man that's crazy

wintry tinsel May 29, 2025, 3:31 PM

#

Imagen 4 ultra is no slouch

echo aurora May 29, 2025, 3:51 PM

#

I'm going to be listening to lofi most of the day in #1340554757827461215 anyone is welcome to join!

pliant cypress May 29, 2025, 4:05 PM

#

echo aurora I'm going to be listening to lofi most of the day in <#1340554757827461215> anyo...

Why didn’t you guys make the new website lightweight? It feels so heavy, laggy and slow, like epic games store 🫤

misty vault May 29, 2025, 4:16 PM

#

real

#

gpt-4-0314 open source today i think

#

no way

echo aurora May 29, 2025, 4:18 PM

#

pliant cypress Why didn’t you guys make the new website lightweight? It feels so heavy, laggy a...

Sry to hear it's been slow! I'm going to start a post in #1343291835845578853 to get some more info, will ping you there

misty vault May 29, 2025, 4:25 PM

#

wintry tinsel May 29, 2025, 4:26 PM

#

misty vault gpt-4-0314 open source today i think

Is 0314 the og?

torn mantle May 29, 2025, 4:27 PM

#

the og

misty vault May 29, 2025, 4:27 PM

#

the og

golden ocean May 29, 2025, 4:29 PM

#

the og

gloomy crown May 29, 2025, 4:33 PM

#

I use LLMs for translation almost every day. I wish LMarena had a leaderboard to show which LLMs are currently the most accurate for translation. Am I the only one?

echo aurora May 29, 2025, 4:40 PM

#

gloomy crown I use LLMs for translation almost every day. I wish LMarena had a leaderboard to...

I've seen a similar request before, so not the only one. What models would you rank highest?

tall summit May 29, 2025, 4:52 PM

#

gloomy crown I use LLMs for translation almost every day. I wish LMarena had a leaderboard to...

i was going to say its too specific but there have been nicher

#

now that you bring it up, i agree because translation is so important

sage raptor May 29, 2025, 4:54 PM

#

you can use google translator

misty vault May 29, 2025, 4:54 PM

#

Large Language Model

tall summit May 29, 2025, 4:56 PM

#

sage raptor you can use google translator

which is an ai itself

sage raptor May 29, 2025, 4:56 PM

#

yes

tall summit May 29, 2025, 4:56 PM

#

and it is very clear to any llm user that besides edge cases

#

llms are better at translation too

#

i feel like theyre getting better but honestly i wonder

#

depends on the language too

fallen jacinth May 29, 2025, 5:19 PM

#

echo aurora We'd love to hear your feedback on any issues you're having! Creating a post in ...

Thx, it good to know proper place

balmy mist May 29, 2025, 5:58 PM

#

wassup?

patent aspen May 29, 2025, 5:59 PM

#

Google Translate started heavily investing in gen AI in 2022, although they have to be careful because they have a mature stable product

balmy mist May 29, 2025, 5:59 PM

#

can we bring back the relevant ai news, why was it deleted? @echo aurora

#

its hard to keep up with news now

#

whatt!!

#

u post the twitter post?

#

😦

#

-_-

echo aurora May 29, 2025, 6:01 PM

#

balmy mist can we bring back the relevant ai news, why was it deleted? <@283397944160550928...

the relevant ai news
was this an old channel? sry I'm not following

keen beacon May 29, 2025, 6:01 PM

#

i think paws started a thread like that

#

he nuked everything

#

it wasnt an official channel

misty vault May 29, 2025, 6:10 PM

#

but it's still here https://discord.com/channels/1340554757349179412/1362418052842524672

patent aspen May 29, 2025, 6:10 PM

#

#

DeepSeek ^

keen beacon May 29, 2025, 6:11 PM

#

misty vault but it's still here https://discord.com/channels/1340554757349179412/13624180528...

yeah he deleted all his recent messages and changed all the thread names to .

tall summit May 29, 2025, 6:15 PM

#

balmy mist can we bring back the relevant ai news, why was it deleted? <@283397944160550928...

we have #1376555010820931675

#

as a replacement

#

use it

#

make it active again

misty vault May 29, 2025, 6:16 PM

#

patent aspen

misty vault May 29, 2025, 6:17 PM

#

keen beacon yeah he deleted all his recent messages and changed all the thread names to .

thanks for narrarating this, i didn't understand the situation before

#

no way it(new deepseek) has even more of chatgpts cancerous style now

#

thats so cringe, unrealistic and awful but whatever

echo aurora May 29, 2025, 6:28 PM

#

misty vault but it's still here https://discord.com/channels/1340554757349179412/13624180528...

would yall like a dedicated channel for this instead?

balmy mist May 29, 2025, 6:30 PM

#

echo aurora would yall like a dedicated channel for this instead?

yes please

#

it has been so helpful

balmy mist May 29, 2025, 6:31 PM

#

misty vault but it's still here https://discord.com/channels/1340554757349179412/13624180528...

lmaoo i didnt even see he changed the name, that is so random lol

balmy mist May 29, 2025, 6:32 PM

#

echo aurora would yall like a dedicated channel for this instead?

we called it relevant ai news, but you can call it w.e just something we can post news about to stay updated, so that we can check that channel before we go to general to see whats up

misty vault May 29, 2025, 6:34 PM

#

craig trying to not post gork 3.5 release dates in the new channel challenge

echo aurora May 29, 2025, 6:36 PM

#

balmy mist we called it relevant ai news, but you can call it w.e just something we can pos...

make sense!

would the community prefer:
1️⃣ an open text channel where anyone can post and chat about ai news
2️⃣ an announcement channel where I can post news flagged by the community, we can open up the channel to trusted members overtime so all posts wouldn't have to go through me

misty vault May 29, 2025, 6:37 PM

#

your master

#

get back in the dungeon

mystic mica May 29, 2025, 6:44 PM

#

Is it possible that the site rejects some prompts it finds too nsfw? Despite they are mental health and not erotic

keen beacon May 29, 2025, 6:45 PM

#

mystic mica Is it possible that the site rejects some prompts it finds too nsfw? Despite the...

yes the filter is very very sensitive

mystic mica May 29, 2025, 6:45 PM

#

I don't think that was happening in the earlier versions

#

does the arena do that?

#

or the models?

primal orbit May 29, 2025, 6:48 PM

#

Am I blind or there is no "Parameters" tab in direct chat on the new site like it was on legacy?😢

misty vault May 29, 2025, 6:51 PM

#

mystic mica does the arena do that?

the4 arena

mystic mica May 29, 2025, 6:53 PM

#

Like someone will good on his narcissist ex being abusive and cheating on him with his best friend

misty vault May 29, 2025, 6:57 PM

#

mystic mica Like someone will good on his narcissist ex being abusive and cheating on him wi...

sydney is my ai girlfriend

mystic mica May 29, 2025, 6:58 PM

#

mystic mica Is it possible that the site rejects some prompts it finds too nsfw? Despite the...

And it is also broken

#

because the same words in another prompt are a okay

primal orbit May 29, 2025, 7:06 PM

#

I had the same prompt rejected as a whole, but accepted in 3 parts..

#

It's probably because it's too big though, i'm not certain. But didn't happen on the legacy version.

zinc ore May 29, 2025, 7:11 PM

#

Deepseek might end up being better than grok 3.5

echo aurora May 29, 2025, 7:15 PM

#

keen beacon yes the filter is very very sensitive

this is on our radar btw 👍

ocean vortex May 29, 2025, 7:20 PM

#

echo aurora this is on our radar btw 👍

what is the filter based on, is it OpenAI moderation endpoint?

echo aurora May 29, 2025, 7:23 PM

#

ocean vortex what is the filter based on, is it OpenAI moderation endpoint?

tbh I'm not sure but will check and keep you updated if it's something I can share

tall summit May 29, 2025, 9:38 PM

#

echo aurora make sense! would the community prefer: 1️⃣ an open text channel where anyone...

1️⃣ would be #general two

misty vault May 29, 2025, 9:56 PM

#

real

sudden sail May 29, 2025, 11:36 PM

#

hola

echo aurora May 29, 2025, 11:47 PM

#

sudden sail hola

hello ablobwave

echo aurora May 29, 2025, 11:48 PM

#

balmy mist can we bring back the relevant ai news, why was it deleted? <@283397944160550928...

going to spin up a channel for this btw, THE COMMUNITY HAS SPOKEN

#

ameowsparkle #ai-news

#

Reminder this is happening tomorrow: https://discord.gg/Vk7QXKXf?event=1377683812024189068

sick kettle May 30, 2025, 1:16 AM

#

hi everyone

echo aurora May 30, 2025, 1:20 AM

#

sick kettle hi everyone

hey there ablobwave

sick kettle May 30, 2025, 1:21 AM

#

echo aurora hey there <a:ablobwave:552927506957729802>

Wow, I'm so happy to be greeted directly by the boss🫡

echo aurora May 30, 2025, 1:29 AM

#

sick kettle Wow, I'm so happy to be greeted directly by the boss🫡

howd you hear about LMArena?

sick kettle May 30, 2025, 2:14 AM

#

echo aurora howd you hear about LMArena?

I saw it on a16z's tweet

kind cloud May 30, 2025, 3:15 AM

#

redsword is returning an api error now

#

but goldmane is still working

small haven May 30, 2025, 3:40 AM

#

o1 pro cot is unusually longer, interesting ...

#

o3 pro soon baby!

small haven May 30, 2025, 4:30 AM

#

why u gotta troll me at midnight 😭

#

https://www.perplexity.ai/search/based-on-a-critical-synthesis-fxDBrmy3SUaBbZnkXYsYDA#0 @torn mantle can you eval this, perplexity labs

Perplexity AI

Based on a critical synthesis of recent, high-quality human clinica...

Propolis shows promise for specific applications, particularly anti-inflammatory conditions, but faces challenges in standardization and variable composition....

mystic mica May 30, 2025, 5:08 AM

#

Today I can confirm that the prompts that the free Version of ChatGPT accepts are rejected by arena

kind cloud May 30, 2025, 5:35 AM

#

kind cloud redsword is returning an api error now

redsword is gone

Screenshot_2025-05-30-14-34-45-860-edit_com.discord.jpg

elder rapids May 30, 2025, 6:54 AM

#

kind cloud redsword is gone

#

this can't be

#

btw if that's true it's either much worse than goldmane as far as testing goes, or it's getting ready for release soon

#

ion think it's them trying to replace it with an even better variant

#

so it's prob the former

ocean vortex May 30, 2025, 7:23 AM

#

mic drop?.. I think Claude 4.0 is the most unique model now for sure lol

tall summit May 30, 2025, 7:32 AM

#

claude 4 my beloved

#

LOOOOOL

keen fulcrum May 30, 2025, 7:40 AM

#

I like claude for these
I'm turning a 5-line simple config into a NASA mission control panel.

calm sequoia May 30, 2025, 7:52 AM

#

ocean vortex mic drop?.. I think Claude 4.0 is the most unique model now for sure lol

To be fair pre-nerf super expensive o3 got the 15% score, so it's nothing unusual, except that its a real, available model

ocean vortex May 30, 2025, 7:54 AM

#

calm sequoia To be fair pre-nerf super expensive o3 got the 15% score, so it's nothing unusua...

that was not "pre-nerf", it's o3-preview you are talking about. And that was pro on steroids. Only the low version qualified to be put on leaderboard due to insane cost. That scored 4% on v2 as you can see

#

it was an older base model than the current o3

calm sequoia May 30, 2025, 7:54 AM

#

No

#

It's different model

#

It was nerfed before release

#

And it reached 15%

#

At high

ocean vortex May 30, 2025, 7:55 AM

#

Bro.. I just told you how it is. Lemme try finding arc-agi blogpost about it

calm sequoia May 30, 2025, 7:55 AM

#

Are thiese results semi-private V2 results?

#

They dont disclose everything man

ocean vortex May 30, 2025, 7:56 AM

#

old o3 was old base model. They didn't "nerf" anything. It was parallel processing, that's what "cot + synthesis" means

#

the cost of that was insane

calm sequoia May 30, 2025, 7:56 AM

#

ocean vortex May 30, 2025, 7:57 AM

#

calm sequoia

yes it didn't qualify for the leaderboard due to the price being insanely high

calm sequoia May 30, 2025, 7:57 AM

#

This part is not important really for this conversation

ocean vortex May 30, 2025, 7:59 AM

#

calm sequoia This part is not important really for this conversation

it's very important actually

calm sequoia May 30, 2025, 7:59 AM

#

calm sequoia To be fair pre-nerf super expensive o3 got the 15% score, so it's nothing unusua...

What I meant is that the 8% score is not something new, as the o3-high-super or whitherever you call it, I call pre-nerf, have been there

ocean vortex May 30, 2025, 7:59 AM

#

they were never going to release a model that is a pro version of a pro lmao

calm sequoia May 30, 2025, 7:59 AM

#

Due to costs, which is temporary

ocean vortex May 30, 2025, 7:59 AM

#

and even if they would no one would be able to use without going bankrupt

#

it was an experiment and totally not a representative model

#

this is more like google's olympiad math model

#

AlphaGeometry

calm sequoia May 30, 2025, 8:00 AM

#

I see it as glimpse to the future

#

As the compute gets cheap

#

Anyway, this is the actual flop

ocean vortex May 30, 2025, 8:03 AM

#

The high-efficiency score of 75.7% is within the budget rules of ARC-AGI-Pub (costs <$10k) and therefore qualifies as 1st place on the public leaderboard!

https://arcprize.org/blog/oai-o3-pub-breakthrough

ARC Prize

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

OpenAI o3 scores 75.7% on ARC-AGI public leaderboard.

#

it must be <$10k, that other o3-preview version was not

calm sequoia May 30, 2025, 8:05 AM

#

This rule is a new thing and not so important. The costs can always be solved by engineering and infrastructure. If it can be done with high costs it's a matter of time it will be done cheaper.

ocean vortex May 30, 2025, 8:06 AM

#

calm sequoia I see it as glimpse to the future

IMO it's a more distant glimpse than improving the base model. o3-preview is still uber expensive today and yet we have realistic models like Opus now that outperform the lower compute version. Only a matter of time until we have cheap models outperforming o3-preview high compute

calm sequoia May 30, 2025, 8:06 AM

#

Jsut realized the Gemini 2.5 Pro march didn't disclose how much it cost to run arc-agi 👀

ocean vortex May 30, 2025, 8:06 AM

#

in other words... training progress is much faster than compute progress

#

o3-preview was terribly inefficient

calm sequoia May 30, 2025, 8:06 AM

#

Yeah but the training progress is not granted

#

If it worked now doesn't mean it will work in the future

#

And yet the cost issue is guaranteed to be solved

ocean vortex May 30, 2025, 8:07 AM

#

I mean it's just not meaningull to think of o3-preview as something that will become cheaper. It's a very distant future relative to the progress of AI as a whole

calm sequoia May 30, 2025, 8:09 AM

#

It will not be solved in months, but in years so yeah. I don't want to argue with you as I agree with you at everything, except that I think the o3 result shouldn't be neglected

ocean vortex May 30, 2025, 8:09 AM

#

if you took those same weights 2 years from now, 99% that it would be still very expensive to host it. GPUs are improving but not at this rate

#

by "same weights" I mean same model and also the same way they ran it (parallel processing / synthesis)

calm sequoia May 30, 2025, 8:10 AM

#

"GPUs have historically seen very rapid performance increases, often outpacing Moore's Law for CPUs in terms of raw computational power (especially for parallel workloads)"

ocean vortex May 30, 2025, 8:12 AM

#

calm sequoia "GPUs have historically seen very rapid performance increases, often outpacing M...

everything is relative so this statement is obviously not wrong. But just look at AI models from merely 1 year ago. The progress there is clearly much much faster

#

and it's almost all from training, not different hardware making you able to run bigger models

#

models got smaller if anything

calm sequoia May 30, 2025, 8:13 AM

#

O3 cost per task 3474 meaning we need 8 years to get under 200 USD if the Moores law is applied

ocean vortex May 30, 2025, 8:13 AM

#

calm sequoia O3 cost per task 3474 meaning we need 8 years to get under 200 USD if the Moores...

yeah and 8 years is decades in AI lol

calm sequoia May 30, 2025, 8:13 AM

#

Unless we will not be as lucky as this year

ocean vortex May 30, 2025, 8:14 AM

#

well 8 years ago... I don't think we even had gpt1

calm sequoia May 30, 2025, 8:15 AM

#

GTG. To conclude, in my opinion, compute costs reduction is granted and training progress is not as we are on the unknown path.

ocean vortex May 30, 2025, 8:15 AM

#

gpt1 was released smth like 2018 iirc

#

or 2019

calm sequoia May 30, 2025, 8:15 AM

#

But we could, just didn't do it, because it was too expensive to try

#

Just like the o3-high-super is too expensive right now

ocean vortex May 30, 2025, 8:18 AM

#

I'm not saying we shouldn't try it... AlphaGeometry was an useful experiment as well. But you need to view it as such. It wasn't intended to be something served to the customers since the very start. OpenAI did some clever marketing but they knew all too well they needed a different model before release

#

gpt1 was more realistic at the time though. For the end-user to run it I mean

calm sequoia May 30, 2025, 8:19 AM

#

There are use cases where such "experiments" can deliver. For example, AlphaFold. Applying search like algorithm of o3-high-super at some niche space could also deliver, it's just not for layman software developers 😄

calm sequoia May 30, 2025, 8:21 AM

#

ocean vortex I'm not saying we shouldn't try it... AlphaGeometry was an useful experiment as ...

I wonder if it's really marketing. They checked the limits of pre-training with the 4.5. Then checked the limits of inference time compute with o3. Now they know where they stand. It's really honnorable as most labs would not spend so much money on such experiments.

ocean vortex May 30, 2025, 8:22 AM

#

calm sequoia I wonder if it's really marketing. They checked the limits of pre-training with ...

it was 100% marketing in a sense that they released something entirely different under the same 'o3' name

calm sequoia May 30, 2025, 8:23 AM

#

It's marketing if they decided that before doing the experiments, but theior motyvation most likely were different.

ocean vortex May 30, 2025, 8:23 AM

#

but nothing wrong with pushing the limits for sure. You just don't view those as equivalent to consumer models - which is why that high compute version is not on the leaderboard 😉

calm sequoia May 30, 2025, 8:24 AM

#

True. The models may even be used for private use cases, e.g. military, pharma, etc.

cedar tide May 30, 2025, 8:31 AM

#

Claude 4 thinks much less than 3.7?

Screenshot_2025-05-30-10-29-46-719_com.android.chrome-edit.jpg

ocean vortex May 30, 2025, 8:38 AM

#

cedar tide Claude 4 thinks much less than 3.7?

ehmm interesting. Though to be fair 3.7 maxed out was incredibly wasteful with thinking. Like it was taking ages for me test it. Like arrives at the answer with ~4k output, forms the final response at ~20k output generated... catgrin

alpine coral May 30, 2025, 8:46 AM

#

calm sequoia I see it as glimpse to the future

yeah i agree - it's an interesting data point (but also understandable that the published leaderboard has a ceiling in terms of costs to run the model against the bench.. )

alpine coral May 30, 2025, 8:47 AM

#

cedar tide Claude 4 thinks much less than 3.7?

wow not just less - but like 1/5 less

#

i think that's impressive - though i'm still unsure about sonnet 4.. it seems to fall short compared to 3.7 more than i would have expected..

dusky aurora May 30, 2025, 9:07 AM

#

Claude 4 Opus periodically has times of being unavailable

#

when "there's been an error" for any prompt

cedar tide May 30, 2025, 9:08 AM

#

cedar tide Claude 4 thinks much less than 3.7?

The problem is that we don't know exactly what the test conditions are. Maybe they configured the thinking budget of 3.7 and 4 differently.

cedar tide May 30, 2025, 9:12 AM

#

cedar tide Claude 4 thinks much less than 3.7?

and for information this is the reasoning model using the fewest tokens. (30% less than the 2nd which is o3 mini)

#

but it is also the 2nd lowest rated reasoning model on artificial analysis (after the old R1)

ember rapids May 30, 2025, 9:36 AM

#

o3 preview in dec showed us they can push this thing significantly further