#general | Arena | Page 46

elder rapids May 23, 2025, 5:57 PM

#

ye

patent aspen May 23, 2025, 5:57 PM

#

I think an issue for xAI is that they don't have enough people to do deep R&D, hill climbing, and the mundane compliance and administrative stuff at the same time as well as its competitors

elder rapids May 23, 2025, 5:57 PM

#

but not a lot of people will be able to test it

#

like o1 pro, theres no reason to talk about it

hollow ocean May 23, 2025, 5:57 PM

#

elder rapids but not a lot of people will be able to test it

125 first 3 months isn’t bad

#

It’s one day of work

patent aspen May 23, 2025, 6:00 PM

#

They might be able to get a lot from brute force hardware, but they wouldn't have enough people leftover to hill climb across everything

#

Similar to Anthropic

teal mantle May 23, 2025, 6:01 PM

#

I remember the inaccurate leaked specs of Grok 3.5
It sounds too optimistic
Especially now we know it would beat Opus 4

keen beacon May 23, 2025, 6:01 PM

#

someone on this discord made fake benchmarks then elon rt'd it (and subsequently deleted it later)

elder rapids May 23, 2025, 6:01 PM

#

crazy tbh

#

I wonder how many people are actually in here

teal mantle May 23, 2025, 6:01 PM

#

keen beacon someone on this discord made fake benchmarks then elon rt'd it (and subsequently...

Too good to be true considering retrospective comparing with Opus specs

keen beacon May 23, 2025, 6:02 PM

#

opus wasnt a fresh pretrain, it seems like a last ditch effort to salvage opus 3.5. but just speculating

teal mantle May 23, 2025, 6:02 PM

#

I don’t doubt xAI could outrun Anthropic but the number are implausible

teal mantle May 23, 2025, 6:03 PM

#

keen beacon opus wasnt a fresh pretrain, it seems like a last ditch effort to salvage opus 3...

Honestly only with proprietary tech they could troll us like this if true

But Opus is too emergent

keen beacon May 23, 2025, 6:03 PM

#

teal mantle Honestly only with proprietary tech they could troll us like this if true But O...

i dont understand?

elder rapids May 23, 2025, 6:03 PM

#

prolly in two weeks

hollow drift May 23, 2025, 6:03 PM

#

So you all don’t think anything will beat out Gemini 2.5 pro on the leaderboard by the end of the month?

keen beacon May 23, 2025, 6:04 PM

#

the ga revision probably not

#

probably very unlikely

#

claude models dont usually perform that well in the arena either

elder rapids May 23, 2025, 6:04 PM

#

hollow drift So you all don’t think anything will beat out Gemini 2.5 pro on the leaderboard ...

no def not

teal mantle May 23, 2025, 6:04 PM

#

keen beacon i dont understand?

I mean if it is salvaged 3.5 it will be quite a troll
Only possible under proprietary environment

elder rapids May 23, 2025, 6:04 PM

#

the gap is MASSIVE

alpine coral May 23, 2025, 6:05 PM

#

i have yeah

#

opus 4 is very good

elder rapids May 23, 2025, 6:05 PM

#

ye

#

pretty likely

alpine coral May 23, 2025, 6:05 PM

#

i really don't know what to make of sonnet 4 tho..

teal mantle May 23, 2025, 6:05 PM

#

I think o4 full sized might come firsthand

elder rapids May 23, 2025, 6:06 PM

#

yet everyone just a couple months ago swore Google never benchmark maxed

#

along with anthropic

#

I also mean for lmarena

keen beacon May 23, 2025, 6:06 PM

#

o3 was probably done more recently than you think

elder rapids May 23, 2025, 6:06 PM

#

it doesn't seem like they're benchmark maxing

keen beacon May 23, 2025, 6:07 PM

#

yeah

#

likely

#

their cpt wasnt done yet probably

#

the cpt was a company wide effort i think

#

4o image gen is on the new 4.1 base model, it would seem

#

very likely 4.1 mini

#

chatgpt 4o latest uses the 4.1 base model and predated 4.1

#

since jan as public api snapshots

torn mantle May 23, 2025, 6:08 PM

#

redsword seems to fail a lot when using threejs

keen beacon May 23, 2025, 6:08 PM

#

chatgpt in dec with some tests

#

@alpine coral what base model do you think they used otherwise lol

alpine coral May 23, 2025, 6:10 PM

#

keen beacon <@1053335914555908116> what base model do you think they used otherwise lol

i think it's a derivative of o4 - so whatever o4 is based on

keen beacon May 23, 2025, 6:10 PM

#

they arent doing a fresh pretrain for that

teal mantle May 23, 2025, 6:10 PM

#

Btw is Grok 3.5 still a vaporware

elder rapids May 23, 2025, 6:11 PM

#

keen beacon they arent doing a fresh pretrain for that

for a smaller model they likely are

keen fulcrum May 23, 2025, 6:11 PM

#

Grok 3.5 better than claude 4 sonnet

keen beacon May 23, 2025, 6:11 PM

#

alpine coral i think it's a derivative of o4 - so whatever o4 is based on

wow o4 dropped 29 simpleqa points

#

it just does not make sense

alpine coral May 23, 2025, 6:12 PM

#

fair enough

#

i don't have any profound insight here lol

#

it's just o4-mini

teal mantle May 23, 2025, 6:12 PM

#

keen fulcrum Grok 3.5 better than claude 4 sonnet

But it is unreleased

keen beacon May 23, 2025, 6:12 PM

#

seriously the naming is arbitrary

alpine coral May 23, 2025, 6:13 PM

#

i disagree

teal mantle May 23, 2025, 6:13 PM

#

But then didn’t arguably the fake bench derail the 3.5 release?

alpine coral May 23, 2025, 6:13 PM

#

but perhaps i'm misstaken

keen beacon May 23, 2025, 6:13 PM

#

alpine coral but perhaps i'm misstaken

you think o4 mini absolutely necessitates o4, but it doesn't. but that doesn't mean o4 doesn't exist

alpine coral May 23, 2025, 6:14 PM

#

keen beacon you think o4 mini absolutely necessitates o4, but it doesn't. *but* that doesn't...

i think o4 exists

#

plain and simple lol

keen beacon May 23, 2025, 6:14 PM

#

it might or might not be

#

it doesn't matter

alpine coral May 23, 2025, 6:14 PM

#

yeah i ofc don't know

keen beacon May 23, 2025, 6:14 PM

#

o4 mini existing doesnt necessitate it

alpine coral May 23, 2025, 6:14 PM

#

precedent would at least be strongly indicativate

keen fulcrum May 23, 2025, 6:15 PM

#

O5 will be great, tbd at end of the year

patent aspen May 23, 2025, 6:15 PM

#

Google has both of the top 2 AI organizations in the world

#

Merged into one

keen beacon May 23, 2025, 6:16 PM

#

alpine coral precedent would at least be strongly indicativate

past precedent of ai companies naming their models whatever is strongly indicative too

keen fulcrum May 23, 2025, 6:18 PM

#

There is Google DeepMind

patent aspen May 23, 2025, 6:18 PM

#

Yes prior to Gemini, Google Brain and DeepMind published some wild percentage of all NeurIPS papers. 80-90% of the seminal AI research papers from past decade came from those 2 organizations

misty vault May 23, 2025, 6:18 PM

#

.

torn mantle May 23, 2025, 6:19 PM

#

keen fulcrum There is Google DeepMind

and there is you

#

hyping up grok 3.5

keen fulcrum May 23, 2025, 6:20 PM

#

torn mantle and there is you

And I will hype up grok 4

torn mantle May 23, 2025, 6:20 PM

#

what about grok 5

keen fulcrum May 23, 2025, 6:21 PM

#

Tbd

alpine coral May 23, 2025, 6:21 PM

#

keen beacon past precedent of ai companies naming their models whatever is strongly indicati...

apart from perhaps from grok i'm not really sure which you're referring to (in terms of the mini variants not being downsized/distilled descendants of the 'full' model)

keen fulcrum May 23, 2025, 6:21 PM

#

ChatGPT could have been created a decade ago, technology wasn't advanced for a purpose

patent aspen May 23, 2025, 6:22 PM

#

keen fulcrum ChatGPT could have been created a decade ago, technology wasn't advanced for a p...

A decade ago was 2015. The deep learning research necessary to create it had not been published yet

#

Attention is All You Need was published by Google in 2017

#

That's where transformers originated

keen fulcrum May 23, 2025, 6:25 PM

#

Indeed, there wasn't moved as much R&D capital as needed

keen beacon May 23, 2025, 6:30 PM

#

alpine coral apart from perhaps from grok i'm not really sure which you're referring to (in t...

1.5 pro was a fresh pretrain iirc. 2.0 pro was a fresh pretrain. 2.5 pro is a cpt/etc.
claude 3 sonnet (fresh pretrain) -> claude 3.5 sonnet (fresh pretrain, not the same size compared to claude 3 sonnet as it was increased, there was a piece of anthropic media that stated this/i don't have it anymore) -> claude 4 sonnet (cpt, timeline is too short for a fresh pretrain imho, even more likely the case for opus 3.5. semianalysis reported about 3.5 opus's existence too, i think they salvaged 3.5 opus.)
4o -> 4.5 -> 4.1, despite chatgpt 4o (jan+) = 4.1 lol
4o image gen being on 4.1 (not too sure, as there could be additional post processing)

i hope i dont have to list more because they are wack

#

gemini 1.5 flash -> gemini 2.0 flash lite (probably same size, something about it in a google blog somewhere)
new model size -> gemini 2.0 flash

alpine coral May 23, 2025, 6:31 PM

#

i was about to say "if you were talking about 1.5 pro and 1.5 flash i'd understand it a bit more " ha

small haven May 23, 2025, 6:32 PM

#

so what was it based on before 4o?

alpine coral May 23, 2025, 6:32 PM

#

i mean i'm sure this isn't perfectly accurate, but it conveys where i'm coming from

keen beacon May 23, 2025, 6:33 PM

#

yeah thats just WAY too many assumptions

alpine coral May 23, 2025, 6:33 PM

#

again, fair enough 🙂

keen beacon May 23, 2025, 6:33 PM

#

imho, they could be true, but its just way too much

#

if you look at 4.1 (they spent several months on this, they'll probably use this for the foreseeable future. mid train of 4o, confirmed outright in a podcast) and 4.1 mini (recent fresh pretrain, you can hear openai employees talk about this). they would try to squeeze out as much api value as they can (e.g. instruct models), so i doubt they have actual different internal models (that don't originate from them two) that they would specifically use for o series

#

this is the most obvious thing about it. then additionally via pretraining probing, it seems highly likely that o4 mini uses 4.1 mini on a base model, you might be able to actually prove it tbh but worthless endeavor

alpine coral May 23, 2025, 6:36 PM

#

i dunno man.. but i find it weird to think o4-mini is based of 4.1-mini

#

i just find it more likely that o4-mini is based of / distilled from o4 in the same way o3-mini from o3, and o1-mini from o1.. (and similiarly for 4o-mini, it's a distilliation of 4o)

keen beacon May 23, 2025, 6:39 PM

#

alpine coral i just find it more likely that o4-mini is based of / distilled from o4 in the s...

maybe if they didnt decide to retrain o3 lol

alpine coral May 23, 2025, 6:39 PM

#

afaik that relationship between the preceding o series isn't disputed

#

i'm just going by the past..

#

not cpt dates

keen beacon May 23, 2025, 6:39 PM

#

alpine coral afaik that relationship between the preceding o series isn't disputed

youre making it a rule when it isnt really a rule tbh

#

the naming is mostly marketing

alpine coral May 23, 2025, 6:40 PM

#

i'm not saying it's 1000% or iron clad

#

you're strawmanning my claim

keen beacon May 23, 2025, 6:40 PM

#

the development cycle matches with o3, if they were to distill anything it would be o3. they mightve have done preliminary stuff with o4 etc but i dont see it being the case that they couldve distilled o4 already

alpine coral May 23, 2025, 6:40 PM

#

keen beacon the naming is mostly marketing

i would generally agree with this statement

keen beacon May 23, 2025, 6:42 PM

#

alpine coral i dunno man.. but i find it weird to think o4-mini is based of 4.1-mini

try to explain it otherwise, it just doesnt make sense. if they used the same base model as o3 or o4, surely, the simpleqa shouldn't have dropped that hard

alpine coral May 23, 2025, 6:43 PM

#

i honestly don't know.. it's hard to talk about benchmarks when one of models is unreleased

#

maybe o4 sucks (compared to mini.. based on costs)

keen beacon May 23, 2025, 6:43 PM

#

you think theres a possibility that o4 has 20% on simpleqa

alpine coral May 23, 2025, 6:43 PM

#

and it's non release is as simple as that lol

alpine coral May 23, 2025, 6:43 PM

#

keen beacon you think theres a possibility that o4 has 20% on simpleqa

i have no idea

keen beacon May 23, 2025, 6:45 PM

#

its just a coincidence that the benchmarks align with 4.1 mini... they would spend several months working on 4.1 (later using it in o3), just to abandon it with o4 with a model with significantly less world knowledge than gemini 2 flash. does not make sense

#

you can believe me or not tbh. i feel like ive been observant and reasonable about these things. and you can probably prove that o4 mini is based on 4.1 mini, but i really think its pointless to go to that extent

alpine coral May 23, 2025, 6:47 PM

#

i think all these models are basically based of GPT4.5/5 in one way or another, with different cpt's

#

anyway.. we've been waiting for claude 4 and it finally arrived..

civic flame May 23, 2025, 6:54 PM

#

alpine coral opus 4 is very good

what does it get?

alpine coral May 23, 2025, 6:54 PM

#

#

on 2/3 of the question sets, sonnet 3.7 outperforms sonnet 4...

#

but opus is very good

civic flame May 23, 2025, 6:55 PM

#

you should test redsword on the arena

#

seems better than 2.5 pro

#

also wtf is with sonnet 4

#

yikes

alpine coral May 23, 2025, 6:56 PM

#

civic flame you should test redsword on the arena

oh nice - a new google model?

civic flame May 23, 2025, 6:57 PM

#

goldmane and redsword popped up today, both google models

#

redsword seems to be the better one but doesn't hurt to try both

alpine coral May 23, 2025, 6:58 PM

#

civic flame also wtf is with sonnet 4

yeah i know.. it's just not as strong with comprehension and lateral reasoning.. and it's knowledge, while more recent.. feels more shallow

misty vault May 23, 2025, 7:02 PM

#

what about opus 4

torn mantle May 23, 2025, 7:06 PM

#

misty vault what about opus 4

nothing crazy

small haven May 23, 2025, 7:08 PM

#

aight operator on o3 still sucks

torn mantle May 23, 2025, 7:08 PM

#

the thing with all anthropic models is that they are so lazy

#

and they made that on purpose

#

to save tokens

keen beacon May 23, 2025, 7:08 PM

#

seems fine to me, only had the issue with new sonnet (oct)

torn mantle May 23, 2025, 7:08 PM

#

i want it to reason even in the stupidest things

misty vault May 23, 2025, 7:08 PM

#

Does it beat 3.7 sonnet at least

#

in coding

ocean vortex May 23, 2025, 7:14 PM

#

torn mantle the thing with all anthropic models is that they are so lazy

3.7 Sonnet was not lazy at all tbh

#

it was the opposite of that with thinking budget maxed out

#

new Opus is lazy though. Kinda defeats the purpose of thinking budget in the first place

keen beacon May 23, 2025, 7:16 PM

#

ocean vortex new Opus *is* lazy though. Kinda defeats the purpose of thinking budget in the f...

its so expensive

leaden meteor May 23, 2025, 7:16 PM

#

I am not good enough yet to identify the nuances between very good and great models. But I am surprised that people are saying opus 4 is inferior to 2.5 pro when most benchmarks say otherwise....I cant honeslty differentiate between both much...

alpine coral May 23, 2025, 7:17 PM

#

ocean vortex new Opus *is* lazy though. Kinda defeats the purpose of thinking budget in the f...

yeah it's the same with 4 sonnet

ocean vortex May 23, 2025, 7:17 PM

#

keen beacon its so expensive

still in the acceptable range with pricing. I was about to say it's barely more expensive than o3, but... OpenAI reduced the price? 🤯

alpine coral May 23, 2025, 7:17 PM

#

it like won't do things that are token intensive that 3.7 would just naturyally '

ocean vortex May 23, 2025, 7:18 PM

#

Could have sworn it was the same price as o1

#

now it's $40 not $60 per 1M output

main gulch May 23, 2025, 7:19 PM

#

it launched with $40

ocean vortex May 23, 2025, 7:20 PM

#

main gulch it launched with $40

that's actually a decent price all things considered

#

o3 cheaper than o3?

#

🤣

#

I said decent price for o3

#

because o3 launched with that price...?

#

lmao

keen beacon May 23, 2025, 7:24 PM

#

ocean vortex because o3 launched with that price...?

im just confused

#

ignore me

ocean vortex May 23, 2025, 7:25 PM

#

catgrin

keen beacon May 23, 2025, 7:25 PM

#

the lack of sleep is really catching up to me lol

ocean vortex May 23, 2025, 7:26 PM

#

I think Google started something really good initially with 2.5 Pro which probably did push OpenAI to do this... But recent Google moves on pricing are less promising

#

just a shame that Google gave in so fast... Like they aren't even pushing AI from google.com yet. But pricing for their plans is already as if they were fully committed

#

you can't use gemini on google.com and that's very odd all things considered. If AI is their side kick then that pricing has no business to be a thing

#

only US I think. Just like AI overviews. But gemini website is not US only

#

yeah bing.com is full AI worldwide so I don't think it's a major roadblock

#

probably more like them wanting their ad revenue tbh

#

but you can't have that and then also charge people $250 a month for AI on top

torn mantle May 23, 2025, 7:38 PM

#

https://x.com/veggie_eric/status/1925976535636328814

Eric Jiang (@veggie_eric)

you know the team was up cooking real late last night when it's 11am and the office is still deserted

#

stop

#

haah

#

sigh

small haven May 23, 2025, 7:39 PM

#

cooking up a lawsuit

torn mantle May 23, 2025, 7:40 PM

#

lol

#

someone replied with this

#

https://x.com/PeterKarani_/status/1925982421712769307

Peter (@PeterKarani_)

@veggie_eric Bullish on the team

#

idk it may be related

#

@deep adder

#

wdym

#

you see the correlation too?

#

well lets just hope the model is good

small haven May 23, 2025, 7:44 PM

#

torn mantle https://x.com/PeterKarani_/status/1925982421712769307

can't believe thats the inventor of batch normalization smh

#

operator o3 in a nutshell

#

not as of today

#

it did things better than the old operator, but still shite

#

and it shouldn't be using windows as the os, imo

#

jk

#

linux

torn mantle May 23, 2025, 7:50 PM

#

small haven can't believe thats the inventor of batch normalization smh

they actually have a goated team

#

i just dont understand whats really happening

#

some of their staff are actually pioneers of the reasoning paradigm

small haven May 23, 2025, 7:52 PM

#

torn mantle they actually have a goated team

embarassing tbf, i mean at least its not llama 4

torn mantle May 23, 2025, 7:52 PM

#

got me on that

small haven May 23, 2025, 7:53 PM

#

xai needs to acquire ssi and have ilya as the leader, not that red hat wearing gork creator

torn mantle May 23, 2025, 7:53 PM

#

why would they acquire ssi?

#

nah

small haven May 23, 2025, 7:54 PM

#

buddy tried to buy $90b openai, i think he can acquire ssi within that

#

ilya is worth that

torn mantle May 23, 2025, 7:55 PM

#

ilya is scared of everything

#

you already have the sample

#

did we hear anything from him so far?

small haven May 23, 2025, 7:56 PM

#

i mean hes trying to oneshot asi, not agi

sour spindle May 23, 2025, 7:57 PM

#

Have played with all the new models still find myself using o3 a lot.

torn mantle May 23, 2025, 7:57 PM

#

small haven i mean hes trying to oneshot asi, not agi

gl on that

small haven May 23, 2025, 7:57 PM

#

torn mantle gl on that

ya ik not seeing him until 2030

sour spindle May 23, 2025, 7:58 PM

#

Is that the consensus around here? I know there’s a lot of love for 2.5

#

Feel like I’d like it better if I didn’t hate ai studio Gemini app

misty vault May 23, 2025, 7:58 PM

#

sour spindle Is that the consensus around here? I know there’s a lot of love for 2.5

2.5 is cancer

small haven May 23, 2025, 7:59 PM

#

sour spindle Is that the consensus around here? I know there’s a lot of love for 2.5

2.5 is for the dollar tree version of o3

#

but i wanna try deepthink

sour spindle May 23, 2025, 7:59 PM

#

ChatGPT app is the best ui experience too which probably plays into my preference

small haven May 23, 2025, 8:00 PM

#

btw tho where is o3 pro

sweet tinsel May 23, 2025, 8:00 PM

#

Jules with Gemini 2.5 is pretty peak, better than Manus, have yet to try Codex when it will get to Plus users.

willow grail May 23, 2025, 8:01 PM

#

what is best solution for vibe coding in 2025 may?

#

!!!?????????????

#

i dont mean the model

#

whoel package

sweet tinsel May 23, 2025, 8:01 PM

#

I already got Plus and Perplexity that is enough for me currently.

#

And AI-Studio is pretty great for being free and Jules too.

sour spindle May 23, 2025, 8:02 PM

#

Pro is nice because I ask a lot of dumb questions

sweet tinsel May 23, 2025, 8:02 PM

#

Got them free Manus Credits too

small haven May 23, 2025, 8:02 PM

#

i mean prolly not now, but soon? deepthink definitely delayed the release schedule of o3 pro i think..

#

ahhh finally some truth

sweet tinsel May 23, 2025, 8:02 PM

#

Maybe I think about Pro, how much GPT 4.5 usage does it have? I'm a sucker for GPT 4.5

#

It's just really good with text

willow grail May 23, 2025, 8:02 PM

#

cursor is what?

small haven May 23, 2025, 8:03 PM

#

willow grail cursor is what?

a big fat scam

willow grail May 23, 2025, 8:03 PM

#

small haven a big fat scam

why u saying this

small haven May 23, 2025, 8:04 PM

#

willow grail why u saying this

bugs

sour spindle May 23, 2025, 8:04 PM

#

I don’t think o3 pro is coming out either. There’s really no need to. Unless deep think is phenomenal or something

willow grail May 23, 2025, 8:04 PM

#

small haven bugs

cline also sucks tho

sour spindle May 23, 2025, 8:04 PM

#

Which I’m a bit skeptical of after I/O

small haven May 23, 2025, 8:05 PM

#

willow grail cline also sucks tho

never tried it, but ya its the same bowl of shxt

elder rapids May 23, 2025, 8:06 PM

#

keeps crashing

willow grail May 23, 2025, 8:06 PM

#

small haven never tried it, but ya its the same bowl of shxt

XD so what is the best solution

sweet tinsel May 23, 2025, 8:06 PM

#

elder rapids keeps crashing

All good my guy, have a G 2.5 Pro Report from another person, but thanks for the help!

elder rapids May 23, 2025, 8:06 PM

#

alr

small haven May 23, 2025, 8:06 PM

#

willow grail XD so what is the best solution

claude code, but if u can afford, codex

willow grail May 23, 2025, 8:07 PM

#

small haven claude code, but if u can afford, codex

codex is more expensive that claude code !!

small haven May 23, 2025, 8:07 PM

#

claude code is at least virtually unlimited, 200k context, just zero crashout

small haven May 23, 2025, 8:07 PM

#

willow grail codex is more expensive that claude code !!

ya so claude code

willow grail May 23, 2025, 8:07 PM

#

small haven claude code is at least virtually unlimited, 200k context, just zero crashout

it needs too many tokens cause no vectoring

small haven May 23, 2025, 8:14 PM

#

in terms of time/effort, codex is cheaper than claude code

#

in cc?

#

haha

#

they do need a read-only policy in claude code so it doesn't overwrite tests

#

thats hawt

#

im still rocking my 2018 cpu, works enough

#

nah

#

i think first gen

#

lol

#

amd > intel

sweet tinsel May 23, 2025, 8:21 PM

#

How did the Hype Man itself get this interview?: https://youtu.be/nZtmmUQDzMQ?si=JEx1oE1jEo40vS45

YouTube

Matthew Berman

Google CEO Sundar Pichai on Gemini, Self-improving AI, and World Mo...

My interview with Google's CEO Sundar Pichai. We covered Gemini, agents, diffusion models, self-improving AI (AlphaEvolve), and more.

The camera kept going out of focus, sorry about that.

Join My Newsletter for Regular AI Updates 👇🏼
https://forwardfuture.ai

Discover The Best AI Tools👇🏼
https://tools.forwardfuture.ai

My Links ...

▶ Play video

small haven May 23, 2025, 8:21 PM

#

if i had intel, it would have broke, amd still strong, and i run long processes, its still solid

#

i mean overtime technically it overheats faster, but im sitting at 40 deg c on avg

#

40-50 range

#

unborn ocean May 23, 2025, 8:24 PM

#

The chip is manufactured in Taiwan like the amd chip

#

So why?

#

Well but you talked about ultra ?

#

And thus currently both Intel and AMD have to count as ‚americanm‘ 🤨🤓

sweet tinsel May 23, 2025, 8:29 PM

#

Holy, Google Jules just spent 4 hours with a moderate and not that hard task.

earnest parcel May 23, 2025, 8:29 PM

#

Tested Claude 4 (default/non-thinking, Opus & Sonnet, 20250514):

Ended up topping my ranks (#1 & #2)
Very high reason, logic and common sense
quite concise models (16% token use of reason models such as 2.5 Pro)
highly competent in most areas tested, though Opus had more slip ups in math related tasks
Great coders, but Sonnet is probably the better choice in most cases (bang 4 buck)
Noticed improvements in back-end tasks and debugging
Saw no improvements in Vision
Chess: competent opening moves, then blunder all pieces even in hugely winning positions (14 draws, 1 loss in 15 matches, with zero secured wins)

Opus in particular seems to have additional guardrails, enforced by API, as I received some usage policy violation warnings on harmless queries (e.g. my Steins;Gate demo pages). This issue was not present on Claude Sonnet 4.

I have also uploaded some demo pages onto my shared assets.
Pricing on Opus with little benefit in most scenarios means I won't be utilizing it much, though.
I'll check out performance with reasoning in the coming days, too.
Overall, impressive models. As always, YMMV!

small haven May 23, 2025, 8:30 PM

#

sweet tinsel Holy, Google Jules just spent 4 hours with a moderate and not that hard task.

i was wondering if jules can search the internet post vm boot?

sweet tinsel May 23, 2025, 8:31 PM

#

small haven i was wondering if jules can search the internet post vm boot?

Need to test it out later

#

Need to watch TLOU first

small haven May 23, 2025, 8:38 PM

#

ok so jules can search the web, unlike codex

sweet tinsel May 23, 2025, 8:41 PM

#

It's pretty good, I just need to try it out some more.

small haven May 23, 2025, 8:41 PM

#

its cool, but i just don't like its base model 2.5 pro lol

elder rapids May 23, 2025, 8:45 PM

#

small haven its cool, but i just don't like its base model 2.5 pro lol

you think you might like GA 2.5 pro?

small haven May 23, 2025, 8:46 PM

#

elder rapids you think you might like GA 2.5 pro?

if its better than o3 yes, ill use anything thats pareto optimal

unborn ocean May 23, 2025, 8:47 PM

#

poll_question_text

what kind of scores for webdev are you guys guessing for claude 4 (opus/sonnet)?

victor_answer_votes

3

total_votes

6

victor_answer_id

2

victor_answer_text

1600-1500

elder rapids May 23, 2025, 8:51 PM

#

small haven if its better than o3 yes, ill use anything thats pareto optimal

meant based on how you feel about 2.5 pro currently

unborn ocean May 23, 2025, 8:53 PM

#

unborn ocean

Alr high hopes

elder rapids May 23, 2025, 8:54 PM

#

ye high hopes for webdev

small haven May 23, 2025, 8:57 PM

#

elder rapids meant based on how you feel about 2.5 pro currently

i dont like the current version personally

#

i dont do webdev, but im guessing its dominating there

dull terrace May 23, 2025, 9:08 PM

#

hello

blazing rune May 23, 2025, 9:24 PM

#

@earnest parcel I want to make my own benchmark, can I ask how you get inspiration for the tasks you test?

#

I'm just not very creative

small haven May 23, 2025, 9:27 PM

#

technically mistral because they are more likely to open source than any other ai labs, but its unlikely they get agi first unfortunately

earnest parcel May 23, 2025, 9:28 PM

#

blazing rune <@126820015382069250> I want to make my own benchmark, can I ask how you get ins...

i don't know, I just like comparing AI models. so I have probably made hundreds of different tests, I just post a tiny snippet of some I think might be interesting visually or conceptually. I just do whatever I enjoy or find interesting myself.

tall summit May 23, 2025, 9:29 PM

#

@earnest parcel thanks for the tests

primal orbit May 23, 2025, 9:29 PM

#

hi, is there any secret model atm on arena worth checking? like a gemini deep thinking candidate or smth?

patent aspen May 23, 2025, 9:30 PM

#

The top AI lab startups all used the same sales pitch "We're going to make the best AI but more open and ethical than those evil big tech companies"

#

Meanwhile Google was publishing all of its AI papers, open sourcing all of its libraries

#

Until OpenAI stopped publishing and everyone had to compete

primal orbit May 23, 2025, 9:32 PM

#

calmriver is the recently released flash thinking update, no?

patent aspen May 23, 2025, 9:32 PM

#

I don't buy those startups' holier than thou sales pitches

tall summit May 23, 2025, 9:32 PM

#

patent aspen The top AI lab startups all used the same sales pitch "We're going to make the b...

thats why claude caring about ethics is a decent angle in my opinion

#

i saw people here being like "why care about ethics if nobody else does"

patent aspen May 23, 2025, 9:33 PM

#

Anthropic has even less oversight than Google, but it's a great sales pitch

#

That's marketing

#

All of the companies are checking their models, some more than others

#

They would need to be richer

#

Then they could allocate more resources to the red teaming

#

But the point is Anthropic isn't red teaming more than Google

olive skiff May 23, 2025, 9:37 PM

#

Does anyone know why claude 4 isn't listed on leaderboard?

patent aspen May 23, 2025, 9:38 PM

#

They don't have the same level of resources to allocate to red teaming

olive skiff May 23, 2025, 9:38 PM

#

what votes?

patent aspen May 23, 2025, 9:39 PM

#

Like I'm not saying Anthropic is being negligent or doing anything wrong. I'm just saying they're not actually better positioned for safety than Google

olive skiff May 23, 2025, 9:39 PM

#

but is it in the arena game at least to gain votes to show in leaderboard?

#

bts shouldn't it be like at least at one position in tha leaderboard? event last... but it isn't.. so?

#

i see in announcements that it's integrated but where... not seen in leaderboard yet

earnest parcel May 23, 2025, 9:45 PM

#

blazing rune I'm just not very creative

i hadn't even updated the most scientifically important ones, such as ascii comparisons xD

tall summit May 23, 2025, 9:47 PM

#

earnest parcel i hadn't even updated the most scientifically important ones, such as ascii comp...

tf u mean cracks digital knuckles

#

what system prompt is this HAHAHA

earnest parcel May 23, 2025, 9:47 PM

#

is inspired by this very website, lol. https://dubesor.de/lmarenaarmchaircritic

tall summit May 23, 2025, 9:48 PM

#

wow ok

#

Tested Claude 4 (default/non-thinking, Opus & Sonnet, 20250514):

Ended up topping my ranks (#1 & #2)

Very high reason, logic and common sense

#

really?

earnest parcel May 23, 2025, 9:50 PM

#

one also needs to goof around without ranking, many people use stuff recreationally as well

tall summit May 23, 2025, 9:51 PM

#

earnest parcel is inspired by this very website, lol. https://dubesor.de/lmarenaarmchaircritic

wheres it linked from

#

have you tried it reasoning things from a FEN

#

i still havent yet somehow

primal orbit May 23, 2025, 9:59 PM

#

got redsword - pretty good. Is goldmane better?

tall summit May 23, 2025, 10:07 PM

#

tall summit have you tried it reasoning things from a FEN

they still cant

balmy mist May 23, 2025, 10:18 PM

#

i been busy all day

#

whats the order of models?

#

like whats the best one and how would yall rank them?

#

we need to do constant polls to rank models based on our experience and tests, benchmarks are cool but i trust our vibe tests better

tall summit May 23, 2025, 10:19 PM

#

pftttt what

balmy mist May 23, 2025, 10:20 PM

#

small haven not as of today

but no pro lol wow

#

broo but most people have not gotten to test grok 3.5 so you cant include it yet

#

and have you tried the new google models?

#

oh i didnt even see the 0 lol

#

what about on webdev?

#

hmm okay, imm test more tmw when i have more time

#

i heard ppl saying opus is fast too?

#

ahh okay

#

i just tried goldmane with my poke test and its struggling

#

gonna have to rephrase prompt

elder rapids May 23, 2025, 10:31 PM

#

balmy mist what about on webdev?

opus pretty cracked at some things

tall summit May 23, 2025, 10:52 PM

#

opus definitely slow

misty star May 23, 2025, 11:42 PM

#

Hello

ocean vortex May 24, 2025, 12:16 AM

#

earnest parcel Tested **Claude 4** (default/non-thinking, Opus & Sonnet, 20250514): * Ended up...

yeah on a first glance Opus is overly concise, but I just tested it more extensively myself as well and... it aces much more prompts than one would expect from a model that is not thinking that long. Very solid model actually. May just be the biggest reasoning model we have. Still not as accurate for recursive repetitive tasks as o3 due to limited test-time compute, but logic and reasoning actually seems stronger

#

it flagging prompts is absolutely an issue though. I managed to circumvent it by slightly rewriting one of them, but on 2nd it was blocked HARD with no apparent way to get it through

torn mantle May 24, 2025, 12:24 AM

#

https://x.com/pranjalssh/status/1926069645934563779

Pranjal (@pranjalssh)

LFG

#

https://x.com/ibab/status/1926071200565903684

Igor Babuschkin (@ibab)

@pranjalssh LFG

#

LFG

#

LFG

leaden palm May 24, 2025, 12:24 AM

#

?!

torn mantle May 24, 2025, 12:27 AM

#

leaden palm ?!

ITS HYPE

#

i have no clue whats going on anymore with xai staff

leaden palm May 24, 2025, 12:28 AM

#

PLEASE SHIP

torn mantle May 24, 2025, 12:28 AM

#

LETS GO

#

SHIP NOW

#

https://x.com/techdevnotes/status/1926060678302888156

Tech Dev Notes (@techdevnotes)

xAI is cooking something (other than grok 3.5) which from the indications could be lame but probably not, don't have UI to show yet, hopefully soon

#

nvm

torn mantle May 24, 2025, 12:48 AM

#

this is getting sus

#

is that account urs?

#

that tech dev guy

#

= you

#

thanks for confirming

#

gtg now

#

leaden palm May 24, 2025, 12:55 AM

#

torn mantle thanks for confirming

time to run a style analysis

leaden palm May 24, 2025, 1:24 AM

#

leaden palm time to run a style analysis

lol

azure minnow May 24, 2025, 1:37 AM

#

Bro what

grim axle May 24, 2025, 1:40 AM

#

leaden palm lol

💀

elder rapids May 24, 2025, 2:49 AM

#

leaden palm lol

proof btw

small haven May 24, 2025, 3:47 AM

#

torn mantle https://x.com/techdevnotes/status/1926060678302888156

"cooking something other thank grok" ya we know lol

calm sequoia May 24, 2025, 3:56 AM

#

Craig, you could have shared it's you. I wouldn't have trolled you on twitter.

wheat onyx May 24, 2025, 4:03 AM

#

What are peoples thoughts on claude 4 on coding? How big of a jump does it feel like?

small haven May 24, 2025, 4:37 AM

#

they seriously have to release o3 pro soon like wtf is sam altman jumping on

keen fulcrum May 24, 2025, 5:25 AM

#

When is R2 coming out?

#

pulsar tendon May 24, 2025, 5:56 AM

#

I cant tell if goldmane or redsword is better

late path May 24, 2025, 6:30 AM

#

Do you guys think Opus still has a chance to get #1?

#

I'm not entirely sure, considering 0506 achieved a higher ELO despite weakening in all other areas and only improving in programming, whereas Opus specifically focused on programming

#

According to their last statistics, it seems about a quarter of the requests in the arena are about programming

torn mantle May 24, 2025, 8:36 AM

#

late path Do you guys think Opus still has a chance to get #1?

Tbh no

#

Its only good at coding

#

Also the one being tested is the non thinking model

alpine coral May 24, 2025, 9:48 AM

#

earnest parcel Tested **Claude 4** (default/non-thinking, Opus & Sonnet, 20250514): * Ended up...

was waiting for this - cheers!

willow grail May 24, 2025, 11:02 AM

#

claude code should be even better for context than roo?

unborn ocean May 24, 2025, 11:42 AM

#

Why? - to bypass the very high output token price of the thinking model (3,5$).

#

And potentially get one of the best reasoning models (in chat stuff atleast) for 0,15$ in and 0,6$ out. (prob beating grok mini price / performance wise in close to everything)

#

(I would do some kind of benchmarks in the process to evaluate how good it is :)

keen fulcrum May 24, 2025, 12:05 PM

#

Sonnet 4 twice as expensive for maximum context 200k

ocean vortex May 24, 2025, 12:14 PM

#

unborn ocean

2.5 Flash has different pricing for thinking and no thinking. Which suggests to me there could be more going on than just a different prompt template with same weights. But by all means, what you are suggesting makes sense to experiment with for this model. One of the main factors is going to be whether you can get it to output responses that are long enough or will it just do the thinking at the expense of the final response sticking to what it normally does (in terms of output length for the non-thinking version)

keen beacon May 24, 2025, 12:20 PM

#

ocean vortex 2.5 Flash has different pricing for thinking and no thinking. Which suggests to ...

the increased pricing is because adjustments to the hosting probably

#

batching, kv seq length etc. semianalysis talked about this before iirc

ocean vortex May 24, 2025, 12:20 PM

#

if it's anything like Claude implementation, then this would make all the sense in the world to do...

#

since that non-reasoning model was already trained to reason

unborn ocean May 24, 2025, 12:22 PM

#

keen beacon the increased pricing is because adjustments to the hosting probably

i think it is not really just that, because the price hike is just wayyy to extreme

keen beacon May 24, 2025, 12:22 PM

#

unborn ocean i think it is not really just that, because the price hike is just wayyy to extr...

yeah its not just that but it contributes to the increased pricing

unborn ocean May 24, 2025, 12:22 PM

#

for that price hike they could change the inference + give you the thinking for free

keen beacon May 24, 2025, 12:23 PM

#

they are probably trying to make more (profit margin wise) on flash than 2.5 pro. a similar thing to 3.5 haiku i think

ocean vortex May 24, 2025, 12:23 PM

#

unborn ocean i think it is not really just that, because the price hike is just wayyy to extr...

not necessarily too extreme tbh. Gone are the days when cost had high correlation to price too

#

would be unusual considering this is Google, but defo not unheard of, and in light of Google Ultra pricing, more realistic now

unborn ocean May 24, 2025, 12:24 PM

#

my main ideas are that that it is mainly a pricing strategy (because they know they have basically no competitors in that price range and can just charge more to still match deepseek and beat openai (pricing wise) for reasoning) and some other stuff related to product rather than the cost of serving

#

it is mainly monopoly pricing

keen beacon May 24, 2025, 12:24 PM

#

thats probably part of it

#

i looked into gemini 2 flash a while back for cold start, it wasn't tenable back then

unborn ocean May 24, 2025, 12:26 PM

#

keen beacon i looked into gemini 2 flash a while back for cold start, it wasn't tenable back...

but if 2.5 non thinking and thinking are the same model

#

there should be big potential there

keen beacon May 24, 2025, 12:27 PM

#

sure, idk about 2.5 flash honestly

unborn ocean May 24, 2025, 12:27 PM

#

they claim it is hybrid

keen beacon May 24, 2025, 12:27 PM

#

unborn ocean they claim it is hybrid

do they?

#

if so its quite believable tbh

keen beacon May 24, 2025, 12:27 PM

#

keen beacon i looked into gemini 2 flash a while back for cold start, it wasn't tenable back...

there were certain problems with 2.0 flash that made it unusable for this purpose (been working on a proj for ~~almost a year~~ (overestimated this) now, maybe something interesting will come out of it as some point. im using mi300xs now lol)

unborn ocean May 24, 2025, 12:28 PM

#

true

#

but the "hybrid" part could also be less literal than we are interpreting it

#

maybe something like a big lora style adoption, that changes the model by a big margin

keen beacon May 24, 2025, 12:28 PM

#

Could be

#

I dont find it likely

unborn ocean May 24, 2025, 12:29 PM

#

keen beacon there were certain problems with 2.0 flash that made it unusable for this purpos...

are you running your own inference on them?

#

bc i heard it is not very nice development wise to say the least

keen beacon May 24, 2025, 12:30 PM

#

unborn ocean are you running your own inference on them?

inference is definitely easier than training, as there can be issues with training

#

you can use their prebuilt vllm docker pretty easily

unborn ocean May 24, 2025, 12:30 PM

#

keen beacon inference is definitely easier than training, as there can be issues with traini...

did semianalysis not do some report there where they complained BIG TIME about all the things dumb

keen beacon May 24, 2025, 12:31 PM

#

Yeah

#

Lmao

#

dude i didnt even know they had a pytorch training rocm docker until recently 😭 i prebuilt everything

unborn ocean May 24, 2025, 12:31 PM

#

and i get a newsletter like every week about them complain about how little compute the amd team gets for testing

unborn ocean May 24, 2025, 12:31 PM

#

keen beacon dude i didnt even know they had a pytorch training rocm docker until recently 😭...

ok that is mad

#

i also have amd home compute and it is really annoying with any ai stuff

#

although it has gotten way better

#

but my uni has a100 compute and i kind of use that for some of my projects

#

wayyy easier

keen beacon May 24, 2025, 12:34 PM

#

unborn ocean but my uni has a100 compute and i kind of use that for some of my projects

damn

unborn ocean May 24, 2025, 12:34 PM

#

*and some big h100 cluster coming up, hyped for that

#

like muti million $ stuff 🔥

keen beacon May 24, 2025, 12:35 PM

#

ngl the mi300x is really good, its just the ecosystem/support/etc its really annoying

#

with a h100/a100 its so easy to setup lol

unborn ocean May 24, 2025, 12:36 PM

#

keen beacon ngl the mi300x is really good, its just the ecosystem/support/etc its really ann...

yes, i was originally very hyped for the mi300 releases back then because AMD cards usually sound really good on paper and then ...

ocean vortex May 24, 2025, 12:52 PM

#

2.5 Flash no-thinking seems like it has issues following your sys prompt exactly, so I simply spammed it with tags lol
this somewhat works, managed 11k+ responses with no thinking enabled. And it doesn't break going into infinite loops with same math prompts compared to no sys prompt

    process and answer are enclosed within <thinking> </thinking>  respectively, i.e.,
    <reflection>extremely long and exhaustive reasoning process with high reasoning effort not visible to the user here </reflection> 
</think></thinking></reasoning><reason>
 then from new paragraph output final answer visible to the user here</answer>```

#

yeah you defo can get more test-time compute out of it than it's willing by default with no thinking enabled. Another test prompt, with no sys prompt consistently around 1k total output, with sys prompt always around 4k 🧐

#

it's a shame they started giving you summaries of thinking rather than raw output in aistudio so it's difficult to directly compare. But it may just be more verbose now than with thinking enabled tbh
tases more time to answer, total token usage showing up higher, final response streaming around the same speed

ocean plume May 24, 2025, 1:24 PM

#

still no update claude 4 ?

unborn ocean May 24, 2025, 1:30 PM

#

ocean vortex it's a shame they started giving you summaries of thinking rather than raw outpu...

my plan was to potentially present the model with an example in the prompt (one with the actual reasoning traces of the old 2.5 flash or pro) to kind of anchor the reasoning lenght at that level

#

and i was also thinking about making it dynamic (because the prompt is supposed to be for api)

#

in the way that you could use a small model to classify the prompt topic into like 50 categories (where for each you already have a reasoning trace from the actual thinking models) and then provide the example in the system prompt based on that

#

(obviously that is more expensive than just using the short prompt, but also still wayyy cheaper than the ludicrous prices they have for their thinking variant)

misty vault May 24, 2025, 1:49 PM

#

https://cdn.discordapp.com/attachments/972660634795794444/1375832293427384350/n5l9najq5ybe1.webp?ex=68331f2c&is=6831cdac&hm=c42d40ad3524980316f08229e8eac19f2301863388853c27ff965bb46f96f627&

late path May 24, 2025, 2:38 PM

#

poll_question_text

Where do you find Claude 4 Opus stronger than Gemini 2.5 Pro?

victor_answer_votes

9

total_votes

28

victor_answer_id

1

victor_answer_text

Programming

ocean vortex May 24, 2025, 3:02 PM

#

Opus can still be pushed to become as unhinged as 3.7 Sonnet 😇

#

shame for the hard cap

keen beacon May 24, 2025, 3:03 PM

#

ocean vortex Opus can still be pushed to become as unhinged as 3.7 Sonnet 😇

use prefill

#

then the windows the cap

#

or is it disabled for opus/claude 4 (it's been a while)?

#

you might not be able to use their native thinking functionality but if it works like 3.7 sonnet, its possible

ocean vortex May 24, 2025, 3:05 PM

#

keen beacon or is it disabled for opus/claude 4 (it's been a while)?

it's disabled. I could just add it to context and ask it to continue from that point but honestly it brought to me to negative balance and don't feel like refilling again lmao

torn mantle May 24, 2025, 4:21 PM

#

asi?

#

nah that redsword model is kinda crazy

#

and its so fast too

#

few months

#

few years

#

few millennia

sudden root May 24, 2025, 4:43 PM

#

No, it won’t

#

In the best case, it will be a little bit better than gemini 2.5 Pro and o3, which are amazing indeed. But then the lead won’t last long, google is cooking currently, while I believe xAI is overhyped by Musk Fanboys.

#

No 😂

#

Think 2.5 pro is bad?

#

You meant „leading in every single domain except coding“?

#

Claude 4 is SOTA in coding

#

Don‘t ignore AlphaEvolve

keen beacon May 24, 2025, 4:58 PM

#

wait for the ga release i guess

ember rapids May 24, 2025, 5:01 PM

#

google def leads in video but sora 2...

sour spindle May 24, 2025, 5:13 PM

#

It's night and day difference for me using o3 with tools and 2.5 pro

#

I thought this was a cope when OAI employees talked about how important tools are but it has been signficant

#

I am interested in grok 3.5 will be there was like a month where grok 3 was my favorite model but my patience is wearing thin with the xai team

ornate stump May 24, 2025, 5:17 PM

#

sour spindle I thought this was a cope when OAI employees talked about how important tools ar...

For what kind of tasks?

elder rapids May 24, 2025, 5:20 PM

#

veo 2 existed before sora

#

?

#

and Veo 2 was still demonstrably better per the demos

#

so sora was never SOTA

#

Google also apparently had the first reasoning model

#

before o1

sour spindle May 24, 2025, 5:21 PM

#

ornate stump For what kind of tasks?

For me I do a lot of research based tasks. o3 is far better at getting me high quality research which is quite shocking because you think google would be better due to its access

#

I honestly am very disapointed at some of the stuff googles considered high quality in some of the responses

elder rapids May 24, 2025, 5:22 PM

#

sour spindle For me I do a lot of research based tasks. o3 is far better at getting me high q...

they have the same amount of access the synthesis is what makes its output though. Google hasn't given their tools that access to synthesize yet

#

math specialized 1.5 pro

#

was apparently a reasoning model

#

explicit reasoning like o1

sour spindle May 24, 2025, 5:23 PM

#

elder rapids they have the same amount of access the synthesis is what makes its output thoug...

you are what you ship

elder rapids May 24, 2025, 5:23 PM

#

sour spindle you are what you ship

and openAI is what they announce...?

#

Google has announced a lot of this stuff too

#

and it's not released

#

strange standard for the respective companies tho

sour spindle May 24, 2025, 5:24 PM

#

I'm talking about stuff already on the market

keen beacon May 24, 2025, 5:24 PM

#

math specialized 1.5 pro wasnt as versatile as o1 though i think

elder rapids May 24, 2025, 5:24 PM

#

move the goalpost thats fine tbh

#

veo 2 > sora

elder rapids May 24, 2025, 5:26 PM

#

keen beacon math specialized 1.5 pro wasnt as versatile as o1 though i think

prob, but that's really the first instance of an LLM having that feature

#

and I don't think openAI publicizing o1 is meaningful in any way

keen beacon May 24, 2025, 5:26 PM

#

elder rapids prob, but that's really the first instance of an LLM having that feature

publicly i guess

elder rapids May 24, 2025, 5:26 PM

#

since googles little thing was RL + reasoning chains

#

so they'd have eventually went deep into reasoning models regardless

ornate stump May 24, 2025, 5:45 PM

#

sour spindle For me I do a lot of research based tasks. o3 is far better at getting me high q...

and how do you deal with hallucinations?

sour spindle May 24, 2025, 5:52 PM

#

ornate stump and how do you deal with hallucinations?

Click the link to the relevant source

unborn ocean May 24, 2025, 6:12 PM

#

keen beacon math specialized 1.5 pro wasnt as versatile as o1 though i think

https://arxiv.org/pdf/2203.14465 found something older

#

#

But they used grade school math as a benchmark back then 😂

#

But the finetune is only RL-like, or at least different to the current implementations

elder rapids May 24, 2025, 6:17 PM

#

unborn ocean https://arxiv.org/pdf/2203.14465 found something older

going through it, ye it actually is explicit reasoning

keen beacon May 24, 2025, 6:17 PM

#

unborn ocean https://arxiv.org/pdf/2203.14465 found something older

yeah i remember seeing this before

elder rapids May 24, 2025, 6:17 PM

#

that's crazy

#

this is pretty old too

#

thanks for finding that

keen beacon May 24, 2025, 6:18 PM

#

oh its from google research too (at least partly)

#

i just realized that

elder rapids May 24, 2025, 6:18 PM

#

yep

unborn ocean May 24, 2025, 6:24 PM

#

keen beacon oh its from google research too (at least partly)

There is another :) also by google and even older!: https://arxiv.org/pdf/2112.00114

#

But the other ‚newer‘ paper matches the current SOTA methods better

elder rapids May 24, 2025, 6:27 PM

#

unborn ocean There is another :) also by google and even older!: https://arxiv.org/pdf/2112.0...

nice this one also qualifies and explicit reasoning

elder rapids May 24, 2025, 6:27 PM

#

unborn ocean But the other ‚newer‘ paper matches the current SOTA methods better

not sure it matters tbh, the point is it's not black box jump to conclusion

unborn ocean May 24, 2025, 6:31 PM

#

So in short: Google OWNS the reasoning paradigm!

wintry locust May 24, 2025, 6:31 PM

#

llama 3.1 did rl from code execution and learnt to backtrack in practice

elder rapids May 24, 2025, 6:35 PM

#

unborn ocean So in short: Google OWNS the reasoning paradigm!

guess so

#

kind of crazy tbh

#

I suspected openAI didn't actually invent the reasoning stuff

#

but Google researching this stuff so far back is surprising

ocean vortex May 24, 2025, 6:45 PM

#

unborn ocean So in short: Google OWNS the reasoning paradigm!

well there's no use of it if you aren't the one who makes it work for the general public the first. And Google wasn't the first. Same goes for MoE architecture. It was a thing before OpenAI adopted it, but it wasn't actually made to work and taken advantage of before them. There's also a thing that we have many proof of concept research papers even today, but only select few are made to work and bring an impact. Whoever does it first and does it right is typically the one taking the most credit

#

I think both OpenAI and Google contribute and compliment each other a lot though, when it comes to the general progress of AI

#

atm I think it's all about making the reasoning more efficient. Replacing plain-text repetitive data with something more effective. We have quite many recent papers on this

#

if we can reduce the context use and reasoning output size, we could potentially be able to scale it much more

small haven May 24, 2025, 7:08 PM

#

yess apply pressure !

raven void May 24, 2025, 7:08 PM

#

#

looks like Google is going very hard on RL to make it better on code and design

small haven May 24, 2025, 7:15 PM

#

what is currently better than claybrook

unborn ocean May 24, 2025, 7:27 PM

#

ocean vortex well there's no use of it if you aren't the one who makes it work for the genera...

well MoE was well known in research i think (even before the tranformer..)

unborn ocean May 24, 2025, 7:28 PM

#

ocean vortex well there's no use of it if you aren't the one who makes it work for the genera...

and everyone adopted really quick

unborn ocean May 24, 2025, 7:28 PM

#

ocean vortex well there's no use of it if you aren't the one who makes it work for the genera...

ensamble learning (which is at its core similar) has been a think for ages

#

sorry for all the pings 😓

leaden palm May 24, 2025, 7:30 PM

#

matt shumer:

small haven May 24, 2025, 7:33 PM

#

i feel like o3 in chatgpt a bit different in a good way? or is it just me

sonic tendon May 24, 2025, 7:34 PM

#

anyone know as to how the performance is for goldmane and redsword?

keen beacon May 24, 2025, 7:35 PM

#

sonic tendon anyone know as to how the performance is for goldmane and redsword?

People are raving about it

sonic tendon May 24, 2025, 7:36 PM

#

haven't seen it on the regular arena, I don't think

keen beacon May 24, 2025, 7:37 PM

#

It's apparently supposed to be the ga versions so it should be there

ocean vortex May 24, 2025, 7:38 PM

#

small haven i feel like o3 in chatgpt a bit different in a good way? or is it just me

the main difference is that it can use tools while thinking. Which can mean a fairly substantial improvement

small haven May 24, 2025, 7:39 PM

#

ocean vortex the main difference is that it can use tools while thinking. Which can mean a fa...

no i mean thats been that from the start, but it is a bit smarter now

ocean vortex May 24, 2025, 7:39 PM

#

small haven no i mean thats been that from the start, but it is a bit smarter now

no it's the same

small haven May 24, 2025, 7:40 PM

#

i feel like it got updated lowkey

#

i mean i spam o3 for days, this seems diff

sonic tendon May 24, 2025, 7:47 PM

#

ok

#

been wondering when/if deepthink is gonna be on here

elder rapids May 24, 2025, 8:00 PM

#

small haven what is currently better than claybrook

goldmane, redsword, drakesclaw, dayhush, nightwhisper, dragon tail

#

iffy on dragon tail

#

but ye claybrook was one of the weaker ones

small haven May 24, 2025, 8:02 PM

#

elder rapids goldmane, redsword, drakesclaw, dayhush, nightwhisper, dragon tail

and among those which ones the best

#

is it for webdev only or in general

elder rapids May 24, 2025, 8:03 PM

#

small haven and among those which ones the best

goldmane >≈ nightwhisper ≈ redsword ≈ drakesclaw

elder rapids May 24, 2025, 8:03 PM

#

small haven is it for webdev only or in general

both imo but I haven't gone far with them

small haven May 24, 2025, 8:04 PM

#

elder rapids goldmane >≈ nightwhisper ≈ redsword ≈ drakesclaw

cool thanks

elder rapids May 24, 2025, 8:06 PM

#

small haven cool thanks

although some people might say redsword is better than goldmane

#

but regardless it's that they're very equivalent models

#

still pretty concerned regarding these models tbh

#

claybrook didn't stand out

keen beacon May 24, 2025, 8:08 PM

#

I'm assuming both are 2.5 pro since people like both of them and there aren't large capability differences people talk about

elder rapids May 24, 2025, 8:08 PM

#

there are capability differences

keen beacon May 24, 2025, 8:08 PM

#

Like pro to flash level?

elder rapids May 24, 2025, 8:08 PM

#

not that large but it's obvious

#

claybrook → nightwhisper is pretty massive

keen beacon May 24, 2025, 8:09 PM

#

elder rapids claybrook → nightwhisper is pretty massive

I'm talking about goldmane and redsword

elder rapids May 24, 2025, 8:09 PM

#

oh mb

#

no there's not much of a difference between goldmane and redsword

keen beacon May 24, 2025, 8:10 PM

#

Yeah makes sense

small haven May 24, 2025, 8:12 PM

#

o3 pro is still going to be released, less go

#

whats fake the screenshot?

#

hes literally followed by sam lol

elder rapids May 24, 2025, 8:14 PM

#

small haven hes literally followed by sam lol

you don't need to double down imo that's substantive but it's not something to rely on

#

openAI just does shi

#

Sam Included

small haven May 24, 2025, 8:15 PM

#

i vouch

#

its kinda crazy that o3 pro is taking this long, prolly too hard to tame under the safety protocol, its too smart lol

#

ok

elder rapids May 24, 2025, 8:27 PM

#

small haven its kinda crazy that o3 pro is taking this long, prolly too hard to tame under t...

ye same thing with deepthink

#

for 2.5 pro

patent aspen May 24, 2025, 8:29 PM

#

The big expensive models are a lot slower to do evals on

#

Typically you evaluate safety, performance, etc, look at the result, fix the problems, and then eval again. If the model is big and slow, that iteration loop is way slower

sweet tinsel May 24, 2025, 8:32 PM

#

Manus image generation is pretty good, it's worse than GPT 4o Image Generation but better than Flux and such.

#

It's even better than Imagen 4, but I don't find Imagen 4 in my short testing that good.

#

It's even better than Imagen 4, but I don't find Imagen 4 in my short testing that good.

#

Imagen 4 is really bad at the following of instructions.

unborn ocean May 24, 2025, 9:06 PM

#

from what i saw it is really terrible

#

only if you want some weird text heavy stuff (it is decent but not better than 4o)

#

and imo it is not really image gen anymore

misty vault May 24, 2025, 9:53 PM

#

yes except gemini 2.5 pro

#

gemini 2.5 pro can do everything

#

it is asi

#

claude is also asi but it needs phone number!

high ginkgo May 24, 2025, 9:54 PM

#

Gemini 2.5 pro is god it can do rust it is the best LLM

misty vault May 24, 2025, 9:54 PM

#

YOU are in that mode

leaden palm May 24, 2025, 9:58 PM

#

in theory rust should be better for both rl and test time compute because it has a verbose compiler

#

in reality it might be too mentally taxing and underrepresented in training data

keen beacon May 24, 2025, 9:58 PM

#

Did you try o3 it might fare better

keen beacon May 24, 2025, 9:59 PM

#

leaden palm in reality it might be too mentally taxing and underrepresented in training data

I think the problem is primarily underrepresentation

#

Bruh

leaden palm May 24, 2025, 10:01 PM

#

you're overrating low level languages in terms of both speed and presence in training data

leaden palm May 24, 2025, 10:02 PM

#

leaden palm you're overrating low level languages in terms of both speed and presence in tra...

youll find more js and python

keen beacon May 24, 2025, 10:03 PM

#

No I don't recall needing one

misty vault May 24, 2025, 10:04 PM

#

you

high ginkgo May 24, 2025, 10:06 PM

#

Except gemini 2.5 pro because it is the best llm in the universe according to you

misty vault May 24, 2025, 10:06 PM

#

then stop saying gemini 2.5 pro is god

#

its sh*t

#

be reasonable

#

hello

keen beacon May 24, 2025, 10:08 PM

#

Hi

#

They're trolling

golden ocean May 24, 2025, 10:08 PM

#

keen beacon They're trolling

cat-orange-look-like-egg-photoshop-battle-original-1.png

#

bro you're just weird

quiet folio May 24, 2025, 10:11 PM

#

keen beacon May 24, 2025, 10:11 PM

#

I'm not sure if it's a bunny or not

#

I have to maintain factual accuracy when classifying bunnies

quiet folio May 24, 2025, 10:12 PM

#

Why use grok if can use gemini 25 pr

keen beacon May 24, 2025, 10:12 PM

#

Grok 3 sucks

misty vault May 24, 2025, 10:12 PM

#

Fr

#

no, gemini is best at everything in the world

keen beacon May 24, 2025, 10:12 PM

#

Try o3 in direct chat

worthy belfry May 24, 2025, 10:23 PM

#

when does the new lmarena come online?

#

Just found the answer in previous "announcement" posts: this upcoming week

patent aspen May 24, 2025, 10:38 PM

#

The only thing I don't like about the beta is that the overview doesn't show a big spreadsheet of rankings

misty vault May 24, 2025, 10:57 PM

#

patent aspen The only thing I don't like about the beta is that the overview doesn't show a b...

great!!!!!!!!
that new version will hopefully be better and have these overview issues fixed
-# sometimes, i get emotional with gemini 2.5 pro

patent aspen May 24, 2025, 11:01 PM

#

Oh wait nvm they actually do still have the spreadsheet if you scroll down

#

Great I'm satisfied

misty vault May 24, 2025, 11:06 PM

#

great!
that new spreadsheet will hopefully be better if you scroll down

#

-# gemini 2.5 pro also satisfies me

patent aspen May 24, 2025, 11:07 PM

#

torn mantle May 24, 2025, 11:42 PM

#

https://x.com/ibab/status/1926416990257693145

Igor Babuschkin (@ibab)

@techdevnotes 10 tons of garlic are arriving at the office as we speak

#

hyping up grok 3.5

#

hes basically saying they are cooking up so hard

#

i will return to this post after grok 3.5 is released

reef lynx May 25, 2025, 12:15 AM

#

~~look back to a year ago when grok 3 wasn't even real~~

solar hollow May 25, 2025, 12:40 AM

#

do we know which company has the most datacenters/compute available to them?

#

microsoft provides a lot for openai right?

unborn ocean May 25, 2025, 12:59 AM

#

Microsoft rents a lot aswell and OpenAI is also partnering with other companies now

#

And xAI probably does not have the most compute (maybe most in one datacenter though)

small haven May 25, 2025, 1:30 AM

#

people are sleeping on codex, and me too 😴

elder rapids May 25, 2025, 1:50 AM

#

solar hollow do we know which company has the most datacenters/compute available to them?

we can never know the total server count or aggregate processing power (flops) but it's most likely Google, followed by Amazon and Microsoft, Amazon has the most publically available compute iirc and then Microsoft, and then Google. But googles infrastructure was built and scaled before gcp which means they've maintained a large amount that was never intended to be publicized, and we can assume the public facing compute is in addition to that large infrastructure predating gcp

#

the fact that they have such a large non specialized pool of compute (gcp) and then highly specialized compute (tpus, which is in gcp, but not alone in processors within it, ie gpus) so we know it's x in addition to y instead of "x is only an instance of y therefore unquantifiable"

solar hollow May 25, 2025, 2:03 AM

#

even if it is, i dont think its gonna be better than sota

elder rapids May 25, 2025, 2:05 AM

#

xai isn't really in the discussion

#

of most compute

#

how does that matter

patent aspen May 25, 2025, 3:23 AM

#

Google has by far the most total AI compute.

#

Bear in mind that AI compute isn't just number of data centers, data center size, number of chips, etc. It's about how many useful flops you can get out of those. Google would already be way in the lead in all of those categories, but once you account for how much efficiency Google can squeeze out of its flops, it's a massive lead

harsh flume May 25, 2025, 5:11 AM

#

elder rapids xai isn't really in the discussion

I saw some thread of nVidia's CEO praising xAI as having built the fastestes supercomputer on the planet or smth

#

https://x.com/PatOnTheLevel/status/1924480207660175675

Patrick Patterson (@PatOnTheLevel)

NVIDIA's CEO just broke the internet with a shocking admission.

During his latest interview, he confessed, "Tesla achieved something singular" with their supercomputer.

But there was more.

Here's what made Jensen Huang's jaw drop: 🧵

#

ah, but fastest doesnt equate the most tbf

honest jewel May 25, 2025, 5:26 AM

#

harsh flume https://x.com/PatOnTheLevel/status/1924480207660175675

Interesting read

patent aspen May 25, 2025, 5:44 AM

#

harsh flume I saw some thread of nVidia's CEO praising xAI as having built the fastestes sup...

It's neither the fastest nor the most once you take into account cross-metro training

#

I've never heard of a data center being built that fast before though

#

much less at that scale

small haven May 25, 2025, 6:08 AM

#

o3 pro is going to increase afghanistan gdp by 0.5%

calm sequoia May 25, 2025, 7:26 AM

#

We haven't had a new ChatGPT-4o-latest for two months already. The latest was at (2025-03-26) and they had a flop after that.

calm sequoia May 25, 2025, 7:27 AM

#

small haven o3 pro is going to increase afghanistan gdp by 0.5%

Very random sentence here, could you ellaborate?

#

Congratulations on the wise decision, lmarena team! The style control was always the best option.

tall summit May 25, 2025, 7:35 AM

#

small haven people are sleeping on codex, and me too 😴

ok that was a funny double meaning

late path May 25, 2025, 7:48 AM

#

Actually, I thought that glazed 4o version of Monday was really fun to chat with, but it wasn't as enjoyable after they reverted

hardy pecan May 25, 2025, 8:20 AM

#

VEO 3 is so fun to use holy

#

It's really quite amazing with the added audio

#

simple addition but improves it so much

torn mantle May 25, 2025, 8:52 AM

#

Next year i heard

dusky aurora May 25, 2025, 9:19 AM

#

tonight there's been one hell of a night

ocean vortex May 25, 2025, 10:39 AM

#

small haven o3 pro is going to increase afghanistan gdp by 0.5%

slop AI

#

we need ASI

#

it has to make Afghanistan rich

jolly aspen May 25, 2025, 10:51 AM

#

small haven o3 pro is going to increase afghanistan gdp by 0.5%

Which is about 10¢

dusky aurora May 25, 2025, 10:58 AM

#

beta interface became unusable in mobile

unborn ocean May 25, 2025, 11:42 AM

#

unborn ocean

poll_question_text

Should I try the following?:
Prompt gemini 2.5 flash (no-thinking) to generate reasoning traces before answering with a good prompt like: https://github.com/huggingface/open-r1/discussions/164.

victor_answer_votes

4

total_votes

10

#

Wow guys, very helpful advice as always

ocean vortex May 25, 2025, 11:55 AM

#

dork 4.0 is gonna conquer all

hot pelican May 25, 2025, 12:34 PM

#

Just got back weeks after reporting this issue here. Either or both AIs not showing any result in most cases, but still being able to vote

#

#

I am still votting, messing up arena results. Because it doesn't matter if i vote or not as many people are votting for "webdevarena" web errors, instead of the AIs

torn mantle May 25, 2025, 12:35 PM

#

hot pelican

yea there are some rendering issues

#

we dont know if its from lmarena sandboxing or model issues

hot pelican May 25, 2025, 12:36 PM

#

Yeah, the votting result is messed up. Because people are votting for the error of the platform. counting as the AI being bad

torn mantle May 25, 2025, 12:36 PM

#

they said its not counted

#

if it failed rendering or gave you an error then the vote wont count

#

thats what they said the other day

#

their errors handling is kinda ....

hot pelican May 25, 2025, 12:38 PM

#

yeah. I don't think the way they detect whether to count vote can be made reliable. As sometimes it actually generates code, but doesn't preview it

#

In the last weeks. almost all requests were giving such error. I wonder how much this messes up the leaderboard

torn mantle May 25, 2025, 12:44 PM

#

hot pelican yeah. I don't think the way they detect whether to count vote can be made reliab...

the vote isnt counted

alpine coral May 25, 2025, 1:13 PM

#

unborn ocean Wow guys, very helpful advice as always

lol fwiw i was curious...using that R1-inspired CoT prompt you linked to, 2.5-flash scored higher than the thinking version on the 10 sample questions for Simple Bench (i also tried a few other CoT prompts which did well too)

#

i didn't test repeatedly or beyond this.. my guess would be that coercing pre-answer 'thinking' through prompting probably wouldn't result in performance on par with the actual native thinking version consistently (at least that just seems to good to be true / possible).. but it seems maybe worth exploring, given the wild price differential (which doesn't really make sense to my mind and is an interesting question in its own right..)

#

but yeah anyway i dunno.. again just fwiw :))

unborn ocean May 25, 2025, 1:38 PM

#

:) now i am really interested

#

don't have much time, but expect something larger about it this week

torn mantle May 25, 2025, 1:38 PM

#

alpine coral lol fwiw i was curious...using that R1-inspired CoT prompt you linked to, 2.5-fl...

what about goldmane & redsword

cedar tide May 25, 2025, 3:06 PM

#

Claude 4 opus the best non reasoning models ?

calm sequoia May 25, 2025, 3:36 PM

#

Flash is non reasoning?

tall summit May 25, 2025, 3:56 PM

#

you can choose

cedar tide May 25, 2025, 4:06 PM

#

calm sequoia Flash is non reasoning?

Both, version reasoning and non reasoning

cedar tide May 25, 2025, 4:08 PM

#

cedar tide Claude 4 opus the best non reasoning models ?

Do you think one of these models will dethrone it?

Gemini 2.5 Pro (no think, out debut june)
Grok 3.5 (if we will have no think version)
Llama Behemoth
gpt 5 (if we will have no think version)
next deepseek v..
Mistral large 3

lime coral May 25, 2025, 4:11 PM

#

How would we know?

calm sequoia May 25, 2025, 4:20 PM

#

Now the Imarena is fixed (style control default). You can't cheat to first place using emoji and as licking techniques. Because of this, behemot is out of question

#

It is more real than gork

torn mantle May 25, 2025, 4:29 PM

#

Wen?

wintry tinsel May 25, 2025, 4:30 PM

#

cedar tide Do you think one of these models will dethrone it? - Gemini 2.5 Pro (no think, ...

I highly doubt behemoth will release, mistral 3 large won’t be better than opus, neither will deep seek v4, Gemini will be better at some technical tasks, but worse in creative writing, general conversation, and logic, gpt 5 I’m guessing will beat it

cedar tide May 25, 2025, 4:31 PM

#

calm sequoia Now the Imarena is fixed (style control default). You can't cheat to first place...

Il not speaking about lm arena, but just better in real

ember rapids May 25, 2025, 4:32 PM

#

Deepseek overrated

cedar tide May 25, 2025, 4:34 PM

#

wintry tinsel I highly doubt behemoth will release, mistral 3 large won’t be better than opus,...

Why not release Behemoth ?

cedar tide May 25, 2025, 4:34 PM

#

wintry tinsel I highly doubt behemoth will release, mistral 3 large won’t be better than opus,...

And grok 3.5 ?

balmy mist May 25, 2025, 4:35 PM

#

what happened to relevant ai news channel?

#

also is 4o still the best image editing model?

torn mantle May 25, 2025, 4:37 PM

#

ember rapids Deepseek overrated

By you

tall summit May 25, 2025, 4:38 PM

#

cedar tide Do you think one of these models will dethrone it? - Gemini 2.5 Pro (no think, ...

i do not

#

think any of them will

#

i guess gpt 5 possibly

torn mantle May 25, 2025, 4:43 PM

#

Isnt gpt 5 like o3 pro + some gpt model on top?

echo aurora May 25, 2025, 4:44 PM

#

is it just me or are people seeing the site broken rn?

misty vault May 25, 2025, 4:45 PM

#

yes

#

why did u delete lmarena css @paws

tall summit May 25, 2025, 4:45 PM

#

yes

fleet lintel May 25, 2025, 4:45 PM

#

Did Meta gave up on Llama?

torn mantle May 25, 2025, 4:45 PM

#

fleet lintel Did Meta gave up on Llama?

Yes

#

Hehe

#

A win for us

#

Enough of slops

tall summit May 25, 2025, 4:45 PM

#

no it's not

#

the more open source the better

cedar tide May 25, 2025, 4:46 PM

#

torn mantle Isnt gpt 5 like o3 pro + some gpt model on top?

We dont know

echo aurora May 25, 2025, 4:46 PM

#

thank ypu meowpensivepray ! we're debugging

fleet lintel May 25, 2025, 4:46 PM

#

tall summit the more open source the better

True but I have negative trust on anything with Zuck/Meta

tall summit May 25, 2025, 4:46 PM

#

fleet lintel True but I have negative trust on anything with Zuck/Meta

alright then

misty vault May 25, 2025, 4:47 PM

#

idgaf about open source

#

id give sam sloppy one for gpt-4-0314 access

torn mantle May 25, 2025, 4:48 PM

#

cedar tide We dont know

I thought you knew everything

cedar tide May 25, 2025, 4:52 PM

#

torn mantle Isnt gpt 5 like o3 pro + some gpt model on top?

"better"

Screenshot_2025-05-25-18-51-39-603_com.reddit.frontpage-edit.jpg

torn mantle May 25, 2025, 4:53 PM

#

cedar tide "better"

Eh

#

Better huh

elder rapids May 25, 2025, 4:58 PM

#

cedar tide Do you think one of these models will dethrone it? - Gemini 2.5 Pro (no think, ...

prob grok 3.5, but there's a good chance 2.5 pro is better than it too at least in creative writing, regular discussion, philosophy etc, just not in coding

keen beacon May 25, 2025, 4:58 PM

#

lmarena chat not rendering?

cedar tide May 25, 2025, 4:59 PM

#

elder rapids prob grok 3.5, but there's a good chance 2.5 pro is better than it too at least ...

You see Goldman and redsword ?

#

very strong in code

keen beacon May 25, 2025, 4:59 PM

#

interesting

echo aurora May 25, 2025, 4:59 PM

#

keen beacon interesting

sorry about that! the team is working on it

keen beacon May 25, 2025, 5:00 PM

#

alright

elder rapids May 25, 2025, 5:00 PM

#

cedar tide very strong in code

ye

#

how are they outside of coding tho

cedar tide May 25, 2025, 5:02 PM

#

Claude 4 haïku will exist ?

elder rapids May 25, 2025, 5:10 PM

#

dawg goldmanes stuff isn't loading the first time around and I can't see what it's generating, so when I vote for the opposing models generation and look back it loads and 100% of the time it looks better than the model I chose lmfao

#

goldmane is cracked tho lmao, I asked it to generate the Roblox webpage and it recalled not just the titles, but what teams made them including the avg like % on the actual webpage (close) and the actual rounded viewership for each of the game

#

seems like it's pretty big

leaden palm May 25, 2025, 5:19 PM

#

elder rapids goldmane is cracked tho lmao, I asked it to generate the Roblox webpage and it r...

wonder if they gave it tools

dusky aurora May 25, 2025, 5:20 PM

#

I am so much stressed and anxious right now

elder rapids May 25, 2025, 5:21 PM

#

leaden palm wonder if they gave it tools

it was up against opus too

#

opus hasnt once beaten goldmane

#

in these runs

dusky aurora May 25, 2025, 5:22 PM

#

opus is very good at writing prose scenes

#

so far LMArena is my only stress relief

#

I hope it doesn't have a quota

elder rapids May 25, 2025, 5:29 PM

#

btw webdev arena is broken on mobile

#

the send button isn't easily pressed

cedar tide May 25, 2025, 5:33 PM

#

Grok 3.5 this week 100%

leaden palm May 25, 2025, 5:34 PM

#

cedar tide Grok 3.5 this week 100%

market disagrees

Manifold

Which day of May will Grok 3.5 be released?

Update 2025-05-04 (PST) (AI summary of creator comment): - The release date used for resolution will be based on the Pacific Time (PT) zone.

Update 2025-05-10 (PST) (AI summary of creator comment): - To count for resolution, a release must be both reputable and intentional.

cedar tide May 25, 2025, 5:37 PM

#

leaden palm [market disagrees](https://manifold.markets/KTibow/which-day-of-may-will-grok-35...

Screenshot_2025-05-25-19-36-09-883_com.discord-edit.jpg

#

Screenshot_2025-05-25-19-36-49-783_com.twitter.android-edit.jpg

leaden palm May 25, 2025, 5:37 PM

#

cedar tide

well get your play money up then

#

prove the chair wrong

cedar tide May 25, 2025, 5:38 PM

#

@leaden palm It was supposed to come out a few weeks ago according to Musk.

leaden palm May 25, 2025, 5:38 PM

#

cedar tide <@794377681331945524> It was supposed to come out a few weeks ago according to M...

yup

#

doesn't make it any more likely to release today

#

if you flip a coin 4 times and it lands on heads each time, thinking "okay surely it'll be tails next time" is fallacious

#

in fact it may indicate the opposite

#

in the coin's case, maybe it's biased

cedar tide May 25, 2025, 5:39 PM

#

leaden palm if you flip a coin 4 times and it lands on heads each time, thinking "okay surel...

This true

cedar tide May 25, 2025, 5:39 PM

#

leaden palm doesn't make it any more likely to release today

This not true

leaden palm May 25, 2025, 5:40 PM

#

cedar tide This not true

how is this different from the case of the roadster?

#

i mean of course it'll eventually release but why would "it hasn't released for a long time despite many claims" not transfer to here?

cedar tide May 25, 2025, 5:41 PM

#

@leaden palm If Elon says once it's coming out next week it's 99% ready, each day that passes increases the progression towards 100% and increases the probability of release

feral lichen May 25, 2025, 5:41 PM

#

best ai for roblox studio?

leaden palm May 25, 2025, 5:41 PM

#

cedar tide <@794377681331945524> If Elon says once it's coming out next week it's 99% ready...

well we'll see what happens this week

#

i honestly might be with the chair though

cedar tide May 25, 2025, 5:42 PM

#

Elon never said the roadster will be released next week

keen beacon May 25, 2025, 5:42 PM

#

elon rt'd fake benchmarks, i wouldnt trust him lol

misty vault May 25, 2025, 5:42 PM

#

grok is releasing soon

cedar tide May 25, 2025, 5:44 PM

#

keen beacon elon rt'd fake benchmarks, i wouldnt trust him lol

we can't really compare

#

?

misty vault May 25, 2025, 5:44 PM

#

fr

#

are u a drooling alien

cedar tide May 25, 2025, 5:44 PM

#

Im elon

#

for me there is a 50% chance that grok 3.5 is only with reasoning

leaden palm May 25, 2025, 5:47 PM

#

plain o3?

#

you like it that much?

misty vault May 25, 2025, 5:49 PM

#

https://tenor.com/view/cat-trump-kat-meow-gif-7676659

Tenor

#

32k one was actually nerfed alread

#

the real one is 8k one

cedar tide May 25, 2025, 5:49 PM

#

@deep adder "why are you trolling" ?

misty vault May 25, 2025, 5:49 PM

#

but thats so small context window

high ginkgo May 25, 2025, 5:49 PM

#

I agree

elder rapids May 25, 2025, 5:54 PM

#

sonnet 4 making build errors vs redsword?

#

crazy

#

holy

#

I just asked goldmane to make a page debunking an unfalsifiable philosophical position

#

and it took the extra steps to make a chat-like demonstration process

#

where if you actively talk to it

elder rapids May 25, 2025, 6:00 PM

#

elder rapids where if you actively talk to it

that action itself demonstrates the self refutation of the philosophy

#

that's not meaningful

#

philosophy isnt contingent on subjective thought

#

it only invokes them for analytic material/or comparison

#

ye, you can tell how goldmane and redsword format their code

#

it's fancy asf

#

look for large indentations

elder rapids May 25, 2025, 6:03 PM

#

elder rapids look for large indentations

or gradient indents

#

seems true

#

goldmane is intelligent asf

sour spindle May 25, 2025, 6:07 PM

#

Who is behind Goldmane?

elder rapids May 25, 2025, 6:07 PM

#

intuition is telling me it's another possible nebula moment

elder rapids May 25, 2025, 6:07 PM

#

sour spindle Who is behind Goldmane?

Google

keen beacon May 25, 2025, 6:08 PM

#

elder rapids intuition is telling me it's another possible nebula moment

I don't think so tbh

elder rapids May 25, 2025, 6:08 PM

#

ye but this trend doesn't have to continue due to the GA releases

elder rapids May 25, 2025, 6:09 PM

#

keen beacon I don't think so tbh

really feels that way tbh, it's accomplishing the tasks in a very very strong way, it has personality in its output which imo means it's going further than just understanding the prompt

#

and simply doing it

#

like how nebula didn't just "answer"

#

but in this case it's for webdev

#

ye

#

and imo redsword is worse

#

it has that capability

keen beacon May 25, 2025, 6:11 PM

#

Yeah

elder rapids May 25, 2025, 6:11 PM

#

ye

#

2.5 pro

keen beacon May 25, 2025, 6:12 PM

#

I don't think the conditions for another nebula moment are met with this model

elder rapids May 25, 2025, 6:12 PM

#

answering the 3 questions

#

the last question just indicates GA release

#

that's all

keen beacon May 25, 2025, 6:13 PM

#

People interpreting it as Gemini 3 lol but it might be true coincidentally anyway

elder rapids May 25, 2025, 6:13 PM

#

if they go to Gemini 3 I'd be surprised tbh

keen beacon May 25, 2025, 6:13 PM

#

2.5 pro

elder rapids May 25, 2025, 6:14 PM

#

best model by the largest gap we'd seen in a long time

#

don't posture it as simply incidental, it's fundementally different from 2.0 gen

#

id be surprised if they didn't brand such a difference as 2.5

#

simply due to how different 2.0 → 2.5 is fundementally

keen beacon May 25, 2025, 6:16 PM

#

It has continued pretraining (I think) though so it isn't just post training (unless you count it). Probably a bunch of extra stuff on top of that too

elder rapids May 25, 2025, 6:17 PM

#

I don't believe that one bit tbh

#

not saying you're simply wrong but you'd have to question that even in the position of deepmind themselves

#

no

#

I'm not talking about any indication of performance

#

ion know if you saying you work at the company and that you saw every update Internally is that meaningful

#

but by all means

#

it's not the behavior that's different tho

#

they said when releasing 2.5 pro it's simply a different model altogether

#

whether it's major architectural differences, techniques

#

etc

#

this already warrants a generation iteration

keen beacon May 25, 2025, 6:26 PM

#

elder rapids they said when releasing 2.5 pro it's simply a different model altogether

I don't think they did they said 'enhanced base model' among other things

#

Yeah but it's different from a fresh pretrain

#

They didn't say that though I think

#

And it doesn't really make sense tbh but I don't know

elder rapids May 25, 2025, 6:30 PM

#

ye but the point is, if it's large enough it's inherently a .5 or 1.0 entire change given this difference. We know 1.5 → 1.5 002 are different, but 2.0 → 2.5 is larger, how it recollects things, the difference in the CoT altogether from 2.0 flash thinking (2.0 flash thinking had a much much much difference trace) and all that stuff

#

you can justify them seeing the performance and whatnot and then saying "oh this is worthy of a .5 level difference"

#

but I don't believe this when it's such a different model both behaviorally, technique wise, and many other parts

keen beacon May 25, 2025, 6:30 PM

#

I don't think 2.0 flash thinking had a significantly different trace style at least

elder rapids May 25, 2025, 6:31 PM

#

it did

#

I used that model everyday lmao

#

it had a MUCH different trace

keen beacon May 25, 2025, 6:31 PM

#

I did too

elder rapids May 25, 2025, 6:31 PM

#

I even pointed this out

#

when 2.5 pro released

keen beacon May 25, 2025, 6:31 PM

#

The style remained somewhat the same but the underlying model and capabilities were significantly different

#

It's overblown

#

People are misinterpreting it

#

It's complex

elder rapids May 25, 2025, 6:33 PM

#

keen beacon The style remained somewhat the same but the underlying model and capabilities w...

the style was drastically different, it didn't format like that and didn't have much if any "aha" moments, it didn't iterate through its own thought process and it didn't reflect that much on its thought process via the output

keen beacon May 25, 2025, 6:33 PM

#

Yeah I think it's kinda like that

elder rapids May 25, 2025, 6:33 PM

#

although 2.0 thinking was still reminiscent of that blank canvas behavior

#

where you could force it to iterate techniques through a context

#

but everything about it was diff

keen beacon May 25, 2025, 6:34 PM

#

Tbh I used it a lot and I sorta agree

elder rapids May 25, 2025, 6:34 PM

#

single responses and personality ye but that's where the critique ends

#

since similar to 2.5 pro (though not nearly as strong) it got better through the context window

fleet lintel May 25, 2025, 6:35 PM

#

yup. that's why I was very surprised by 2.5 pro drop. It was leaps ahead

elder rapids May 25, 2025, 6:35 PM

#

it was a fun model tbh

patent aspen May 25, 2025, 6:35 PM

#

2.0 flash thinking was Google experimenting with reasoning for like 1-2 months

keen beacon May 25, 2025, 6:35 PM

#

I've never seen a model do this before. Go 29k into thinking and return back the problem

keen beacon May 25, 2025, 6:36 PM

#

keen beacon I've never seen a model do this before. Go 29k into thinking and return back the...

Old screenshot of flash thinking

patent aspen May 25, 2025, 6:36 PM

#

2.0 flash thinking was massive for price-vs-performance at the time it came out

#

It wasn't smart but it was cheap

#

It was

elder rapids May 25, 2025, 6:36 PM

#

keen beacon I've never seen a model do this before. Go 29k into thinking and return back the...

ye

#

wouldn't comprehend its own thinking process

#

and or wouldn't follow the output the thinking process invoked

keen beacon May 25, 2025, 6:38 PM

#

Lmao

elder rapids May 25, 2025, 6:38 PM

#

he's trolling

fleet lintel May 25, 2025, 6:38 PM

#

what's nebula?

elder rapids May 25, 2025, 6:39 PM

#

wym

#

lol

#

everyone was glazing it

#

like it was the second coming of Jesus

#

what

keen beacon May 25, 2025, 6:39 PM

#

Nebula

elder rapids May 25, 2025, 6:39 PM

#

ye but nobody is talking about flash thinking

#

ye

#

you don't believe that

#

😭

#

2.5 pro is STILL the best casual general model

keen beacon May 25, 2025, 6:40 PM

#

I'd rather have that than the meta plague of before

#

Wow it turned me off the arena for a while

#

The meta spam

fleet lintel May 25, 2025, 6:40 PM

#

interesting... are they planning to release 2.5 revision with new pre-training?

elder rapids May 25, 2025, 6:40 PM

#

they've done this since the middle of last year

#

so do exactly what they've been doing?

keen beacon May 25, 2025, 6:41 PM

#

They don't do anon models it's boring

elder rapids May 25, 2025, 6:41 PM

#

since they started because of the other companies

fleet lintel May 25, 2025, 6:42 PM

#

anthropic is good but too expensive

sonic tendon May 25, 2025, 6:42 PM

#

anthropic seems to not really care about lmarena

#

yea

fleet lintel May 25, 2025, 6:42 PM

#

companies dont benchmark when they know that they suck

elder rapids May 25, 2025, 6:42 PM

#

the exceptions lmao, ALL of their models prior previewed on lmarena, as well as grok, Mistral, Nexus, meta

fleet lintel May 25, 2025, 6:42 PM

#

o3 is overhyped

#

delusional much?

elder rapids May 25, 2025, 6:44 PM

#

btw in lmarena benchmaxxing can only be apparent in a fine-tune

#

ironically it's one of the ones you can't game that easily

#

despite weird narratives

#

😭

#

bro said perplexity

fleet lintel May 25, 2025, 6:45 PM

#

this looks internal info... how good are these latest iterations? should i expect good improvements?

elder rapids May 25, 2025, 6:46 PM

#

if simplebench comes out with a high score for them

#

then unironically ye

#

I'm so sad Claude 4 opus is mid

sonic tendon May 25, 2025, 6:46 PM

#

elder rapids I'm so sad Claude 4 opus is mid

same

#

really disappointed with claude 4 in general tbh

keen beacon May 25, 2025, 6:47 PM

#

they salvaged their 3.5 opus training run (i think)

#

its not bad but its not 😮

elder rapids May 25, 2025, 6:47 PM

#

keen beacon they salvaged their 3.5 opus training run (i think)

just imagine Claude 3 opus but better

#

😭😭

#

agi brooo

fleet lintel May 25, 2025, 6:47 PM

#

sonic tendon really disappointed with claude 4 in general tbh

give me opus 4 at 2.5 pro price and i'll be super happy

keen beacon May 25, 2025, 6:48 PM

#

its harder to work with

fleet lintel May 25, 2025, 6:48 PM

#

quality is whatever...it's flash-lite

elder rapids May 25, 2025, 6:48 PM

#

it doesn't ye

keen beacon May 25, 2025, 6:48 PM

#

its probably the largest reasoning model out there

elder rapids May 25, 2025, 6:48 PM

#

sonnet gains the most

#

grok I think

keen beacon May 25, 2025, 6:48 PM

#

grok 3 reasoning is vaporware

misty vault May 25, 2025, 6:48 PM

#

bro like literally right now claude 4 opus added a tiny lil feature in the code that i didnt ask for

keen beacon May 25, 2025, 6:48 PM

#

it never released

misty vault May 25, 2025, 6:49 PM

#

what is this gemini 2.5 pro ahh sh*t

keen beacon May 25, 2025, 6:49 PM

#

i honestly prefer 3.7 sonnet over 4 sonnet

#

for creative writing personally

elder rapids May 25, 2025, 6:50 PM

#

can't imagine using these models for things other than vibe coding when you have 2.5 flash right there

#

flash has never failed me

keen beacon May 25, 2025, 6:50 PM

#

tbh i wonder what was the deal with sonnet 3.5

elder rapids May 25, 2025, 6:50 PM

#

keen beacon tbh i wonder what was the deal with sonnet 3.5

ye

keen beacon May 25, 2025, 6:50 PM

#

truly a generational run lol

#

their 3.5 haiku was a dud, 3.5 opus was seemingly a dud etc

elder rapids May 25, 2025, 6:51 PM

#

keen beacon for creative writing personally

yep, also btw they apparently DID train Claude 4's specifically for creative writing

keen beacon May 25, 2025, 6:51 PM

#

the comments help it bring you a better output

elder rapids May 25, 2025, 6:51 PM

#

the comments are CoT

keen beacon May 25, 2025, 6:51 PM

#

it might be annoying for you but its easier for the model

#

i honestly prefer it

#

u dont need opus 4 to clean up the comments lmao

elder rapids May 25, 2025, 6:52 PM

#

elite efficiency 😭😭

keen beacon May 25, 2025, 6:52 PM

#

what a waste lmfao

elder rapids May 25, 2025, 6:52 PM

#

you can just use a tiny model for that lmfao

keen beacon May 25, 2025, 6:53 PM

#

man i just love reading 2.5 pro cot, i hope they bring it back

elder rapids May 25, 2025, 6:53 PM

#

keen beacon man i just love reading 2.5 pro cot, i hope they bring it back

yes bro

keen beacon May 25, 2025, 6:53 PM

#

the model is so smart

misty vault May 25, 2025, 6:53 PM

#

vibe coding with gemini 2.5 pro be like:

ur task: write one line of code

// this
// is
// a
// comment


oneLineOfCode();

// end code
// END_FILE_H
// end file

// this looks better trust me bro
secondLineOfCode();


checkIfLinesOfCodeExist();
// ensure the lines exist bro trust me, i get hard from adding 10 billions of checks and redundant code

redundantCode();
// like why not trust me bro

elder rapids May 25, 2025, 6:54 PM

#

can't wait for 2.5 pro GA release

#

but goddamn that hidden CoT

#

😭

keen beacon May 25, 2025, 6:54 PM

#

yeah

elder rapids May 25, 2025, 6:54 PM

#

at least I can guess what's happening

keen beacon May 25, 2025, 6:54 PM

#

bring back raw thoughts 😭

elder rapids May 25, 2025, 6:54 PM

#

and how to fix certain tendencies

keen beacon May 25, 2025, 6:54 PM

#

its so hard to use the model tbh without it

#

yeah

elder rapids May 25, 2025, 6:54 PM

#

but people who don't know are going to have trouble

elder rapids May 25, 2025, 6:55 PM

#

keen beacon its so hard to use the model tbh without it

ye

#

its so easy to read them and see the problem

keen beacon May 25, 2025, 6:55 PM

#

i didnt realize how much of an effect it had on my usage tbh

#

its night and day

elder rapids May 25, 2025, 6:55 PM

#

ong just bring it back

#

@patent aspen yo tell demis what's up

#

tell him to bring have the raw cot

patent aspen May 25, 2025, 6:56 PM

#

elder rapids <@607352374352281612> yo tell demis what's up

I assume they've received that as feedback at this point

golden ocean May 25, 2025, 6:56 PM

#

misty vault vibe coding with gemini 2.5 pro be like: ur task: write one line of code ```cp...

lmaooo
fr

keen beacon May 25, 2025, 6:57 PM

#

yeah theres a forum thread about it

#

https://discuss.ai.google.dev/t/massive-regression-detailed-gemini-thinking-process-vanished-from-ai-studio/83916/59

Google AI Developers Forum

Massive Regression: Detailed Gemini Thinking Process vanished from ...

Thank you to everyone who has shared their thoughts and concerns in this thread. We hear you. While we’re excited to now return thought summaries directly through the Gemini API for the first time, we understand this is a different experience from the raw thoughts previously available in AI Studio. It’s clear that in their current state, the...

elder rapids May 25, 2025, 6:57 PM

#

nah I want to get it directly to demis

#

if he doesn't do what I ask

#

there'll be consequences

keen beacon May 25, 2025, 6:57 PM

#

i wonder if this is a product decision primarily or competitive

fleet lintel May 25, 2025, 6:58 PM

#

my understanding is that they are worried about people crawling and stealing raw gemini thoughts

keen beacon May 25, 2025, 6:58 PM

#

fleet lintel my understanding is that they are worried about people crawling and stealing raw...

atp tbh i dont think it matters anymore

elder rapids May 25, 2025, 6:59 PM

#

keen beacon https://discuss.ai.google.dev/t/massive-regression-detailed-gemini-thinking-proc...

nice that people understand that the problem is inherent and not the summary itself

keen beacon May 25, 2025, 7:00 PM

#

yeah ive seen really good responses

elder rapids May 25, 2025, 7:00 PM

#

although Google can definitely solve this and still have summary

#

that's a harder route

#

than just enabling raw CoT

patent aspen May 25, 2025, 7:00 PM

#

My guess is they won't go back to raw thoughts but will make the summaries more focused on the key steps or more verbose

fleet lintel May 25, 2025, 7:01 PM

#

how good is "deep think"? any insights here?

elder rapids May 25, 2025, 7:01 PM

#

fleet lintel how good is "deep think"? any insights here?

agi

misty vault May 25, 2025, 7:01 PM

#

fleet lintel how good is "deep think"? any insights here?

agi

golden ocean May 25, 2025, 7:01 PM

#

fleet lintel how good is "deep think"? any insights here?

agi

tall summit May 25, 2025, 7:01 PM

#

fleet lintel how good is "deep think"? any insights here?

agi

high ginkgo May 25, 2025, 7:01 PM

#

fleet lintel how good is "deep think"? any insights here?

agi

fleet lintel May 25, 2025, 7:01 PM

#

ajhhhh

elder rapids May 25, 2025, 7:03 PM

#

I'm actually using it rn

#

deepmind gave me super duper access

#

yep AGI btw