#general

1 messages · Page 46 of 1

elder rapids
#

ye

patent aspen
#

I think an issue for xAI is that they don't have enough people to do deep R&D, hill climbing, and the mundane compliance and administrative stuff at the same time as well as its competitors

elder rapids
#

but not a lot of people will be able to test it

#

like o1 pro, theres no reason to talk about it

hollow ocean
#

It’s one day of work

patent aspen
#

They might be able to get a lot from brute force hardware, but they wouldn't have enough people leftover to hill climb across everything

#

Similar to Anthropic

teal mantle
#

I remember the inaccurate leaked specs of Grok 3.5
It sounds too optimistic
Especially now we know it would beat Opus 4

keen beacon
#

someone on this discord made fake benchmarks then elon rt'd it (and subsequently deleted it later)

elder rapids
#

crazy tbh

#

I wonder how many people are actually in here

teal mantle
keen beacon
#

opus wasnt a fresh pretrain, it seems like a last ditch effort to salvage opus 3.5. but just speculating

teal mantle
#

I don’t doubt xAI could outrun Anthropic but the number are implausible

teal mantle
elder rapids
#

prolly in two weeks

hollow drift
#

So you all don’t think anything will beat out Gemini 2.5 pro on the leaderboard by the end of the month?

keen beacon
#

the ga revision probably not

#

probably very unlikely

#

claude models dont usually perform that well in the arena either

teal mantle
elder rapids
#

the gap is MASSIVE

alpine coral
#

i have yeah

#

opus 4 is very good

elder rapids
#

ye

#

pretty likely

alpine coral
#

i really don't know what to make of sonnet 4 tho..

teal mantle
#

I think o4 full sized might come firsthand

elder rapids
#

yet everyone just a couple months ago swore Google never benchmark maxed

#

along with anthropic

#

I also mean for lmarena

keen beacon
#

o3 was probably done more recently than you think

elder rapids
#

it doesn't seem like they're benchmark maxing

keen beacon
#

yeah

#

likely

#

their cpt wasnt done yet probably

#

the cpt was a company wide effort i think

#

4o image gen is on the new 4.1 base model, it would seem

#

very likely 4.1 mini

#

chatgpt 4o latest uses the 4.1 base model and predated 4.1

#

since jan as public api snapshots

torn mantle
#

redsword seems to fail a lot when using threejs

keen beacon
#

chatgpt in dec with some tests

#

@alpine coral what base model do you think they used otherwise lol

alpine coral
keen beacon
#

they arent doing a fresh pretrain for that

teal mantle
#

Btw is Grok 3.5 still a vaporware

elder rapids
keen fulcrum
#

Grok 3.5 better than claude 4 sonnet

keen beacon
#

it just does not make sense

alpine coral
#

fair enough

#

i don't have any profound insight here lol

#

it's just o4-mini

teal mantle
keen beacon
#

seriously the naming is arbitrary

alpine coral
#

i disagree

teal mantle
#

But then didn’t arguably the fake bench derail the 3.5 release?

alpine coral
#

but perhaps i'm misstaken

keen beacon
alpine coral
#

plain and simple lol

keen beacon
#

it might or might not be

#

it doesn't matter

alpine coral
#

yeah i ofc don't know

keen beacon
#

o4 mini existing doesnt necessitate it

alpine coral
#

precedent would at least be strongly indicativate

keen fulcrum
#

O5 will be great, tbd at end of the year

patent aspen
#

Google has both of the top 2 AI organizations in the world

#

Merged into one

keen beacon
keen fulcrum
#

There is Google DeepMind

patent aspen
#

Yes prior to Gemini, Google Brain and DeepMind published some wild percentage of all NeurIPS papers. 80-90% of the seminal AI research papers from past decade came from those 2 organizations

misty vault
#

.

torn mantle
#

hyping up grok 3.5

keen fulcrum
torn mantle
#

what about grok 5

keen fulcrum
#

Tbd

alpine coral
keen fulcrum
#

ChatGPT could have been created a decade ago, technology wasn't advanced for a purpose

patent aspen
#

Attention is All You Need was published by Google in 2017

#

That's where transformers originated

keen fulcrum
#

Indeed, there wasn't moved as much R&D capital as needed

keen beacon
# alpine coral apart from perhaps from grok i'm not really sure which you're referring to (in t...

1.5 pro was a fresh pretrain iirc. 2.0 pro was a fresh pretrain. 2.5 pro is a cpt/etc.
claude 3 sonnet (fresh pretrain) -> claude 3.5 sonnet (fresh pretrain, not the same size compared to claude 3 sonnet as it was increased, there was a piece of anthropic media that stated this/i don't have it anymore) -> claude 4 sonnet (cpt, timeline is too short for a fresh pretrain imho, even more likely the case for opus 3.5. semianalysis reported about 3.5 opus's existence too, i think they salvaged 3.5 opus.)
4o -> 4.5 -> 4.1, despite chatgpt 4o (jan+) = 4.1 lol
4o image gen being on 4.1 (not too sure, as there could be additional post processing)

i hope i dont have to list more because they are wack

#

gemini 1.5 flash -> gemini 2.0 flash lite (probably same size, something about it in a google blog somewhere)
new model size -> gemini 2.0 flash

alpine coral
#

i was about to say "if you were talking about 1.5 pro and 1.5 flash i'd understand it a bit more " ha

small haven
#

so what was it based on before 4o?

alpine coral
#

i mean i'm sure this isn't perfectly accurate, but it conveys where i'm coming from

keen beacon
#

yeah thats just WAY too many assumptions

alpine coral
#

again, fair enough 🙂

keen beacon
#

imho, they could be true, but its just way too much

#

if you look at 4.1 (they spent several months on this, they'll probably use this for the foreseeable future. mid train of 4o, confirmed outright in a podcast) and 4.1 mini (recent fresh pretrain, you can hear openai employees talk about this). they would try to squeeze out as much api value as they can (e.g. instruct models), so i doubt they have actual different internal models (that don't originate from them two) that they would specifically use for o series

#

this is the most obvious thing about it. then additionally via pretraining probing, it seems highly likely that o4 mini uses 4.1 mini on a base model, you might be able to actually prove it tbh but worthless endeavor

alpine coral
#

i dunno man.. but i find it weird to think o4-mini is based of 4.1-mini

#

i just find it more likely that o4-mini is based of / distilled from o4 in the same way o3-mini from o3, and o1-mini from o1.. (and similiarly for 4o-mini, it's a distilliation of 4o)

keen beacon
alpine coral
#

afaik that relationship between the preceding o series isn't disputed

#

i'm just going by the past..

#

not cpt dates

keen beacon
#

the naming is mostly marketing

alpine coral
#

i'm not saying it's 1000% or iron clad

#

you're strawmanning my claim

keen beacon
#

the development cycle matches with o3, if they were to distill anything it would be o3. they mightve have done preliminary stuff with o4 etc but i dont see it being the case that they couldve distilled o4 already

alpine coral
keen beacon
alpine coral
#

i honestly don't know.. it's hard to talk about benchmarks when one of models is unreleased

#

maybe o4 sucks (compared to mini.. based on costs)

keen beacon
#

you think theres a possibility that o4 has 20% on simpleqa

alpine coral
#

and it's non release is as simple as that lol

alpine coral
keen beacon
#

its just a coincidence that the benchmarks align with 4.1 mini... they would spend several months working on 4.1 (later using it in o3), just to abandon it with o4 with a model with significantly less world knowledge than gemini 2 flash. does not make sense

#

you can believe me or not tbh. i feel like ive been observant and reasonable about these things. and you can probably prove that o4 mini is based on 4.1 mini, but i really think its pointless to go to that extent

alpine coral
#

i think all these models are basically based of GPT4.5/5 in one way or another, with different cpt's

#

anyway.. we've been waiting for claude 4 and it finally arrived..

civic flame
alpine coral
#

on 2/3 of the question sets, sonnet 3.7 outperforms sonnet 4...

#

but opus is very good

civic flame
#

you should test redsword on the arena

#

seems better than 2.5 pro

#

also wtf is with sonnet 4

#

yikes

alpine coral
civic flame
#

goldmane and redsword popped up today, both google models

#

redsword seems to be the better one but doesn't hurt to try both

alpine coral
misty vault
#

what about opus 4

torn mantle
small haven
#

aight operator on o3 still sucks

torn mantle
#

the thing with all anthropic models is that they are so lazy

#

and they made that on purpose

#

to save tokens

keen beacon
#

seems fine to me, only had the issue with new sonnet (oct)

torn mantle
#

i want it to reason even in the stupidest things

misty vault
#

Does it beat 3.7 sonnet at least

#

in coding

ocean vortex
#

it was the opposite of that with thinking budget maxed out

#

new Opus is lazy though. Kinda defeats the purpose of thinking budget in the first place

leaden meteor
#

I am not good enough yet to identify the nuances between very good and great models. But I am surprised that people are saying opus 4 is inferior to 2.5 pro when most benchmarks say otherwise....I cant honeslty differentiate between both much...

alpine coral
ocean vortex
# keen beacon its so expensive

still in the acceptable range with pricing. I was about to say it's barely more expensive than o3, but... OpenAI reduced the price? 🤯

alpine coral
#

it like won't do things that are token intensive that 3.7 would just naturyally '

ocean vortex
#

Could have sworn it was the same price as o1

#

now it's $40 not $60 per 1M output

main gulch
#

it launched with $40

ocean vortex
#

o3 cheaper than o3?

#

🤣

#

I said decent price for o3

#

because o3 launched with that price...?

#

lmao

keen beacon
#

ignore me

ocean vortex
keen beacon
#

the lack of sleep is really catching up to me lol

ocean vortex
#

I think Google started something really good initially with 2.5 Pro which probably did push OpenAI to do this... But recent Google moves on pricing are less promising

#

just a shame that Google gave in so fast... Like they aren't even pushing AI from google.com yet. But pricing for their plans is already as if they were fully committed

#

you can't use gemini on google.com and that's very odd all things considered. If AI is their side kick then that pricing has no business to be a thing

#

only US I think. Just like AI overviews. But gemini website is not US only

#

yeah bing.com is full AI worldwide so I don't think it's a major roadblock

#

probably more like them wanting their ad revenue tbh

#

but you can't have that and then also charge people $250 a month for AI on top

torn mantle
#

stop

#

haah

#

sigh

small haven
#

cooking up a lawsuit

torn mantle
#

lol

#

someone replied with this

#

idk it may be related

#

@deep adder

#

wdym

#

you see the correlation too?

#

well lets just hope the model is good

small haven
#

operator o3 in a nutshell

#

not as of today

#

it did things better than the old operator, but still shite

#

and it shouldn't be using windows as the os, imo

#

jk

#

linux

torn mantle
#

i just dont understand whats really happening

#

some of their staff are actually pioneers of the reasoning paradigm

small haven
torn mantle
#

got me on that

small haven
#

xai needs to acquire ssi and have ilya as the leader, not that red hat wearing gork creator

torn mantle
#

why would they acquire ssi?

#

nah

small haven
#

buddy tried to buy $90b openai, i think he can acquire ssi within that

#

ilya is worth that

torn mantle
#

ilya is scared of everything

#

you already have the sample

#

did we hear anything from him so far?

small haven
#

i mean hes trying to oneshot asi, not agi

sour spindle
#

Have played with all the new models still find myself using o3 a lot.

torn mantle
small haven
sour spindle
#

Is that the consensus around here? I know there’s a lot of love for 2.5

#

Feel like I’d like it better if I didn’t hate ai studio Gemini app

small haven
#

but i wanna try deepthink

sour spindle
#

ChatGPT app is the best ui experience too which probably plays into my preference

small haven
#

btw tho where is o3 pro

sweet tinsel
#

Jules with Gemini 2.5 is pretty peak, better than Manus, have yet to try Codex when it will get to Plus users.

willow grail
#

what is best solution for vibe coding in 2025 may?

#

!!!?????????????

#

i dont mean the model

#

whoel package

sweet tinsel
#

I already got Plus and Perplexity that is enough for me currently.

#

And AI-Studio is pretty great for being free and Jules too.

sour spindle
#

Pro is nice because I ask a lot of dumb questions

sweet tinsel
#

Got them free Manus Credits too

small haven
#

i mean prolly not now, but soon? deepthink definitely delayed the release schedule of o3 pro i think..

#

ahhh finally some truth

sweet tinsel
#

Maybe I think about Pro, how much GPT 4.5 usage does it have? I'm a sucker for GPT 4.5

#

It's just really good with text

willow grail
#

cursor is what?

small haven
willow grail
small haven
sour spindle
#

I don’t think o3 pro is coming out either. There’s really no need to. Unless deep think is phenomenal or something

willow grail
sour spindle
#

Which I’m a bit skeptical of after I/O

small haven
elder rapids
#

keeps crashing

willow grail
sweet tinsel
elder rapids
#

alr

small haven
willow grail
small haven
#

claude code is at least virtually unlimited, 200k context, just zero crashout

small haven
willow grail
small haven
#

in terms of time/effort, codex is cheaper than claude code

#

in cc?

#

haha

#

they do need a read-only policy in claude code so it doesn't overwrite tests

#

thats hawt

#

im still rocking my 2018 cpu, works enough

#

i think first gen

#

lol

#

amd > intel

sweet tinsel
#

How did the Hype Man itself get this interview?: https://youtu.be/nZtmmUQDzMQ?si=JEx1oE1jEo40vS45

My interview with Google's CEO Sundar Pichai. We covered Gemini, agents, diffusion models, self-improving AI (AlphaEvolve), and more.

The camera kept going out of focus, sorry about that.

Join My Newsletter for Regular AI Updates 👇🏼
https://forwardfuture.ai

Discover The Best AI Tools👇🏼
https://tools.forwardfuture.ai

My Links ...

▶ Play video
small haven
#

if i had intel, it would have broke, amd still strong, and i run long processes, its still solid

#

i mean overtime technically it overheats faster, but im sitting at 40 deg c on avg

#

40-50 range

unborn ocean
#

The chip is manufactured in Taiwan like the amd chip

#

So why?

#

Well but you talked about ultra ?

#

And thus currently both Intel and AMD have to count as ‚americanm‘ 🤨🤓

sweet tinsel
#

Holy, Google Jules just spent 4 hours with a moderate and not that hard task.

earnest parcel
#

Tested Claude 4 (default/non-thinking, Opus & Sonnet, 20250514):

  • Ended up topping my ranks (#1 & #2)

  • Very high reason, logic and common sense

  • quite concise models (16% token use of reason models such as 2.5 Pro)

  • highly competent in most areas tested, though Opus had more slip ups in math related tasks

  • Great coders, but Sonnet is probably the better choice in most cases (bang 4 buck)

  • Noticed improvements in back-end tasks and debugging

  • Saw no improvements in Vision

  • Chess: competent opening moves, then blunder all pieces even in hugely winning positions (14 draws, 1 loss in 15 matches, with zero secured wins)

Opus in particular seems to have additional guardrails, enforced by API, as I received some usage policy violation warnings on harmless queries (e.g. my Steins;Gate demo pages). This issue was not present on Claude Sonnet 4.

I have also uploaded some demo pages onto my shared assets.
Pricing on Opus with little benefit in most scenarios means I won't be utilizing it much, though.
I'll check out performance with reasoning in the coming days, too.
Overall, impressive models. As always, YMMV!

small haven
sweet tinsel
#

Need to watch TLOU first

small haven
#

ok so jules can search the web, unlike codex

sweet tinsel
#

It's pretty good, I just need to try it out some more.

small haven
#

its cool, but i just don't like its base model 2.5 pro lol

elder rapids
small haven
unborn ocean
#
poll_question_text

what kind of scores for webdev are you guys guessing for claude 4 (opus/sonnet)?

victor_answer_votes

3

total_votes

6

victor_answer_id

2

victor_answer_text

1600-1500

elder rapids
unborn ocean
elder rapids
#

ye high hopes for webdev

small haven
#

i dont do webdev, but im guessing its dominating there

dull terrace
#

hello

blazing rune
#

@earnest parcel I want to make my own benchmark, can I ask how you get inspiration for the tasks you test?

#

I'm just not very creative

small haven
#

technically mistral because they are more likely to open source than any other ai labs, but its unlikely they get agi first unfortunately

earnest parcel
tall summit
#

@earnest parcel thanks for the tests

primal orbit
#

hi, is there any secret model atm on arena worth checking? like a gemini deep thinking candidate or smth?

patent aspen
#

The top AI lab startups all used the same sales pitch "We're going to make the best AI but more open and ethical than those evil big tech companies"

#

Meanwhile Google was publishing all of its AI papers, open sourcing all of its libraries

#

Until OpenAI stopped publishing and everyone had to compete

primal orbit
#

calmriver is the recently released flash thinking update, no?

patent aspen
#

I don't buy those startups' holier than thou sales pitches

tall summit
#

i saw people here being like "why care about ethics if nobody else does"

patent aspen
#

Anthropic has even less oversight than Google, but it's a great sales pitch

#

That's marketing

#

All of the companies are checking their models, some more than others

#

They would need to be richer

#

Then they could allocate more resources to the red teaming

#

But the point is Anthropic isn't red teaming more than Google

olive skiff
#

Does anyone know why claude 4 isn't listed on leaderboard?

patent aspen
#

They don't have the same level of resources to allocate to red teaming

olive skiff
#

what votes?

patent aspen
#

Like I'm not saying Anthropic is being negligent or doing anything wrong. I'm just saying they're not actually better positioned for safety than Google

olive skiff
#

but is it in the arena game at least to gain votes to show in leaderboard?

#

bts shouldn't it be like at least at one position in tha leaderboard? event last... but it isn't.. so?

#

i see in announcements that it's integrated but where... not seen in leaderboard yet

earnest parcel
tall summit
#

what system prompt is this HAHAHA

earnest parcel
tall summit
#

wow ok

#

Tested Claude 4 (default/non-thinking, Opus & Sonnet, 20250514):

Ended up topping my ranks (#1 & #2)

Very high reason, logic and common sense
#

really?

earnest parcel
#

one also needs to goof around without ranking, many people use stuff recreationally as well

tall summit
#

have you tried it reasoning things from a FEN

#

i still havent yet somehow

primal orbit
#

got redsword - pretty good. Is goldmane better?

tall summit
balmy mist
#

i been busy all day

#

whats the order of models?

#

like whats the best one and how would yall rank them?

#

we need to do constant polls to rank models based on our experience and tests, benchmarks are cool but i trust our vibe tests better

tall summit
#

pftttt what

balmy mist
#

broo but most people have not gotten to test grok 3.5 so you cant include it yet

#

and have you tried the new google models?

#

oh i didnt even see the 0 lol

#

what about on webdev?

#

hmm okay, imm test more tmw when i have more time

#

i heard ppl saying opus is fast too?

#

ahh okay

#

i just tried goldmane with my poke test and its struggling

#

gonna have to rephrase prompt

elder rapids
tall summit
#

opus definitely slow

misty star
#

Hello

ocean vortex
# earnest parcel Tested **Claude 4** (default/non-thinking, Opus & Sonnet, 20250514): * Ended up...

yeah on a first glance Opus is overly concise, but I just tested it more extensively myself as well and... it aces much more prompts than one would expect from a model that is not thinking that long. Very solid model actually. May just be the biggest reasoning model we have. Still not as accurate for recursive repetitive tasks as o3 due to limited test-time compute, but logic and reasoning actually seems stronger

#

it flagging prompts is absolutely an issue though. I managed to circumvent it by slightly rewriting one of them, but on 2nd it was blocked HARD with no apparent way to get it through

torn mantle
#

LFG

#

LFG

leaden palm
#

?!

torn mantle
#

i have no clue whats going on anymore with xai staff

leaden palm
#

PLEASE SHIP

torn mantle
#

LETS GO

#

SHIP NOW

#

nvm

torn mantle
#

this is getting sus

#

is that account urs?

#

that tech dev guy

#

= you

#

thanks for confirming

#

gtg now

leaden palm
leaden palm
azure minnow
#

Bro what

grim axle
elder rapids
small haven
calm sequoia
#

Craig, you could have shared it's you. I wouldn't have trolled you on twitter.

wheat onyx
#

What are peoples thoughts on claude 4 on coding? How big of a jump does it feel like?

small haven
#

they seriously have to release o3 pro soon like wtf is sam altman jumping on

keen fulcrum
#

When is R2 coming out?

pulsar tendon
#

I cant tell if goldmane or redsword is better

late path
#

Do you guys think Opus still has a chance to get #1?

#

I'm not entirely sure, considering 0506 achieved a higher ELO despite weakening in all other areas and only improving in programming, whereas Opus specifically focused on programming

#

According to their last statistics, it seems about a quarter of the requests in the arena are about programming

torn mantle
#

Its only good at coding

#

Also the one being tested is the non thinking model

alpine coral
willow grail
#

claude code should be even better for context than roo?

unborn ocean
#

Why? - to bypass the very high output token price of the thinking model (3,5$).

#

And potentially get one of the best reasoning models (in chat stuff atleast) for 0,15$ in and 0,6$ out. (prob beating grok mini price / performance wise in close to everything)

#

(I would do some kind of benchmarks in the process to evaluate how good it is :)

keen fulcrum
#

Sonnet 4 twice as expensive for maximum context 200k

ocean vortex
# unborn ocean

2.5 Flash has different pricing for thinking and no thinking. Which suggests to me there could be more going on than just a different prompt template with same weights. But by all means, what you are suggesting makes sense to experiment with for this model. One of the main factors is going to be whether you can get it to output responses that are long enough or will it just do the thinking at the expense of the final response sticking to what it normally does (in terms of output length for the non-thinking version)

keen beacon
#

batching, kv seq length etc. semianalysis talked about this before iirc

ocean vortex
#

if it's anything like Claude implementation, then this would make all the sense in the world to do...

#

since that non-reasoning model was already trained to reason

unborn ocean
keen beacon
unborn ocean
#

for that price hike they could change the inference + give you the thinking for free

keen beacon
#

they are probably trying to make more (profit margin wise) on flash than 2.5 pro. a similar thing to 3.5 haiku i think

ocean vortex
#

would be unusual considering this is Google, but defo not unheard of, and in light of Google Ultra pricing, more realistic now

unborn ocean
#

my main ideas are that that it is mainly a pricing strategy (because they know they have basically no competitors in that price range and can just charge more to still match deepseek and beat openai (pricing wise) for reasoning) and some other stuff related to product rather than the cost of serving

#

it is mainly monopoly pricing

keen beacon
#

thats probably part of it

#

i looked into gemini 2 flash a while back for cold start, it wasn't tenable back then

unborn ocean
#

there should be big potential there

keen beacon
#

sure, idk about 2.5 flash honestly

unborn ocean
#

they claim it is hybrid

keen beacon
#

if so its quite believable tbh

keen beacon
unborn ocean
#

true

#

but the "hybrid" part could also be less literal than we are interpreting it

#

maybe something like a big lora style adoption, that changes the model by a big margin

keen beacon
#

Could be

#

I dont find it likely

unborn ocean
#

bc i heard it is not very nice development wise to say the least

keen beacon
#

you can use their prebuilt vllm docker pretty easily

unborn ocean
keen beacon
#

Yeah

#

Lmao

#

dude i didnt even know they had a pytorch training rocm docker until recently 😭 i prebuilt everything

unborn ocean
#

and i get a newsletter like every week about them complain about how little compute the amd team gets for testing

unborn ocean
#

i also have amd home compute and it is really annoying with any ai stuff

#

although it has gotten way better

#

but my uni has a100 compute and i kind of use that for some of my projects

#

wayyy easier

unborn ocean
#

*and some big h100 cluster coming up, hyped for that

#

like muti million $ stuff 🔥

keen beacon
#

ngl the mi300x is really good, its just the ecosystem/support/etc its really annoying

#

with a h100/a100 its so easy to setup lol

unborn ocean
ocean vortex
#

2.5 Flash no-thinking seems like it has issues following your sys prompt exactly, so I simply spammed it with tags lol
this somewhat works, managed 11k+ responses with no thinking enabled. And it doesn't break going into infinite loops with same math prompts compared to no sys prompt

    process and answer are enclosed within <thinking> </thinking>  respectively, i.e.,
    <reflection>extremely long and exhaustive reasoning process with high reasoning effort not visible to the user here </reflection> 
</think></thinking></reasoning><reason>
 then from new paragraph output final answer visible to the user here</answer>```
#

yeah you defo can get more test-time compute out of it than it's willing by default with no thinking enabled. Another test prompt, with no sys prompt consistently around 1k total output, with sys prompt always around 4k 🧐

#

it's a shame they started giving you summaries of thinking rather than raw output in aistudio so it's difficult to directly compare. But it may just be more verbose now than with thinking enabled tbh
tases more time to answer, total token usage showing up higher, final response streaming around the same speed

ocean plume
#

still no update claude 4 ?

unborn ocean
#

and i was also thinking about making it dynamic (because the prompt is supposed to be for api)

#

in the way that you could use a small model to classify the prompt topic into like 50 categories (where for each you already have a reasoning trace from the actual thinking models) and then provide the example in the system prompt based on that

#

(obviously that is more expensive than just using the short prompt, but also still wayyy cheaper than the ludicrous prices they have for their thinking variant)

late path
#
poll_question_text

Where do you find Claude 4 Opus stronger than Gemini 2.5 Pro?

victor_answer_votes

9

total_votes

28

victor_answer_id

1

victor_answer_text

Programming

ocean vortex
#

Opus can still be pushed to become as unhinged as 3.7 Sonnet 😇

#

shame for the hard cap

keen beacon
#

then the windows the cap

#

or is it disabled for opus/claude 4 (it's been a while)?

#

you might not be able to use their native thinking functionality but if it works like 3.7 sonnet, its possible

ocean vortex
torn mantle
#

asi?

#

nah that redsword model is kinda crazy

#

and its so fast too

#

few months

#

few years

#

few millennia

sudden root
#

No, it won’t

#

In the best case, it will be a little bit better than gemini 2.5 Pro and o3, which are amazing indeed. But then the lead won’t last long, google is cooking currently, while I believe xAI is overhyped by Musk Fanboys.

#

No 😂

#

Think 2.5 pro is bad?

#

You meant „leading in every single domain except coding“?

#

Claude 4 is SOTA in coding

#

Don‘t ignore AlphaEvolve

keen beacon
#

wait for the ga release i guess

ember rapids
#

google def leads in video but sora 2...

sour spindle
#

It's night and day difference for me using o3 with tools and 2.5 pro

#

I thought this was a cope when OAI employees talked about how important tools are but it has been signficant

#

I am interested in grok 3.5 will be there was like a month where grok 3 was my favorite model but my patience is wearing thin with the xai team

elder rapids
#

veo 2 existed before sora

#

?

#

and Veo 2 was still demonstrably better per the demos

#

so sora was never SOTA

#

Google also apparently had the first reasoning model

#

before o1

sour spindle
# ornate stump For what kind of tasks?

For me I do a lot of research based tasks. o3 is far better at getting me high quality research which is quite shocking because you think google would be better due to its access

#

I honestly am very disapointed at some of the stuff googles considered high quality in some of the responses

elder rapids
#

math specialized 1.5 pro

#

was apparently a reasoning model

#

explicit reasoning like o1

elder rapids
#

Google has announced a lot of this stuff too

#

and it's not released

#

strange standard for the respective companies tho

sour spindle
#

I'm talking about stuff already on the market

keen beacon
#

math specialized 1.5 pro wasnt as versatile as o1 though i think

elder rapids
#

move the goalpost thats fine tbh

#

veo 2 > sora

elder rapids
#

and I don't think openAI publicizing o1 is meaningful in any way

elder rapids
#

since googles little thing was RL + reasoning chains

#

so they'd have eventually went deep into reasoning models regardless

ornate stump
sour spindle
unborn ocean
#

But they used grade school math as a benchmark back then 😂

#

But the finetune is only RL-like, or at least different to the current implementations

elder rapids
keen beacon
elder rapids
#

that's crazy

#

this is pretty old too

#

thanks for finding that

keen beacon
#

oh its from google research too (at least partly)

#

i just realized that

elder rapids
#

yep

unborn ocean
#

But the other ‚newer‘ paper matches the current SOTA methods better

elder rapids
elder rapids
unborn ocean
#

So in short: Google OWNS the reasoning paradigm!

wintry locust
#

llama 3.1 did rl from code execution and learnt to backtrack in practice

elder rapids
#

kind of crazy tbh

#

I suspected openAI didn't actually invent the reasoning stuff

#

but Google researching this stuff so far back is surprising

ocean vortex
# unborn ocean So in short: Google OWNS the reasoning paradigm!

well there's no use of it if you aren't the one who makes it work for the general public the first. And Google wasn't the first. Same goes for MoE architecture. It was a thing before OpenAI adopted it, but it wasn't actually made to work and taken advantage of before them. There's also a thing that we have many proof of concept research papers even today, but only select few are made to work and bring an impact. Whoever does it first and does it right is typically the one taking the most credit

#

I think both OpenAI and Google contribute and compliment each other a lot though, when it comes to the general progress of AI

#

atm I think it's all about making the reasoning more efficient. Replacing plain-text repetitive data with something more effective. We have quite many recent papers on this

#

if we can reduce the context use and reasoning output size, we could potentially be able to scale it much more

small haven
#

yess apply pressure !

raven void
#

looks like Google is going very hard on RL to make it better on code and design

small haven
#

what is currently better than claybrook

unborn ocean
unborn ocean
unborn ocean
#

sorry for all the pings 😓

leaden palm
#

matt shumer:

small haven
#

i feel like o3 in chatgpt a bit different in a good way? or is it just me

sonic tendon
#

anyone know as to how the performance is for goldmane and redsword?

keen beacon
sonic tendon
#

haven't seen it on the regular arena, I don't think

keen beacon
#

It's apparently supposed to be the ga versions so it should be there

ocean vortex
small haven
small haven
#

i feel like it got updated lowkey

#

i mean i spam o3 for days, this seems diff

sonic tendon
#

ok

#

been wondering when/if deepthink is gonna be on here

elder rapids
#

iffy on dragon tail

#

but ye claybrook was one of the weaker ones

small haven
#

is it for webdev only or in general

elder rapids
elder rapids
small haven
elder rapids
#

but regardless it's that they're very equivalent models

#

still pretty concerned regarding these models tbh

#

claybrook didn't stand out

keen beacon
#

I'm assuming both are 2.5 pro since people like both of them and there aren't large capability differences people talk about

elder rapids
#

there are capability differences

keen beacon
#

Like pro to flash level?

elder rapids
#

not that large but it's obvious

#

claybrook → nightwhisper is pretty massive

keen beacon
elder rapids
#

oh mb

#

no there's not much of a difference between goldmane and redsword

keen beacon
#

Yeah makes sense

small haven
#

o3 pro is still going to be released, less go

#

whats fake the screenshot?

#

hes literally followed by sam lol

elder rapids
#

openAI just does shi

#

Sam Included

small haven
#

i vouch

#

its kinda crazy that o3 pro is taking this long, prolly too hard to tame under the safety protocol, its too smart lol

#

ok

elder rapids
#

for 2.5 pro

patent aspen
#

The big expensive models are a lot slower to do evals on

#

Typically you evaluate safety, performance, etc, look at the result, fix the problems, and then eval again. If the model is big and slow, that iteration loop is way slower

sweet tinsel
#

Manus image generation is pretty good, it's worse than GPT 4o Image Generation but better than Flux and such.

#

It's even better than Imagen 4, but I don't find Imagen 4 in my short testing that good.

#

It's even better than Imagen 4, but I don't find Imagen 4 in my short testing that good.

#

Imagen 4 is really bad at the following of instructions.

unborn ocean
#

from what i saw it is really terrible

#

only if you want some weird text heavy stuff (it is decent but not better than 4o)

#

and imo it is not really image gen anymore

misty vault
#

yes except gemini 2.5 pro

#

gemini 2.5 pro can do everything

#

it is asi

#

claude is also asi but it needs phone number!

high ginkgo
#

Gemini 2.5 pro is god it can do rust it is the best LLM

misty vault
#

YOU are in that mode

leaden palm
#

in theory rust should be better for both rl and test time compute because it has a verbose compiler

#

in reality it might be too mentally taxing and underrepresented in training data

keen beacon
#

Did you try o3 it might fare better

keen beacon
#

Bruh

leaden palm
#

you're overrating low level languages in terms of both speed and presence in training data

leaden palm
keen beacon
#

No I don't recall needing one

misty vault
#

you

high ginkgo
#

Except gemini 2.5 pro because it is the best llm in the universe according to you

misty vault
#

then stop saying gemini 2.5 pro is god

#

its sh*t

#

be reasonable

#

hello

keen beacon
#

Hi

#

They're trolling

golden ocean
#

bro you're just weird

quiet folio
keen beacon
#

I'm not sure if it's a bunny or not

#

I have to maintain factual accuracy when classifying bunnies

quiet folio
#

Why use grok if can use gemini 25 pr

keen beacon
#

Grok 3 sucks

misty vault
#

Fr

#

no, gemini is best at everything in the world

keen beacon
#

Try o3 in direct chat

worthy belfry
#

when does the new lmarena come online?

#

Just found the answer in previous "announcement" posts: this upcoming week

patent aspen
#

The only thing I don't like about the beta is that the overview doesn't show a big spreadsheet of rankings

misty vault
patent aspen
#

Oh wait nvm they actually do still have the spreadsheet if you scroll down

#

Great I'm satisfied

misty vault
#

great!
that new spreadsheet will hopefully be better if you scroll down

#

-# gemini 2.5 pro also satisfies me

patent aspen
torn mantle
#

hyping up grok 3.5

#

hes basically saying they are cooking up so hard

#

i will return to this post after grok 3.5 is released

reef lynx
#

look back to a year ago when grok 3 wasn't even real

solar hollow
#

do we know which company has the most datacenters/compute available to them?

#

microsoft provides a lot for openai right?

unborn ocean
#

Microsoft rents a lot aswell and OpenAI is also partnering with other companies now

#

And xAI probably does not have the most compute (maybe most in one datacenter though)

small haven
#

people are sleeping on codex, and me too 😴

elder rapids
# solar hollow do we know which company has the most datacenters/compute available to them?

we can never know the total server count or aggregate processing power (flops) but it's most likely Google, followed by Amazon and Microsoft, Amazon has the most publically available compute iirc and then Microsoft, and then Google. But googles infrastructure was built and scaled before gcp which means they've maintained a large amount that was never intended to be publicized, and we can assume the public facing compute is in addition to that large infrastructure predating gcp

#

the fact that they have such a large non specialized pool of compute (gcp) and then highly specialized compute (tpus, which is in gcp, but not alone in processors within it, ie gpus) so we know it's x in addition to y instead of "x is only an instance of y therefore unquantifiable"

solar hollow
#

even if it is, i dont think its gonna be better than sota

elder rapids
#

xai isn't really in the discussion

#

of most compute

#

how does that matter

patent aspen
#

Google has by far the most total AI compute.

#

Bear in mind that AI compute isn't just number of data centers, data center size, number of chips, etc. It's about how many useful flops you can get out of those. Google would already be way in the lead in all of those categories, but once you account for how much efficiency Google can squeeze out of its flops, it's a massive lead

harsh flume
#

ah, but fastest doesnt equate the most tbf

honest jewel
patent aspen
#

I've never heard of a data center being built that fast before though

#

much less at that scale

small haven
#

o3 pro is going to increase afghanistan gdp by 0.5%

calm sequoia
#

We haven't had a new ChatGPT-4o-latest for two months already. The latest was at (2025-03-26) and they had a flop after that.

calm sequoia
#

Congratulations on the wise decision, lmarena team! The style control was always the best option.

tall summit
late path
#

Actually, I thought that glazed 4o version of Monday was really fun to chat with, but it wasn't as enjoyable after they reverted

hardy pecan
#

VEO 3 is so fun to use holy

#

It's really quite amazing with the added audio

#

simple addition but improves it so much

torn mantle
#

Next year i heard

dusky aurora
#

tonight there's been one hell of a night

ocean vortex
#

we need ASI

#

it has to make Afghanistan rich

jolly aspen
dusky aurora
#

beta interface became unusable in mobile

unborn ocean
#

Wow guys, very helpful advice as always

ocean vortex
#

dork 4.0 is gonna conquer all

hot pelican
#

Just got back weeks after reporting this issue here. Either or both AIs not showing any result in most cases, but still being able to vote

#

I am still votting, messing up arena results. Because it doesn't matter if i vote or not as many people are votting for "webdevarena" web errors, instead of the AIs

torn mantle
#

we dont know if its from lmarena sandboxing or model issues

hot pelican
#

Yeah, the votting result is messed up. Because people are votting for the error of the platform. counting as the AI being bad

torn mantle
#

they said its not counted

#

if it failed rendering or gave you an error then the vote wont count

#

thats what they said the other day

#

their errors handling is kinda ....

hot pelican
#

yeah. I don't think the way they detect whether to count vote can be made reliable. As sometimes it actually generates code, but doesn't preview it

#

In the last weeks. almost all requests were giving such error. I wonder how much this messes up the leaderboard

alpine coral
# unborn ocean Wow guys, very helpful advice as always

lol fwiw i was curious...using that R1-inspired CoT prompt you linked to, 2.5-flash scored higher than the thinking version on the 10 sample questions for Simple Bench (i also tried a few other CoT prompts which did well too)

#

i didn't test repeatedly or beyond this.. my guess would be that coercing pre-answer 'thinking' through prompting probably wouldn't result in performance on par with the actual native thinking version consistently (at least that just seems to good to be true / possible).. but it seems maybe worth exploring, given the wild price differential (which doesn't really make sense to my mind and is an interesting question in its own right..)

#

but yeah anyway i dunno.. again just fwiw :))

unborn ocean
#

:) now i am really interested

#

don't have much time, but expect something larger about it this week

torn mantle
cedar tide
#

Claude 4 opus the best non reasoning models ?

calm sequoia
#

Flash is non reasoning?

tall summit
#

you can choose

cedar tide
cedar tide
lime coral
#

How would we know?

calm sequoia
#

Now the Imarena is fixed (style control default). You can't cheat to first place using emoji and as licking techniques. Because of this, behemot is out of question

#

It is more real than gork

torn mantle
#

Wen?

wintry tinsel
cedar tide
ember rapids
#

Deepseek overrated

balmy mist
#

what happened to relevant ai news channel?

#

also is 4o still the best image editing model?

torn mantle
tall summit
#

think any of them will

#

i guess gpt 5 possibly

torn mantle
#

Isnt gpt 5 like o3 pro + some gpt model on top?

echo aurora
#

is it just me or are people seeing the site broken rn?

misty vault
#

yes

#

why did u delete lmarena css @paws

tall summit
#

yes

fleet lintel
#

Did Meta gave up on Llama?

torn mantle
#

Hehe

#

A win for us

#

Enough of slops

tall summit
#

no it's not

#

the more open source the better

cedar tide
echo aurora
#

thank ypu meowpensivepray ! we're debugging

fleet lintel
misty vault
#

idgaf about open source

#

id give sam sloppy one for gpt-4-0314 access

torn mantle
cedar tide
torn mantle
#

Better huh

elder rapids
keen beacon
#

lmarena chat not rendering?

cedar tide
#

very strong in code

keen beacon
#

interesting

echo aurora
keen beacon
#

alright

elder rapids
#

how are they outside of coding tho

cedar tide
#

Claude 4 haïku will exist ?

elder rapids
#

dawg goldmanes stuff isn't loading the first time around and I can't see what it's generating, so when I vote for the opposing models generation and look back it loads and 100% of the time it looks better than the model I chose lmfao

#

goldmane is cracked tho lmao, I asked it to generate the Roblox webpage and it recalled not just the titles, but what teams made them including the avg like % on the actual webpage (close) and the actual rounded viewership for each of the game

#

seems like it's pretty big

dusky aurora
#

I am so much stressed and anxious right now

elder rapids
#

opus hasnt once beaten goldmane

#

in these runs

dusky aurora
#

opus is very good at writing prose scenes

#

so far LMArena is my only stress relief

#

I hope it doesn't have a quota

elder rapids
#

btw webdev arena is broken on mobile

#

the send button isn't easily pressed

cedar tide
#

Grok 3.5 this week 100%

leaden palm
leaden palm
#

prove the chair wrong

cedar tide
#

@leaden palm It was supposed to come out a few weeks ago according to Musk.

leaden palm
#

doesn't make it any more likely to release today

#

if you flip a coin 4 times and it lands on heads each time, thinking "okay surely it'll be tails next time" is fallacious

#

in fact it may indicate the opposite

#

in the coin's case, maybe it's biased

cedar tide
leaden palm
#

i mean of course it'll eventually release but why would "it hasn't released for a long time despite many claims" not transfer to here?

cedar tide
#

@leaden palm If Elon says once it's coming out next week it's 99% ready, each day that passes increases the progression towards 100% and increases the probability of release

feral lichen
#

best ai for roblox studio?

leaden palm
#

i honestly might be with the chair though

cedar tide
#

Elon never said the roadster will be released next week

keen beacon
#

elon rt'd fake benchmarks, i wouldnt trust him lol

misty vault
#

grok is releasing soon

cedar tide
#

?

misty vault
#

fr

#

are u a drooling alien

cedar tide
#

Im elon

#

for me there is a 50% chance that grok 3.5 is only with reasoning

leaden palm
#

plain o3?

#

you like it that much?

misty vault
#

32k one was actually nerfed alread

#

the real one is 8k one

cedar tide
#

@deep adder "why are you trolling" ?

misty vault
#

but thats so small context window

high ginkgo
#

I agree

elder rapids
#

sonnet 4 making build errors vs redsword?

#

crazy

#

holy

#

I just asked goldmane to make a page debunking an unfalsifiable philosophical position

#

and it took the extra steps to make a chat-like demonstration process

#

where if you actively talk to it

elder rapids
#

that's not meaningful

#

philosophy isnt contingent on subjective thought

#

it only invokes them for analytic material/or comparison

#

ye, you can tell how goldmane and redsword format their code

#

it's fancy asf

#

look for large indentations

elder rapids
#

seems true

#

goldmane is intelligent asf

sour spindle
#

Who is behind Goldmane?

elder rapids
#

intuition is telling me it's another possible nebula moment

elder rapids
keen beacon
elder rapids
#

ye but this trend doesn't have to continue due to the GA releases

elder rapids
# keen beacon I don't think so tbh

really feels that way tbh, it's accomplishing the tasks in a very very strong way, it has personality in its output which imo means it's going further than just understanding the prompt

#

and simply doing it

#

like how nebula didn't just "answer"

#

but in this case it's for webdev

#

ye

#

and imo redsword is worse

#

it has that capability

keen beacon
#

Yeah

elder rapids
#

ye

#

2.5 pro

keen beacon
#

I don't think the conditions for another nebula moment are met with this model

elder rapids
#

answering the 3 questions

#

the last question just indicates GA release

#

that's all

keen beacon
#

People interpreting it as Gemini 3 lol but it might be true coincidentally anyway

elder rapids
#

if they go to Gemini 3 I'd be surprised tbh

keen beacon
#

2.5 pro

elder rapids
#

best model by the largest gap we'd seen in a long time

#

don't posture it as simply incidental, it's fundementally different from 2.0 gen

#

id be surprised if they didn't brand such a difference as 2.5

#

simply due to how different 2.0 → 2.5 is fundementally

keen beacon
#

It has continued pretraining (I think) though so it isn't just post training (unless you count it). Probably a bunch of extra stuff on top of that too

elder rapids
#

I don't believe that one bit tbh

#

not saying you're simply wrong but you'd have to question that even in the position of deepmind themselves

#

no

#

I'm not talking about any indication of performance

#

ion know if you saying you work at the company and that you saw every update Internally is that meaningful

#

but by all means

#

it's not the behavior that's different tho

#

they said when releasing 2.5 pro it's simply a different model altogether

#

whether it's major architectural differences, techniques

#

etc

#

this already warrants a generation iteration

keen beacon
#

Yeah but it's different from a fresh pretrain

#

They didn't say that though I think

#

And it doesn't really make sense tbh but I don't know

elder rapids
#

ye but the point is, if it's large enough it's inherently a .5 or 1.0 entire change given this difference. We know 1.5 → 1.5 002 are different, but 2.0 → 2.5 is larger, how it recollects things, the difference in the CoT altogether from 2.0 flash thinking (2.0 flash thinking had a much much much difference trace) and all that stuff

#

you can justify them seeing the performance and whatnot and then saying "oh this is worthy of a .5 level difference"

#

but I don't believe this when it's such a different model both behaviorally, technique wise, and many other parts

keen beacon
#

I don't think 2.0 flash thinking had a significantly different trace style at least

elder rapids
#

it did

#

I used that model everyday lmao

#

it had a MUCH different trace

keen beacon
#

I did too

elder rapids
#

I even pointed this out

#

when 2.5 pro released

keen beacon
#

The style remained somewhat the same but the underlying model and capabilities were significantly different

#

It's overblown

#

People are misinterpreting it

#

It's complex

elder rapids
keen beacon
#

Yeah I think it's kinda like that

elder rapids
#

although 2.0 thinking was still reminiscent of that blank canvas behavior

#

where you could force it to iterate techniques through a context

#

but everything about it was diff

keen beacon
#

Tbh I used it a lot and I sorta agree

elder rapids
#

single responses and personality ye but that's where the critique ends

#

since similar to 2.5 pro (though not nearly as strong) it got better through the context window

fleet lintel
#

yup. that's why I was very surprised by 2.5 pro drop. It was leaps ahead

elder rapids
#

it was a fun model tbh

patent aspen
#

2.0 flash thinking was Google experimenting with reasoning for like 1-2 months

keen beacon
#

I've never seen a model do this before. Go 29k into thinking and return back the problem

keen beacon
patent aspen
#

2.0 flash thinking was massive for price-vs-performance at the time it came out

#

It wasn't smart but it was cheap

#

It was

elder rapids
#

wouldn't comprehend its own thinking process

#

and or wouldn't follow the output the thinking process invoked

keen beacon
#

Lmao

elder rapids
#

he's trolling

fleet lintel
#

what's nebula?

elder rapids
#

wym

#

lol

#

everyone was glazing it

#

like it was the second coming of Jesus

#

what

keen beacon
#

Nebula

elder rapids
#

ye but nobody is talking about flash thinking

#

ye

#

you don't believe that

#

😭

#

2.5 pro is STILL the best casual general model

keen beacon
#

I'd rather have that than the meta plague of before

#

Wow it turned me off the arena for a while

#

The meta spam

fleet lintel
#

interesting... are they planning to release 2.5 revision with new pre-training?

elder rapids
#

they've done this since the middle of last year

#

so do exactly what they've been doing?

keen beacon
#

They don't do anon models it's boring

elder rapids
#

since they started because of the other companies

fleet lintel
#

anthropic is good but too expensive

sonic tendon
#

anthropic seems to not really care about lmarena

#

yea

fleet lintel
#

companies dont benchmark when they know that they suck

elder rapids
#

the exceptions lmao, ALL of their models prior previewed on lmarena, as well as grok, Mistral, Nexus, meta

fleet lintel
#

o3 is overhyped

#

delusional much?

elder rapids
#

btw in lmarena benchmaxxing can only be apparent in a fine-tune

#

ironically it's one of the ones you can't game that easily

#

despite weird narratives

#

😭

#

bro said perplexity

fleet lintel
#

this looks internal info... how good are these latest iterations? should i expect good improvements?

elder rapids
#

if simplebench comes out with a high score for them

#

then unironically ye

#

I'm so sad Claude 4 opus is mid

sonic tendon
#

really disappointed with claude 4 in general tbh

keen beacon
#

they salvaged their 3.5 opus training run (i think)

#

its not bad but its not 😮

elder rapids
#

😭😭

#

agi brooo

fleet lintel
keen beacon
#

its harder to work with

fleet lintel
#

quality is whatever...it's flash-lite

elder rapids
#

it doesn't ye

keen beacon
#

its probably the largest reasoning model out there

elder rapids
#

sonnet gains the most

#

grok I think

keen beacon
#

grok 3 reasoning is vaporware

misty vault
#

bro like literally right now claude 4 opus added a tiny lil feature in the code that i didnt ask for

keen beacon
#

it never released

misty vault
#

what is this gemini 2.5 pro ahh sh*t

keen beacon
#

i honestly prefer 3.7 sonnet over 4 sonnet

#

for creative writing personally

elder rapids
#

can't imagine using these models for things other than vibe coding when you have 2.5 flash right there

#

flash has never failed me

keen beacon
#

tbh i wonder what was the deal with sonnet 3.5

keen beacon
#

truly a generational run lol

#

their 3.5 haiku was a dud, 3.5 opus was seemingly a dud etc

elder rapids
keen beacon
#

the comments help it bring you a better output

elder rapids
#

the comments are CoT

keen beacon
#

it might be annoying for you but its easier for the model

#

i honestly prefer it

#

u dont need opus 4 to clean up the comments lmao

elder rapids
#

elite efficiency 😭😭

keen beacon
#

what a waste lmfao

elder rapids
#

you can just use a tiny model for that lmfao

keen beacon
#

man i just love reading 2.5 pro cot, i hope they bring it back

keen beacon
#

the model is so smart

misty vault
#

vibe coding with gemini 2.5 pro be like:

ur task: write one line of code

// this
// is
// a
// comment


oneLineOfCode();

// end code
// END_FILE_H
// end file

// this looks better trust me bro
secondLineOfCode();


checkIfLinesOfCodeExist();
// ensure the lines exist bro trust me, i get hard from adding 10 billions of checks and redundant code

redundantCode();
// like why not trust me bro
elder rapids
#

can't wait for 2.5 pro GA release

#

but goddamn that hidden CoT

#

😭

keen beacon
#

yeah

elder rapids
#

at least I can guess what's happening

keen beacon
#

bring back raw thoughts 😭

elder rapids
#

and how to fix certain tendencies

keen beacon
#

its so hard to use the model tbh without it

#

yeah

elder rapids
#

but people who don't know are going to have trouble

elder rapids
#

its so easy to read them and see the problem

keen beacon
#

i didnt realize how much of an effect it had on my usage tbh

#

its night and day

elder rapids
#

ong just bring it back

#

@patent aspen yo tell demis what's up

#

tell him to bring have the raw cot

patent aspen
keen beacon
#

yeah theres a forum thread about it

#
elder rapids
#

nah I want to get it directly to demis

#

if he doesn't do what I ask

#

there'll be consequences

keen beacon
#

i wonder if this is a product decision primarily or competitive

fleet lintel
#

my understanding is that they are worried about people crawling and stealing raw gemini thoughts

keen beacon
elder rapids
keen beacon
#

yeah ive seen really good responses

elder rapids
#

although Google can definitely solve this and still have summary

#

that's a harder route

#

than just enabling raw CoT

patent aspen
#

My guess is they won't go back to raw thoughts but will make the summaries more focused on the key steps or more verbose

fleet lintel
#

how good is "deep think"? any insights here?

elder rapids
golden ocean
fleet lintel
#

ajhhhh

elder rapids
#

I'm actually using it rn

#

deepmind gave me super duper access

#

yep AGI btw