#general

1 messages · Page 45 of 1

elder rapids
#

no as in

dull terrace
#

Craig answer the question

elder rapids
#

reiterate it

dull terrace
#

please

#

We do not have all day

#

I would like to see your benchmarks

#

to please

#

No no

#

you stated it

#

Tell us why you said it

#

from what perpesctive

#

If one model is benchmaxxed

#

why wouldnt others not be??

#

Answer that now?

#

You think google is the only one that benchmaxxed?

#

Ok good were getting somewhere jimmy

#

That is contradictory to your previous statement.

#

You said this

#

No worry my dear sir

#

It ok little jimmy take your time

#

Thank you jimmy

#

Now I would like for you to explain one thing

#

for me.

#

How is gpt4o still up while being generations away from the other llms.

#

Specfically in the oai line of models

misty vault
dull terrace
#

Thx i take free sponsers

#

Ok

#

Good point

#

A counter argument

#

even if they are constanly updated

#

why would not the newer model be better,

#

Either way

#

A new gen should be better then a old gen

#

Whats going on rn

#

@misty vault A playstation 2

#

is better then a playstation 1

#

right

#

chatgpt 4o

#

gpt 4.1.

#

Answer that then

#

The lmarena measures the model capabilty and human prefrence

#

No worry I can read

misty vault
dull terrace
#

Let use your logic for a second

#

If gpt 4o is the newest one

#

How come previous versions of it still are in the top ten

torn mantle
#

i take that back

dull terrace
#

or top 15

torn mantle
#

maybe opus 4

#

is the real deal

dull terrace
#

Does that make sense

torn mantle
#

actually

#

let me finish craig

#

just maybe

dull terrace
#

@torn mantle What happend to change your mind

torn mantle
dull terrace
#

.....

#

So gpt2 is better then gpt4o

#

Its the point of a new genaration

#

Whats going on rn

echo aurora
#

loving the debate but remember:

✅ Treat others with Respect.

dull terrace
#

Because bious and potentially fishy stuff

#

Either way

#

it wouldnt make sense

misty vault
#

I'm kidding @echo aurora I apologize, I love you (in a friendly way)

dull terrace
#

You did three times lol

#

but ok

mossy drum
#

New model in Beta Text2Image: gemini-2.0-flash-preview-image-generation

dull terrace
#

You just lost.....

#

Holy cope

#

But good debate

#

maybe in 2 years or so you do really good i see the potential in you gl

misty vault
#

Does that mean he is agi

dull terrace
#

and consciouness is 2 different things.

golden ocean
#

gpt-4 is asi and chatgpt 4o is artificial stupidity but gpt-4 is older model

dull terrace
#

@misty vault So who won in your opinon?

misty vault
#

I will ask sydney

tropic nimbus
#

0 substance over the last 50 messages whats happening here

#

are there any new anonymous models lately? anything interesting?

unborn ocean
#

this is claude's assessment (based on the discussion) of who won:
Craig Federighi's estimated cognitive profile:
[...]
Estimated range: 115-125 IQ - Above average to superior range, with strong logical reasoning and debate skills.
Odin's estimated cognitive profile:
[...]
Estimated range: 95-105 IQ - Average range, with notable weaknesses in logical reasoning and debate methodology.
Key difference: Craig demonstrates systematic thinking and logical consistency, while Odin relies more on assertion and deflection when challenged on specifics.

#

@deep adder new claude thinks you are highly intelligent 🤣

wintry tinsel
#

Sim theory has free opus today incredible

#

I’m spamming it

unborn ocean
#

(obv the calculation is really dumb and not based on reality)

wintry tinsel
#

Free with no rate limits opus

#

For today

unborn ocean
#

furthermore, it did not even think to consider your age, which is kind of the second most important thing for IQ

leaden palm
#

has the new claude done anything interesting yet

leaden palm
wintry tinsel
#

Give me any questions you want to ask Opus I got unlimited access

#

I guess you could ask on lm arena though

misty vault
#

Give prompt

#

I will give gpt 4

small haven
wintry tinsel
#

I’ve jailbroken 4 Opus

hollow ocean
#

Opus is king of simple bench

wintry tinsel
#

🔥🔥🔥

unborn ocean
#

Age-adjusted IQ estimates:
Craig (assuming ~21 years old):
His reasoning ability seems appropriate for a bright college student. The logical consistency and debate skills suggest IQ around 120-130 - gifted range, roughly 91st-98th percentile for his age group.
Odin (assuming ~16 years old):
His reasoning patterns are more concerning when age-adjusted. While some impulsivity is normal for teens, the logical inconsistencies and inability to support arguments suggest IQ around 90-100 - average to low-average range, roughly 25th-50th percentile for his age group.

olive mesa
unborn ocean
#

no way he is 16, claude is like really really dumb asf (for 🍍 : my point was more that claude is just assuming way too much and going at the problem in a really unintelligent way)

small haven
#

craig is gifted

echo aurora
dull terrace
#

so what happens then

misty vault
#

bru

dull terrace
#

Nah im joking

#

but for real

#

good debate

small haven
dull terrace
#

my age is ???

unborn ocean
ornate stump
#

I've just tried claude 4, still prefer gemini even the nerfed one we have now.

dull terrace
#

Lol

#

Mad fun

misty vault
#

That was too easy bro

dull terrace
unborn ocean
#

@dull terrace claude is already reading the stars and mapping out your future, lol

small haven
#

odin will be a nobel prize winner

dull terrace
#

interesting

#

im 14 for trasperency \

#

hm

small haven
dull terrace
golden ocean
torn mantle
#

opus 4 is good

#

best instruct model so far

vale trench
#

Web arena prompt: 'chatbot arena that was designed by Jony Ive' - Opus 4 response

torn mantle
small haven
#

can i show this without triggering people, jk

#

ohhh mb

#

ya thats impressive

#

opus or sonnet tho

leaden palm
#

you guys are fighting over nothing

#

claude and chatgpt have been able to use code for a long time

#

chatgpt and claude have been able to do that for a long time

#

at least they can rn

#

idk i never really messed around with it much

#

yeah

#

it's amazing

#

(also amazing that it took us so long to figure this out since it's just a ui feature... i guess training is important)

balmy mist
#

i havent had a chance to play with models, what is the run down?

storm needle
#

claude opus is really good at counting numbers

elder rapids
#

sonnet is more leveled

storm needle
#

o3 high gets an approximate value but after 17k output tokens

elder rapids
#

why is it that o3 and opus can't follow instructions bro ts genuinely is getting me heated

storm needle
#

gpt 4.1 too

elder rapids
#

it's not that simple

storm needle
frozen arch
#

i like the new ui guys

frozen arch
#

thought they killed it

elder rapids
frozen arch
#

do they have 3.5 opus in lmarena?

elder rapids
#

they do have 4 opus in the arena ye

frozen arch
#

4 opus?!

elder rapids
elder rapids
frozen arch
#

ok assuming you're talking about the battle section bc i couldn't find it in side-by-side dropdown

frozen arch
#

any tricks to get 4 opus (or any specific model) every time? or with a high probability of success? (in the battle ui)

elder burrow
#

when the the the when uhh the when

elder rapids
#

so I'm not sure

elder burrow
#

b

elder rapids
#

although you could always ask "what model are you"

#

and that narrows it down considerably

frozen arch
elder rapids
#

there's no other way to get it though

#

nothing helps

frozen arch
#

no not that one @elder burrow this is different, its about tricks to get a specific model from the random choice

elder rapids
#

you can try a 3rd party provider that gives limited access

#

like poe

frozen arch
#

poe is paid?

elder rapids
#

4 opus is prolly too heavy to give free access tho

#

oh damn 4k points for a single request for opus lmao

#

and you only have 3k total

#

if you're not a subscriber

#

on poe

frozen arch
elder burrow
frozen arch
elder rapids
frozen arch
#

btw any UI where we can chat with multiple models at once?
i like lmarena side by side for the same reason bc a single model doesn't always provide the right answers or teach things well

elder burrow
#

it has side by side

elder rapids
#

ye

#

in AI studio

elder burrow
#

its called "compare" and it lets you tweak everything in both sides individually like temp, sysprompt, grounding (google search) n stuff

frozen arch
#

only for gemini models only tho right

elder burrow
frozen arch
#

for diversity u need models from diff labs

#

otherwise its the same vibe

elder rapids
#

btw don't feel like you're missing out on 4 opus

#

you're not

frozen arch
#

its bad? i heared people talk so many great things ab 3 opus.. so i thought this is going to be similarly dope

elder rapids
#

and it's not like performance simply degraded

#

it's more like exceptionally bad

#

at certain tasks

#

as opposed to 3.7 sonnet, or the Almighty 3.5 sonnet

frozen arch
#

damn thats crazy

elder rapids
#

you'll prob see this narrative going around

#

through time

frozen arch
#

what all did u try apart from coding

torn mantle
#

I think the base model isn't bad

#

But it's clearly not better than gemini 2.5 thinking

elder rapids
#

and couldn't really comprehend without shoehorning itself

#

into a narrow interpretation

torn mantle
#

Also i think anthropic are painting this whole opus 4 image wrong with their asl-3 definition

elder rapids
#

ye I agree

torn mantle
#

What anthropic are doing is basically running their models in a sandboxed env

#

Without any guardrails

frozen arch
torn mantle
#

Its obvious you will expect some weird behaviors that will be fixed later

#

People are taking this out of context

elder rapids
torn mantle
elder rapids
#

and it p much is

torn mantle
#

For knowledge wise, gemini and o3 are clearly the winners

#

Ive tried asking opus some niche questions

frozen arch
elder rapids
#

yeah I was disappointed asf

torn mantle
#

Although the answer was good but it lacked in many ways compared to others

frozen arch
elder rapids
#

but it seems smart in a way

#

it's just, not that intelligent

#

it's kind of a mid model overall, but I'll prolly keep playing with it

frozen arch
#

hmm.. why would they be testing this rn

torn mantle
#

Its not fair to compare the instruct model vs thinking models, but knowledge is something static, and i find opus4 below o3 & gemini on that

torn mantle
elder rapids
#

but in the same way

#

you can say 3.5 sonnet tried hard

torn mantle
#

Anthropic keynote title was heavily focused on coding

elder rapids
#

at tasks

#

and demonstrated pretty good intelligence

torn mantle
#

Dario said AGI will come in +4 years, Demis had a much lower time prediction

#

There is a reason for that

torn mantle
#

Dario = didn't see much improvements

#

Demis = saw the opposite

#

Alphaevolve?

#

Co-scientist?

#

Gemini deep think?

elder rapids
#

it seems like everyone else but DeepMind are going backwards

frozen arch
#

didn't demis just recently say 10 years tho

torn mantle
elder rapids
#

at MOST

torn mantle
#

Pretty sure Dario was adding a year each time they ask him

frozen arch
#

ahh sry

torn mantle
#

Google next step is to focus on agentic usage

elder rapids
#

demis has kept a pretty consistent timeline

torn mantle
#

Parallel tool calling

elder rapids
#

like some crazy shi

#

veo 3 is mind boggling

#

alpha evolve is mind boggling

torn mantle
#

Tbh i dont think we can predict when we will reach AGI

elder rapids
#

I'm not that confident in AGI at all

torn mantle
#

And i don't like how we are fixated on that

elder rapids
#

ye

torn mantle
#

I mean

elder rapids
#

but I think extremes are inevitable simply due to the fact you can't overestimate the potential of AI

torn mantle
#

What indicators are they using to predict the timeline?

#

Are they taking into consideration sudden breakthroughs?

#

Is it only scaling law indicator?

elder rapids
#

capabilities of LLM's themselves, emergent ones are kind of magic

torn mantle
#

Generalization then

#

Which to me seems possible with reasoning training

elder rapids
#

as cliche as it sounds the fact

probability, trends in extraordinarily large numbers = these things we're seeing and not predetermined

#

we don't understand it

torn mantle
#

Unfortunately we are still following a normal distribution

#

Which is good and bad

#

If its a strong distribution, you are more confident and accurate about your next word, but its a retrieval information, doesnt reflect how we think

#

Tbh i think oai managed a bit reasoning

#

Better than other labs

small haven
#

llama 4 maverick >>

hardy violet
#

Okay, so I just played around on lmarena. Had it interpret a modernist essay with a clear structure. Claude 4.0 Opus killed it. I totally expected Gemini 2.5 Pro to flop, but after a short wait (Gemini's slower, but the generated content was more detailed), its performance was basically the same... 🤷‍♂️ This one test isn't definitive, but my main takeaway is: Claude is a beast! Super quick and token-efficient for explanations. Only downside is Opus can be a bit steep on the wallet.

patent aspen
# elder rapids he said "not long after 2030"

In engineering, the more you know about a problem, the longer the timeline estimate becomes.

I think this is similar to the problem when product managers give ETAs to external customers without consulting with the engineers in the trenches. Demis is an actual AI expert and gave a more conservative timeline of 5-10 years. The CEOs who aren't AI experts gave overly aggressive timelines.

elder rapids
patent aspen
#

I think Demis said "not long after 2030" because he didn't want to diverge too much from Sergey and make the interview awkward. He said 5 - 10 years in the 60 minutes interview a few weeks ago

elder rapids
#

ye seems that way

#

although I think he is more confident in the shorter part of the timeline

#

just not 5 years imo

#

I agree it'll likely be around 2031

#

or 2032

small haven
#

rip windsurf, dario is mad

elder rapids
#

ye, if anyone I would trust demis

#

@balmy mist accidentally left agent neo on overnight and I just barely checked, and agent neo starting getting mad at itself for not performing the browser tool correctly, like deadass it was getting mad

elder rapids
small haven
#

24hr agent thats pretty cool

keen beacon
#

It's sonnet 4 as you know now but I think it's another cpt but I just woke up and didn't play with it that much. I think the Claude 4 models are cpts, but not sure and am not gonna defend that for now

#

Both Claude 4 sonnet and Claude 4 opus are cpts (initial pretraining 3.5 sonnet and 3.5 opus, my guess without really looking into it too much but examining the timeline etc)

#

3.7 sonnet was an experiment with it

#

The rumors that 3.5 opus was disappointing were true it would seem

#

No it's too fast

#

To be pretrained from scratch among other reasons

#

ATP I don't see other possibilities being likely but I don't know

#

I'm not sure

elder rapids
#

seems like they just focused a lot on coding

patent aspen
#

Anthropic seems to losing ground in every area except coding

#

And even coding is debatable

#

Their rate of improvement seems a bit too slow

elder rapids
#

I suspect it'll be really high with thinking but Ion think this means anything, multiple choice doesn't reveal the full story, especially with how bad the models really are

#

who even mentioned 2.5 pro

#

😭

#

ye but that's not what's being evaluated here

keen beacon
#

claude 4 is definitely not bad

#

imo there are too many assumptions right now with human thought (but im not gonna argue about this)

elder rapids
iron cipher
#

Glad that lmarena's ama is not the same time as wwdc, as most of the audience would have to decide

keen beacon
leaden meteor
olive mesa
# olive mesa
poll_question_text

Which model is better?

victor_answer_votes

6

total_votes

11

victor_answer_id

1

victor_answer_text

Claude 4 Opus

patent aspen
#

I don't think Anthropic is bad at things outside of coding. I think they don't have the resources to compete on everything, so they chose to go all in on coding so they can at least win one important domain and have a chance of survival - or last long enough to keep the funding coming and then branch out from there

#

I just don't know if they're actually gaining ground in coding or losing ground

#

If they're losing ground in their specialty, then they're toast

sturdy mica
#

anyone have a 4 opus jailbreak

#

also opus's context window is small

patent aspen
#

I don’t think any company has held a big lead over incumbents across benchmarks for long enough by a wide enough margin to make that conclusion

#

Then they said it was 400m at I/O

deep adder
#

but @patent aspen my point is that you never really know who is leading

keen beacon
#

the gemini product needs to improve a lot :\

#

the long system prompt/degraded performance on models there, there's no branching (fundamental feature), etc. :\

ember rapids
#

has anyone tried goldmane on webdev arena

#

better than nightwhisper

keen beacon
#

google

ember rapids
#

yeah google

elder rapids
#

I know for models it can sometimes seem arbitrary, people have different reactions or mixed feelings and it may not be immediately obvious

#

but this isn't the case for the Claude 4 series, they're ESPECIALLY bad at these tasks

#

and it isn't really up for debate tbh, just use the model and let me know how it feels

#

you'll see how it demonstrates things

#

although I'm glad now to have a model that comprehends codebases so beautifully and there's more chances for me to fix other models mistakes

#

standard philosophical concepts

#

let me clarify, whatever the philosophical concept is

#

it doesn't matter lmao

#

the fact it doesn't recognize ANY rigor to invoke in its response

#

is a major problem

#

for discussion lmao

#

clearly not what I'm saying then

#

cool I agree that's what I said tho

#

exactly as you've interpreted it as

#

adjustments that are iterative

#

yeah no shi all models necessarily can make iterative adjustments

#

clearly not my point then

#

I'm saying both ye

leaden meteor
#

can you give me some prompts to test claude 4 against others? For the ones I have tested, opus 4 is doing pretty good....

elder rapids
#

put simply

#

yep

#

it's a good model

#

that doesn't contrast it being "bad"

#

just means it's comprehensive enough to be ranked over other models

#

I consistently refrain from calling you sped

#

but I won't

#

I'll reiterate

#

it's a good model, that doesn't contrast it being "bad", just means it's comprehensive enough to be ranked over other models, just not the models that actually qualify for being "good model" and "good"

elder rapids
#

that's fine, you don't need to since I'm actively conveying it

#

Craig just ask for clarity and I'll give clarity stop beating around the bush

#

that's great, that's why I presented a distinction tho

#

it's alr, I'm just saying there are different things being qualitated

#

if that makes sense

#

it's a good model, by no means does it not accomplish a vast majority of its tasks

#

but it's not presenting it in a way that demonstrates not just reiteration

#

but actual knowledge

#

in coding it's complicated, 2.5 pro might be able to keep reasoning to solve certain tasks and eventually get there, but opus just gets it tbh. But in regular tasks it's worse

#

but there's a good reason

#

ye that's its trademark basically

#

or that's what made me like it so much

#

pretty good tbh, its really warm and presents itself really well

#

but it doesn't have the absolute know the concept behavior

#

ye what I mean tho is 2.5 pro would basically iterate how redundant the presentation would be, how to clarify jargon and then the simplified variant in parenthesis

#

all that stuff

#

it would recall the inference in the discussion vs the ACTUAL studied concept

#

and their relationship

#

and then compare

#

and then move forward

#

it's lesser now in 0506 but you'll see it by asking it to explain any graduate lvl mathematical concept

#

nah not really I've already tried, it knows what it's looking at but it doesn't know how to relate it

late path
#

I found goldmanes answer in conversation is much better than 0506

#

new checkpoint of 2.5 pro?

small haven
#

building at night >>

#

oh craig u still awake 😮

#

so sleep

#

do u actually take addy

#

oh right

#

why lmao

#

ur dad in this discord ? lol

elder rapids
#

never heard of it

late path
elder rapids
#

any info

nimble trail
#

It doesn't seem to think that long tho so maybe GA?

keen beacon
#
#

😭 the raw thoughts ngl were better than reading the response

mossy drum
#

New model in Beta Text2Image: anonymous-bot-0514

unborn ocean
#

*with the last message

keen beacon
unborn ocean
#

But I might not be surprised if they do it for paying / corporations only or something

#

For debugging their stuff

keen beacon
wintry tinsel
unborn ocean
#

And google surely is aware of that

elder rapids
#

not the summaries mb

keen beacon
unborn ocean
elder rapids
#

ye everyone knows that tho

keen beacon
unborn ocean
#

ik

#

Was just trying to make the point

elder rapids
#

the point is, Google wouldn't be preventing the issue regardless + it causes more harm to the user base than benefit to Google preventing competition

keen beacon
#

imho atp, traces dont even matter that much anymore

unborn ocean
#

Bc I did not remember anyone using pro traces

keen beacon
#

qwen and deepseek can self sustain themselves at this point

unborn ocean
#

Obv it is better to have actual rl based stuff

keen beacon
unborn ocean
#

But I mean that is just way more expensive and not viable for everything

keen beacon
#

i might be remembering wrong but anyway s1k used gpqa questions in their dataset etc it was kinda suspect 🤨

unborn ocean
#

Well I still think that google might not have the best reasoning game but what they do have (without question imo) the best reasoning for human preferences, something that can easily be copied using traces

elder rapids
#

the best reasoning for human preferences?

#

wym

elder rapids
#

it doesn't think very human

#

or in a sporadic way

keen beacon
#

imho (elaborating on my previous point and kinda tangential) chinese companies no longer need to distill western models, especially for cot. the responses probably arent even worth getting trained on anymore, they can take inspiration and develop their own bootstrap to generate similar responses to it

unborn ocean
# keen beacon i might be remembering wrong but anyway s1k used gpqa questions in their dataset...

All sources according to dataset (using my bad sql skills)
AI-MO/NuminaMath-CoT/aops_forum
qq8933/AIME_1983_2024
KbsdJames/Omni-MATH
TIGER-Lab/TheoremQA/float
daman1209arora/jeebench/phy
qfq/openaimath/Algebra
Hothan/OlympiadBench/Open-ended/Physics
Idavidrein/gpqa
Hothan/OlympiadBench/Theorem proof/Math
daman1209arora/jeebench/chem
qfq/openaimath/Precalculus
TIGER-Lab/TheoremQA/bool
qfq/openaimath/Intermediate Algebra
qfq/openaimath/Geometry
0xharib/xword1
TIGER-Lab/TheoremQA/list of integer
TIGER-Lab/TheoremQA/integer
baber/agieval/aqua_rat
GAIR/OlympicArena/Math
GAIR/OlympicArena/Astronomy
AI-MO/NuminaMath-CoT/olympiads
baber/agieval/math_agieval
qfq/openaimath/Number Theory
qfq/openaimath/Prealgebra
qfq/quant
daman1209arora/jeebench/math
AI-MO/NuminaMath-CoT/cn_k12
OpenDFM/SciEval/chemistry/filling/reagent selection
qfq/openaimath/Counting & Probability
qfq/stats_qual
GAIR/OlympicArena/Chemistry
Hothan/OlympiadBench/Theorem proof/Physics
TIGER-Lab/TheoremQA/list of float
GAIR/OlympicArena/Physics

keen beacon
#

Idavidrein/gpqa is the gpqa benchmark dataset

#

there is no training split

unborn ocean
#

And it is not just doing simple grpo rl

#

These companies do more for some of their reasoning (especially google, which becomes very evident when looking at the formatting and layout of their thinking)

unborn ocean
keen beacon
unborn ocean
#

I mean we do t really have any other models where the reasoning itself helps that much with aligning to human preference (especially Chinese). And obviously the can either train directly on the traces or do rl or just copy the technicus they observe manually. It does not really matter in the grand scheme of things, but at the end they use the traces to the disadvantage of google.

keen beacon
unborn ocean
#

BTW I checked in detail now and the a1k uses 88 out of the 450 or so gpqa problems, smh 🤦‍♂️

keen beacon
#

does gemma 3 use synthid? while thinking i realized u can use those responses huh

#

food for thought later 🤔

unborn ocean
#

Never heard of them using it

#

But might have spilled over from distilling from the Gemini model

#

(Depending on what kind of Gemini development stage they used)

keen beacon
keen beacon
unborn ocean
keen beacon
#

something along those lines

unborn ocean
#

Well they could have distilled and then used a reward model (potentially also Gemini) to enhance from there

calm sequoia
keen beacon
unborn ocean
#

Well then reiterate

keen beacon
# unborn ocean Well then reiterate

they potentially focused on human preference for gemma more than gemini (in relation to model capability), the methodology used to achieve that doesn't matter for the point im saying

#

example from eqbench's creative writing leaderboard

unborn ocean
keen beacon
#

the reason its more useful because in relation to model capability, gemma is more human preferable compared to other gemini models

#

you can use this to generate more human preferable responses with the thoughts of another model (model capability)

keen beacon
keen beacon
#

or use it to train a reward model or whatever

keen beacon
unborn ocean
keen beacon
unborn ocean
#

Because as far as I know synth id gets introduced later in the training

keen beacon
#

i assumed u read the paper

unborn ocean
#

I mean the stage of the Gemini models

unborn ocean
keen beacon
#

yeah it depends

unborn ocean
#

That was my point from the beginning

keen beacon
#

we were both talking about separate things 😭

unborn ocean
#

But I agree with you about Gemma focusing on human preferences :)

#

It definitely is not as good in other benchmarks as it seems in things like arena

keen beacon
raven void
#

https://youtu.be/64lXQP6cs5M

very interesting talk

New episode with my good friends Sholto Douglas & Trenton Bricken. Sholto focuses on scaling RL and Trenton researches mechanistic interpretability, both at Anthropic.

We talk through what’s changed in the last year of AI research; the new RL regime and how far it can scale; how to trace a model’s thoughts; and how countries, workers, and ...

▶ Play video
keen beacon
calm sequoia
#

For those who denied the Gemini nerf

patent bane
keen beacon
#

fiction live bench

calm sequoia
#

I wonder how expensive the original 2.5 Pro was

#

If they nerfed it so hard, they must have eaten a lot of costs

#

I would have actually bought subscription for it

patent bane
#

wait is deepthink rolled out or not? why am I not seeing any benchmarks?

keen beacon
#

they said at io i believe

#

for now

calm sequoia
#

@alpine coral have you included the new claude into your personal bench that I always like?

keen beacon
mossy drum
#

New model in Beta Arena: grok-3-mini-beta

small haven
#

codex is a dream come true

cedar tide
calm sequoia
#

How can o3 have such a bad thought's, where it seems it will completely miss the point, but then returns good answer 😄

calm sequoia
#

Yeah, most likely some 0.5B nano model 😦

keen beacon
#

ive seen it say the opposite result in the summary, reaffiriming to itself that yeah, incorrect answer, incorrect answer. then returns the correct answer

calm sequoia
#

I used to learn stuff from thoughts

#

What's the use of them then. It functions like loading animation now.

torn mantle
#

Wild

#

Tell us how these models perfoms

hardy pecan
#

Did claude 4 get added to the arena?

torn mantle
cedar tide
cedar tide
sweet tinsel
#

Did it work now?

quick patrol
#

I like LMArena's censorship system, lol.

#

I started to use some words that are not related to anything vulgar, actually meaning mundane things.

torn mantle
#

nothing is better than nightwhisper

#

imma try them

quick patrol
cedar tide
#

Un gars m'a dit mieux que NightWhisper

torn mantle
torn mantle
#

alr

quick patrol
#

Because I suppose no AI would get hidden "vulgar" sense behind mundane similar words used with the subtext.

#

That's something humans would recognize.

torn mantle
#

@cedar tide wait

#

these models are good actually

tall summit
torn mantle
#

nah these new models are pretty good

#

ngl

keen beacon
#

post results xd

torn mantle
#

im saving them

#

just in case those models got removed

#

so far ive got goldmane

#

that thing is def on par with nightwhisper or even better

keen beacon
torn mantle
#

quite shocked actually

cedar tide
tall summit
torn mantle
#

goldmane > redsword

sweet tinsel
#

Dropped a larger Update to my Deep Research List, only Gemini 2.5 Pro DR missing now: https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing

quick patrol
#

Okay, I did that, and it's not banned.

#

But I think it'll become very soon, lol.

elder burrow
#

The service collects dialogue data from user interactions. By using the service, you grant LMArena the right to collect, store, and potentially distribute this data under a Creative Commons Attribution (CC-BY) license.

#

💔

keen fulcrum
#

Any experience with stima api and t3 chat?

cedar tide
#

Redsword and goldmane

#

@tall summit @keen beacon

keen beacon
# cedar tide

ya somewhat posted that earlier comparing it to nightwhipsert

keen beacon
#

nope

cedar tide
keen beacon
#

not sure if its the code itself though

torn mantle
#

redsword may be better at coding

#

im so lost tbh

golden ocean
#

these models are going to be 600$ per month

blazing coyote
#

Redsword is insane

calm sequoia
#

They will release them and then nerf them as always

tall summit
#

thanks for posting

torn mantle
#

xai will just be left behind again

#

their strategy is so bad tbh

#

look at google testing models and releasing them on multiple occasions

#

and here we are stuck with grok 3.5 that wont come out until it meets their expectations

#

and when will that happen?

#

google is about to release new models + deep think mode, deepseeek is about to release r2

#

mistral probably gonnar release their reasoning model as well

#

its not looking good for xai tbh

keen beacon
#

i dont think mistral is gonna be competitive

torn mantle
#

also what happened to 'Big brain'?

#

like it was so obvious that thing wont be released giving how inefficient their reasoning process

torn mantle
keen beacon
torn mantle
#

yea...

keen beacon
#

qwen has iterated on the reasoning trace style several times by now...

torn mantle
#

this whole strategy of waiting till we get things right is fundamentally wrong

#

talking about grok 3.5 release

#

i would rather see them release multiple versions than waiting for grok 3.5

#

i would be okay with different checkpoints

#

like grok 3.1 -> 3.2 -> 3.3

keen beacon
#

meh just let them release it when they release it tbh. there's a lot of good models. theyre getting overworked etc i kinda feel bad lol

keen beacon
#

if u force a release

torn mantle
#

now they have to exceed expectations with this new model

#

which wont happen

keen beacon
torn mantle
#

what are they doing the whole day

keen beacon
torn mantle
#

it may be true

#

but whats the actual work/value from that

#

18h -> 1h value?

#

2h value?

keen beacon
#

if you overwork employees consistently yea you will get consistently less value

torn mantle
#

2h just eating, 3h just meetings

#

4h just talking to friends

keen beacon
#

i dont think theyre being lazy over there but idk

torn mantle
#

im not questioning 'how much time they spend in the office?' but they should stop tweeting such things

#

its like they are saying -> look at us we are working so hard here, lets impress elon

keen beacon
#

one of the xai employees merged a "update prompt to please elon" troll merge request on the xai prompts repository, then reverted it later lol. then they reset the repo

#

so idk

torn mantle
#

haha

#

well thats their goal

#

to please him

torn mantle
#

he didnt lie tho

keen beacon
#

they really tried to wipe it huh, it's gone now

#

that page was still there when they wiped the repo

#

screenshots not mine

misty vault
#

claude 4 might be stupid

keen beacon
sacred plaza
calm sequoia
#

They really pay too much attention to safety. What's the use of super safe claude if they can get the same info from all other llms

keen beacon
#

tbh claude is fine nowadays

#

it was crazy back then with claude 2.1

#

i think they released the false positive rate it was absolutely absurd

sacred plaza
misty vault
#

Nevermind that was skill issue on my side claude 4 sonnet is king

sacred plaza
calm sequoia
sacred plaza
#

i say all that supporting anthropic but don't use it cause of the rate limits set on claude pro lol

calm sequoia
#

They have to make a safety group. SImilar as the agent-to-agent protocol. And smear every LLM lab who does not correspond.

keen beacon
#

it will slow down every release

#

death by committee

calm sequoia
#

It will happen sooner or later

#

I am against regulation too, but the current appraoch is ridiculous

sacred plaza
calm sequoia
#

Good companies get penalties while bad do not

keen beacon
#

outside of jurisdiction if its by law

unborn ocean
# calm sequoia They really pay too much attention to safety. What's the use of super safe claud...

Two reasons why it might make sense beyond altruistic believes:
A: it is way easier to recruit sought after AI experts if they can align themselves with the ethics of the company and feel like they are working toward something ‚good‘
B: Claude markets itself as for coders on the one hand side, but also for companies and these companies obviously want a ‚save‘ ai model as otherwise they will have more lawsuits than they can handle around their neck.

#

(Sorry for length, I like to yap like llama 4 exp)

sacred plaza
calm sequoia
#

Not sure if effective though

calm sequoia
#

In the end, I don't think it is really so much easyer to build a chemical weapon using LLM. There are lot of books on this.

keen beacon
# keen beacon outside of jurisdiction if its by law

elon signed a petition to stop training/etc (or whatever i dont remember), it was just an attempt to delay the others as he was starting his own ai company. if you voluntarily join a thing to delay releases (and actually comply), unless you have unlimited money, you will lose ground and competitiveness i think (that's the reality rn, others wont stop)

misty vault
#

real

calm sequoia
sacred plaza
# calm sequoia In the end, I don't think it is really so much easyer to build a chemical weapon...

i will trust dario's take on this since he comes from a biosciences background.

https://techcrunch.com/2025/02/07/anthropic-ceo-says-deepseek-was-the-worst-on-a-critical-bioweapons-data-safety-test/
Anthropic’s CEO Dario Amodei is worried about competitor DeepSeek, the Chinese AI company that took Silicon Valley by storm with its R1 model. And his concerns could be more serious than the typical ones raised about DeepSeek sending user data back to China.

In an interview on Jordan Schneider’s ChinaTalk podcast, Amodei said DeepSeek generated rare information about bioweapons in a safety test run by Anthropic.

DeepSeek’s performance was “the worst of basically any model we’d ever tested,” Amodei claimed. “It had absolutely no blocks whatsoever against generating this information.”

Amodei stated that this was part of evaluations Anthropic routinely runs on various AI models to assess their potential national security risks. His team looks at whether models can generate bioweapons-related information that isn’t easily found on Google or in textbooks. Anthropic positions itself as the AI foundational model provider that takes safety seriously.

calm sequoia
cedar tide
#

It's disgusting how Anthropic scammed everyone with their parallel thinking score. Every company finds ways to avoid being ridiculed on the benchmarks.

keen beacon
#

losers report scores outside of pass@1 /s but fr, just report pass@1 unless your system does that for every request

unborn ocean
#

Well that was 50:50 concern and marketing about why companies should use expensive Claude instead of unsafe and scary Chinese deepseek 🐳🐋
(the take from Dario)

torn mantle
torn mantle
cedar tide
sacred plaza
keen beacon
#

they have a huge footnote section about it iirc

cedar tide
#

O1 pro cost 150$ 600$

#

And gemini we dont know

keen beacon
#

you can't even use this internal scoring model lol

sacred plaza
#

do y'all think that was done purely to boost benchmark scores? at the gemini i/o event this week, demis was trying to sell it as the next step in llm reasoning

cedar tide
#

@sacred plaza it runs the model maybe 10 times at the same time, a model retrieves the best answer and gives it to you

calm sequoia
sacred plaza
cedar tide
#

Anthropic's parallel thinking is perhaps with 64

calm sequoia
#

If it's another glasses 🤦‍♂️

cedar tide
calm sequoia
#

Doesn't seem to wear glasses before

unborn ocean
#

I hope not

late path
torn mantle
#

looks so thin

calm sequoia
#

The problem is that you need to much compute with video. Can't fit it into glasses. Innovation might come from audio devices where you don't have such a high data throughput. If it's just microphone, then it's doable. But if it's just microphone ,why do you need glasses 😄

keen beacon
#

if its streaming and computation is handled on the cloud, i think its possible but I don't know much about embedded and hardware

cedar tide
#

It's possible that sonnet 4 will be lower than 3.7 on the arena 😶

hazy quest
#

Try asking it in different order. From my experience, AIs heavily favour second options to firsts. This has been my experience since OG chatgpt 3.5, quite consistently regardless of companies

calm sequoia
# keen beacon if its streaming and computation is handled on the cloud, i think its possible b...

I've trained some image classifiers, and they always have 1 to tens of millions of params for few classes. Image embedding models weight megabytes and can be easily ran on MCUs. However, latency is hundreds of milliseconds per image. The Image (video frame) embedding -> transfer to cloud could be potentially realized, but this would mean at most 1 frame per second. Maybe that's enough for them 👀 Otherwise big battery is needed.

cedar tide
#

Problem here
75.4% officialy
70% on artificial analysis

#

even 3.7 thinking at 77% on artificial analysis 🤦

keen beacon
#

I don't think it's suspicious

#

But it is the thinking model, it should be less susceptible

cedar tide
fleet lintel
#

isn't Claude 4 too expensive? I am thinking that I should stick with Gemini 2.5 pro

sweet tinsel
#

Lets see if it is better than last time.

cedar tide
# cedar tide

@keen beacon the problem is on artificial analysis side i think 🤦

#

he just lost another 3%

sweet tinsel
unborn ocean
scenic narwhal
#

Worth switching from Gemini Pro to Opus 4?

unborn ocean
#

That would alight with them claiming a higher score

brittle tiger
# sweet tinsel Looks better than the last one, still prefer the ChatGPT o3 one.

I always run on both. It's probably 50 50 which one I use but o3 gets better over time with memory. Gemini DR getting file uploads this week has been very insane though. Throwing most important info into huge context window then letting DR get to work has been great for better outputs but havent gotten enough testing done yet

tall summit
scenic narwhal
tall summit
golden ocean
#

oh

tall summit
#

i'd bet on gemini

#

but i'm not betting

#

and am not planning to

keen beacon
#

i wonder how much claude 4 scores on simpleqa

misty vault
#

Considering everyone being drooling aliens, I'd bet on gemini too probably

unborn ocean
#

For me 2.5 pro as orchestrator (with large context) + sonnet 4 for coding was really good

keen beacon
#

are you using a tool like aider?

tall summit
#

claude 4 opus is 3rd on longform creative writing

leaden meteor
tall summit
#

significantly less slop

unborn ocean
tall summit
#

the fact its significantly less slop makes it much better imo

#

might win in the shortform creative writing

wintry tinsel
#

By a huge margin

#

If deep seek is higher on that bench than I don’t trust it, deep seek sucks at creative writing it’s outputs are nonsense

wintry tinsel
#

Oh lol

#

That’s why ai ranks deep seek so high the highly randomized outputs are interpreted as “creative”

late path
tall summit
#

definitely not reasoning puzzles lmfao

misty vault
#

claude 4 loves to use "!" too now like chatgpt

coral notch
#

What's the redsword model

#

it's great

barren prairie
tall summit
#

ok opus is worse at translation i'd say

misty vault
keen beacon
#

Claude 4 opus is agi ASI

misty vault
#

gpt-4-0314-32k created the big bang of the universe

keen beacon
#

I've read one response from it and I've concluded from that single sample that it is superintelligent

late path
keen beacon
# tall summit which

It came up with stuff about hot dogs and such randomly on an unrelated topic 🤣

#

Funny and absurd reply

high ginkgo
keen beacon
high ginkgo
#

You ARE the dog

cedar tide
#

Waiting for
Grok 3.5
o3 pro
Deepseek R2
Gemini deepthink

misty vault
#

claude 4.5 sonnet

keen beacon
# high ginkgo You ARE the dog

I'm not sure what you mean by "You are the dog." Could you provide more context about what you're referring to? Are you perhaps thinking of a game, story, or specific scenario where I should play the role of a dog? I'd be happy to help once I better understand what you have in mind.

cedar tide
cedar tide
#

with open ai and gemini doing parallel thinking, Elon will also release grok 3.5 pro (with parallel thinking)

#

Are they preparing this?

misty vault
sacred plaza
#

do people actually use grok-3? not sure i understand why people still talk about grok-3.5 release given how elon brainwashes it outputs via the system prompt

tall summit
#

maybe

#

it's built into twitter so many twitter users do

sacred plaza
keen beacon
#

I don't know yet. Will you harm me if I

sacred plaza
#

microsoft did this with inflection pi folk a few years ago. instead of a buy out just poach all of the people from the company. that is how mustafa returned to microsoft i believe

ocean plume
#

where is claude 4 ?

keen beacon
torn mantle
#

anthropic are kinda confident in their models at coding, but from what im seeing these two new models are the next thing tbh ( goldmande & redsword )

keen beacon
torn mantle
keen beacon
#

yea

torn mantle
#

about ga versions?

keen beacon
#

next month

#

should be exciting

torn mantle
#

interesting

keen beacon
elder rapids
#

dawg

#

are they good

#

ye but what are their performance

keen beacon
#

people are raving about them as far as i can tell

elder rapids
#

alr cool all I need to hear

torn mantle
elder rapids
#

fr?

torn mantle
#

like nightwhisper level

#

yea esp at coding

elder rapids
#

how is it's overall performance too

#

if it's nw level

#

then it has to be good asf

naive idol
#

Someone’s know about blockchain?

torn mantle
#

better than the current gemini 2.5 pro model

#

overall

torn mantle
#

even better if you ask me

keen beacon
#

i hope they bring back raw thoughts on aistudio 🥲

torn mantle
#

dont you think it thinks for less time than before?

#

is it really just an update related to summarizing thoughts or there is more to it?

keen beacon
#

i heard people talking about that but i havent measured it myself

balmy mist
#

did i miss anything?

keen beacon
#

better than nightwhisper apparently

torn mantle
keen beacon
#

ga versions coming next month

balmy mist
#

really

#

send link again please

balmy mist
#

but not on webdev?

torn mantle
eternal flower
#

Will Claude 4 Opus appear on the regular leaderboard? or just the beta one?

torn mantle
#

its on both

#

beta & old website

balmy mist
#

what are the model names?

keen beacon
#

goldmane redmane redsword

elder rapids
#

I'm so sad

torn mantle
#

goldmane & redsword

keen beacon
torn mantle
#

they are both strong at coding

elder rapids
#

btw 2.5 flash is a beast

#

with the new update

eternal flower
balmy mist
torn mantle
#

@keen beacon you know whats funny is that ive always chose sonnet 3.7 over sonnet 4 in webdev

elder rapids
torn mantle
#

like literally on all of my tests ive chose it over the latest version

elder rapids
#

Google coming to save the day

fleet lintel
torn mantle
torn mantle
keen beacon
#

the claude 4 models werent ever anon models on the arena

elder rapids
#

ye

#

no one had any idea

fleet lintel
# torn mantle yes

please dont troll me... every hype is blueballing me since march 2.5 pro launch (except may be veo 3)

keen beacon
#

its really good this time

elder rapids
#

@torn mantle how long do the new models think

eternal flower
#

Am I stupid or what, I cannot find 4 Opus on the leaderboard? Does it not have enough votes yet or something

elder rapids
#

btw I wonder if they're ever going to release a Gemma thinking model

keen beacon
#

so u have to wait i guess

eternal flower
#

I see, is the consensus here that it will place below Gemini?

keen beacon
#

probably

torn mantle
keen beacon
#

especially the new revisions

torn mantle
#

they dont feel like flash versions

#

def pro level

elder rapids
#

opus 4 thinking simply isn't Gemini lvl imo

patent aspen
elder rapids
#

2.5 pro even if people think it's nerfed in some ways

#

it's still a league ahead of everything else

keen beacon
#

just need the raw thoughts 😦

elder rapids
#

deadass

eternal flower
#

Thanks for the input everyone, and this goldmane & redsword, are these updated Gemini models to be released to the live leaderboard soon?

keen beacon
#

theyll put it on the leaderboard when they launch it probably

elder rapids
keen beacon
#

which is next month, apparently

elder rapids
#

they'll be revealed

hollow ocean
elder rapids
keen beacon
#

i would take nerfed 2.5 pro with raw thoughts than o3 tbh

#

raw thoughts are so important. the o3 summary model is 💀

elder rapids
#

I would prefer o3 in certain tasks but for the vast majority of things it's 2.5 pro

hollow ocean
#

o3 tool use is too good

#

But hallucinates more

elder rapids
#

seems like they're open to improving it

#

so I'm optimistic asf

keen beacon
elder rapids
#

ye it means a ton for what they're looking for next

wintry tinsel
#

With veo 3 and opus 4 we are in the next stage of AI

#

I’d say two more stages until AGI

elder rapids
#

veo 3 I can agree but opus 4 seems like the lower end of the current level of models

#

veo 3 is insane

keen beacon
#

did u try 2.5 pro tts btw?

#

its really good

elder rapids
#

im gonna soon tho

#

I'm trying to get a hold of opus 4

keen beacon
#

just need 2.5 pro image gen 🔥

wintry tinsel
#

With gpt 2 & 3 being stage one, gpt 4 sonnet 3.5 being stage 2 and Gemini 2.5 pro/opus 4 being stage 3

hollow ocean
#

Opus 4 is better than 2.5 pro on livebench

elder rapids
elder rapids
#

asf

wintry tinsel
#

Opus feels way smarter than 2.5 pro to me though in general use

hollow ocean
hollow ocean
wintry tinsel
#

Maybe the coding of 2.5 pro is a little better

novel flame
#

So… Opus is good but fundamentally just way too expensive, right?

elder rapids
#

and you'll see how you're the minority

keen beacon
wintry tinsel
#

I use a system prompt to unlock it

fleet lintel
wintry tinsel
#

Opus is much more guidable with system prompt than 2.5 pro is

elder rapids
#

lmao

#

and I mean CRAZY opposite

keen beacon
#

i heard that people used 2 messages on claude pro with claude 4 opus and it locked them out Lmao

wintry tinsel
#

We have different use cases than

#

Have they released reasoning on Claude 4 or not

keen beacon
#

yes

wintry tinsel
#

Does the model decide when it reasons?

elder rapids
#

I've done a TON of testing for opus

#

I used to be a Claude glazer lmao

wintry tinsel
#

That must have cost you something

elder rapids
#

ye

wintry tinsel
#

I’m Claude glazer for life tho

hollow ocean
ocean vortex
#

it needs to be faster too though

elder rapids
hollow ocean
#

On the Claude subs most complaints are about the cost

ocean vortex
wintry tinsel
#

Sonnet is just the better model practically unless your stacked

elder rapids
#

yep

#

opus does feel a little smarter tho

wintry tinsel
#

Bricked up with cash

elder rapids
#

majority of people here are stacked that seems to be the demographic when it comes to AI

novel flame
# ocean vortex No but it reasons way too short. They did just enough for it to beat Sonnet but ...

Did you read their recent paper on “the biology of an LLM”? Their research (on a variant of Haiku) shows that the reasoning parts are in some cases totally unrelated to the true way the model is arriving at its answer; it has already decided what its answer is going to be and the reasoning trace is just yapping to justify it without actually adding any benefit. It’s an interesting read.

ocean vortex
elder rapids
keen beacon
#

people are overblowing it

elder rapids
#

but it's not crazy tbh

keen beacon
#

its true to an extent but its complex

wintry tinsel
elder rapids
#

opus thought process is kind of nothing burger

eternal flower
#

anyone have any guesses for Grok 3.5 release? Was Grok 3 released under a codename on lmarena before public release last time?

novel flame
#

Speaking of fast, I ran a prompt on Qwen3 32B today and it spat out tokens at a speed I’ve never seen before - it exceeded 1600 tps!!!! It wasn’t the best response to be fair, but hooooooly cow that’s fast.

keen beacon
#

where?/

hollow ocean
#

Elon said 2 more weeks

elder rapids
#

2 weeks too long shi

eternal flower
elder rapids
#

watch it be an ass model

#

lmao

hollow ocean
#

Prob next week

#

Or next next week

wintry tinsel
#

To the people here who have used grok 3.5 how does it compare to Gemini pro?

hollow ocean
#

No one hyped about Gemini deep think?

patent aspen
#

btw how did Grok manage to become relevant in the AI race when it was founded in 2023? Did they have to use a hyperscalar cloud provider to get compute that fast?

eternal flower
patent aspen
#

Via a cloud provider?

keen beacon
#

nvidia ceo talked about it/kinda glazing xai about how they setup things up recently iirc

patent aspen
#

Ah 122 days. That's really fast

eternal flower
#

what was its codename? you certain it was 3.5

keen beacon
#

hes trolling

eternal flower
#

I thought lol

hollow ocean
keen beacon
#

theyre too busy merging troll prs into their prompts repo and resetting it multiple times to work on grok 3.5

wintry tinsel
keen beacon
#

they have to buy them anyway

#

otherwise they arent competitive

#

but i mean glazing xai and elon can't hurt for sales right

teal mantle
#

Isn’t Grok 3.5 still a vaporware until now?

keen fulcrum
#

["openai"] = "OpenAI",
["anthropic"] = "Anthropic",
["google"] = "Google",
["groq"] = "Groq",
["cohere"] = "Cohere",
["mistral"] = "Mistral",
["amazon"] = "Amazon",
["arcee"] = "Arcee",
["ai21"] = "AI21 Labs",
["liquid"] = "Liquid",
["lambdalabs"] = "Lambda Labs",
["chutes"] = "Chutes",
["reka"] = "Reka",
["xai"] = "xAI", -- Updated mapping for verified xAI models
["deepseek"] = "DeepSeek", -- Updated from "DeepSpeek" typo
["01ai"] = "01.ai",
["moonshot"] = "Moonshot AI", -- New provider
["hyperbolic"] = "Hyperbolic",
["together"] = "Together.AI",
["fireworks"] = "Fireworks",
["nebius"] = "Nebius AI Studio",
["deepinfra"] = "DeepInfra",
["sambanova"] = "SambaNova Cloud",
["cerebras"] = "Cerebras",
["replicate"] = "Replicate",
["perplexity"] = "Perplexity",
["anyscale"] = "Anyscale",
["ibm_watsonx"] = "IBM Watsonx",
["ibm_watsonx_3rdparty"] = "IBM Watsonx 3rd Party",
["novita"] = "Novita AI",
["writer"] = "Writer",
["stima"] = "Stima",
["straico"] = "Straico",

elder rapids
#

so people felt like it wasn't worth it

hollow ocean