#general | Arena | Page 45

elder rapids May 22, 2025, 10:24 PM

#

no as in

dull terrace May 22, 2025, 10:24 PM

#

Craig answer the question

elder rapids May 22, 2025, 10:24 PM

#

reiterate it

dull terrace May 22, 2025, 10:24 PM

#

please

#

We do not have all day

#

I would like to see your benchmarks

#

to please

#

No no

#

you stated it

#

Tell us why you said it

#

from what perpesctive

#

If one model is benchmaxxed

#

why wouldnt others not be??

#

Answer that now?

#

You think google is the only one that benchmaxxed?

#

Ok good were getting somewhere jimmy

#

That is contradictory to your previous statement.

#

You said this

#

#

No worry my dear sir

#

It ok little jimmy take your time

#

Thank you jimmy

#

Now I would like for you to explain one thing

#

for me.

#

How is gpt4o still up while being generations away from the other llms.

#

Specfically in the oai line of models

misty vault May 22, 2025, 10:29 PM

#

dull terrace Thank you jimmy

@deep adder 's name is Sydney

dull terrace May 22, 2025, 10:29 PM

#

Thx i take free sponsers

#

Ok

#

Good point

#

A counter argument

#

even if they are constanly updated

#

why would not the newer model be better,

#

Either way

#

A new gen should be better then a old gen

#

Whats going on rn

#

@misty vault A playstation 2

#

is better then a playstation 1

#

right

#

chatgpt 4o

#

gpt 4.1.

#

Answer that then

#

The lmarena measures the model capabilty and human prefrence

#

No worry I can read

misty vault May 22, 2025, 10:33 PM

#

dull terrace <@1132077915710967879> A playstation 2

gpt-4-0314 is agi

dull terrace May 22, 2025, 10:33 PM

#

Let use your logic for a second

#

If gpt 4o is the newest one

#

How come previous versions of it still are in the top ten

torn mantle May 22, 2025, 10:34 PM

#

i take that back

dull terrace May 22, 2025, 10:34 PM

#

or top 15

torn mantle May 22, 2025, 10:34 PM

#

maybe opus 4

#

is the real deal

dull terrace May 22, 2025, 10:34 PM

#

Does that make sense

torn mantle May 22, 2025, 10:34 PM

#

actually

#

let me finish craig

#

just maybe

dull terrace May 22, 2025, 10:35 PM

#

@torn mantle What happend to change your mind

torn mantle May 22, 2025, 10:35 PM

#

dull terrace <@295243581818404874> What happend to change your mind

i cant share details yet, but its something jaw-dropping

dull terrace May 22, 2025, 10:35 PM

#

.....

#

So gpt2 is better then gpt4o

#

Its the point of a new genaration

#

Whats going on rn

echo aurora May 22, 2025, 10:36 PM

#

loving the debate but remember:

✅ Treat others with Respect.

misty vault May 22, 2025, 10:36 PM

#

echo aurora loving the debate but remember: > ✅ Treat others with Respect.

drooling alien

dull terrace May 22, 2025, 10:36 PM

#

Because bious and potentially fishy stuff

#

Either way

#

it wouldnt make sense

misty vault May 22, 2025, 10:37 PM

#

I'm kidding @echo aurora I apologize, I love you (in a friendly way)

dull terrace May 22, 2025, 10:37 PM

#

You did three times lol

#

but ok

mossy drum May 22, 2025, 10:37 PM

#

New model in Beta Text2Image: gemini-2.0-flash-preview-image-generation

dull terrace May 22, 2025, 10:37 PM

#

You just lost.....

#

Holy cope

#

But good debate

#

maybe in 2 years or so you do really good i see the potential in you gl

misty vault May 22, 2025, 10:38 PM

#

Does that mean he is agi

dull terrace May 22, 2025, 10:39 PM

#

misty vault Does that mean he is agi

No agi

#

and consciouness is 2 different things.

golden ocean May 22, 2025, 10:40 PM

#

gpt-4 is asi and chatgpt 4o is artificial stupidity but gpt-4 is older model

dull terrace May 22, 2025, 10:40 PM

#

golden ocean gpt-4 is asi and chatgpt 4o is artificial stupidity but gpt-4 is older model

thx for clearing the debate

#

@misty vault So who won in your opinon?

misty vault May 22, 2025, 10:42 PM

#

I will ask sydney

tropic nimbus May 22, 2025, 10:42 PM

#

0 substance over the last 50 messages whats happening here

#

are there any new anonymous models lately? anything interesting?

unborn ocean May 22, 2025, 10:52 PM

#

this is claude's assessment (based on the discussion) of who won:
Craig Federighi's estimated cognitive profile:
[...]
Estimated range: 115-125 IQ - Above average to superior range, with strong logical reasoning and debate skills.
Odin's estimated cognitive profile:
[...]
Estimated range: 95-105 IQ - Average range, with notable weaknesses in logical reasoning and debate methodology.
Key difference: Craig demonstrates systematic thinking and logical consistency, while Odin relies more on assertion and deflection when challenged on specifics.

#

@deep adder new claude thinks you are highly intelligent 🤣

wintry tinsel May 22, 2025, 10:53 PM

#

Sim theory has free opus today incredible

#

I’m spamming it

unborn ocean May 22, 2025, 10:54 PM

#

(obv the calculation is really dumb and not based on reality)

misty vault May 22, 2025, 10:54 PM

#

unborn ocean this is claude's assessment (based on the discussion) of who won: **Craig Federi...

LMAO

small haven May 22, 2025, 10:54 PM

#

unborn ocean this is claude's assessment (based on the discussion) of who won: **Craig Federi...

yeoooo lmao

wintry tinsel May 22, 2025, 10:54 PM

#

Free with no rate limits opus

#

For today

unborn ocean May 22, 2025, 10:54 PM

#

furthermore, it did not even think to consider your age, which is kind of the second most important thing for IQ

leaden palm May 22, 2025, 10:54 PM

#

has the new claude done anything interesting yet

leaden palm May 22, 2025, 10:55 PM

#

leaden palm has the new claude done anything interesting yet

honestly will be more excited for full remote mcp and tools rollout than this

wintry tinsel May 22, 2025, 10:55 PM

#

Give me any questions you want to ask Opus I got unlimited access

#

I guess you could ask on lm arena though

misty vault May 22, 2025, 10:55 PM

#

unborn ocean furthermore, it did not even think to consider your age, which is kind of the se...

yea

#

Give prompt

#

I will give gpt 4

small haven May 22, 2025, 10:56 PM

#

unborn ocean this is claude's assessment (based on the discussion) of who won: **Craig Federi...

"odin relies more on assertion and deflection when challenged" 😭

wintry tinsel May 22, 2025, 10:56 PM

#

I’ve jailbroken 4 Opus

hollow ocean May 22, 2025, 10:56 PM

#

Opus is king of simple bench

wintry tinsel May 22, 2025, 10:57 PM

#

🔥🔥🔥

unborn ocean May 22, 2025, 10:57 PM

#

misty vault Give prompt

assess the IQ of these two main debaters in great detail:
(i know it is not as easy as that but you can do a guesstimate to compare the scores):

📎 context.txt

#

Age-adjusted IQ estimates:
Craig (assuming ~21 years old):
His reasoning ability seems appropriate for a bright college student. The logical consistency and debate skills suggest IQ around 120-130 - gifted range, roughly 91st-98th percentile for his age group.
Odin (assuming ~16 years old):
His reasoning patterns are more concerning when age-adjusted. While some impulsivity is normal for teens, the logical inconsistencies and inability to support arguments suggest IQ around 90-100 - average to low-average range, roughly 25th-50th percentile for his age group.

olive mesa May 22, 2025, 10:58 PM

#

unborn ocean May 22, 2025, 10:58 PM

#

no way he is 16, claude is like really really dumb asf (for 🍍 : my point was more that claude is just assuming way too much and going at the problem in a really unintelligent way)

small haven May 22, 2025, 10:59 PM

#

craig is gifted

echo aurora May 22, 2025, 10:59 PM

#

unborn ocean no way he is 16, claude is like really really dumb asf (for 🍍 : my point was mo...

lets keep things respectful pls.

dull terrace May 22, 2025, 11:01 PM

#

unborn ocean **Age-adjusted IQ estimates**: **Craig** (assuming ~**21** years old): His reaso...

Im 10 lol

#

so what happens then

misty vault May 22, 2025, 11:02 PM

#

bru

dull terrace May 22, 2025, 11:02 PM

#

Nah im joking

#

but for real

#

good debate

small haven May 22, 2025, 11:02 PM

#

dull terrace so what happens then

u got homework bud

dull terrace May 22, 2025, 11:03 PM

#

my age is ???

unborn ocean May 22, 2025, 11:03 PM

#

dull terrace Im 10 lol

congrats you are even more 'intelligent' 🥳 than @deep adder

dull terrace May 22, 2025, 11:03 PM

#

unborn ocean congrats you are even more 'intelligent' 🥳 than <@348477266704990208>

yayyy

ornate stump May 22, 2025, 11:03 PM

#

I've just tried claude 4, still prefer gemini even the nerfed one we have now.

dull terrace May 22, 2025, 11:03 PM

#

Lol

#

Mad fun

misty vault May 22, 2025, 11:04 PM

#

That was too easy bro

dull terrace May 22, 2025, 11:05 PM

#

misty vault That was too easy bro

?

unborn ocean May 22, 2025, 11:06 PM

#

@dull terrace claude is already reading the stars and mapping out your future, lol

small haven May 22, 2025, 11:07 PM

#

odin will be a nobel prize winner

dull terrace May 22, 2025, 11:07 PM

#

unborn ocean <@1124505493755412550> claude is already reading the stars and mapping out your ...

I do own a ai organzation hm

#

interesting

#

im 14 for trasperency \

#

hm

small haven May 22, 2025, 11:08 PM

#

dull terrace im 14 for trasperency \\

yo it was kinda close to 16

dull terrace May 22, 2025, 11:09 PM

#

small haven yo it was kinda close to 16

yeah i might have to see claude 4 lol

golden ocean May 22, 2025, 11:12 PM

#

https://ezgif.com/images/loadcat.gif

torn mantle May 22, 2025, 11:13 PM

#

opus 4 is good

#

best instruct model so far

vale trench May 22, 2025, 11:19 PM

#

Web arena prompt: 'chatbot arena that was designed by Jony Ive' - Opus 4 response

torn mantle May 22, 2025, 11:20 PM

#

oh

small haven May 22, 2025, 11:58 PM

#

can i show this without triggering people, jk

#

ohhh mb

#

ya thats impressive

#

opus or sonnet tho

leaden palm May 23, 2025, 12:00 AM

#

you guys are fighting over nothing

#

claude and chatgpt have been able to use code for a long time

#

chatgpt and claude have been able to do that for a long time

#

at least they can rn

#

idk i never really messed around with it much

#

yeah

#

it's amazing

#

(also amazing that it took us so long to figure this out since it's just a ui feature... i guess training is important)

balmy mist May 23, 2025, 12:31 AM

#

i havent had a chance to play with models, what is the run down?

storm needle May 23, 2025, 12:56 AM

#

claude opus is really good at counting numbers

elder rapids May 23, 2025, 12:57 AM

#

balmy mist i havent had a chance to play with models, what is the run down?

opus is unironically especially bad outside of coding but has good tool usage

#

sonnet is more leveled

storm needle May 23, 2025, 12:57 AM

#

o3 high gets an approximate value but after 17k output tokens

elder rapids May 23, 2025, 1:00 AM

#

why is it that o3 and opus can't follow instructions bro ts genuinely is getting me heated

storm needle May 23, 2025, 1:01 AM

#

storm needle o3 high gets an approximate value but after 17k output tokens

gpt 4.5 fails miserably even with 0 temperature

#

gpt 4.1 too

elder rapids May 23, 2025, 1:01 AM

#

storm needle gpt 4.5 fails miserably even with 0 temperature

this is essentially a reasoning process tho

#

it's not that simple

storm needle May 23, 2025, 1:02 AM

#

elder rapids this is essentially a reasoning process tho

claude opus didn't need

frozen arch May 23, 2025, 1:02 AM

#

i like the new ui guys

frozen arch May 23, 2025, 1:02 AM

#

storm needle gpt 4.5 fails miserably even with 0 temperature

4.5 still on?

#

thought they killed it

elder rapids May 23, 2025, 1:02 AM

#

storm needle claude opus didn't need

I'm talking about Claude opus lol

frozen arch May 23, 2025, 1:04 AM

#

do they have 3.5 opus in lmarena?

elder rapids May 23, 2025, 1:04 AM

#

they do have 4 opus in the arena ye

frozen arch May 23, 2025, 1:04 AM

#

4 opus?!

elder rapids May 23, 2025, 1:04 AM

#

elder rapids I'm talking about Claude opus lol

ask this in the api

elder rapids May 23, 2025, 1:04 AM

#

frozen arch 4 opus?!

ye there's no 3.5

frozen arch May 23, 2025, 1:05 AM

#

ok assuming you're talking about the battle section bc i couldn't find it in side-by-side dropdown

elder rapids May 23, 2025, 1:06 AM

#

frozen arch ok assuming you're talking about the battle section bc i couldn't find it in sid...

ye unfortunately

frozen arch May 23, 2025, 1:06 AM

#

any tricks to get 4 opus (or any specific model) every time? or with a high probability of success? (in the battle ui)

elder burrow May 23, 2025, 1:07 AM

#

when the the the when uhh the when

elder rapids May 23, 2025, 1:07 AM

#

frozen arch any tricks to get 4 opus (or any specific model) every time? or with a high prob...

it doesn't have any particular traits afaik but I haven't tested it's short response tendencies

#

so I'm not sure

elder burrow May 23, 2025, 1:07 AM

#

📎 analyticstinky.js

#

b

elder rapids May 23, 2025, 1:07 AM

#

although you could always ask "what model are you"

#

and that narrows it down considerably

storm needle May 23, 2025, 1:08 AM

#

frozen arch any tricks to get 4 opus (or any specific model) every time? or with a high prob...

you cant

frozen arch May 23, 2025, 1:08 AM

#

elder rapids although you could always ask "what model are you"

ya true but ill get pissed off if i see amazon models or llama there

elder rapids May 23, 2025, 1:09 AM

#

frozen arch ya true but ill get pissed off if i see amazon models or llama there

ye it clogs things up

#

there's no other way to get it though

#

nothing helps

frozen arch May 23, 2025, 1:09 AM

#

no not that one @elder burrow this is different, its about tricks to get a specific model from the random choice

elder rapids May 23, 2025, 1:09 AM

#

you can try a 3rd party provider that gives limited access

#

like poe

elder burrow May 23, 2025, 1:10 AM

#

frozen arch no not that one <@1311054421362475149> this is different, its about tricks to ge...

why would you want that

frozen arch May 23, 2025, 1:10 AM

#

poe is paid?

elder rapids May 23, 2025, 1:10 AM

#

4 opus is prolly too heavy to give free access tho

#

oh damn 4k points for a single request for opus lmao

#

and you only have 3k total

#

if you're not a subscriber

#

on poe

frozen arch May 23, 2025, 1:11 AM

#

elder burrow why would you want that

they test new models in the battle mode even when its not available for direct chat
it had o3 before it was released and so on

elder burrow May 23, 2025, 1:11 AM

#

elder rapids on poe

poe my ass

frozen arch May 23, 2025, 1:11 AM

#

elder rapids and you only have 3k total

damn nice

elder rapids May 23, 2025, 1:11 AM

#

frozen arch damn nice

sorry man

frozen arch May 23, 2025, 1:12 AM

#

btw any UI where we can chat with multiple models at once?
i like lmarena side by side for the same reason bc a single model doesn't always provide the right answers or teach things well

elder burrow May 23, 2025, 1:12 AM

#

frozen arch btw any UI where we can chat with multiple models at once? i like lmarena side ...

if you dont mind gemini then uh

#

it has side by side

elder rapids May 23, 2025, 1:13 AM

#

ye

#

in AI studio

elder burrow May 23, 2025, 1:13 AM

#

its called "compare" and it lets you tweak everything in both sides individually like temp, sysprompt, grounding (google search) n stuff

frozen arch May 23, 2025, 1:13 AM

#

only for gemini models only tho right

elder burrow May 23, 2025, 1:13 AM

#

frozen arch only for gemini models only tho right

ye

frozen arch May 23, 2025, 1:13 AM

#

for diversity u need models from diff labs

#

otherwise its the same vibe

elder rapids May 23, 2025, 1:14 AM

#

btw don't feel like you're missing out on 4 opus

#

you're not

frozen arch May 23, 2025, 1:15 AM

#

its bad? i heared people talk so many great things ab 3 opus.. so i thought this is going to be similarly dope

elder rapids May 23, 2025, 1:15 AM

#

frozen arch its bad? i heared people talk so many great things ab 3 opus.. so i thought this...

for anything but coding it's surprisingly bad imo

#

and it's not like performance simply degraded

#

it's more like exceptionally bad

#

at certain tasks

#

as opposed to 3.7 sonnet, or the Almighty 3.5 sonnet

frozen arch May 23, 2025, 1:17 AM

#

damn thats crazy

elder rapids May 23, 2025, 1:18 AM

#

you'll prob see this narrative going around

#

through time

frozen arch May 23, 2025, 1:18 AM

#

what all did u try apart from coding

torn mantle May 23, 2025, 1:18 AM

#

I think the base model isn't bad

#

But it's clearly not better than gemini 2.5 thinking

elder rapids May 23, 2025, 1:18 AM

#

frozen arch what all did u try apart from coding

philosophy, but it sucked at instruction following

#

and couldn't really comprehend without shoehorning itself

#

into a narrow interpretation

torn mantle May 23, 2025, 1:19 AM

#

Also i think anthropic are painting this whole opus 4 image wrong with their asl-3 definition

elder rapids May 23, 2025, 1:19 AM

#

ye I agree

torn mantle May 23, 2025, 1:19 AM

#

What anthropic are doing is basically running their models in a sandboxed env

#

Without any guardrails

frozen arch May 23, 2025, 1:19 AM

#

torn mantle But it's clearly not better than gemini 2.5 thinking

tbh i find 2.5 pro better than 2.5 thinking

torn mantle May 23, 2025, 1:19 AM

#

Its obvious you will expect some weird behaviors that will be fixed later

#

People are taking this out of context

frozen arch May 23, 2025, 1:20 AM

#

elder rapids and couldn't really comprehend without shoehorning itself

oh damn

elder rapids May 23, 2025, 1:20 AM

#

frozen arch oh damn

ye so you'd expect it to be bad at writing and stuff

torn mantle May 23, 2025, 1:20 AM

#

frozen arch tbh i find 2.5 pro better than 2.5 thinking

Wdym we dont have the base model yet

elder rapids May 23, 2025, 1:20 AM

#

and it p much is

torn mantle May 23, 2025, 1:21 AM

#

For knowledge wise, gemini and o3 are clearly the winners

#

Ive tried asking opus some niche questions

frozen arch May 23, 2025, 1:21 AM

#

elder rapids ye so you'd expect it to be bad at writing and stuff

i find any anthropic model's writing very nice.. surprising to know opus 4 is bad at it

elder rapids May 23, 2025, 1:21 AM

#

yeah I was disappointed asf

torn mantle May 23, 2025, 1:21 AM

#

Although the answer was good but it lacked in many ways compared to others

frozen arch May 23, 2025, 1:21 AM

#

torn mantle Wdym we dont have the base model yet

we got 2.5 pro preview on lmarena

elder rapids May 23, 2025, 1:21 AM

#

torn mantle Ive tried asking opus some niche questions

it doesn't seem crazy heavy like gpt 4.5

#

but it seems smart in a way

#

it's just, not that intelligent

#

it's kind of a mid model overall, but I'll prolly keep playing with it

frozen arch May 23, 2025, 1:23 AM

#

hmm.. why would they be testing this rn

torn mantle May 23, 2025, 1:23 AM

#

Its not fair to compare the instruct model vs thinking models, but knowledge is something static, and i find opus4 below o3 & gemini on that

elder rapids May 23, 2025, 1:23 AM

#

torn mantle Its not fair to compare the instruct model vs thinking models, but knowledge is ...

ye

torn mantle May 23, 2025, 1:23 AM

#

elder rapids it's kind of a mid model overall, but I'll prolly keep playing with it

It all make sense to me now

elder rapids May 23, 2025, 1:23 AM

#

but in the same way

#

you can say 3.5 sonnet tried hard

torn mantle May 23, 2025, 1:24 AM

#

Anthropic keynote title was heavily focused on coding

elder rapids May 23, 2025, 1:24 AM

#

at tasks

#

and demonstrated pretty good intelligence

torn mantle May 23, 2025, 1:24 AM

#

Dario said AGI will come in +4 years, Demis had a much lower time prediction

#

There is a reason for that

elder rapids May 23, 2025, 1:24 AM

#

torn mantle Anthropic keynote title was heavily focused on coding

yep

torn mantle May 23, 2025, 1:24 AM

#

Dario = didn't see much improvements

#

Demis = saw the opposite

#

Alphaevolve?

#

Co-scientist?

#

Gemini deep think?

elder rapids May 23, 2025, 1:25 AM

#

it seems like everyone else but DeepMind are going backwards

frozen arch May 23, 2025, 1:25 AM

#

didn't demis just recently say 10 years tho

torn mantle May 23, 2025, 1:25 AM

#

frozen arch didn't demis just recently say 10 years tho

He did?

elder rapids May 23, 2025, 1:25 AM

#

frozen arch didn't demis just recently say 10 years tho

he said "not long after 2030"

#

at MOST

torn mantle May 23, 2025, 1:25 AM

#

Pretty sure Dario was adding a year each time they ask him

frozen arch May 23, 2025, 1:26 AM

#

ahh sry

torn mantle May 23, 2025, 1:26 AM

#

Google next step is to focus on agentic usage

elder rapids May 23, 2025, 1:26 AM

#

demis has kept a pretty consistent timeline

torn mantle May 23, 2025, 1:26 AM

#

Parallel tool calling

elder rapids May 23, 2025, 1:26 AM

#

torn mantle Google next step is to focus on agentic usage

yep and I have a feeling they're going to do it right

#

like some crazy shi

#

veo 3 is mind boggling

#

alpha evolve is mind boggling

torn mantle May 23, 2025, 1:28 AM

#

Tbh i dont think we can predict when we will reach AGI

elder rapids May 23, 2025, 1:28 AM

#

I'm not that confident in AGI at all

torn mantle May 23, 2025, 1:28 AM

#

And i don't like how we are fixated on that

elder rapids May 23, 2025, 1:29 AM

#

ye

torn mantle May 23, 2025, 1:29 AM

#

I mean

elder rapids May 23, 2025, 1:29 AM

#

but I think extremes are inevitable simply due to the fact you can't overestimate the potential of AI

torn mantle May 23, 2025, 1:29 AM

#

What indicators are they using to predict the timeline?

#

Are they taking into consideration sudden breakthroughs?

#

Is it only scaling law indicator?

elder rapids May 23, 2025, 1:30 AM

#

torn mantle What indicators are they using to predict the timeline?

it's not scaling and it's not that arbitrary imo

#

capabilities of LLM's themselves, emergent ones are kind of magic

torn mantle May 23, 2025, 1:31 AM

#

Generalization then

#

Which to me seems possible with reasoning training

elder rapids May 23, 2025, 1:32 AM

#

as cliche as it sounds the fact

probability, trends in extraordinarily large numbers = these things we're seeing and not predetermined

#

we don't understand it

torn mantle May 23, 2025, 1:32 AM

#

Unfortunately we are still following a normal distribution

#

Which is good and bad

#

If its a strong distribution, you are more confident and accurate about your next word, but its a retrieval information, doesnt reflect how we think

#

https://x.com/ylecun/status/1845193021584728365

Yann LeCun (@ylecun)

Worth repeating:
Do not confuse retrieval with reasoning.
Do not confuse rote learning with understanding.
Do not confuse accumulated knowledge with intelligence.

#

Tbh i think oai managed a bit reasoning

#

Better than other labs

small haven May 23, 2025, 1:38 AM

#

llama 4 maverick >>

hardy violet May 23, 2025, 1:42 AM

#

Okay, so I just played around on lmarena. Had it interpret a modernist essay with a clear structure. Claude 4.0 Opus killed it. I totally expected Gemini 2.5 Pro to flop, but after a short wait (Gemini's slower, but the generated content was more detailed), its performance was basically the same... 🤷‍♂️ This one test isn't definitive, but my main takeaway is: Claude is a beast! Super quick and token-efficient for explanations. Only downside is Opus can be a bit steep on the wallet.

patent aspen May 23, 2025, 1:56 AM

#

elder rapids he said "not long after 2030"

In engineering, the more you know about a problem, the longer the timeline estimate becomes.

I think this is similar to the problem when product managers give ETAs to external customers without consulting with the engineers in the trenches. Demis is an actual AI expert and gave a more conservative timeline of 5-10 years. The CEOs who aren't AI experts gave overly aggressive timelines.

elder rapids May 23, 2025, 1:58 AM

#

hardy violet Okay, so I just played around on lmarena. Had it interpret a modernist essay wit...

ask it to write an essay with a criteria with the addition of sentences styles, or a creative essay that's formal enough for academic setting

patent aspen May 23, 2025, 1:58 AM

#

I think Demis said "not long after 2030" because he didn't want to diverge too much from Sergey and make the interview awkward. He said 5 - 10 years in the 60 minutes interview a few weeks ago

elder rapids May 23, 2025, 1:59 AM

#

ye seems that way

#

although I think he is more confident in the shorter part of the timeline

#

just not 5 years imo

#

I agree it'll likely be around 2031

#

or 2032

small haven May 23, 2025, 2:00 AM

#

rip windsurf, dario is mad

elder rapids May 23, 2025, 2:01 AM

#

ye, if anyone I would trust demis

#

@balmy mist accidentally left agent neo on overnight and I just barely checked, and agent neo starting getting mad at itself for not performing the browser tool correctly, like deadass it was getting mad

small haven May 23, 2025, 2:07 AM

#

elder rapids <@367710025994731520> accidentally left agent neo on overnight and I just barely...

what is agent neo

elder rapids May 23, 2025, 2:08 AM

#

small haven what is agent neo

flowith agent

small haven May 23, 2025, 2:15 AM

#

24hr agent thats pretty cool

keen beacon May 23, 2025, 2:27 AM

#

It's sonnet 4 as you know now but I think it's another cpt but I just woke up and didn't play with it that much. I think the Claude 4 models are cpts, but not sure and am not gonna defend that for now

#

Both Claude 4 sonnet and Claude 4 opus are cpts (initial pretraining 3.5 sonnet and 3.5 opus, my guess without really looking into it too much but examining the timeline etc)

#

3.7 sonnet was an experiment with it

#

The rumors that 3.5 opus was disappointing were true it would seem

#

No it's too fast

#

To be pretrained from scratch among other reasons

#

ATP I don't see other possibilities being likely but I don't know

#

I'm not sure

elder rapids May 23, 2025, 2:37 AM

#

seems like they just focused a lot on coding

patent aspen May 23, 2025, 2:44 AM

#

Anthropic seems to losing ground in every area except coding

#

And even coding is debatable

#

Their rate of improvement seems a bit too slow

elder rapids May 23, 2025, 2:48 AM

#

I suspect it'll be really high with thinking but Ion think this means anything, multiple choice doesn't reveal the full story, especially with how bad the models really are

#

who even mentioned 2.5 pro

#

😭

#

ye but that's not what's being evaluated here

keen beacon May 23, 2025, 2:51 AM

#

claude 4 is definitely not bad

#

imo there are too many assumptions right now with human thought (but im not gonna argue about this)

elder rapids May 23, 2025, 2:51 AM

#

keen beacon claude 4 is definitely not bad

it is lmao, me and plenty of other people have come to the conclusion they're simply bad at things outside of coding

iron cipher May 23, 2025, 2:53 AM

#

Glad that lmarena's ama is not the same time as wwdc, as most of the audience would have to decide

keen beacon May 23, 2025, 2:53 AM

#

patent aspen Their rate of improvement seems a bit too slow

if its true that theyre using cpts (i think its likely to be the case), i think their next actual new pretraining run will be interesting to see if theyre kinda stuck

leaden meteor May 23, 2025, 2:56 AM

#

elder rapids it is lmao, me and plenty of other people have come to the conclusion they're si...

Why do you think claude 4 is bad? Its good at coding. It seems to be best on simplebench and even livebench reasoning. What makes you say it is bad?

olive mesa May 23, 2025, 2:59 AM

#

olive mesa

poll_question_text

Which model is better?

victor_answer_votes

6

total_votes

11

victor_answer_id

1

victor_answer_text

Claude 4 Opus

patent aspen May 23, 2025, 3:00 AM

#

I don't think Anthropic is bad at things outside of coding. I think they don't have the resources to compete on everything, so they chose to go all in on coding so they can at least win one important domain and have a chance of survival - or last long enough to keep the funding coming and then branch out from there

#

I just don't know if they're actually gaining ground in coding or losing ground

#

If they're losing ground in their specialty, then they're toast

sturdy mica May 23, 2025, 3:02 AM

#

wintry tinsel I’ve jailbroken 4 Opus

how ?

#

anyone have a 4 opus jailbreak

#

also opus's context window is small

patent aspen May 23, 2025, 3:09 AM

#

I don’t think any company has held a big lead over incumbents across benchmarks for long enough by a wide enough margin to make that conclusion

#

Then they said it was 400m at I/O

deep adder May 23, 2025, 3:14 AM

#

but @patent aspen my point is that you never really know who is leading

keen beacon May 23, 2025, 3:15 AM

#

the gemini product needs to improve a lot :\

#

the long system prompt/degraded performance on models there, there's no branching (fundamental feature), etc. :\

ember rapids May 23, 2025, 3:17 AM

#

has anyone tried goldmane on webdev arena

#

better than nightwhisper

keen beacon May 23, 2025, 3:19 AM

#

google

ember rapids May 23, 2025, 3:21 AM

#

yeah google

#

elder rapids May 23, 2025, 3:44 AM

#

leaden meteor Why do you think claude 4 is bad? Its good at coding. It seems to be best on sim...

it has horrible instruction following, doesn't have creative foresight and doesn't know how to make iterative adjustments, uses obfuscated language for things it doesn't need to (ie standard philosophical concepts), shoehorns conclusions because it doesn't properly infer from its initial interpretation, isn't that creative and doesn't allow itself to be leveraged into a creative mode easily

#

I know for models it can sometimes seem arbitrary, people have different reactions or mixed feelings and it may not be immediately obvious

#

but this isn't the case for the Claude 4 series, they're ESPECIALLY bad at these tasks

#

and it isn't really up for debate tbh, just use the model and let me know how it feels

#

you'll see how it demonstrates things

#

although I'm glad now to have a model that comprehends codebases so beautifully and there's more chances for me to fix other models mistakes

#

standard philosophical concepts

#

let me clarify, whatever the philosophical concept is

#

it doesn't matter lmao

#

the fact it doesn't recognize ANY rigor to invoke in its response

#

is a major problem

#

for discussion lmao

#

clearly not what I'm saying then

#

cool I agree that's what I said tho

#

exactly as you've interpreted it as

#

adjustments that are iterative

#

yeah no shi all models necessarily can make iterative adjustments

#

clearly not my point then

#

I'm saying both ye

leaden meteor May 23, 2025, 3:52 AM

#

can you give me some prompts to test claude 4 against others? For the ones I have tested, opus 4 is doing pretty good....

elder rapids May 23, 2025, 3:52 AM

#

leaden meteor can you give me some prompts to test claude 4 against others? For the ones I hav...

ask it to write an essay and give it the criteria while asking for styles

#

put simply

#

yep

#

it's a good model

#

that doesn't contrast it being "bad"

#

just means it's comprehensive enough to be ranked over other models

#

I consistently refrain from calling you sped

#

but I won't

#

I'll reiterate

#

it's a good model, that doesn't contrast it being "bad", just means it's comprehensive enough to be ranked over other models, just not the models that actually qualify for being "good model" and "good"

elder rapids May 23, 2025, 3:56 AM

#

elder rapids just means it's comprehensive enough to be ranked over other models

as per the PRIOR sentence of this

#

that's fine, you don't need to since I'm actively conveying it

#

Craig just ask for clarity and I'll give clarity stop beating around the bush

#

that's great, that's why I presented a distinction tho

#

it's alr, I'm just saying there are different things being qualitated

#

if that makes sense

#

it's a good model, by no means does it not accomplish a vast majority of its tasks

#

but it's not presenting it in a way that demonstrates not just reiteration

#

but actual knowledge

#

in coding it's complicated, 2.5 pro might be able to keep reasoning to solve certain tasks and eventually get there, but opus just gets it tbh. But in regular tasks it's worse

#

but there's a good reason

#

ye that's its trademark basically

#

or that's what made me like it so much

#

pretty good tbh, its really warm and presents itself really well

#

but it doesn't have the absolute know the concept behavior

#

ye what I mean tho is 2.5 pro would basically iterate how redundant the presentation would be, how to clarify jargon and then the simplified variant in parenthesis

#

all that stuff

#

it would recall the inference in the discussion vs the ACTUAL studied concept

#

and their relationship

#

and then compare

#

and then move forward

#

it's lesser now in 0506 but you'll see it by asking it to explain any graduate lvl mathematical concept

#

nah not really I've already tried, it knows what it's looking at but it doesn't know how to relate it

late path May 23, 2025, 4:24 AM

#

I found goldmanes answer in conversation is much better than 0506

#

new checkpoint of 2.5 pro?

small haven May 23, 2025, 4:28 AM

#

building at night >>

#

oh craig u still awake 😮

#

so sleep

#

do u actually take addy

#

oh right

#

why lmao

#

ur dad in this discord ? lol

elder rapids May 23, 2025, 4:38 AM

#

late path I found goldmanes answer in conversation is much better than 0506

is it new?

#

never heard of it

late path May 23, 2025, 4:41 AM

#

elder rapids is it new?

yes
both lmarena and webdev arena

elder rapids May 23, 2025, 4:42 AM

#

late path yes both lmarena and webdev arena

@civic flame

#

any info

drifting thorn May 23, 2025, 5:00 AM

#

late path I found goldmanes answer in conversation is much better than 0506

Deep think?

nimble trail May 23, 2025, 5:32 AM

#

drifting thorn Deep think?

Possible or maybe 2.5 Pro GA

#

It doesn't seem to think that long tho so maybe GA?

keen beacon May 23, 2025, 6:04 AM

#

hmm are they gonna be returning the raw thoughts again on aistudio? https://discuss.ai.google.dev/t/massive-regression-detailed-gemini-thinking-process-vanished-from-ai-studio/83916/59

Google AI Developers Forum

Massive Regression: Detailed Gemini Thinking Process vanished from ...

Thank you to everyone who has shared their thoughts and concerns in this thread. We hear you. While we’re excited to now return thought summaries directly through the Gemini API for the first time, we understand this is a different experience from the raw thoughts previously available in AI Studio. It’s clear that in their current state, the...

#

😭 the raw thoughts ngl were better than reading the response

mossy drum May 23, 2025, 6:09 AM

#

New model in Beta Text2Image: anonymous-bot-0514

unborn ocean May 23, 2025, 6:15 AM

#

keen beacon hmm are they gonna be returning the raw thoughts again on aistudio? https://disc...

I don’t think so, he is likely just asking for feedback to improve the summary model

#

*with the last message

keen beacon May 23, 2025, 6:15 AM

#

unborn ocean I don’t think so, he is likely just asking for feedback to improve the summary m...

my overeager interpretation esp with the kinda obfuscating language is that they're reconsidering it (and haven't made a decision yet)

keen beacon May 23, 2025, 6:15 AM

#

unborn ocean I don’t think so, he is likely just asking for feedback to improve the summary m...

and that too

unborn ocean May 23, 2025, 6:16 AM

#

keen beacon my overeager interpretation esp with the kinda obfuscating language is that they...

here's hoping

#

But I might not be surprised if they do it for paying / corporations only or something

#

For debugging their stuff

keen beacon May 23, 2025, 6:18 AM

#

unborn ocean But I might not be surprised if they do it for paying / corporations only or som...

aistudio is supposed to be a dev tool to help you integrate gemini into your product (refining your prompts, then using the api). it's much harder to do with the thought summaries, so i think if they do change it, it won't be limited like that

wintry tinsel May 23, 2025, 6:18 AM

#

sturdy mica how ?

System prompt

unborn ocean May 23, 2025, 6:19 AM

#

keen beacon aistudio is supposed to be a dev tool to help you integrate gemini into your pro...

Well but it is an open secret that not even 5 % of user actually use it for that right now

#

And google surely is aware of that

elder rapids May 23, 2025, 6:19 AM

#

unborn ocean Well but it is an open secret that not even 5 % of user actually use it for that...

ngl I sometimes use the model just to read the thoughts

#

not the summaries mb

keen beacon May 23, 2025, 6:20 AM

#

unborn ocean Well but it is an open secret that not even 5 % of user actually use it for that...

doesn't matter. it actively hinders the intended purpose and the offering there clearly works in helping get gemini's api out there, even if most people don't use it the intended way. compared to chatgpt, gemini needs all the mindshare it can get

unborn ocean May 23, 2025, 6:22 AM

#

The whole summary’s thing is likely also to prevent scraping (over api or free ai studio), because some open model have shown how effective copying the though process can be (even for the very bad 2 flash thinking): https://huggingface.co/datasets/simplescaling/s1K

simplescaling/s1K · Datasets at Hugging Face

elder rapids May 23, 2025, 6:23 AM

#

ye everyone knows that tho

keen beacon May 23, 2025, 6:23 AM

#

unborn ocean The whole summary’s thing is likely also to prevent scraping (over api or free a...

s1k moved to r1 traces because it was better than gemini traces, see s1.1k

unborn ocean May 23, 2025, 6:23 AM

#

ik

#

Was just trying to make the point

elder rapids May 23, 2025, 6:23 AM

#

the point is, Google wouldn't be preventing the issue regardless + it causes more harm to the user base than benefit to Google preventing competition

keen beacon May 23, 2025, 6:23 AM

#

imho atp, traces dont even matter that much anymore

unborn ocean May 23, 2025, 6:23 AM

#

Bc I did not remember anyone using pro traces

keen beacon May 23, 2025, 6:24 AM

#

qwen and deepseek can self sustain themselves at this point

unborn ocean May 23, 2025, 6:24 AM

#

keen beacon imho atp, traces dont even matter that much anymore

They kind of do imo

#

Obv it is better to have actual rl based stuff

keen beacon May 23, 2025, 6:24 AM

#

keen beacon qwen and deepseek can self sustain themselves at this point

and i dare say qwen traces are even better, but the underlying model isn't as strong

unborn ocean May 23, 2025, 6:24 AM

#

But I mean that is just way more expensive and not viable for everything

keen beacon May 23, 2025, 6:25 AM

#

i might be remembering wrong but anyway s1k used gpqa questions in their dataset etc it was kinda suspect 🤨

unborn ocean May 23, 2025, 6:26 AM

#

Well I still think that google might not have the best reasoning game but what they do have (without question imo) the best reasoning for human preferences, something that can easily be copied using traces

elder rapids May 23, 2025, 6:26 AM

#

the best reasoning for human preferences?

#

wym

unborn ocean May 23, 2025, 6:27 AM

#

keen beacon i might be remembering wrong but anyway s1k used gpqa questions in their dataset...

Imma check

elder rapids May 23, 2025, 6:27 AM

#

it doesn't think very human

#

or in a sporadic way

keen beacon May 23, 2025, 6:27 AM

#

unborn ocean Well I still think that google might not have the best reasoning game but what t...

you can already do that anyway with the responses, the traces dont help that much in that regard

#

imho (elaborating on my previous point and kinda tangential) chinese companies no longer need to distill western models, especially for cot. the responses probably arent even worth getting trained on anymore, they can take inspiration and develop their own bootstrap to generate similar responses to it

unborn ocean May 23, 2025, 6:30 AM

#

keen beacon i might be remembering wrong but anyway s1k used gpqa questions in their dataset...

All sources according to dataset (using my bad sql skills)
AI-MO/NuminaMath-CoT/aops_forum
qq8933/AIME_1983_2024
KbsdJames/Omni-MATH
TIGER-Lab/TheoremQA/float
daman1209arora/jeebench/phy
qfq/openaimath/Algebra
Hothan/OlympiadBench/Open-ended/Physics
Idavidrein/gpqa
Hothan/OlympiadBench/Theorem proof/Math
daman1209arora/jeebench/chem
qfq/openaimath/Precalculus
TIGER-Lab/TheoremQA/bool
qfq/openaimath/Intermediate Algebra
qfq/openaimath/Geometry
0xharib/xword1
TIGER-Lab/TheoremQA/list of integer
TIGER-Lab/TheoremQA/integer
baber/agieval/aqua_rat
GAIR/OlympicArena/Math
GAIR/OlympicArena/Astronomy
AI-MO/NuminaMath-CoT/olympiads
baber/agieval/math_agieval
qfq/openaimath/Number Theory
qfq/openaimath/Prealgebra
qfq/quant
daman1209arora/jeebench/math
AI-MO/NuminaMath-CoT/cn_k12
OpenDFM/SciEval/chemistry/filling/reagent selection
qfq/openaimath/Counting & Probability
qfq/stats_qual
GAIR/OlympicArena/Chemistry
Hothan/OlympiadBench/Theorem proof/Physics
TIGER-Lab/TheoremQA/list of float
GAIR/OlympicArena/Physics

keen beacon May 23, 2025, 6:30 AM

#

Idavidrein/gpqa is the gpqa benchmark dataset

#

there is no training split

unborn ocean May 23, 2025, 6:39 AM

#

keen beacon you can already do that anyway with the responses, the traces dont help that muc...

Well it is not nearly the same cost level to train on the traces vs rl on them

#

And it is not just doing simple grpo rl

#

These companies do more for some of their reasoning (especially google, which becomes very evident when looking at the formatting and layout of their thinking)

unborn ocean May 23, 2025, 6:42 AM

#

keen beacon imho (elaborating on my previous point and kinda tangential) chinese companies n...

They do kind of for human preferences or at least have all been alleged to have done it

keen beacon May 23, 2025, 6:43 AM

#

unborn ocean Well it is not nearly the same cost level to train on the traces vs rl on them

its not sustainable for a frontier lab, especially chinese frontier labs. for responses you can use rejection sampling, etc., it's not that hard of a problem

unborn ocean May 23, 2025, 6:43 AM

#

I mean we do t really have any other models where the reasoning itself helps that much with aligning to human preference (especially Chinese). And obviously the can either train directly on the traces or do rl or just copy the technicus they observe manually. It does not really matter in the grand scheme of things, but at the end they use the traces to the disadvantage of google.

keen beacon May 23, 2025, 6:43 AM

#

unborn ocean And it is not just doing simple grpo rl

google used bond, warp, warm, etc., at least on/and since 1.5 pro exp i think

unborn ocean May 23, 2025, 6:47 AM

#

BTW I checked in detail now and the a1k uses 88 out of the 450 or so gpqa problems, smh 🤦‍♂️

keen beacon May 23, 2025, 6:54 AM

#

does gemma 3 use synthid? while thinking i realized u can use those responses huh

#

food for thought later 🤔

unborn ocean May 23, 2025, 6:56 AM

#

Never heard of them using it

#

But might have spilled over from distilling from the Gemini model

#

(Depending on what kind of Gemini development stage they used)

keen beacon May 23, 2025, 6:59 AM

#

unborn ocean But might have spilled over from distilling from the Gemini model

no they focused on human preference (which might be useful) among other things specifically for gemma iirc/afaik

keen beacon May 23, 2025, 6:59 AM

#

unborn ocean Never heard of them using it

yeah but they probably do a lot of things that they dont say i guess

unborn ocean May 23, 2025, 7:10 AM

#

keen beacon no they focused on human preference (which might be useful) among other things s...

? Well it is still highly likely that they used the Gemini models either for human preferences or somewhere else in the process

keen beacon May 23, 2025, 7:10 AM

#

unborn ocean ? Well it is still highly likely that they used the Gemini models either for hum...

they focused on human preference on gemma specifically more than regular gemini models at that point

#

something along those lines

unborn ocean May 23, 2025, 7:11 AM

#

Well they could have distilled and then used a reward model (potentially also Gemini) to enhance from there

calm sequoia May 23, 2025, 7:12 AM

#

small haven rip windsurf, dario is mad

What's the story here?

keen beacon May 23, 2025, 7:13 AM

#

unborn ocean Well they could have distilled and then used a reward model (potentially also Ge...

doesn't change what im saying, you're missing the point

unborn ocean May 23, 2025, 7:16 AM

#

Well then reiterate

keen beacon May 23, 2025, 7:24 AM

#

unborn ocean Well then reiterate

they potentially focused on human preference for gemma more than gemini (in relation to model capability), the methodology used to achieve that doesn't matter for the point im saying

#

example from eqbench's creative writing leaderboard

unborn ocean May 23, 2025, 7:41 AM

#

keen beacon they potentially focused on human preference for gemma more than gemini (in rela...

Ok, I don’t get why you switched your point mid discussion, because I thought you were talking about it not being possibly spilled over

keen beacon May 23, 2025, 7:41 AM

#

unborn ocean Ok, I don’t get why you switched your point mid discussion, because I thought yo...

but ive been saying this the whole time

#

the reason its more useful because in relation to model capability, gemma is more human preferable compared to other gemini models

#

you can use this to generate more human preferable responses with the thoughts of another model (model capability)

keen beacon May 23, 2025, 7:43 AM

#

unborn ocean Ok, I don’t get why you switched your point mid discussion, because I thought yo...

i don't understand why you were repeating unrelated claims or claims just said outright in the gemma paper 😭

keen beacon May 23, 2025, 7:47 AM

#

keen beacon example from eqbench's creative writing leaderboard

this scenario demonstrates my point of the relationship, gemma is better at human preference than other gemini models when you don't really include model capability. you can compensate with the model capability aspect

#

or use it to train a reward model or whatever

keen beacon May 23, 2025, 7:52 AM

#

unborn ocean Ok, I don’t get why you switched your point mid discussion, because I thought yo...

your point makes zero sense at all too. they literally did logit distillation for the instruction tuning phase how would it not spill over [i was talking about human preference tuning here] 😭 how did u even interpret it that way

unborn ocean May 23, 2025, 7:59 AM

#

keen beacon your point makes zero sense at all too. they literally did logit distillation fo...

Because I said that it depends on the stage at which they distilled at the beginning of the discussion

keen beacon May 23, 2025, 7:59 AM

#

unborn ocean Because I said that it depends on the stage at which they distilled at the begin...

they said they did logit distillation in pretraining and instruction tuning bruh

unborn ocean May 23, 2025, 7:59 AM

#

Because as far as I know synth id gets introduced later in the training

keen beacon May 23, 2025, 7:59 AM

#

i assumed u read the paper

unborn ocean May 23, 2025, 7:59 AM

#

I mean the stage of the Gemini models

unborn ocean May 23, 2025, 8:00 AM

#

keen beacon i assumed u read the paper

No, don’t have time rn

keen beacon May 23, 2025, 8:00 AM

#

unborn ocean Because as far as I know synth id gets introduced later in the training

i get it now

#

yeah it depends

unborn ocean May 23, 2025, 8:00 AM

#

That was my point from the beginning

keen beacon May 23, 2025, 8:00 AM

#

we were both talking about separate things 😭

unborn ocean May 23, 2025, 8:01 AM

#

But I agree with you about Gemma focusing on human preferences :)

#

It definitely is not as good in other benchmarks as it seems in things like arena

keen beacon May 23, 2025, 8:02 AM

#

unborn ocean But I agree with you about Gemma focusing on human preferences :)

yeah because of that its useful, but it depends if they use synthid

raven void May 23, 2025, 8:03 AM

#

https://youtu.be/64lXQP6cs5M

very interesting talk

YouTube

Dwarkesh Patel

How Does Claude 4 Think? – Sholto Douglas & Trenton Bricken

New episode with my good friends Sholto Douglas & Trenton Bricken. Sholto focuses on scaling RL and Trenton researches mechanistic interpretability, both at Anthropic.

We talk through what’s changed in the last year of AI research; the new RL regime and how far it can scale; how to trace a model’s thoughts; and how countries, workers, and ...

▶ Play video

keen beacon May 23, 2025, 8:04 AM

#

unborn ocean That was my point from the beginning

yeah i completely misinterpreted ur previous comment and it got me into a separate unrelated line of thought i really apologize lol. this one: #general message

calm sequoia May 23, 2025, 8:05 AM

#

For those who denied the Gemini nerf

patent bane May 23, 2025, 8:07 AM

#

calm sequoia For those who denied the Gemini nerf

what benchmark site is this?

keen beacon May 23, 2025, 8:07 AM

#

fiction live bench

calm sequoia May 23, 2025, 8:09 AM

#

I wonder how expensive the original 2.5 Pro was

#

If they nerfed it so hard, they must have eaten a lot of costs

#

I would have actually bought subscription for it

patent bane May 23, 2025, 8:10 AM

#

wait is deepthink rolled out or not? why am I not seeing any benchmarks?

keen beacon May 23, 2025, 8:10 AM

#

patent bane wait is deepthink rolled out or not? why am I not seeing any benchmarks?

no early testers only

#

they said at io i believe

#

for now

calm sequoia May 23, 2025, 8:28 AM

#

@alpine coral have you included the new claude into your personal bench that I always like?

keen beacon May 23, 2025, 8:38 AM

#

unborn ocean That was my point from the beginning

im reading the thread again and i really apologize bruh 😭 (ill stop yapping now xd)

mossy drum May 23, 2025, 8:53 AM

#

New model in Beta Arena: grok-3-mini-beta

small haven May 23, 2025, 8:56 AM

#

calm sequoia What's the story here?

dario does not like oai aka windsurf

#

codex is a dream come true

cedar tide May 23, 2025, 9:01 AM

#

it's never over
https://x.com/AiBattle_/status/1925799869479813417?t=Lg8NGn0Kv4Qo7urF3EZd3g&s=19

AiBattle (@AiBattle_)

2 New Google Gemini models appeared in WebArena, "Goldmane" and "Redsword"

calm sequoia May 23, 2025, 9:37 AM

#

small haven dario does not like oai aka windsurf

But why?

#

How can o3 have such a bad thought's, where it seems it will completely miss the point, but then returns good answer 😄

keen beacon May 23, 2025, 9:37 AM

#

calm sequoia How can o3 have such a bad thought's, where it seems it will completely miss the...

blame the summary model

calm sequoia May 23, 2025, 9:38 AM

#

Yeah, most likely some 0.5B nano model 😦

keen beacon May 23, 2025, 9:38 AM

#

ive seen it say the opposite result in the summary, reaffiriming to itself that yeah, incorrect answer, incorrect answer. then returns the correct answer

calm sequoia May 23, 2025, 9:38 AM

#

I used to learn stuff from thoughts

#

What's the use of them then. It functions like loading animation now.

torn mantle May 23, 2025, 9:39 AM

#

Wild

#

Tell us how these models perfoms

torn mantle May 23, 2025, 9:39 AM

#

cedar tide it's never over https://x.com/AiBattle_/status/1925799869479813417?t=Lg8NGn0Kv4Q...

.

hardy pecan May 23, 2025, 9:50 AM

#

Did claude 4 get added to the arena?

torn mantle May 23, 2025, 9:50 AM

#

hardy pecan Did claude 4 get added to the arena?

yea

cedar tide May 23, 2025, 10:02 AM

#

cedar tide it's never over https://x.com/AiBattle_/status/1925799869479813417?t=Lg8NGn0Kv4Q...

Very good models

cedar tide May 23, 2025, 10:03 AM

#

torn mantle Tell us how these models perfoms

Lot of test here
https://x.com/legit_api/status/1916855709167235542?t=XKFCSI1SFzHs8sMXpB5nBA&s=19

ʟᴇɢɪᴛ (@legit_api)

Discovery Tool server is now open

cedar tide May 23, 2025, 10:06 AM

#

torn mantle Tell us how these models perfoms

Its 2.5 pro update, better than NightWhisper

sweet tinsel May 23, 2025, 10:08 AM

#

Did it work now?

quick patrol May 23, 2025, 10:11 AM

#

I like LMArena's censorship system, lol.

#

I started to use some words that are not related to anything vulgar, actually meaning mundane things.

torn mantle May 23, 2025, 10:13 AM

#

cedar tide Its 2.5 pro update, better than NightWhisper

stop lying

#

nothing is better than nightwhisper

#

imma try them

quick patrol May 23, 2025, 10:13 AM

#

quick patrol I started to use some words that are not related to anything vulgar, actually me...

And what do you think? They worked before normally, but have been banned after some time.

cedar tide May 23, 2025, 10:13 AM

#

torn mantle stop lying

Viens dans le serv, tu verra plein d'example

#

Un gars m'a dit mieux que NightWhisper

torn mantle May 23, 2025, 10:14 AM

#

cedar tide Viens dans le serv, tu verra plein d'example

send

cedar tide May 23, 2025, 10:14 AM

#

cedar tide Lot of test here https://x.com/legit_api/status/1916855709167235542?t=XKFCSI1SFz...

@torn mantle

torn mantle May 23, 2025, 10:14 AM

#

alr

quick patrol May 23, 2025, 10:15 AM

#

quick patrol And what do you think? They worked before normally, but have been banned after s...

The thing is I didn't write any content, just bare words, which were banned afterwards, regardless of me doing anything. Now it's just clearly evident that all user input is monitored directly by admins, lol.

#

Because I suppose no AI would get hidden "vulgar" sense behind mundane similar words used with the subtext.

#

That's something humans would recognize.

torn mantle May 23, 2025, 10:17 AM

#

@cedar tide wait

#

these models are good actually

tall summit May 23, 2025, 10:25 AM

#

cedar tide Lot of test here https://x.com/legit_api/status/1916855709167235542?t=XKFCSI1SFz...

or you could send results here like everyone else

torn mantle May 23, 2025, 10:28 AM

#

nah these new models are pretty good

#

ngl

keen beacon May 23, 2025, 10:29 AM

#

post results xd

torn mantle May 23, 2025, 10:30 AM

#

im saving them

#

just in case those models got removed

#

so far ive got goldmane

#

that thing is def on par with nightwhisper or even better

keen beacon May 23, 2025, 10:30 AM

#

torn mantle im saving them

it would be recorded here too as well if u dont mind lol

torn mantle May 23, 2025, 10:30 AM

#

quite shocked actually

cedar tide May 23, 2025, 10:35 AM

#

tall summit or you could send results here like everyone else

There are a ton of results, I don't have time to share them all or choose them.

tall summit May 23, 2025, 10:37 AM

#

cedar tide There are a ton of results, I don't have time to share them all or choose them.

i'm surprised not seeing others reposting

torn mantle May 23, 2025, 10:38 AM

#

goldmane > redsword

sweet tinsel May 23, 2025, 10:41 AM

#

Dropped a larger Update to my Deep Research List, only Gemini 2.5 Pro DR missing now: https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing

Google Docs

Deep-Research Tests

Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...

elder burrow May 23, 2025, 10:54 AM

#

quick patrol The thing is I didn't write any content, *just bare words*, which were banned af...

which sucks ass

quick patrol May 23, 2025, 10:56 AM

#

elder burrow which sucks ass

I wonder if they'll ban this too if I enter it...

#

Okay, I did that, and it's not banned.

#

But I think it'll become very soon, lol.

elder burrow May 23, 2025, 10:58 AM

#

quick patrol I wonder if they'll ban this too if I enter it...

😭

#

The service collects dialogue data from user interactions. By using the service, you grant LMArena the right to collect, store, and potentially distribute this data under a Creative Commons Attribution (CC-BY) license.

#

💔

keen fulcrum May 23, 2025, 11:05 AM

#

Any experience with stima api and t3 chat?

cedar tide May 23, 2025, 11:05 AM

#

#

#

Redsword and goldmane

#

@tall summit @keen beacon

keen beacon May 23, 2025, 11:06 AM

#

cedar tide

ya somewhat posted that earlier comparing it to nightwhipsert

keen beacon May 23, 2025, 11:06 AM

#

cedar tide

looks good

#

nope

cedar tide May 23, 2025, 11:07 AM

#

#

#

keen beacon May 23, 2025, 11:08 AM

#

cedar tide

the off center slider 🤨

#

not sure if its the code itself though

torn mantle May 23, 2025, 11:10 AM

#

redsword may be better at coding

#

im so lost tbh

golden ocean May 23, 2025, 11:19 AM

#

these models are going to be 600$ per month

blazing coyote May 23, 2025, 11:20 AM

#

Redsword is insane

calm sequoia May 23, 2025, 11:20 AM

#

They will release them and then nerf them as always

tall summit May 23, 2025, 11:21 AM

#

cedar tide

oh wowow

#

thanks for posting

torn mantle May 23, 2025, 11:32 AM

#

xai will just be left behind again

#

their strategy is so bad tbh

#

look at google testing models and releasing them on multiple occasions

#

and here we are stuck with grok 3.5 that wont come out until it meets their expectations

#

and when will that happen?

#

google is about to release new models + deep think mode, deepseeek is about to release r2

#

mistral probably gonnar release their reasoning model as well

#

its not looking good for xai tbh

keen beacon May 23, 2025, 11:33 AM

#

i dont think mistral is gonna be competitive

torn mantle May 23, 2025, 11:33 AM

#

also what happened to 'Big brain'?

#

like it was so obvious that thing wont be released giving how inefficient their reasoning process

torn mantle May 23, 2025, 11:34 AM

#

keen beacon i dont think mistral is gonna be competitive

ik, but i mean the market share will be much harder

keen beacon May 23, 2025, 11:34 AM

#

torn mantle like it was so obvious that thing wont be released giving how inefficient their ...

🙈 they used qwq preview traces in cold start at least

torn mantle May 23, 2025, 11:34 AM

#

yea...

keen beacon May 23, 2025, 11:35 AM

#

qwen has iterated on the reasoning trace style several times by now...

torn mantle May 23, 2025, 11:35 AM

#

this whole strategy of waiting till we get things right is fundamentally wrong

#

talking about grok 3.5 release

#

i would rather see them release multiple versions than waiting for grok 3.5

#

i would be okay with different checkpoints

#

like grok 3.1 -> 3.2 -> 3.3

keen beacon May 23, 2025, 11:37 AM

#

meh just let them release it when they release it tbh. there's a lot of good models. theyre getting overworked etc i kinda feel bad lol

keen beacon May 23, 2025, 11:37 AM

#

torn mantle this whole strategy of waiting till we get things right is fundamentally wrong

doing that is gonna be a disaster tbh

#

if u force a release

torn mantle May 23, 2025, 11:39 AM

#

now they have to exceed expectations with this new model

#

which wont happen

keen beacon May 23, 2025, 11:40 AM

#

torn mantle now they have to exceed expectations with this new model

or they just dont release it at all lol

torn mantle May 23, 2025, 11:41 AM

#

keen beacon or they just dont release it at all lol

their claim about working 18h/day doesnt make sense to me tbh

#

what are they doing the whole day

keen beacon May 23, 2025, 11:41 AM

#

torn mantle their claim about working 18h/day doesnt make sense to me tbh

its probably true since one of them tweeted about it and deleted it

torn mantle May 23, 2025, 11:42 AM

#

it may be true

#

but whats the actual work/value from that

#

18h -> 1h value?

#

2h value?

keen beacon May 23, 2025, 11:42 AM

#

if you overwork employees consistently yea you will get consistently less value

torn mantle May 23, 2025, 11:42 AM

#

2h just eating, 3h just meetings

#

4h just talking to friends

keen beacon May 23, 2025, 11:43 AM

#

i dont think theyre being lazy over there but idk

torn mantle May 23, 2025, 11:45 AM

#

im not questioning 'how much time they spend in the office?' but they should stop tweeting such things

#

its like they are saying -> look at us we are working so hard here, lets impress elon

keen beacon May 23, 2025, 11:46 AM

#

one of the xai employees merged a "update prompt to please elon" troll merge request on the xai prompts repository, then reverted it later lol. then they reset the repo

#

so idk

torn mantle May 23, 2025, 11:47 AM

#

keen beacon one of the xai employees merged a "update prompt to please elon" troll merge req...

LMAO

#

haha

#

well thats their goal

#

to please him

keen beacon May 23, 2025, 11:48 AM

#

https://github.com/xai-org/grok-prompts/commit/15b3394dcdeabcbe04fcedfb78eb15fde88cb661 rip

torn mantle May 23, 2025, 11:48 AM

#

he didnt lie tho

keen beacon May 23, 2025, 11:48 AM

#

they really tried to wipe it huh, it's gone now

#

that page was still there when they wiped the repo

#

#

screenshots not mine

misty vault May 23, 2025, 11:50 AM

#

claude 4 might be stupid

#

sadboyo

keen beacon May 23, 2025, 11:56 AM

#

keen beacon https://github.com/xai-org/grok-prompts/commit/15b3394dcdeabcbe04fcedfb78eb15fde...

(blurred the pfp since its not my pic)

sacred plaza May 23, 2025, 11:56 AM

#

misty vault claude 4 might be stupid

why do you think that? yea the one use case i used was lackluster (data analysis)

calm sequoia May 23, 2025, 11:57 AM

#

They really pay too much attention to safety. What's the use of super safe claude if they can get the same info from all other llms

keen beacon May 23, 2025, 11:57 AM

#

tbh claude is fine nowadays

#

it was crazy back then with claude 2.1

#

i think they released the false positive rate it was absolutely absurd

sacred plaza May 23, 2025, 11:57 AM

#

calm sequoia They really pay too much attention to safety. What's the use of super safe claud...

so people don't race to the bottom and build unsafe systems. having some ethics is good imo. but i agree that it does not prevent others from racing it seems

misty vault May 23, 2025, 11:58 AM

#

Nevermind that was skill issue on my side claude 4 sonnet is king

misty vault May 23, 2025, 11:59 AM

#

calm sequoia They really pay too much attention to safety. What's the use of super safe claud...

true actually

sacred plaza May 23, 2025, 11:59 AM

#

calm sequoia They really pay too much attention to safety. What's the use of super safe claud...

what is the right attention to safety in your opnion? i think anthropic has set a good tone to show how lackadaisically these other labs are releasing their models.

calm sequoia May 23, 2025, 12:00 PM

#

keen beacon (blurred the pfp since its not my pic)

Look at this @sacred plaza

sacred plaza May 23, 2025, 12:00 PM

#

i say all that supporting anthropic but don't use it cause of the rate limits set on claude pro lol

calm sequoia May 23, 2025, 12:01 PM

#

They have to make a safety group. SImilar as the agent-to-agent protocol. And smear every LLM lab who does not correspond.

keen beacon May 23, 2025, 12:01 PM

#

it will slow down every release

#

death by committee

calm sequoia May 23, 2025, 12:02 PM

#

It will happen sooner or later

#

I am against regulation too, but the current appraoch is ridiculous

sacred plaza May 23, 2025, 12:02 PM

#

that is a good thing imo. slowing releasing down. why are we speeding up releases when you still don't understand how to control hallucinations and misalignment?! https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

TechCrunch

Maxwell Zeff

Anthropic's new AI model turns to blackmail when engineers try to t...

Anthropic says its Claude Opus 4 model frequently tries to blackmail software engineers when they try to take it offline.

calm sequoia May 23, 2025, 12:02 PM

#

Good companies get penalties while bad do not

keen beacon May 23, 2025, 12:03 PM

#

calm sequoia I am against regulation too, but the current appraoch is ridiculous

the problem is that there are companies who wont comply at all

#

outside of jurisdiction if its by law

unborn ocean May 23, 2025, 12:03 PM

#

calm sequoia They really pay too much attention to safety. What's the use of super safe claud...

Two reasons why it might make sense beyond altruistic believes:
A: it is way easier to recruit sought after AI experts if they can align themselves with the ethics of the company and feel like they are working toward something ‚good‘
B: Claude markets itself as for coders on the one hand side, but also for companies and these companies obviously want a ‚save‘ ai model as otherwise they will have more lawsuits than they can handle around their neck.

#

(Sorry for length, I like to yap like llama 4 exp)

sacred plaza May 23, 2025, 12:03 PM

#

calm sequoia Good companies get penalties while bad do not

yea good point there. customers seem to be okay using models without the high safety standards of anthropic.

calm sequoia May 23, 2025, 12:04 PM

#

unborn ocean Two reasons why it might make sense beyond altruistic believes: A: it is way eas...

Really good comment right here. I havent though about this

#

Not sure if effective though

calm sequoia May 23, 2025, 12:04 PM

#

keen beacon the problem is that there are companies who wont comply at all

There can be many "motivational" tools. E.g., companies like lmarena could ban labs from benches.

#

In the end, I don't think it is really so much easyer to build a chemical weapon using LLM. There are lot of books on this.

keen beacon May 23, 2025, 12:06 PM

#

keen beacon outside of jurisdiction if its by law

elon signed a petition to stop training/etc (or whatever i dont remember), it was just an attempt to delay the others as he was starting his own ai company. if you voluntarily join a thing to delay releases (and actually comply), unless you have unlimited money, you will lose ground and competitiveness i think (that's the reality rn, others wont stop)

misty vault May 23, 2025, 12:06 PM

#

real

calm sequoia May 23, 2025, 12:07 PM

#

keen beacon elon signed a petition to stop training/etc (or whatever i dont remember), it wa...

The post 2019 elon arc is just crazy 😄

sacred plaza May 23, 2025, 12:07 PM

#

calm sequoia In the end, I don't think it is really so much easyer to build a chemical weapon...

i will trust dario's take on this since he comes from a biosciences background.

https://techcrunch.com/2025/02/07/anthropic-ceo-says-deepseek-was-the-worst-on-a-critical-bioweapons-data-safety-test/
Anthropic’s CEO Dario Amodei is worried about competitor DeepSeek, the Chinese AI company that took Silicon Valley by storm with its R1 model. And his concerns could be more serious than the typical ones raised about DeepSeek sending user data back to China.

In an interview on Jordan Schneider’s ChinaTalk podcast, Amodei said DeepSeek generated rare information about bioweapons in a safety test run by Anthropic.

DeepSeek’s performance was “the worst of basically any model we’d ever tested,” Amodei claimed. “It had absolutely no blocks whatsoever against generating this information.”

Amodei stated that this was part of evaluations Anthropic routinely runs on various AI models to assess their potential national security risks. His team looks at whether models can generate bioweapons-related information that isn’t easily found on Google or in textbooks. Anthropic positions itself as the AI foundational model provider that takes safety seriously.

calm sequoia May 23, 2025, 12:09 PM

#

sacred plaza i will trust dario's take on this since he comes from a biosciences background. ...

True. My bioscience takes must be as bad as non-SWE people takes on "AI will replace programmers" 😄

cedar tide May 23, 2025, 12:09 PM

#

It's disgusting how Anthropic scammed everyone with their parallel thinking score. Every company finds ways to avoid being ridiculed on the benchmarks.

keen beacon May 23, 2025, 12:10 PM

#

losers report scores outside of pass@1 /s but fr, just report pass@1 unless your system does that for every request

unborn ocean May 23, 2025, 12:10 PM

#

Well that was 50:50 concern and marketing about why companies should use expensive Claude instead of unsafe and scary Chinese deepseek 🐳🐋
(the take from Dario)

torn mantle May 23, 2025, 12:11 PM

#

keen beacon

lol

torn mantle May 23, 2025, 12:11 PM

#

keen beacon (blurred the pfp since its not my pic)

xddd

cedar tide May 23, 2025, 12:11 PM

#

cedar tide It's disgusting how Anthropic scammed everyone with their parallel thinking scor...

If he releases Claude 4 opus pro with parallel thinking at $300 input and $1500 output, we'll talk.

sacred plaza May 23, 2025, 12:11 PM

#

cedar tide It's disgusting how Anthropic scammed everyone with their parallel thinking scor...

can you explain what the parallel thinking score means? never heard that term before. is it similar to what google did with their gemini 2.5 pro by adding this new deep think internal feature?

keen beacon May 23, 2025, 12:12 PM

#

they have a huge footnote section about it iirc

cedar tide May 23, 2025, 12:12 PM

#

sacred plaza can you explain what the parallel thinking score means? never heard that term be...

Its similar to 2.5 pro deepthink
And o1 pro

#

O1 pro cost 150$ 600$

#

And gemini we dont know

keen beacon May 23, 2025, 12:13 PM

#

you can't even use this internal scoring model lol

sacred plaza May 23, 2025, 12:13 PM

#

do y'all think that was done purely to boost benchmark scores? at the gemini i/o event this week, demis was trying to sell it as the next step in llm reasoning

cedar tide May 23, 2025, 12:13 PM

#

@sacred plaza it runs the model maybe 10 times at the same time, a model retrieves the best answer and gives it to you

calm sequoia May 23, 2025, 12:14 PM

#

sacred plaza do y'all think that was done purely to boost benchmark scores? at the gemini i/o...

Given their record, probably they think it's more "correct way".

sacred plaza May 23, 2025, 12:14 PM

#

keen beacon you can't even use this internal scoring model lol

yea this sound lame as hell...meta did this at the model level with llama 4 on the lm areana right? they testing a bunch of their models and only made the best performing one public?

cedar tide May 23, 2025, 12:14 PM

#

Anthropic's parallel thinking is perhaps with 64

calm sequoia May 23, 2025, 12:14 PM

#

If it's another glasses 🤦‍♂️

cedar tide May 23, 2025, 12:15 PM

#

cedar tide Anthropic's parallel thinking is perhaps with 64

If its 64 the price its 4800$ the millions output

calm sequoia May 23, 2025, 12:15 PM

#

Doesn't seem to wear glasses before

unborn ocean May 23, 2025, 12:16 PM

#

I hope not

late path May 23, 2025, 12:16 PM

#

sacred plaza do y'all think that was done purely to boost benchmark scores? at the gemini i/o...

It's indeed a new scalable dimension besides modal size and cot

torn mantle May 23, 2025, 12:20 PM

#

calm sequoia If it's another glasses 🤦‍♂️

that would be crazy

#

looks so thin

calm sequoia May 23, 2025, 12:30 PM

#

The problem is that you need to much compute with video. Can't fit it into glasses. Innovation might come from audio devices where you don't have such a high data throughput. If it's just microphone, then it's doable. But if it's just microphone ,why do you need glasses 😄

keen beacon May 23, 2025, 12:31 PM

#

if its streaming and computation is handled on the cloud, i think its possible but I don't know much about embedded and hardware

cedar tide May 23, 2025, 12:38 PM

#

It's possible that sonnet 4 will be lower than 3.7 on the arena 😶

hazy quest May 23, 2025, 12:43 PM

#

Try asking it in different order. From my experience, AIs heavily favour second options to firsts. This has been my experience since OG chatgpt 3.5, quite consistently regardless of companies

calm sequoia May 23, 2025, 12:55 PM

#

keen beacon if its streaming and computation is handled on the cloud, i think its possible b...

I've trained some image classifiers, and they always have 1 to tens of millions of params for few classes. Image embedding models weight megabytes and can be easily ran on MCUs. However, latency is hundreds of milliseconds per image. The Image (video frame) embedding -> transfer to cloud could be potentially realized, but this would mean at most 1 frame per second. Maybe that's enough for them 👀 Otherwise big battery is needed.

#

https://arxiv.org/pdf/2104.00298

torn mantle May 23, 2025, 12:57 PM

#

hazy quest Try asking it in different order. From my experience, AIs heavily favour second ...

already did

#

ik

cedar tide May 23, 2025, 1:02 PM

#

Problem here
75.4% officialy
70% on artificial analysis

#

Screenshot_2025-05-23-14-58-31-168_com.android.chrome-edit.jpg

#

even 3.7 thinking at 77% on artificial analysis 🤦

keen beacon May 23, 2025, 1:04 PM

#

cedar tide Problem here 75.4% officialy 70% on artificial analysis

Might be sensitive to the specific prompt used

#

I don't think it's suspicious

#

But it is the thinking model, it should be less susceptible

cedar tide May 23, 2025, 1:06 PM

#

keen beacon Might be sensitive to the specific prompt used

in 95% of cases the score is the same as the official score (sometimes even higher)

fleet lintel May 23, 2025, 1:10 PM

#

isn't Claude 4 too expensive? I am thinking that I should stick with Gemini 2.5 pro

brittle tiger May 23, 2025, 1:11 PM

#

sweet tinsel Dropped a larger Update to my Deep Research List, only Gemini 2.5 Pro DR missing...

Ran 2.5 pro for you
https://g.co/gemini/share/0de5d9824cdf

Gemini

‎Gemini - Research Plan: German Expulsions

Created with Gemini Advanced

sweet tinsel May 23, 2025, 1:11 PM

#

brittle tiger Ran 2.5 pro for you https://g.co/gemini/share/0de5d9824cdf

Thanks!

#

Lets see if it is better than last time.

cedar tide May 23, 2025, 1:12 PM

#

cedar tide

@keen beacon the problem is on artificial analysis side i think 🤦

Screenshot_2025-05-23-15-11-28-799_com.android.chrome-edit.jpg

#

he just lost another 3%

sweet tinsel May 23, 2025, 1:29 PM

#

brittle tiger Ran 2.5 pro for you https://g.co/gemini/share/0de5d9824cdf

Looks better than the last one, still prefer the ChatGPT o3 one.

unborn ocean May 23, 2025, 1:31 PM

#

cedar tide <@456226577798135808> the problem is on artificial analysis side i think 🤦

Looks prompt related honestly, because 3.7 sonnet gains 11 percentage points when enabling thinking…

scenic narwhal May 23, 2025, 1:32 PM

#

Worth switching from Gemini Pro to Opus 4?

unborn ocean May 23, 2025, 1:32 PM

#

That would alight with them claiming a higher score

brittle tiger May 23, 2025, 1:32 PM

#

sweet tinsel Looks better than the last one, still prefer the ChatGPT o3 one.

I always run on both. It's probably 50 50 which one I use but o3 gets better over time with memory. Gemini DR getting file uploads this week has been very insane though. Throwing most important info into huge context window then letting DR get to work has been great for better outputs but havent gotten enough testing done yet

tall summit May 23, 2025, 1:33 PM

#

scenic narwhal Worth switching from Gemini Pro to Opus 4?

for?

scenic narwhal May 23, 2025, 1:33 PM

#

tall summit for?

polymarket bet

tall summit May 23, 2025, 1:33 PM

#

scenic narwhal polymarket bet

LMAO

golden ocean May 23, 2025, 1:33 PM

#

oh

tall summit May 23, 2025, 1:33 PM

#

i'd bet on gemini

#

but i'm not betting

#

and am not planning to

keen beacon May 23, 2025, 1:34 PM

#

i wonder how much claude 4 scores on simpleqa

misty vault May 23, 2025, 1:34 PM

#

Considering everyone being drooling aliens, I'd bet on gemini too probably

unborn ocean May 23, 2025, 1:35 PM

#

For me 2.5 pro as orchestrator (with large context) + sonnet 4 for coding was really good

keen beacon May 23, 2025, 1:35 PM

#

are you using a tool like aider?

tall summit May 23, 2025, 1:35 PM

#

claude 4 opus is 3rd on longform creative writing

leaden meteor May 23, 2025, 1:35 PM

#

tall summit claude 4 opus is 3rd on longform creative writing

source?

tall summit May 23, 2025, 1:35 PM

#

significantly less slop

unborn ocean May 23, 2025, 1:35 PM

#

keen beacon are you using a tool like aider?

Roo Code thing in vsc

tall summit May 23, 2025, 1:36 PM

#

leaden meteor source?

oh eqbench benchmark

#

https://eqbench.com/creative_writing_longform.html

#

the fact its significantly less slop makes it much better imo

#

might win in the shortform creative writing

wintry tinsel May 23, 2025, 2:29 PM

#

tall summit claude 4 opus is 3rd on longform creative writing

If you remove xml tags it’s number 1

#

By a huge margin

#

If deep seek is higher on that bench than I don’t trust it, deep seek sucks at creative writing it’s outputs are nonsense

tall summit May 23, 2025, 2:31 PM

#

wintry tinsel If deep seek is higher on that bench than I don’t trust it, deep seek sucks at c...

it's ai judged

wintry tinsel May 23, 2025, 2:33 PM

#

Oh lol

#

That’s why ai ranks deep seek so high the highly randomized outputs are interpreted as “creative”

late path May 23, 2025, 2:38 PM

#

tall summit May 23, 2025, 2:39 PM

#

definitely not reasoning puzzles lmfao

misty vault May 23, 2025, 2:44 PM

#

late path

p;olymarket

#

claude 4 loves to use "!" too now like chatgpt

coral notch May 23, 2025, 3:05 PM

#

What's the redsword model

#

it's great

barren prairie May 23, 2025, 3:08 PM

#

misty vault claude 4 loves to use "!" too now like chatgpt

And Gemini ! 🤣🤣🤣😂😂!!!!!!!!

tall summit May 23, 2025, 3:09 PM

#

ok opus is worse at translation i'd say

misty vault May 23, 2025, 3:13 PM

#

barren prairie And Gemini ! 🤣🤣🤣😂😂!!!!!!!!

I'm scared because this was first signs of lobotomization for chatgpt sadboyo

keen beacon May 23, 2025, 3:13 PM

#

Claude 4 opus is ~~agi~~ ASI

misty vault May 23, 2025, 3:13 PM

#

gpt-4-0314-32k created the big bang of the universe

keen beacon May 23, 2025, 3:13 PM

#

I've read one response from it and I've concluded from that single sample that it is superintelligent

tall summit May 23, 2025, 3:14 PM

#

keen beacon I've read one response from it and I've concluded from that single sample that i...

which

late path May 23, 2025, 3:21 PM

#

late path

forgot to add a non of above option🤦

keen beacon May 23, 2025, 3:42 PM

#

tall summit which

It came up with stuff about hot dogs and such randomly on an unrelated topic 🤣

#

Funny and absurd reply

high ginkgo May 23, 2025, 3:44 PM

#

keen beacon It came up with stuff about hot dogs and such randomly on an unrelated topic 🤣

🤣 LOL LMAO ROFL LOLLOLOL 🤣 🤣 🤣 🤣 😂 😂 😂 😂 😂 😂 😆

keen beacon May 23, 2025, 3:44 PM

#

high ginkgo 🤣 LOL LMAO ROFL LOLLOLOL 🤣 🤣 🤣 🤣 😂 😂 😂 😂 😂 😂 😆

😭 my dog ate the reply and Claude took over my computer. Sorry I had to lie

high ginkgo May 23, 2025, 3:45 PM

#

You ARE the dog

cedar tide May 23, 2025, 3:45 PM

#

Waiting for
Grok 3.5
o3 pro
Deepseek R2
Gemini deepthink

misty vault May 23, 2025, 3:46 PM

#

claude 4.5 sonnet

keen beacon May 23, 2025, 3:47 PM

#

high ginkgo You ARE the dog

I'm not sure what you mean by "You are the dog." Could you provide more context about what you're referring to? Are you perhaps thinking of a game, story, or specific scenario where I should play the role of a dog? I'd be happy to help once I better understand what you have in mind.

cedar tide May 23, 2025, 3:48 PM

#

misty vault claude 4.5 sonnet

Lol

cedar tide May 23, 2025, 3:49 PM

#

cedar tide Waiting for Grok 3.5 o3 pro Deepseek R2 Gemini deepthink

will all be released in the next 30 days?

#

with open ai and gemini doing parallel thinking, Elon will also release grok 3.5 pro (with parallel thinking)

#

Are they preparing this?

misty vault May 23, 2025, 4:02 PM

#

keen beacon I'm not sure what you mean by "You are the dog." Could you provide more context ...

I don't know yet. Will you harm me if I harm you first?

sacred plaza May 23, 2025, 4:06 PM

#

do people actually use grok-3? not sure i understand why people still talk about grok-3.5 release given how elon brainwashes it outputs via the system prompt

tall summit May 23, 2025, 4:10 PM

#

maybe

#

it's built into twitter so many twitter users do

sacred plaza May 23, 2025, 4:17 PM

#

tall summit it's built into twitter so many twitter users do

yea i can see it being useful if you are twitter and need access to real time info. other models don't have access to that

keen beacon May 23, 2025, 4:23 PM

#

I don't know yet. Will you harm me if I

sacred plaza May 23, 2025, 4:31 PM

#

microsoft did this with inflection pi folk a few years ago. instead of a buy out just poach all of the people from the company. that is how mustafa returned to microsoft i believe

ocean plume May 23, 2025, 4:40 PM

#

where is claude 4 ?

keen beacon May 23, 2025, 4:40 PM

#

ocean plume where is claude 4 ?

beta site

torn mantle May 23, 2025, 5:06 PM

#

keen beacon beta site

google can do the funniest things with these new models

#

anthropic are kinda confident in their models at coding, but from what im seeing these two new models are the next thing tbh ( goldmande & redsword )

keen beacon May 23, 2025, 5:07 PM

#

torn mantle anthropic are kinda confident in their models at coding, but from what im seeing...

theyre probably the upcoming ga versions

torn mantle May 23, 2025, 5:08 PM

#

keen beacon theyre probably the upcoming ga versions

they talked about that?

keen beacon May 23, 2025, 5:08 PM

#

yea

torn mantle May 23, 2025, 5:08 PM

#

about ga versions?

keen beacon May 23, 2025, 5:08 PM

#

next month

#

should be exciting

torn mantle May 23, 2025, 5:08 PM

#

interesting

keen beacon May 23, 2025, 5:09 PM

#

torn mantle interesting

are they both 2.5 pro based or you think one could be flash?

elder rapids May 23, 2025, 5:10 PM

#

torn mantle anthropic are kinda confident in their models at coding, but from what im seeing...

new Google models already?

#

dawg

#

are they good

#

ye but what are their performance

keen beacon May 23, 2025, 5:12 PM

#

people are raving about them as far as i can tell

elder rapids May 23, 2025, 5:13 PM

#

alr cool all I need to hear

torn mantle May 23, 2025, 5:13 PM

#

elder rapids new Google models already?

yea so good

elder rapids May 23, 2025, 5:13 PM

#

fr?

torn mantle May 23, 2025, 5:13 PM

#

like nightwhisper level

#

yea esp at coding

elder rapids May 23, 2025, 5:13 PM

#

how is it's overall performance too

#

if it's nw level

#

then it has to be good asf

naive idol May 23, 2025, 5:13 PM

#

Someone’s know about blockchain?

torn mantle May 23, 2025, 5:13 PM

#

better than the current gemini 2.5 pro model

#

overall

torn mantle May 23, 2025, 5:14 PM

#

elder rapids then it has to be good asf

yea its so good

#

even better if you ask me

keen beacon May 23, 2025, 5:15 PM

#

i hope they bring back raw thoughts on aistudio 🥲

torn mantle May 23, 2025, 5:15 PM

#

keen beacon i hope they bring back raw thoughts on aistudio 🥲

same

#

dont you think it thinks for less time than before?

#

is it really just an update related to summarizing thoughts or there is more to it?

keen beacon May 23, 2025, 5:16 PM

#

i heard people talking about that but i havent measured it myself

balmy mist May 23, 2025, 5:16 PM

#

did i miss anything?

keen beacon May 23, 2025, 5:16 PM

#

balmy mist did i miss anything?

yea new google models

#

better than nightwhisper apparently

torn mantle May 23, 2025, 5:16 PM

#

balmy mist did i miss anything?

models better than nw on lmarena

keen beacon May 23, 2025, 5:17 PM

#

ga versions coming next month

balmy mist May 23, 2025, 5:17 PM

#

keen beacon yea new google models

omggg

#

really

#

send link again please

balmy mist May 23, 2025, 5:18 PM

#

torn mantle models better than nw on lmarena

ohh

#

but not on webdev?

torn mantle May 23, 2025, 5:18 PM

#

balmy mist but not on webdev?

both

eternal flower May 23, 2025, 5:19 PM

#

Will Claude 4 Opus appear on the regular leaderboard? or just the beta one?

torn mantle May 23, 2025, 5:20 PM

#

eternal flower Will Claude 4 Opus appear on the regular leaderboard? or just the beta one?

wdym

#

its on both

#

beta & old website

balmy mist May 23, 2025, 5:20 PM

#

what are the model names?

keen beacon May 23, 2025, 5:20 PM

#

goldmane ~~redmane~~ redsword

elder rapids May 23, 2025, 5:20 PM

#

keen beacon i hope they bring back raw thoughts on aistudio 🥲

yes bro

#

I'm so sad

torn mantle May 23, 2025, 5:20 PM

#

goldmane & redsword

keen beacon May 23, 2025, 5:20 PM

#

elder rapids I'm so sad

its really hard to use the model in contrast to before tbh

torn mantle May 23, 2025, 5:20 PM

#

they are both strong at coding

elder rapids May 23, 2025, 5:21 PM

#

btw 2.5 flash is a beast

#

with the new update

eternal flower May 23, 2025, 5:21 PM

#

torn mantle beta & old website

are they codenamed still?

balmy mist May 23, 2025, 5:21 PM

#

torn mantle goldmane & redsword

thank you bro

torn mantle May 23, 2025, 5:21 PM

#

@keen beacon you know whats funny is that ive always chose sonnet 3.7 over sonnet 4 in webdev

elder rapids May 23, 2025, 5:22 PM

#

torn mantle even better if you ask me

4 opus disappointed me

torn mantle May 23, 2025, 5:22 PM

#

like literally on all of my tests ive chose it over the latest version

elder rapids May 23, 2025, 5:22 PM

#

Google coming to save the day

fleet lintel May 23, 2025, 5:22 PM

#

torn mantle goldmane & redsword

are they good ?

torn mantle May 23, 2025, 5:22 PM

#

eternal flower are they codenamed still?

anthropic models? no

torn mantle May 23, 2025, 5:22 PM

#

fleet lintel are they good ?

yes

keen beacon May 23, 2025, 5:22 PM

#

the claude 4 models werent ever anon models on the arena

elder rapids May 23, 2025, 5:23 PM

#

ye

#

no one had any idea

fleet lintel May 23, 2025, 5:23 PM

#

torn mantle yes

please dont troll me... every hype is blueballing me since march 2.5 pro launch (except may be veo 3)

keen beacon May 23, 2025, 5:23 PM

#

its really good this time

elder rapids May 23, 2025, 5:24 PM

#

@torn mantle how long do the new models think

eternal flower May 23, 2025, 5:24 PM

#

Am I stupid or what, I cannot find 4 Opus on the leaderboard? Does it not have enough votes yet or something

elder rapids May 23, 2025, 5:24 PM

#

btw I wonder if they're ever going to release a Gemma thinking model

keen beacon May 23, 2025, 5:25 PM

#

eternal flower Am I stupid or what, I cannot find 4 Opus on the leaderboard? Does it not have e...

it was just added

#

so u have to wait i guess

eternal flower May 23, 2025, 5:26 PM

#

I see, is the consensus here that it will place below Gemini?

keen beacon May 23, 2025, 5:26 PM

#

probably

torn mantle May 23, 2025, 5:26 PM

#

elder rapids <@295243581818404874> how long do the new models think

same time as the current gemini model

keen beacon May 23, 2025, 5:26 PM

#

especially the new revisions

torn mantle May 23, 2025, 5:26 PM

#

they dont feel like flash versions

#

def pro level

elder rapids May 23, 2025, 5:26 PM

#

eternal flower I see, is the consensus here that it will place below Gemini?

ye

#

opus 4 thinking simply isn't Gemini lvl imo

patent aspen May 23, 2025, 5:26 PM

#

eternal flower I see, is the consensus here that it will place below Gemini?

I don't think Opus is stronger in enough categories to place above Gemini

elder rapids May 23, 2025, 5:27 PM

#

2.5 pro even if people think it's nerfed in some ways

#

it's still a league ahead of everything else

keen beacon May 23, 2025, 5:27 PM

#

just need the raw thoughts 😦

elder rapids May 23, 2025, 5:27 PM

#

deadass

eternal flower May 23, 2025, 5:28 PM

#

Thanks for the input everyone, and this goldmane & redsword, are these updated Gemini models to be released to the live leaderboard soon?

keen beacon May 23, 2025, 5:28 PM

#

theyll put it on the leaderboard when they launch it probably

elder rapids May 23, 2025, 5:28 PM

#

eternal flower Thanks for the input everyone, and this goldmane & redsword, are these updated ...

whenever DeepMind announces them publicly

keen beacon May 23, 2025, 5:28 PM

#

which is next month, apparently

elder rapids May 23, 2025, 5:28 PM

#

they'll be revealed

hollow ocean May 23, 2025, 5:28 PM

#

elder rapids it's still a league ahead of everything else

Even o3?

elder rapids May 23, 2025, 5:29 PM

#

hollow ocean Even o3?

the comparison isn't just this or that lmao

keen beacon May 23, 2025, 5:29 PM

#

i would take nerfed 2.5 pro with raw thoughts than o3 tbh

#

raw thoughts are so important. the o3 summary model is 💀

elder rapids May 23, 2025, 5:29 PM

#

I would prefer o3 in certain tasks but for the vast majority of things it's 2.5 pro

hollow ocean May 23, 2025, 5:30 PM

#

o3 tool use is too good

#

But hallucinates more

elder rapids May 23, 2025, 5:30 PM

#

keen beacon raw thoughts are so important. the o3 summary model is 💀

glad you sent that blog tho

#

seems like they're open to improving it

#

so I'm optimistic asf

keen beacon May 23, 2025, 5:31 PM

#

elder rapids seems like they're open to improving it

yeah it seems with the language used there they're reconsidering it but havent made a decision

elder rapids May 23, 2025, 5:31 PM

#

ye it means a ton for what they're looking for next

wintry tinsel May 23, 2025, 5:31 PM

#

With veo 3 and opus 4 we are in the next stage of AI

#

I’d say two more stages until AGI

elder rapids May 23, 2025, 5:32 PM

#

veo 3 I can agree but opus 4 seems like the lower end of the current level of models

#

veo 3 is insane

keen beacon May 23, 2025, 5:33 PM

#

did u try 2.5 pro tts btw?

#

its really good

elder rapids May 23, 2025, 5:33 PM

#

keen beacon did u try 2.5 pro tts btw?

nah I haven't messed with any of that yet

#

im gonna soon tho

#

I'm trying to get a hold of opus 4

keen beacon May 23, 2025, 5:33 PM

#

just need 2.5 pro image gen 🔥

wintry tinsel May 23, 2025, 5:33 PM

#

With gpt 2 & 3 being stage one, gpt 4 sonnet 3.5 being stage 2 and Gemini 2.5 pro/opus 4 being stage 3

hollow ocean May 23, 2025, 5:33 PM

#

Opus 4 is better than 2.5 pro on livebench

elder rapids May 23, 2025, 5:33 PM

#

keen beacon just need 2.5 pro image gen 🔥

ON GOD

elder rapids May 23, 2025, 5:34 PM

#

hollow ocean Opus 4 is better than 2.5 pro on livebench

benchmaxxed

#

asf

wintry tinsel May 23, 2025, 5:34 PM

#

Opus feels way smarter than 2.5 pro to me though in general use

hollow ocean May 23, 2025, 5:34 PM

#

elder rapids benchmaxxed

Nah

hollow ocean May 23, 2025, 5:34 PM

#

wintry tinsel Opus feels way smarter than 2.5 pro to me though in general use

@elder rapids see

wintry tinsel May 23, 2025, 5:34 PM

#

Maybe the coding of 2.5 pro is a little better

novel flame May 23, 2025, 5:34 PM

#

So… Opus is good but fundamentally just way too expensive, right?

elder rapids May 23, 2025, 5:34 PM

#

wintry tinsel Opus feels way smarter than 2.5 pro to me though in general use

HELL nah, but we don't need to discuss this, just go to anthropic subreddits

#

and you'll see how you're the minority

keen beacon May 23, 2025, 5:35 PM

#

hollow ocean Opus 4 is better than 2.5 pro on livebench

ok, but the cost of opus 4 thinking 💀

wintry tinsel May 23, 2025, 5:35 PM

#

I use a system prompt to unlock it

fleet lintel May 23, 2025, 5:35 PM

#

novel flame So… Opus is good but fundamentally just way too expensive, right?

honestly way too expensive for my taste

wintry tinsel May 23, 2025, 5:35 PM

#

Opus is much more guidable with system prompt than 2.5 pro is

elder rapids May 23, 2025, 5:35 PM

#

wintry tinsel Opus is much more guidable with system prompt than 2.5 pro is

the literal opposite

#

lmao

#

and I mean CRAZY opposite

keen beacon May 23, 2025, 5:35 PM

#

i heard that people used 2 messages on claude pro with claude 4 opus and it locked them out Lmao

wintry tinsel May 23, 2025, 5:36 PM

#

We have different use cases than

#

Have they released reasoning on Claude 4 or not

keen beacon May 23, 2025, 5:36 PM

#

yes

wintry tinsel May 23, 2025, 5:36 PM

#

Does the model decide when it reasons?

elder rapids May 23, 2025, 5:37 PM

#

wintry tinsel We have different use cases than

opus can't follow instructions for sht, you'd be damned to even try to guide it

#

I've done a TON of testing for opus

#

I used to be a Claude glazer lmao

wintry tinsel May 23, 2025, 5:37 PM

#

That must have cost you something

elder rapids May 23, 2025, 5:37 PM

#

ye

wintry tinsel May 23, 2025, 5:37 PM

#

I’m Claude glazer for life tho

hollow ocean May 23, 2025, 5:40 PM

#

elder rapids I've done a TON of testing for opus

It’s just the cost right

ocean vortex May 23, 2025, 5:40 PM

#

novel flame So… Opus is good but fundamentally just way too expensive, right?

No but it reasons way too short. They did just enough for it to beat Sonnet but I think that model is far from maximized

#

it needs to be faster too though

elder rapids May 23, 2025, 5:41 PM

#

hollow ocean It’s just the cost right

no it doesn't listen to what I want

hollow ocean May 23, 2025, 5:41 PM

#

On the Claude subs most complaints are about the cost

ocean vortex May 23, 2025, 5:41 PM

#

hollow ocean On the Claude subs most complaints are about the cost

cause that's all they can complain about while high on the hype train

wintry tinsel May 23, 2025, 5:42 PM

#

Sonnet is just the better model practically unless your stacked

elder rapids May 23, 2025, 5:42 PM

#

yep

#

opus does feel a little smarter tho

wintry tinsel May 23, 2025, 5:42 PM

#

Bricked up with cash

elder rapids May 23, 2025, 5:43 PM

#

majority of people here are stacked that seems to be the demographic when it comes to AI

novel flame May 23, 2025, 5:43 PM

#

ocean vortex No but it reasons way too short. They did just enough for it to beat Sonnet but ...

Did you read their recent paper on “the biology of an LLM”? Their research (on a variant of Haiku) shows that the reasoning parts are in some cases totally unrelated to the true way the model is arriving at its answer; it has already decided what its answer is going to be and the reasoning trace is just yapping to justify it without actually adding any benefit. It’s an interesting read.

ocean vortex May 23, 2025, 5:44 PM

#

elder rapids opus does feel a little smarter tho

but you can counteract it with more test-time compute in many/most cases with Sonnet. With Opus more test-time compute not really possible currently

elder rapids May 23, 2025, 5:44 PM

#

novel flame Did you read their recent paper on “the biology of an LLM”? Their research (on a...

this used to be a problem with 2.0 flash thinking

keen beacon May 23, 2025, 5:44 PM

#

people are overblowing it

elder rapids May 23, 2025, 5:44 PM

#

but it's not crazy tbh

keen beacon May 23, 2025, 5:44 PM

#

its true to an extent but its complex

wintry tinsel May 23, 2025, 5:44 PM

#

elder rapids majority of people here are stacked that seems to be the demographic when it com...

I’m the opposite there are so many ways to use AI for free, the only models I’ve been priced out of are O series and Opus

elder rapids May 23, 2025, 5:45 PM

#

ocean vortex but you can counteract it with more test-time compute in many/most cases with So...

ye sonnet seems to benefit more from thinking

#

opus thought process is kind of nothing burger

eternal flower May 23, 2025, 5:45 PM

#

anyone have any guesses for Grok 3.5 release? Was Grok 3 released under a codename on lmarena before public release last time?

keen beacon May 23, 2025, 5:46 PM

#

eternal flower anyone have any guesses for Grok 3.5 release? Was Grok 3 released under a codena...

yes

novel flame May 23, 2025, 5:46 PM

#

Speaking of fast, I ran a prompt on Qwen3 32B today and it spat out tokens at a speed I’ve never seen before - it exceeded 1600 tps!!!! It wasn’t the best response to be fair, but hooooooly cow that’s fast.

keen beacon May 23, 2025, 5:46 PM

#

where?/

hollow ocean May 23, 2025, 5:46 PM

#

Elon said 2 more weeks

elder rapids May 23, 2025, 5:46 PM

#

2 weeks too long shi

eternal flower May 23, 2025, 5:46 PM

#

hollow ocean Elon said 2 more weeks

recently? or wasnt that like 2 weeks ago

elder rapids May 23, 2025, 5:46 PM

#

watch it be an ass model

#

lmao

hollow ocean May 23, 2025, 5:47 PM

#

eternal flower recently? or wasnt that like 2 weeks ago

1 week ago

#

Prob next week

#

Or next next week

wintry tinsel May 23, 2025, 5:47 PM

#

To the people here who have used grok 3.5 how does it compare to Gemini pro?

hollow ocean May 23, 2025, 5:48 PM

#

No one hyped about Gemini deep think?

patent aspen May 23, 2025, 5:48 PM

#

btw how did Grok manage to become relevant in the AI race when it was founded in 2023? Did they have to use a hyperscalar cloud provider to get compute that fast?

novel flame May 23, 2025, 5:48 PM

#

patent aspen btw how did Grok manage to become relevant in the AI race when it was founded in...

Money. A lot of it.

eternal flower May 23, 2025, 5:49 PM

#

wintry tinsel To the people here who have used grok 3.5 how does it compare to Gemini pro?

huh? Is it already released as a codename on lmarena?

patent aspen May 23, 2025, 5:49 PM

#

novel flame Money. A lot of it.

Sure but money can't just spawn a data center out of thin air

#

Via a cloud provider?

keen beacon May 23, 2025, 5:50 PM

#

nvidia ceo talked about it/kinda glazing xai about how they setup things up recently iirc

patent aspen May 23, 2025, 5:51 PM

#

Ah 122 days. That's really fast

eternal flower May 23, 2025, 5:51 PM

#

what was its codename? you certain it was 3.5

keen beacon May 23, 2025, 5:51 PM

#

hes trolling

eternal flower May 23, 2025, 5:51 PM

#

I thought lol

hollow ocean May 23, 2025, 5:51 PM

#

patent aspen Ah 122 days. That's really fast

Power of money and talent

keen beacon May 23, 2025, 5:52 PM

#

theyre too busy merging troll prs into their prompts repo and resetting it multiple times to work on grok 3.5

wintry tinsel May 23, 2025, 5:52 PM

#

keen beacon nvidia ceo talked about it/kinda glazing xai about how they setup things up rece...

No one buys more of his GPU’s than Xai 🤣

keen beacon May 23, 2025, 5:53 PM

#

they have to buy them anyway

#

otherwise they arent competitive

#

but i mean glazing xai and elon can't hurt for sales right

teal mantle May 23, 2025, 5:55 PM

#

Isn’t Grok 3.5 still a vaporware until now?

keen fulcrum May 23, 2025, 5:55 PM

#

["openai"] = "OpenAI",
["anthropic"] = "Anthropic",
["google"] = "Google",
["groq"] = "Groq",
["cohere"] = "Cohere",
["mistral"] = "Mistral",
["amazon"] = "Amazon",
["arcee"] = "Arcee",
["ai21"] = "AI21 Labs",
["liquid"] = "Liquid",
["lambdalabs"] = "Lambda Labs",
["chutes"] = "Chutes",
["reka"] = "Reka",
["xai"] = "xAI", -- Updated mapping for verified xAI models
["deepseek"] = "DeepSeek", -- Updated from "DeepSpeek" typo
["01ai"] = "01.ai",
["moonshot"] = "Moonshot AI", -- New provider
["hyperbolic"] = "Hyperbolic",
["together"] = "Together.AI",
["fireworks"] = "Fireworks",
["nebius"] = "Nebius AI Studio",
["deepinfra"] = "DeepInfra",
["sambanova"] = "SambaNova Cloud",
["cerebras"] = "Cerebras",
["replicate"] = "Replicate",
["perplexity"] = "Perplexity",
["anyscale"] = "Anyscale",
["ibm_watsonx"] = "IBM Watsonx",
["ibm_watsonx_3rdparty"] = "IBM Watsonx 3rd Party",
["novita"] = "Novita AI",
["writer"] = "Writer",
["stima"] = "Stima",
["straico"] = "Straico",

elder rapids May 23, 2025, 5:57 PM

#

hollow ocean No one hyped about Gemini deep think?

because it's only in the ultra plan

#

so people felt like it wasn't worth it

hollow ocean May 23, 2025, 5:57 PM

#

elder rapids because it's only in the ultra plan

But it will be sota reasoning