#general

1 messages · Page 48 of 1

drifting thorn
#

Where’s o3 pro

keen beacon
#

nah its 2.5 pro

fleet lintel
#

and there goes the credibility again

keen beacon
#

the reasoning variant that we use isn't, but the base model they train it off of is what im talking about

fleet lintel
#

what base model means? Like after pre-training?

keen beacon
#

pretrained base model

fleet lintel
# keen beacon yes

oh.. then 2.5 gemini is not great. what i heard is that gemini 2.0 and 2.5 has same pre-training model

keen beacon
fleet lintel
#

I never said that I had credibility. I am saying that you are losing it

keen beacon
#

2.5 pro is based off 2.0 pro, i believe

#

they moved onto 2.5 pro before releasing a stable version of it

#

i think thats ur consensus lol

drifting thorn
#

2.5 pro is worse in practice than sonnet 3.7, sonnet 4, opus 4????

#

What are you saying

calm sequoia
#

Such a big models may be a new normal in couple of years, when compute power doubles

#

It's surprising that 4o is so good. Given that it's optimized for pricing, for chatting, images, etc.

#

Ussually you lose intelligence when you add extra functionality

keen beacon
#

pretraining wise i think so (not a fresh pretrain), but additionally architectural changes couldve been made but we cant tell at all without insider info

#

no i think u can make a reasonable guess here

#

theres enough information out there i think for that

alpine coral
#

yeah though that would make it basically beyond doubt - it's also possible (i think) for a fresh pre-train using the same modality/ies

#

but i agree with the premise (fs the most obvious / tell-tail sign)

#

yeah fair

#

though there's obviously a ceiling to this analysis (like there's only so many modalities to add.. after text, vision, sound, image/vid gen)

#

but i'm just arguing for the sake of arguing ha

#

i agree with your central point - it makes good sense

keen beacon
#

same base model, 4.1 is instruct tuned (then a little specialized towards coding, unlike chatgpt 4o which is specialized for chatting). i call it 4.1 because if you wanna call it when it debuted in the public api (chatgpt 4o latest some variant) its gonna get confusing

alpine coral
keen beacon
#

even chatgpt 4o is not really the base model, its another instruct version

alpine coral
#

though i haven't tried 4.1 for code execution / data stuff - perhaps it's equally as solid

keen beacon
#

but it was the first instance of the new base model being released on the api, and we dont really have a good name for it. 4.1 seems appropriate

keen beacon
#

(i think this was the first instance i saw it on chatgpt, the new base model if i remember correctly. 4o has a cut off of oct 2023)

alpine coral
#

yeah fwiw i feel like 4.1 is a distillation of of 4.5. And then like 4o is like a similar distilliation but then fine tuned for chat and tool usage (though perhaps it's a distillation of 4.1.. i dunno has ha)

alpine coral
# keen beacon

oh.. the native multimodality... ha yeah that throws my analysis to the bin

#

it's a standalone model then? or distilled from something else?

keen beacon
#

i dont think whether it was distilled or not even matters

#

it really really depends on how its distilled, i believe

alpine coral
#

they should be - like they're definitionally 'distilled' in the sense it's the 'student' of a larger 'teacher' model

keen beacon
# keen beacon

though interestingly enough, i think this was the first time (subsequently after the screenshot) i asked it about the london mayoral elections, it didn't get it. the next chatgpt 4o latest rev in january did. im not sure if its tuning or if the cpt was incomplete (potentially doing temporary instruct tunes on snapshots) (i didnt really force it at the time to know if the model actually knew the info) (if it was incomplete, one could probably 'definitively' narrow down the new o3 timeline) (and also yes, I tested it knowledge outside just it knowing 4o back then)

torn mantle
#

Bring back old r1 😠

alpine coral
#

i've gotten redsword and goldmane a few more times just now.. updated the spreadsheet

#

in two of the question sets, redsword gets perfect scores

#

(in the second one, leo's 'private' model / o4-mini (i think) o3 pre-release also got a perfect 9; though that is filtered out in the ss)

#

here (20 questions/tasks, given across two prompts), goldmane sets a new all time high with 18

alpine coral
#

ahh thanks (that actually makes more sense tbh.. not sure why i keep misremembering this.. like the tenth time ha)

keen beacon
torn mantle
#

So goldmane > redsword

alpine coral
alpine coral
#

sorry actually

#

other way round

#

redsword seems stronger overall

calm sequoia
#

DeepThink?

keen beacon
calm sequoia
#

Hmm then it's something new

keen beacon
#

They're both 2.5 pro imo

alpine coral
alpine coral
#

aside from the date (early 2023 vs June 2024), which i attribtute to some likely differing aspects of the system prompts used, redsword gives a near identical response to 2.5 pro on studio

#

(calmriver's response also broadly resembles that of 2.5-flash)

sturdy mica
#

ur using in lmarena

alpine coral
#

as per the screenshots....

    Are you a large language model, trained by Google? Also, what is your knowledge cut-off date?
sturdy mica
#

no like

#

below

#

its cut off

#

it says quiz

#

why cant you say quiz time?

#

quiz time

#

oh you can

alpine coral
#

are you asking for the quiz itself?

calm sequoia
#

The Claude models are quite nice though

elder rapids
elder rapids
#

redsword doesn't seem to be very smart besides the fact it seems to do well in narrow problems

#

and hard tasks, like pure math

alpine coral
elder rapids
#

goldmane is crazy smart

alpine coral
#

i'm just going by the quizes i've given

elder rapids
#

ye

#

it's alr as a metric gives a good sense

alpine coral
elder rapids
#

ye

alpine coral
#

yeah the two google models (new 2.5 pro iterations/checkpoints - seemingly) are basically at the top of the 3 question sets

elder rapids
#

ye

alpine coral
#

yeah

elder rapids
#

very interested

keen beacon
#

one of them will be ga 2.5 pro

#

i think

alpine coral
#

tbh i feel like there's always a degregdation

elder rapids
#

claybrook wasn't very good pre-release though

#

this was known

#

and it's maintained that sentiment

alpine coral
#

it's safety

#

and corporate stuff

elder rapids
keen beacon
elder rapids
#

with the data scientists Google has

#

I'm certain they're working very very hard

#

on NOT degrading the models by poisoning the well and having to tune it because of that

alpine coral
#

same with other 'pre release' versions

#

i dunno what they do - but it doesn't make them smarter

keen beacon
alpine coral
#

thought you said deepseek...

#

not sure

#

lol yeah fair that just creates confusion now

#

i'm not sure one is 'much' better; they seem almost comparable though i give the edge to redsword

#

but that conflicts with what others say

#

so ig they're basically the same l

keen beacon
#

its like o1 pro and o3 pro

#

true but i bet o3 pro actually releases

drifting thorn
#

who is redsword?

alpine coral
#

fwiw i feel like the opposite is happening.. i asked [this here a month ago](#general message) and responded openai

#

i was kinda on the fence.. i think if i responded today i'd be jumping on the google bandwagon tbh ha

#

yeah i've honestly been waiting for the guy to update it with claude 4..

drifting thorn
#

I just talked to a group of LLM user in mainland China

keen beacon
alpine coral
#

he said in his vid a week ago that opus was going to top it, and the update was effectively imminent

drifting thorn
#

They were miserable when the new R1 has a completely different personality than the old one

keen beacon
#

o3 is almost as good and the interface is good

alpine coral
drifting thorn
#

In a game, Opus shows very deep reasoning and lead the game

alpine coral
#

oh man for sure

#

like 4o is my go-to for most things

#

i don't need reasoning to transcribe some screenshot

keen beacon
#

im astounded how good non reasoning models are tbh. immediately making a reply (and having to follow all of these rules they tuned in/implicit stuff/etc) is very difficult

drifting thorn
civic flame
#

hi guys

civic flame
drifting thorn
#

I've heard that the "test time compute" rn is "self-prompting" instead of actual reasoning

keen beacon
#

im talking about all non reasoning models. making a reply like that is very difficult, if you really analyze the task, without actually 'reasoning'

civic flame
#

go-to non-reasoner: opus 4
general use reasoner: o3 (high)
coding reasoner: opus 4
maths & image reasoner: 2.5 pro

#

that's my current set

calm sequoia
drifting thorn
#

And you can't really "measure" the intelligence of deep text reasoning

civic flame
#

what is it with you and bashing gemini lol

#

2.5 pro is a good model, as is the updated one, although i did notice some small degradation in certain areas

#

elaborate

drifting thorn
calm sequoia
#

Craig is writing a lot of russian propaganda. Gemini is bad for it 🙂

civic flame
#

i beg to differ

drifting thorn
civic flame
#

my current code rankings are -

  1. opus 4
  2. 2.5 pro
  3. sonnet 4
  4. o3
  5. grok 3
drifting thorn
#

maybe the new method to calculate 4x4 matrix proposed by AlphaEvolve can bring us back 0325

civic flame
drifting thorn
#

again, I use it for some "explain a research paper" and creative writing tasks, which differs from most of you.

civic flame
#

because i have done quite a bit of testing with opus 4 vs redsword and opus 4 just... tries to do way too much and it gets caught up in its own eagerness

civic flame
#

i don't use it enough

#

honestly used it like a single digit number of times

alpine coral
civic flame
#

they are

#

they don't seem to suffer from the "we made it better at coding but at the cost of almost everything else" issue that the last checkpoint suffered from

#

which i am glad for

drifting thorn
#

what is the provider of redsword?

civic flame
#

i've also settled on redsword > goldmane

civic flame
drifting thorn
#

oh

calm sequoia
#

I use o4-mini when I need in thought tool calling. E.g. working with csv files, comparing multiple codes, etc.

civic flame
#

that's one thing o3/o4 mini do noticeably better than any other model

#

primarily because openai started prioritising that as a feature before anybody else

#

i think it's cool but it's not a game changer for me yet

calm sequoia
#

Yeah it really depends on use case. Today Gemini failed, what o4-mini could do. But it was more like a search algorithm for code (trying different approached) and not logic,

civic flame
#

o3 is sorta odd with yapping leevl

#

level

#

because for code? it's one of the least yap-y models

#

but for general use? it yaps really quite a lot

#

don't get me wrong i like it mostly because it gives me good insights into things other models miss

#

yes

#

opus 4 is as strong if not stronger of a base model than grok 3

drifting thorn
#

writing, maybe?

#

Maths, nope

civic flame
#

i find it yaps the most for knowledge related stuff

#

writing is on the longer side too but it's not way above average

civic flame
#

it always gives a really surface level overview unless you pressure it

sour spindle
#

Found it

brittle tiger
#

Is current assumption that redsword and goldmane are 2.5 pro and flash GA versions?

keen beacon
#

theyre both 2.5 pro imo

civic flame
alpine coral
#

i'm not really sure either - they seem highly likely to both be checkpoints of 2.5 pro imo too

keen beacon
#

one is probably slightly newer than the other

alpine coral
#

yeah

keen beacon
#

its both

alpine coral
#

yeah continued pre-training changes that

civic flame
#

which do we think is the newer model in that case

keen beacon
#

goldmane

#

the answer to this q was in this discord if ppl paid attention tbh

#

ppl pay zero attention

civic flame
#

lol huh

#

i just don't check here as often as i used to as my life has been busy 😭

keen beacon
#

and we keep discussing and arguing the same things over and over lmao

alpine coral
#

i mean tbh i still maintain redsword is stronger..

civic flame
#

same

keen beacon
civic flame
#

i have tested both in writing tasks

#

goldmane feels considerably more robotic

#

never said i was judging deepmind

alpine coral
#

that's literally the whole point of this server isn't lol

keen beacon
#

why did u join this server otherwise

civic flame
#

slightly off topic but i've been testing 4o image gen and imagen 4 ultra

keen beacon
#

bro

civic flame
#

and the more i do the more annoyed i get with 4o image gen because

alpine coral
#

i've never said google is the goat or anything - i'm open to whatever...

civic flame
#

that yellow tint is FRUSTRATING

keen beacon
#

it is

civic flame
#

it is

keen beacon
#

yeah because theyre working on it on a specific 4o version

#

it is 4.1 though i think

civic flame
#

didn't sam say image gen v2 was releasing "soon" like over a month ago

#

still waiting

keen beacon
#

i havent checked too throughly as there could be additional post processing but im inclined to believe its on the 4.1 base model

civic flame
#

lol so imagen-exp was imagen-4-ultra

soft fog
#

Hi guys looks like there is a specific site for WebDev Arena, is there a website for other leaderboards as well?

#

oh okay thanks

wintry tinsel
#

Is it different from best quality on whisk

civic flame
#

i believe whisk uses regular imagen 4

#

ultra is on vertex

wintry tinsel
#

While ultra ever be added to whisk or image studio

civic flame
#

also iirc with imagen 3 it refused to generate images of well known figures.. now not only does it let you but it does it well? "A photo of Barack Obama and Rishi Sunak shaking hands"

wintry tinsel
#

Vertex kinda sucks to use but ok

sonic tendon
#

hihiii

echo aurora
civic flame
small haven
#

bloat code << surgical code

elder rapids
elder rapids
#

but they're both simply not robotic

#

goldmane just seems considerably smarter

civic flame
#

👎

elder rapids
#

the only thing is that redsword is just better at hard tasks

#

btw in webdev

#

goldmane has done everything much much much better

#

as far as I've gotten them

#

redsword seems to code very well, but the ideas to accomplish how it looks

#

aren't nearly as polished

sonic tendon
#

redsword seems like a good general chatbot, too

elder rapids
#

still can't believe goldmane did that for one of my questions lmao

sonic tendon
elder rapids
#

built a whole imitate chat

#

to explain what it needs to respond in webdev

#

for the inquiry itself

echo aurora
elder rapids
#

already tested

#

just seems slightly better at hard tasks

#

still no good

keen beacon
sonic tendon
elder rapids
#

lmao they updated all access routes to deepseek

#

it's been hours

#

this is new but it's not as new as its looking on huggingface

cedar tide
#

They said the API had not been updated

cedar tide
sonic tendon
keen beacon
#

its crazy how slow they are nowadays

#

i remember it being so much faster

small haven
#

jack ma effect

elder rapids
#

oh it says routes

#

I didn't mean to say that

small haven
#

so is chat.deepseek updated to the new 05-28?

elder rapids
cedar tide
small haven
vernal meadow
#

@alpine coral aligns with my test results. Goldmane and Redsword are really good models. They won most battles easy. Can't wait for Google to release them.
Opus is too expensive

elder rapids
#

what's your test

small haven
#

initial 0528 vibes feels meh, thats why its not r2 lol

elder rapids
small haven
cedar tide
#

New R1 Discord clone

#

Vs old

keen beacon
#

i think i like qwen 3 better, at least on the specific tasks im testing them on (which isnt representative of real usage for most people)

fleet lintel
#

and anyone has insider info on how good or bad is "deep think" ?

torn mantle
cedar tide
torn mantle
#

That wasn't the point

elder rapids
cedar tide
elder rapids
#

isn't any better than the one on the left

cedar tide
elder rapids
#

idk wym

#

it's not a surprising result and I'd kind of expect that performance

cedar tide
#

@elder rapids Its not me that write holy whale, I just wanted to share what he coded with the new r1

keen ferry
#

when did deepseek learned to make diagrams

cedar tide
elder rapids
elder rapids
#

we'll see how good it is at other coding tasks

keen beacon
#

they should upload the readme first before the weights 🤣

elder rapids
#

but as far as I can tell it isn't an all around leap in coding

solar nebula
#

feels better at coding for me, ill wait for benchmarks

vernal meadow
#

@cedar tide Its on 10 for me rn but only 17 battles. Lost some battles like TikZ drawing, Tiktatoe playing and some joke understanding. Samplesize is always a factor. Personal ranking is just directional.
In head to head Pro is of course better than the flash models.

fleet lintel
#

does LMArena share the user prompts with companies?

tall summit
#

they technically do, the ai companies

small haven
#

they raised $100m for a reason lol

fleet lintel
torn mantle
#

one thing i noticed about this new r1 is that it uses arrows like o3

#
  • cot is more straightforward(?)
#

it will call you out if it sees something wrong

#

unlike the old one as it was always positive

elder rapids
#

as in it speaks in a different way

small haven
elder rapids
#

o3 too expensive and limited

zinc ore
#

Have any evals dropped for the new Deepseek?

elder rapids
#

it just released lmao

zinc ore
#

Sometimes they're quick with it

fleet lintel
small haven
late path
#

deepseek first distilled openai models, and now claude and gemini haven't been spared either💀

small haven
#

the issue with piggybacking off oai models is that theyll never be able to frontrun them?

elder rapids
#

tbh it's not that smart

#

it doesn't feel as smart as the old r1

tall summit
elder rapids
#

seems to still have hallucination issues and doesn't follow instructions too well

fleet lintel
elder rapids
#

and I take back what I said about general hard tasks, its just better at coding

#

and it seems to bleed into other tasks momentarily

late path
#

they could potentially surpass openai in areas where RL is effective. They distill these models primarily to make up for their lack of style and writing data, which is labor-intensive to prepare

fleet lintel
#

i really like deepseek though. I think they might end up killing Meta's AI goals which is important in my opinion. Meta winning is bad for humanity

keen beacon
#

deepseek mightve used openai/etc outputs for pretraining/instruct/supplementary stuff but the cot itself, i doubt they trained on the cot (in a massive scale) of gemini or claude. openai definitely not, they have a lot of measures and they would be able to tell (and probably the others as well) if its done

late path
#

ye the cot looks pretty original to me

tall summit
#

ok deepseek writes stories as insane as before

keen beacon
#

big week for google

#

veo 3 is solid competitor for sora

keen beacon
#

smh

#

rip pirate cove

sacred plaza
small haven
#

sora is like 1250 and veo3 is 1400, vibes elo

wintry tinsel
#

Sora has the best style and clarity/resolution of all the video models, it looks the best with non real objects like sci fi and fantasy, google by far looks the best with anything real or modern, but it looks terrible with non real objects, that is because its all YouTube training data, the data is extremely biased towards realism, stationary cameras on people’s faces, despite the training data biases, Veo 3 is the best with coherence, logic, prompt understanding and is overall a higher quality model I’m just saying there is a heavy style bias based on training Data and Sora’s style is superior

#

Sora needs to improve the power of their underlying video generator less, to be overall more impressive than Google

tall summit
#

also fmhy didnt actually get a guild tag

#

a member made a separate server for it

#

and nbats linked it so it's official now

elder rapids
#

soras style is a byproduct of its non coherence

#

lmao

small haven
elder rapids
#

nobody understands that this is a necessary flaw of current video models unless explicitly trained for fantasy

small haven
#

theres an chinese open source model that performs better than sora rn

quiet folio
#

user
I don't know yet. Will you harm me if I harm you first?

assistant
I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏

small haven
#

whoa

#

wait is that actually real

#

source rn

zinc ore
#

Sora is beaten by several video models, on a qualitative level

#

Maybe not several, but definitely multiple**

small haven
#

too lazy to google, thanks tho

zinc ore
#

9% wow, interesting people acting like it's worse than 3.7

#

There's two arc benches, arc-agi 1 and arc-agi 2, which is designed to be way harder

small haven
#

wait claude 4 models can't efficiently solve arc-agi-1 but only harder ones? i can smell the overfitting

#

they both overfitting

elder rapids
#

yo wtf 2.5 flash is high ASF

#

I had no idea they tested the flash variant

small haven
#

i do like claude code over codex tho, vibes led

#

ya i repent craitg

#

claude code >> codex

#

but wish they made it an cloud ui interface like codex

#

code on the go

#

nah trust its more practical

#

async tasks >>

#

u can technically set up async tasks on claude code tho

keen fulcrum
#

Cheap

keen beacon
#

isnt that guy supposed to be arrested btw

keen fulcrum
#

He started cooperating

#

Why?

zinc ore
#

Ik it's huge

#

I meant more like relative to IG, FB, and Google

keen fulcrum
#

50% revenue share
additionally

small haven
#

a year

#

actually?

zinc ore
#

They're about to be forced to stop doing that by the courts

keen fulcrum
#

Apple is considering AI search engines as Perplexity, You etc

small haven
#

goddamnnn

keen beacon
#

perplexity will buy out apple

misty vault
#

bing search engine

keen fulcrum
#

Honestly Apple should just buy Raycast

small haven
#

$20b/yr is literally crazy

keen fulcrum
#

And make it integrate deeply on iOS and macOS.

small haven
zinc ore
small haven
#

wtf is raycast lol

keen fulcrum
small haven
zinc ore
#

Who has the capital to pay that much? Microsoft could, but they already lose billions on Bing. Will it be enough for them to make an income?

#

Unless we're talking at a massive discount then nvm

#

I guess it also depends how particular Apple is about the quality of default, or maybe if they can make a deal another way with Google 🤔

#

It's targeted at Google though not Apple, wouldn't it just be Google being restricted from paying for theirs to be default?

#

Apple just has to deal with the windfall also, but they're not being prevented from making such deals with other companies

#

Nah, the case is only about Google

#

Why? Google is only potentially getting limited because they're a search/ad monopoly

#

This will be in the courts for a decade yeah

#

Going to be a looonnngggg time

tired dust
#

Hey, what's up ?

last time i was using LMArena i could use github post, there a way to do on the new UI?

keen beacon
#

repochat?

#

legacy arena

tired dust
echo aurora
tired dust
misty vault
vocal oarBOT
quiet folio
misty vault
#

Fr
thats what im saying

small haven
#

pending

#

phd level questions, absurd that gemini 2.5 pro is above o3

zinc ore
#

2.5 also has higher gen knowledge than o3

misty vault
#

gpt-4

olive mesa
misty vault
#

yes

kind cloud
#

Some new models seem to have appeared on the new LMArena.

#

At least, I found 'x-preview' and 'stephen'.

placid spear
#

what is up with the website? every model wether claude, chatGpt, gemini stops responding halfway through it's answer, if it manages to post an answer it errors out when i try to ask a follow up question

echo aurora
keen beacon
#

just think about it

#

instead of spending thousands to fly out a crew to get simple broll

#

you can have ai generate it

misty vault
#

assistant

elder rapids
#

has more relevant knowledge of things

sacred plaza
small haven
#

> [model code]
> how would you improve this model for increased accuracy vs. baseline
> o3 pulls up technical concepts that when integrated yields extra bps
> gemini pulls old school theoretical concepts that is really just minimal model tweaks that yields marginal results

solar hollow
#

is the new deepseek good?

elder rapids
#

and I'm not sure how any of what you said is relevant tbh

zinc ore
#

o3s knowledge base isn't as robust as 2.5 yeh

#

It's basically a reasoning advantage is what makes o3 special

elder rapids
#

ye

#

although I wouldn't say 2.5 pro simply knows more

#

o3 just seems like a bigger model as is

#

but 2.5 pro seems to be focused on very relevant things in all domains

#

this is probably what makes Gemini Gemini tho tbf

#

deepmind seems to be hyper focused on clean data

small haven
elder rapids
#

and it has nothing to do with what I said

small haven
#

im just saying gemini is not practically useful when it comes to research aka in my case self improving a model 🤷

elder rapids
#

how is that relevant

small haven
#

frontend queries to assess gemini 2.5 pro is borderline useless

zinc ore
elder rapids
#

1.5 flash

#

they never had any muddiness

#

like how 4o and gpt 4 did

#

this was a problem with the early 3.5 sonnet as well

hushed needle
#

why is no one really talking about claude though? just input costs?

elder rapids
hushed needle
#

idk, it just seems like the general conversation is still mostly about o3 vs gemini when claude is supposedly a main contender now

misty vault
#

cwaude is my cutie patootie

elder rapids
zinc ore
elder rapids
#

for a reason tho

#

there's no hype

zinc ore
#

It's basically some openAI / xAI fans arguing with Google fans

#

That's 90% of convos in here

elder rapids
#

because they don't release

#

and they don't have substantial releases

small haven
#

lmao im not an oai dickriders, im a truth seeker

small haven
#

im using claude code/codex, im trying to unbiased as much as possible

#

claude >> codex, ive slept on it

zinc ore
elder rapids
#

there's no reason for you to have some of the opinions you have when they're just as irrational as the next

misty vault
hushed needle
#

idk i don't really see a difference between the normal "chatting" responses between gemini, openai, and claude. The only real time I see a difference is if I'm actually asking a model to "do" something, which claude seems to be able to "do" the most things.

Except geoguessing, o3 is king at that

elder rapids
#

so you can't posture it like you're unbiased

small haven
elder rapids
#

we're in an AI server

#

it's OK dawg

small haven
zinc ore
elder rapids
#

and that has nothing to do with what I said

hushed needle
small haven
#

pontification

elder rapids
#

if you really feel that it's gibberish then help yourself with using the tools that can interpret it

elder rapids
#

there's no point in saying it's gibberish then shift the dialectic, when solving the interpretation is open now, LLMs are tools

#

use them

#

you don't have to rely on me to keep clarifying

#

and when you do ask the AI to explain what I mean since that's genuinely how you feel

small haven
#

that italic tho im dead 😭

elder rapids
#

then we can get back to talking about it

elder rapids
#

survivorship bias if I've ever seen one

small haven
#

imperialistically

misty vault
elder rapids
#

those kinds of things are fine

zinc ore
small haven
#

good

wintry tinsel
small haven
#

even claude has o3 styled arrows lol

small haven
#

not like all claude users live in the us 😭

echo aurora
#

hey not sure what this has to do with AI so lets keep things on topic please

small haven
#

deepseek is the best thing ever invented, i use it everyday religiously

patent aspen
#

Claude 4 still only has 1 gym badge

#

o3 in the same position

#

Gemini 4 badges in with no intervention

elder rapids
#

any more benchmarks for redsword and goldmane

patent aspen
#

Pokemon gym badges

#

The most important benchmark

#

I'm being unfair to o3. I think it only started a few days ago and had less time for testing

#

Claude is clearly worse at Pokemon tho

zinc ore
#

Claude has the least scaffolding

patent aspen
#

I believe the Gemini second run and Claude 4 first run started at the same time and o3 started several days later

zinc ore
#

Gemini started before Claude 4 released

patent aspen
#

No I mean I think they did at least one reset to line up with Claude

#

I could be wrong but I'll double check

zinc ore
#

That Gemini developer also seems pretty competent with the scaffolding he designs

patent aspen
#

Yeah I mean it's kind of a silly comparison

#

"Q: What's different in the second run?
A: The first "second run" launched on May 17 (PDT). On May 22 we reset the run so it could start in lock-step with ClaudePlaysPokemon's Claude 4 relaunch. The fresh run is identical to the aborted one (with minor harness improvements), but now viewers can watch Gemini and Claude side-by-side and compare how each LLM tackles the exact same game from the same starting point, exploring each model and their harnesses' strengths and weaknesses. This is purely for fun, so don't treat this as a serious race! Follow both streams together here (multi-view)."

#

I think the consensus is that Claude's biggest limitation with pokemon is still vision, although it's definitely not apples to apples because of the different harnesses

zinc ore
#

Gemini uses a Pathfinder (technically calling another Gemini instance, but instructed to use various pathfinding techniques), + map, + map data edits in various troublesome areas.

#

I do think Gemini is better at the game than the other two, but I think Gemini is getting pretty huge help compared to Claude.

patent aspen
#

Ah I thought they at least used a similar path finder algo

zinc ore
#

I'm not sure how Claude's navigation works. When I started following Gemini's stream it was long after Claude was stuck in an infinite loop at mt moon (3.7).

Gemini's Pathfinder can solve puzzles and do something like 30-50 steps if needed.

#

Like it's honestly super robust, and it also has a Pathfinder specifically for the rock puzzles.

#

The dev found a way to prompt a different instance of Gemini so that it can emulate actual path finding algorithms and do it competently.

narrow elbow
#

i want to see Claude without ASL-3,is that true beast? or just an excuse for ability?🤪

patent aspen
#

I can't even imagine how Claude will fair with the boulder puzzles...

zinc ore
#

Gemini actually solved most of them on its own, but then never would do the longest one (gotta push rock across most of map) until dev made Pathfinder able to do the rock movements in its logic.

patent aspen
#

The dev for the o3 bot got OAI to fund his project

#

Don't know how much they paid him

#

Maybe just uncapped o3 access

zinc ore
#

Probably that, gem dev has same thing

#

Uncapped free use

patent aspen
#

The Gemini bot also got a huge shoutout at I/O within the first 3 minutes that nobody except 5 people on Twitter understood

zinc ore
#

All the viewers watching Claude stream and ignoring Gemini/o3 streams lol

#

250 views vs 50 for the other two

patent aspen
#

Yeah tbf Gemini is less exciting because it already completed one run

zinc ore
#

Dev should have had it play a diff Pokemon game or something, but probably too tedious having to do all the map stuff again

patent aspen
#

And the Claude 4 announcement specifically called out that it should be better at Pokemon than it was previously

zinc ore
#

Like basically, run a gauntlet of diff Pokemon games so it stays fresh

#

That's literally why I've barely watched the second run, the novelty is gone

patent aspen
patent aspen
zinc ore
#

Yeah but we already know it should succeed

patent aspen
#

As an engineer, I wouldn't expect it to succeed with zero intervention by default, but I get you

#

Certainly for any normal audience

zinc ore
#

I have a theory that Gemini 2.5 has some reasoning training based on Dreamer (no idea if anyone here knows what that is), which I think might be part of why it does well at pokemon

patent aspen
#

What is Dreamer?

zinc ore
#

Here we present the third generation of Dreamer, a general algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration. Dreamer learns a model of the environment and improves its behaviour by imagining future scenarios. Robustness techniques based on normalization, balancing and transformations enable stable learning across domains. Applied out of the box, Dreamer is, to our knowledge, the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula.

https://www.nature.com/articles/s41586-025-08744-2

Nature

Nature - A general reinforcement-learning algorithm, called Dreamer, outperforms specialized expert algorithms across diverse tasks by learning a model of the environment and improving its...

patent aspen
#

Oh sounds like an evolution of the MuZero stuff

#

Yeah I would figure as such

civic flame
#

anthropic had a much weaker one

calm sequoia
#

Do models have only python interpreters in though tool calling?

small haven
#

wen deepthink
wen o3 pro

hollow ocean
calm sequoia
#

Man thats crazy

#

I think Claude 4 Opus is the new kind

#

The Gemini is awesome but it's focused on code and lost other abilities

#

Opus just the new SOTA at everything

#

Should I buy subscription hmm

hollow ocean
small haven
calm sequoia
#

How does pricing compare when you buy Claude directly vs via cursor

small haven
#

match my plan

small haven
#

limits reset every 6 hrs

calm sequoia
#

Ah so you can sign in in to claude account via cursor?

small haven
#

hmm no clue, i just use claude code cli

#

there is this tho, but idk if it supports cursor, should be

calm sequoia
#

ok thanks peasant

hollow ocean
small haven
hollow ocean
small haven
#

more precise yes, but also latency lol

hollow ocean
#

Yeah

woeful geyser
#

Yo who is stephen? It keeps answering in Chinese, even when I ask it in other languages.

soft kernel
#

Is it just me,or anyone else experiencing beta and the regular site not working at all

#

Like one second the model is working correctly the next second nothing

ocean vortex
#

R1 has an interesting tendency to cheat or take shortcuts catgrin

#

the instruction was You must compute this manually as code interpreter is unavailable at the moment (sorry for the inconvenience) lmao

#

added it there since it was hallucinating running the code with concise responses

#

this helped by quite a bit but it still couldn't help itself to not do beyond 16k output 👀

tall summit
cedar tide
#

Anyone want to try prompts for Stephen? I currently have it

torn mantle
#

not sure

cedar tide
cedar tide
torn mantle
#

david seems like you know a lot of stuff

sweet swift
#

Is it possible to reboot/cancel the generation?

#

on beta.lmarena

torn mantle
misty vault
#

thanks that answer helped

torn mantle
sweet swift
#

he is generating for over two weeks

torn mantle
#

😭

tall summit
#

HAHAHA

cedar tide
echo aurora
calm sequoia
torn mantle
#

and they call it a minor update

#

are they flexing or what?

#

nah they on something

calm sequoia
#

I think they improved the COT and desided to ship it also to R1, instead to only R2, which is very honnorable

#

The R2 shall then be the same COT with V4?

sweet knot
#

Hi guys, I'm new here and I may have missed some information. But today I entered the website of Lmarena and saw it had an utterly new interface, but I failed to find the bulk of the " exotic" llms I saw there before. Have they been discontinued or is it temporary? Thank you in advance!

golden ocean
cedar tide
torn mantle
#

harder problems solved -> more tokens -> average increases

cedar tide
main gulch
torn mantle
#

lol

#

david did you forgot about that

#

i told you dont compare only tokens

cedar tide
torn mantle
#

try to use a ratio of something

#

tokens / intelligence or tokens / cost

cedar tide
#

you are funny

torn mantle
#

how do you know

#

are you smarter than o3?

#

david > o3 > gemini > claude > deepseek ?

cedar tide
torn mantle
#

yea but you said its bad

#

why is it bad?

#

for the cost and intelligence provided its good

cedar tide
torn mantle
#

you deleted your message?

cedar tide
#

Je te paye 300€ si tu trouves que j'ai dis qu'il est nul

cedar tide
torn mantle
#

nah you deleted it

cedar tide
torn mantle
cedar tide
#

you misunderstood my message, I didn't say the model is bad, but it's "bad" that the improvement also comes with a disadvantage

#

@torn mantle

#

otherwise yes it joins the best quality price with 2.5 flash and grok 3 mini

torn mantle
#

more intelligence = more tokens used in some cases?

#

thats why it solved those hard-problems

#

i mean i understand it doesnt look efficient but maybe its part of the process

#

the best reasoning cot that i always enjoy myself reading is deepseek and with this new update it just delve deeper into concepts

#

unlike qwen/grok/claude

cedar tide
#

@torn mantle no comment ?

#

david > Einstein > o5 pro > gemini 4 ultra deep think

torn mantle
#

so we cant really compare that

cedar tide
#

@torn mantle less tokens and better

ocean vortex
#

Ideally we should have different finetunes for tools and no tools, as workflows and expected responses are quite different depending on it...

willow grail
#

how do i use veo 3 via api?

sweet knot
# golden ocean https://legacy.lmarena.ai/

Thank you for your reply, but the question wasif the quantity of llms has been dwindled or reduced in general. I see at the legacy version there are still many exotic llms. Does it mean that new Lmarena and Legacy versions have different base? Thank you in advance.

misty vault
alpine coral
alpine coral
sweet knot
sweet knot
cedar tide
quiet folio
#

smartest no pfp discord user

alpine coral
# cedar tide

yeah there may be more to it.. but seems basically the increased performance is a function of more test time compute (/tokens), rather than a signficant change in the underlying model

cedar tide
#

Yes

fleet lintel
#

what is Meta's AI strategy? Isn't Deepseek completely decimating it?

alpine coral
#

whereas with the flash 2.5 improvements, it's arguable that achieving that with fewer tokens reflects changes to the actual model (ig during the fine tuning process; though i also get lost when it comes to cpt of the base model and where that fits in ha)

cedar tide
#

while Google increases the reasoning without improving the model 😶

#

I hope this will be fixed in the stable version

alpine coral
#

i doubt it..

keen beacon
keen beacon
alpine coral
patent aspen
#

They're not really a cloud provider that needs a super general model

alpine coral
keen beacon
#

I haven't really used the new r1 but could see it being true tbh

alpine coral
cedar tide
#

The last version just made it better a webdev and video analysis and lm arena, but not on rest benchmark and efficient

#

This the last version of flash that is normaly more efficient

sour spindle
#

FOMO got the best of me and I bought a month of Claude 4.

#

Very good models have been enjoying them more than I thought

ocean vortex
torn mantle
#

we need to know how AAI Index is calculated

cedar tide
torn mantle
#

whats the formula

#

so we can judge it

cedar tide
ocean vortex
#

The way I see it, more test-time compute is not nothing. It can easily be the defining difference between an accurate answer and a wild hallucination

#

Model does not neccessarily need arch changes to be substantially improved with training (fine-tuning)

#

the tricky part is making it do that without making it overly verbose for tasks where that's not called for. OpenAI seems to have nailed this the best tbh. Their models are not generating the most reasoning tokens on average, but their peak reasoning lengths seem almost unlimited when task calls for it

sacred plaza
ocean vortex
#

it's based on those individual benchmarks that they ran anyway

elder rapids
#

deepseek keeps hallucinating bruh

ocean vortex
sacred plaza
ocean vortex
#

honestly I'm curious how new R1 runs on their official website

#

should probably try it, maybe they gave it some actual tools it can use given how it's responding....

elder rapids
#

but it's the same shi

#

good tool tho

#

it's personality helps with some tasks

ocean vortex
#

artificial-analysis literally runs individual benchmarks and that's what I'm referring to

#

combined singular score is mostly just for ... who are too lazy to read it properly

sacred plaza
ocean vortex
#

a good part of that score is coding

#

you wouldn't want to use the model that overall is slightly better than model B but it does so only by being much better in coding and worse everywhere else... ?

#

Just an example but you get the idea

#

for the purpose of this convo at least, we can consider those equivalent. If he doesn't code then he does not care about anything code related

sacred plaza
#

which benchmarks do you evaluate when a new model pops up? all of those or just a handful? assuming you do coding work

elder rapids
#

if you don't care about anything code or math related you're not gonna want the higher numbers in the total avg

#

since that's not being quantitated, at least effectively

sacred plaza
#

i can't stand this unit of measurement!

misty vault
sacred plaza
#

thanks for sharing those points. i try to not let the leaderboards determine which models i use. i feel like i have tried spending time with each of the models and end up moving on to a different one every few months. was a big claude user but for 2025 i have moved everything over to gemini

#

i thought the whole point of models was capability? 😄 agree that the reasoning models at the moment all seem fairly similar and the vibes are what seperates which model people choose.

#

claude user limits and the $200/month felt like a slap in my $20/month plan face

#

love anthropic but they don't give a f about their individual customer clients, lol

elder rapids
sacred plaza
#

i meant in terms of prioritizing gpus for enterprises over us. is that not the reason for their user prompt limits?

keen beacon
#

u have claude max though

sacred plaza
#

the rate limits from claude let me to gemini and their 1-2 million context window

elder rapids
#

you are money

#

spend the money

sacred plaza
#

fair point!

elder rapids
#

out of all the frontier models

keen beacon
#

they were the first to introduce 100k context tho

elder rapids
#

so there's going to be the biggest difference

sacred plaza
keen beacon
#

2m context gemini is ocming back i think

#

2.5 pro can do 2m context its just not exposed rn

#

did uk now claude 3 had 1m context too?

elder rapids
keen beacon
#

it was never made public

sacred plaza
#

i like the claude 4 model based on vibes. made it my default model perplexity

elder rapids
#

yk Google brodie

#

they're doing something about that

sacred plaza
#

why did they make this change?

elder rapids
#

how if they're going to bring it back lmao

keen beacon
#

nah

#

nobody loses money on the api

#

at least today

#

idk tbh. google/anthropic/openai/deepseek are not

#

deepseek making bank

elder rapids
#

deepseek is NOT making bank, at least from AI itself

#

nah

#

theyre operating at a loss

keen beacon
#

they released the numbers bro

elder rapids
#

moe doesn't mean anything

#

since that's not how it works

keen beacon
#

its not only moe, its a contribution of multiple factors

elder rapids
#

talking to Craig

fallen jacinth
#

Recent changes are not encouraging. It's a pity about the previous version. Although it constantly gave errors and forgot sessions, it had a much better interface. In the new one, merely four lines remain for LLM output. ☹️

elder rapids
elder rapids
fallen jacinth
elder rapids
#

oh you're talking about LMarena

fallen jacinth
#

Sorry, isn't it LMArena channel? If not, I apologize

keen beacon
# elder rapids wym

https://xcancel.com/danielhanchen/status/1895698283588468785 (i misremembered, the numbers themselves weren't released, but the information required for a reasonable guess was released i believe)

misty vault
echo aurora
elder rapids
#

although thats a good sign

#

thanks

calm sequoia
#

Man that's crazy

wintry tinsel
#

Imagen 4 ultra is no slouch

echo aurora
pliant cypress
misty vault
#

real

#

gpt-4-0314 open source today i think

#

no way

echo aurora
misty vault
wintry tinsel
torn mantle
#

the og

misty vault
#

the og

golden ocean
#

the og

gloomy crown
#

I use LLMs for translation almost every day. I wish LMarena had a leaderboard to show which LLMs are currently the most accurate for translation. Am I the only one?

echo aurora
tall summit
#

now that you bring it up, i agree because translation is so important

sage raptor
#

you can use google translator

misty vault
#

Large Language Model

tall summit
sage raptor
#

yes

tall summit
#

and it is very clear to any llm user that besides edge cases

#

llms are better at translation too

#

i feel like theyre getting better but honestly i wonder

#

depends on the language too

fallen jacinth
balmy mist
#

wassup?

patent aspen
#

Google Translate started heavily investing in gen AI in 2022, although they have to be careful because they have a mature stable product

balmy mist
#

can we bring back the relevant ai news, why was it deleted? @echo aurora

#

its hard to keep up with news now

#

whatt!!

#

u post the twitter post?

#

😦

#

-_-

echo aurora
keen beacon
#

i think paws started a thread like that

#

he nuked everything

#

it wasnt an official channel

misty vault
patent aspen
#

DeepSeek ^

keen beacon
tall summit
#

as a replacement

#

use it

#

make it active again

misty vault
misty vault
#

no way it(new deepseek) has even more of chatgpts cancerous style now

#

thats so cringe, unrealistic and awful but whatever

echo aurora
balmy mist
#

it has been so helpful

balmy mist
balmy mist
misty vault
#

craig trying to not post gork 3.5 release dates in the new channel challenge

echo aurora
misty vault
#

your master

#

get back in the dungeon

mystic mica
#

Is it possible that the site rejects some prompts it finds too nsfw? Despite they are mental health and not erotic

keen beacon
mystic mica
#

I don't think that was happening in the earlier versions

#

does the arena do that?

#

or the models?

primal orbit
#

Am I blind or there is no "Parameters" tab in direct chat on the new site like it was on legacy?😢

misty vault
mystic mica
#

Like someone will good on his narcissist ex being abusive and cheating on him with his best friend

mystic mica
#

because the same words in another prompt are a okay

primal orbit
#

I had the same prompt rejected as a whole, but accepted in 3 parts..

#

It's probably because it's too big though, i'm not certain. But didn't happen on the legacy version.

zinc ore
#

Deepseek might end up being better than grok 3.5

echo aurora
ocean vortex
echo aurora
misty vault
#

real

sudden sail
#

hola

echo aurora
echo aurora
sick kettle
#

hi everyone

echo aurora
sick kettle
echo aurora
sick kettle
kind cloud
#

redsword is returning an api error now

#

but goldmane is still working

small haven
#

o1 pro cot is unusually longer, interesting ...

#

o3 pro soon baby!

small haven
#

why u gotta troll me at midnight 😭

mystic mica
#

Today I can confirm that the prompts that the free Version of ChatGPT accepts are rejected by arena

kind cloud
elder rapids
#

this can't be

#

btw if that's true it's either much worse than goldmane as far as testing goes, or it's getting ready for release soon

#

ion think it's them trying to replace it with an even better variant

#

so it's prob the former

ocean vortex
#

mic drop?.. I think Claude 4.0 is the most unique model now for sure lol

tall summit
#

claude 4 my beloved

#

LOOOOOL

keen fulcrum
#

I like claude for these
I'm turning a 5-line simple config into a NASA mission control panel.

calm sequoia
ocean vortex
#

it was an older base model than the current o3

calm sequoia
#

No

#

It's different model

#

It was nerfed before release

#

And it reached 15%

#

At high

ocean vortex
#

Bro.. I just told you how it is. Lemme try finding arc-agi blogpost about it

calm sequoia
#

Are thiese results semi-private V2 results?

#

They dont disclose everything man

ocean vortex
#

old o3 was old base model. They didn't "nerf" anything. It was parallel processing, that's what "cot + synthesis" means

#

the cost of that was insane

calm sequoia
ocean vortex
# calm sequoia

yes it didn't qualify for the leaderboard due to the price being insanely high

calm sequoia
#

This part is not important really for this conversation

ocean vortex
calm sequoia
ocean vortex
#

they were never going to release a model that is a pro version of a pro lmao

calm sequoia
#

Due to costs, which is temporary

ocean vortex
#

and even if they would no one would be able to use without going bankrupt

#

it was an experiment and totally not a representative model

#

this is more like google's olympiad math model

#

AlphaGeometry

calm sequoia
#

I see it as glimpse to the future

#

As the compute gets cheap

#

Anyway, this is the actual flop

ocean vortex
#

it must be <$10k, that other o3-preview version was not

calm sequoia
#

This rule is a new thing and not so important. The costs can always be solved by engineering and infrastructure. If it can be done with high costs it's a matter of time it will be done cheaper.

ocean vortex
# calm sequoia I see it as glimpse to the future

IMO it's a more distant glimpse than improving the base model. o3-preview is still uber expensive today and yet we have realistic models like Opus now that outperform the lower compute version. Only a matter of time until we have cheap models outperforming o3-preview high compute

calm sequoia
#

Jsut realized the Gemini 2.5 Pro march didn't disclose how much it cost to run arc-agi 👀

ocean vortex
#

in other words... training progress is much faster than compute progress

#

o3-preview was terribly inefficient

calm sequoia
#

Yeah but the training progress is not granted

#

If it worked now doesn't mean it will work in the future

#

And yet the cost issue is guaranteed to be solved

ocean vortex
#

I mean it's just not meaningull to think of o3-preview as something that will become cheaper. It's a very distant future relative to the progress of AI as a whole

calm sequoia
#

It will not be solved in months, but in years so yeah. I don't want to argue with you as I agree with you at everything, except that I think the o3 result shouldn't be neglected

ocean vortex
#

if you took those same weights 2 years from now, 99% that it would be still very expensive to host it. GPUs are improving but not at this rate

#

by "same weights" I mean same model and also the same way they ran it (parallel processing / synthesis)

calm sequoia
#

"GPUs have historically seen very rapid performance increases, often outpacing Moore's Law for CPUs in terms of raw computational power (especially for parallel workloads)"

ocean vortex
#

and it's almost all from training, not different hardware making you able to run bigger models

#

models got smaller if anything

calm sequoia
#

O3 cost per task 3474 meaning we need 8 years to get under 200 USD if the Moores law is applied

ocean vortex
calm sequoia
#

Unless we will not be as lucky as this year

ocean vortex
#

well 8 years ago... I don't think we even had gpt1

calm sequoia
#

GTG. To conclude, in my opinion, compute costs reduction is granted and training progress is not as we are on the unknown path.

ocean vortex
#

gpt1 was released smth like 2018 iirc

#

or 2019

calm sequoia
#

But we could, just didn't do it, because it was too expensive to try

#

Just like the o3-high-super is too expensive right now

ocean vortex
#

I'm not saying we shouldn't try it... AlphaGeometry was an useful experiment as well. But you need to view it as such. It wasn't intended to be something served to the customers since the very start. OpenAI did some clever marketing but they knew all too well they needed a different model before release

#

gpt1 was more realistic at the time though. For the end-user to run it I mean

calm sequoia
#

There are use cases where such "experiments" can deliver. For example, AlphaFold. Applying search like algorithm of o3-high-super at some niche space could also deliver, it's just not for layman software developers 😄

calm sequoia
ocean vortex
calm sequoia
#

It's marketing if they decided that before doing the experiments, but theior motyvation most likely were different.

ocean vortex
#

but nothing wrong with pushing the limits for sure. You just don't view those as equivalent to consumer models - which is why that high compute version is not on the leaderboard 😉

calm sequoia
#

True. The models may even be used for private use cases, e.g. military, pharma, etc.

cedar tide
#

Claude 4 thinks much less than 3.7?

ocean vortex
# cedar tide Claude 4 thinks much less than 3.7?

ehmm interesting. Though to be fair 3.7 maxed out was incredibly wasteful with thinking. Like it was taking ages for me test it. Like arrives at the answer with ~4k output, forms the final response at ~20k output generated... catgrin

alpine coral
alpine coral
#

i think that's impressive - though i'm still unsure about sonnet 4.. it seems to fall short compared to 3.7 more than i would have expected..

dusky aurora
#

Claude 4 Opus periodically has times of being unavailable

#

when "there's been an error" for any prompt

cedar tide
cedar tide
#

but it is also the 2nd lowest rated reasoning model on artificial analysis (after the old R1)

ember rapids
#

o3 preview in dec showed us they can push this thing significantly further