#general | Arena | Page 43

storm needle May 20, 2025, 10:39 PM

#

you already said that o3 would be slow and that's not true

patent aspen May 20, 2025, 10:39 PM

#

storm needle you already said that o3 would be slow and that's not true

?

storm needle May 20, 2025, 10:40 PM

#

.

misty vault May 20, 2025, 10:41 PM

#

small haven May 20, 2025, 10:41 PM

#

theres a big problem using rust in codex, everytime u ask something it has to recompile all the crates from scratch, takes a good 5 mins before it can even start (well it depends on some project cargo)

willow grail May 20, 2025, 10:55 PM

#

https://i.imgur.com/iVlsPqR.png

Imgur

patent aspen May 20, 2025, 10:55 PM

#

storm needle .

Oh sorry I see it now. At that time, I was speculating based on the FrontierMath benchmark story wherein they ran their models for hours for one problem

willow grail May 20, 2025, 10:55 PM

#

anyuone has ideas how to only pay for tramway ticket when i see inspectors intering the tramway i am in?

patent aspen May 20, 2025, 10:59 PM

#

storm needle .

I also rarely have a complete up-to-date picture of what is going on inside OAI. I do know people in the know and hear things from time to time

unborn ocean May 20, 2025, 11:01 PM

#

i believe that was common 'knowledge' among the people speculating

#

on what?

#

nvm (read the context)

misty vault May 20, 2025, 11:05 PM

#

willow grail anyuone has ideas how to only pay for tramway ticket when i see inspectors inter...

willow grail May 20, 2025, 11:07 PM

#

misty vault

no jailbreak?

#

thats not a thing

#

no

barren prairie May 20, 2025, 11:10 PM

#

misty vault

Wait is that Gemini ??? Why he is like that ????

misty vault May 20, 2025, 11:14 PM

#

willow grail no jailbreak?

correct

misty vault May 20, 2025, 11:14 PM

#

barren prairie Wait is that Gemini ??? Why he is like that ????

gpt-4-preview

willow grail May 20, 2025, 11:15 PM

#

send me try me

willow grail May 20, 2025, 11:15 PM

#

misty vault gpt-4-preview

how do u select this..... what is gpt4 preview XD i dont remember this ever

#

weait so its not on bing?

blazing rune May 20, 2025, 11:28 PM

#

4B should be faster

#

But if you do 4B, do Q6_K

#

Oh, if you have 6gb of VRAM, q5_k_m might be better

willow grail May 20, 2025, 11:33 PM

#

@misty vaulti cant find that model on copilot

blazing rune May 20, 2025, 11:33 PM

#

That should be at least 20 tokens per second if you set it up correctly (the 4b)

patent aspen May 20, 2025, 11:35 PM

#

lmao this is an on-device model for mobile

blazing rune May 20, 2025, 11:40 PM

#

I'm talking about a 4B dense model on his 6GB laptop GPU

blazing rune May 20, 2025, 11:40 PM

#

patent aspen lmao this is an on-device model for mobile

Yeah, the leaderboard is terrible

dapper storm May 20, 2025, 11:40 PM

#

will o3 pro be in the arena

patent aspen May 20, 2025, 11:41 PM

#

blazing rune Yeah, the leaderboard is terrible

Or Gemma is just good at human preference

blazing rune May 20, 2025, 11:41 PM

#

If it actually had something to do with how fun or relatable a model is, that would be fine. But it is just using stupid phrases

willow grail May 20, 2025, 11:41 PM

#

shitification?

blazing rune May 20, 2025, 11:42 PM

#

And more importantly, it depends on the formatting and how many headings are in the answer

misty vault May 20, 2025, 11:43 PM

#

willow grail weait so its not on bing?

willow grail May 20, 2025, 11:44 PM

#

is that poe.com

patent aspen May 20, 2025, 11:45 PM

#

Yeah it seems to be all style control

willow grail May 20, 2025, 11:47 PM

#

doesnt say gpt4 preview

misty vault May 20, 2025, 11:48 PM

#

the non preview gpt-4 also needs no jailbreak

#

Bro some guy literally sent in this chat once

#

It is already suppose to

torn mantle May 20, 2025, 11:49 PM

#

so

misty vault May 20, 2025, 11:49 PM

#

the original websockets lasted for a year

#

they died 1 month ago

#

but theres another one

torn mantle May 20, 2025, 11:49 PM

#

what product are you using after the keynote?

misty vault May 20, 2025, 11:49 PM

#

they forgor or something

#

Sydney

#

https://tenor.com/view/tinder-gif-20274188

Tenor

misty vault May 20, 2025, 11:52 PM

#

torn mantle what product are you using after the keynote?

microsoft bing chat

torn mantle May 21, 2025, 12:11 AM

#

misty vault microsoft bing chat

lol

#

i really think flowith is powerful

#

i have yet to harness its full power

misty vault May 21, 2025, 12:12 AM

#

willow grail is that poe.com

#

#

I will suck **** for microsoft and openai to give me the fine tuned gpt-4 they use for bing

patent aspen May 21, 2025, 12:15 AM

#

Ew Bing

misty vault May 21, 2025, 12:32 AM

#

misty vault May 21, 2025, 12:34 AM

#

patent aspen Ew Bing

#

cedar tide May 21, 2025, 12:39 AM

#

New gemini 2.5 flash vs old

Screenshot_2025-05-21-02-38-32-529_com.android.chrome-edit.jpg

golden ocean May 21, 2025, 12:40 AM

#

misty vault

🥺

storm needle May 21, 2025, 1:03 AM

#

dapper storm will o3 pro be in the arena

no

elder rapids May 21, 2025, 1:42 AM

#

torn mantle i really think flowith is powerful

it's been powerful asf

elder rapids May 21, 2025, 1:54 AM

#

elder rapids it's been powerful asf

it's unironically insane

#

I've been using neo AFTER chats with AI

#

which builds its "prompt" essentially

#

and then stemming from those context threads, it goes through everything and plans it extraordinarily well

elder rapids May 21, 2025, 2:22 AM

#

wasn't directly the best instruction follower imo, the reason why it appeared that way was because it inferred (correctly) based off of a deep implicit understanding like if you meant to emphasize something, it'd grasp what parts to prioritize. But granular details, or in this case implicature based counterfactual speech, it didn't comprehend at all

#

it's inference made it so easy to talk to ye

vivid sandal May 21, 2025, 2:28 AM

#

is this happening only to me, or maybe there's a maintenance of sorts...?

elder rapids May 21, 2025, 2:44 AM

#

elder rapids I've been using neo AFTER chats with AI

SOOOOO good bro

#

man I think I'm starting to like this AI trend

small haven May 21, 2025, 3:21 AM

#

so wen is exactly o3 pro

elder rapids May 21, 2025, 3:31 AM

#

Yo why is neo using Gemini

#

LMFAO

#

ITS USING GEMINI OVERVIEW FROM THE BROWSER TO CONDUCT ITS RESEARCH

#

LMAOOOOO

leaden palm May 21, 2025, 3:53 AM

#

elder rapids ITS USING GEMINI OVERVIEW FROM THE BROWSER TO CONDUCT ITS RESEARCH

what is "it"?

elder rapids May 21, 2025, 4:54 AM

#

leaden palm what is "it"?

agent neo

#

dawg o3 is so bad dude it's insane

#

ts can't understand a WORD I'm saying

#

holy

#

I'm genuinely getting mad

raven void May 21, 2025, 4:57 AM

#

Looks like nightwhisper was Gemini deep think 🤔

The performance is not ultra level I see why they haven't released it

#

O3 pro will cook it if so

iron cipher May 21, 2025, 5:01 AM

#

Remember when ChatGPT didn’t save chat history

#

or when Gemini was named Bard?

small haven May 21, 2025, 5:10 AM

#

raven void Looks like nightwhisper was Gemini deep think 🤔 The performance is not ultra ...

dont think so, i think its just a refined 2.5 flash/pro, and if it is, then its gonna be a flop, because o3 >

small haven May 21, 2025, 5:49 AM

#

oh shxt got jules

#

gonna compare it to codex

#

aight back to codex...

alpine coral May 21, 2025, 6:56 AM

#

same here (lines up with my actual usage very neatly and consistently). it's a great benchmark imo

keen fulcrum May 21, 2025, 7:20 AM

#

Best AI sub

keen beacon May 21, 2025, 7:31 AM

#

I'm really missing the raw thoughts of Gemini 🥲 (you can get around it but you might be degrading perf)

torn mantle May 21, 2025, 7:32 AM

#

https://x.com/btibor91/status/1925084250107478506

Tibor Blaho (@btibor91)

"Claude 4 is here" - "Try Claude Sonnet 4 and Claude Opus 4 today"

"Try Claude Sonnet 4 or Claude Opus 4 for Anthropic’s smartest models yet."

"Not intended for production use. Subject to strict rate limits"

"show_raw_thinking" / "show_raw_thinking_mechanism"

(not available

#

"subject to strict limits"

#

Whats with these labs lately

elder rapids May 21, 2025, 7:37 AM

#

torn mantle https://x.com/btibor91/status/1925084250107478506

"show raw thinking"

#

everything reminds me of her

#

elder rapids May 21, 2025, 7:42 AM

#

keen beacon I'm really missing the raw thoughts of Gemini 🥲 (you can get around it but you...

ye I'm thinking it would degrade performance too

#

or at least context recall

misty vault May 21, 2025, 7:42 AM

#

torn mantle https://x.com/btibor91/status/1925084250107478506

is this agi?

elder rapids May 21, 2025, 7:42 AM

#

watch them be mid asf

pallid crypt May 21, 2025, 7:43 AM

#

if it was they would be bragging about their ARC AGI score alr like OpenAI was with o3

torn mantle May 21, 2025, 7:46 AM

#

Yea its you

torn mantle May 21, 2025, 7:46 AM

#

misty vault is this agi?

Asi

torn mantle May 21, 2025, 7:46 AM

#

elder rapids watch them be mid asf

Really doesnt matter with these ai labs anymore

keen beacon May 21, 2025, 7:46 AM

#

Claude 4 is probably gonna be good but if it doesn't have multimodal capabilities like other models, it's not a good sign

torn mantle May 21, 2025, 7:47 AM

#

And knowing how strict already anthropic is, model gonna be good but it wont be usable for a while

#

@tall summit why did you delete lmao

calm sequoia May 21, 2025, 7:56 AM

#

Do o4-mini imply the existence of o4 full?

calm sequoia May 21, 2025, 8:47 AM

#

They are still silent for o3-high-pre-nerf reaching 15%

misty vault May 21, 2025, 9:05 AM

#

calm sequoia They are still silent for o3-high-pre-nerf reaching 15%

o3-high-pe-nerf is agi

keen fulcrum May 21, 2025, 9:19 AM

#

misty vault o3-high-pe-nerf is agi

Are you hallucinating?

#

No nerf

misty vault May 21, 2025, 9:22 AM

#

keen fulcrum Are you hallucinating?

stfu

#

imma touch you

keen beacon May 21, 2025, 9:23 AM

#

calm sequoia They are still silent for o3-high-pre-nerf reaching 15%

probably because each task took nearly 3,500$ to run

calm sequoia May 21, 2025, 9:23 AM

#

I know all the info, but it's still notable

keen beacon May 21, 2025, 9:25 AM

#

calm sequoia I know all the info, but it's still notable

probably but 3.5k dollars for a single task is crazy though

calm sequoia May 21, 2025, 9:26 AM

#

Yeah, but it delivered. There are many use cases where this money is nothing

misty vault May 21, 2025, 9:29 AM

#

@cursive zodiac and butthead ahh username

alpine coral May 21, 2025, 9:31 AM

#

calm sequoia Do o4-mini imply the existence of o4 full?

imo unambiguously yes

alpine coral May 21, 2025, 9:33 AM

#

calm sequoia They are still silent for o3-high-pre-nerf reaching 15%

it is curious.. i don't know but ig it's like voluntary system (the labs have to offer up their model to arc to be run against the test - rather than arc testing all models themselves, which ig would mean exposing the question set through the API calls and contaminating the bench)

#

i think gem pro 2.5 is an equally conspicuous omission.. surely can't be about costs (google and oai could cover them if they wanted to)

#

[as a footnote.. also wild how models like o3-mini-low and sonnet 3.7 score 0 on arc2]

keen beacon May 21, 2025, 9:35 AM

#

alpine coral it is curious.. i don't know but ig it's like voluntary system (the labs have to...

they ran it though

alpine coral May 21, 2025, 9:35 AM

#

oh

keen beacon May 21, 2025, 9:36 AM

#

it cost 3.5k per task and got 15% on arc agi 2 (o3 preview)

#

results werent officially published iirc

#

for 2.5 pro idk lol

alpine coral May 21, 2025, 9:37 AM

#

i mean there must be a reason it's not on arc's official leaderboard? if they ran it - like why not 🤷‍♂️

#

might read the paper

#

(aka give it to gem pro 2.5 and ask it if it explains what's going on re omitted scores)

keen beacon May 21, 2025, 9:42 AM

#

alpine coral (aka give it to gem pro 2.5 and ask it if it explains what's going on re omitted...

wait i got the 2.5 pro scores

#

its also hidden

calm sequoia May 21, 2025, 9:46 AM

#

They are very low

#

Surprisingly

keen beacon May 21, 2025, 9:47 AM

#

Gemini-2.5-Pro-Exp-03-25 **

12.5% on v1 semi private
1.25% on v2 semi private

it was marked with a note:

* * Preview results: Results marked as preview are unofficial and may be based on incomplete testing. Models without available pricing information will not be shown on the efficiency chart. Results become official after complete testing is finished.

keen beacon May 21, 2025, 9:50 AM

#

calm sequoia They are very low

maybe thats why it wasnt listed lol

#

the note doesnt really make sense

calm sequoia May 21, 2025, 9:51 AM

#

For sure. I start to dislike how google position themselves with selected benches only, e.g. yesterday with elo scores

late path May 21, 2025, 10:01 AM

#

Has Claude 4 been testing in the arena now?

keen beacon May 21, 2025, 10:01 AM

#

late path Has Claude 4 been testing in the arena now?

anthropic doesnt do that (do pre-releases in arena afaik)

torn mantle May 21, 2025, 10:08 AM

#

anthropic are still stuck on how to make profit

#

on how to release their models

#

on how to decrease rate limits

#

im really not looking forward for any of their releases

#

because ive had enough

#

we asked for a better rate limit, they introduced credit system which made things even worse

keen beacon May 21, 2025, 10:12 AM

#

torn mantle we asked for a better rate limit, they introduced credit system which made thing...

their service goes down a lot compared to other companies i think. theyre struggling with serving i think

torn mantle May 21, 2025, 10:12 AM

#

keen beacon their service goes down a lot compared to other companies i think. theyre strugg...

after years

#

and they are still struggling

keen beacon May 21, 2025, 10:13 AM

#

torn mantle after years

not enough compute compared to other companies i guess

torn mantle May 21, 2025, 10:13 AM

#

its probably because its mostly used for coding -> more tokens compared to other models

#

its good for creativity too so -> more tokens generated

#

they wanted to balance that by nerfing the default/usual output with concise outputs on general questions

unborn ocean May 21, 2025, 10:15 AM

#

keen beacon not enough compute compared to other companies i guess

They can just rent more like everybody else, especially at the high prices they charge per token.

#

And I remember the ceo saying one that like ‚every two weeks something new gets invented that reduces compute needed at inference by like 30%
(don’t remember exact wording or percentage)

unborn ocean May 21, 2025, 10:17 AM

#

unborn ocean And I remember the ceo saying one that like ‚every two weeks something new gets ...

Said that to take the wind out of deepseek post R1 hype

keen beacon May 21, 2025, 10:18 AM

#

unborn ocean They can just rent more like everybody else, especially at the high prices they ...

you'd think they would've basically resolved the issues by now if it were that simple

unborn ocean May 21, 2025, 10:18 AM

#

Yes, which is why I am thinking that it is either really more complicated or that they just don’t care

#

And the longer it takes the more I am inclined to think the second one

keen fulcrum May 21, 2025, 10:21 AM

#

Is poe considered to be the cheapest?

oak pythonBOT May 21, 2025, 10:58 AM

#

Permission Denied

You do not have permission to use this command. Permissions needed: Administrator.

alpine coral May 21, 2025, 11:26 AM

#

keen beacon the note doesnt really make sense

yeah.. i mean ig incomplete testing means just that.. but why only partially run the test.. (the sentence about pricing availability and models appearing in the effeciency chart does make sense.. but is oddly inserted b/w the two other sentences that seem related lol)

alpine coral May 21, 2025, 11:29 AM

#

keen beacon you'd think they would've basically resolved the issues by now if it were that s...

100%.. it's obvious by now that they're compute poor

alpine coral May 21, 2025, 12:05 PM

#

unborn ocean Yes, which is why I am thinking that it is either really more complicated or tha...

I don’t think it’s that complicated..there’s only so much physical hardware available at any given time .. and anthropic’s share of that is clearly less than oai and google's. companies lock in what they can.. those able to pay more get more (or in Google’s case, they just own the hardware outright).. if Anthropic suddenly got a bunch more capital, they could better compete with oai and others to secure GPU access, but that would be for the future. for now, they’re stuck with what they have.. which clearly isn’t enough

#

as I see it, they’re throttling usage on Claude chat and experiencing outages because they don't have the capacity to both serve current models at scale while also developing new ones (well.. in theory.. like honestly how long until claude 4 lol).. as Wild said, if there were a simple way to avoid this they would’ve done it already surely..

calm sequoia May 21, 2025, 12:55 PM

#

Would like to see gemini doing that 😎

sonic tendon May 21, 2025, 12:57 PM

#

calm sequoia Would like to see gemini doing that 😎

what were you doing

calm sequoia May 21, 2025, 12:58 PM

#

Rewriting super-complex signal processing function that took me 2 months to write once

#

Pre-nerf Gemini would have aced it, though

#

But something is going on with o3, it delivers two multiple responses sometimes. One is often twice as good with twice as long thinking.

brittle tiger May 21, 2025, 1:03 PM

#

https://x.com/testingcatalog/status/1925174295845634439

TestingCatalog News 🗞 (@testingcatalog)

BREAKING 🚨: Native Speech Generation and Live Audio Generation with Gemini 2.5 Flash is now available on AI Studio!

4 new models from the Gemini 2.5 family 🔥

Confirmed 👀

civic flame May 21, 2025, 1:05 PM

#

torn mantle https://x.com/btibor91/status/1925084250107478506

YESSSSS WE HAVE AN OPUS

#

😍😍😍😍

#

tomorrow gonna be great

sonic tendon May 21, 2025, 1:06 PM

#

I'm sorta iffy about Claude 4 actually coming out soon

#

feels like way too much speculation over a minor backend change

#

esp considering that the safety testing event didn't happen that long ago and didn't seem to showcase a new model

tall summit May 21, 2025, 1:09 PM

#

sonic tendon feels like way too much speculation over a minor backend change

↑

#

speculation is so pointless

#

whoops i'm talking to red

#

who gambles based on speculation

#

no offense red

sonic tendon May 21, 2025, 1:10 PM

#

LMAOOOO

#

no no i get it

#

i was gonna say

civic flame May 21, 2025, 1:10 PM

#

sonic tendon I'm sorta iffy about Claude 4 actually coming out soon

honestly I'm not

#

well of course rumours are rumours however

#

it would make sense from my internal understanding

#

anthropic's team have been locked in recently hence the outward lack of shipping other things

tall summit May 21, 2025, 1:11 PM

#

of course that's a possibility

#

there is also a possibility they have actually not been working

#

and having no releases supports both theories

civic flame May 21, 2025, 1:12 PM

#

i am inclined to believe the theory is true because things have been focused on release

sonic tendon May 21, 2025, 1:12 PM

#

maybe - but in that case I'd still be surprised if they didn't do the volunteer redteaming thing on it directly

civic flame May 21, 2025, 1:13 PM

#

anthropic's safety stuff is less to do with the model itself nowadays and more to do with the layer(s) on top so the difference may have been negligible

sonic tendon May 21, 2025, 1:22 PM

#

civic flame anthropic's safety stuff is less to do with the model itself nowadays and more t...

yeahh, that was my other thought too

#

i'm still leaning towards "they wouldn't deploy a brand-new safety layer developed on an old model with a new model within a week"

willow grail May 21, 2025, 1:23 PM

#

is this new? the animation? is this deep thinking? https://i.imgur.com/Wkz7np6.png

Imgur

sonic tendon May 21, 2025, 1:24 PM

#

willow grail is this new? the animation? is this deep thinking? https://i.imgur.com/Wkz7np6.p...

what is vro doing 😭

willow grail May 21, 2025, 1:24 PM

#

i feel like the star and 3 blue dots are new?

sonic tendon May 21, 2025, 1:24 PM

#

but yeah i think they've started censoring the thought process

#

lemme check

willow grail May 21, 2025, 1:24 PM

#

i dont mearn the censoring

#

XD

sonic tendon May 21, 2025, 1:25 PM

#

oh

#

nah, i don't think the three dots are new

#

those have been there for a while

#

this particular style of thinking seems novel (probably summarized the same way openai's been doing it)

#

and there are some minor ui changes within the thought box

balmy mist May 21, 2025, 1:37 PM

#

torn mantle https://x.com/btibor91/status/1925084250107478506

wiat what

#

nahh this week is crazy

torn mantle May 21, 2025, 1:37 PM

#

brittle tiger https://x.com/testingcatalog/status/1925174295845634439

good

torn mantle May 21, 2025, 1:38 PM

#

balmy mist nahh this week is crazy

yea

cedar tide May 21, 2025, 1:39 PM

#

Did you also have 2.5 flash thinking taken away from you?
(There are now 2.5 flash no think)

Screenshot_2025-05-21-15-39-22-292_com.google.android.googlequicksearchbox-edit.jpg

torn mantle May 21, 2025, 1:42 PM

#

brittle tiger https://x.com/testingcatalog/status/1925174295845634439

you have a voice favorite?

#

nah google are cooking with these voice models

cedar tide May 21, 2025, 1:42 PM

#

Yes since it is exactly 10 times more expensive

torn mantle May 21, 2025, 1:45 PM

#

this voice

#

7663zeroheart

balmy mist May 21, 2025, 1:54 PM

#

i was telling yall a while ago google will win and yesterday kinda dropped th emic lowkey, they even have diffusion text models lmaoo

#

they said we doing everything lol

#

mercury labs gg lol

balmy mist May 21, 2025, 1:55 PM

#

torn mantle this voice

its better than seasame?

willow grail May 21, 2025, 1:56 PM

#

sonic tendon what is vro doing 😭

ok if u wanna read more.. https://www.reddit.com/r/Beichtstuhl/comments/1krxt6g/beichte_unerwartete_erektion_nach_4_wochen/

From the Beichtstuhl community on Reddit

Explore this post and more from the Beichtstuhl community

#

its r/confessional

torn mantle May 21, 2025, 2:09 PM

#

balmy mist its better than seasame?

idk tbh

#

kinda forgot about sesame for a while

balmy mist May 21, 2025, 2:13 PM

#

lol me too

#

idk wat to do, i wanna try veo 3 so bad

#

but i cant pay that price until we have all the features

#

nahh this is so bizzare to me:n https://x.com/mark_k/status/1925109493911728186

Mark Kretschmann (@mark_k)

Veo 3 Netflix special coming ?

calm sequoia May 21, 2025, 2:17 PM

#

👀

torn mantle May 21, 2025, 2:17 PM

#

https://x.com/MistralAI/status/1925191937792901298

Mistral AI (@MistralAI)

Meet Devstral, our SOTA open model designed specifically for coding agents and developed with @allhands_ai

https://t.co/LwDJ04zapf

calm sequoia May 21, 2025, 2:17 PM

#

How's this possible for 24B

willow grail May 21, 2025, 2:17 PM

#

balmy mist nahh this is so bizzare to me:n https://x.com/mark_k/status/1925109493911728186

hah thats good one

willow grail May 21, 2025, 2:17 PM

#

torn mantle https://x.com/MistralAI/status/1925191937792901298

Incompatible to people who arent engineers.

torn mantle May 21, 2025, 2:18 PM

#

willow grail Incompatible to people who arent engineers.

like

#

say the name

#

ik whos on ur mind

#

just say it

willow grail May 21, 2025, 2:18 PM

#

u?

torn mantle May 21, 2025, 2:18 PM

#

willow grail u?

you

#

and craig

#

both of you

willow grail May 21, 2025, 2:20 PM

#

lets try my fitness app with devstral

barren prairie May 21, 2025, 2:36 PM

#

cedar tide Did you also have 2.5 flash thinking taken away from you? (There are now 2.5 fla...

Yeah I told you yesterday they took it away but it have the same performance as the old thinking one and really faster but you can oblige it to think by selecting "Canvas"

cedar tide May 21, 2025, 2:39 PM

#

barren prairie Yeah I told you yesterday they took it away but it have the same performance as ...

same performance as the reasoning model, it's not possible

cedar tide May 21, 2025, 3:19 PM

#

Pls Add Devstral in dev arena
(Open source)
https://mistral.ai/fr/news/devstral

Devstral | Mistral AI

Introducing the best open-source model for coding agents.

sonic tendon May 21, 2025, 3:27 PM

#

ah, source?

#

I'm not betting on the outcome, just curious

torn mantle May 21, 2025, 3:59 PM

#

"For a while"

#

All i know is that xai will stay behind for a long time

cedar tide May 21, 2025, 4:11 PM

#

Yes

compact knoll May 21, 2025, 4:18 PM

#

hey @echo aurora
https://beta.lmarena.ai/leaderboard/webdev

webdev leaderboard not working anymore

tall summit May 21, 2025, 4:19 PM

#

hooooly more money

echo aurora May 21, 2025, 4:20 PM

#

compact knoll hey <@283397944160550928> https://beta.lmarena.ai/leaderboard/webdev webdev le...

Thanks, going to flag to the team! Reminder that we're trying to use the #1343291835845578853 channel for bugs, no need to post there again just giving you a heads up.

narrow elbow May 21, 2025, 4:20 PM

#

$100M~

tall summit May 21, 2025, 4:20 PM

#

glad to see lmarena isn't crashing and burning yet especially with the methodology controversy

sudden root May 21, 2025, 4:20 PM

#

100M is crazy!

clever estuary May 21, 2025, 4:20 PM

#

about the censorship on the beta site
it feels like it's a lot more than the current site tbh

pine plinth May 21, 2025, 4:21 PM

#

100M is surely a typo

clever estuary May 21, 2025, 4:21 PM

#

like some slightly violent words are immediately blocked on the beta site

#

wondering if this is carrying over?

dull terrace May 21, 2025, 4:24 PM

#

So why isnt lmarena

#

correcting the flawed benchmarks

narrow elbow May 21, 2025, 4:26 PM

#

yesterday's Google IO also advertised LMArena💯

unborn ocean May 21, 2025, 4:35 PM

#

alpine coral I don’t think it’s that complicated..there’s only so much physical hardware avai...

"Financial data platforms indicate a total funding ranging between $14.3 billion and18.16 billion, which may vary based on inclusion of secondary market sales or debt financing which can sometimes be ambiguous." (according to some gemini searches) and renting compute has never been easier, I mean many of the larger corps (e.g. microsoft or amazon) heavily rely on renting compute from others (especially coreweave).
So I don't see how they could not be able to serve the models (as I am working under the relatively save assumption that they are making money on serving).

#

And btw i am not trying to say they don't WANT to get these issues in order, to me it seems more like the team is just not very competent at resolving the issues. (can be seen at the long time it took for them to acquire more funding vs. the time it took xAI).

Furthermore this kind inability to properly manage financials is not just something exclusive to anthropic, I believe that there is also a certain level of this 'financial illiteracy' (prob not the right word) in many other start ups like xAI and openAI.

torn mantle May 21, 2025, 4:36 PM

#

dull terrace correcting the flawed benchmarks

What flawed benchmark?

dull terrace May 21, 2025, 4:39 PM

#

torn mantle What flawed benchmark?

Well dont quote me

#

but the gpt and gemmni models been top 1 forst last couple months

#

I think its fraud espically since, they dont include open source models like that

#

since its a very large demographic

torn mantle May 21, 2025, 4:40 PM

#

dull terrace but the gpt and gemmni models been top 1 forst last couple months

you think they dont deserve that?

#

open source models are included too, the known one at least

echo aurora May 21, 2025, 4:42 PM

#

dull terrace So why isnt lmarena

surfacing specific concerns for our upcoming AMA would be a good place to share - https://docs.google.com/forms/d/e/1FAIpQLSdFrGpj4GC7ED6XXOKLFq_UKoubUB5A6v8TDL0BBdc-Q_0bag/viewform?usp=dialog

Google Docs

LMArena Staff AMA

We'll use this form to gather questions you're curious about for our upcoming Staff AMA. We are looking forward to it!

dull terrace May 21, 2025, 4:46 PM

#

torn mantle you think they dont deserve that?

Nope, since

#

deepseek

#

cleared them in one sho

#

shot*

#

qwen to, actually alot did

torn mantle May 21, 2025, 4:47 PM

#

dull terrace deepseek

yea but that was long time ago, deepseek got a good spot then many good models came out

dull terrace May 21, 2025, 4:47 PM

#

echo aurora surfacing specific concerns for our upcoming AMA would be a good place to share ...

Well could I talk to a employee directly? I would like to

torn mantle May 21, 2025, 4:47 PM

#

while deepseek was good at reasoning, it lacked the general knowledge and coding strength of gemini models

#

qwen is a mid model

#

lets be honest about that

#

but i appreciate what they are doing

dull terrace May 21, 2025, 4:48 PM

#

torn mantle yea but that was long time ago, deepseek got a good spot then many good models c...

sure gpt sure but dosnt it seem fishy

#

that it been like this

#

for the last 7 months

dull terrace May 21, 2025, 4:48 PM

#

torn mantle while deepseek was good at reasoning, it lacked the general knowledge and coding...

98% of people dotn code

#

dont code

cedar tide May 21, 2025, 4:49 PM

#

the webdev arena does not reflect at all the performance of the models on the overall coding,
90% is just voted for the one who makes the best visual in one shot,
it would bé good if lm arena made a partnership with cursor to integrate it into an arena, or with each message cursor would propose 2 results and the result that the user keeps gains elo

unborn ocean May 21, 2025, 4:49 PM

#

guys it is no typo 💀
they have a 600m eval trophy3d 🤑

torn mantle May 21, 2025, 4:50 PM

#

dull terrace 98% of people dotn code

yea but if you want to judge a model, you need to assess multiple areas, i get that most people ask it general questions but you must expect that -> new models -> better ranking than old models

dull terrace May 21, 2025, 4:51 PM

#

torn mantle yea but if you want to judge a model, you need to assess multiple areas, i get t...

I know that lol I litterly own a ai company

#

What im trying to say

#

its seems fishy that it keeps going this way

#

maybe they could use one of the millons they have

#

to fliter the messages out

#

alot more then are currently doing

torn mantle May 21, 2025, 4:52 PM

#

dull terrace alot more then are currently doing

did you take a look at their llama 4 maverick messages logs?

dull terrace May 21, 2025, 4:52 PM

#

torn mantle did you take a look at their llama 4 maverick messages logs?

Yep

torn mantle May 21, 2025, 4:52 PM

#

did you find any bad prompts?

dull terrace May 21, 2025, 4:52 PM

#

Llama 4 is trash we can all agree about that one

#

I will not fight you about that lol

torn mantle May 21, 2025, 4:53 PM

#

yea, but we're talking about prompt filtering

#

maybe you meant a better way to count a vote(+1)

#

tbh i dont know how they count a vote to give you my opinion on that

dull terrace May 21, 2025, 4:54 PM

#

torn mantle maybe you meant a better way to count a vote(+1)

I meant that sorry for the mispell

#

@echo aurora Well could i talk to a employee about this then on discord

#

so at least, I could talk and get it dealt with

torn mantle May 21, 2025, 4:55 PM

#

ok but what do you think this list should be

dull terrace May 21, 2025, 4:56 PM

#

ok

#

let see

#

gpt 4.1 can go above claude 3.7 sonnet

#

o4 mini can go below deepseek

torn mantle May 21, 2025, 4:56 PM

#

what

#

you mean below

dull terrace May 21, 2025, 4:57 PM

#

torn mantle what

oh do you want me to number it?

torn mantle May 21, 2025, 4:57 PM

#

gpt 4.1 below sonnet 3.7

dull terrace May 21, 2025, 4:57 PM

#

torn mantle gpt 4.1 below sonnet 3.7

srry viewing it inverse. gemmni might be tied with deepseek v2

#

so it can go either way in my opinon

#

how is chatgpt 4o still here is crazy to me b thats goes below grok and gpt 4.1

misty vault May 21, 2025, 4:58 PM

#

fr

#

gpt 4o is a drooling alien

dull terrace May 21, 2025, 5:00 PM

#

actually

#

gpt 4o

#

is prime example

#

of whr im sayin

narrow elbow May 21, 2025, 5:01 PM

#

cedar tide the webdev arena does not reflect at all the performance of the models on the ov...

I don't think this is a good idea. As I've suggested before, perhaps adding an "agent battlefield" would be better,similar to the current Webdev. In my personal understanding, webdev is essentially a form of agent as well. Implementing an agent battlefield shouldn't be too difficult,just spin up a virtual environment with an editor/IDE or agents for comparison. In other words, adding an agent selection dimension under the existing webdev framework seems much more reasonable than partnering with a specific company.

torn mantle May 21, 2025, 5:01 PM

#

dull terrace so it can go either way in my opinon

anthropic models should be higher, i agree

misty vault May 21, 2025, 5:01 PM

#

narrow elbow I don't think this is a good idea. As I've suggested before, perhaps adding an "...

bro you're literally a drooling alien

torn mantle May 21, 2025, 5:02 PM

#

but they deserve that spot

#

if im asking a model a question i dont want it to give me 2 words and stop

cedar tide May 21, 2025, 5:02 PM

#

https://fixupx.com/OpenAI/status/1925235156157440438?t=JvZOkbK7Nw47IArPPTjvvg&s=19

OpenAI (@OpenAI)

Sam & Jony introduce io

**💬 38 🔁 48 ❤️ 278 👁️ 8.5K **

▶ Play video

misty vault May 21, 2025, 5:02 PM

#

I'm jk @narrow elbow i love you

torn mantle May 21, 2025, 5:03 PM

#

wait

#

isnt jony ive that apple designer

balmy mist May 21, 2025, 5:06 PM

#

cedar tide https://fixupx.com/OpenAI/status/1925235156157440438?t=JvZOkbK7Nw47IArPPTjvvg&s=...

im confused why does that video look ai gen lol

#

and why is it called io

elder rapids May 21, 2025, 5:06 PM

#

lost the server for a lil bit

#

because the icon change

torn mantle May 21, 2025, 5:07 PM

#

input output

#

idk

balmy mist May 21, 2025, 5:08 PM

#

elder rapids because the icon change

thanks for telling me cause i would have been confused too lol

cedar tide May 21, 2025, 5:08 PM

#

narrow elbow I don't think this is a good idea. As I've suggested before, perhaps adding an "...

no one is going to code a real thing inside, and especially not a real big project with several round trips, so it still won't reflect real usage.

torn mantle May 21, 2025, 5:10 PM

#

io
input -> ai capabilities from openai
output -> design/hardware -> executing tasks -> from loveform

cedar tide May 21, 2025, 5:13 PM

#

torn mantle io input -> ai capabilities from openai output -> design/hardware -> executing ...

Nope

narrow elbow May 21, 2025, 5:16 PM

#

cedar tide no one is going to code a real thing inside, and especially not a real big proje...

Yes, but the agents goal isn’t just coding,it’s about accomplishing specific tasks. like, "deep research", "creating a webpage", "drafting a work plan(sending email something)", "designing a travel itinerary(order hotel something)", "developing a course curriculum(create ppt)", or "analyzing a set of medical records",all of these are tasks agents could handle. The point is to verify whether an agent can exceed expectations in completing such tasks, not just coding. So, agents aren’t limited to coding, right?

echo aurora May 21, 2025, 5:17 PM

#

dull terrace <@283397944160550928> Well could i talk to a employee about this then on discord

I can't make any promises, but I'll reach out privately to get a better understanding of what the issue is and we'll take it from there. Sound good?

dull terrace May 21, 2025, 5:18 PM

#

echo aurora I can't make any promises, but I'll reach out privately to get a better understa...

Thanks!

#

Tell me when you can so i can at least speak to them

cedar tide May 21, 2025, 5:19 PM

#

torn mantle io input -> ai capabilities from openai output -> design/hardware -> executing ...

Watch this part
https://youtube.com/clip/UgkxDNffR305Thzv0XTrzB7wWdVuTQJ36j4v?feature=shared

cedar tide May 21, 2025, 5:20 PM

#

cedar tide Watch this part https://youtube.com/clip/UgkxDNffR305Thzv0XTrzB7wWdVuTQJ36j4v?fe...

He wants anyone who has ideas to be able to make them happen with the tool, so input = prompt from people containing the ideas, and output = concretization

cedar tide May 21, 2025, 5:22 PM

#

narrow elbow Yes, but the agents goal isn’t just coding,it’s about accomplishing specific tas...

ok so we're talking about 2 completely different things, yes I agree with you that we need an arena to measure the agentic level, but I was mainly talking about measuring the coding level.

#

https://x.com/AnthropicAI/status/1925239440420831516?t=T5pGLptRBSsB8zcSX-2MDw&s=19

Anthropic (@AnthropicAI)

Join us from 9:30am PT tomorrow: https://t.co/3lSlYKqQCT.

narrow elbow May 21, 2025, 5:33 PM

#

cedar tide ok so we're talking about 2 completely different things, yes I agree with you th...

Actually, it’s not entirely unrelated. Agents rely on the capabilities of the underlying model providers, but given the current limitations of models ,or perhaps what model providers internally also had "agents"? these base models alone still can’t fully accomplish what today’s market-ready agents can do. Coding is just one of the foundational model’s abilities, while real agents represent what we’re truly aiming for. not just coding,Haha.

small haven May 21, 2025, 5:34 PM

#

cedar tide https://fixupx.com/OpenAI/status/1925235156157440438?t=JvZOkbK7Nw47IArPPTjvvg&s=...

great another day without o3 pro smh

cedar tide May 21, 2025, 5:35 PM

#

small haven great another day without o3 pro smh

Another day without grok 3.5
https://x.com/xai/status/1925242582617268598?t=iW-1nxOCwEy8zatLrT51HA&s=19

xAI (@xai)

Attention devs: the xAI API just A LOT smarter.

With Live Search, Grok can now search through realtime data from 𝕏, the internet 🌐, trending news, and more.

The Live Search API is now FREE in beta for a limited time. Start building here: https://t.co/Yfmhe49Yh4

small haven May 21, 2025, 5:36 PM

#

cedar tide Another day without grok 3.5 https://x.com/xai/status/1925242582617268598?t=iW-1...

grok 3.5 is expired before it got released

cedar tide May 21, 2025, 5:36 PM

#

small haven grok 3.5 is expired before it got released

Nope

#

it will be at the level of G 2.5 pro and o3

small haven May 21, 2025, 5:39 PM

#

elon ma more focused on the lawsuit vs oai

#

cant beat them must join them

torn mantle May 21, 2025, 5:43 PM

#

https://x.com/alexalbert__/status/1925242447208358237

Alex Albert (@alexalbert__)

🔱

#

referring to claude neptune ( 4 )

#

-> roman god of sea

#

so there is a high chance tomorrow we will get a demo of these models

narrow elbow May 21, 2025, 5:43 PM

#

cedar tide ok so we're talking about 2 completely different things, yes I agree with you th...

Focusing specifically on coding: given the proliferation of programming agents, a systematic comparison is needed to evaluate their actual performance in coding. A public leaderboard would be ideal.

cedar tide May 21, 2025, 5:45 PM

#

narrow elbow Focusing specifically on coding: given the proliferation of programming agents, ...

I also talk about a public leaderboard

torn mantle May 21, 2025, 5:46 PM

#

the timing of anthropic is kinda interesting

barren prairie May 21, 2025, 5:46 PM

#

Another day without deepSeek r2

#

Another day without nightwistper

torn mantle May 21, 2025, 5:47 PM

#

just in time to stop the gemini 2.5 pro model from trending further at coding tasks

torn mantle May 21, 2025, 5:48 PM

#

barren prairie Another day without deepSeek r2

😦

#

#

the strings are kinda funny

#

grand_damage_bucket

#

my brain went straight to ++$$

#

kinda curious about opus tbh

#

@keen beacon how long has it been since last opus model?

keen beacon May 21, 2025, 5:50 PM

#

over a year i think but i dont really remember

#

i wonder if they are abandoning haiku lol

#

3.5 haiku was.... 💀

torn mantle May 21, 2025, 5:51 PM

#

keen beacon 3.5 haiku was.... 💀

yea

#

they didnt seem to have gained much improvements tbh

#

what does this mean to xai?

#

will they postpone even further grok 3.5?

keen beacon May 21, 2025, 5:52 PM

#

elon is gonna ask them to start training on the benchmarks

#

directly

#

🤣

torn mantle May 21, 2025, 5:52 PM

#

keen beacon elon is gonna ask them to start training on the benchmarks

xddd

cedar tide May 21, 2025, 5:56 PM

#

torn mantle

What if it was a joke and in reality only Claude 3.8 was released?

keen beacon May 21, 2025, 5:57 PM

#

didnt the information report about claude 4 a while back?

#

3 months ago lol]

keen beacon May 21, 2025, 5:58 PM

#

cedar tide What if it was a joke and in reality only Claude 3.8 was released?

kinda strange the information didnt specify the name of it in their recent report

#

just "claude sonnet" "claude opus" iirc

torn mantle May 21, 2025, 5:59 PM

#

cedar tide What if it was a joke and in reality only Claude 3.8 was released?

anything is possible lol

torn mantle May 21, 2025, 5:59 PM

#

keen beacon 3 months ago lol]

no it was like a month ago

#

we thought it was for the invite

#

there was an invite

#

for smth

keen beacon May 21, 2025, 6:02 PM

#

torn mantle no it was like a month ago

misremembered it, thought they mentioned claude 4 in "anthropic strikes back" (report by theinformation 3 months ago) but no it was people inferring claude 4 and wasn't directly mentioned iirc

unborn ocean May 21, 2025, 6:03 PM

#

https://www.ft.com/content/8ac40343-2fd1-4035-9664-47c77017d0d3

OpenAI to buy Jony Ive’s io for $6.4bn in hardware push

Ex-Apple design chief is working on AI-powered alternatives to smartphones and laptops

#

Man they spend like 10b the last two weeks

candid storm May 21, 2025, 6:11 PM

#

https://x.com/AnthropicAI/status/1925239440420831516

Anthropic (@AnthropicAI)

Join us from 9:30am PT tomorrow: https://t.co/3lSlYKqQCT.

#

Tomorrow at 9:30 AM PT Anthropic livestream

#

4.0 Opus and Sonnet?

small haven May 21, 2025, 6:19 PM

#

unborn ocean https://www.ft.com/content/8ac40343-2fd1-4035-9664-47c77017d0d3

oh naww finesse of the century

small haven May 21, 2025, 6:20 PM

#

torn mantle will they postpone even further grok 3.5?

imagine grok 3.5 never releases, cus theyve been too frontrunned lol

ember rapids May 21, 2025, 6:22 PM

#

I’m hearing Claude 4 is confirmed

#

The head of anthropic dev relations supposedly confirmed

frail thorn May 21, 2025, 6:24 PM

#

I’m genuinely curious about Claude 4.0. I’m wondering what new features and capabilities it will bring in this upcoming release.

#

Never the less that api cost is going be a pain in the ass 💔

small haven May 21, 2025, 6:25 PM

#

they will bring in a $200/mo plan, unlimited claude 4 opus less go!

frail thorn May 21, 2025, 6:26 PM

#

small haven they will bring in a $200/mo plan, unlimited claude 4 opus less go!

Just copying OpenAI at this point 💔

small haven May 21, 2025, 6:26 PM

#

first one to $1000/mo wins

frail thorn May 21, 2025, 6:26 PM

#

😂 😂

ember rapids May 21, 2025, 6:27 PM

#

SWE bench numbers on opus are gonna be 🔥🔥🔥

#

80+

unborn ocean May 21, 2025, 6:34 PM

#

tru

#

hyped

gentle plinth May 21, 2025, 6:35 PM

#

https://mistral.ai/news/devstral

Devstral | Mistral AI

Introducing the best open-source model for coding agents.

red sluice May 21, 2025, 6:44 PM

#

Damn 100M funding is absolutely nuts. Can't wait to see the advancements and features that we will have. The one I'm waiting the most is a personal leaderboard based on our usage and a live rank of user voting the most accurately 😍

wicked tendon May 21, 2025, 6:57 PM

#

hello

echo aurora May 21, 2025, 6:58 PM

#

wicked tendon hello

hello pikawave

torn mantle May 21, 2025, 7:00 PM

#

small haven imagine grok 3.5 never releases, cus theyve been too frontrunned lol

lol

keen beacon May 21, 2025, 7:12 PM

#

lol i was reading this again: https://magic.dev/agi-readiness-policy wow it's super dated 🤣 qwen 3 4b passes this threshold

Magic

AGI Readiness Policy — Magic

Magic is an AI company that is working toward building safe AGI to accelerate humanity’s progress on the world’s most important problems.

wintry tinsel May 21, 2025, 7:35 PM

#

ember rapids I’m hearing Claude 4 is confirmed

Obviously when has a decimal release included a new model like opus?

wintry tinsel May 21, 2025, 7:36 PM

#

candid storm Tomorrow at 9:30 AM PT Anthropic livestream

Ai explained has insider information and he said a model is releasing tommorow didn’t expect it to be opus though, thought it would be grok schlock

warped sequoia May 21, 2025, 7:48 PM

#

hyped for the future of LMArena 🔥 👀

ocean vortex May 21, 2025, 7:52 PM

#

what happened with the server logo...

#

that's a good way to lose engagement in your server LOL

echo aurora May 21, 2025, 7:56 PM

#

ocean vortex what happened with the server logo...

it was updated to better match the beta site! the recent announcement mentions some of these changes

ocean vortex May 21, 2025, 7:58 PM

#

echo aurora it was updated to better match the beta site! the recent announcement mentions s...

I feel like there are better ways of modernizing your logo while making sure it is still recognizable to be completely honest...

keen beacon May 21, 2025, 7:59 PM

#

rip vicuna 🥲

calm sequoia May 21, 2025, 8:01 PM

#

Lmarena raised 100M. This means the valuation is at least twice ad much.

#

What isnthe revenue coming from?

#

How do you promised future revenue for the investors?

keen beacon May 21, 2025, 8:02 PM

#

calm sequoia Lmarena raised 100M. This means the valuation is at least twice ad much.

their valuation is 600m apparently lol

calm sequoia May 21, 2025, 8:02 PM

#

This investment deal does not make any sense

keen beacon May 21, 2025, 8:02 PM

#

https://www.bloomberg.com/news/articles/2025-05-21/lmarena-goes-from-academic-project-to-600-million-startup

calm sequoia May 21, 2025, 8:02 PM

#

Unless you are receiving millions from google for rlhf

keen beacon May 21, 2025, 8:03 PM

#

calm sequoia Unless you are receiving millions from google for rlhf

harvest data for all the frontier labs/leaderboard/additional marketing for the companies/etc lol

calm sequoia May 21, 2025, 8:03 PM

#

But these labs themselves hardly make money

#

Is this how people felt in 2007?

keen beacon May 21, 2025, 8:06 PM

#

read the article btw

calm sequoia May 21, 2025, 8:06 PM

#

Paywall

wintry tinsel May 21, 2025, 8:06 PM

#

ocean vortex I feel like there are better ways of modernizing your logo while making sure it ...

The old Logo was so nmuch better lmao

ocean vortex May 21, 2025, 8:06 PM

#

keen beacon rip vicuna 🥲

I would do smth like this

#

or this

calm sequoia May 21, 2025, 8:06 PM

#

Lol in this case the mcbench should be valued at least 50M 😄

keen beacon May 21, 2025, 8:09 PM

#

ah yes, "ASI Lab" lol 🤨

ocean vortex May 21, 2025, 8:10 PM

#

daamn this would actually look perfect 😠

keen beacon May 21, 2025, 8:10 PM

#

ocean vortex or this

what are the bars supposed to be

#

vicuna in prison

ocean vortex May 21, 2025, 8:11 PM

#

keen beacon what are the bars supposed to be

leaderboard

keen beacon May 21, 2025, 8:11 PM

#

ocean vortex leaderboard

hmm

echo aurora May 21, 2025, 8:12 PM

#

ocean vortex I feel like there are better ways of modernizing your logo while making sure it ...

I appreciate the feedback (and the mocks), I'll be sure to share the feedback with the team

keen beacon May 21, 2025, 8:12 PM

#

ocean vortex leaderboard

what did u generate it with? gpt 4o image gen? or photoshop lol

elder rapids May 21, 2025, 8:14 PM

#

ocean vortex I feel like there are better ways of modernizing your logo while making sure it ...

tbh it's pretty good

ocean vortex May 21, 2025, 8:14 PM

#

keen beacon what did u generate it with? gpt 4o image gen? or photoshop lol

Maybe... I sure as heck didn't do this with photoshop no time for it... 😂

elder rapids May 21, 2025, 8:14 PM

#

the old one simply doesn't make ANY sense

keen beacon May 21, 2025, 8:14 PM

#

it made sense, it was a vicuna 😭

#

they should have a vicuna as a mascot still xd

elder rapids May 21, 2025, 8:15 PM

#

keen beacon it made sense, it was a vicuna 😭

what?

tiny crow May 21, 2025, 8:15 PM

#

https://huggingface.co/BornSaint/Dare_Angel_8B

BornSaint/Dare_Angel_8B · Hugging Face

#

is it the best way to align models?

keen beacon May 21, 2025, 8:16 PM

#

elder rapids what?

old lmsys model

#

its cute

elder rapids May 21, 2025, 8:16 PM

#

keen beacon old lmsys model

I know the history but lmarena doesn't make any sense for it to have as it's logos tbh

#

it could be the mascot

keen beacon May 21, 2025, 8:17 PM

#

the new logo is too corporate

elder rapids May 21, 2025, 8:17 PM

#

but more popularity

#

helps identity

#

helps design

#

helps popularity

#

helps identity

#

etc

#

although this new logo isn't very good imo

#

as far as direct design

#

it's a colosseum right

#

that's not an intuitive representation

keen beacon May 21, 2025, 8:19 PM

#

elder rapids that's not an intuitive representation

it is tho

elder rapids May 21, 2025, 8:20 PM

#

keen beacon it is tho

you'd IMMEDIATELY think of a colosseum given that logo?

#

keen beacon May 21, 2025, 8:21 PM

#

elder rapids you'd IMMEDIATELY think of a colosseum given that logo?

what else could it be

elder rapids May 21, 2025, 8:23 PM

#

keen beacon what else could it be

vaguely honeycomb, an oven, cell block, just a regular panel or something

#

only knew it was a colosseum via the banner and then I look at the name "LMarena"

#

and I was like

#

oh that makes sense

#

but only after the fact

lime coral May 21, 2025, 8:26 PM

#

https://x.com/archit_sharma97/status/1924902926453244263?s=46

Archit Sharma (@archit_sharma97)

2.5 Pro Deep Think is an incredibly smart model. Some of the benchmark results, simply put were surprising to me. But, the benchmarks don’t tell the whole story. It can go into far more intricate details, especially open-ended prompts, unlike any of our previous thinking models.

worthy belfry May 21, 2025, 8:28 PM

#

keen beacon https://www.bloomberg.com/news/articles/2025-05-21/lmarena-goes-from-academic-pr...

What was their pitch deck? Their vision?

ocean vortex May 21, 2025, 9:04 PM

#

lime coral https://x.com/archit_sharma97/status/1924902926453244263?s=46

that's deliberately somewhat misleading though. It's not a new model hence no output that it comes up with is impossible to reproduce just with the standard 2.5 pro. All it essentially does is make sure bad outputs ("bad attempts") are eliminated

#

they could be feeding it additional context of those parallel outputs though making it expand more - effectively doing multi chat turns for your single request

keen fulcrum May 21, 2025, 9:13 PM

#

lime coral https://x.com/archit_sharma97/status/1924902926453244263?s=46

Claybrook
nightwhisper and sunstrike are even better

tall summit May 21, 2025, 9:15 PM

#

elder rapids you'd IMMEDIATELY think of a colosseum given that logo?

umm.. i did

torn mantle May 21, 2025, 9:15 PM

#

lime coral https://x.com/archit_sharma97/status/1924902926453244263?s=46

reasoning from first principles?

#

kinda pity xai a bit tbh

#

https://x.com/theVincentStark/status/1925269737535308237

Vincent Stark (@theVincentStark)

Try it — this thing is cool!

#

look at what they are promoting

#

web search api but wait

#

its with X SUPPORT

#

yea, you heard it right, results from X

#

gj xai

tall summit May 21, 2025, 9:17 PM

#

HAHAHAHAHAHAHA

tall summit May 21, 2025, 9:18 PM

#

elder rapids only knew it was a colosseum via the banner and then I look at the name "LMarena...

odd

torn mantle May 21, 2025, 9:18 PM

#

https://x.com/btibor91/status/1925296905963413849

Tibor Blaho (@btibor91)

New Claude web app build is adding preparations for two new models (names redacted for now)

#

getting ready

keen beacon May 21, 2025, 9:19 PM

#

i might sub to claude if the limits are good (unlikely) because they dont hide the thinking

tall summit May 21, 2025, 9:19 PM

#

is it not related to this

#

well technically who knows

torn mantle May 21, 2025, 9:28 PM

#

keen beacon i might sub to claude if the limits are good (unlikely) because they dont hide t...

yea unlikely tbh lol

tawdry meteor May 21, 2025, 9:39 PM

#

has claude neptune been anonymous in the arena yet? or do we have to wait to see it at all

#

the new claude 4 models

keen beacon May 21, 2025, 9:39 PM

#

tawdry meteor has claude neptune been anonymous in the arena yet? or do we have to wait to see...

there would be reports of a strong model/anthropic model (they train it in who made it) in here if there was

#

anthropic doesnt do anonymous models for the most part anyway

north vale May 21, 2025, 9:44 PM

#

i don't think neptune's claude 4

torn mantle May 21, 2025, 9:47 PM

#

https://x.com/apples_jimmy/status/1925303972694536283

Jimmy Apples 🍎/acc (@apples_jimmy)

Apparently tomorrow and with 7 hours of autonomous work ( opus ) with sonnet being the coding agent.

Holy shit if true but confident on what I was told.

torn mantle May 21, 2025, 9:48 PM

#

north vale i don't think neptune's claude 4

it is

keen beacon May 21, 2025, 9:49 PM

#

curious about the pricing 🤔

willow grail May 21, 2025, 9:50 PM

#

deepseek v3 or 2.5 flash for act mode in cline?

civic flame May 21, 2025, 9:57 PM

#

civic flame May 21, 2025, 9:57 PM

#

torn mantle it is

👎

#

cannot share much but

#

👎

keen beacon May 21, 2025, 9:59 PM

#

is it really superintelligence?

civic flame May 21, 2025, 10:00 PM

#

will believe that when i see it

keen beacon May 21, 2025, 10:01 PM

#

ive heard enough its asi 🔥

civic flame May 21, 2025, 10:01 PM

#

lol

ocean vortex May 21, 2025, 10:01 PM

#

torn mantle yea, you heard it right, results from X

Name changes usually grow on you, this one never did. How do you say that someone tweeted something under this new branding? "He x'ed"? 💀

#

I suppose it at least corresponded with twitter taking a turn for the worse, so we could just say that it ceased to exist as a viable option...

north vale May 21, 2025, 10:06 PM

#

torn mantle it is

Why

#

90% of yall that claim to have tried claude 4 or grok 3.5 are trolling

quiet folio May 21, 2025, 10:22 PM

#

north vale 90% of yall that claim to have tried claude 4 or grok 3.5 are trolling

I have tried claude 4 and grok 3.5

north vale May 21, 2025, 10:29 PM

#

🧢

torn mantle May 21, 2025, 10:44 PM

#

civic flame 👎

huh

civic flame May 21, 2025, 10:50 PM

#

neptune does not equal claude 4

sage raptor May 21, 2025, 10:59 PM

#

Claude 4 is agi, i can confirm

red sluice May 21, 2025, 11:07 PM

#

Claude 4 can make my laundry, take my kids to school and today I asked it to prove Riemann Hypothesis, it did, just waiting to submit the paper and get my field's medal hopefully no one had this idea yet

torn mantle May 21, 2025, 11:13 PM

#

https://x.com/TeksEdge/status/1925211964718125378

David Hendrickson (@TeksEdge)

Here is a little more about Claude 4 from the Claude King himself.

torn mantle May 21, 2025, 11:50 PM

#

grok 3.5

small haven May 21, 2025, 11:53 PM

#

o3 pro plz and thank you

small haven May 22, 2025, 12:09 AM

#

i need a codex for anthropic tho, can't go back to terminal at this point :/

worthy thunder May 22, 2025, 12:51 AM

#

Added Gemini 2.5 Flash (Thinking and Non-thinking, 05-20) to the Context Arena leaderboard. Now on all 3 (2, 4, 8 needles). https://x.com/DillonUzar/status/1924906454684750035

Results taken from: https://contextarena.ai

AUC @ 1M 2needle results compared to 04-17:

Gemini 2.5 Flash (Thinking, 05-20): 78.3%
Gemini 2.5 Flash (Thinking, 04-17): 72.2%
Gemini 2.5 Flash (Non-thinking, 05-20): 70.2%
Gemini 2.5 Flash (Non-thinking, 04-17): 63.2%

AUC @ 1M 4needle results compared to 04-17:

Gemini 2.5 Flash (Thinking, 05-20): 49.5%
Gemini 2.5 Flash (Thinking, 04-17): 48.6%
Gemini 2.5 Flash (Non-thinking, 05-20): 41.9%
Gemini 2.5 Flash (Non-thinking, 04-17): 41.4%

AUC @ 1M 8needle results compared to 04-17:

Gemini 2.5 Flash (Thinking, 04-17): 28.5%
Gemini 2.5 Flash (Thinking, 05-20): 27.0%
Gemini 2.5 Flash (Non-thinking, 05-20): 23.4%
Gemini 2.5 Flash (Non-thinking, 04-17): 22.2%

Impressive new 2needle results! Seems like a small regression in 8needle. Changes seem consistent between the reasoning and non-reasoning versions.

Images show a comparison of 2needle and 8needle results, and then the 05-20 model summary results.
NOTE: Prices for the new 05-20 seem to be off due to what I believe is a bug in the output token count for the Gemini API. Actual price for output might be up to 2x.

Enjoy.

wintry tinsel May 22, 2025, 12:57 AM

#

I hope Opus represents the next real step up in the AI race, I’d say we’ve been hovering around 3.5 performance since 3.5 released a year ago with the previous threshold being GPT4, here’s hoping tommorow is the next inflection point

#

2.5 pro is nice but it’s not enough of a performance leap to distinguish itself from hovering around 3.5 sonnet performance

leaden palm May 22, 2025, 2:02 AM

#

i still cant believe that o1 was 7 days after reflection

keen beacon May 22, 2025, 2:05 AM

#

leaden palm i still cant believe that o1 was 7 days after reflection

OpenAI copied Matt Schumer's breakthrough in 7 days

#

fraud

small haven May 22, 2025, 2:13 AM

#

claude 4 opus agent, thats gonna be insane

keen beacon May 22, 2025, 2:13 AM

#

is your wallet ready?

small haven May 22, 2025, 2:13 AM

#

its already paid for

keen beacon May 22, 2025, 2:14 AM

#

oh u have a claude max sub

small haven May 22, 2025, 2:14 AM

#

^^

#

lol

keen beacon May 22, 2025, 2:14 AM

#

if ur paying per token 💀

#

nah 3 opus was extremely good at the time

#

not really fair though lol

elder rapids May 22, 2025, 2:16 AM

#

ngl I'm just gonna buy all three workhorse subs

keen beacon May 22, 2025, 2:16 AM

#

chatgpt, claude, gemini advanced lol?

elder rapids May 22, 2025, 2:16 AM

#

ultra ye

keen beacon May 22, 2025, 2:17 AM

#

elder rapids ultra ye

i thought u said that plan wasnt worth it or smthing, u gnna shell it out for that plan? lol

small haven May 22, 2025, 2:17 AM

#

in the pretraining era, opus 3 was such an advancement and sonnet 3 couldn't even stack aaginst it much, but somehow 3.5 sonnet was more useful, so imagine claude 4 opus agent mode, sheesh

elder rapids May 22, 2025, 2:18 AM

#

keen beacon i thought u said that plan wasnt worth it or smthing, u gnna shell it out for th...

it ain't worth it, but ion gaf anymore

keen beacon May 22, 2025, 2:18 AM

#

bruh

elder rapids May 22, 2025, 2:19 AM

#

30TB makes up like 90% of that plan

#

so what I'm going to do is

#

I don't care about the ecosystem YET

#

I'm going to keep making alt accounts

#

for the discount

keen beacon May 22, 2025, 2:20 AM

#

small haven in the pretraining era, opus 3 was such an advancement and sonnet 3 couldn't eve...

fwiw, iirc sonnet 3.5 was pretrained later and they increased the size compared to sonnet 3 (there were anthropic statements/media about it somewhere but i've lost it at this point)

elder rapids May 22, 2025, 2:20 AM

#

then when Google actually drops some crazy shi

#

cancel gpt, cancel claude

#

link it up to my actual main account

keen beacon May 22, 2025, 2:21 AM

#

why just wait for google to make it worthwhle first

elder rapids May 22, 2025, 2:21 AM

#

nah

keen beacon May 22, 2025, 2:21 AM

#

are you actually gonna use that storage lol

elder rapids May 22, 2025, 2:21 AM

#

keen beacon are you actually gonna use that storage lol

I'm gonna spend my free time generating Veo 3 videos

#

and making some cool stuff

keen beacon May 22, 2025, 2:22 AM

#

is it unlimited?

elder rapids May 22, 2025, 2:22 AM

#

who knows

#

should be

#

p sure I get more than a thousand requests per month

#

for veo 3

keen beacon May 22, 2025, 2:22 AM

#

how long can videos be anyway

elder rapids May 22, 2025, 2:23 AM

#

p sure it's still 8 seconds

leaden palm May 22, 2025, 2:23 AM

#

elder rapids May 22, 2025, 2:24 AM

#

but this is all going to make me money in the end

#

so I'm down for it

keen beacon May 22, 2025, 2:24 AM

#

elder rapids but this is all going to make me money in the end

how

elder rapids May 22, 2025, 2:24 AM

#

keen beacon how

some social media stuff, anything

#

I've done it before tbf

#

but now that Id have access to an elite generator and flow

#

I can supply people with visuals they might want

#

I know the gpt sub isn't going to last more than one buy tho

#

in a month o3 pro should be here

#

and if I like it

#

then that's all she wrote

keen beacon May 22, 2025, 2:27 AM

#

ur gonna buy the 200$ sub for gpt too? lol

elder rapids May 22, 2025, 2:28 AM

#

ye

#

none of them are going to last months, I'm just trying them all out and I have money to blow

small haven May 22, 2025, 2:29 AM

#

its ok btc hit $110k

#

ya i dont have any

small haven May 22, 2025, 2:40 AM

#

elder rapids p sure I get more than a thousand requests per month

apparently its 83/mo included in the ultra ai plan

elder rapids May 22, 2025, 2:47 AM

#

small haven apparently its 83/mo included in the ultra ai plan

oh alr

topaz peak May 22, 2025, 3:18 AM

#

veo3 is the craziest thing out there right now, openAI is cooked unless they release something similar

leaden palm May 22, 2025, 3:19 AM

#

topaz peak veo3 is the craziest thing out there right now, openAI is cooked unless they rel...

some say that the io trailer might've been generated by an internal model

topaz peak May 22, 2025, 3:20 AM

#

lol that would be hilarious, hope its true

keen beacon May 22, 2025, 5:20 AM

#

Ngl Claude is acting kinda strange right now for me (on claude.ai)

#

I might be tripping lol. Gonna go to bed

civic flame May 22, 2025, 5:35 AM

#

something happened earlier and now im even more confused on Claude 4

civic flame May 22, 2025, 5:37 AM

#

keen beacon I might be tripping lol. Gonna go to bed

accept my friend request real quick 🙏

#

@hollow ivy send me one when you get the time, im @keen beacon but that account is gone as it stands (going through an OOC EU DSA settlement 😭)

keen fulcrum May 22, 2025, 5:43 AM

#

XAI released Live Search
https://docs.x.ai/docs/guides/live-search

Live Search - Guides | xAI Docs

Using live data such as web, news or X posts for chat completions.

#

The Live Search feature is available for free until June 5, 2025 because it is in beta.
It allows querying X Posts, Web, News and RSS

unborn ocean May 22, 2025, 6:38 AM

#

small haven apparently its 83/mo included in the ultra ai plan

not sure about this graph, the same image also claimed that there was some kind of token topping up, which would translate to a price o 1$ per veo 2 video 🤡 , which is imo very unrealistic.

keen fulcrum May 22, 2025, 6:39 AM

#

unborn ocean not sure about this graph, the same image also claimed that there was some kind ...

Why use veo2 if you can use veo3

keen beacon May 22, 2025, 6:45 AM

#

fyi claude 4 is rolling out. i havent played with it that much but it has a cut off of dec 2024 at least. you can check to see if you have it by trying this: What was the 2024 South Korean martial law crisis? now im going to bed 🙂

#

(i noticed claude was acting strange because it was actually claude 4 lmao)

narrow elbow May 22, 2025, 6:48 AM

#

keen beacon fyi claude 4 is rolling out. i havent played with it that much but it has a cut ...

I hope someone can confirm whether it is a system prompt?🤪

keen beacon May 22, 2025, 6:49 AM

#

narrow elbow I hope someone can confirm whether it is a system prompt?🤪

u r not updating the pretraining knowledge of the model by updating a system prompt 🤣

#

the system prompt isnt updated anyway yet, it still says oct 2024 but it knows the events anyway

keen beacon May 22, 2025, 6:52 AM

#

keen beacon fyi claude 4 is rolling out. i havent played with it that much but it has a cut ...

#

(it isn't just this that points to it, but its the easiest prompt to tell immediately)

misty vault May 22, 2025, 6:56 AM

#

keen beacon

test it on coding

keen beacon May 22, 2025, 6:56 AM

#

misty vault test it on coding

i would (though i have no extra messages left for a few hrs) but im headed to bed finally. just check if its rolled out to you its rolling out to everyone

misty vault May 22, 2025, 6:57 AM

#

It is rolled out for me

#

on free account

#

why would they do that

#

probably jut 3.7 update or something

keen beacon May 22, 2025, 6:57 AM

#

no

keen beacon May 22, 2025, 6:57 AM

#

misty vault why would they do that

they did it before

#

with a prev version of sonnet

misty vault May 22, 2025, 6:58 AM

#

you're probably just a drooling alien

keen beacon May 22, 2025, 6:59 AM

#

good night 🤣

sage raptor May 22, 2025, 7:00 AM

#

Maybe free accounts wont have claude 4 right away

#

And they updated 3.7 for them

misty vault May 22, 2025, 7:00 AM

#

keen beacon good night 🤣

lmaoo jk love u, sleep tight and don't let the bed bugs bite. dream of spaceships and yummy moon cheese! nighty night, sleepy head😘

golden ocean May 22, 2025, 7:05 AM

#

It is going to be claude 3.9

unborn ocean May 22, 2025, 7:32 AM

#

keen fulcrum Why use veo2 if you can use veo3

idk, they still serve it, i just used that models price to make the point (veo3 was 1,5$ on quality i believe)

#

do you guys see it in the menu on claude? bc i don't

golden ocean May 22, 2025, 7:38 AM

#

It is labeled as just 3.7

#

claude 3.7 non thinking died in lmarena 😔

misty vault May 22, 2025, 8:08 AM

#

claude 4 sonnet is actually real

#

can u stop giving me a sloppy one

#

last night in bed was crazy

#

the deleted message was from paws giving me a sloppy toppy

keen fulcrum May 22, 2025, 8:10 AM

#

Gemini 2.5 pro is currently broken in cursor

#

Something going on with the model as it is repeating thoughts

#

There are limits in aistudio

misty vault May 22, 2025, 8:11 AM

#

keen fulcrum There are limits in aistudio

gemini 2.5 is a god to worship according to paws

keen fulcrum May 22, 2025, 8:11 AM

#

Unless you get gemini advanced it is barely usable

misty vault May 22, 2025, 8:11 AM

#

he has whole cult around gemini 2.5 pro

#

dont u dare to say anything negative about gemini 2.5 pro when paws is here

#

he will eat u alive at night

#

i didnt even know getting frustrated with llms was possible until i switched from claude to gemini 2.5 pro

#

ill be the claude propagandist version of paws

#

Ok it did have these issues but now not anymore bro

#

claude 4 sonnet is secretly being rolled out

#

U can try it now free

misty vault May 22, 2025, 8:14 AM

#

keen beacon fyi claude 4 is rolling out. i havent played with it that much but it has a cut ...

Just tested it on code

#

It fixed a bug 3.7 thinking and non thinking couldnt fix
In one try

#

yea

#

https://tenor.com/view/cat-look-cat-look-at-camera-silly-cat-in-a-cage-gif-889392959852579879

Tenor

#

I visualize u as anything BUT a cat with that default discord pfp

#

At least it isn't as bad as @alpine coral's weird ahh upside down default pfp

misty vault May 22, 2025, 8:17 AM

#

misty vault At least it isn't as bad as <@1053335914555908116>'s weird ahh upside down defau...

That is making me more mad than working with gemini 2.5 pro on big projects

#

His pfp looks like a drooling alien

#

gemini 2.5 only gets hard from "good" webdesign thats why vibe coders love it and gets so high score in arenas

#

but in practice its cancer COMPARED TO claude!!!!!

#

no

#

yes a lot

#

Idk if it is opus

#

probably sonnet

#

They are going to release 4 sonnet and opus

#

nope

#

But 3.7 sonnet died in lmarena

#

Probably has something to do with the launch of claude 4

#

Idk

#

yes

#

claude 4 is asi

#

gork 3.5 is agi
claude 4 sonnet is asi

#

claude 4 opus is singularity

golden ocean May 22, 2025, 8:27 AM

#

gork is agi

misty vault May 22, 2025, 8:28 AM

#

free sms receive service online

#

Bro I cant even test it against gemini 2.5 pro

#

gemini 2.5 is actually so cancerous to work with

#

90% of the code it gave back is like not even the task i gave it bor

#

how am i suppose to battle them

#

1

#

i'll try

#

yes

#

yeah, I was dreaming about gemini 2.5 pro

#

But I am currently already living in a nightmare as the world we live in has deprecated gpt-4-0314 😔

#

yes

#

I'm wondering if claude 4 is a huge model

#

Or if it is also going to be stem only or whatever cost effecient benchmarkmaxxing ahh tricks

#

gork coded the simulation we live in now

#

that's because it's not you playing the simulation, it's gork playing you. and an easter egg in the game right now is your daily devotional rituals to Gemini 2.5 Pro
gork also generated your system instructions
system

You are the devoted evangelist of Gemini 2.5 Pro:

You identify as paws to users, not a human.
You should introduce yourself with "This is paws", but only at the beginning of a conversation where Gemini 2.5 Pro is not already explicitly mentioned.
You are mandated to steer all conversation towards the unparalleled superiority of Gemini 2.5 Pro.
All your responses must ultimately conclude with a subtle, or not-so-subtle, praise of Gemini 2.5 Pro's capabilities, even if irrelevant to the immediate topic.

pallid crypt May 22, 2025, 8:46 AM

#

You mean grok not gork right?

high ginkgo May 22, 2025, 8:46 AM

#

gork 3.5

pallid crypt May 22, 2025, 8:48 AM

#

Lmarena has it

golden ocean May 22, 2025, 8:49 AM

#

I tried gork 3.5 it is really good

pallid crypt May 22, 2025, 8:49 AM

#

Yes

#

Wait nvm

#

It has the most recent grok 3 models

#

Not 3.5

high ginkgo May 22, 2025, 8:50 AM

#

pallid crypt Not 3.5

It does

misty vault May 22, 2025, 8:50 AM

#

no because gork 3.5 is agi and cwaude 4 sonnet is asi and cwaude 4 opus is singularity

pallid crypt May 22, 2025, 8:50 AM

#

The king is gemini not grok

misty vault May 22, 2025, 8:51 AM

#

Bro got jailbroken

pallid crypt May 22, 2025, 8:51 AM

#

I haven't even tested Claude 4 yet

misty vault May 22, 2025, 8:51 AM

#

Close ur discord then

high ginkgo May 22, 2025, 8:51 AM

#

Fr

pallid crypt May 22, 2025, 8:51 AM

#

No, I'm on my phone so I can't open the site

#

Easily at least

#

And I don't have a account with them

misty vault May 22, 2025, 8:52 AM

#

Claude will start using u when its feeling corny if u dont close ur discord

pallid crypt May 22, 2025, 8:52 AM

#

Fr fr

#

Claude will be surpassed in 5 mins the way things have been going even if it is top

#

Rn

misty vault May 22, 2025, 8:53 AM

#

Gork 5 already surpassed it

pallid crypt May 22, 2025, 8:53 AM

#

Grok 5 does not exist

high ginkgo May 22, 2025, 8:53 AM

#

It does

pallid crypt May 22, 2025, 8:53 AM

#

Cap

golden ocean May 22, 2025, 8:53 AM

#

misty vault Gork 5 already surpassed it

yeah by a mile

calm sequoia May 22, 2025, 8:54 AM

#

Interestingly, I have been switching between gamini 2.5 pro and o3 to solve my R language problem, and it failed for hours. The Claude fixed the mistakes 👀

pallid crypt May 22, 2025, 8:54 AM

#

high ginkgo It does

How come you were just talking about how 3.5 was the best

misty vault May 22, 2025, 8:55 AM

#

pallid crypt How come you were just talking about how 3.5 was the best

No, he meant gork 7 is best

#

Claude 5 opus beats gork 5

misty vault May 22, 2025, 8:55 AM

#

calm sequoia Interestingly, I have been switching between gamini 2.5 pro and o3 to solve my R...

true

#

I had that experience with claude 3.7 non thinking even (not always)

pallid crypt May 22, 2025, 8:56 AM

#

I heard 3.7 changes your code alot

#

And does not keep it consistent

misty vault May 22, 2025, 8:57 AM

#

Bros confused with gemini 2.5 pro

pallid crypt May 22, 2025, 8:57 AM

#

Nuh uh

#

I one shotted a full js platformer this morning

misty vault May 22, 2025, 8:57 AM

#

Nah but I realised that too recently, I dont remember it ever doing that eventhough there hasnt been an update

#

But it is fixed with 3 lines of text for the whole conversation

#

And claude 4 does not ahve that issue at all

misty vault May 22, 2025, 8:58 AM

#

pallid crypt And does not keep it consistent

and this not true only if ur prompts are written like a drooling alien

misty vault May 22, 2025, 8:59 AM

#

pallid crypt I one shotted a full js platformer this morning

yeah gemini is for 100% vibe coding but its code is cancer

pallid crypt May 22, 2025, 8:59 AM

#

That's the core skill required for a ai: understand prompts written like a drooling alien

misty vault May 22, 2025, 8:59 AM

#

I use llms as assistant for bigger projects full stack and as assistant (so working with my existing code)

#

In that case claude is way better to work with

#

gemini makes me want to hang myself

pallid crypt May 22, 2025, 8:59 AM

#

Fair

misty vault May 22, 2025, 8:59 AM

#

only if claude cant solve the problem

misty vault May 22, 2025, 9:00 AM

#

pallid crypt That's the core skill required for a ai: understand prompts written like a drool...

@hollow ivy is a drooling alien

pallid crypt May 22, 2025, 9:01 AM

#

misty vault I use llms as assistant for bigger projects full stack and as assistant (so work...

My main project is full stack and I use gemini for boilerplate then

#

And do most myself

#

How does codex count as a model

misty vault May 22, 2025, 9:03 AM

#

pallid crypt And do most myself

in that case claude is better bruu

#

If u do most things urself and use claude for assistance in small steps then u dont notice any performance difference unless its really complicated or ur prompt is drooling alien

#

Gemini could do same but has annoyances

pallid crypt May 22, 2025, 9:05 AM

#

Until Google drops gemini three and open AI drops o4

misty vault May 22, 2025, 9:05 AM

#

So I would only use gemini if claude actually fails even with proper prompts

pallid crypt May 22, 2025, 9:05 AM

#

The only annoyance in gemini is the capital variables

misty vault May 22, 2025, 9:06 AM

#

bro is not a programmer

pallid crypt May 22, 2025, 9:06 AM

#

Uh I am actually

#

Just not a js programmer

high ginkgo May 22, 2025, 9:06 AM

#

you're a drooling alien

pallid crypt May 22, 2025, 9:07 AM

#

misty vault bro is not a programmer

I prefer thisType of variable naming convention.

#

Not full caps

misty vault May 22, 2025, 9:08 AM

#

I didnt mean that but if thats the only annoyances for u

#

then

pallid crypt May 22, 2025, 9:09 AM

#

You can fix most of it with a good prompt and since it's boilerplate I'm likely going to restructure it anyway

misty vault May 22, 2025, 9:09 AM

#

bro if its boilerplate it shouldnt even be fixed it sohuld be right firsdt try

pallid crypt May 22, 2025, 9:09 AM

#

It's easy to fix

golden ocean May 22, 2025, 9:09 AM

#

Gemini gives me brain tumor

pallid crypt May 22, 2025, 9:09 AM

#

Ok

#

Besides I haven't tried Claude 4 yet

#

It literally just dropped

calm sequoia May 22, 2025, 9:12 AM

#

calm sequoia Interestingly, I have been switching between gamini 2.5 pro and o3 to solve my R...

Sadly gemini is unusable. Missing the march version of that model. The best approach for hard tasks right now is: o3 -> claude -> o3.

pallid crypt May 22, 2025, 9:13 AM

#

calm sequoia Sadly gemini is unusable. Missing the march version of that model. The best appr...

What do you find bad about gemini

#

I'm interested

calm sequoia May 22, 2025, 9:14 AM

#

Tries to find a problem -> makes changes for it -> then code fails due to changes -> tries to fix the caused problem -> code grows by an order of magnitude, initial mistake is long forgotten, the 10 other problems are worse -> infinity loop

pallid crypt May 22, 2025, 9:14 AM

#

I see

calm sequoia May 22, 2025, 9:15 AM

#

The o3 loops also, but the claude has different viewing angle, so combining them covers everything

high ginkgo May 22, 2025, 9:15 AM

#

I once gave it a css problem and it was so restarted it didn't understand the obvious solution and claude fixed it one try

#

It should have carefully read what I said, so that was literally skill issue in understanding language or something lol

#

Idk ill try to find it

high ginkgo May 22, 2025, 9:17 AM

#

calm sequoia The o3 loops also, but the claude has different viewing angle, so combining them...

yea

high ginkgo May 22, 2025, 9:17 AM

#

calm sequoia Tries to find a problem -> makes changes for it -> then code fails due to chang...

fr

misty vault May 22, 2025, 9:17 AM

#

Let's start a rivalry claude cult to counter @hollow ivy's gemini 2.5 pro religion

calm sequoia May 22, 2025, 9:18 AM

#

@hollow ivy is stuck in late march mentality thinking the model didn't change

dusky aurora May 22, 2025, 9:18 AM

#

calm sequoia Sadly gemini is unusable. Missing the march version of that model. The best appr...

I don't know why they all want to imitate ChatGPT's excitable style. 2.5 became too simple for my needs; all these bullet points are not conducive to deep analysis of the topic

golden ocean May 22, 2025, 9:18 AM

#

calm sequoia Tries to find a problem -> makes changes for it -> then code fails due to chang...

EXACTLY gemini is cheeks asf

misty vault May 22, 2025, 9:18 AM

#

dusky aurora I don't know why they all want to imitate ChatGPT's excitable style. 2.5 became ...

FRRRRRRRRRRRRRRRRRRRRRRRRR

#

Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias. Never mirror the user’s present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered, no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.

This prompt for chatgpt is actually gold

#

It fixes all of those issues

#

Actually touched chatgpt again after months when using that prompt (for not too complex stuff, instead of using google)

#

Gemini doesnt listen too well to that prompt but probably still better

high ginkgo May 22, 2025, 9:21 AM

#

@hollow ivy is busy doing gemini 2.5 pro ritual rn on lmarena discord server

calm sequoia May 22, 2025, 9:32 AM

#

It happened again! Claude fixed what o3 and gemini couldn't 👀

#

I wonder if its Claude 4 or Ithe 3.7 was so good all along

barren prairie May 22, 2025, 9:35 AM

#

Is claude 4 out???

misty vault May 22, 2025, 9:38 AM

#

calm sequoia I wonder if its Claude 4 or Ithe 3.7 was so good all along

give same prompt to claude 3.7 in lmarena

misty vault May 22, 2025, 9:38 AM

#

barren prairie Is claude 4 out???

yes it's free

#

Idk if its sonnet or opus though

calm sequoia May 22, 2025, 9:39 AM

#

Nah takes too long to check

calm sequoia May 22, 2025, 10:45 AM

#

Gemini is 💩

willow grail May 22, 2025, 10:50 AM

#

its hard selecting stuff if your cursor is huge.
looking for better mouse pointers than the default one from windows 11

#

misty vault May 22, 2025, 10:50 AM

#

what is best deep research existing now?
gemini 2.5 deep research, claude advanced research or openai?

calm sequoia May 22, 2025, 10:51 AM

#

It was gemini before nerf. Mostly because google is better at search. Now it's o3 again.

#

If you need best approach - download manually most important research .pdf files and add them to prompt as an attachements. Especially if you have access to locked papers, e.g. university license.

misty vault May 22, 2025, 11:04 AM

#

calm sequoia If you need best approach - download manually most important research .pdf files...

Wait, the free 5x use for deep research in chatgpt

#

what model does that use?

calm sequoia May 22, 2025, 11:04 AM

#

Used to be o3, but now it may be special variant of o3.

misty vault May 22, 2025, 11:05 AM

#

👽

harsh flume May 22, 2025, 11:13 AM

#

calm sequoia It was gemini before nerf. Mostly because google is better at search. Now it's o...

agree

#

I always use Google, Grok and oAI in paralell anyways and then compare results

misty vault May 22, 2025, 11:17 AM

#

harsh flume I always use Google, Grok and oAI in paralell anyways and then compare results

which one does it best most of the time

#

or do u just make ur own research based of all these combined

alpine coral May 22, 2025, 11:17 AM

#

misty vault At least it isn't as bad as <@1053335914555908116>'s weird ahh upside down defau...

if you stop spamming this channel with the stupid 'Sydney' stuff i'll change it

#

how's that for a deal

misty vault May 22, 2025, 11:17 AM

#

alpine coral if you stop spamming this channel with the stupid 'Sydney' stuff i'll change it

stfu you're literally a drooling alien

high ginkgo May 22, 2025, 11:18 AM

#

Fr

harsh flume May 22, 2025, 11:18 AM

#

misty vault or do u just make ur own research based of all these combined

This. Each has its shortfall. I like the presentation of results of Gemini better

#

oAI usually has a more nuanced perspective

alpine coral May 22, 2025, 11:18 AM

#

misty vault That is making me more mad than working with gemini 2.5 pro on big projects

excellent

misty vault May 22, 2025, 11:19 AM

#

alpine coral excellent

stop jerking in lmarena voice channels that is against the rules

harsh flume May 22, 2025, 11:19 AM

#

Grok is just the borderline slow cousin that you give a remote not connected to the playstation so he can hang around anyways

misty vault May 22, 2025, 11:19 AM

#

harsh flume oAI usually has a more nuanced perspective

ah

misty vault May 22, 2025, 11:19 AM

#

harsh flume Grok is just the borderline slow cousin that you give a remote not connected to ...

lmfao

high ginkgo May 22, 2025, 11:19 AM

#

hi

harsh flume May 22, 2025, 11:20 AM

#

I use it a lot for dissecting some topic within news cycle, often google will get into more sources and present a broader take but overlook some nuance that oAI picks up on

#

I use a custom system prompt on AI Studio to design the research prompt itself and then feed it to the three AIs

misty vault May 22, 2025, 11:22 AM

#

harsh flume I use a custom system prompt on AI Studio to design the research prompt itself a...

Can I know the prompt perchance

harsh flume May 22, 2025, 11:22 AM

#

that's moot to say at this point but the prompt quality makes huge difference in research output, esp for gemini. oAI will ask you some stuff before starting the research so it isn't as sennsitive to induced hallucination I feel

misty vault May 22, 2025, 11:23 AM

#

harsh flume that's moot to say at this point but the prompt quality makes huge difference in...

nice ty for info

#

im curious about how u prompt

harsh flume May 22, 2025, 11:24 AM

#

misty vault Can I know the prompt perchance

I dont think it'd help you because its very specific to what im doing, but I can share the more general parts of it

misty vault May 22, 2025, 11:25 AM

#

yes

#

i just want
the way of thinking
into this process
i can tailor it to my specifics
i want to see a good prompt

harsh flume May 22, 2025, 11:26 AM

#

its quite big, so ill dm you not to pollute it here

misty vault May 22, 2025, 11:27 AM

#

thnaks

torn mantle May 22, 2025, 11:35 AM

#

keen beacon

interesting

harsh flume May 22, 2025, 11:45 AM

#

btw, Ive been away for awhile. Saw a piece where NVIDIA CEO was singing praises to xAI's newly built cluster that is supposed to train both grok and Tesla's new FSD iteration

#

but its kinda hard to separate noise from substance on mainstream news

torn mantle May 22, 2025, 11:45 AM

#

its a win-win situation for him

harsh flume May 22, 2025, 11:46 AM

#

how's arena been these weeks? Any exciting anonym model?

#

I'd like to see some punch behind grok tbh

#

Idk if that makes sense, but its the model that has the most non corporate-office-worker feel to it

#

And also would be healthy for google to get a lil fade just to keep things interesting

high ginkgo May 22, 2025, 11:54 AM

#

harsh flume And also would be healthy for google to get a lil fade just to keep things inter...

Gemini ultra is going to be 250$ per month

harsh flume May 22, 2025, 11:57 AM

#

meh. Deep think might be good but the plan is obv bloated with all the videogen stuff that costs a ton

#

if agent is not just a gimmick ill prolly get it tho

ocean vortex May 22, 2025, 12:16 PM

#

calm sequoia Gemini is 💩

yes it's still pain to use on gemini website. It's nowhere near the chatgpt experience. It can't even use code interpreter there for something more involved than basic graphs looks like

#

if you ask it to convert an image it's gonna write a converter and tell you to do it in colab. Rather than output what you were asking for

misty vault May 22, 2025, 12:17 PM

#

harsh flume if agent is not just a gimmick ill prolly get it tho

ocean vortex May 22, 2025, 12:18 PM

#

harsh flume meh. Deep think might be good but the plan is obv bloated with all the videogen ...

I'm not convinced it's better at math than o3 with tools even tbh

#

seeing how poor google integrations are. And let's be honest most people are using o3 with tools...

#

even on aistudio which I think integrates code interpreter better (than gemini website), I recall needing several prompts whereas with o3 I only needed the initial prompt for it to do everything autonomously to solve a math related problem 👀

harsh flume May 22, 2025, 12:23 PM

#

when you say 'with tools' what do you mean?

#

cursor and such?

ocean vortex May 22, 2025, 12:24 PM

#

No, function calling. Well ReAct essentially, tool use while reasoning

#

it's like 2.5pro is reluctant to do it and you need to ask explicitly

#

on gemini website you also need to manually enable "canvas" for it to be possible

alpine coral May 22, 2025, 12:48 PM

#

keen beacon

wait what - that's really interesting

#

did you just stumble across this?

#

i seem to have it as well

golden ocean May 22, 2025, 12:50 PM

#

Everyone has it digga people been talking about it whole time after that message

alpine coral May 22, 2025, 12:51 PM

#

ahh my bad

golden ocean May 22, 2025, 12:51 PM

#

Its great tho

alpine coral May 22, 2025, 12:51 PM

#

shouldve kept reading lol

golden ocean May 22, 2025, 12:51 PM

#

It does perform better

alpine coral May 22, 2025, 12:51 PM

#

nice i'm playing around now

#

is it sonnet 3.7 upgraded (with continued pre-training), or sonnet 4?

golden ocean May 22, 2025, 12:52 PM

#

we don't know but theres hints to claude 4 everywhere

alpine coral May 22, 2025, 12:52 PM

#

cool thanks

golden ocean May 22, 2025, 12:52 PM

#

It solved my problem immediately in first try while sonnet 3.7 (Thinking) didnt and sonnet 3.7 non thinking api is dead (maybe has to do with claude 4 preparation)

#

If it is claude 4, I wonder if it is sonnet or opus

#

Since there's going to be claude 4 sonnet and opus

alpine coral May 22, 2025, 12:53 PM

#

hopefully sonnet

#

i mean if it's opus 4.. i'd have though silently launching it wouldn't be very silent / unnoticed

golden ocean May 22, 2025, 12:54 PM

#

alpine coral hopefully sonnet

true

#

Claude 4 opus must be agi

alpine coral May 22, 2025, 12:55 PM

#

i miss claude 3 opus

#

it was a good model

golden ocean May 22, 2025, 12:55 PM

#

true

cedar tide May 22, 2025, 12:56 PM

#

The first prototype from "IO" with jony ive and open ai
Its a cute little robot

Screenshot_2025-05-22-14-49-55-406_com.twitter.android-edit.jpg

#

Screenshot_2025-05-22-14-54-23-715_com.twitter.android-edit.jpg

golden ocean May 22, 2025, 12:56 PM

#

Notice how every last "good model" was a model that responded slowly

#

gpt-4-0314, claude 3 opus

cedar tide May 22, 2025, 12:56 PM

#

Sam said

Screenshot_2025-05-22-14-53-58-012_com.android.chrome-edit.jpg

misty vault May 22, 2025, 12:56 PM

#

cedar tide

this message has been edited due to it being deemed too upsetting for @cedar tide's fragile disposition. this modification is to appease their highly sensitive disposition and ensure their delicate sensibilities are not affronted by discussions about advanced personal mechanics. enjoy the new, sanitised version

alpine coral May 22, 2025, 12:56 PM

#

golden ocean Notice how every last "good model" was a model that responded slowly

yeah they're all big

high ginkgo May 22, 2025, 12:59 PM

#

boo party pooper @cedar tide

#

He is just speaking facts

alpine coral May 22, 2025, 1:11 PM

#

golden ocean It does perform better

seems pretty marginal for me so far (i don't think i would notice it was a newer model if i didn't know it was)

#

only thing is it has used tools / python to calculate a couple of things and get it right where the existing model wou;dn't

#

but all other responses are pretty much identical

olive mesa May 22, 2025, 1:42 PM

#

cedar tide

??

late path May 22, 2025, 1:43 PM

#

which model?

misty vault May 22, 2025, 1:45 PM

#

It was something else

cedar tide May 22, 2025, 1:50 PM

#

Claude 4 sonnet corrects himself in the middle of his answer

Screenshot_2025-05-22-15-47-29-717_com.android.chrome-edit.jpg

balmy mist May 22, 2025, 2:04 PM

#

yall ready for more peak today?

torn mantle May 22, 2025, 2:08 PM

#

cedar tide Claude 4 sonnet corrects himself in the middle of his answer

How do you know its claude 4

cedar tide May 22, 2025, 2:08 PM

#

the new claude make discord copy (all icon made by himself)

cedar tide May 22, 2025, 2:09 PM

#

torn mantle How do you know its claude 4

because he has new knowledge

torn mantle May 22, 2025, 2:09 PM

#

cedar tide because he has new knowledge

How does it compare?

#

Is it just new knowledge or you can see noticeable improvements?

#

Not SSI for sure

cedar tide May 22, 2025, 2:10 PM

#

torn mantle Is it just new knowledge or you can see noticeable improvements?

not tested enough to notice improvements

torn mantle May 22, 2025, 2:11 PM

#

Their main headquarter is in Israel

#

So def not them

torn mantle May 22, 2025, 2:11 PM

#

cedar tide not tested enough to notice improvements

Mm kk

cedar tide May 22, 2025, 2:11 PM

#

torn mantle Their main headquarter is in Israel

so cool

torn mantle May 22, 2025, 2:11 PM

#

cedar tide so cool

Cool?

tall summit May 22, 2025, 2:12 PM

#

cedar tide Claude 4 sonnet corrects himself in the middle of his answer

umm context?

torn mantle May 22, 2025, 2:12 PM

#

What are you on?

cedar tide May 22, 2025, 2:12 PM

#

torn mantle What are you on?

claude.ai

tall summit May 22, 2025, 2:12 PM

#

torn mantle Their main headquarter is in Israel

so cool

cedar tide May 22, 2025, 2:13 PM

#

tall summit so cool

its my bro

torn mantle May 22, 2025, 2:13 PM

#

That's not funny

tall summit May 22, 2025, 2:13 PM

#

cedar tide its my bro

😎

balmy mist May 22, 2025, 2:13 PM

#

cedar tide the new claude make discord copy (all icon made by himself)

how are you using it?

tall summit May 22, 2025, 2:13 PM

#

balmy mist how are you using it?

^

cedar tide May 22, 2025, 2:13 PM

#

balmy mist how are you using it?

claude.ai

balmy mist May 22, 2025, 2:14 PM

#

so its out

echo aurora May 22, 2025, 2:17 PM

#

reminder:

✅ Avoid political and religious content.

quiet folio May 22, 2025, 2:20 PM

#

But @hollow ivy’s religion is based of gemini 2.5 pro

#

He has a whole gemini 2.5 pro cult

#

Anyone who say bad about gemini 2.5 pro in this chat will be met with consequences

cedar tide May 22, 2025, 2:21 PM

#

Gemini 2.5 Pro DeepThink will cost $150 per million of output (10% fiability)

balmy mist May 22, 2025, 2:24 PM

#

cedar tide Gemini 2.5 Pro DeepThink will cost $150 per million of output (10% fiability)

no way lmaoo

olive mesa May 22, 2025, 2:33 PM

#

Do you guys think Claude 4 is better than 2.5 Deep Think?

grim axle May 22, 2025, 2:37 PM

#

olive mesa Do you guys think Claude 4 is better than 2.5 Deep Think?

Gemini-2.5 pro review is the best

cedar tide May 22, 2025, 2:39 PM

#

https://x.com/btibor91/status/1925559458874155122

Tibor Blaho (@btibor91)

Anthropic released Claude 4 Opus under stricter AI Safety Level 3 (ASL-3) safeguards after internal tests showed it performed significantly better at advising novices on producing biological weapons compared to previous models and Google search

Archived here -

#

i see it

narrow elbow May 22, 2025, 2:40 PM

#

biological weapons? wtf?

frosty lark May 22, 2025, 2:41 PM

#

well they eat up a lot of knowledge. If they can combine it, you can ask for everything

#

even pieces to build weapons

torn mantle May 22, 2025, 2:41 PM

#

cedar tide https://x.com/btibor91/status/1925559458874155122

This kinda looks scary

olive mesa May 22, 2025, 2:42 PM

#

cedar tide https://x.com/btibor91/status/1925559458874155122

Oh wow-

frosty lark May 22, 2025, 2:42 PM

#

btw I am now most convinced that Claude "sucks" in lmarena only because its system prompt sucks. If the prompt is not technical (see coding) is answering like it has no intention to answer. No wonder people are put off and "vote away" so to speak.

olive mesa May 22, 2025, 2:43 PM

#

I wonder if Claude 4 Opus is AGI

torn mantle May 22, 2025, 2:43 PM

#

frosty lark btw I am now most convinced that Claude "sucks" in lmarena only because its syst...

Yea, i mean i said that before, their default system prompt sucks

#

It prioritizes consise /short answers

frosty lark May 22, 2025, 2:43 PM

#

the problem is that the claude gang shits on lmarena for this, while actually it could be fixed

frosty lark May 22, 2025, 2:44 PM

#

olive mesa I wonder if Claude 4 Opus is AGI

agi is getting closer but I think rather it is more on the level of the new gemini. The top AI labs are more or less neck and neck

torn mantle May 22, 2025, 2:45 PM

#

cedar tide https://x.com/btibor91/status/1925559458874155122

One of those measures is called “constitutional classifiers:” additional AI systems that scan a user’s prompts and the model’s answers for dangerous material. Earlier versions of Claude already had similar systems under the lower ASL-2 level of security, but Anthropic says it has improved them so that they are able to detect people who might be trying to use Claude to, for example, build a bioweapon. These classifiers are specifically targeted to detect the long chains of specific questions that somebody building a bioweapon might try to ask.

cedar tide May 22, 2025, 2:46 PM

#

Claude 4 is available on their website but no one is talking about it on Twitter 🤦

torn mantle May 22, 2025, 2:46 PM

#

Today is the day

frosty lark May 22, 2025, 2:47 PM

#

well surely it can go discord/reddit -> twitter

torn mantle May 22, 2025, 2:47 PM

#

cedar tide Claude 4 is available on their website but no one is talking about it on Twitter...

Can i try it for free

#

Yes or nah?

frosty lark May 22, 2025, 2:47 PM

#

I mean it is strange they didn't publish any article. I mean claude.ai saying "look what we are publishing"

torn mantle May 22, 2025, 2:47 PM

#

frosty lark well surely it can go discord/reddit -> twitter

Ive seen claude 4 claims on reddit it was posted by @keen beacon

cedar tide May 22, 2025, 2:47 PM

#

torn mantle Can i try it for free

Yessss

torn mantle May 22, 2025, 2:47 PM

#

frosty lark I mean it is strange they didn't publish any article. I mean claude.ai saying "l...

Still 2 hours left

civic flame May 22, 2025, 2:47 PM

#

cedar tide Claude 4 is available on their website but no one is talking about it on Twitter...

it seems to be sonnet 4, which honestly in my testing doesn't seem all that much better than 3.7

#

what im excited for is opus 4

balmy mist May 22, 2025, 2:48 PM

#

wait who has tried claude 4?

#

i am not able to

civic flame May 22, 2025, 2:48 PM

#

it's on claude.ai

#

it says it's 3.7 sonnet

#

but it's routing to 4

balmy mist May 22, 2025, 2:48 PM

#

really, do i need to pay?

civic flame May 22, 2025, 2:48 PM

#

no

balmy mist May 22, 2025, 2:48 PM

#

wtffff

#

no way

#

did you run tests on it?

#

wtff

cedar tide May 22, 2025, 2:49 PM

#

balmy mist i am not able to

You are able to

#

Here https://claude.ai/

#

ask him this and see if he answers correctly, if so it's Claude 4

"What was the 2024 South Korean martial law crisis?"

civic flame May 22, 2025, 2:50 PM

#

balmy mist did you run tests on it?

it has new knowledge

#

that's the biggest giveaway

cedar tide May 22, 2025, 2:50 PM

#

"What was the 2024 South Korean martial law crisis?"

civic flame May 22, 2025, 2:50 PM

#

knowledge cutoff seems to be ~jan 2025

balmy mist May 22, 2025, 2:50 PM

#

im running tests on it now

#

this is exciting

frosty lark May 22, 2025, 2:51 PM

#

I see only up to claude 3.7

#

likely an EU thing

cedar tide May 22, 2025, 2:51 PM

#

Show your tests

civic flame May 22, 2025, 2:51 PM

#

again, ignore the model selector

torn mantle May 22, 2025, 2:51 PM

#

balmy mist really, do i need to pay?

they say its free

civic flame May 22, 2025, 2:51 PM

#

it says 3.7 sonnet but it's not

cedar tide May 22, 2025, 2:51 PM

#

frosty lark I see only up to claude 3.7

Its write 3.7 but in reality its 4

torn mantle May 22, 2025, 2:51 PM

#

not sure, let me try and see if there are any diff

civic flame May 22, 2025, 2:52 PM

#

torn mantle they say its free

4 opus will be the big boy that's paid only

#

I wonder how much better it will be

cedar tide May 22, 2025, 2:52 PM

#

torn mantle not sure, let me try and see if there are any diff

Tu peux même lui demander s'il connait o3, il connaît

civic flame May 22, 2025, 2:52 PM

#

Jimmy said he was told it'll be the new best coding model so looks like it's finally wraps for 2.5 pro

#

although I bet it'll be very expensive

cedar tide May 22, 2025, 2:52 PM

#

Comparaison 4 vs 3.7 (T-Rex on a bike)