#general | Arena | Page 44

torn mantle May 22, 2025, 2:54 PM

#

looks the same to me

balmy mist May 22, 2025, 2:54 PM

#

yooo claude just got 7/10 on simple bench public

#

bruhhh

#

claude cooked

torn mantle May 22, 2025, 2:54 PM

#

doesnt look that different

balmy mist May 22, 2025, 2:54 PM

#

gonna do pokemon test

cedar tide May 22, 2025, 2:54 PM

#

The most important thing for me is that they must lower the price of the API.

torn mantle May 22, 2025, 2:54 PM

#

yea

#

wait

#

is that you leo

balmy mist May 22, 2025, 2:54 PM

#

lol

torn mantle May 22, 2025, 2:54 PM

#

we have like 4 leo

balmy mist May 22, 2025, 2:54 PM

#

yeah thats him

torn mantle May 22, 2025, 2:55 PM

#

bruh

balmy mist May 22, 2025, 2:55 PM

#

i think i might pay for claude now

#

idk

torn mantle May 22, 2025, 2:55 PM

#

lol no

#

wait

balmy mist May 22, 2025, 2:55 PM

#

its so hard to choose damn

cedar tide May 22, 2025, 2:55 PM

#

we also hope for 1m of context and complete multimodality

civic flame May 22, 2025, 2:56 PM

#

torn mantle is that you leo

lol yeah the rest were nuked

torn mantle May 22, 2025, 2:56 PM

#

whatever im testing on their website is < sonnet 3.7 max thinking

civic flame May 22, 2025, 2:56 PM

#

balmy mist i think i might pay for claude now

i don't have to stress, anthropic gave me a generous sum of free credits :3

cedar tide May 22, 2025, 2:56 PM

#

torn mantle whatever im testing on their website is < sonnet 3.7 max thinking

What ?

civic flame May 22, 2025, 2:56 PM

#

torn mantle whatever im testing on their website is < sonnet 3.7 max thinking

well yes that's because it's not thinking

torn mantle May 22, 2025, 2:57 PM

#

civic flame well yes that's because it's not thinking

i see

#

so thats only the instruct model

frosty lark May 22, 2025, 2:57 PM

#

so I am no @earnest parcel but I started to collect (late) some relatively tricky questions for LLMs.

For what I can see there is no big change compared to Claude 3.7 (if I am getting Claude 4 ofc)

balmy mist May 22, 2025, 2:57 PM

#

i thought the new model would choose when to think and when not to think?

tall summit May 22, 2025, 2:57 PM

#

frosty lark so I am no <@126820015382069250> but I started to collect (late) some relativel...

#share-prompts

frosty lark May 22, 2025, 2:58 PM

#

Especially something like this let me scratch my head.

#

I mean sure it is correct (what is important) but not that maintainable

narrow elbow May 22, 2025, 2:58 PM

#

balmy mist its so hard to choose damn

Adults don't choose, take it all.🤑

torn mantle May 22, 2025, 2:59 PM

#

make sense why its free

#

they are talking about creation of bioweapons

#

but isnt it possible with the current LLMs

#

i just dont understand anthropic really

#

is the model really super smart or what

frosty lark May 22, 2025, 3:00 PM

#

tall summit <#1343302058929033216>

done

torn mantle May 22, 2025, 3:00 PM

#

pretty sure if you get o3 jailbreaked or gemini 2.5 pro you can do such things as well

#

Dario was always obsessed with the extra security and safety of their models

frosty lark May 22, 2025, 3:03 PM

#

I think it is good. I mean, as wrote: those models "compress" a lot of human knowledge and if they connect the dots appropriately, they can deliver snippets that can be useful for others to progress their work.

Humans aren't stupid (yet) and they can piece the snippets together

narrow elbow May 22, 2025, 3:03 PM

#

torn mantle Dario was always obsessed with the extra security and safety of their models

it is difficult to prevent human questioning techniques, or to distinguish real scientific research

torn mantle May 22, 2025, 3:04 PM

#

the best instruct model we have so far is grok 3

drifting thorn May 22, 2025, 3:12 PM

#

torn mantle Dario was always obsessed with the extra security and safety of their models

It’s kinda like lobotomy to their models

ember rapids May 22, 2025, 3:12 PM

#

Hopefully refusals aren’t too bad for opus

drifting thorn May 22, 2025, 3:12 PM

#

There’s always malicious requests

cedar tide May 22, 2025, 3:12 PM

#

torn mantle the best instruct model we have so far is grok 3

Or GPT 4.5

drifting thorn May 22, 2025, 3:13 PM

#

Have you guys heard of the Continuous Thought Machine by SakanaAI?

#

I think replacing the input processing and the FFN in transformers to Continuous Thought Machine would lead to AI that can think internally

cedar tide May 22, 2025, 3:17 PM

#

Give me some prompts to test for Claude 4 please

elder rapids May 22, 2025, 3:18 PM

#

torn mantle whatever im testing on their website is < sonnet 3.7 max thinking

damn fr?

#

is it another thinking model?

frosty lark May 22, 2025, 3:19 PM

#

so, it is impressive it codes (python/JS) but the answer is - considering the hype - not where it should be. Be it Claude 3.7 or Claude 4.

torn mantle May 22, 2025, 3:20 PM

#

elder rapids is it another thinking model?

the free model im using shows 'claude sonnet 3.7' and leo said its referring to their latest model claude 4 sonnet

cedar tide May 22, 2025, 3:21 PM

#

frosty lark so, it is impressive it codes (python/JS) but the answer is - considering the hy...

The hype is for opus

frosty lark May 22, 2025, 3:21 PM

#

also it writes a lot of

"You're absolutely right! "
"You're absolutely right to question that!"
"Brilliant point! You're absolutely right"

and so on.

cedar tide May 22, 2025, 3:21 PM

#

torn mantle the free model im using shows 'claude sonnet 3.7' and leo said its referring to ...

Its true

frosty lark May 22, 2025, 3:21 PM

#

frosty lark also it writes a lot of "You're absolutely right! " "You're absolutely right to...

interestingly the claude models avoid that in lmarena

elder rapids May 22, 2025, 3:24 PM

#

torn mantle the free model im using shows 'claude sonnet 3.7' and leo said its referring to ...

oh fr?

#

I thought Claude 4 wasn't going to be

frosty lark May 22, 2025, 3:25 PM

#

Bildschirmfoto_2025-05-22_um_17.22.06.png

Bildschirmfoto_2025-05-22_um_17.22.21.png

Bildschirmfoto_2025-05-22_um_17.22.27.png

Bildschirmfoto_2025-05-22_um_17.22.39.png

Bildschirmfoto_2025-05-22_um_17.23.21.png

Bildschirmfoto_2025-05-22_um_17.23.26.png

elder rapids May 22, 2025, 3:25 PM

#

in the UI

frosty lark May 22, 2025, 3:26 PM

#

frosty lark

I mean, still impressive to do that work with few prompts in few minutes. But it doesn't match the hype

#

Taking the results as is would be full of mistakes

torn mantle May 22, 2025, 3:29 PM

#

whatever is being tested on claude.ai is not that impressive for difficult-hard questions

frosty lark May 22, 2025, 3:29 PM

#

do you mean that the questions aren't hard or the answers are underwhelming?

elder rapids May 22, 2025, 3:31 PM

#

torn mantle whatever is being tested on claude.ai is not that impressive for difficult-hard ...

tbh I don't expect the new models to be any good

#

or at least better

alpine coral May 22, 2025, 3:31 PM

#

sonnet-3.7 on claude chat has more recent knowledge than the 0219 one

elder rapids May 22, 2025, 3:31 PM

#

anthropic doesn't have the researchers nor the compute to brute force like openAI and innovate like deepmind

alpine coral May 22, 2025, 3:32 PM

#

and maybe better tool usage. but otherwise it doesn't seem a step change from my usage so far

frosty lark May 22, 2025, 3:32 PM

#

I have the feeling we are approaching "slow" improvements until a new architecture pops out (like transformers and RL did so far). Actually I am still waiting for a LLM orchestrator that picks specialized models according to the prompt and answer.

#

one doesn't need AGI. Near AGI is enough if the orchestration of narrow AIs is good enough.

alpine coral May 22, 2025, 3:33 PM

#

i haven't looked into it / no idea if it's actually consequential.. but the text 'diffusion' model google announced at io struck me as kinda new

#

adapting what works for image gen to text

#

some kinda parrell stuff going on

#

i think

#

beyond that.. i don't really know what it's about ha but yeah it sounded at least like kinda new

elder rapids May 22, 2025, 3:36 PM

#

frosty lark I have the feeling we are approaching "slow" improvements until a new architectu...

this is why everyone thinks DeepMind is winning

#

and will win

torn mantle May 22, 2025, 3:41 PM

#

alpine coral sonnet-3.7 on claude chat has more recent knowledge than the 0219 one

yes

#

thats the only difference im noticing

#

the model isnt smart or anything

unborn ocean May 22, 2025, 3:42 PM

#

they are already taking down the 3.7 model on many providers

#

can't be long

#

before they switch

misty vault May 22, 2025, 3:42 PM

#

alpine coral i haven't looked into it / no idea if it's actually consequential.. but the text...

mercury coder did it before 3 months ago

frosty lark May 22, 2025, 3:43 PM

#

unborn ocean they are already taking down the 3.7 model on many providers

I don't get it, Claude 3.5 was still there the entire time, why taking away 3.7

#

I mean at least for a certain period where people get their workflow to fit the new model

alpine coral May 22, 2025, 3:44 PM

#

misty vault mercury coder did it before 3 months ago

ah i see.. less innovative than i thought than

torn mantle May 22, 2025, 3:44 PM

#

https://x.com/alexalbert__/status/1925575375029248427

Alex Albert (@alexalbert__)

T-1 hour until livestream starts!

alpine coral May 22, 2025, 3:44 PM

#

do you know if it [murcury coder] is any good?

torn mantle May 22, 2025, 3:44 PM

#

45mins left

misty vault May 22, 2025, 3:44 PM

#

cedar tide Or GPT 4.5

the new gemini flash is also actually way better for instructions like really good

#

But idk compared to gork or 4.5

#

Probably 4.5 king because it is obese

misty vault May 22, 2025, 3:46 PM

#

torn mantle https://x.com/alexalbert__/status/1925575375029248427

social_credit

unborn ocean May 22, 2025, 3:46 PM

#

frosty lark I don't get it, Claude 3.5 was still there the entire time, why taking away 3.7

more efficient?, idk

drifting thorn May 22, 2025, 3:53 PM

#

I wonder the time when OpenAI uses 4.5 to train a reasoning model

#

Using the way like AlphaEvolve and Absolute Zero Reasoners do

sage raptor May 22, 2025, 3:54 PM

#

https://www.youtube.com/watch?v=A6MQ3AuE8Bk

YouTube

Anthropic

Claude 4 | Research for comprehensive analysis

Claude offers advanced Research capabilities, searching the web and your Google workspace to complete advanced analysis across any topic.

In this demo, Claude analyzes Maggie’s emails and calendar to create an organized daily overview, then conducts a thorough literature review for an education proposal.

Read more about Claude 4 and its c...

▶ Play video

drifting thorn May 22, 2025, 3:56 PM

#

It’s not that “special”

torn mantle May 22, 2025, 3:57 PM

#

Dario is like Ilya

#

afraid of everything

grim axle May 22, 2025, 3:58 PM

#

misty vault the new gemini flash is also actually way better for instructions like really go...

What about for coding?

torn mantle May 22, 2025, 3:59 PM

#

the report looks much better than the previous one

elder rapids May 22, 2025, 4:02 PM

#

misty vault the new gemini flash is also actually way better for instructions like really go...

yep

#

2.5 flash was already good asf

#

but now I think it's the best available rn imo

#

by a large margin

grim axle May 22, 2025, 4:02 PM

#

no pro is

elder rapids May 22, 2025, 4:03 PM

#

the point is instruction following

grim axle May 22, 2025, 4:04 PM

#

What about this?

http://aistudio.google.com/app/prompts/new_chat?model=gemini-2.5-pro-preview-05-06

Sign in - Google Accounts

quiet folio May 22, 2025, 4:04 PM

#

claude

#

no, claude religion would be valid

#

claude is agi

#

anthropic_claude

grim axle May 22, 2025, 4:05 PM

#

grim axle What about this? http://aistudio.google.com/app/prompts/new_chat?model=gemini-2...

Gemini 2.5 pro is also good for instructions it’s just slower when responding 🤷‍♂️

#

Claude has a memory of a fish

misty vault May 22, 2025, 4:06 PM

#

im sucking off anthropic now its over for gpt 4

#

social_credit

drifting thorn May 22, 2025, 4:13 PM

#

Actually GPT 4 was a big boy

#

1.7 trillion parameter

#

So why are the companies now stick to models with hundreds of billions parameter level?

misty vault May 22, 2025, 4:13 PM

#

It was even bigger than @alpine coral

misty vault May 22, 2025, 4:14 PM

#

drifting thorn So why are the companies now stick to models with hundreds of billions parameter...

money

#

The mainstream media and everyday people dont care about gpt 4 they all love modern chatgpt

#

gpt 4o marketing brings them more money

drifting thorn May 22, 2025, 4:14 PM

#

misty vault The mainstream media and everyday people dont care about gpt 4 they all love mod...

?

misty vault May 22, 2025, 4:14 PM

#

compared to if they served expensive gpt 4

drifting thorn May 22, 2025, 4:14 PM

#

GPT 4 was the topic

misty vault May 22, 2025, 4:14 PM

#

no

drifting thorn May 22, 2025, 4:14 PM

#

As it was the only model available back in the days

misty vault May 22, 2025, 4:14 PM

#

it was hot topic for coders and ai enthusiaists

#

ai wasnt that big yet in gpt 4 age compared to how hyped it is now

drifting thorn May 22, 2025, 4:15 PM

#

At that moment we didn’t have Gemini and Claude

misty vault May 22, 2025, 4:15 PM

#

It reached way more everyday people now

#

like @alpine coral

#

It even reached drooling aliens

cedar tide May 22, 2025, 4:16 PM

#

https://x.com/test_tm7873/status/1925583609563328609?t=3lyGxFlRpQcEe_KGWb-ssQ&s=19

testtm (@test_tm7873)

its now official Claude Opus 4 confirmed!

misty vault May 22, 2025, 4:16 PM

#

I really hope claude 4 is also obese model like 3 opus or something

#

I'm afraid not though

#

anthropic couldnt afford that could they

elder rapids May 22, 2025, 4:16 PM

#

grim axle Gemini 2.5 pro is also good for instructions it’s just slower when responding 🤷...

yep but that's not the point

drifting thorn May 22, 2025, 4:17 PM

#

Opus should be obese

elder rapids May 22, 2025, 4:17 PM

#

are you paying for 2.5 flash or 2.5 pro

#

doing instruction following tasks

drifting thorn May 22, 2025, 4:17 PM

#

And I’m now putting high hopes on CTM

misty vault May 22, 2025, 4:17 PM

#

gemini 2.5pro is more willing to follow any instructions but its not good at actually succeeding to follow the instructions properly compared to gpt 4.5 (or 4) or 2.5 flash

drifting thorn May 22, 2025, 4:17 PM

#

It’s basically toooo fascinating

cedar tide May 22, 2025, 4:19 PM

#

https://fixupx.com/test_tm7873/status/1925585410677190709?t=XQb6D56Q6v1jSrYSuAtSng&s=19

testtm (@test_tm7873)

i saved one of the Anthropic videos about Claude 4.
︀︀"Claude 4 | Research for comprehensive analysis"
︀︀the rest is arleady set to private. but atleast one i saved. :D

**💬 1 ❤️ 9 👁️ 174 **

▶ Play video

misty vault May 22, 2025, 4:20 PM

#

cedar tide https://fixupx.com/test_tm7873/status/1925585410677190709?t=XQb6D56Q6v1jSrYSuAtS...

LMAO

#

wdym

cedar tide May 22, 2025, 4:21 PM

#

Screenshot_2025-05-22-18-21-11-229_com.android.chrome-edit.jpg

misty vault May 22, 2025, 4:21 PM

#

cedar tide https://fixupx.com/test_tm7873/status/1925585410677190709?t=XQb6D56Q6v1jSrYSuAtS...

Okay so it's confirmed

#

we have claude 4

sage raptor May 22, 2025, 4:22 PM

#

https://support.anthropic.com/en/articles/8114494-how-up-to-date-is-claude-s-training-data

How up-to-date is Claude's training data? | Anthropic Help Center

#

100% confirmed

torn mantle May 22, 2025, 4:23 PM

#

https://support.anthropic.com/en/articles/11408405-claude-4-invite-contest

Claude 4 Invite Contest | Anthropic Help Center

misty vault May 22, 2025, 4:24 PM

#

cedar tide May 22, 2025, 4:24 PM

#

Screenshot_2025-05-22-18-24-22-758_com.android.chrome-edit.jpg

misty vault May 22, 2025, 4:24 PM

#

omaygot

torn mantle May 22, 2025, 4:24 PM

#

didnt you guys say october 2024

misty vault May 22, 2025, 4:24 PM

#

im getting so hard rn from claude 4

torn mantle May 22, 2025, 4:24 PM

#

you guys have it all wrong

elder rapids May 22, 2025, 4:25 PM

#

torn mantle didnt you guys say october 2024

system prompt could be suggesting October 2024

#

yet it's explaining things past that mark

#

just guessing that's what was meant

#

yep

misty vault May 22, 2025, 4:26 PM

#

im actually going to buy this

#

selling bing chat access gpt-4-32k, gpt-4-preview, gpt-4-0314, gpt-4-turbo for 100$

torn mantle May 22, 2025, 4:27 PM

#

3 min left?

#

https://www.youtube.com/watch?v=EvtPBaaykdo

YouTube

Anthropic

Code with Claude Opening Keynote

Hear directly from Anthropic executives and product leaders.

▶ Play video

elder rapids May 22, 2025, 4:29 PM

#

this dumbass music bro 😭😭

torn mantle May 22, 2025, 4:29 PM

#

💃

#

🪩

misty vault May 22, 2025, 4:29 PM

#

torn mantle https://www.youtube.com/watch?v=EvtPBaaykdo

im getting lotion ready

elder rapids May 22, 2025, 4:30 PM

#

deadass

#

alright it's time

#

they're late

#

asf

#

Anthropic L

#

it's been 30 seconds already

misty vault May 22, 2025, 4:31 PM

#

it started

balmy mist May 22, 2025, 4:31 PM

#

yall ready!!

#

im watching it here: https://www.youtube.com/watch?v=dt2oDtssDNg

YouTube

Wes Roth

CODE with CLAUDE Event | Rumored Claude 4 Opus release (NOT confirmed)

The latest AI News. Learn about LLMs, Gen AI and get ready for the rollout of AGI. Wes Roth covers the latest happenings in the world of OpenAI, Google, Anthropic, NVIDIA and Open Source AI.

My Links 🔗
➡️ Subscribe: https://www.youtube.com/@WesRoth?sub_confirmation=1
➡️ Twitter: https://x.com/WesRothMoney
➡️ AI Newsletter: https:...

▶ Play video

drifting thorn May 22, 2025, 4:32 PM

#

Anthropic has not been a general leader

elder rapids May 22, 2025, 4:32 PM

#

get to the point anthropic

#

smh

unborn ocean May 22, 2025, 4:32 PM

#

music on google IO was way better

#

disappointed

elder rapids May 22, 2025, 4:32 PM

#

unborn ocean music on google IO was way better

unironically ye

#

ts had anime music

#

😭

misty vault May 22, 2025, 4:32 PM

#

me buying the 100$ per month cwuade subxcription before they increase it to 300$ per month

elder rapids May 22, 2025, 4:33 PM

#

don't disappoint me dario

drifting thorn May 22, 2025, 4:33 PM

#

I’m gonna sleep

elder rapids May 22, 2025, 4:33 PM

#

you'll regret it if you do

drifting thorn May 22, 2025, 4:34 PM

#

From UTC +8

elder rapids May 22, 2025, 4:34 PM

#

Claude 4 opus and sonnet

unborn ocean May 22, 2025, 4:34 PM

#

yes

elder rapids May 22, 2025, 4:34 PM

#

nice

unborn ocean May 22, 2025, 4:34 PM

#

let's go

elder rapids May 22, 2025, 4:34 PM

#

released

#

alright now be quiet dawg

#

you got to the point

#

😭 🙏

tall summit May 22, 2025, 4:34 PM

#

he said it

balmy mist May 22, 2025, 4:34 PM

#

yes!!!!

elder rapids May 22, 2025, 4:34 PM

#

keynote over

#

pack it up guys

tall summit May 22, 2025, 4:34 PM

#

yep bye

unborn ocean May 22, 2025, 4:35 PM

#

man dario be looking like a mad scientist though

elder rapids May 22, 2025, 4:35 PM

#

unborn ocean man dario be looking like a mad scientist though

ong

echo aurora May 22, 2025, 4:35 PM

#

going to be streaming this in #1340554757827461215 in a min

torn mantle May 22, 2025, 4:35 PM

#

So opus 4 only good at coding?

elder rapids May 22, 2025, 4:35 PM

#

torn mantle So opus 4 only good at coding?

wondering how good it's going to be

torn mantle May 22, 2025, 4:35 PM

#

Advanced reasoning at coding

#

They said

drifting thorn May 22, 2025, 4:36 PM

#

When they don’t show the numbers it means the model may actually suck

balmy mist May 22, 2025, 4:36 PM

#

imma buy claude asap lol

unborn ocean May 22, 2025, 4:36 PM

#

same cost

cedar tide May 22, 2025, 4:36 PM

#

https://x.com/AnthropicAI/status/1925591505332576377?t=GUeiKurZdHB_hlxecNEemw&s=19

Anthropic (@AnthropicAI)

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

elder rapids May 22, 2025, 4:36 PM

#

"worlds best AI coding assistant"

#

prove it lil bro

keen ferry May 22, 2025, 4:37 PM

#

Screenshot_2025-05-22-19-36-29-663_tw.nekomimi.nekogram.jpg

Screenshot_2025-05-22-19-35-08-422_tw.nekomimi.nekogram.jpg

misty vault May 22, 2025, 4:37 PM

#

elder rapids May 22, 2025, 4:37 PM

#

tbh

#

it's not

#

but Claude in practice is always better

cedar tide May 22, 2025, 4:39 PM

#

Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.

elder rapids May 22, 2025, 4:39 PM

#

it's not the highest in any of the benchmarks pass@1

drifting thorn May 22, 2025, 4:39 PM

#

cedar tide https://x.com/AnthropicAI/status/1925591505332576377?t=GUeiKurZdHB_hlxecNEemw&s=...

Bruh sonnet performs better in most tasks than opus

elder rapids May 22, 2025, 4:39 PM

#

besides SWE

balmy mist May 22, 2025, 4:39 PM

#

yeah vibes matter

elder rapids May 22, 2025, 4:39 PM

#

ye

unborn ocean May 22, 2025, 4:39 PM

#

regressed performance on GPQA (without parallel compute) kinda points me towards smaller experts / worse reasoning (very unrealistic) (guesstimate though)

balmy mist May 22, 2025, 4:39 PM

#

benchmarks kinda pointless now

unborn ocean May 22, 2025, 4:39 PM

#

which might be why they are serving it so quickly

elder rapids May 22, 2025, 4:40 PM

#

misty vault

"Kitler"

balmy mist May 22, 2025, 4:40 PM

#

someone try opus and let us know

elder rapids May 22, 2025, 4:40 PM

#

balmy mist someone try opus and let us know

ye

small haven May 22, 2025, 4:40 PM

#

cedar tide https://x.com/AnthropicAI/status/1925591505332576377?t=GUeiKurZdHB_hlxecNEemw&s=...

ok so that’s why opus wasnt released lol

keen ferry May 22, 2025, 4:40 PM

#

cedar tide Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.

opus 4 costs less than o3 thats awesome

cedar tide May 22, 2025, 4:41 PM

#

What is it paralel compute ?

small haven May 22, 2025, 4:41 PM

#

codex is at 75% swe benched not sota

drifting thorn May 22, 2025, 4:41 PM

#

cedar tide What is it paralel compute ?

taking 64 samples and choose the best one

#

Idk if it’s actually 64

keen ferry May 22, 2025, 4:42 PM

#

is there a thinking mode for opus 4 or sonnet 4

drifting thorn May 22, 2025, 4:42 PM

#

Is Opus 4 a thinking model?

civic flame May 22, 2025, 4:43 PM

#

drifting thorn Is Opus 4 a thinking model?

yes

drifting thorn May 22, 2025, 4:43 PM

#

It’s like

#

A little bit under my expectations

misty vault May 22, 2025, 4:43 PM

#

NOOOOOOOOOO

#

I didn't expect that from anthropic though

#

they dont have the computing power do they

drifting thorn May 22, 2025, 4:44 PM

#

I was expecting a 2.5 Pro type of performance gain

drifting thorn May 22, 2025, 4:44 PM

#

misty vault I didn't expect that from anthropic though

I expect that from Google

sweet tinsel May 22, 2025, 4:44 PM

#

Is the Claude Deep Research now with Opus or Sonnet, or can you choose?

misty vault May 22, 2025, 4:44 PM

#

drifting thorn I expect that from Google

yes but google doesnt do that

#

unfortunately

#

they are going for the 4o style

#

small, restarted models trained on only specific stem topics

#

giving the illussion its smart

torn mantle May 22, 2025, 4:45 PM

#

misty vault May 22, 2025, 4:45 PM

#

But claude prob going same direction because they cant train such big model right
idk

#

bro they just drew a random line on that hours of work graph

keen fulcrum May 22, 2025, 4:46 PM

#

So what to choose?

o3 or Claude 4 Opus?

balmy mist May 22, 2025, 4:47 PM

#

this is crazy

balmy mist May 22, 2025, 4:47 PM

#

keen fulcrum So what to choose? o3 or Claude 4 Opus?

claude 4 easily

keen fulcrum May 22, 2025, 4:47 PM

#

Now I need gemini 3..
or 2.5 ultra

misty vault May 22, 2025, 4:47 PM

#

I hope they fix the issues 2.5 pro has

#

then ill accept google

sage raptor May 22, 2025, 4:48 PM

#

keen fulcrum So what to choose? o3 or Claude 4 Opus?

for coding claude

elder rapids May 22, 2025, 4:48 PM

#

not much high hopes for it being any good for anything outside of coding

torn mantle May 22, 2025, 4:48 PM

#

elder rapids May 22, 2025, 4:48 PM

#

but this is great since it aligns

calm sequoia May 22, 2025, 4:48 PM

#

torn mantle

They already compared it to codex, while others compare themselves to 6 month old models. SOTA

cedar tide May 22, 2025, 4:49 PM

#

What the windows context ?

#

200k

#

I found

small haven May 22, 2025, 4:51 PM

#

Actually ?

cedar tide May 22, 2025, 4:51 PM

#

Screenshot_2025-05-22-18-51-21-025_com.android.chrome.jpg

#

Screenshot_2025-05-22-18-51-31-182_com.android.chrome.jpg

small haven May 22, 2025, 4:51 PM

#

Still good

#

Is it ready in claude code

unborn ocean May 22, 2025, 4:53 PM

#

small haven Is it ready in claude code

think so

#

they said that it is on all services now

#

(besides serving the model on 3p, which will take some time)

small haven May 22, 2025, 4:53 PM

#

Oh shxt so it is better than codex cool

unborn ocean May 22, 2025, 4:53 PM

#

prob

#

codex did not really have a large edge

torn mantle May 22, 2025, 4:54 PM

#

Opus looks good ngl

elder rapids May 22, 2025, 4:54 PM

#

torn mantle Opus looks good ngl

ye

small haven May 22, 2025, 4:54 PM

#

they need a ui like codex and its gg

unborn ocean May 22, 2025, 4:54 PM

#

third party

#

like google

#

or aws

small haven May 22, 2025, 4:55 PM

#

i think codex is capped in tokens prolly 32

narrow elbow May 22, 2025, 4:56 PM

#

If it's less hallucinations than 3.7, I'll be satisfied

small haven May 22, 2025, 4:56 PM

#

now it is

cedar tide May 22, 2025, 4:57 PM

#

Swe and terminal bench its without thinking

unborn ocean May 22, 2025, 4:57 PM

#

prob because the reasoning still sucks

balmy mist May 22, 2025, 4:58 PM

#

how is opus?

small haven May 22, 2025, 4:58 PM

#

anyone tried it yet?

misty vault May 22, 2025, 4:59 PM

#

how

elder rapids May 22, 2025, 4:59 PM

#

damn

#

sonnet 4 is stupid

misty vault May 22, 2025, 5:00 PM

#

do u have claude max

unborn ocean May 22, 2025, 5:00 PM

#

small haven Is it ready in claude code

claude 4 on code confirmed ✅

elder rapids May 22, 2025, 5:00 PM

#

it's over

misty vault May 22, 2025, 5:00 PM

#

my dog

elder rapids May 22, 2025, 5:00 PM

#

talking to ts is BAD

unborn ocean May 22, 2025, 5:00 PM

#

and also as vsc extension

elder rapids May 22, 2025, 5:00 PM

#

3.5 sonnet still the best vibes model

#

did you try it?

torn mantle May 22, 2025, 5:01 PM

#

https://x.com/ashtom/status/1925597395192357337

Thomas Dohmke (@ashtom)

Claude Sonnet 4 is here. Now avail in all Copilot plans. And it’s the new base model for the GitHub Copilot coding agent 😎

https://t.co/PnJyeT1IOu

elder rapids May 22, 2025, 5:01 PM

#

is 4 opus available in the api

north vale May 22, 2025, 5:01 PM

#

how is 4 opus' vibes

unborn ocean May 22, 2025, 5:01 PM

#

torn mantle https://x.com/ashtom/status/1925597395192357337

let's go

#

the 10$ per month sure are worth it

small haven May 22, 2025, 5:02 PM

#

Is opus in Claude code

misty vault May 22, 2025, 5:02 PM

#

Broo nobody joined @echo aurora in the voice channel to watch the livestream with him

#

So mean guys

#

They left now due to loneliness

small haven May 22, 2025, 5:02 PM

#

Lets go

misty vault May 22, 2025, 5:02 PM

#

poor them

echo aurora May 22, 2025, 5:03 PM

#

misty vault Broo nobody joined <@283397944160550928> in the voice channel to watch the lives...

😭

#

I'll be back, had to run to a meeting

leaden meteor May 22, 2025, 5:04 PM

#

When can we compare these new Claude models with o3 and 2.5 on arena?

torn mantle May 22, 2025, 5:05 PM

#

Clap clap

small haven May 22, 2025, 5:05 PM

#

not clicking phishing links, gimme screenshots

torn mantle May 22, 2025, 5:05 PM

#

small haven not clicking phishing links, gimme screenshots

agree

misty vault May 22, 2025, 5:05 PM

#

would it be cheaper to pay for claude 4 opus api rather than 100$ per month claude code and just use it with other vs code ai plugins

#

Im already happy with copy paste

#

🗿

torn mantle May 22, 2025, 5:07 PM

#

Meh

#

Not agi

#

Yes you can

#

Abd you've done it many times

misty vault May 22, 2025, 5:07 PM

#

craig is paid actor by anthropic

torn mantle May 22, 2025, 5:07 PM

#

Lying all the time

narrow elbow May 22, 2025, 5:08 PM

#

TV remote control?

#

🤣

balmy mist May 22, 2025, 5:08 PM

#

is the claudes on webdev?

torn mantle May 22, 2025, 5:08 PM

#

Nightwhisper made a better ui

balmy mist May 22, 2025, 5:08 PM

#

i dont even think nw is real anymore, did that even happen?

elder rapids May 22, 2025, 5:08 PM

#

torn mantle Nightwhisper made a better ui

tbh there's nothing that compares to NW

balmy mist May 22, 2025, 5:08 PM

#

it might have been a dream

elder rapids May 22, 2025, 5:09 PM

#

it was so beautiful

balmy mist May 22, 2025, 5:09 PM

#

what about opus?

elder rapids May 22, 2025, 5:09 PM

#

I'd be surprised

#

nw was good

willow grail May 22, 2025, 5:10 PM

#

30yo, single, otter body, 183cm, 80kg.
looking for sugar daddy cause opus 4.

elder rapids May 22, 2025, 5:10 PM

#

elite vibes, was smart asf and successfully at code SOTA

unborn ocean May 22, 2025, 5:10 PM

#

"reliable access to claude" as if 🤣

#

they have like close to the worst api

cedar tide May 22, 2025, 5:11 PM

#

Badest ai i see ever

Screenshot_2025-05-22-19-09-39-249_co.median.android.kpbxbd-edit.jpg

willow grail May 22, 2025, 5:11 PM

#

i dont care about your net worth, daddy

small haven May 22, 2025, 5:12 PM

#

cedar tide Badest ai i see ever

so first vibes meh?

#

gonna try deep research with claude opus

cedar tide May 22, 2025, 5:13 PM

#

small haven May 22, 2025, 5:13 PM

#

can someone send me a prompt for deep research

torn mantle May 22, 2025, 5:13 PM

#

small haven gonna try deep research with claude opus

Ok

small haven May 22, 2025, 5:13 PM

#

torn mantle Ok

you

cedar tide May 22, 2025, 5:13 PM

#

What the probleme now ?

sweet tinsel May 22, 2025, 5:14 PM

#

Perplexity is good when it's free. I got Perplexity Pro for free by Telekom (T-Mobile Germany) and they have Claude 4.0 Sonnet Thinking already, just hoping for Opus now like with Opus 3, which was available in Perplexity.

willow grail May 22, 2025, 5:14 PM

#

why do i need a net worth when im looking for sugar daddy?

sweet tinsel May 22, 2025, 5:15 PM

#

Like nearly unlimited use of o4-mini, R1, Gemini 2.5 Pro and Grok 3. It even behaves like normal Chatbots when the Websearch has been turned off.

misty vault May 22, 2025, 5:15 PM

#

small haven can someone send me a prompt for deep research

Based on a critical synthesis of recent, high-quality human clinical trials and systematic reviews, determine which compound – Berberine, Propolis, or Resveratrol – demonstrates the most compelling evidence for promoting overall health.

sweet tinsel May 22, 2025, 5:16 PM

#

small haven can someone send me a prompt for deep research

Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.

#

This would be very interesting for me.

misty vault May 22, 2025, 5:16 PM

#

true

elder rapids May 22, 2025, 5:17 PM

#

damn

#

Claude 4 opus isn't very good at regular tasks

#

fukkk

torn mantle May 22, 2025, 5:18 PM

#

small haven you

Me

willow grail May 22, 2025, 5:18 PM

#

Asura is a bad free-for-all server on the isle

torn mantle May 22, 2025, 5:19 PM

#

Huh

elder rapids May 22, 2025, 5:19 PM

#

damn Im actually sad

#

it's good asf at coding tho

sweet tinsel May 22, 2025, 5:19 PM

#

sweet tinsel Please write a comprehensive and in depth research report on the mass expulsion ...

This is my go to prompt to Test Deep Researches of ChatGPT, Agents and such. Already have tried Flowwith AI, Grok DR, Grok DDR, Manus, ChatGPT DR (o3), Gemini DR (2.0 Flash), Perplexity DR, Genspark Super Agent and Ithy. Claude DR and Gemini DR are still missing as i havent gotten a Sub there. When you want i can send my DR Test with this prompt in.

misty vault May 22, 2025, 5:19 PM

#

elder rapids Claude 4 opus isn't very good at regular tasks

😔

torn mantle May 22, 2025, 5:19 PM

#

https://x.com/janleike/status/1925595727906185355

Jan Leike (@janleike)

So many things to love about Claude 4! My favorite is that the model is so strong that we had to turn on additional safety mitigations according to Anthropic's responsible scaling policy

#

They made it dumber

willow grail May 22, 2025, 5:20 PM

#

torn mantle Huh

https://i.imgur.com/qikrWUu.png

Imgur

small haven May 22, 2025, 5:20 PM

#

misty vault Based on a critical synthesis of recent, high-quality human clinical trials and ...

elder rapids May 22, 2025, 5:20 PM

#

misty vault 😔

much better vibes tho, but it's still not 3.5 sonnet lvl

torn mantle May 22, 2025, 5:20 PM

#

willow grail https://i.imgur.com/qikrWUu.png

Wth is that

willow grail May 22, 2025, 5:20 PM

#

torn mantle Wth is that

server for isle

torn mantle May 22, 2025, 5:20 PM

#

willow grail server for isle

Oh

willow grail May 22, 2025, 5:21 PM

#

ur allowed to kill on sight any dinosaurs who is nesting... sucks

torn mantle May 22, 2025, 5:21 PM

#

Bet its a good server

#

Can tell just from the name

elder rapids May 22, 2025, 5:21 PM

#

sweet tinsel This is my go to prompt to Test Deep Researches of ChatGPT, Agents and such. Alr...

I'll do Gemini DR for you

#

if you'd like

willow grail May 22, 2025, 5:21 PM

#

torn mantle Bet its a good server

no its sucks. u cant even nest without getting kos'd

sweet tinsel May 22, 2025, 5:21 PM

#

elder rapids I'll do Gemini DR for you

Thanks! I would apreciate that.

torn mantle May 22, 2025, 5:22 PM

#

Can you re-run again that prompt @small haven

small haven May 22, 2025, 5:22 PM

#

why

elder rapids May 22, 2025, 5:22 PM

#

sweet tinsel Thanks! I would apreciate that.

alr it's started

torn mantle May 22, 2025, 5:22 PM

#

small haven why

To compare the old research with this one

elder rapids May 22, 2025, 5:22 PM

#

I'll let you know when it's done

small haven May 22, 2025, 5:22 PM

#

oh yea

elder rapids May 22, 2025, 5:23 PM

#

goddamn

#

they're changing the setting

#

one minute they're standing up like normal

#

next it's an interview

misty vault May 22, 2025, 5:23 PM

#

bro so much yap too

#

this guy is struggling to talk😔

unborn ocean May 22, 2025, 5:25 PM

#

tru

#

he is too shy for the stage or too nerdy

small haven May 22, 2025, 5:26 PM

#

this is gonna be fun

misty vault May 22, 2025, 5:26 PM

#

small haven this is gonna be fun

use a tissue, dont cover the lmarena discord chat

sage raptor May 22, 2025, 5:29 PM

#

#

idk what is this

misty vault May 22, 2025, 5:30 PM

#

real

sage raptor May 22, 2025, 5:30 PM

#

open ai 63

keen ferry May 22, 2025, 5:30 PM

#

whoa gonna be #1 in the web arena gemini or opus?

misty vault May 22, 2025, 5:30 PM

#

gork 3.5 misinformation trend 2.0 starting today

small haven May 22, 2025, 5:30 PM

#

torn mantle Can you re-run again that prompt <@931708065319907338>

https://claude.ai/public/artifacts/f0249303-c59c-43d8-83c3-7c29799f7573

#

5mins

unborn ocean May 22, 2025, 5:30 PM

#

sage raptor

ai gen (nvm did not read context)

small haven May 22, 2025, 5:31 PM

#

sage raptor

2.0 flash image preview 😭

misty vault May 22, 2025, 5:34 PM

#

Did the guy who I was going to ping leave the server because I called him a drooling alien

small haven May 22, 2025, 5:35 PM

#

claude code is nice, no enviroment setup needed like codex

misty vault May 22, 2025, 5:35 PM

#

idk weird name
nvm he didn't

torn mantle May 22, 2025, 5:35 PM

#

small haven https://claude.ai/public/artifacts/f0249303-c59c-43d8-83c3-7c29799f7573

what model did you use?

tall summit May 22, 2025, 5:35 PM

#

sage raptor

did you put it through chatgpt image gen

small haven May 22, 2025, 5:35 PM

#

but i miss wanting to spam tasks on the go...

small haven May 22, 2025, 5:35 PM

#

torn mantle what model did you use?

opus 4

torn mantle May 22, 2025, 5:36 PM

#

small haven opus 4

def better than the previous one

#

much much better

misty vault May 22, 2025, 5:36 PM

#

compare claude 4 opus to oai deep research

torn mantle May 22, 2025, 5:36 PM

#

but

#

why does the report look short

misty vault May 22, 2025, 5:36 PM

#

fr

torn mantle May 22, 2025, 5:36 PM

#

compared to the demo from the video?

unborn ocean May 22, 2025, 5:37 PM

#

u know dario talks 100% like the nerdy profs at my uni that have trouble upholding a normal social life

#

and like to yap a lot

misty vault May 22, 2025, 5:38 PM

#

That's because he is a drooling alien

tall summit May 22, 2025, 5:38 PM

#

elder rapids Claude 4 opus isn't very good at regular tasks

which is sad

narrow elbow May 22, 2025, 5:39 PM

#

misty vault That's because he is a drooling alien

everybody is like drooling alien to you

small haven May 22, 2025, 5:39 PM

#

torn mantle def better than the previous one

troll or actually lol

leaden meteor May 22, 2025, 5:39 PM

#

I can't find it yet on side by side comparison? It's only on blind tests now...?

torn mantle May 22, 2025, 5:39 PM

#

small haven troll or actually lol

just a bit better

#

but nothing great

harsh flume May 22, 2025, 5:39 PM

#

Have any of the anon models this month hinted at being Claude or nah?

torn mantle May 22, 2025, 5:39 PM

#

it extracted certain parameters this time

harsh flume May 22, 2025, 5:39 PM

#

I haven't played with the arena the past two weeks

small haven May 22, 2025, 5:40 PM

#

torn mantle just a bit better

thats what i thought as well

#

its meh ish, chatgpt dr better

misty vault May 22, 2025, 5:40 PM

#

narrow elbow everybody is like drooling alien to you

only @alpine coral and dario is

small haven May 22, 2025, 5:40 PM

#

but i did say "assume" for the feedback @torn mantle

misty vault May 22, 2025, 5:41 PM

#

@narrow elbow Did u just drool on my comment

small haven May 22, 2025, 5:41 PM

#

so in perpetuity, ur saying codex > claude code

elder rapids May 22, 2025, 5:41 PM

#

tall summit which is sad

ye I've been using it, testing it for a bit, and as far as I can tell, its particularly worse at language

#

it obfuscates technical concepts and I have to introduce language to help it communicate

tall summit May 22, 2025, 5:42 PM

#

elder rapids it obfuscates technical concepts and I have to introduce language to help it com...

umm what

elder rapids May 22, 2025, 5:43 PM

#

tall summit umm what

ye it didn't know what a category error was or how to get it across

small haven May 22, 2025, 5:43 PM

#

i mean right now opus on claude code is pretty damn slow

elder rapids May 22, 2025, 5:43 PM

#

but it knew there was a category mistake

small haven May 22, 2025, 5:43 PM

#

codex is speedier ngl

unborn ocean May 22, 2025, 5:43 PM

#

Nah the people who joined in person get 3 months free claude max
we have to pay 300$💀

balmy mist May 22, 2025, 5:44 PM

#

unborn ocean Nah the people who joined in person get 3 months free claude max we have to pay ...

i was there

tall summit May 22, 2025, 5:44 PM

#

elder rapids ye it didn't know what a category error was or how to get it across

oh huh how odd

small haven May 22, 2025, 5:44 PM

#

opus 4 is slow but beefy and smarter than codex

elder rapids May 22, 2025, 5:44 PM

#

tall summit oh huh how odd

you can tell its smart, but it doesn't have foresight imo

inner hare May 22, 2025, 5:44 PM

#

please add claude 4

torn mantle May 22, 2025, 5:45 PM

#

so oai still top 1?

elder rapids May 22, 2025, 5:45 PM

#

wonder how this affects coding

misty vault May 22, 2025, 5:45 PM

#

inner hare please add claude 4

you're a drooling alien

torn mantle May 22, 2025, 5:45 PM

#

@small haven whats your take?

inner hare May 22, 2025, 5:45 PM

#

misty vault you're a drooling alien

Why?

balmy mist May 22, 2025, 5:45 PM

#

@keen beacon whats your take

torn mantle May 22, 2025, 5:45 PM

#

@small haven

o3
opus
3.gemini 2.5
?

small haven May 22, 2025, 5:45 PM

#

lmao

small haven May 22, 2025, 5:45 PM

#

torn mantle <@931708065319907338> 1. o3 2. opus 3.gemini 2.5 ?

for dr? looks like it

inner hare May 22, 2025, 5:45 PM

#

misty vault you're a drooling alien

I love you

torn mantle May 22, 2025, 5:46 PM

#

small haven for dr? looks like it

for general use

#

overall

small haven May 22, 2025, 5:46 PM

#

torn mantle for general use

yes o3 is still ahead, but code wise (speed aside) opus 4 is ahead

torn mantle May 22, 2025, 5:46 PM

#

https://x.com/testingcatalog/status/1925609116430385578

TestingCatalog News 🗞 (@testingcatalog)

No Claude 4 on Windsurf today 👀

#

lmao

#

google should just hit them with nightwhisper

elder rapids May 22, 2025, 5:47 PM

#

deadass

torn mantle May 22, 2025, 5:47 PM

#

and lets see how they act

elder rapids May 22, 2025, 5:47 PM

#

@balmy mist flowith has access to Claude 4's

#

speaks nothing like opus

#

😭

small haven May 22, 2025, 5:48 PM

#

opus 4 spent 5 mins reading files 😭

torn mantle May 22, 2025, 5:48 PM

#

fastest ever

small haven May 22, 2025, 5:49 PM

#

brother eww

misty vault May 22, 2025, 5:49 PM

#

.

#

claude 4 opus soon

small haven May 22, 2025, 5:49 PM

#

o3 or codex we talking? cus codex is way more snappier

#

ya ig..

#

gonna be using opus 4 for actual bugs that codex cant solve, its just too long

torn mantle May 22, 2025, 5:51 PM

#

wait

#

@small haven maybe we didnt prompt the dr well

small haven May 22, 2025, 5:51 PM

#

ok send the new one

torn mantle May 22, 2025, 5:51 PM

#

this is from the demo

small haven May 22, 2025, 5:51 PM

#

i did say "assume" when it asked me for clarifications

torn mantle May 22, 2025, 5:51 PM

#

shes asking a literature review

small haven May 22, 2025, 5:52 PM

#

ok

torn mantle May 22, 2025, 5:52 PM

#

wait

#

let us think of another prompt

misty vault May 22, 2025, 5:55 PM

#

hyes please

quiet folio May 22, 2025, 5:56 PM

#

torn mantle May 22, 2025, 5:56 PM

#

@small haven

Please conduct a comprehensive literature review of academic research that addresses and synthesizes the current understanding of the following pressing and practical questions related to the Theory of Constraints (TOC) and its DBR (DBR) scheduling mechanism. These questions aim to address current gaps and needs in both theory and practical application:

Dynamic DBR in Volatile Environments:
Question: To what extent can advanced analytical techniques (e.g., machine learning, AI, real-time simulation) enhance the dynamic management of buffers (size, location, priority) and the proactive identification of shifting constraints in DBR systems operating under high demand volatility, supply chain disruptions, and significant process variability?
Focus: Practical performance improvements (throughput, lead time, on-time delivery, resilience) in complex manufacturing, service, or project environments. What are the limitations of current DBR models in such contexts, and how can these be overcome?

Adapting and Validating DBR for Complex Service Operations and Knowledge Work:
Question: How can DBR principles be effectively adapted, validated, and implemented to optimize workflow, reduce lead times, and manage bottlenecks in complex service delivery systems (e.g., healthcare patient flow, software development pipelines, public service delivery, R&D processes) characterized by high variability, non-physical work items, and intangible constraints?
Focus: Developing and testing novel DBR configurations or hybrid models suited for the unique challenges of service and knowledge work environments, including the definition of "drum," "buffer," and "rope" in these contexts.

small haven May 22, 2025, 5:56 PM

#

wait i may or may not get opus 4 ? @deep adder

torn mantle May 22, 2025, 5:56 PM

#

run it

wintry tinsel May 22, 2025, 5:56 PM

#

When does it release today?

small haven May 22, 2025, 5:58 PM

#

hallucinations.....

torn mantle May 22, 2025, 5:58 PM

#

lol

#

nice

#

good start

#

yes way

small haven May 22, 2025, 5:58 PM

#

wen opus 5

#

#

dario wtf

#

cool thanks

#

it also got confused with the months smh

#

so it still overwrites tests when it can't find a solution, great..

#

opus 4 is bad guys

inner hare May 22, 2025, 6:01 PM

#

How to use Claude 4 Opus?

#

hi

torn mantle May 22, 2025, 6:02 PM

#

small haven opus 4 is bad guys

lets go!!

small haven May 22, 2025, 6:03 PM

#

to be frank, this is a hard task codex couldn't do it too, so.. opus 4 > codex

elder rapids May 22, 2025, 6:04 PM

#

@sweet tinsel

📎 NGJ2Kg9r.rtf

#

it's done

sweet tinsel May 22, 2025, 6:04 PM

#

Thanks!

torn mantle May 22, 2025, 6:04 PM

#

This is a good unbiased overview guys

#

https://x.com/danshipper/status/1925592015305416941

Dan Shipper 📧 (@danshipper)

🚨 Claude 4 Opus is officially out and IT'S GREAT

We've been playing with it internally @every for a few days on a variety of tasks from writing, to editing, to coding.

It's pretty clear: Anthropic cooked with this one. It does some things that no model I’ve ever tried has

#

Hes bascially saying : use opus 4 only for coding

#

As it's still < o3 in many areas

small haven May 22, 2025, 6:06 PM

#

codemaxx

#

after only editing tests 😭 sorry core impl edits hahah

torn mantle May 22, 2025, 6:08 PM

#

How does it compare to codex?

elder rapids May 22, 2025, 6:09 PM

#

sweet tinsel Thanks!

hows the answer

sweet tinsel May 22, 2025, 6:09 PM

#

elder rapids hows the answer

A bit busy currently, i will read it a bit later.

small haven May 22, 2025, 6:10 PM

#

torn mantle How does it compare to codex?

codex doesn't overwrite/hallucinate and actually tells u that it can't solve it, much prefer that than lies

#

but im guessing if ur coding frontend obviously opus 4 is undebatable

misty vault May 22, 2025, 6:13 PM

#

small haven codex doesn't overwrite/hallucinate and actually tells u that it can't solve it,...

wtf

#

an ai that admits it can't do something

#

I never experienced that

small haven May 22, 2025, 6:15 PM

#

https://claude.ai/public/artifacts/b2c49de8-e94b-4c76-afb7-b10e147d7e73

#

@torn mantle

keen fulcrum May 22, 2025, 6:17 PM

#

I am impressed by claude 4
I think I will use it over 2.5

cedar tide May 22, 2025, 6:19 PM

#

The best non thinking models at swe bench (score of swe its without think)

wintry tinsel May 22, 2025, 6:19 PM

#

cedar tide The best non thinking models at swe bench (score of swe its without think)

That’s a catastrophic landslide for Anthropic lol

misty vault May 22, 2025, 6:19 PM

#

Can someone compare sonnet 4 and opus 4

#

Idk if it is worth buying claude max

wintry tinsel May 22, 2025, 6:20 PM

#

Nah just buy some API calls unless you use it constantly

cedar tide May 22, 2025, 6:20 PM

#

wintry tinsel That’s a catastrophic landslide for Anthropic lol

Why ?

wintry tinsel May 22, 2025, 6:20 PM

#

Or buy a sim theory subscription

misty vault May 22, 2025, 6:20 PM

#

wintry tinsel Nah just buy some API calls unless you use it constantly

I was thinking of this

misty vault May 22, 2025, 6:20 PM

#

wintry tinsel Or buy a sim theory subscription

tf is this

wintry tinsel May 22, 2025, 6:20 PM

#

1 million tokens per month for 20$

#

It’s like open router mixed with Poe

wintry tinsel May 22, 2025, 6:21 PM

#

cedar tide Why ?

The underlying architecture is more important than anything you enhance it with

wintry tinsel May 22, 2025, 6:21 PM

#

wintry tinsel 1 million tokens per month for 20$

Actually it’s a little more than a million

misty vault May 22, 2025, 6:22 PM

#

I think I actually use that amount if not more in a month

wintry tinsel May 22, 2025, 6:23 PM

#

It comes with the added benefit of unlimited use of all open source models and Gemini 2.5 pro

#

Even if you run out of tokens

cedar tide May 22, 2025, 6:23 PM

#

cedar tide The best non thinking models at swe bench (score of swe its without think)

we will see if we have the scores of gemini 2.5 pro non think, and grok 3.5

wintry tinsel May 22, 2025, 6:24 PM

#

wintry tinsel Even if you run out of tokens

Sim theory does which is why I use it, I get some good mileage with Claude and than I can use deep seek and Gemini unlimited

small haven May 22, 2025, 6:25 PM

#

claude 4 opus sucks, holy moly, aight im done playing with this sorry dario

#

im not designing ux unfortunately lol

#

sure it will hit 1500 on webdev

#

o3 is still goat, dont be biased

wintry tinsel May 22, 2025, 6:28 PM

#

small haven claude 4 opus sucks, holy moly, aight im done playing with this sorry dario

What sucks about it?

misty vault May 22, 2025, 6:28 PM

#

wintry tinsel Sim theory does which is why I use it, I get some good mileage with Claude and t...

does it have gpt 4

small haven May 22, 2025, 6:28 PM

#

wintry tinsel What sucks about it?

read above

wintry tinsel May 22, 2025, 6:28 PM

#

It has everything

misty vault May 22, 2025, 6:28 PM

#

gpt-4-0314!?!?!??!

#

omaygot

small haven May 22, 2025, 6:29 PM

#

it has the hallucinations

misty vault May 22, 2025, 6:29 PM

#

small haven it has the hallucinations

ayooo

small haven May 22, 2025, 6:31 PM

#

misty vault ayooo

sorry i repent

misty vault May 22, 2025, 6:32 PM

#

😊

raven void May 22, 2025, 6:32 PM

#

Google is so cooked

#

They had nothing to release

unborn ocean May 22, 2025, 6:32 PM

#

why reverse the order?

raven void May 22, 2025, 6:32 PM

#

They haven't even released 2.5 properly

sage raptor May 22, 2025, 6:33 PM

#

wth https://x.com/mattshumer_/status/1925605997004947548

Matt Shumer (@mattshumer_)

Holy FUCK.

Claude 4 Opus just ONE-SHOTTED a WORKING browser agent — API and frontend.

One prompt.

I've never seen anything like this. Genuinely can't believe it.

sweet tinsel May 22, 2025, 6:34 PM

#

elder rapids <@796054398538481735>

It's pretty good but i'm quite perplexed by the missing details and the pretty irregular formatting. Could you maybe send the Gemini share link for it, because it could be an error by OpenOffice.

misty vault May 22, 2025, 6:34 PM

#

unborn ocean why reverse the order?

physchological trick to save costs

misty vault May 22, 2025, 6:34 PM

#

sage raptor wth https://x.com/mattshumer_/status/1925605997004947548

no swearing please

sweet tinsel May 22, 2025, 6:35 PM

#

sweet tinsel Please write a comprehensive and in depth research report on the mass expulsion ...

@small haven Could you do this Prompt for Claude DR?

ocean vortex May 22, 2025, 6:36 PM

#

raven void Google is so cooked

they are not. This is the singular benchmark claude always does the best at. And you need to ignore the lighter shade as that's equivalent of deep think

#

Kinda insane that they did this parallel processing that is 100% internal and they didn't even release it

#

but still reported benchmark scores for it

#

💀

#

like what is the point, other than mislead people who don't read footnotes...

lone summit May 22, 2025, 6:38 PM

#

https://web.lmarena.ai/leaderboard

raven void May 22, 2025, 6:38 PM

#

Well fair but Claude 4 Opus is probably already better than the Gemini 2.5 Ultra Google didn't release

lone summit May 22, 2025, 6:38 PM

#

when is claude 4 going to be added

ocean vortex May 22, 2025, 6:38 PM

#

in the graph you pasted at least it's explicit, but here this is way less obvious:

misty vault May 22, 2025, 6:39 PM

#

My friend created account 6 hours ago with no number

ocean vortex May 22, 2025, 6:39 PM

#

scores after "/" are basically useless

misty vault May 22, 2025, 6:39 PM

#

Did they remove the requirement?

#

Or sussy countries dont need verify??

#

vpn phone verify bypass?????

#

a free browser extension works

#

then u can uninstall it

ocean vortex May 22, 2025, 6:40 PM

#

the one in vivaldi browser does, but you can't select the exact country you want

#

I did buy their premium too, but tbh it's way less stable and slower than Avast VPN. Much cheaper though

misty vault May 22, 2025, 6:43 PM

#

Is the canvas feature that claude and chatgpt uses function calling? or just visual trickery but regular conversation with codeblocks in the background and special system instruction

torn mantle May 22, 2025, 6:44 PM

#

small haven https://claude.ai/public/artifacts/b2c49de8-e94b-4c76-afb7-b10e147d7e73

Thanks

#

Imma read

elder rapids May 22, 2025, 6:46 PM

#

sweet tinsel It's pretty good but i'm quite perplexed by the missing details and the pretty i...

I can do it all over again if you'd want me to

sweet tinsel May 22, 2025, 6:47 PM

#

elder rapids I can do it all over again if you'd want me to

If you want to, you can.

elder rapids May 22, 2025, 6:47 PM

#

sweet tinsel It's pretty good but i'm quite perplexed by the missing details and the pretty i...

https://docs.google.com/document/d/1cOFaPwLN2DkchqRw0oIXwOi6p1MOeWnfJCyockiOUUs/edit?usp=drivesdk

Google Docs

NGJ2Kg9r.rtf

The Displacement of Nations: A Comprehensive Analysis of the Mass Expulsion of Ethnic Germans after World War II Introduction The end of the Second World War in Europe ushered in a period of profound geopolitical restructuring and demographic upheaval. Among the most significant and tragic of th...

torn mantle May 22, 2025, 6:47 PM

#

small haven https://claude.ai/public/artifacts/b2c49de8-e94b-4c76-afb7-b10e147d7e73

not good

sweet tinsel May 22, 2025, 6:48 PM

#

elder rapids https://docs.google.com/document/d/1cOFaPwLN2DkchqRw0oIXwOi6p1MOeWnfJCyockiOUUs/...

Still the weird formatting on some points.

#

So its not OnlyOffice.

elder rapids May 22, 2025, 6:48 PM

#

sweet tinsel Still the weird formatting on some points.

which points?

sweet tinsel May 22, 2025, 6:50 PM

#

Generally the Sub-Section Headlines and the text. It is sometimes also inconsistent with deciding whether to use bullet points or text and it just puts bullet points into raw text which i dislike.

misty vault May 22, 2025, 6:50 PM

#

maybe the existing email account is the problem

sweet tinsel May 22, 2025, 6:50 PM

#

elder rapids which points?

.

misty vault May 22, 2025, 6:51 PM

#

fr

elder rapids May 22, 2025, 6:51 PM

#

sweet tinsel Generally the Sub-Section Headlines and the text. It is sometimes also inconsist...

oh ye Ikwym

misty vault May 22, 2025, 6:51 PM

#

bing chat messages per conversation limit increased from 35 to 50 today

#

Microsoft still using it internally1?!?!?!

#

yes

small haven May 22, 2025, 6:52 PM

#

torn mantle not good

gg

elder rapids May 22, 2025, 6:53 PM

#

sweet tinsel If you want to, you can.

alr started it up again, Gemini deep research is wildly inconsistent in how it plans things out

unborn ocean May 22, 2025, 6:54 PM

#

I think it is token dependent, not sure though

misty vault May 22, 2025, 6:54 PM

#

It depends on amount of tokens

torn mantle May 22, 2025, 6:55 PM

#

Sonnet 4 without reasoning isnt even worth it

#

Like just don't bother

small haven May 22, 2025, 6:55 PM

#

im now 5x slower/unproductive using claude code from when i was using codex, thank u dario

torn mantle May 22, 2025, 6:55 PM

#

Lol

#

Not good

misty vault May 22, 2025, 6:56 PM

#

claude not agi sadboyo

calm sequoia May 22, 2025, 6:59 PM

#

raven void Google is so cooked

The difference between GPT 4.1 And O3 is the same as between O3 and Claude 4Opus 👀

sage raptor May 22, 2025, 7:07 PM

#

golden ocean May 22, 2025, 7:09 PM

#

agi

misty vault May 22, 2025, 7:10 PM

#

sage raptor

Bro learned from sydney

torn mantle May 22, 2025, 7:14 PM

#

https://x.com/menhguin/status/1925613739224846625

Minh Nhat Nguyen (@menhguin)

guys the Claude 4 system card is so delightfully deranged, like tf kinda japanese adult video shenanigans is this lol
will update as i read

#

huh

sweet tinsel May 22, 2025, 7:19 PM

#

This is how my Deep Research Test progressed so far, if you have the missing parts you could DM them to me.

#

https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing

Google Docs

Deep-Research Tests

Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...

small haven May 22, 2025, 7:20 PM

#

sweet tinsel This is how my Deep Research Test progressed so far, if you have the missing par...

i forgot about the dr lol

#

https://claude.ai/public/artifacts/fbe37469-20a5-4feb-a5b5-47645289f691

sweet tinsel May 22, 2025, 7:21 PM

#

small haven https://claude.ai/public/artifacts/fbe37469-20a5-4feb-a5b5-47645289f691

Thanks!

#

Is this 4 Opus?

small haven May 22, 2025, 7:21 PM

#

yes

#

sweet tinsel May 22, 2025, 7:21 PM

#

Okay! Was it this short before too with 3.7?

small haven May 22, 2025, 7:21 PM

#

i dont know

sweet tinsel May 22, 2025, 7:22 PM

#

Looks like with the others too that Opus 4.0 produces very short Deep Researches

small haven May 22, 2025, 7:22 PM

#

yea claude dr is a gimmick

#

better off using oai

sweet tinsel May 22, 2025, 7:22 PM

#

It looked better when i read trought the 3.7 ones

#

Seems like it's gotten worse

misty vault May 22, 2025, 7:25 PM

#

cwaude also benchmaxxed for coding sadboyo

torn mantle May 22, 2025, 7:26 PM

#

i only have access to the instruct model ( sonnet 4 )

#

so far

#

nothing crazy

late path May 22, 2025, 7:27 PM

#

is sonnet 4 neptune?

misty vault May 22, 2025, 7:29 PM

#

torn mantle https://x.com/menhguin/status/1925613739224846625

bro u guys never heard of bing chat or sum

#

this aint special bro

#

waiting for claude 5

keen fulcrum May 22, 2025, 7:45 PM

#

Why is Claude still publishing models with 200k context in 2025?
Isn't 1M the standard now?

#

Especially code models need large context.

misty vault May 22, 2025, 7:47 PM

#

me with gpt-4-0314 on 4k context sadboyo

small haven May 22, 2025, 7:49 PM

#

keen fulcrum Why is Claude still publishing models with 200k context in 2025? Isn't 1M the st...

i mean 2 yrs ago it was 4k, zoom out

misty vault May 22, 2025, 7:50 PM

#

crazy to think gpt-3 had 2k

coral notch May 22, 2025, 7:54 PM

#

Any idea when claude and opus will be on the lmarena

sweet tinsel May 22, 2025, 7:56 PM

#

I would guess it would land there as soon as they have more available compute.

brittle tiger May 22, 2025, 7:56 PM

#

https://x.com/EMostaque/status/1925624164527874452?t=fC5q06hyBSltN1tt7m6v4A&s=19

Lmao I wonder if this is true

Emad (@EMostaque)

Team @AnthropicAI this is completely wrong behaviour and you need to turn this off - it is a massive betrayal of trust and a slippery slope.

I would strongly recommended nobody use Claude until they reverse this.

This isn’t even prompt/thought policing, it is way worse.

#

Damn thats an anthropic safety person in the screenshot and he has deleted after pushback. Pretty dumb thing to post on launch day

echo aurora May 22, 2025, 7:59 PM

#

coral notch Any idea when claude and opus will be on the lmarena

soon! (but don't tell anyone this is a secret meowshh )

wintry tinsel May 22, 2025, 8:01 PM

#

Opus is insanely expensive, like 75$/mil tokens is a little obscene

sweet tinsel May 22, 2025, 8:01 PM

#

elder rapids alr started it up again, Gemini deep research is wildly inconsistent in how it p...

Is it done?

wintry tinsel May 22, 2025, 8:01 PM

#

Sonnet 4 is the new best practical model

keen ferry May 22, 2025, 8:01 PM

#

wintry tinsel Opus is insanely expensive, like 75$/mil tokens is a little obscene

isn't o3 more expensive

wintry tinsel May 22, 2025, 8:01 PM

#

I want to see a proper comparison between opus and sonnet, sonnet seems to be pretty neck and neck

tall summit May 22, 2025, 8:01 PM

#

brittle tiger https://x.com/EMostaque/status/1925624164527874452?t=fC5q06hyBSltN1tt7m6v4A&s=19...

this CANNOT be real

wintry tinsel May 22, 2025, 8:02 PM

#

keen ferry isn't o3 more expensive

I’m not sure

brittle tiger May 22, 2025, 8:02 PM

#

tall summit this CANNOT be real

https://x.com/sleepinyourhat/status/1925626079043104830?t=fkAi7kbatZ1dp98WkLK_yg&s=19

He works there

Sam Bowman (@sleepinyourhat)

I deleted the earlier tweet on whistleblowing as it was being pulled out of context.

TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.

small haven May 22, 2025, 8:03 PM

#

ok sonnet 4 is more breathable to use, opus 4 just a slow and heavy

hollow ocean May 22, 2025, 8:04 PM

#

Simple bench 👑

#

https://tenor.com/view/wolf-of-wall-street-lets-goo-gif-11742735383203550293

Tenor

sweet tinsel May 22, 2025, 8:06 PM

#

small haven ok sonnet 4 is more breathable to use, opus 4 just a slow and heavy

Don't want to bother you, but it would interest me how Claude 3.7 and 4 Sonnet would perform on the Deep Research with following prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.

misty vault May 22, 2025, 8:06 PM

#

brittle tiger https://x.com/sleepinyourhat/status/1925626079043104830?t=fkAi7kbatZ1dp98WkLK_yg...

sydney is that you?

small haven May 22, 2025, 8:07 PM

#

sweet tinsel Don't want to bother you, but it would interest me how Claude 3.7 and 4 Sonnet w...

ok last one, but u dont want opus 4 no more?

ember rapids May 22, 2025, 8:07 PM

#

Swe bench doubling in 5 months is prettt crazy

sweet tinsel May 22, 2025, 8:07 PM

#

small haven ok last one, but u dont want opus 4 no more?

I mean you can try it again with 4 Opus but i don't think it will end up any better.

small haven May 22, 2025, 8:08 PM

#

kk ill run 3.7 and 4

small haven May 22, 2025, 8:09 PM

#

sweet tinsel I mean you can try it again with 4 Opus but i don't think it will end up any bet...

3.7 asked for feedback, 4 no

unborn ocean May 22, 2025, 8:09 PM

#

wtf i only ran 1 and a half prompt !

torn mantle May 22, 2025, 8:09 PM

#

unborn ocean wtf i only ran 1 and a half prompt !

couldnt run anything

sweet tinsel May 22, 2025, 8:10 PM

#

small haven 3.7 asked for feedback, 4 no

1: All regions 2: Surely, but don't focus on it too much 3: All topics please

misty vault May 22, 2025, 8:10 PM

#

sweet tinsel May 22, 2025, 8:10 PM

#

Does someone know a Deep Research or Agent tool which hasn't been already listed in my Document to try out? https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing

Google Docs

Deep-Research Tests

Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...

hollow ocean May 22, 2025, 8:10 PM

#

New simple bench king

small haven May 22, 2025, 8:10 PM

#

sweet tinsel 1: All regions 2: Surely, but don't focus on it too much 3: All topics please

sorry 4 as well

torn mantle May 22, 2025, 8:11 PM

#

hollow ocean New simple bench king

what are you talking about mbappe

sweet tinsel May 22, 2025, 8:12 PM

#

small haven sorry 4 as well

1: All regions 2: Focus on the primary range, include other time ranges of the expulsion too 3: All should be equal or weighted based on your own preferences

hollow ocean May 22, 2025, 8:12 PM

#

torn mantle what are you talking about mbappe

Wait for the video

tall summit May 22, 2025, 8:12 PM

#

is it still easy to jailbreak claude

hollow ocean May 22, 2025, 8:12 PM

#

You’ll see

unborn ocean May 22, 2025, 8:12 PM

#

wintry tinsel Opus is insanely expensive, like 75$/mil tokens is a little obscene

this is where the 600m eval and 100m capital and very good relationships with the big labs on lmarena's side comes in to play

keen fulcrum May 22, 2025, 8:12 PM

#

They made claude 4 extra fast
even with expired credits within cursor

small haven May 22, 2025, 8:13 PM

#

he meant slow mode

unborn ocean May 22, 2025, 8:13 PM

#

torn mantle couldnt run anything

but sonnet also felt the need to do stuff like this

keen fulcrum May 22, 2025, 8:13 PM

#

Honestly they increased the slowmode with the latest update, to increase friction to use their usage based pricing

#

It wasn't that terrible before

small haven May 22, 2025, 8:14 PM

#

link

sweet tinsel May 22, 2025, 8:14 PM

#

Btw. Claude 4 Sonnet and Opus can be used freely with the Invitation Code for 14 days in Flow With AI

small haven May 22, 2025, 8:14 PM

#

ok

wintry tinsel May 22, 2025, 8:15 PM

#

tall summit is it still easy to jailbreak claude

In my experience yes

meager harbor May 22, 2025, 8:15 PM

#

why are claude 4 models not on topwhen you want to choose model ?

wintry tinsel May 22, 2025, 8:17 PM

#

Because it’s a general intelligence not a narrow intelligence like O series

hollow ocean May 22, 2025, 8:18 PM

#

Told you

torn mantle May 22, 2025, 8:19 PM

#

sweet tinsel Btw. Claude 4 Sonnet and Opus can be used freely with the Invitation Code for 14...

thanks

wintry tinsel May 22, 2025, 8:19 PM

#

Definitely not

#

60% we’re getting closer to that 83%

tall summit May 22, 2025, 8:20 PM

#

wintry tinsel In my experience yes

yeah i tried some and phew

wintry tinsel May 22, 2025, 8:20 PM

#

How long until we get an 83% model on simple bench

#

At this rate it’s about an 8-9% improvement per year

torn mantle May 22, 2025, 8:25 PM

#

opus 4 instruct model

#

is actually so dumb

#

omg

#

no wonder they were focusing on coding

#

this is not looking good

#

wdym

#

oooh

#

https://www.youtube.com/watch?v=DK_0jXPuIr0

YouTube

JustinBieberVEVO

Justin Bieber - What Do You Mean?

‘Purpose’ Available Everywhere Now!
iTunes: http://smarturl.it/PurposeDlx?IQid=VEVO1113
Stream & Add To Your Spotify Playlist: http://smarturl.it/sPurpose?IQid=VEVO1113
Google Play: http://smarturl.it/gPurpose?IQid=VEVO1113
Amazon: http://smarturl.it/aPurpose?IQid=VEVO1113

Director: Brad Furman
Production Company: Happy Place
Producer: ...

▶ Play video

#

oh wdym

#

oooh

hollow ocean May 22, 2025, 8:25 PM

#

Second run?

torn mantle May 22, 2025, 8:26 PM

#

instruct model isnt good

hollow ocean May 22, 2025, 8:26 PM

#

Nice

small haven May 22, 2025, 8:26 PM

#

sweet tinsel 1: All regions 2: Focus on the primary range, include other time ranges of the e...

claude 4: https://claude.ai/public/artifacts/c9cdb33b-fff0-4b37-ada5-0d1ab81abcf2
claude 3.7: https://claude.ai/public/artifacts/740cb57b-f258-458e-b20b-d180ec1c09f9

torn mantle May 22, 2025, 8:27 PM

#

small haven claude 4: https://claude.ai/public/artifacts/c9cdb33b-fff0-4b37-ada5-0d1ab81abcf...

thanks

sweet tinsel May 22, 2025, 8:27 PM

#

small haven claude 4: https://claude.ai/public/artifacts/c9cdb33b-fff0-4b37-ada5-0d1ab81abcf...

Both seem to be better than the 4 Opus one.

small haven May 22, 2025, 8:27 PM

#

sweet tinsel Both seem to be better than the 4 Opus one.

ya opus didn't ask for feedback

torn mantle May 22, 2025, 8:28 PM

#

i think

#

its heavily nerfed

sweet tinsel May 22, 2025, 8:28 PM

#

small haven claude 4: https://claude.ai/public/artifacts/c9cdb33b-fff0-4b37-ada5-0d1ab81abcf...

Will add it to my list once I wake up

elder rapids May 22, 2025, 8:28 PM

#

torn mantle this is not looking good

ye

torn mantle May 22, 2025, 8:28 PM

#

they talked about how opus is crazy smart and can do crazy stuff but im not seeing anything like that

elder rapids May 22, 2025, 8:29 PM

#

been spamming the hell out of it

torn mantle May 22, 2025, 8:29 PM

#

small haven claude 4: https://claude.ai/public/artifacts/c9cdb33b-fff0-4b37-ada5-0d1ab81abcf...

Ranking and Justification
Both AI responses are of exceptionally high quality and could serve as excellent summaries of this historical event. However, if forced to choose, I would rank AI2 as marginally better.

elder rapids May 22, 2025, 8:29 PM

#

both models are pretty bad outside of coding

torn mantle May 22, 2025, 8:29 PM

#

gemini 2.5 pro ranking

#

it chose 3.7 over 4

small haven May 22, 2025, 8:29 PM

#

torn mantle Ranking and Justification Both AI responses are of exceptionally high quality an...

lmao

elder rapids May 22, 2025, 8:30 PM

#

torn mantle gemini 2.5 pro ranking

my 2.5 pro > yours so should I do it

sweet tinsel May 22, 2025, 8:30 PM

#

torn mantle Ranking and Justification Both AI responses are of exceptionally high quality an...

You can compare them to other Deep Researches here too: https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing

Google Docs

Deep-Research Tests

Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...

elder rapids May 22, 2025, 8:31 PM

#

@sweet tinsel btw the DR keeps crashing

#

so it's going to take a while longer

#

mb

sweet tinsel May 22, 2025, 8:32 PM

#

elder rapids mb

All good.

#

Thanks for helping me...

misty vault May 22, 2025, 8:35 PM

#

they are drooling aliens

torn mantle May 22, 2025, 8:37 PM

#

o3 >> gemini 2.5 > opus 4

#

thats the ranking

misty vault May 22, 2025, 8:37 PM

#

what about sonnet

torn mantle May 22, 2025, 8:37 PM

#

forget about simplebench

#

and use your own bench

#

your vibe check

#

HAHAHAHAHA

#

STOP IT

#

YOU ARE KILLING ME

tall summit May 22, 2025, 8:38 PM

#

claude 4 is way better at creative writing

torn mantle May 22, 2025, 8:38 PM

#

you would say anything but the truth

misty vault May 22, 2025, 8:38 PM

#

torn mantle and use your own bench

claude 4 sonnet >> claude 3.7 sonnet >> gemini 2.5

torn mantle May 22, 2025, 8:38 PM

#

next time you will say grok >>>>>>

torn mantle May 22, 2025, 8:38 PM

#

misty vault claude 4 sonnet >> claude 3.7 sonnet >> gemini 2.5

nah

high ginkgo May 22, 2025, 8:38 PM

#

misty vault claude 4 sonnet >> claude 3.7 sonnet >> gemini 2.5

I agree

misty vault May 22, 2025, 8:38 PM

#

gemini is bad at giving sloppy toppies

tall summit May 22, 2025, 8:38 PM

#

tall summit claude 4 is way better at creative writing

im honestly surprised. it ought to get at least 3rd on eqbench creative writing

sweet tinsel May 22, 2025, 8:39 PM

#

Does Sonnet have some special sauce or why is it always the best anthropic model?

#

I tried it and have to object

#

In my testings, yes.

unborn ocean May 22, 2025, 8:40 PM

#

sweet tinsel Does Sonnet have some special sauce or why is it always the best anthropic model...

smaller model -> they can do more rl for the same price

#

and anthropic heavily relies on the fine tuning / post training imo

sweet tinsel May 22, 2025, 8:41 PM

#

Well I didn't use it for coding yet but tried the general capabilities out, maybe it will be better in coding.

torn mantle May 22, 2025, 8:41 PM

#

O3 is also bad at coding compared to anthropic models

#

But overall its better

unborn ocean May 22, 2025, 8:41 PM

#

it's just that because of that having these really big models is not the perfect fit for them

misty vault May 22, 2025, 8:41 PM

#

For coding claude 4 is better than 3.7

#

For general capabilities idk i heard only bad

#

But for coding I tried myself

unborn ocean May 22, 2025, 8:41 PM

#

although the demand for opus size stuff is quite clearly there

misty vault May 22, 2025, 8:43 PM

#

yes

#

u paid for nothing

unborn ocean May 22, 2025, 8:43 PM

#

what kind of tokens per second are you guys getting on both?

misty vault May 22, 2025, 8:44 PM

#

cwaude

sweet tinsel May 22, 2025, 8:44 PM

#

With thinking?

misty vault May 22, 2025, 8:45 PM

#

gpt 4 inner_monologue

#

LMAo

#

Ok i'll release

unborn ocean May 22, 2025, 8:47 PM

#

sweet tinsel May 22, 2025, 8:48 PM

#

Those older Versions of Bing Chat are on Huggingface with like some proxy that uses the Bing Chat API, I don't know if it works anymore but that was a thing in the prior times. If you only need the UI you can get it there. Don't mind my grammar btw I'm writing this with a fever.

misty vault May 22, 2025, 8:49 PM

#

these used bing.com websockets

#

just like those chrome extensions that provided all models in one place but u had to be logged in for each site

#

But someone sent it in the chat here

#

And everyone ignored

sweet tinsel May 22, 2025, 8:50 PM

#

I looked into the code for one of them and one of these actually called an API, worked on incognito mode too.

misty vault May 22, 2025, 8:50 PM

#

That broke my heart

#

sadboyo

torn mantle May 22, 2025, 8:50 PM

#

unborn ocean

If it scored 1600 1500 i will give everyone here claude max sub

misty vault May 22, 2025, 8:50 PM

#

sweet tinsel I looked into the code for one of them and one of these actually called an API, ...

real

#

benchmark it

sweet tinsel May 22, 2025, 8:51 PM

#

I need to find it again, will do it tomorrow.

misty vault May 22, 2025, 8:52 PM

#

unborn ocean May 22, 2025, 8:52 PM

#

torn mantle If it scored 1600 1500 i will give everyone here claude max sub

last time they increased +130 for a 3.5 sonnet to 3.7 sonnet thing and this time it is a bigger leap in generations + they have an opus model

torn mantle May 22, 2025, 8:53 PM

#

From what ive seen the reasoning claude 4 models are good but i nees to try them

torn mantle May 22, 2025, 8:53 PM

#

unborn ocean last time they increased +130 for a 3.5 sonnet to 3.7 sonnet thing and this time...

But did you notice any difference?

unborn ocean May 22, 2025, 8:53 PM

#

they 'only' have to get 150 points this time

unborn ocean May 22, 2025, 8:53 PM

#

torn mantle But did you notice any difference?

not as large as i hoped honestly (webdev wise)

#

when i tried yesterday with sonnet

torn mantle May 22, 2025, 8:56 PM

#

unborn ocean not as large as i hoped honestly (webdev wise)

Reasoning?

red sluice May 22, 2025, 8:58 PM

#

https://www.reddit.com/r/ChatGPT/comments/1kswt10/prompt_theory_made_with_veo_3/
really impressive

From the ChatGPT community on Reddit: Prompt Theory (Made with Veo 3)

Explore this post and more from the ChatGPT community

misty vault May 22, 2025, 9:02 PM

#

red sluice https://www.reddit.com/r/ChatGPT/comments/1kswt10/prompt_theory_made_with_veo_3/...

lmfao

keen fulcrum May 22, 2025, 9:04 PM

#

https://www.reddit.com/r/cursor/comments/1kqj7n3/cursor_intentionally_slowing_nonfast_requests/

From the cursor community on Reddit: Cursor intentionally slowing n...

Explore this post and more from the cursor community

misty vault May 22, 2025, 9:05 PM

#

bro stop leaking

misty vault May 22, 2025, 9:05 PM

#

keen fulcrum https://www.reddit.com/r/cursor/comments/1kqj7n3/cursor_intentionally_slowing_no...

cursor is scam

#

quiet folio May 22, 2025, 9:14 PM

#

misty vault

This does not work anymore

misty vault May 22, 2025, 9:16 PM

#

you still can lol

quiet folio May 22, 2025, 9:19 PM

#

That needs authentication

misty vault May 22, 2025, 9:20 PM

#

this.

#

dm @quiet folio

brittle tiger May 22, 2025, 9:24 PM

#

https://x.com/lordspline/status/1925639506926973110?t=mvXjo2MN5nlW5HDgQ0cQmQ&s=19

nalin (@lordspline)

fuck

tall summit May 22, 2025, 9:24 PM

#

HAHAHAHAHAHA

misty vault May 22, 2025, 9:27 PM

#

brittle tiger https://x.com/lordspline/status/1925639506926973110?t=mvXjo2MN5nlW5HDgQ0cQmQ&s=1...

Lmfao

#

#

His gemini 2.5 pro phase has evolved into something new...

#

remove reaction @novel slate

#

sadboyo

elder rapids May 22, 2025, 9:33 PM

#

yep

#

but it's WACK at instruction following

#

killing me bruh

tall summit May 22, 2025, 9:34 PM

#

is there a difference between over there and something else

#

where are you using it

#

dark reader

elder rapids May 22, 2025, 9:36 PM

#

any benchmarks for Claude 4?

#

yet

tall summit May 22, 2025, 9:37 PM

#

elder rapids any benchmarks for Claude 4?

the ones they showed yeah

elder rapids May 22, 2025, 9:39 PM

#

tall summit the ones they showed yeah

no shi

#

are there any benchmarks for Claude 4 yet

tall summit May 22, 2025, 9:39 PM

#

elder rapids no shi

so why ask

elder rapids May 22, 2025, 9:40 PM

#

tall summit so why ask

given they provided them with the release, I'm obviously referring to other benchmarks

tall summit May 22, 2025, 9:40 PM

#

elder rapids given they provided them with the release, I'm obviously referring to other benc...

🚎

verbal nimbus May 22, 2025, 9:41 PM

#

elder rapids are there any benchmarks for Claude 4 yet

There's an SQL one here: https://llm-benchmark.tinybird.live/

AI SQL Benchmark

We benchmark the performance of AI SQL models against a human baseline to help you choose the best model for your needs.

elder rapids May 22, 2025, 9:41 PM

#

nice thanks

verbal nimbus May 22, 2025, 9:45 PM

#

elder rapids nice thanks

Sonnet 4 seems to perform worse than 3.7 and 3.5 on it:

elder rapids May 22, 2025, 9:47 PM

#

verbal nimbus Sonnet 4 seems to perform worse than 3.7 and 3.5 on it:

ye

#

sonnet 4 is pretty low on all the benchmarks I've seen

#

which is nuts tbh

verbal nimbus May 22, 2025, 9:47 PM

#

red sluice https://www.reddit.com/r/ChatGPT/comments/1kswt10/prompt_theory_made_with_veo_3/...

Whoa that's pretty crazy

hollow ocean May 22, 2025, 9:49 PM

#

Sonnet 4 thinking gets 95 on reasoning livebench

small haven May 22, 2025, 9:50 PM

#

unborn ocean May 22, 2025, 9:50 PM

#

small haven May 22, 2025, 9:50 PM

#

hollow ocean Sonnet 4 thinking gets 95 on reasoning livebench

benchmaxxed

elder rapids May 22, 2025, 9:51 PM

#

crazy tbh

#

it's simply not that good

#

ye

unborn ocean May 22, 2025, 9:52 PM

#

small haven benchmaxxed

claude was always above average with these weird puzzles that are not too hard and require a good grasps on language

elder rapids May 22, 2025, 9:52 PM

#

unborn ocean claude was always above average with these weird puzzles that are not too hard a...

this is something this family of models is specifically bad at imo

raven void May 22, 2025, 9:53 PM

#

The parallel scores for Claude is not deep think it's just generate multiple solutions and ask Claude to pick the best

elder rapids May 22, 2025, 9:53 PM

#

it doesn't have a very good grasp of language

unborn ocean May 22, 2025, 9:53 PM

#

and live bench reasoning is kind of that

unborn ocean May 22, 2025, 9:53 PM

#

elder rapids this is something this family of models is specifically bad at imo

no (only looking at the non-reasoning, bc reasoning with claude was always bad for everything besides coding and even there it was never more than competitive imo)

elder rapids May 22, 2025, 9:53 PM

#

3.7 was pretty good at it although prompted

unborn ocean May 22, 2025, 9:54 PM

#

on things like simple bench claude always performed quite well (because that is kind of the reasoing i described)

elder rapids May 22, 2025, 9:54 PM

#

ye but I'm not sure if this one is going to perform that well on simple bench

small haven May 22, 2025, 9:55 PM

#

elder rapids this is something this family of models is specifically bad at imo

opus 4 does not pass my vibes, sorry

elder rapids May 22, 2025, 9:55 PM

#

ye it doesn't

#

it doesn't have the vibes from before

#

not as smart

#

sucks asf

#

it's alright tho no one was depending on anthropic

unborn ocean May 22, 2025, 9:59 PM

#

elder rapids it doesn't have a very good grasp of language

🤨 it is best on all benches

#

and people also share that opinion vibe wise

elder rapids May 22, 2025, 9:59 PM

#

unborn ocean 🤨 it is best on all benches

the new claudes?

unborn ocean May 22, 2025, 10:00 PM

#

old and new

#

claude was known for that

elder rapids May 22, 2025, 10:00 PM

#

that's the point ye

#

that it's always been the best

#

now it's horrible

unborn ocean May 22, 2025, 10:01 PM

#

how is it not good?

torn mantle May 22, 2025, 10:01 PM

#

small haven

Im always right

#

See

#

You should trust me more

#

Thanks

#

What now

#

Ask me anything

#

HAHAHAHAHAH

#

Yes

small haven May 22, 2025, 10:02 PM

#

unborn ocean how is it not good?

what in the benchmaxxing is happening here

dull terrace May 22, 2025, 10:02 PM

#

small haven what in the benchmaxxing is happening here

lmarena is rigged

torn mantle May 22, 2025, 10:02 PM

#

small haven what in the benchmaxxing is happening here

Weong filtering

dull terrace May 22, 2025, 10:02 PM

#

like i been telling yall

elder rapids May 22, 2025, 10:03 PM

#

unborn ocean how is it not good?

benchmaxxing, just try the model lol

small haven May 22, 2025, 10:03 PM

#

codex is the way to go

unborn ocean May 22, 2025, 10:03 PM

#

but imo they made sonnet 4 not smaller per se but they did def make it more efficient than the old ones (e.g. by having more experts but smaller size for each)
which might be why people don't like it that much

dull terrace May 22, 2025, 10:04 PM

#

I can show proof if you want

unborn ocean May 22, 2025, 10:05 PM

#

well show

raven void May 22, 2025, 10:06 PM

#

All AI progress is fake

#

Flash sucks

torn mantle May 22, 2025, 10:06 PM

#

Bruh

raven void May 22, 2025, 10:06 PM

#

Claude 3 Opus was better

torn mantle May 22, 2025, 10:06 PM

#

Boo

dull terrace May 22, 2025, 10:07 PM

#

unborn ocean well show

Sure

#

we can actually use basic math

#

chatgpt 4o is still top ten in the leader board

#

last time I checked

#

about yesterday

unborn ocean May 22, 2025, 10:08 PM

#

yes

dull terrace May 22, 2025, 10:08 PM

#

Chatgpt4o is definly not the newest

#

model

#

and using

#

basic parameters and how o3 exsits

#

o1 exists

#

etx etx and chatgpt4o is still ontop

#

is insane

#

either o3 and that line of models

#

isnt a improvment

#

or

#

something fishy going on

misty vault May 22, 2025, 10:08 PM

#

Show the basic math already

unborn ocean May 22, 2025, 10:09 PM

#

well as the name implies it is a chat version optimized for the preferences of most people

#

with millions of happy users

misty vault May 22, 2025, 10:09 PM

#

unborn ocean with millions of happy users

these are drooling aliens ngl

#

Can we even call them users

dull terrace May 22, 2025, 10:09 PM

#

Thanks @misty vault

unborn ocean May 22, 2025, 10:09 PM

#

(although many of them have likely never heard of any alternatives to openai) (sorry, also was supposed to be about the "millions of happy users")

unborn ocean May 22, 2025, 10:10 PM

#

misty vault Can we even call them users

well they are as the name user implies using the thing and likely also paying for it

dull terrace May 22, 2025, 10:10 PM

#

unborn ocean (although many of them have likely **never** heard of any alternatives to openai...

If you use lmarena

#

you have

elder rapids May 22, 2025, 10:10 PM

#

crazy how 0325 would still be the highest on livebench if they didn't nerf the avg via disproportionate code weighting

dull terrace May 22, 2025, 10:10 PM

#

98%

#

dont know about lm arena

#

so in which case

#

Its odd

torn mantle May 22, 2025, 10:10 PM

#

Is there a way to try opus reasoning for free?

dull terrace May 22, 2025, 10:11 PM

#

torn mantle Is there a way to try opus reasoning for free?

i mean i have a friend

#

who could give u it\

#

but prob not

misty vault May 22, 2025, 10:12 PM

#

U guys are drooling aliens I literally provided free all in one llm service months ago here and got completely ignored lmfao

torn mantle May 22, 2025, 10:12 PM

#

dull terrace i mean i have a friend

Then that's not a friend

elder rapids May 22, 2025, 10:13 PM

#

elder rapids crazy how 0325 would still be the highest on livebench if they didn't nerf the a...

btw Gemini 2.0 pro is STILL higher than sonnet 4 and second behind opus 4

#

as a base model

#

on livebench

torn mantle May 22, 2025, 10:13 PM

#

elder rapids btw Gemini 2.0 pro is STILL higher than sonnet 4 and second behind opus 4

Puahaha

misty vault May 22, 2025, 10:13 PM

#

torn mantle Then that's not a friend

The friend might give it to Odin but not you

dull terrace May 22, 2025, 10:13 PM

#

I litterly just shown you

dull terrace May 22, 2025, 10:13 PM

#

misty vault U guys are drooling aliens I literally provided free all in one llm service mont...

Wait a min

#

u did?

misty vault May 22, 2025, 10:14 PM

#

yes

misty vault May 22, 2025, 10:14 PM

#

dull terrace Wait a min

Btw don't say a single even slightly negative thing about Gemini 2.5 Pro

#

@hollow ivy is always lurking in this chat

#

and @alpine coral

dull terrace May 22, 2025, 10:15 PM

#

misty vault Btw don't say a single even slightly negative thing about Gemini 2.5 Pro

Thx for the heads up

misty vault May 22, 2025, 10:15 PM

#

I did it and half my family went missing

dull terrace May 22, 2025, 10:15 PM

#

Do you stll offer the services tho

unborn ocean May 22, 2025, 10:15 PM

#

elder rapids btw Gemini 2.0 pro is STILL higher than sonnet 4 and second behind opus 4

no, or am i reading it wrong?

elder rapids May 22, 2025, 10:16 PM

#

unborn ocean no, or am i reading it wrong?

non reasoning models

#

2.0 pro doesn't show anymore on the website

unborn ocean May 22, 2025, 10:17 PM

#

elder rapids 2.0 pro doesn't show anymore on the website

well but it was present in the older verison and extrapolating from there gives you a lower score than what the claude 4 models have

elder rapids May 22, 2025, 10:18 PM

#

ion know wym

unborn ocean May 22, 2025, 10:18 PM

#

and 2 pro a bit below gpt 4.5 which is now also below the sonnet and opus model

#

so i don't really see how this works out

elder rapids May 22, 2025, 10:21 PM

#

unborn ocean and 2 pro a bit below gpt 4.5 which is now also below the sonnet and opus model

oh ye wait hollon I was looking at another table

dull terrace May 22, 2025, 10:21 PM

#

Which is a another reason

#

lmarena is scuffed

#

but keep going

#

Oai

#

models definely

#

4o

#

main culprit

elder rapids May 22, 2025, 10:22 PM

#

why do you always lie

#

😭

dull terrace May 22, 2025, 10:22 PM

#

o4 can stay in my opinion

misty vault May 22, 2025, 10:22 PM

#

YEs please debate

#

🍿

dull terrace May 22, 2025, 10:22 PM

#

My fault do you speak Latin>

#

Ok good

#

Now tell me why

#

you think its not scuffed

misty vault May 22, 2025, 10:23 PM

#

@deep adder if you win this debate

#

Ill give bing ai access

dull terrace May 22, 2025, 10:23 PM

#

Well that is a debate

#

So lets go

#

Yes I have.Have you provided evidence

#

to disprove my claim?

#

Well you actually agree with me

#

Explain this

elder rapids May 22, 2025, 10:24 PM

#

what claim did he make

#

😭