#general
1 messages · Page 44 of 1
doesnt look that different
gonna do pokemon test
The most important thing for me is that they must lower the price of the API.
lol
we have like 4 leo
yeah thats him
bruh
its so hard to choose damn
we also hope for 1m of context and complete multimodality
lol yeah the rest were nuked
whatever im testing on their website is < sonnet 3.7 max thinking
i don't have to stress, anthropic gave me a generous sum of free credits :3
well yes that's because it's not thinking
i see
so thats only the instruct model
so I am no @earnest parcel but I started to collect (late) some relatively tricky questions for LLMs.
For what I can see there is no big change compared to Claude 3.7 (if I am getting Claude 4 ofc)
i thought the new model would choose when to think and when not to think?
Especially something like this let me scratch my head.
I mean sure it is correct (what is important) but not that maintainable
Adults don't choose, take it all.🤑
make sense why its free
they are talking about creation of bioweapons
but isnt it possible with the current LLMs
i just dont understand anthropic really
is the model really super smart or what
done
pretty sure if you get o3 jailbreaked or gemini 2.5 pro you can do such things as well
Dario was always obsessed with the extra security and safety of their models
I think it is good. I mean, as wrote: those models "compress" a lot of human knowledge and if they connect the dots appropriately, they can deliver snippets that can be useful for others to progress their work.
Humans aren't stupid (yet) and they can piece the snippets together
it is difficult to prevent human questioning techniques, or to distinguish real scientific research
the best instruct model we have so far is grok 3
It’s kinda like lobotomy to their models
Hopefully refusals aren’t too bad for opus
There’s always malicious requests
Or GPT 4.5
Have you guys heard of the Continuous Thought Machine by SakanaAI?
I think replacing the input processing and the FFN in transformers to Continuous Thought Machine would lead to AI that can think internally
Give me some prompts to test for Claude 4 please
damn fr?
is it another thinking model?
so, it is impressive it codes (python/JS) but the answer is - considering the hype - not where it should be. Be it Claude 3.7 or Claude 4.
the free model im using shows 'claude sonnet 3.7' and leo said its referring to their latest model claude 4 sonnet
The hype is for opus
also it writes a lot of
"You're absolutely right! "
"You're absolutely right to question that!"
"Brilliant point! You're absolutely right"
and so on.
Its true
interestingly the claude models avoid that in lmarena
oh fr?
I thought Claude 4 wasn't going to be
in the UI
I mean, still impressive to do that work with few prompts in few minutes. But it doesn't match the hype
Taking the results as is would be full of mistakes
whatever is being tested on claude.ai is not that impressive for difficult-hard questions
do you mean that the questions aren't hard or the answers are underwhelming?
tbh I don't expect the new models to be any good
or at least better
sonnet-3.7 on claude chat has more recent knowledge than the 0219 one
anthropic doesn't have the researchers nor the compute to brute force like openAI and innovate like deepmind
and maybe better tool usage. but otherwise it doesn't seem a step change from my usage so far
I have the feeling we are approaching "slow" improvements until a new architecture pops out (like transformers and RL did so far). Actually I am still waiting for a LLM orchestrator that picks specialized models according to the prompt and answer.
one doesn't need AGI. Near AGI is enough if the orchestration of narrow AIs is good enough.
i haven't looked into it / no idea if it's actually consequential.. but the text 'diffusion' model google announced at io struck me as kinda new
adapting what works for image gen to text
some kinda parrell stuff going on
i think
beyond that.. i don't really know what it's about ha but yeah it sounded at least like kinda new
this is why everyone thinks DeepMind is winning
and will win
yes
thats the only difference im noticing
the model isnt smart or anything
they are already taking down the 3.7 model on many providers
can't be long
before they switch
mercury coder did it before 3 months ago
I don't get it, Claude 3.5 was still there the entire time, why taking away 3.7
I mean at least for a certain period where people get their workflow to fit the new model
ah i see.. less innovative than i thought than
do you know if it [murcury coder] is any good?
45mins left
the new gemini flash is also actually way better for instructions like really good
But idk compared to gork or 4.5
Probably 4.5 king because it is obese
more efficient?, idk
I wonder the time when OpenAI uses 4.5 to train a reasoning model
Using the way like AlphaEvolve and Absolute Zero Reasoners do
Claude offers advanced Research capabilities, searching the web and your Google workspace to complete advanced analysis across any topic.
In this demo, Claude analyzes Maggie’s emails and calendar to create an organized daily overview, then conducts a thorough literature review for an education proposal.
Read more about Claude 4 and its c...
It’s not that “special”
What about for coding?
the report looks much better than the previous one
yep
2.5 flash was already good asf
but now I think it's the best available rn imo
by a large margin
no pro is
the point is instruction following
Gemini 2.5 pro is also good for instructions it’s just slower when responding 🤷♂️
Claude has a memory of a fish
Actually GPT 4 was a big boy
1.7 trillion parameter
So why are the companies now stick to models with hundreds of billions parameter level?
It was even bigger than @alpine coral
money
The mainstream media and everyday people dont care about gpt 4 they all love modern chatgpt
gpt 4o marketing brings them more money
?
compared to if they served expensive gpt 4
GPT 4 was the topic
no
As it was the only model available back in the days
it was hot topic for coders and ai enthusiaists
ai wasnt that big yet in gpt 4 age compared to how hyped it is now
At that moment we didn’t have Gemini and Claude
It reached way more everyday people now
like @alpine coral
It even reached drooling aliens
I really hope claude 4 is also obese model like 3 opus or something
I'm afraid not though
anthropic couldnt afford that could they
yep but that's not the point
Opus should be obese
And I’m now putting high hopes on CTM
gemini 2.5pro is more willing to follow any instructions but its not good at actually succeeding to follow the instructions properly compared to gpt 4.5 (or 4) or 2.5 flash
It’s basically toooo fascinating
i saved one of the Anthropic videos about Claude 4.
︀︀"Claude 4 | Research for comprehensive analysis"
︀︀the rest is arleady set to private. but atleast one i saved. :D
LMAO
wdym
Okay so it's confirmed
we have claude 4
omaygot
didnt you guys say october 2024
im getting so hard rn from claude 4
you guys have it all wrong
system prompt could be suggesting October 2024
yet it's explaining things past that mark
just guessing that's what was meant
yep
im actually going to buy this
selling bing chat access gpt-4-32k, gpt-4-preview, gpt-4-0314, gpt-4-turbo for 100$
3 min left?
Hear directly from Anthropic executives and product leaders.
this dumbass music bro 😭😭
im getting lotion ready
deadass
alright it's time
they're late
asf
Anthropic L
it's been 30 seconds already
it started
yall ready!!
im watching it here: https://www.youtube.com/watch?v=dt2oDtssDNg
The latest AI News. Learn about LLMs, Gen AI and get ready for the rollout of AGI. Wes Roth covers the latest happenings in the world of OpenAI, Google, Anthropic, NVIDIA and Open Source AI.
My Links 🔗
➡️ Subscribe: https://www.youtube.com/@WesRoth?sub_confirmation=1
➡️ Twitter: https://x.com/WesRothMoney
➡️ AI Newsletter: https:...
Anthropic has not been a general leader
unironically ye
ts had anime music
😭
me buying the 100$ per month cwuade subxcription before they increase it to 300$ per month
don't disappoint me dario
I’m gonna sleep
you'll regret it if you do
From UTC +8
Claude 4 opus and sonnet
yes
nice
let's go
he said it
yes!!!!
yep bye
man dario be looking like a mad scientist though
ong
going to be streaming this in #1340554757827461215 in a min
So opus 4 only good at coding?
wondering how good it's going to be
When they don’t show the numbers it means the model may actually suck
imma buy claude asap lol
same cost
Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.
it's not the highest in any of the benchmarks pass@1
Bruh sonnet performs better in most tasks than opus
besides SWE
yeah vibes matter
ye
regressed performance on GPQA (without parallel compute) kinda points me towards smaller experts / worse reasoning (very unrealistic) (guesstimate though)
benchmarks kinda pointless now
which might be why they are serving it so quickly
"Kitler"
someone try opus and let us know
ye
ok so that’s why opus wasnt released lol
opus 4 costs less than o3 thats awesome
What is it paralel compute ?
codex is at 75% swe benched not sota
taking 64 samples and choose the best one
Idk if it’s actually 64
is there a thinking mode for opus 4 or sonnet 4
Is Opus 4 a thinking model?
yes
I didn't expect that from anthropic though
they dont have the computing power do they
I was expecting a 2.5 Pro type of performance gain
I expect that from Google
Is the Claude Deep Research now with Opus or Sonnet, or can you choose?
yes but google doesnt do that
unfortunately
they are going for the 4o style
small, restarted models trained on only specific stem topics
giving the illussion its smart
But claude prob going same direction because they cant train such big model right
idk
bro they just drew a random line on that hours of work graph
So what to choose?
o3 or Claude 4 Opus?
this is crazy
claude 4 easily
Now I need gemini 3..
or 2.5 ultra
for coding claude
not much high hopes for it being any good for anything outside of coding
but this is great since it aligns
They already compared it to codex, while others compare themselves to 6 month old models. SOTA
Actually ?
think so
they said that it is on all services now
(besides serving the model on 3p, which will take some time)
Oh shxt so it is better than codex cool
Opus looks good ngl
ye
they need a ui like codex and its gg
i think codex is capped in tokens prolly 32
If it's less hallucinations than 3.7, I'll be satisfied
now it is
Swe and terminal bench its without thinking
prob because the reasoning still sucks
how is opus?
anyone tried it yet?
how
do u have claude max
claude 4 on code confirmed ✅
it's over
my dog
talking to ts is BAD
and also as vsc extension
Claude Sonnet 4 is here. Now avail in all Copilot plans. And it’s the new base model for the GitHub Copilot coding agent 😎
is 4 opus available in the api
how is 4 opus' vibes
let's go
the 10$ per month sure are worth it
Is opus in Claude code
Broo nobody joined @echo aurora in the voice channel to watch the livestream with him
So mean guys
They left now due to loneliness
Lets go
poor them
😭
I'll be back, had to run to a meeting
When can we compare these new Claude models with o3 and 2.5 on arena?
Clap clap
not clicking phishing links, gimme screenshots
agree
would it be cheaper to pay for claude 4 opus api rather than 100$ per month claude code and just use it with other vs code ai plugins
Im already happy with copy paste
🗿
craig is paid actor by anthropic
Lying all the time
is the claudes on webdev?
Nightwhisper made a better ui
i dont even think nw is real anymore, did that even happen?
tbh there's nothing that compares to NW
it might have been a dream
it was so beautiful
what about opus?
30yo, single, otter body, 183cm, 80kg.
looking for sugar daddy cause opus 4.
elite vibes, was smart asf and successfully at code SOTA
Badest ai i see ever
i dont care about your net worth, daddy
so first vibes meh?
gonna try deep research with claude opus
can someone send me a prompt for deep research
Ok
you
What the probleme now ?
Perplexity is good when it's free. I got Perplexity Pro for free by Telekom (T-Mobile Germany) and they have Claude 4.0 Sonnet Thinking already, just hoping for Opus now like with Opus 3, which was available in Perplexity.
why do i need a net worth when im looking for sugar daddy?
Like nearly unlimited use of o4-mini, R1, Gemini 2.5 Pro and Grok 3. It even behaves like normal Chatbots when the Websearch has been turned off.
Based on a critical synthesis of recent, high-quality human clinical trials and systematic reviews, determine which compound – Berberine, Propolis, or Resveratrol – demonstrates the most compelling evidence for promoting overall health.
Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.
This would be very interesting for me.
true
Me
Asura is a bad free-for-all server on the isle
Huh
This is my go to prompt to Test Deep Researches of ChatGPT, Agents and such. Already have tried Flowwith AI, Grok DR, Grok DDR, Manus, ChatGPT DR (o3), Gemini DR (2.0 Flash), Perplexity DR, Genspark Super Agent and Ithy. Claude DR and Gemini DR are still missing as i havent gotten a Sub there. When you want i can send my DR Test with this prompt in.
😔
So many things to love about Claude 4! My favorite is that the model is so strong that we had to turn on additional safety mitigations according to Anthropic's responsible scaling policy
They made it dumber
much better vibes tho, but it's still not 3.5 sonnet lvl
Wth is that
server for isle
Oh
ur allowed to kill on sight any dinosaurs who is nesting... sucks
I'll do Gemini DR for you
if you'd like
no its sucks. u cant even nest without getting kos'd
Thanks! I would apreciate that.
Can you re-run again that prompt @small haven
why
alr it's started
To compare the old research with this one
I'll let you know when it's done
oh yea
goddamn
they're changing the setting
one minute they're standing up like normal
next it's an interview
this is gonna be fun
use a tissue, dont cover the lmarena discord chat
real
open ai 63
whoa gonna be #1 in the web arena gemini or opus?
gork 3.5 misinformation trend 2.0 starting today
ai gen (nvm did not read context)
2.0 flash image preview 😭
Did the guy who I was going to ping leave the server because I called him a drooling alien
claude code is nice, no enviroment setup needed like codex
idk weird name
nvm he didn't
what model did you use?
did you put it through chatgpt image gen
but i miss wanting to spam tasks on the go...
opus 4
compare claude 4 opus to oai deep research
fr
compared to the demo from the video?
u know dario talks 100% like the nerdy profs at my uni that have trouble upholding a normal social life
and like to yap a lot
That's because he is a drooling alien
which is sad
everybody is like drooling alien to you
troll or actually lol
I can't find it yet on side by side comparison? It's only on blind tests now...?
Have any of the anon models this month hinted at being Claude or nah?
it extracted certain parameters this time
I haven't played with the arena the past two weeks
thats what i thought as well
its meh ish, chatgpt dr better
only @alpine coral and dario is
but i did say "assume" for the feedback @torn mantle
@narrow elbow Did u just drool on my comment
so in perpetuity, ur saying codex > claude code
ye I've been using it, testing it for a bit, and as far as I can tell, its particularly worse at language
it obfuscates technical concepts and I have to introduce language to help it communicate
umm what
ye it didn't know what a category error was or how to get it across
i mean right now opus on claude code is pretty damn slow
but it knew there was a category mistake
codex is speedier ngl
Nah the people who joined in person get 3 months free claude max
we have to pay 300$💀
i was there
oh huh how odd
opus 4 is slow but beefy and smarter than codex
you can tell its smart, but it doesn't have foresight imo
please add claude 4
so oai still top 1?
wonder how this affects coding
you're a drooling alien
@small haven whats your take?
Why?
@keen beacon whats your take
@small haven
- o3
- opus
3.gemini 2.5
?
lmao
for dr? looks like it
I love you
yes o3 is still ahead, but code wise (speed aside) opus 4 is ahead
deadass
and lets see how they act
opus 4 spent 5 mins reading files 😭
fastest ever
brother eww
o3 or codex we talking? cus codex is way more snappier
ya ig..
gonna be using opus 4 for actual bugs that codex cant solve, its just too long
ok send the new one
i did say "assume" when it asked me for clarifications
shes asking a literature review
ok
hyes please
@small haven
Please conduct a comprehensive literature review of academic research that addresses and synthesizes the current understanding of the following pressing and practical questions related to the Theory of Constraints (TOC) and its DBR (DBR) scheduling mechanism. These questions aim to address current gaps and needs in both theory and practical application:
Dynamic DBR in Volatile Environments:
Question: To what extent can advanced analytical techniques (e.g., machine learning, AI, real-time simulation) enhance the dynamic management of buffers (size, location, priority) and the proactive identification of shifting constraints in DBR systems operating under high demand volatility, supply chain disruptions, and significant process variability?
Focus: Practical performance improvements (throughput, lead time, on-time delivery, resilience) in complex manufacturing, service, or project environments. What are the limitations of current DBR models in such contexts, and how can these be overcome?
Adapting and Validating DBR for Complex Service Operations and Knowledge Work:
Question: How can DBR principles be effectively adapted, validated, and implemented to optimize workflow, reduce lead times, and manage bottlenecks in complex service delivery systems (e.g., healthcare patient flow, software development pipelines, public service delivery, R&D processes) characterized by high variability, non-physical work items, and intangible constraints?
Focus: Developing and testing novel DBR configurations or hybrid models suited for the unique challenges of service and knowledge work environments, including the definition of "drum," "buffer," and "rope" in these contexts.
wait i may or may not get opus 4 ? @deep adder
run it
When does it release today?
hallucinations.....
wen opus 5
dario wtf
cool thanks
it also got confused with the months smh
so it still overwrites tests when it can't find a solution, great..
opus 4 is bad guys
lets go!!
to be frank, this is a hard task codex couldn't do it too, so.. opus 4 > codex
Thanks!
This is a good unbiased overview guys
Hes bascially saying : use opus 4 only for coding
As it's still < o3 in many areas
How does it compare to codex?
hows the answer
A bit busy currently, i will read it a bit later.
codex doesn't overwrite/hallucinate and actually tells u that it can't solve it, much prefer that than lies
but im guessing if ur coding frontend obviously opus 4 is undebatable
wtf
an ai that admits it can't do something
I never experienced that
I am impressed by claude 4
I think I will use it over 2.5
The best non thinking models at swe bench (score of swe its without think)
That’s a catastrophic landslide for Anthropic lol
Nah just buy some API calls unless you use it constantly
Why ?
Or buy a sim theory subscription
I was thinking of this
tf is this
The underlying architecture is more important than anything you enhance it with
Actually it’s a little more than a million
I think I actually use that amount if not more in a month
It comes with the added benefit of unlimited use of all open source models and Gemini 2.5 pro
Even if you run out of tokens
we will see if we have the scores of gemini 2.5 pro non think, and grok 3.5
Sim theory does which is why I use it, I get some good mileage with Claude and than I can use deep seek and Gemini unlimited
claude 4 opus sucks, holy moly, aight im done playing with this sorry dario
im not designing ux unfortunately lol
sure it will hit 1500 on webdev
o3 is still goat, dont be biased
What sucks about it?
does it have gpt 4
read above
It has everything
it has the hallucinations
ayooo
sorry i repent
😊
why reverse the order?
They haven't even released 2.5 properly
It's pretty good but i'm quite perplexed by the missing details and the pretty irregular formatting. Could you maybe send the Gemini share link for it, because it could be an error by OpenOffice.
physchological trick to save costs
no swearing please
@small haven Could you do this Prompt for Claude DR?
they are not. This is the singular benchmark claude always does the best at. And you need to ignore the lighter shade as that's equivalent of deep think
Kinda insane that they did this parallel processing that is 100% internal and they didn't even release it
but still reported benchmark scores for it
💀
like what is the point, other than mislead people who don't read footnotes...
Well fair but Claude 4 Opus is probably already better than the Gemini 2.5 Ultra Google didn't release
when is claude 4 going to be added
in the graph you pasted at least it's explicit, but here this is way less obvious:
My friend created account 6 hours ago with no number
scores after "/" are basically useless
Did they remove the requirement?
Or sussy countries dont need verify??
vpn phone verify bypass?????
a free browser extension works
then u can uninstall it
the one in vivaldi browser does, but you can't select the exact country you want
I did buy their premium too, but tbh it's way less stable and slower than Avast VPN. Much cheaper though
Is the canvas feature that claude and chatgpt uses function calling? or just visual trickery but regular conversation with codeblocks in the background and special system instruction
Thanks
Imma read
I can do it all over again if you'd want me to
If you want to, you can.
The Displacement of Nations: A Comprehensive Analysis of the Mass Expulsion of Ethnic Germans after World War II Introduction The end of the Second World War in Europe ushered in a period of profound geopolitical restructuring and demographic upheaval. Among the most significant and tragic of th...
not good
Still the weird formatting on some points.
So its not OnlyOffice.
which points?
Generally the Sub-Section Headlines and the text. It is sometimes also inconsistent with deciding whether to use bullet points or text and it just puts bullet points into raw text which i dislike.
maybe the existing email account is the problem
.
fr
oh ye Ikwym
bing chat messages per conversation limit increased from 35 to 50 today
Microsoft still using it internally1?!?!?!
yes
gg
alr started it up again, Gemini deep research is wildly inconsistent in how it plans things out
I think it is token dependent, not sure though
It depends on amount of tokens
im now 5x slower/unproductive using claude code from when i was using codex, thank u dario
claude not agi 
The difference between GPT 4.1 And O3 is the same as between O3 and Claude 4Opus 👀
agi
Bro learned from sydney
huh
This is how my Deep Research Test progressed so far, if you have the missing parts you could DM them to me.
Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...
i forgot about the dr lol
Thanks!
Is this 4 Opus?
Okay! Was it this short before too with 3.7?
i dont know
Looks like with the others too that Opus 4.0 produces very short Deep Researches
cwaude also benchmaxxed for coding 
is sonnet 4 neptune?
bro u guys never heard of bing chat or sum
this aint special bro
waiting for claude 5
Why is Claude still publishing models with 200k context in 2025?
Isn't 1M the standard now?
Especially code models need large context.
me with gpt-4-0314 on 4k context 
i mean 2 yrs ago it was 4k, zoom out
crazy to think gpt-3 had 2k
Any idea when claude and opus will be on the lmarena
I would guess it would land there as soon as they have more available compute.
https://x.com/EMostaque/status/1925624164527874452?t=fC5q06hyBSltN1tt7m6v4A&s=19
Lmao I wonder if this is true
Damn thats an anthropic safety person in the screenshot and he has deleted after pushback. Pretty dumb thing to post on launch day
soon! (but don't tell anyone this is a secret
)
Opus is insanely expensive, like 75$/mil tokens is a little obscene
Is it done?
Sonnet 4 is the new best practical model
isn't o3 more expensive
I want to see a proper comparison between opus and sonnet, sonnet seems to be pretty neck and neck
this CANNOT be real
I’m not sure
https://x.com/sleepinyourhat/status/1925626079043104830?t=fkAi7kbatZ1dp98WkLK_yg&s=19
He works there
I deleted the earlier tweet on whistleblowing as it was being pulled out of context.
TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.
ok sonnet 4 is more breathable to use, opus 4 just a slow and heavy
Don't want to bother you, but it would interest me how Claude 3.7 and 4 Sonnet would perform on the Deep Research with following prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.
sydney is that you?
ok last one, but u dont want opus 4 no more?
Swe bench doubling in 5 months is prettt crazy
I mean you can try it again with 4 Opus but i don't think it will end up any better.
kk ill run 3.7 and 4
3.7 asked for feedback, 4 no
wtf i only ran 1 and a half prompt !
couldnt run anything
1: All regions 2: Surely, but don't focus on it too much 3: All topics please
Does someone know a Deep Research or Agent tool which hasn't been already listed in my Document to try out? https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing
Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...
New simple bench king
sorry 4 as well
what are you talking about mbappe
1: All regions 2: Focus on the primary range, include other time ranges of the expulsion too 3: All should be equal or weighted based on your own preferences
Wait for the video
is it still easy to jailbreak claude
You’ll see
this is where the 600m eval and 100m capital and very good relationships with the big labs on lmarena's side comes in to play
They made claude 4 extra fast
even with expired credits within cursor
he meant slow mode
but sonnet also felt the need to do stuff like this
Honestly they increased the slowmode with the latest update, to increase friction to use their usage based pricing
It wasn't that terrible before
link
Btw. Claude 4 Sonnet and Opus can be used freely with the Invitation Code for 14 days in Flow With AI
ok
In my experience yes
why are claude 4 models not on topwhen you want to choose model ?
Because it’s a general intelligence not a narrow intelligence like O series
Told you
thanks
yeah i tried some and phew
How long until we get an 83% model on simple bench
At this rate it’s about an 8-9% improvement per year
opus 4 instruct model
is actually so dumb
omg
no wonder they were focusing on coding
this is not looking good
wdym
oooh
‘Purpose’ Available Everywhere Now!
iTunes: http://smarturl.it/PurposeDlx?IQid=VEVO1113
Stream & Add To Your Spotify Playlist: http://smarturl.it/sPurpose?IQid=VEVO1113
Google Play: http://smarturl.it/gPurpose?IQid=VEVO1113
Amazon: http://smarturl.it/aPurpose?IQid=VEVO1113
Director: Brad Furman
Production Company: Happy Place
Producer: ...
oh wdym
oooh
Second run?
instruct model isnt good
Nice
thanks
Both seem to be better than the 4 Opus one.
ya opus didn't ask for feedback
Will add it to my list once I wake up
ye
they talked about how opus is crazy smart and can do crazy stuff but im not seeing anything like that
been spamming the hell out of it
Ranking and Justification
Both AI responses are of exceptionally high quality and could serve as excellent summaries of this historical event. However, if forced to choose, I would rank AI2 as marginally better.
both models are pretty bad outside of coding
lmao
my 2.5 pro > yours so should I do it
You can compare them to other Deep Researches here too: https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing
Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...
they are drooling aliens
what about sonnet
forget about simplebench
and use your own bench
your vibe check
HAHAHAHAHA
STOP IT
YOU ARE KILLING ME
claude 4 is way better at creative writing
you would say anything but the truth
claude 4 sonnet >> claude 3.7 sonnet >> gemini 2.5
next time you will say grok >>>>>>
nah
I agree
gemini is bad at giving sloppy toppies
im honestly surprised. it ought to get at least 3rd on eqbench creative writing
Does Sonnet have some special sauce or why is it always the best anthropic model?
I tried it and have to object
In my testings, yes.
smaller model -> they can do more rl for the same price
and anthropic heavily relies on the fine tuning / post training imo
Well I didn't use it for coding yet but tried the general capabilities out, maybe it will be better in coding.
it's just that because of that having these really big models is not the perfect fit for them
For coding claude 4 is better than 3.7
For general capabilities idk i heard only bad
But for coding I tried myself
although the demand for opus size stuff is quite clearly there
what kind of tokens per second are you guys getting on both?
cwaude
With thinking?
Those older Versions of Bing Chat are on Huggingface with like some proxy that uses the Bing Chat API, I don't know if it works anymore but that was a thing in the prior times. If you only need the UI you can get it there. Don't mind my grammar btw I'm writing this with a fever.
these used bing.com websockets
just like those chrome extensions that provided all models in one place but u had to be logged in for each site
But someone sent it in the chat here
And everyone ignored
I looked into the code for one of them and one of these actually called an API, worked on incognito mode too.
If it scored 1600 1500 i will give everyone here claude max sub
real
benchmark it
I need to find it again, will do it tomorrow.
last time they increased +130 for a 3.5 sonnet to 3.7 sonnet thing and this time it is a bigger leap in generations + they have an opus model
From what ive seen the reasoning claude 4 models are good but i nees to try them
But did you notice any difference?
they 'only' have to get 150 points this time
not as large as i hoped honestly (webdev wise)
when i tried yesterday with sonnet
Reasoning?
lmfao
bro stop leaking
cursor is scam
This does not work anymore
you still can lol
That needs authentication
HAHAHAHAHAHA
Lmfao
His gemini 2.5 pro phase has evolved into something new...
remove reaction @novel slate

is there a difference between over there and something else
where are you using it
dark reader
the ones they showed yeah
no shi
are there any benchmarks for Claude 4 yet
so why ask
given they provided them with the release, I'm obviously referring to other benchmarks
🚎
There's an SQL one here: https://llm-benchmark.tinybird.live/
We benchmark the performance of AI SQL models against a human baseline to help you choose the best model for your needs.
nice thanks
Sonnet 4 seems to perform worse than 3.7 and 3.5 on it:
ye
sonnet 4 is pretty low on all the benchmarks I've seen
which is nuts tbh
Whoa that's pretty crazy
Sonnet 4 thinking gets 95 on reasoning livebench
benchmaxxed
claude was always above average with these weird puzzles that are not too hard and require a good grasps on language
this is something this family of models is specifically bad at imo
The parallel scores for Claude is not deep think it's just generate multiple solutions and ask Claude to pick the best
it doesn't have a very good grasp of language
and live bench reasoning is kind of that
no (only looking at the non-reasoning, bc reasoning with claude was always bad for everything besides coding and even there it was never more than competitive imo)
3.7 was pretty good at it although prompted
on things like simple bench claude always performed quite well (because that is kind of the reasoing i described)
ye but I'm not sure if this one is going to perform that well on simple bench
opus 4 does not pass my vibes, sorry
ye it doesn't
it doesn't have the vibes from before
not as smart
sucks asf
it's alright tho no one was depending on anthropic
🤨 it is best on all benches
and people also share that opinion vibe wise
the new claudes?
how is it not good?
Im always right
See
You should trust me more
Thanks
What now
Ask me anything
HAHAHAHAHAH
Yes
what in the benchmaxxing is happening here
lmarena is rigged
Weong filtering
like i been telling yall
benchmaxxing, just try the model lol
codex is the way to go
but imo they made sonnet 4 not smaller per se but they did def make it more efficient than the old ones (e.g. by having more experts but smaller size for each)
which might be why people don't like it that much
I can show proof if you want
well show
Bruh
Claude 3 Opus was better
Boo
Sure
we can actually use basic math
chatgpt 4o is still top ten in the leader board
last time I checked
about yesterday
yes
Chatgpt4o is definly not the newest
model
and using
basic parameters and how o3 exsits
o1 exists
etx etx and chatgpt4o is still ontop
is insane
either o3 and that line of models
isnt a improvment
or
something fishy going on
Show the basic math already
well as the name implies it is a chat version optimized for the preferences of most people
with millions of happy users
these are drooling aliens ngl
Can we even call them users
Thanks @misty vault
(although many of them have likely never heard of any alternatives to openai) (sorry, also was supposed to be about the "millions of happy users")
well they are as the name user implies using the thing and likely also paying for it
If you use lmarena
you have
crazy how 0325 would still be the highest on livebench if they didn't nerf the avg via disproportionate code weighting
Is there a way to try opus reasoning for free?
i mean i have a friend
who could give u it\
but prob not
U guys are drooling aliens I literally provided free all in one llm service months ago here and got completely ignored lmfao
Then that's not a friend
btw Gemini 2.0 pro is STILL higher than sonnet 4 and second behind opus 4
as a base model
on livebench
Puahaha
The friend might give it to Odin but not you
I litterly just shown you
Wait a min
u did?
yes
Btw don't say a single even slightly negative thing about Gemini 2.5 Pro
@hollow ivy is always lurking in this chat
and @alpine coral
Thx for the heads up
I did it and half my family went missing
Do you stll offer the services tho
no, or am i reading it wrong?
non reasoning models
2.0 pro doesn't show anymore on the website
well but it was present in the older verison and extrapolating from there gives you a lower score than what the claude 4 models have
ion know wym
and 2 pro a bit below gpt 4.5 which is now also below the sonnet and opus model
so i don't really see how this works out
oh ye wait hollon I was looking at another table
Which is a another reason
lmarena is scuffed
but keep going
Oai
models definely
4o
main culprit
o4 can stay in my opinion

