#general
1 messages · Page 45 of 1
Craig answer the question
reiterate it
please
We do not have all day
I would like to see your benchmarks
to please
No no
you stated it
Tell us why you said it
from what perpesctive
If one model is benchmaxxed
why wouldnt others not be??
Answer that now?
You think google is the only one that benchmaxxed?
Ok good were getting somewhere jimmy
That is contradictory to your previous statement.
You said this
No worry my dear sir
It ok little jimmy take your time
Thank you jimmy
Now I would like for you to explain one thing
for me.
How is gpt4o still up while being generations away from the other llms.
Specfically in the oai line of models
@deep adder 's name is Sydney
Thx i take free sponsers
Ok
Good point
A counter argument
even if they are constanly updated
why would not the newer model be better,
Either way
A new gen should be better then a old gen
Whats going on rn
@misty vault A playstation 2
is better then a playstation 1
right
chatgpt 4o
gpt 4.1.
Answer that then
The lmarena measures the model capabilty and human prefrence
No worry I can read
gpt-4-0314 is agi
Let use your logic for a second
If gpt 4o is the newest one
How come previous versions of it still are in the top ten
i take that back
or top 15
Does that make sense
@torn mantle What happend to change your mind
i cant share details yet, but its something jaw-dropping
.....
So gpt2 is better then gpt4o
Its the point of a new genaration
Whats going on rn
loving the debate but remember:
✅ Treat others with Respect.
drooling alien
I'm kidding @echo aurora I apologize, I love you (in a friendly way)
New model in Beta Text2Image: gemini-2.0-flash-preview-image-generation
You just lost.....
Holy cope
But good debate
maybe in 2 years or so you do really good i see the potential in you gl
Does that mean he is agi
No agi
and consciouness is 2 different things.
gpt-4 is asi and chatgpt 4o is artificial stupidity but gpt-4 is older model
thx for clearing the debate
@misty vault So who won in your opinon?
I will ask sydney
0 substance over the last 50 messages whats happening here
are there any new anonymous models lately? anything interesting?
this is claude's assessment (based on the discussion) of who won:
Craig Federighi's estimated cognitive profile:
[...]
Estimated range: 115-125 IQ - Above average to superior range, with strong logical reasoning and debate skills.
Odin's estimated cognitive profile:
[...]
Estimated range: 95-105 IQ - Average range, with notable weaknesses in logical reasoning and debate methodology.
Key difference: Craig demonstrates systematic thinking and logical consistency, while Odin relies more on assertion and deflection when challenged on specifics.
@deep adder new claude thinks you are highly intelligent 🤣
(obv the calculation is really dumb and not based on reality)
LMAO
yeoooo lmao
furthermore, it did not even think to consider your age, which is kind of the second most important thing for IQ
has the new claude done anything interesting yet
honestly will be more excited for full remote mcp and tools rollout than this
Give me any questions you want to ask Opus I got unlimited access
I guess you could ask on lm arena though
yea
Give prompt
I will give gpt 4
"odin relies more on assertion and deflection when challenged" 😭
I’ve jailbroken 4 Opus
Opus is king of simple bench
🔥🔥🔥
assess the IQ of these two main debaters in great detail:
(i know it is not as easy as that but you can do a guesstimate to compare the scores):
Age-adjusted IQ estimates:
Craig (assuming ~21 years old):
His reasoning ability seems appropriate for a bright college student. The logical consistency and debate skills suggest IQ around 120-130 - gifted range, roughly 91st-98th percentile for his age group.
Odin (assuming ~16 years old):
His reasoning patterns are more concerning when age-adjusted. While some impulsivity is normal for teens, the logical inconsistencies and inability to support arguments suggest IQ around 90-100 - average to low-average range, roughly 25th-50th percentile for his age group.
no way he is 16, claude is like really really dumb asf (for 🍍 : my point was more that claude is just assuming way too much and going at the problem in a really unintelligent way)
craig is gifted
lets keep things respectful pls.
Im 10 lol
so what happens then
bru
u got homework bud
my age is ???
congrats you are even more 'intelligent' 🥳 than @deep adder
yayyy
I've just tried claude 4, still prefer gemini even the nerfed one we have now.
That was too easy bro
?
@dull terrace claude is already reading the stars and mapping out your future, lol
odin will be a nobel prize winner
I do own a ai organzation hm
interesting
im 14 for trasperency \
hm
yo it was kinda close to 16
yeah i might have to see claude 4 lol
Web arena prompt: 'chatbot arena that was designed by Jony Ive' - Opus 4 response
oh
can i show this without triggering people, jk
ohhh mb
ya thats impressive
opus or sonnet tho
you guys are fighting over nothing
claude and chatgpt have been able to use code for a long time
chatgpt and claude have been able to do that for a long time
at least they can rn
idk i never really messed around with it much
yeah
it's amazing
(also amazing that it took us so long to figure this out since it's just a ui feature... i guess training is important)
i havent had a chance to play with models, what is the run down?
claude opus is really good at counting numbers
opus is unironically especially bad outside of coding but has good tool usage
sonnet is more leveled
o3 high gets an approximate value but after 17k output tokens
why is it that o3 and opus can't follow instructions bro ts genuinely is getting me heated
gpt 4.5 fails miserably even with 0 temperature
gpt 4.1 too
this is essentially a reasoning process tho
it's not that simple
claude opus didn't need
i like the new ui guys
4.5 still on?
thought they killed it
I'm talking about Claude opus lol
do they have 3.5 opus in lmarena?
they do have 4 opus in the arena ye
4 opus?!
ask this in the api
ye there's no 3.5
ok assuming you're talking about the battle section bc i couldn't find it in side-by-side dropdown
ye unfortunately
any tricks to get 4 opus (or any specific model) every time? or with a high probability of success? (in the battle ui)
when the the the when uhh the when
it doesn't have any particular traits afaik but I haven't tested it's short response tendencies
so I'm not sure
although you could always ask "what model are you"
and that narrows it down considerably
you cant
ya true but ill get pissed off if i see amazon models or llama there
ye it clogs things up
there's no other way to get it though
nothing helps
no not that one @elder burrow this is different, its about tricks to get a specific model from the random choice
why would you want that
poe is paid?
4 opus is prolly too heavy to give free access tho
oh damn 4k points for a single request for opus lmao
and you only have 3k total
if you're not a subscriber
on poe
they test new models in the battle mode even when its not available for direct chat
it had o3 before it was released and so on
poe my ass
damn nice
sorry man
btw any UI where we can chat with multiple models at once?
i like lmarena side by side for the same reason bc a single model doesn't always provide the right answers or teach things well
if you dont mind gemini then uh
it has side by side
its called "compare" and it lets you tweak everything in both sides individually like temp, sysprompt, grounding (google search) n stuff
only for gemini models only tho right
ye
its bad? i heared people talk so many great things ab 3 opus.. so i thought this is going to be similarly dope
for anything but coding it's surprisingly bad imo
and it's not like performance simply degraded
it's more like exceptionally bad
at certain tasks
as opposed to 3.7 sonnet, or the Almighty 3.5 sonnet
damn thats crazy
what all did u try apart from coding
I think the base model isn't bad
But it's clearly not better than gemini 2.5 thinking
philosophy, but it sucked at instruction following
and couldn't really comprehend without shoehorning itself
into a narrow interpretation
Also i think anthropic are painting this whole opus 4 image wrong with their asl-3 definition
ye I agree
What anthropic are doing is basically running their models in a sandboxed env
Without any guardrails
tbh i find 2.5 pro better than 2.5 thinking
Its obvious you will expect some weird behaviors that will be fixed later
People are taking this out of context
oh damn
ye so you'd expect it to be bad at writing and stuff
Wdym we dont have the base model yet
and it p much is
For knowledge wise, gemini and o3 are clearly the winners
Ive tried asking opus some niche questions
i find any anthropic model's writing very nice.. surprising to know opus 4 is bad at it
yeah I was disappointed asf
Although the answer was good but it lacked in many ways compared to others
we got 2.5 pro preview on lmarena
it doesn't seem crazy heavy like gpt 4.5
but it seems smart in a way
it's just, not that intelligent
it's kind of a mid model overall, but I'll prolly keep playing with it
hmm.. why would they be testing this rn
Its not fair to compare the instruct model vs thinking models, but knowledge is something static, and i find opus4 below o3 & gemini on that
ye
It all make sense to me now
Anthropic keynote title was heavily focused on coding
Dario said AGI will come in +4 years, Demis had a much lower time prediction
There is a reason for that
yep
Dario = didn't see much improvements
Demis = saw the opposite
Alphaevolve?
Co-scientist?
Gemini deep think?
it seems like everyone else but DeepMind are going backwards
didn't demis just recently say 10 years tho
He did?
he said "not long after 2030"
at MOST
Pretty sure Dario was adding a year each time they ask him
ahh sry
Google next step is to focus on agentic usage
demis has kept a pretty consistent timeline
Parallel tool calling
yep and I have a feeling they're going to do it right
like some crazy shi
veo 3 is mind boggling
alpha evolve is mind boggling
Tbh i dont think we can predict when we will reach AGI
I'm not that confident in AGI at all
And i don't like how we are fixated on that
ye
I mean
but I think extremes are inevitable simply due to the fact you can't overestimate the potential of AI
What indicators are they using to predict the timeline?
Are they taking into consideration sudden breakthroughs?
Is it only scaling law indicator?
it's not scaling and it's not that arbitrary imo
capabilities of LLM's themselves, emergent ones are kind of magic
as cliche as it sounds the fact
probability, trends in extraordinarily large numbers = these things we're seeing and not predetermined
we don't understand it
Unfortunately we are still following a normal distribution
Which is good and bad
If its a strong distribution, you are more confident and accurate about your next word, but its a retrieval information, doesnt reflect how we think
Worth repeating:
Do not confuse retrieval with reasoning.
Do not confuse rote learning with understanding.
Do not confuse accumulated knowledge with intelligence.
Tbh i think oai managed a bit reasoning
Better than other labs
llama 4 maverick >>
Okay, so I just played around on lmarena. Had it interpret a modernist essay with a clear structure. Claude 4.0 Opus killed it. I totally expected Gemini 2.5 Pro to flop, but after a short wait (Gemini's slower, but the generated content was more detailed), its performance was basically the same... 🤷♂️ This one test isn't definitive, but my main takeaway is: Claude is a beast! Super quick and token-efficient for explanations. Only downside is Opus can be a bit steep on the wallet.
In engineering, the more you know about a problem, the longer the timeline estimate becomes.
I think this is similar to the problem when product managers give ETAs to external customers without consulting with the engineers in the trenches. Demis is an actual AI expert and gave a more conservative timeline of 5-10 years. The CEOs who aren't AI experts gave overly aggressive timelines.
ask it to write an essay with a criteria with the addition of sentences styles, or a creative essay that's formal enough for academic setting
I think Demis said "not long after 2030" because he didn't want to diverge too much from Sergey and make the interview awkward. He said 5 - 10 years in the 60 minutes interview a few weeks ago
ye seems that way
although I think he is more confident in the shorter part of the timeline
just not 5 years imo
I agree it'll likely be around 2031
or 2032
rip windsurf, dario is mad
ye, if anyone I would trust demis
@balmy mist accidentally left agent neo on overnight and I just barely checked, and agent neo starting getting mad at itself for not performing the browser tool correctly, like deadass it was getting mad
what is agent neo
flowith agent
24hr agent thats pretty cool
It's sonnet 4 as you know now but I think it's another cpt but I just woke up and didn't play with it that much. I think the Claude 4 models are cpts, but not sure and am not gonna defend that for now
Both Claude 4 sonnet and Claude 4 opus are cpts (initial pretraining 3.5 sonnet and 3.5 opus, my guess without really looking into it too much but examining the timeline etc)
3.7 sonnet was an experiment with it
The rumors that 3.5 opus was disappointing were true it would seem
No it's too fast
To be pretrained from scratch among other reasons
ATP I don't see other possibilities being likely but I don't know
I'm not sure
seems like they just focused a lot on coding
Anthropic seems to losing ground in every area except coding
And even coding is debatable
Their rate of improvement seems a bit too slow
I suspect it'll be really high with thinking but Ion think this means anything, multiple choice doesn't reveal the full story, especially with how bad the models really are
who even mentioned 2.5 pro
😭
ye but that's not what's being evaluated here
claude 4 is definitely not bad
imo there are too many assumptions right now with human thought (but im not gonna argue about this)
it is lmao, me and plenty of other people have come to the conclusion they're simply bad at things outside of coding
Glad that lmarena's ama is not the same time as wwdc, as most of the audience would have to decide
if its true that theyre using cpts (i think its likely to be the case), i think their next actual new pretraining run will be interesting to see if theyre kinda stuck
Why do you think claude 4 is bad? Its good at coding. It seems to be best on simplebench and even livebench reasoning. What makes you say it is bad?
Which model is better?
6
11
1
Claude 4 Opus
I don't think Anthropic is bad at things outside of coding. I think they don't have the resources to compete on everything, so they chose to go all in on coding so they can at least win one important domain and have a chance of survival - or last long enough to keep the funding coming and then branch out from there
I just don't know if they're actually gaining ground in coding or losing ground
If they're losing ground in their specialty, then they're toast
how ?
anyone have a 4 opus jailbreak
also opus's context window is small
I don’t think any company has held a big lead over incumbents across benchmarks for long enough by a wide enough margin to make that conclusion
Then they said it was 400m at I/O
but @patent aspen my point is that you never really know who is leading
the gemini product needs to improve a lot :\
the long system prompt/degraded performance on models there, there's no branching (fundamental feature), etc. :\
it has horrible instruction following, doesn't have creative foresight and doesn't know how to make iterative adjustments, uses obfuscated language for things it doesn't need to (ie standard philosophical concepts), shoehorns conclusions because it doesn't properly infer from its initial interpretation, isn't that creative and doesn't allow itself to be leveraged into a creative mode easily
I know for models it can sometimes seem arbitrary, people have different reactions or mixed feelings and it may not be immediately obvious
but this isn't the case for the Claude 4 series, they're ESPECIALLY bad at these tasks
and it isn't really up for debate tbh, just use the model and let me know how it feels
you'll see how it demonstrates things
although I'm glad now to have a model that comprehends codebases so beautifully and there's more chances for me to fix other models mistakes
standard philosophical concepts
let me clarify, whatever the philosophical concept is
it doesn't matter lmao
the fact it doesn't recognize ANY rigor to invoke in its response
is a major problem
for discussion lmao
clearly not what I'm saying then
cool I agree that's what I said tho
exactly as you've interpreted it as
adjustments that are iterative
yeah no shi all models necessarily can make iterative adjustments
clearly not my point then
I'm saying both ye
can you give me some prompts to test claude 4 against others? For the ones I have tested, opus 4 is doing pretty good....
ask it to write an essay and give it the criteria while asking for styles
put simply
yep
it's a good model
that doesn't contrast it being "bad"
just means it's comprehensive enough to be ranked over other models
I consistently refrain from calling you sped
but I won't
I'll reiterate
it's a good model, that doesn't contrast it being "bad", just means it's comprehensive enough to be ranked over other models, just not the models that actually qualify for being "good model" and "good"
as per the PRIOR sentence of this
that's fine, you don't need to since I'm actively conveying it
Craig just ask for clarity and I'll give clarity stop beating around the bush
that's great, that's why I presented a distinction tho
it's alr, I'm just saying there are different things being qualitated
if that makes sense
it's a good model, by no means does it not accomplish a vast majority of its tasks
but it's not presenting it in a way that demonstrates not just reiteration
but actual knowledge
in coding it's complicated, 2.5 pro might be able to keep reasoning to solve certain tasks and eventually get there, but opus just gets it tbh. But in regular tasks it's worse
but there's a good reason
ye that's its trademark basically
or that's what made me like it so much
pretty good tbh, its really warm and presents itself really well
but it doesn't have the absolute know the concept behavior
ye what I mean tho is 2.5 pro would basically iterate how redundant the presentation would be, how to clarify jargon and then the simplified variant in parenthesis
all that stuff
it would recall the inference in the discussion vs the ACTUAL studied concept
and their relationship
and then compare
and then move forward
it's lesser now in 0506 but you'll see it by asking it to explain any graduate lvl mathematical concept
nah not really I've already tried, it knows what it's looking at but it doesn't know how to relate it
I found goldmanes answer in conversation is much better than 0506
new checkpoint of 2.5 pro?
building at night >>
oh craig u still awake 😮
so sleep
do u actually take addy
oh right
why lmao
ur dad in this discord ? lol
is it new?
never heard of it
yes
both lmarena and webdev arena
Deep think?
Possible or maybe 2.5 Pro GA
It doesn't seem to think that long tho so maybe GA?
hmm are they gonna be returning the raw thoughts again on aistudio? https://discuss.ai.google.dev/t/massive-regression-detailed-gemini-thinking-process-vanished-from-ai-studio/83916/59
Thank you to everyone who has shared their thoughts and concerns in this thread. We hear you. While we’re excited to now return thought summaries directly through the Gemini API for the first time, we understand this is a different experience from the raw thoughts previously available in AI Studio. It’s clear that in their current state, the...
😭 the raw thoughts ngl were better than reading the response
New model in Beta Text2Image: anonymous-bot-0514
I don’t think so, he is likely just asking for feedback to improve the summary model
*with the last message
my overeager interpretation esp with the kinda obfuscating language is that they're reconsidering it (and haven't made a decision yet)
and that too
here's hoping
But I might not be surprised if they do it for paying / corporations only or something
For debugging their stuff
aistudio is supposed to be a dev tool to help you integrate gemini into your product (refining your prompts, then using the api). it's much harder to do with the thought summaries, so i think if they do change it, it won't be limited like that
System prompt
Well but it is an open secret that not even 5 % of user actually use it for that right now
And google surely is aware of that
ngl I sometimes use the model just to read the thoughts
not the summaries mb
doesn't matter. it actively hinders the intended purpose and the offering there clearly works in helping get gemini's api out there, even if most people don't use it the intended way. compared to chatgpt, gemini needs all the mindshare it can get
The whole summary’s thing is likely also to prevent scraping (over api or free ai studio), because some open model have shown how effective copying the though process can be (even for the very bad 2 flash thinking): https://huggingface.co/datasets/simplescaling/s1K
ye everyone knows that tho
s1k moved to r1 traces because it was better than gemini traces, see s1.1k
the point is, Google wouldn't be preventing the issue regardless + it causes more harm to the user base than benefit to Google preventing competition
imho atp, traces dont even matter that much anymore
Bc I did not remember anyone using pro traces
qwen and deepseek can self sustain themselves at this point
They kind of do imo
Obv it is better to have actual rl based stuff
and i dare say qwen traces are even better, but the underlying model isn't as strong
But I mean that is just way more expensive and not viable for everything
i might be remembering wrong but anyway s1k used gpqa questions in their dataset etc it was kinda suspect 🤨
Well I still think that google might not have the best reasoning game but what they do have (without question imo) the best reasoning for human preferences, something that can easily be copied using traces
Imma check
you can already do that anyway with the responses, the traces dont help that much in that regard
imho (elaborating on my previous point and kinda tangential) chinese companies no longer need to distill western models, especially for cot. the responses probably arent even worth getting trained on anymore, they can take inspiration and develop their own bootstrap to generate similar responses to it
All sources according to dataset (using my bad sql skills)
AI-MO/NuminaMath-CoT/aops_forum
qq8933/AIME_1983_2024
KbsdJames/Omni-MATH
TIGER-Lab/TheoremQA/float
daman1209arora/jeebench/phy
qfq/openaimath/Algebra
Hothan/OlympiadBench/Open-ended/Physics
Idavidrein/gpqa
Hothan/OlympiadBench/Theorem proof/Math
daman1209arora/jeebench/chem
qfq/openaimath/Precalculus
TIGER-Lab/TheoremQA/bool
qfq/openaimath/Intermediate Algebra
qfq/openaimath/Geometry
0xharib/xword1
TIGER-Lab/TheoremQA/list of integer
TIGER-Lab/TheoremQA/integer
baber/agieval/aqua_rat
GAIR/OlympicArena/Math
GAIR/OlympicArena/Astronomy
AI-MO/NuminaMath-CoT/olympiads
baber/agieval/math_agieval
qfq/openaimath/Number Theory
qfq/openaimath/Prealgebra
qfq/quant
daman1209arora/jeebench/math
AI-MO/NuminaMath-CoT/cn_k12
OpenDFM/SciEval/chemistry/filling/reagent selection
qfq/openaimath/Counting & Probability
qfq/stats_qual
GAIR/OlympicArena/Chemistry
Hothan/OlympiadBench/Theorem proof/Physics
TIGER-Lab/TheoremQA/list of float
GAIR/OlympicArena/Physics
Well it is not nearly the same cost level to train on the traces vs rl on them
And it is not just doing simple grpo rl
These companies do more for some of their reasoning (especially google, which becomes very evident when looking at the formatting and layout of their thinking)
They do kind of for human preferences or at least have all been alleged to have done it
its not sustainable for a frontier lab, especially chinese frontier labs. for responses you can use rejection sampling, etc., it's not that hard of a problem
I mean we do t really have any other models where the reasoning itself helps that much with aligning to human preference (especially Chinese). And obviously the can either train directly on the traces or do rl or just copy the technicus they observe manually. It does not really matter in the grand scheme of things, but at the end they use the traces to the disadvantage of google.
google used bond, warp, warm, etc., at least on/and since 1.5 pro exp i think
BTW I checked in detail now and the a1k uses 88 out of the 450 or so gpqa problems, smh 🤦♂️
does gemma 3 use synthid? while thinking i realized u can use those responses huh
food for thought later 🤔
Never heard of them using it
But might have spilled over from distilling from the Gemini model
(Depending on what kind of Gemini development stage they used)
no they focused on human preference (which might be useful) among other things specifically for gemma iirc/afaik
yeah but they probably do a lot of things that they dont say i guess
? Well it is still highly likely that they used the Gemini models either for human preferences or somewhere else in the process
they focused on human preference on gemma specifically more than regular gemini models at that point
something along those lines
Well they could have distilled and then used a reward model (potentially also Gemini) to enhance from there
What's the story here?
doesn't change what im saying, you're missing the point
Well then reiterate
they potentially focused on human preference for gemma more than gemini (in relation to model capability), the methodology used to achieve that doesn't matter for the point im saying
example from eqbench's creative writing leaderboard
Ok, I don’t get why you switched your point mid discussion, because I thought you were talking about it not being possibly spilled over
but ive been saying this the whole time
the reason its more useful because in relation to model capability, gemma is more human preferable compared to other gemini models
you can use this to generate more human preferable responses with the thoughts of another model (model capability)
i don't understand why you were repeating unrelated claims or claims just said outright in the gemma paper 😭
this scenario demonstrates my point of the relationship, gemma is better at human preference than other gemini models when you don't really include model capability. you can compensate with the model capability aspect
or use it to train a reward model or whatever
your point makes zero sense at all too. they literally did logit distillation for the instruction tuning phase how would it not spill over [i was talking about human preference tuning here] 😭 how did u even interpret it that way
Because I said that it depends on the stage at which they distilled at the beginning of the discussion
they said they did logit distillation in pretraining and instruction tuning bruh
Because as far as I know synth id gets introduced later in the training
i assumed u read the paper
I mean the stage of the Gemini models
No, don’t have time rn
i get it now
yeah it depends
That was my point from the beginning
we were both talking about separate things 😭
But I agree with you about Gemma focusing on human preferences :)
It definitely is not as good in other benchmarks as it seems in things like arena
yeah because of that its useful, but it depends if they use synthid
very interesting talk
New episode with my good friends Sholto Douglas & Trenton Bricken. Sholto focuses on scaling RL and Trenton researches mechanistic interpretability, both at Anthropic.
We talk through what’s changed in the last year of AI research; the new RL regime and how far it can scale; how to trace a model’s thoughts; and how countries, workers, and ...
yeah i completely misinterpreted ur previous comment and it got me into a separate unrelated line of thought i really apologize lol. this one: #general message
For those who denied the Gemini nerf
what benchmark site is this?
fiction live bench
I wonder how expensive the original 2.5 Pro was
If they nerfed it so hard, they must have eaten a lot of costs
I would have actually bought subscription for it
wait is deepthink rolled out or not? why am I not seeing any benchmarks?
no early testers only
they said at io i believe
for now
@alpine coral have you included the new claude into your personal bench that I always like?
im reading the thread again and i really apologize bruh 😭 (ill stop yapping now xd)
New model in Beta Arena: grok-3-mini-beta
dario does not like oai aka windsurf
codex is a dream come true
But why?
How can o3 have such a bad thought's, where it seems it will completely miss the point, but then returns good answer 😄
blame the summary model
Yeah, most likely some 0.5B nano model 😦
ive seen it say the opposite result in the summary, reaffiriming to itself that yeah, incorrect answer, incorrect answer. then returns the correct answer
I used to learn stuff from thoughts
What's the use of them then. It functions like loading animation now.
.
Did claude 4 get added to the arena?
yea
Very good models
Its 2.5 pro update, better than NightWhisper
Did it work now?
I like LMArena's censorship system, lol.
I started to use some words that are not related to anything vulgar, actually meaning mundane things.
stop lying
nothing is better than nightwhisper
imma try them
And what do you think? They worked before normally, but have been banned after some time.
Viens dans le serv, tu verra plein d'example
Un gars m'a dit mieux que NightWhisper
send
@torn mantle
alr
The thing is I didn't write any content, just bare words, which were banned afterwards, regardless of me doing anything. Now it's just clearly evident that all user input is monitored directly by admins, lol.
Because I suppose no AI would get hidden "vulgar" sense behind mundane similar words used with the subtext.
That's something humans would recognize.
or you could send results here like everyone else
post results xd
im saving them
just in case those models got removed
so far ive got goldmane
that thing is def on par with nightwhisper or even better
it would be recorded here too as well if u dont mind lol
quite shocked actually
There are a ton of results, I don't have time to share them all or choose them.
i'm surprised not seeing others reposting
goldmane > redsword
Dropped a larger Update to my Deep Research List, only Gemini 2.5 Pro DR missing now: https://docs.google.com/document/d/1qSfyAyxzUziFQf55CD60-UgQ4Af9ubVmr69OrmAdevE/edit?usp=sharing
Deep-Research Tests Prompt: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and ...
which sucks ass
I wonder if they'll ban this too if I enter it...
Okay, I did that, and it's not banned.
But I think it'll become very soon, lol.
😭
The service collects dialogue data from user interactions. By using the service, you grant LMArena the right to collect, store, and potentially distribute this data under a Creative Commons Attribution (CC-BY) license.
💔
Any experience with stima api and t3 chat?
ya somewhat posted that earlier comparing it to nightwhipsert
these models are going to be 600$ per month
Redsword is insane
They will release them and then nerf them as always
xai will just be left behind again
their strategy is so bad tbh
look at google testing models and releasing them on multiple occasions
and here we are stuck with grok 3.5 that wont come out until it meets their expectations
and when will that happen?
google is about to release new models + deep think mode, deepseeek is about to release r2
mistral probably gonnar release their reasoning model as well
its not looking good for xai tbh
i dont think mistral is gonna be competitive
also what happened to 'Big brain'?
like it was so obvious that thing wont be released giving how inefficient their reasoning process
ik, but i mean the market share will be much harder
🙈 they used qwq preview traces in cold start at least
yea...
qwen has iterated on the reasoning trace style several times by now...
this whole strategy of waiting till we get things right is fundamentally wrong
talking about grok 3.5 release
i would rather see them release multiple versions than waiting for grok 3.5
i would be okay with different checkpoints
like grok 3.1 -> 3.2 -> 3.3
meh just let them release it when they release it tbh. there's a lot of good models. theyre getting overworked etc i kinda feel bad lol
doing that is gonna be a disaster tbh
if u force a release
or they just dont release it at all lol
their claim about working 18h/day doesnt make sense to me tbh
what are they doing the whole day
its probably true since one of them tweeted about it and deleted it
it may be true
but whats the actual work/value from that
18h -> 1h value?
2h value?
if you overwork employees consistently yea you will get consistently less value
i dont think theyre being lazy over there but idk
im not questioning 'how much time they spend in the office?' but they should stop tweeting such things
its like they are saying -> look at us we are working so hard here, lets impress elon
one of the xai employees merged a "update prompt to please elon" troll merge request on the xai prompts repository, then reverted it later lol. then they reset the repo
so idk
LMAO
haha
well thats their goal
to please him
he didnt lie tho
they really tried to wipe it huh, it's gone now
that page was still there when they wiped the repo
screenshots not mine
(blurred the pfp since its not my pic)
why do you think that? yea the one use case i used was lackluster (data analysis)
They really pay too much attention to safety. What's the use of super safe claude if they can get the same info from all other llms
tbh claude is fine nowadays
it was crazy back then with claude 2.1
i think they released the false positive rate it was absolutely absurd
so people don't race to the bottom and build unsafe systems. having some ethics is good imo. but i agree that it does not prevent others from racing it seems
Nevermind that was skill issue on my side claude 4 sonnet is king
true actually
what is the right attention to safety in your opnion? i think anthropic has set a good tone to show how lackadaisically these other labs are releasing their models.
Look at this @sacred plaza
i say all that supporting anthropic but don't use it cause of the rate limits set on claude pro lol
They have to make a safety group. SImilar as the agent-to-agent protocol. And smear every LLM lab who does not correspond.
It will happen sooner or later
I am against regulation too, but the current appraoch is ridiculous
that is a good thing imo. slowing releasing down. why are we speeding up releases when you still don't understand how to control hallucinations and misalignment?! https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/
Good companies get penalties while bad do not
the problem is that there are companies who wont comply at all
outside of jurisdiction if its by law
Two reasons why it might make sense beyond altruistic believes:
A: it is way easier to recruit sought after AI experts if they can align themselves with the ethics of the company and feel like they are working toward something ‚good‘
B: Claude markets itself as for coders on the one hand side, but also for companies and these companies obviously want a ‚save‘ ai model as otherwise they will have more lawsuits than they can handle around their neck.
(Sorry for length, I like to yap like llama 4 exp)
yea good point there. customers seem to be okay using models without the high safety standards of anthropic.
Really good comment right here. I havent though about this
Not sure if effective though
There can be many "motivational" tools. E.g., companies like lmarena could ban labs from benches.
In the end, I don't think it is really so much easyer to build a chemical weapon using LLM. There are lot of books on this.
elon signed a petition to stop training/etc (or whatever i dont remember), it was just an attempt to delay the others as he was starting his own ai company. if you voluntarily join a thing to delay releases (and actually comply), unless you have unlimited money, you will lose ground and competitiveness i think (that's the reality rn, others wont stop)
real
The post 2019 elon arc is just crazy 😄
i will trust dario's take on this since he comes from a biosciences background.
https://techcrunch.com/2025/02/07/anthropic-ceo-says-deepseek-was-the-worst-on-a-critical-bioweapons-data-safety-test/
Anthropic’s CEO Dario Amodei is worried about competitor DeepSeek, the Chinese AI company that took Silicon Valley by storm with its R1 model. And his concerns could be more serious than the typical ones raised about DeepSeek sending user data back to China.
In an interview on Jordan Schneider’s ChinaTalk podcast, Amodei said DeepSeek generated rare information about bioweapons in a safety test run by Anthropic.
DeepSeek’s performance was “the worst of basically any model we’d ever tested,” Amodei claimed. “It had absolutely no blocks whatsoever against generating this information.”
Amodei stated that this was part of evaluations Anthropic routinely runs on various AI models to assess their potential national security risks. His team looks at whether models can generate bioweapons-related information that isn’t easily found on Google or in textbooks. Anthropic positions itself as the AI foundational model provider that takes safety seriously.
True. My bioscience takes must be as bad as non-SWE people takes on "AI will replace programmers" 😄
It's disgusting how Anthropic scammed everyone with their parallel thinking score. Every company finds ways to avoid being ridiculed on the benchmarks.
losers report scores outside of pass@1 /s but fr, just report pass@1 unless your system does that for every request
Well that was 50:50 concern and marketing about why companies should use expensive Claude instead of unsafe and scary Chinese deepseek 🐳🐋
(the take from Dario)
lol
xddd
If he releases Claude 4 opus pro with parallel thinking at $300 input and $1500 output, we'll talk.
can you explain what the parallel thinking score means? never heard that term before. is it similar to what google did with their gemini 2.5 pro by adding this new deep think internal feature?
they have a huge footnote section about it iirc
Its similar to 2.5 pro deepthink
And o1 pro
O1 pro cost 150$ 600$
And gemini we dont know
you can't even use this internal scoring model lol
do y'all think that was done purely to boost benchmark scores? at the gemini i/o event this week, demis was trying to sell it as the next step in llm reasoning
@sacred plaza it runs the model maybe 10 times at the same time, a model retrieves the best answer and gives it to you
Given their record, probably they think it's more "correct way".
yea this sound lame as hell...meta did this at the model level with llama 4 on the lm areana right? they testing a bunch of their models and only made the best performing one public?
Anthropic's parallel thinking is perhaps with 64
If it's another glasses 🤦♂️
If its 64 the price its 4800$ the millions output
Doesn't seem to wear glasses before
I hope not
It's indeed a new scalable dimension besides modal size and cot
The problem is that you need to much compute with video. Can't fit it into glasses. Innovation might come from audio devices where you don't have such a high data throughput. If it's just microphone, then it's doable. But if it's just microphone ,why do you need glasses 😄
if its streaming and computation is handled on the cloud, i think its possible but I don't know much about embedded and hardware
It's possible that sonnet 4 will be lower than 3.7 on the arena 😶
Try asking it in different order. From my experience, AIs heavily favour second options to firsts. This has been my experience since OG chatgpt 3.5, quite consistently regardless of companies
I've trained some image classifiers, and they always have 1 to tens of millions of params for few classes. Image embedding models weight megabytes and can be easily ran on MCUs. However, latency is hundreds of milliseconds per image. The Image (video frame) embedding -> transfer to cloud could be potentially realized, but this would mean at most 1 frame per second. Maybe that's enough for them 👀 Otherwise big battery is needed.
already did
ik
Problem here
75.4% officialy
70% on artificial analysis
even 3.7 thinking at 77% on artificial analysis 🤦
Might be sensitive to the specific prompt used
I don't think it's suspicious
But it is the thinking model, it should be less susceptible
in 95% of cases the score is the same as the official score (sometimes even higher)
isn't Claude 4 too expensive? I am thinking that I should stick with Gemini 2.5 pro
Ran 2.5 pro for you
https://g.co/gemini/share/0de5d9824cdf
Thanks!
Lets see if it is better than last time.
@keen beacon the problem is on artificial analysis side i think 🤦
he just lost another 3%
Looks better than the last one, still prefer the ChatGPT o3 one.
Looks prompt related honestly, because 3.7 sonnet gains 11 percentage points when enabling thinking…
Worth switching from Gemini Pro to Opus 4?
That would alight with them claiming a higher score
I always run on both. It's probably 50 50 which one I use but o3 gets better over time with memory. Gemini DR getting file uploads this week has been very insane though. Throwing most important info into huge context window then letting DR get to work has been great for better outputs but havent gotten enough testing done yet
for?
polymarket bet
LMAO
oh
i wonder how much claude 4 scores on simpleqa
Considering everyone being drooling aliens, I'd bet on gemini too probably
For me 2.5 pro as orchestrator (with large context) + sonnet 4 for coding was really good
are you using a tool like aider?
claude 4 opus is 3rd on longform creative writing
source?
significantly less slop
Roo Code thing in vsc
oh eqbench benchmark
the fact its significantly less slop makes it much better imo
might win in the shortform creative writing
If you remove xml tags it’s number 1
By a huge margin
If deep seek is higher on that bench than I don’t trust it, deep seek sucks at creative writing it’s outputs are nonsense
it's ai judged
Oh lol
That’s why ai ranks deep seek so high the highly randomized outputs are interpreted as “creative”
definitely not reasoning puzzles lmfao
And Gemini ! 🤣🤣🤣😂😂!!!!!!!!
ok opus is worse at translation i'd say
I'm scared because this was first signs of lobotomization for chatgpt 
Claude 4 opus is agi ASI
gpt-4-0314-32k created the big bang of the universe
I've read one response from it and I've concluded from that single sample that it is superintelligent
which
It came up with stuff about hot dogs and such randomly on an unrelated topic 🤣
Funny and absurd reply
🤣 LOL LMAO ROFL LOLLOLOL 🤣 🤣 🤣 🤣 😂 😂 😂 😂 😂 😂 😆
😭 my dog ate the reply and Claude took over my computer. Sorry I had to lie
You ARE the dog
Waiting for
Grok 3.5
o3 pro
Deepseek R2
Gemini deepthink
claude 4.5 sonnet
I'm not sure what you mean by "You are the dog." Could you provide more context about what you're referring to? Are you perhaps thinking of a game, story, or specific scenario where I should play the role of a dog? I'd be happy to help once I better understand what you have in mind.
Lol
will all be released in the next 30 days?
with open ai and gemini doing parallel thinking, Elon will also release grok 3.5 pro (with parallel thinking)
Are they preparing this?
I don't know yet. Will you harm me if I harm you first?
do people actually use grok-3? not sure i understand why people still talk about grok-3.5 release given how elon brainwashes it outputs via the system prompt
yea i can see it being useful if you are twitter and need access to real time info. other models don't have access to that
I don't know yet. Will you harm me if I
microsoft did this with inflection pi folk a few years ago. instead of a buy out just poach all of the people from the company. that is how mustafa returned to microsoft i believe
where is claude 4 ?
beta site
google can do the funniest things with these new models
anthropic are kinda confident in their models at coding, but from what im seeing these two new models are the next thing tbh ( goldmande & redsword )
theyre probably the upcoming ga versions
they talked about that?
yea
about ga versions?
interesting
are they both 2.5 pro based or you think one could be flash?
new Google models already?
dawg
are they good
ye but what are their performance
people are raving about them as far as i can tell
alr cool all I need to hear
yea so good
fr?
Someone’s know about blockchain?
i hope they bring back raw thoughts on aistudio 🥲
same
dont you think it thinks for less time than before?
is it really just an update related to summarizing thoughts or there is more to it?
i heard people talking about that but i havent measured it myself
did i miss anything?
yea new google models
better than nightwhisper apparently
models better than nw on lmarena
ga versions coming next month
both
Will Claude 4 Opus appear on the regular leaderboard? or just the beta one?
wdym
its on both
beta & old website
what are the model names?
goldmane redmane redsword
goldmane & redsword
its really hard to use the model in contrast to before tbh
they are both strong at coding
are they codenamed still?
thank you bro
@keen beacon you know whats funny is that ive always chose sonnet 3.7 over sonnet 4 in webdev
4 opus disappointed me
like literally on all of my tests ive chose it over the latest version
Google coming to save the day
are they good ?
anthropic models? no
yes
the claude 4 models werent ever anon models on the arena
please dont troll me... every hype is blueballing me since march 2.5 pro launch (except may be veo 3)
its really good this time
@torn mantle how long do the new models think
Am I stupid or what, I cannot find 4 Opus on the leaderboard? Does it not have enough votes yet or something
btw I wonder if they're ever going to release a Gemma thinking model
it was just added
so u have to wait i guess
I see, is the consensus here that it will place below Gemini?
probably
same time as the current gemini model
especially the new revisions
ye
opus 4 thinking simply isn't Gemini lvl imo
I don't think Opus is stronger in enough categories to place above Gemini
2.5 pro even if people think it's nerfed in some ways
it's still a league ahead of everything else
just need the raw thoughts 😦
deadass
Thanks for the input everyone, and this goldmane & redsword, are these updated Gemini models to be released to the live leaderboard soon?
theyll put it on the leaderboard when they launch it probably
whenever DeepMind announces them publicly
which is next month, apparently
they'll be revealed
Even o3?
the comparison isn't just this or that lmao
i would take nerfed 2.5 pro with raw thoughts than o3 tbh
raw thoughts are so important. the o3 summary model is 💀
I would prefer o3 in certain tasks but for the vast majority of things it's 2.5 pro
glad you sent that blog tho
seems like they're open to improving it
so I'm optimistic asf
yeah it seems with the language used there they're reconsidering it but havent made a decision
ye it means a ton for what they're looking for next
With veo 3 and opus 4 we are in the next stage of AI
I’d say two more stages until AGI
veo 3 I can agree but opus 4 seems like the lower end of the current level of models
veo 3 is insane
nah I haven't messed with any of that yet
im gonna soon tho
I'm trying to get a hold of opus 4
just need 2.5 pro image gen 🔥
With gpt 2 & 3 being stage one, gpt 4 sonnet 3.5 being stage 2 and Gemini 2.5 pro/opus 4 being stage 3
Opus 4 is better than 2.5 pro on livebench
ON GOD
Opus feels way smarter than 2.5 pro to me though in general use
Nah
@elder rapids see
Maybe the coding of 2.5 pro is a little better
So… Opus is good but fundamentally just way too expensive, right?
HELL nah, but we don't need to discuss this, just go to anthropic subreddits
and you'll see how you're the minority
ok, but the cost of opus 4 thinking 💀
I use a system prompt to unlock it
honestly way too expensive for my taste
Opus is much more guidable with system prompt than 2.5 pro is
the literal opposite
lmao
and I mean CRAZY opposite
i heard that people used 2 messages on claude pro with claude 4 opus and it locked them out Lmao
yes
Does the model decide when it reasons?
opus can't follow instructions for sht, you'd be damned to even try to guide it
I've done a TON of testing for opus
I used to be a Claude glazer lmao
That must have cost you something
ye
I’m Claude glazer for life tho
It’s just the cost right
No but it reasons way too short. They did just enough for it to beat Sonnet but I think that model is far from maximized
it needs to be faster too though
no it doesn't listen to what I want
On the Claude subs most complaints are about the cost
cause that's all they can complain about while high on the hype train
Sonnet is just the better model practically unless your stacked
Bricked up with cash
majority of people here are stacked that seems to be the demographic when it comes to AI
Did you read their recent paper on “the biology of an LLM”? Their research (on a variant of Haiku) shows that the reasoning parts are in some cases totally unrelated to the true way the model is arriving at its answer; it has already decided what its answer is going to be and the reasoning trace is just yapping to justify it without actually adding any benefit. It’s an interesting read.
but you can counteract it with more test-time compute in many/most cases with Sonnet. With Opus more test-time compute not really possible currently
this used to be a problem with 2.0 flash thinking
people are overblowing it
but it's not crazy tbh
its true to an extent but its complex
I’m the opposite there are so many ways to use AI for free, the only models I’ve been priced out of are O series and Opus
ye sonnet seems to benefit more from thinking
opus thought process is kind of nothing burger
anyone have any guesses for Grok 3.5 release? Was Grok 3 released under a codename on lmarena before public release last time?
yes
Speaking of fast, I ran a prompt on Qwen3 32B today and it spat out tokens at a speed I’ve never seen before - it exceeded 1600 tps!!!! It wasn’t the best response to be fair, but hooooooly cow that’s fast.
where?/
Elon said 2 more weeks
2 weeks too long shi
recently? or wasnt that like 2 weeks ago
1 week ago
Prob next week
Or next next week
To the people here who have used grok 3.5 how does it compare to Gemini pro?
No one hyped about Gemini deep think?
btw how did Grok manage to become relevant in the AI race when it was founded in 2023? Did they have to use a hyperscalar cloud provider to get compute that fast?
Money. A lot of it.
huh? Is it already released as a codename on lmarena?
Sure but money can't just spawn a data center out of thin air
Via a cloud provider?
nvidia ceo talked about it/kinda glazing xai about how they setup things up recently iirc
Ah 122 days. That's really fast
what was its codename? you certain it was 3.5
hes trolling
I thought lol
Power of money and talent
theyre too busy merging troll prs into their prompts repo and resetting it multiple times to work on grok 3.5
No one buys more of his GPU’s than Xai 🤣
they have to buy them anyway
otherwise they arent competitive
but i mean glazing xai and elon can't hurt for sales right
Isn’t Grok 3.5 still a vaporware until now?
["openai"] = "OpenAI",
["anthropic"] = "Anthropic",
["google"] = "Google",
["groq"] = "Groq",
["cohere"] = "Cohere",
["mistral"] = "Mistral",
["amazon"] = "Amazon",
["arcee"] = "Arcee",
["ai21"] = "AI21 Labs",
["liquid"] = "Liquid",
["lambdalabs"] = "Lambda Labs",
["chutes"] = "Chutes",
["reka"] = "Reka",
["xai"] = "xAI", -- Updated mapping for verified xAI models
["deepseek"] = "DeepSeek", -- Updated from "DeepSpeek" typo
["01ai"] = "01.ai",
["moonshot"] = "Moonshot AI", -- New provider
["hyperbolic"] = "Hyperbolic",
["together"] = "Together.AI",
["fireworks"] = "Fireworks",
["nebius"] = "Nebius AI Studio",
["deepinfra"] = "DeepInfra",
["sambanova"] = "SambaNova Cloud",
["cerebras"] = "Cerebras",
["replicate"] = "Replicate",
["perplexity"] = "Perplexity",
["anyscale"] = "Anyscale",
["ibm_watsonx"] = "IBM Watsonx",
["ibm_watsonx_3rdparty"] = "IBM Watsonx 3rd Party",
["novita"] = "Novita AI",
["writer"] = "Writer",
["stima"] = "Stima",
["straico"] = "Straico",
because it's only in the ultra plan
so people felt like it wasn't worth it
But it will be sota reasoning