#general
1 messages Β· Page 47 of 1
itll be better than o3
wow thats breaking news
it has "pro" in the name
omg i have goosebumps now
theres two "o"s in the name
(i wonder how many people actually read the raw cot when asking it to code, it does a lot of cot within comments. the final output even with the comments is pretty stripped of it, but enough to understand where the tendency seems to come from)
o3 pro should be interesting.. i have not much hope from grok3.5 or deepseekr2 but very hopeful about o3 pro
I dont want my own comments that actually make sense to get removed though
Fr
no I actually just escaped maximum security prison for being able to see deleted messages
that's literally what he just said
oh nice independent scrolling
i don't think that is the most difficult to code but alright
i deleted it because treesitter might not be the best way to do it
there might be an easier way to do it (dependent on the editor) in a generic fashion with semantic understanding
it's a small team and trust me when I say they work hard! we look forward to expanding the team to deliver new features more quickly.
grok is holding back until 4.0 Dork. They will unleash ASI with it
redsword and goldmane are only available on webdev arena?
O5 is better π¬π€π«£
breaking news
nous research still beats it
Hermes 3 contains advanced long-term context retention and multi-turn conversation capability, complex roleplaying and internal monologue abilities, and enhanced agentic function-calling. Our training data aggressively encourages the model to follow the system and instruction prompts exactly and in an adaptive manner. Hermes 3 was created by fin...
Simulators by Nous Research.
what is this? never saw it before
looks like some frontend for another already existing A.I , ie a scam
open source ai research company
beautifull website ngl, but i am not convinced
Ok that is actually valid reason. I thought u were silly and embarrased, u still should be though because of your default pfp
so whats the consensus on claude 4 opus, shite or vibes
there was some of that too tbh (but it wasnt the primary reason in this instance) π€£ i tend to irrationally delete comments
excited for the future
Is this any good?
very well regarded researchers
i don't personally use them much, but they have very popular finetunes of llama, mistral models
bro claude 4 opus kinda stupid ngl
(mainly focusing on tool use and conversational stuff)
But now they are also working on training their own models trained over a crypto-like (prob not the right word) compute network.
If you are interested in research, i high recommend checking out some of their work.
they've used qwen?
its been a while but i didnt recall them using qwen
shite, no vibes
and it's not going to be webdev monster for that long anymore
no, sorry
mixed it up right there with mistral for some reason
it seems they avoid qwen
i guess llama and some mistral models are more aligned with their ideas?
otherwise one could use athene for high quality qwen finetune
i guess
but they are a public company (so they probably just used the best model available for their model)
and it might also be that qwen is just not good at conversational stuff or roleplaying
i dont think so. i think there've been a lot of popular community tunes (specifically for a specific type of rp) on qwen models but i dont really pay attention to it that much
maybe if u dont tune directly off the base model (noushermes tunes on the base model, so can't be that)
same vibes here, im trying to love it, but its hard
no qwen finetunes, but at least they have a merch shop π€£ https://shop.nousresearch.com/collections/products
nice priorities
https://psyche.network/runs/consilience-40b-1/0 this is also kind of cool
might actually donate some compute
yeah it is
beyond the decentralized aspect, it might be interesting 20t is a solid amount of pretraining tokens
tru 20t is actually like close to what qwen used, right?
For qwen 2.5, yeah
"In the first stage (S1), the model was pretrained on over 30 trillion tokens with a context length of 4K tokens."
so it is like actually close to SOTA and more than qwen 2.5 and llama 3 i think
Qwen 2.5 is 18 trillion
Some of the models 19 trillion
It's not close to qwen 3 at all tbh
Qwen 3 is 36 trillion
I still won't be expecting much, nous research haven't been pretraining their models like qwen. It might be like 20t of slop but I don't know lmao
There's more after that
ik, just the first quote i found
and i am not sure if the 20t is complete from nous research
or just s1
more then half the tokens that a multi billion dollar chinese mega corporation uses
is kind of a lot for a small research collective
if they reach it tho
jup kind of ambitious
its gonna take years at the current rate
about 1111 days total
ye I've been trying to get it to work on things and prompt it in its favor
but it doesn't go very far beyond acknowledging it and very slightly adjusting
are u using claude code
source,
I like Calmriver, but I wish I knew WTF it really is?
what goldmane?
I think so. Much less Markdown slop.
Anyone seen any rumors what happened to R2?
has anyone subscribed to github coding agent? how was your experience?
what did i miss?
the great leader took it away for personal use
Something is cooking with the GPT 4o
It just answered long promompt in miliseconds
As soon as I pressed the "Enter" button π
that sounds like the opposite of cooking
I have ptsd from this
Fr
Is it normal that a gemma 3 gguf model of the same size as a perfectly working llama model seems like it requires much more memory
Most other gguf models use roughly the same amount of ram when their sizes are similar... yet gemma 3 seems to work differently
because gemma 3 is agi
The answer is perfect
"what is 9 + 10?"
It's strange because even small models can't do it in milliseconds
Maybe it's that their serves were not overloaded
there, u said it. small models. artificial stupidity. goodbye
-# react with clown if u think gpt 4o is restarted
Whatβs good frens
real
Just admit gpt 4o is cancer broπ
"You are too poor" sucking off gpt 4o and gemini 2.5 pro says enough
yeah
ok
well
you can ask in lmarena
if it's just a simple prompt
and..
idk how much long context is but it might be fine
have you tried it
he's tricking u man
???
LMAO
Claude 4 opus is struggling with one bug
πππππ
I gave gemini 2.5 a try
And instead of failing to fix the bug it just destroyed the entire app
With an additional 100 lines of comments
And Claude 4 opus thinks for like 5 sentences and thats its ot even thinking its literally just repeating the task I gave it
I can see R is on the middle side of the gaussian IQ chart π
average gemini 2.5 pro experience
yes its useless so far
It provided me with exact same results for js issue from non thinking and thinking and both failed
For real
For real
AGI is cancelled.
So does that imply that gpt 4.5 and claude 4 opus are on par
wdym
we don't care about the text it's just unsloth
I think it's extremely unlikely for any company to catch up to the level of 2.5pro now. OpenAI and Anthropic have tried their best, but o3 only surpassed 2.5pro in specific areas, and opus still feels like a previous generation model.
Okay, I get it, the text is not accurate because it is not on par with gpt 4.5 and claude 4 opus at the same time. Then is it on par with at least one of them? Like 4.5 or claude 4 opus or just overhype and it ends up worse than both? I guess we don't know, but it looks like they are back, so let's hope they will be the next sota model, go deepseek! π³π₯΅
fake
I think it real but who knows how well it actually performs
Fr
we don't care until it releases
OpenAI must be cooking something. They were about 18 months ahead of Google about 18 months back (Gemini 1.0 launch). And they have huge talent and enough money to burn. I dont think they can squander away all that lead in such a small time. I think something big must be coming from them
Well, sorry for sharing dubious information, after talking to the person behind the rumors it seems fake.

no they dont
not true. megacorps like xai, openai and anthropic have agi and asi internally
and most importantly gpt-4-0314 
that was obv an april fool
what do you think DavidSZD?
I don't think Open AI is ahead
This would depend on the model and whatnot. But what he is referring to here is not a finished product. More like experimental model that was not tuned yet or safety aligned. It's not only the latter though, meaning the product an user gonna see could be better than what he has his hands on now.
Looking at his resume he doesn't look a very technical person either tbh. All his roles were product manager. So not a ML Engineer and I doubt he's in a loop on the training or differences in all the models π
Hey everyone, I wrote an article on reasoning. I'd really appreciate it if you could give it a quick read and share your feedback. : )
https://x.com/LuozhuZhang/status/1926955069083107728
Yes
The easiest way to make comparisons is just to look at the pace of improvement of released models
Thanks I couldn't think of that myself
3.5 today? I am already prepared to be dissappointed.
ah, understandable
This is really due to bad OpenAI naming but still a funny fact: Google went from bard to 2.5 pro in between the release dates of GPT 4 and 4.1
lmaoo
baseless claim π―
bro you're more robot than he is for thinking he is serious
LMARena rage baiting or trolling is too easy
i was being semi-ironic
π
I smiled until I read π―
his girlfriend is what
@sonic tendon can relate
it's roughly 50-50 within may now
well
that was mostly me
low liquidity, shouldn't have bet as much as I did
should I make another relevant AI news thread?
relevant AI news (version 2)
thank you
ooooh you made the poll
time to vote
wow 2k mana free per alt you make and refer
it was underperforming considering current models but it was still great at the time. We didn't have any real alternatives for raw thinking output when this was just released
thats 20$ worth ??
2.5pro wouldn't exist or wouldn't be nearly as good if they hadn't made flash-thinking earlier as well
I hope that the "request models" category is not just there to look good but that they will add the models that the community requests πΆ
new Deepseek maybe today
too many rationalists on this platform (manifold)
NO
?
this is not a joke, it's a real thing. Well a real rumor at least lol
it is a rumor
I've been working (together with Javier Gomez-Serrano) with a group at Google Deepmind to explore potential mathematical applications of their tool "AlphaEvolve", a successor of their earlier tool "Funsearch" that was publicly announced today: deepmind.google/discover/blog/β¦ . Very roughly speaking, this is a tool that can attempt to extremize functions F(x) with x ranging over a high dimensional parameter space Omega, that can outperform more traditional optimization algorithms when the parameter space is very high dimensional and the function F (and its extremizers) have non-obvious structural features.
Some of the preliminary problems we have tried this on, including problems involving harmonic analysis inequalities, additive combinatorics, and packing, were already mentioned in the announcement; we are now gradually moving on to more challenging problems where the parameter space has a sparser set of good solutions. The work is still ongoing, but I hope to be able to reβ¦
Dork ASI confirmed π
Probably only gonna be available to SuperDork subscribers though
Is there any potential grok-esque model in the arena rn?
What temp does lmarena use for the models?
it varies by model. I think this isn't just a random example and roughly aligns but this wasn't updated for ages:
https://github.com/lmarena/p2l/blob/main/route/example_config.yaml
TLDR? (more like TSDU, too stupid didn't understand)
terence tao (one of the world's foremost mathematicians) has been working with Google Deepmind regarding AlphaEvolve, the AI model which recently famously made genuine mathematical discoveries
Terence Tao is legit! Alphaevolve may not generate much news but could play huge role in advancing humanity
Do y'all think there are employees from different big companies watching this chat to see people's opinions on their models 
hello
i know claude 4 just released but when does new AI usually show up on the leaderboards?
yea thats a good question im curious too
I think itβs obvious folks at these companies take the lmarena leaderboard very seriously
Thereβs a simplicity to it that makes is very persuasive to users
nope
if it's o3 on chatgpt website then there is no competition at all lol. It's way more effective at using python than other models to compute/verify and that's in addition to it already being very strong offline
which is nice
Grok is π©
"UAE gives all 11M citizens free ChatGPT Plus".
Very intersting
this is actually real
drooling aliens from Perseus Constellation are watching your chatπ€ͺ
LMfAo
uae is slopmaxxing
can july come any sooner, me wants deepthink
shh deepthink gets release -> o3 pro releases in tandem π§
ayo?
this actually looks legit given how badly the unsloth guy is trying to cover
they were basing it off this: https://x.com/YouJiacheng/status/1926885863952159102 (apparently, they said this in the unsloth discord)
there was a second I saw DeepSeek-V3-0526 in changelog, and then it disappeared.
i don't think its happening btw
We know
thanks I didn't see it earlier in this chat
np since no one else posted it earlier π
yeah at this point it's too late in the day
only the unsloth article which is based on this information
and there was also this tweet
someone associated with Deepseek replying I presume but I didn't have time to verify it tbh
Opus is a but of a weird model btw. Really quite unusual how they couldn't showcase anything other than swe essentially. But it does hold up when you test it and looks unique and quite capable π€·ββοΈ
Someone actually did
In my eyes I almost wrote it off completely after I saw their benchmark manipulation with parallel processing lol
but it actually seems good
but the new piece of info is that the unsloth article was based on it
nope
@keen beacon yep
mike posted it 6 hours later
saying it was based on that
THis message and then the x link implied to me that unsloth based it off that before wild posted it not gonna lie
it implied but they admitted to it after the fact
well
I think its obvious that unsloth based it off that
they had access to qwen 3 early
who
unsloth
cares
hmm, wasn't 0324 released on the 25th
u r the one continuing it tho
who cares about gooning with gpt 4o
Anthropic is weird in how they are extremely ethical but at the same time they aren't.
just selectively ethical I suppose. π
awh
yeah I'm giving up on speculation for today
they seem to imply no access to insider info outside of that
i was wrong about Claude, tbf, so take my opinion with a grain of salt
but I think it could be real
Deepseek is very very strict with "confidential info" lately yeah. And it's China we are talking about so consequences are different lmao
will do
iirc unsloth had early access to qwen3
grok 3.5
you could be right, tho
seems possible as well
unsloth had early access to qwen 3 (which might imply they might have insider info about deepseek) but mike from unsloth said they just based it on that specific tweet
I guess the chief question is "did unsloth actually base the article on speculation or were they just trying to cover their ass"
i lean slightly towards the latter, but tomorrow will tell
mildly sus but plausibly deniable
"is now the best performing open-source model in the world" is quite implausible as a copy-paste
true, i get that they would want a template for release
but the highly specific information is really sus
ik people dissed on speculation earlier, but, honestly, I find this pretty fun
and they are not even really denying much, they are actually saying a release is very likely right now
deepseek cannot hide from me
I mean speculation about models and model releases are like our 'specialty' in this chat
so...
yeahhh
any idea when claude 4 will be released in lmarena? it's been several days since release
i love hallucinations
not exactly sure, someone nuked it
I made #1376555010820931675
Sonnet 4 sux so much it is sad. bad logic, bad math and always refusing to answer for stupid reasons.
It might be a agentic coding god but certainly not a good chat model
I would be not surprised if it is even lower than 3.7 in lmarena.
Claude finally made it out of Mt Moon
true, but o3 is so expensive π
flow
hi guys, this website is free?
isnt claude 4 opus here
yes
yes
yeah i found it in the beta but not the old site
how do they do thattπ
still its not even on the leaderboards
they dont
claude 4 opus is amazing at coding
pirating ai?
they get sponsored
wont this get taken down soon
by all the companies
OH!
so
what happens if the season ends
will the ai go away
π
i dont wanna pay $100 a month for claude bruh
if the organization decides to remove lmarenas access to a certain model then yes
for older models
ph
thats so cool
i wish they can add deep research
i ghought the website was pirating AI
i didnt know that was competition
me when gpt-4-0314
hi
what is this
@misty vault bruhh i just saw that the opus 4 stops at a random part of coding
so it cant be abused
noo
just say "continue"
fr?
thats old
yes
that easy?
yes
ill tell it to stop in batches and continue
the website crashed for me cuz i got like 2,000 lines of code
no that was claybrook i think
for some reason it crashed
o
average beta ui experience
there are much better 2.5 pro models in the arena right now
i feel like LMArena it so abusable
is it good?
i dont like how features are limited
have u tried the i/o edition 2.5 pro lol
i might just buy the $10 monthly for opus 4
Idk they secured it pretty well
and the gemini canvas crashed too
yea
yeah u can evaluate it yourself
for coding claude 4 opus
20$ a month..
Jules is free
jules isnt good
dork 4.0 is agi
acctually?
yes ask @deep adder
When will this joke die?
@deep adder
Yes
bro acctualy fr?
dork*
gpt-4-preview-0314 was agi
bro
It's the gen Z / gen alpha slang
And the jokes
Makes me think this is a younger server
?
I wasn't saying it was bad
32
That's fine
Thanks
Young people check more frequently, I bet thereβs a lot of older folks on here who wonβt see the poll because they donβt constantly check
how are people voting for apple wtf
apple insiders
craig im trying to like claude code, but wtf ..
smh
aint they owned by amazon
or partially
i mean to buy anthropic off, minimum $100b
dont think theyll accept fair value
ya minimum $100b
private equity?
$61.5b is based on the funding round theyve raised, its not really a stock market, just a gauge
series e1 theyre very late stage
and u didnt know about their funding round smh
i mean the only thing they have is siri
isnt anthropic bound to amazon somehow though?
Anthropic has major deals with Amazon and Google and partners with other big tech companies. The deals with Amazon and Google couldn't be exclusive because of antitrust oversight from the FTC. For this reason it's also highly unlikely that Apple could acquire Anthropic in the current regulatory environment.
If you're a big tech company, you're not supposed to acquire or make exclusive agreements with nascent companies that have the potential to become a significant competitor in the future
Big tech companies have tried to get around that with special deals that are like pseudo-acquisitions, but even those deals have faced heavy scrutiny from regulators
hard to say for sure, but people suspect that it (dragonclaw) and redsword are the non-preview versions of gemini 2.5 pro and flash
dragonclaw is probably a old 2.5 pro checkpoint, no longer be in the arena now
drakeclaw was a pretty strange model tbh
it was pretty smart
but it was like a strong model gone wrong
like there was something off about it
it didn't know how to spell lmfao
insane syntactic errors
It might be the rl
O3 does some really strange stuff which remind me of that
2.5 pro doesn't seem to be plagued by those problems, at least the released versions (in a very visible way)
when talking to o3
it seems like o3's CoT is a psuedo tree
even though it's single
it kept telling me multiple revisions through an A B C process
and forgetting which one it was assigning to the context
i dont really understand what ur trying to say
ngl
aren't we in an AI server
just use chatgpt
others dont seem to get it sometimes from what ive seen. and i doubt the models would do that well without your context
don't seem to get what?
what ur trying to say
im talking about past conversations
benefit of the doubt it isn't inherently loaded and you take it as is without accusing it of sophistry
so if I'm invoking 3rd party Interpretation (which inherently would lack ALL context besides the claims) that should speak volumes in what I'm trying to say regardless
just take it as is, ion know what else to say
what are the latest un-released good models on LMArena?
goldmane and redsword
I think goldmane is gemini.. what about redsword?
both are gemini and i believe to be 2.5 pro variations
one of them will be ga 2.5 pro i think (best one will be chosen)
small incremental improvements over current 2.5 pro or decently big improvements?
people love it
people say its better than nightwhisper based on the posts in this channel
oh... that would be amazing
i dont think they did more continued pretraining [and etc] (probably nothing big like that until gemini 3 i guess) but might be good to check. i cba to do so rn tho, the models arent far from release i guess
rumor has it that the current gemini 2.5 is actually gemini 3 internally lol
i doubt that. pretraining knowledge/cut off and timelines/etc dont really make sense for that to be the case, but idk
Best general purpose LLM in 2025 yet
7
20
3
Gemini 2.5 Pro 03-25
Goldmane October 2024
It's interesting that it answers differently if you ask for the date in different language
Redsword June 1, 2024
if youre asking for the knowledge cut off directly, its likely to be a hallucination
(if its not trained in or provided in the system prompt)
but it is interesting nonetheless
Yeah I guess it takes too much time to check. At least they don't know what happened in 2025
Lol GPT 4.1
The original 2.5 PRO is always off the charts when pricing is included.
Also MCbench update
something is very wrong with this graph. Opus below Sonnet? Deepseek V3.1 lower than 4.1-nano?? lmao
I think goldmane will be better than 0325
We use data from n > 5000 LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with d = 28,632 items in total). From them we distill a sparse benchmark, metabench, that has less than 3% of the original size of all six benchmarks combined.
Ok so they used saturated outdated benchmarks 
Can't really be saturated if the average is <50%. Do you have a full list?
Agreed. MC bench:
I mean they probably cherry picked the hardest prompts, but still. Those results they got tell me something is not quite right with their approach
What the Link ir source pls
Don't have experiance with claude 4, but if your remove the 4.1 Nano it seems good.
lmao theres deepseek prover on mcbench?
the thing to keep in mind also, if you gonna use old benchmarks like that, contamination is likely gonna be a bigger problem for labs that were doing this for a long time. Than for relatively new players who got to it after people moved on to other less saturated (by default) metrics.
or just older models vs new, depending how much they changed their datasets
Thx
And this ?
Can't find it anymore. Ask Dom
deepseek beats o3 and chatgpt-latest lmao
look at qwen
I don't see that leaderboard, it's you who posted it. I have no idea where you got it from lol
this doesn;t seem to be included in their paper
this is their eval: https://huggingface.co/datasets/HCAI/metabench/viewer
Looks solid on the first glance, but once again... Some of those questions were in datasets a long time ago I think
metabench seems to be more interesting than expected
Interesting, but this is not meta-bench...? π§
some scores are oddly referenced, like they referenced gpt4o AIME25 against Claude parallel processing one with majority voting? hm
em.. wtf lol
I suppose it makes sense considering it's so concise with reasoning, but this is completely opposite to Anthropic's table... Overfitted on newer AIME25? 
its artificialanalysis's benchmark harness
yeah but it checks out most of the time with official numbers +/- small discrepancies
this is a HUGE discrepancy
yup their claude 4 measurements are like that
messed up
claude 4 sonnet non thinking has a higher gpqa diamond than claude 4 sonnet thinking there iirc
well it was like that
GPQA is fine for Opus though, like it's barely behind o3 there in their testing
they remeasured
yeah theres just something wrong with it
theyve been tryiyng to fix it i think
#general message <-- claude 4 sonnet thinking having a lower score than claude 4 sonnet before they remeasured
Opus is giving off slight gpt4.5 vibes of it being outperformed by smaller models in places regardless tbh, although it's a more capable model now. I think they could do more RL training on it
Not worth it imo
can't see how it wouldn't lead to higher scores. I think GPQA could be improved just with a sys prompt. That benchmark seems to favor longer outputs a lot. And considering the size there should be more gains than long outputs from smaller one
Rl would lead to higher scores but it's just probably hard to work with a model that size
They should focus on making another 3.5 sonnet
there's no obvious way forward with that though I don't think. Especially since they seem decided on hybrid reasoning. They could only like redo 3.7 or 4.0 which would be a boring release
unless like train Sonnet non-reasoning on final Opus/Sonnet-thinking outputs. But at that point I would just do both things - this and further RL training on Opus. To see which option is the more promising etc
Wdym, it's from their table. Or did you have other benchmark in mind?
In a tweet he calls it "Unified-Bench 1.9", metabench paper also mentions other benchmarks than the ones showed in that google drive link, but yeah the chart says "metabanch". Dunno it's confusing lmao
oh wait. They probably just used the same method but picked their own different benchmarks to get subsets from. Yeah I was just looking at this wrong lol
this just seems to be a compilation of benchmarks
the meta bench thing is coincidental
but what is the point of calling it metabench then? I would think at least some of it is similar 
i would bet that guy didnt even know that paper existed
yeah it would appear so they didn't independently test anything, how is the average calculated then..?
"Hallucinations when summarizing" -- this should have been inverted at the very least but they are weighting the scores in some interesting ways. Cause even when I treated this metric as a very bad score the average for o3 came up still higher at 61.8%
wow... Claude 4 launch is disappointing af
Maybe not disappointing per se (reasoning still looks very solid as well as context awareness etc, probably the closest model now to the feel of gpt4.5 with better performance), but yeah there are things where it seems to underperform for sure
would be interesting to get SimpleQA score of it
New model in Beta Arena: qwen3-235b-a22b-no-thinking
Hello @everyone
Hire a Generative AI Engineer | Unlock the Power of Intelligent Automation and Creativity
Are you looking to leverage the latest in generative AI to drive innovation, enhance productivity, and create smarter workflows?
Iβm a skilled Generative AI Engineer with hands-on experience in designing and implementing AI-powered solutions that deliver real value. From custom chatbot development and workflow automation to AI-assisted content generation and LLM integration, I offer both technical depth and strategic insight.
Who I Work With
Startups seeking rapid prototyping and intelligent systems
Small to medium businesses looking to automate and scale
Agencies enhancing service offerings with AI capabilities
Creatives and marketers integrating AI into daily workflows
Letβs Work Together
I am currently available for freelance projects, part-time contracts, and consulting engagements. Whether you need a full build or expert guidance, I can help you integrate AI into your business with clarity and efficiency.
there's also X-preview
which responds in Chinese unprompted and self identifies as from Baidu
(the subsequent responses were in English.. but nothing to write home about.. didn't perform or feel like grok)
its from baidu
ernie x1 ( reasoning model )
yea
sorry i didnt read the baidu part
yeah he literally used that score of hallucinations rate for an average even though lower = better lmao
lol
Add notes to the excel or something π
But the fact that self-reported benchmarks are included makes it already unreliable
Though I don't believe OpenAI would cheat here
yeah im kinda similar.. at least i think oai (and most of the other major players) are more likely to selectively publsih / cherry pick evals or do sneaky things involving asterisks* than outright lie
yeah
though u cant replicate the claude 4 benchmarks
the parallel scores\
it uses an internal scoring model LMAO
yeah wow lol
i saw previous comments alluding to this and agree: would've expected better from anthropic..
i mean they're meant to transparancy / alignment / safety company...
yet they're releasing sneaky / non-reproducible evals..
that said.. and just fwiw.. i do think opus 4 is a genuinely strong / top tier model
just report pass@1 and thats it tbh
it would have been ok. If he hadn't treated inverted scores as the same and missing ones as 0% (that's why Claude is so low π )
if it was a total flop i feel like anthropic would be kinda screwed from here.. like they're struggling to keep up
[but i think sonnet 4 is perhaps a flop... and that doesn't bode well for Anthropic at all imo]
i dont think either were pretrained from scratch theres that i think
yeah i agree
maybe their actual new pretraining run will be good (although there might've been architectural changes along the cpt, but we can't tell)
yeah they've just lost time - and afaik remain constrained by resources / compute
so it will be a challenge
both technically / financially combined with their lagging position relative to competitors (esp oai and google)
yeah I feel the same way. But their metrics are a bit concerning. Like there are things you can clearly see it's their best model yet (like HLE), but they kinda failed to show consistent gains across the board
so it becomes difficult to directly compare it against competition
yeah the limited release of evals compared to previous model releases kinda says something in itself
AIME25 score is good, but then AIME24 seemingly isn't..
The limits with Anthropic models are maddening
@alpine coral did you try goldmane/redsword btw?
i was just trying to get them in the arena actually! but haven't had any luck
i got calmwater - kinda surprisingly (i thought it was associated with a now-released version of 2.5)
it performs well
ahh it's calmriver
its still on the arena?
yeah
yeah
it's almost certainly got thinking enabled
New model in Beta Arena: glm-4-air-250414
Air is a trendy name right now. "Our slimmest model yet"
just got redsword (using beta arena)
very, very impressive / strong
like really good (not a step change.. but it looks - based on two quizes.. given in a single exchange - like a genuinely stronger pro 2.5)
[actually i dunno.. maybe a step change... pretty damn good]
Have you tested goldmane
no haven't gotten it yet
opus 4 still good at coding?!
wait, are you doing all of these manually?
yeah.. i mean for models in the arena there aren't really other options ha
though some of the scores are for from API/official chat
idk, i figured it wouldn't be too difficult to reverse-engineer the battle api
could be wrong
nice
aha yeah possibly - though i'd like to think not
but yeah i couldn't do it even if it was somehow possible
nobody has an idea, they are so close to eo
goldmane vs redsword
eo?
sometimes one performs better than the other
each other
ah
:3
webdev arena system prompt, for anyone curious
too lazy to remove the //s
oh, nvm, someone posted it a bit ago
That looks promising!
yep! kinda got that nebula feel about it tbh ha like yeah been a while since something like
though sample of 1.. shouldn't get too far ahead myself ha
Should I get Gemini Pro or ChatGPT Plus for university Computer Science assistance and research?
I'll ask the AI to summarize chapters within the digital books that they've provided us, amongst many other things.
i got goldmane (on beta chat)
it did slightly worse on the two question sets above (12, 6 respectively - still v strong but not at the very top like redsword
but interestingly.. i gave it an additional question set after those two - which it smoked
finally we'll get anon models on the nice UI
LMArena will stay open and accessible to everyone
So does that mean the ranking algo will stay open? Or just that we'll be able to see the rankings
@echo aurora How does one use the search models on the new UI?
Am I on the actual Gemini Pro?
I've been added to a friends family.
Why does it say (preview)?
aha yup.. it's nice
i don't think the old site is even accessible any more
the new site has a huge censorship issue
We're committed to keeping our ranking methodology open and transparent
any slightly problematic words like kill would trigger the word filter
Which is the best at math?
13
24
1
Gemini 2.5 pro (I/O edition)
can you guys add temperature and system prompt config options in direct chat? it was in the legacy arena π«€
Can anyone help?
thanks for the flag, noted π I'm going to make a post in #1343291835845578853
it's possible! for now adding the feedback in #1372230675914031105 would be ideal
iirc i suggested it ~a month ago when it was still a beta
I thought I would push this question back before it is lost forever π
Has anyone here evaluated OpenAI Codex versus Google Jules yet?
Gemini. It's better for knowledge, obscure facts, etc
The GA version will launch in June. The preview version is still good
Itβs hard to say conclusively; maybe Gemini is the better option today, but things move so quickly that nobody knows what the right answer will be in three months, let alone several semesters from now.
Plus Gemini Pro is free for students if you use an edu email
Oh? Niiiice
Comes with NotebookLM, etc
@patent aspen you so 100% either work at google / deepmind or your room looks like this π€£
rude
Sry to say the current site doesn't have ability to filter by search models atm
On pc you can press Ctrl+F on website to filter by model name
this ai image ok
thank you for keeping the legacy site available
obv
idk, i just want him to actually tell me if he works there
or if he is just genuinely a fan
or nothing of the two
Are the anon models still only on the legacy UI? Can't seem to get them on lmarena.ai?
Well got some now.
Get a lot more anon models in the legacy UI.
Is it better to let Gemini Pro remember my data in the long run? Will it be more intelligent, or will it be more bloated, slow, and dumber?
By "data" I mean, the information it gathers over the course of multiple chats.
the new arena is so nice
Nice update! Can't wait for the Q&A π
glad to hear it! be sure to submit questions if you haven't already.
Does anyone know?
Where can I access this benchmark? thank you
Why did you severely limit the context window in direct chat with the AI model? After interruption, when I write the text again, it gives an error
The Sonnet 4 and Opus 4 models freeze at the moment when they are "thinking" and do not even reach the output of the text
i wish a smaller model is used to filter out bad/inappropriate prompts instead of simple word/string/regex filters
how small is "smaller", ya think?
π
if you're working on a project and you want it to learn about exactly how you have everything set up then yes
but in any other context its going to be more bloated yeah, I'd start from new conversations after a few prompts
pretty sure it's text classifier, probably OpenAI moderations endpoint
you can also ask ai to create a prompt which summarises your whole project setup in detail, this works great
it's free to use and you can customize it how you want
New model in Beta Image arena: bagel (style of image quite resembles gpt-image-1)
though I do wonder about their data privacy policy... could be meaningful if lmarena are sending all inputs/outputs to OpenAI for moderation LOL
like π€
mm possible
day 41 without o3 pro
you are still counting π
yes until the day comes duh
o3 still says june 12 like yesterday, omg its so accurate
bro
they really added the new ui
its so incomplete
atleast add max output tokens
sliders for temperature are so helpful
and they're gone
bro
legacy ui scks tho
and it is not gonna have new features anymore
they really need to add the temperature and max output token controls though
no
then the arena can be abusable.
telling them what
the old one already has it
how could it be used for abuse
the technology already exists, we just need to add it to the new ui
gradio built this in a cave! with a box of scraps!
LOOL
@misty vault telling them what
wtf
Reminder:
β No NSFW
gone for 1 hour
I was out walking my dog 
it will come, just a question of when
i wouldnt be surprised if it comes out with deepthink release
wasn't deep think already released?
to those who kindly ask after donating $250
no gets released in late june/early july
only select users have it for safety testing
They could have done what Anthropic did and never released it at all just gave the benchmark scores for it. So I guess it could be worse π
i just hope when deepthink is released, its not going to be heavily limited like veo 3
8 videos and ur done for the week, gg lol
Shouldn't be, heck, I kinda expect it to be cheaper than o3 still
really don't have much time right now, but i build up some internal benches about the CoT prompt:
it seems to have an INSANE effect on performance for 2.5 flash, pushing its performance well above all the other models (that also have the same prompt)
should be pretty obvious from that they are actually just using the same model (for reasoning and normal)
(And qwq / llama maverick lost <5% because of rate limits)
your new cat pfp is lookin a little sleepy
actually outperforms the actual thinking model (by a small margin, that is negligible)
0 features tho tbf
and a lot of bugs currently
censorship weights, mobile bugs
Which company do you think will achieve A.G.I first?
14
19
2
oai >> google, still, 74% is due to recency bias
o4 internal model should easily dominate deepthink, there is still a gap, but narrowing
If you cover the entire AI space, Google is easily ahead. 2.5 pro is still right on o3's heals, although some argue it's overall better.
oh yea google is breath heavy, but not in depth
They're also pivoting to world models and it'll be interesting to see what kind of performance improvements that brings
Haven't seen anything from openAI from that angle (streams of experience).
Google basically appears to match openAI in the LLM space, while being ahead everywhere else, while also showing off what they think is the next frontier of AI improvement (world models).
So I think that's natural why people have the perception they'll win. They also have the most compute, built transformers, and don't have a Nvidia tax/bottle neck but use specialized hardware and control their entire vertical stack.
Like, when you're forced to look at the entire picture holistically, Google starts to look like an increasingly promising bet in the space.
Today I learned they have more compute than Microsoft and Amazon combined.
it's not recency bias lmao, if you unironically think oai > Google in regards to AI then it's your contrarian mindset, not the reality
if Google wanted to make an o3 or o4 model they can and probably do have one internally
there's no reason to serve such an intense model
it goes against basically everything they've been building in regards to efficiency and profit
but even outside of that, Trey said it all tbh
openAI simply isn't in the position to do that
it's a fundamental problem, not mechanical feasibility
How old are you?
15
21
1
< 24
try redsword on this one
he already did
he said redsword performed better than goldmane
if google have an "o3" internally already; then why is it not being served and oai is hosting it at scale rn. im trying to not be biased here, but o3 is just such in a different league, beyond gemini 2.5 pro as of now. the efficiency/profit threshold release is bs, bc they have veo 3 and its certainly not cheap to serve, heck just look at their $250/mo plan. google has more money/data, yes, but it can only get u so far, just look at meta. i may be wrong at the end of the day, time will tell
oh can you link it
I already said why they're not serving it lmao. And it's not in a different league it underperforms in a lot of things compared to 2.5 pro.
veo 3 is an entirely different thing lmao, it inherently requires more compute and diversifies their AI resume, it's necessary
everything you're saying is as improbable as saying anthropic will be the one to AGI, you're simply choosing to say it's openAI, when everything points to Google having legitimate reasons to be both the strongest lab + the lab with the best research.
Meta isn't a good comparison, they have neither the infrastructure, the data scientists, the ML researchers, the scientific foundations, etc
when Google has the opposite, they have THE best
not just "one of the best"
crazy how you say some of the most nothing burger shi ever
logically speaking people have more incentive to work for Google
saying openAI doesn't even make any sense lmao
everybody wants to work for Google
that's the holy Grail dawg
π
smarter in the sense we as an AI community define "smartness" sure, but better isn't the case
? the opposite is true rn
no it's not lol
im not saying veo 3 is the same as the llm, im saying they don't abide to price/efficiency release schedule, thats bs
I'm saying that's completely irrelevant lol
veo 3 doesn't exemplify ANY price efficiency schedule in regards to AI
it isn't more cutting edge you're lying out of your ass
startup feel is bs
has nothing to do with how it operates and incentives
no it LITERALLY is
you can't justify that even a little bit
publicly traded means nothing
its easy to say that when llama 4 is performed poorly (recency bias), they certainly do have the infrastructure, else they couldn't host all their sites at scale and they do have proper ml/data scientists or the algo wouldn't be as addictive
you literally have no idea how that works, that's COMPLETELY irrelevant to employee
crazy how that's something I study, but with that, this is irrelevant to researchers financially
is politics allowed (like saying is trump affecting AI?) <@&1349916362595635286>
trump affecting AI isn't political
well
β
Avoid political and religious content. As a space thatβs inclusive to many different worldviews we ask to avoid topics related to politics and religion in order to maintain an inclusive space. It is okay to have discussion related to new policy or laws as long as itβs related to AI.
it would be silly to ban all trump discussions
well is trump affecting AI?
of course
you're shifting the goalpost
also appreciate your time
well how much is i believe what's being discussed
trump affected everything
llama 4 performed poorly because they don't have the ability to compete, simple. And what I mean by infrastructure isn't compute lmao, I mean the readiness for AI development
AI is going to get affected via collateral damage
prove that
yes you can lmao
this isn't unfalsifiable
you made the claim
what
π
let's get our Gemini 2.5 pros to Duke it out
deadass
what's the claim
brobro
this is irrelevant btw, employee incentive is in discussion
that's legit irrelevant
none, DeepMind is already what initiated this entire thing
legit doesn't matter
no like, in no case
does it matter
in any way
there's no parts of a company that are legitimately stagnant if they're not unstable
that's a nothingburger
and not how it works
Google is too large to be stagnant
and still, Google is basically the only one truly "innovating"
they have the most distribution already
and even operating under the premise "employee incentive", Google pays twice as much
the bonus and RSU's are important, OpenAI's private equity options are speculative and meaningless, literally contradicts "incentive"
google provides annual cash bonuses + liquid GOOG RSU's that vest over 4 years
Above argument is kinda funny considering openAI has lost a bunch of their main researchers over the past year
One of the more bleeding companies in the talent space
ye
The lead researcher on sora went to google
Co-lead technically
Ilya has his own company but is using Google TPUs
FB also lost a lot of their top researchers. Basically anthropic got a bunch of the openAI talent, or they went and formed their own companies.
Google got Noam back (huge deal tbh probably bigger than anything openAI has gotten).
Claude would be way more performant imo if they had similar compute that openAI has. Arguably they'd be in the lead, but I think it would end up being between them and Google if that were the case.
ur just talking about micro events that dont even matter to oai long term
mind u oai has 5k employees, deepmind has 2k if u wanna talk macros
none of them are micro events, but even granting that assertion openAI doesn't need to be affected, other labs (like deepmind) just need to maintain their lead in AI as it's always been
AI isn't just LLMs
and that's LITERALLY the only thing OpenAI has
openAI doesn't have an alpha zero, openAI doesn't have an alphaevolve
openAI doesn't have the data, openAI doesn't have basically everything
cherry picking micro events doesn't make it a good argument
I just granted your assertion lmao that's dismissing the claim altogether
what are you cherry picking
unrelated things
The entire argument is cherry picking lol, we aren't having a particularly exhaustive conversation
Like 99% of convos in here is vague allusions as to why one company is better
Does that include Google brain, or is that the number pre-merger?
ye, but the discussion called for specific employee incentive
which doesn't necessarily invoke a deeper discussion, if one at all
source?
exactly
might as well say oai has 10k employees, im spicy
rooting google to get agi first (even if they do get it) is crazy to me, ppl sometimes
smart people actually do have a brain
and just because google has their own gpu equivalent hardware, doesn't mean anything, actually it should just mean research friction, more time integrating/debugging than actual research
Insane arguments at 5 am
Ngl I'm getting increasingly convinced by the google propaganda in this channel
i mean..
people are recency loving creatures
tiktok brain
if oai releases o4 next week, i know for a fact everyone changes their perspective lol
No way lmao openai is so bad
Dog lmao
"I know for a fact" like
Ting tong countries probably gonna rock this again
Nah, they gonna fall behind. But I hope I'm wrong and Deepseek goes bang bang on the competition
?
ngl this is probably true for most people in the ai space based on past releases π€£
why am i not getting pinged, r u guys scared lol
ur typing tho should i ping u again?
You're saying "you know for a fact" is just your personal assumptions about how performant it will be
ok so o4 mini high virtually matches in elo with o3 in codeforces, u think the jump from o4 mini high to o4 is going to be marginal?
Certainly could be, yes. None of us know how good it'll end up being.
news flash o4 is top 50 in codeforces
We also don't know what the competition might have or drop when it releases
Yeah IDC about openAIs claims, proof is in the pudding, they gotta show it first
lol
in retrospect i can say the same thing with google, on face value they havent released any substantial to actually compete against oai (when we talk about agi -- not gimmicky videos)
"better than most PHDs across most fields" I think is another claim they made for current o3
now that lm arena is shadcn themed instead of gradio themed should i update lmb too π€
Gemini diffusion is amazing!
New amazon model "folsom-exp-v1.5"
New model in Arena: stephen
As far as I can tell, that's the new deepseek R1
You sure ?
Unless they name this model R1.5 or something like that
Is on the new arena ?
he says his name is R1 or is he from deepseek?
from language style
does the arena work now? I get only errors
This model is open source
https://huggingface.co/THUDM/GLM-4-32B-0414
claude 4 opus is so easy to jailbreak lol
new deepseek r1 making discord clone
wow
From which company? Baidu?
"Air" is generally used by apple products
glm its from zhipu
there are glm 4 plus on the leaderboard
vs old r1 (via openrouter so without the their system prompt)
Wow.. huge diff
https://fixupx.com/opera/status/1927645192254861746
opera did this with veo 3
Meet Opera Neon, a browser for the agentic web
οΈοΈ
οΈοΈOpera Neon can browse with you or for you, take action & help you get things done.
οΈοΈ
οΈοΈOur playground to redefine what a browser can be.
οΈοΈ
οΈοΈπ§© Invite only. Sign up now: opr.as/f4190e
wow, new R1
not bad
but why did they call it a minor upgrade
they call also the new v3 a minor upgraae
ive heard the reasoning of the new r1 is much better
@civic flame whats ur take
hmm
yea its different, for see if he better we need to compare the results
send prompt
Last week I've met a lot of people who use 4o for coding. I thought they are midwits, but maybe the lmarena leaderboard is right π
It seems that we had slept on GPT 4.5
And maybe 4.5 is undertrained
Claude 4 Opus, on the other hand, is fully trained
maybe tbh
And I would like to see a race between GPT 4.5, Claude 4 Opus and Llama 4 Behemoth
i recall reading the original simpleqa paper, they dont score that well
claude models on simpleqa
I mean what if 4.5 is constantly upgraded just like 4o
probably wont ever happen
its too large
the pricing though
lmao
his reasoning has not shortened at all, it is very long
Whereβs 3.5???