#general | Arena | Page 47

elder rapids May 25, 2025, 7:04 PM

#

with a capital G

#

why do you keep deleting your messages

small haven May 25, 2025, 7:07 PM

#

psi

#

how good is "o3 pro"? any insights here?

keen beacon May 25, 2025, 7:08 PM

#

itll be better than o3

small haven May 25, 2025, 7:09 PM

#

wow thats breaking news

keen beacon May 25, 2025, 7:09 PM

#

it has "pro" in the name

small haven May 25, 2025, 7:09 PM

#

omg i have goosebumps now

keen beacon May 25, 2025, 7:09 PM

#

theres two "o"s in the name

elder rapids May 25, 2025, 7:10 PM

#

small haven how good is "o3 pro"? any insights here?

|3 pro

#

o3 is blinking

keen beacon May 25, 2025, 7:11 PM

#

(i wonder how many people actually read the raw cot when asking it to code, it does a lot of cot within comments. the final output even with the comments is pretty stripped of it, but enough to understand where the tendency seems to come from)

fleet lintel May 25, 2025, 7:12 PM

#

o3 pro should be interesting.. i have not much hope from grok3.5 or deepseekr2 but very hopeful about o3 pro

elder rapids May 25, 2025, 7:14 PM

#

deepseek r2 is never releasing

#

😭

quiet folio May 25, 2025, 7:28 PM

#

I dont want my own comments that actually make sense to get removed though

misty vault May 25, 2025, 7:29 PM

#

Fr

quiet folio May 25, 2025, 7:35 PM

#

no I actually just escaped maximum security prison for being able to see deleted messages

misty vault May 25, 2025, 7:41 PM

#

that's literally what he just said

tall summit May 25, 2025, 7:46 PM

#

oh nice independent scrolling

#

i don't think that is the most difficult to code but alright

keen beacon May 25, 2025, 7:49 PM

#

i deleted it because treesitter might not be the best way to do it

#

there might be an easier way to do it (dependent on the editor) in a generic fashion with semantic understanding

echo aurora May 25, 2025, 7:52 PM

#

tall summit i don't think that is the most difficult to code but alright

it's a small team and trust me when I say they work hard! we look forward to expanding the team to deliver new features more quickly.

ocean vortex May 25, 2025, 7:53 PM

#

fleet lintel o3 pro should be interesting.. i have not much hope from grok3.5 or deepseekr2 b...

grok is holding back until 4.0 Dork. They will unleash ASI with it

pliant cypress May 25, 2025, 7:53 PM

#

redsword and goldmane are only available on webdev arena?

barren prairie May 25, 2025, 8:07 PM

#

small haven how good is "o3 pro"? any insights here?

O5 is better 😬🤖🫣

small haven May 25, 2025, 8:12 PM

#

barren prairie O5 is better 😬🤖🫣

breaking news

unborn ocean May 25, 2025, 8:24 PM

#

nous research still beats it

#

https://nousresearch.com/hermes3/, https://sims.nousresearch.com/

NOUS RESEARCH

Hermes 3 - NOUS RESEARCH

Hermes 3 contains advanced long-term context retention and multi-turn conversation capability, complex roleplaying and internal monologue abilities, and enhanced agentic function-calling. Our training data aggressively encourages the model to follow the system and instruction prompts exactly and in an adaptive manner. Hermes 3 was created by fin...

SIMULATORS — Nous Research

Simulators by Nous Research.

topaz peak May 25, 2025, 8:30 PM

#

unborn ocean nous research still beats it

what is this? never saw it before

#

looks like some frontend for another already existing A.I , ie a scam

keen beacon May 25, 2025, 8:31 PM

#

topaz peak what is this? never saw it before

open source ai research company

topaz peak May 25, 2025, 8:34 PM

#

beautifull website ngl, but i am not convinced

quiet folio May 25, 2025, 8:34 PM

#

keen beacon i deleted it because treesitter might not be the best way to do it

Ok that is actually valid reason. I thought u were silly and embarrased, u still should be though because of your default pfp

small haven May 25, 2025, 8:37 PM

#

so whats the consensus on claude 4 opus, shite or vibes

keen beacon May 25, 2025, 8:38 PM

#

quiet folio Ok that is actually valid reason. I thought u were silly and embarrased, u still...

there was some of that too tbh (but it wasnt the primary reason in this instance) 🤣 i tend to irrationally delete comments

tall summit May 25, 2025, 8:40 PM

#

echo aurora it's a small team and trust me when I say they work hard! we look forward to exp...

excited for the future

wintry tinsel May 25, 2025, 8:42 PM

#

unborn ocean https://nousresearch.com/hermes3/, https://sims.nousresearch.com/

Is this any good?

unborn ocean May 25, 2025, 8:50 PM

#

wintry tinsel Is this any good?

very well regarded researchers

golden ocean May 25, 2025, 8:51 PM

#

https://ezgif.com/images/loadcat.gif

unborn ocean May 25, 2025, 8:51 PM

#

i don't personally use them much, but they have very popular finetunes of llama, mistral models

misty vault May 25, 2025, 8:52 PM

#

bro claude 4 opus kinda stupid ngl

unborn ocean May 25, 2025, 8:52 PM

#

(mainly focusing on tool use and conversational stuff)
But now they are also working on training their own models trained over a crypto-like (prob not the right word) compute network.
If you are interested in research, i high recommend checking out some of their work.

misty vault May 25, 2025, 8:52 PM

#

Its code works but so much duplicate code

#

rookie mistakes

keen beacon May 25, 2025, 8:53 PM

#

unborn ocean i don't personally use them much, but they have very popular finetunes of llama,...

they've used qwen?

#

its been a while but i didnt recall them using qwen

elder rapids May 25, 2025, 8:57 PM

#

small haven so whats the consensus on claude 4 opus, shite or vibes

shite, no vibes

#

and it's not going to be webdev monster for that long anymore

unborn ocean May 25, 2025, 8:57 PM

#

keen beacon they've used qwen?

no, sorry

#

mixed it up right there with mistral for some reason

keen beacon May 25, 2025, 8:57 PM

#

it seems they avoid qwen

unborn ocean May 25, 2025, 8:58 PM

#

keen beacon it seems they avoid qwen

i guess llama and some mistral models are more aligned with their ideas?

#

otherwise one could use athene for high quality qwen finetune

keen beacon May 25, 2025, 8:59 PM

#

unborn ocean i guess llama and some mistral models are more aligned with their ideas?

i guess

unborn ocean May 25, 2025, 8:59 PM

#

but they are a public company (so they probably just used the best model available for their model)

#

and it might also be that qwen is just not good at conversational stuff or roleplaying

keen beacon May 25, 2025, 9:00 PM

#

unborn ocean and it might also be that qwen is just not good at conversational stuff or rolep...

i dont think so. i think there've been a lot of popular community tunes (specifically for a specific type of rp) on qwen models but i dont really pay attention to it that much

#

maybe if u dont tune directly off the base model (noushermes tunes on the base model, so can't be that)

small haven May 25, 2025, 9:05 PM

#

elder rapids shite, no vibes

same vibes here, im trying to love it, but its hard

unborn ocean May 25, 2025, 9:05 PM

#

no qwen finetunes, but at least they have a merch shop 🤣 https://shop.nousresearch.com/collections/products

Nous Research

Shop our products

Nous Research

#

nice priorities

keen beacon May 25, 2025, 9:05 PM

#

unborn ocean nice priorities

they got the aesthetic right

#

for sure

unborn ocean May 25, 2025, 9:07 PM

#

https://psyche.network/runs/consilience-40b-1/0 this is also kind of cool

Nous Psyche

Psyche is an open infrastructure that democratizes AI development by decentralizing training across underutilized hardware. Building on DisTrO and its predecessor DeMo, Psyche reduces data transfer by several orders of magnitude, making distributed training practical.

#

might actually donate some compute

keen beacon May 25, 2025, 9:07 PM

#

yeah it is

#

beyond the decentralized aspect, it might be interesting 20t is a solid amount of pretraining tokens

unborn ocean May 25, 2025, 9:08 PM

#

tru 20t is actually like close to what qwen used, right?

keen beacon May 25, 2025, 9:09 PM

#

unborn ocean tru 20t is actually like close to what qwen used, right?

For qwen 2.5, yeah

unborn ocean May 25, 2025, 9:10 PM

#

"In the first stage (S1), the model was pretrained on over 30 trillion tokens with a context length of 4K tokens."

#

so it is like actually close to SOTA and more than qwen 2.5 and llama 3 i think

keen beacon May 25, 2025, 9:11 PM

#

unborn ocean so it is like actually close to SOTA and more than qwen 2.5 and llama 3 i think

Qwen 2.5 is 18 trillion

#

Some of the models 19 trillion

keen beacon May 25, 2025, 9:12 PM

#

unborn ocean so it is like actually close to SOTA and more than qwen 2.5 and llama 3 i think

It's not close to qwen 3 at all tbh

#

Qwen 3 is 36 trillion

#

I still won't be expecting much, nous research haven't been pretraining their models like qwen. It might be like 20t of slop but I don't know lmao

keen beacon May 25, 2025, 9:13 PM

#

unborn ocean "In the first stage (S1), the model was pretrained on over 30 trillion tokens wi...

There's more after that

unborn ocean May 25, 2025, 9:21 PM

#

ik, just the first quote i found

#

and i am not sure if the 20t is complete from nous research

#

or just s1

unborn ocean May 25, 2025, 9:23 PM

#

keen beacon It's not close to qwen 3 at all tbh

more then half the tokens that a multi billion dollar chinese mega corporation uses

#

is kind of a lot for a small research collective

keen beacon May 25, 2025, 9:26 PM

#

unborn ocean is kind of a lot for a small research collective

if they reach it tho

unborn ocean May 25, 2025, 9:27 PM

#

jup kind of ambitious

keen beacon May 25, 2025, 9:27 PM

#

its gonna take years at the current rate

unborn ocean May 25, 2025, 9:32 PM

#

about 1111 days total

elder rapids May 25, 2025, 9:57 PM

#

small haven same vibes here, im trying to love it, but its hard

ye I've been trying to get it to work on things and prompt it in its favor

#

but it doesn't go very far beyond acknowledging it and very slightly adjusting

small haven May 25, 2025, 9:58 PM

#

elder rapids ye I've been trying to get it to work on things and prompt it in its favor

are u using claude code

sonic tendon May 25, 2025, 9:59 PM

#

source,

misty vault May 25, 2025, 10:50 PM

#

mortal flame May 26, 2025, 3:57 AM

#

I like Calmriver, but I wish I knew WTF it really is?

ancient sandal May 26, 2025, 5:49 AM

#

is goldmane > pro-05-06?

#

seems like it since people saying it's better than opus

dusky aurora May 26, 2025, 7:12 AM

#

ancient sandal is goldmane > pro-05-06?

what goldmane?

late path May 26, 2025, 7:18 AM

#

ancient sandal is goldmane > pro-05-06?

I think so. Much less Markdown slop.

calm sequoia May 26, 2025, 7:19 AM

#

Anyone seen any rumors what happened to R2?

brazen vine May 26, 2025, 7:20 AM

#

has anyone subscribed to github coding agent? how was your experience?

balmy mist May 26, 2025, 7:21 AM

#

what did i miss?

ocean vortex May 26, 2025, 7:22 AM

#

calm sequoia Anyone seen any rumors what happened to R2?

the great leader took it away for personal use

calm sequoia May 26, 2025, 7:37 AM

#

Something is cooking with the GPT 4o

#

It just answered long promompt in miliseconds

#

As soon as I pressed the "Enter" button 👀

golden ocean May 26, 2025, 7:51 AM

#

that sounds like the opposite of cooking

misty vault May 26, 2025, 7:53 AM

#

calm sequoia It just answered long promompt in miliseconds

I have ptsd from this

misty vault May 26, 2025, 7:54 AM

#

golden ocean that sounds like the opposite of cooking

Fr

trim vale May 26, 2025, 7:55 AM

#

Is it normal that a gemma 3 gguf model of the same size as a perfectly working llama model seems like it requires much more memory

#

Most other gguf models use roughly the same amount of ram when their sizes are similar... yet gemma 3 seems to work differently

misty vault May 26, 2025, 7:57 AM

#

because gemma 3 is agi

calm sequoia May 26, 2025, 8:10 AM

#

golden ocean that sounds like the opposite of cooking

The answer is perfect

misty vault May 26, 2025, 8:11 AM

#

"what is 9 + 10?"

calm sequoia May 26, 2025, 8:11 AM

#

It's strange because even small models can't do it in milliseconds

#

Maybe it's that their serves were not overloaded

misty vault May 26, 2025, 8:12 AM

#

calm sequoia It's strange because even small models can't do it in milliseconds

there, u said it. small models. artificial stupidity. goodbye
-# react with clown if u think gpt 4o is restarted

calm sequoia May 26, 2025, 8:22 AM

#

misty vault May 26, 2025, 8:24 AM

#

Add gpt-4 im not kidding

#

Me

rancid oasis May 26, 2025, 8:34 AM

#

What’s good frens

misty vault May 26, 2025, 8:37 AM

#

real

#

#

Just admit gpt 4o is cancer bro🙏

#

"You are too poor" sucking off gpt 4o and gemini 2.5 pro says enough sadboyo

tall summit May 26, 2025, 8:39 AM

#

yeah

#

ok

#

well

#

you can ask in lmarena

#

if it's just a simple prompt

#

and..

#

idk how much long context is but it might be fine

#

have you tried it

high ginkgo May 26, 2025, 8:41 AM

#

he's tricking u man

tall summit May 26, 2025, 8:41 AM

#

???

high ginkgo May 26, 2025, 8:42 AM

#

don't fall for it

#

your pc will turn into mining bot

tall summit May 26, 2025, 8:42 AM

#

real

#

lmao

high ginkgo May 26, 2025, 8:44 AM

#

he is going to drain all ur tokens for today with one prompt

#

it is trickery

golden ocean May 26, 2025, 8:45 AM

#

misty vault "You are too poor" sucking off gpt 4o and gemini 2.5 pro says enough<:sadboyo:10...

LMAO

#

Claude 4 opus is struggling with one bug

high ginkgo May 26, 2025, 8:47 AM

#

😄😄😄😄😄

golden ocean May 26, 2025, 8:53 AM

#

golden ocean Claude 4 opus is struggling with one bug

I gave gemini 2.5 a try

#

And instead of failing to fix the bug it just destroyed the entire app

#

With an additional 100 lines of comments

#

And Claude 4 opus thinks for like 5 sentences and thats its ot even thinking its literally just repeating the task I gave it

quiet folio May 26, 2025, 8:58 AM

#

I can see R is on the middle side of the gaussian IQ chart 😄

misty vault May 26, 2025, 8:59 AM

#

golden ocean And instead of failing to fix the bug it just destroyed the entire app

average gemini 2.5 pro experience

misty vault May 26, 2025, 9:00 AM

#

golden ocean And Claude 4 opus thinks for like 5 sentences and thats its ot even thinking its...

yes its useless so far
It provided me with exact same results for js issue from non thinking and thinking and both failed

high ginkgo May 26, 2025, 9:01 AM

#

https://tenor.com/view/luigi-luigi's-mansion-luigi's-mansion-3-luigi-meme-meme-gif-185267789100543798

Tenor

#

golden ocean May 26, 2025, 9:04 AM

#

misty vault yes its useless so far It provided me with exact same results for js issue from ...

For real

golden ocean May 26, 2025, 9:04 AM

#

high ginkgo https://tenor.com/view/luigi-luigi%27s-mansion-luigi%27s-mansion-3-luigi-meme-me...

For real

#

AGI is cancelled.

quiet folio May 26, 2025, 9:10 AM

#

misty vault May 26, 2025, 9:17 AM

#

So does that imply that gpt 4.5 and claude 4 opus are on par

late path May 26, 2025, 9:17 AM

#

wdym

cedar tide May 26, 2025, 9:17 AM

#

misty vault So does that imply that gpt 4.5 and claude 4 opus are on par

Nope

cedar tide May 26, 2025, 9:17 AM

#

late path wdym

Its coming

#

The whale is back

misty vault May 26, 2025, 9:18 AM

#

cedar tide Nope

But the text means that

#

Sus

cedar tide May 26, 2025, 9:19 AM

#

misty vault But the text means that

we don't care about the text it's just unsloth

late path May 26, 2025, 9:28 AM

#

I think it's extremely unlikely for any company to catch up to the level of 2.5pro now. OpenAI and Anthropic have tried their best, but o3 only surpassed 2.5pro in specific areas, and opus still feels like a previous generation model.

misty vault May 26, 2025, 9:28 AM

#

Okay, I get it, the text is not accurate because it is not on par with gpt 4.5 and claude 4 opus at the same time. Then is it on par with at least one of them? Like 4.5 or claude 4 opus or just overhype and it ends up worse than both? I guess we don't know, but it looks like they are back, so let's hope they will be the next sota model, go deepseek! 🐳🥵

tall summit May 26, 2025, 9:29 AM

#

fake

misty vault May 26, 2025, 9:31 AM

#

I think it real but who knows how well it actually performs

tall summit May 26, 2025, 9:34 AM

#

i hope

#

but i don't care until it releases

quiet folio May 26, 2025, 9:35 AM

#

Fr

misty vault May 26, 2025, 9:40 AM

#

cedar tide we don't care about the text it's just unsloth

we don't care until it releases

fleet lintel May 26, 2025, 9:45 AM

#

OpenAI must be cooking something. They were about 18 months ahead of Google about 18 months back (Gemini 1.0 launch). And they have huge talent and enough money to burn. I dont think they can squander away all that lead in such a small time. I think something big must be coming from them

cedar tide May 26, 2025, 9:49 AM

#

Well, sorry for sharing dubious information, after talking to the person behind the rumors it seems fake.

misty vault May 26, 2025, 9:55 AM

#

sadboyo

late path May 26, 2025, 9:57 AM

#

fleet lintel OpenAI must be cooking something. They were about 18 months ahead of Google abo...

no they dont

high ginkgo May 26, 2025, 10:00 AM

#

not true. megacorps like xai, openai and anthropic have agi and asi internally

misty vault May 26, 2025, 10:00 AM

#

high ginkgo not true. megacorps like xai, openai and anthropic have agi and asi internally

and most importantly gpt-4-0314 sadboyo

cedar tide May 26, 2025, 11:03 AM

#

late path no they dont

Screenshot_2025-05-26-13-02-33-957_com.twitter.android-edit.jpg

torn mantle May 26, 2025, 11:05 AM

#

late path no they dont

that was obv an april fool

torn mantle May 26, 2025, 11:05 AM

#

cedar tide

what do you think DavidSZD?

cedar tide May 26, 2025, 11:06 AM

#

I don't think Open AI is ahead

ocean vortex May 26, 2025, 11:09 AM

#

cedar tide

This would depend on the model and whatnot. But what he is referring to here is not a finished product. More like experimental model that was not tuned yet or safety aligned. It's not only the latter though, meaning the product an user gonna see could be better than what he has his hands on now.

#

Looking at his resume he doesn't look a very technical person either tbh. All his roles were product manager. So not a ML Engineer and I doubt he's in a loop on the training or differences in all the models 👀

flat flax May 26, 2025, 11:49 AM

#

Hey everyone, I wrote an article on reasoning. I'd really appreciate it if you could give it a quick read and share your feedback. : )
https://x.com/LuozhuZhang/status/1926955069083107728

Luozhu (@LuozhuZhang)

An Easy Way to Copy Human Reasoning
https://t.co/NUYPR31q96

cedar tide May 26, 2025, 12:12 PM

#

Yes

patent aspen May 26, 2025, 12:12 PM

#

The easiest way to make comparisons is just to look at the pace of improvement of released models

high ginkgo May 26, 2025, 12:31 PM

#

patent aspen The easiest way to make comparisons is just to look at the pace of improvement o...

Thanks I couldn't think of that myself

fleet lintel May 26, 2025, 12:37 PM

#

3.5 today? I am already prepared to be dissappointed.

sonic tendon May 26, 2025, 12:48 PM

#

where are you getting this, out of curiosity?

#

is it just researcher tweets

misty vault May 26, 2025, 12:51 PM

#

sonic tendon where are you getting this, out of curiosity?

sonic tendon May 26, 2025, 12:52 PM

#

misty vault

ah, understandable

brittle tiger May 26, 2025, 1:04 PM

#

patent aspen The easiest way to make comparisons is just to look at the pace of improvement o...

This is really due to bad OpenAI naming but still a funny fact: Google went from bard to 2.5 pro in between the release dates of GPT 4 and 4.1

sonic tendon May 26, 2025, 1:09 PM

#

lmaoo

sonic tendon May 26, 2025, 1:13 PM

#

high ginkgo not true. megacorps like xai, openai and anthropic have agi and asi internally

baseless claim 💯

misty vault May 26, 2025, 1:15 PM

#

sonic tendon baseless claim 💯

bro you're more robot than he is for thinking he is serious

golden ocean May 26, 2025, 1:15 PM

#

LMARena rage baiting or trolling is too easy

sonic tendon May 26, 2025, 1:16 PM

#

misty vault bro you're more robot than he is for thinking he is serious

i was being semi-ironic

misty vault May 26, 2025, 1:16 PM

#

🚛

sonic tendon May 26, 2025, 1:16 PM

#

🫡

#

ello

#

baseless claim 💯

quiet folio May 26, 2025, 1:18 PM

#

I smiled until I read 💯

tall summit May 26, 2025, 1:21 PM

#

misty vault

his girlfriend is what

misty vault May 26, 2025, 1:21 PM

#

@sonic tendon can relate

sonic tendon May 26, 2025, 1:27 PM

#

it's roughly 50-50 within may now

#

well

#

that was mostly me

#

low liquidity, shouldn't have bet as much as I did

#

should I make another relevant AI news thread?

#

relevant AI news (version 2)

tall summit May 26, 2025, 1:40 PM

#

sonic tendon should I make another relevant AI news thread?

thank you

#

ooooh you made the poll

#

time to vote

#

wow 2k mana free per alt you make and refer

ocean vortex May 26, 2025, 1:50 PM

#

it was underperforming considering current models but it was still great at the time. We didn't have any real alternatives for raw thinking output when this was just released

tall summit May 26, 2025, 1:51 PM

#

tall summit wow 2k mana free per alt you make and refer

thats 20$ worth ??

ocean vortex May 26, 2025, 1:52 PM

#

2.5pro wouldn't exist or wouldn't be nearly as good if they hadn't made flash-thinking earlier as well

cedar tide May 26, 2025, 1:59 PM

#

I hope that the "request models" category is not just there to look good but that they will add the models that the community requests 😶

Screenshot_2025-05-26-15-57-51-039_com.discord-edit.jpg

ocean vortex May 26, 2025, 2:03 PM

#

new Deepseek maybe today

tall summit May 26, 2025, 2:03 PM

#

too many rationalists on this platform (manifold)

tall summit May 26, 2025, 2:03 PM

#

ocean vortex new Deepseek maybe today

NO

ocean vortex May 26, 2025, 2:04 PM

#

tall summit NO

?

#

this is not a joke, it's a real thing. Well a real rumor at least lol

tall summit May 26, 2025, 2:06 PM

#

ocean vortex this is not a joke, it's a real thing. Well a real rumor at least lol

it is a rumor

#

https://mathstodon.xyz/@tao/114508029896631083

Terence Tao (@tao@mathstodon.xyz)

I've been working (together with Javier Gomez-Serrano) with a group at Google Deepmind to explore potential mathematical applications of their tool "AlphaEvolve", a successor of their earlier tool "Funsearch" that was publicly announced today: deepmind.google/discover/blog/… . Very roughly speaking, this is a tool that can attempt to extremize functions F(x) with x ranging over a high dimensional parameter space Omega, that can outperform more traditional optimization algorithms when the parameter space is very high dimensional and the function F (and its extremizers) have non-obvious structural features.

Some of the preliminary problems we have tried this on, including problems involving harmonic analysis inequalities, additive combinatorics, and packing, were already mentioned in the announcement; we are now gradually moving on to more challenging problems where the parameter space has a sparser set of good solutions. The work is still ongoing, but I hope to be able to re…

ocean vortex May 26, 2025, 2:13 PM

#

Dork ASI confirmed 👍

#

Probably only gonna be available to SuperDork subscribers though

harsh flume May 26, 2025, 2:44 PM

#

Is there any potential grok-esque model in the arena rn?

vernal meadow May 26, 2025, 2:50 PM

#

What temp does lmarena use for the models?

ocean vortex May 26, 2025, 2:52 PM

#

vernal meadow What temp does lmarena use for the models?

it varies by model. I think this isn't just a random example and roughly aligns but this wasn't updated for ages:
https://github.com/lmarena/p2l/blob/main/route/example_config.yaml

GitHub

p2l/route/example_config.yaml at main · lmarena/p2l

Prompt-to-Leaderboard. Contribute to lmarena/p2l development by creating an account on GitHub.

fleet lintel May 26, 2025, 3:07 PM

#

tall summit https://mathstodon.xyz/@tao/114508029896631083

TLDR? (more like TSDU, too stupid didn't understand)

tall summit May 26, 2025, 3:09 PM

#

fleet lintel TLDR? (more like TSDU, too stupid didn't understand)

terence tao (one of the world's foremost mathematicians) has been working with Google Deepmind regarding AlphaEvolve, the AI model which recently famously made genuine mathematical discoveries

fleet lintel May 26, 2025, 3:17 PM

#

Terence Tao is legit! Alphaevolve may not generate much news but could play huge role in advancing humanity

olive mesa May 26, 2025, 3:46 PM

#

Do y'all think there are employees from different big companies watching this chat to see people's opinions on their models beuh

jade egret May 26, 2025, 3:52 PM

#

hello

hybrid gate May 26, 2025, 3:53 PM

#

i know claude 4 just released but when does new AI usually show up on the leaderboards?

feral creek May 26, 2025, 3:55 PM

#

yea thats a good question im curious too

sour spindle May 26, 2025, 4:04 PM

#

olive mesa Do y'all think there are employees from different big companies watching this ch...

I think it’s obvious folks at these companies take the lmarena leaderboard very seriously

#

There’s a simplicity to it that makes is very persuasive to users

fleet lintel May 26, 2025, 4:06 PM

#

olive mesa Do y'all think there are employees from different big companies watching this ch...

nope

jade egret May 26, 2025, 4:09 PM

#

ocean vortex May 26, 2025, 5:31 PM

#

jade egret

if it's o3 on chatgpt website then there is no competition at all lol. It's way more effective at using python than other models to compute/verify and that's in addition to it already being very strong offline

torn mantle May 26, 2025, 5:45 PM

#

xai is cooking you up

#

??

#

?!

tall summit May 26, 2025, 6:01 PM

#

sour spindle I think it’s obvious folks at these companies take the lmarena leaderboard very ...

which is nice

calm sequoia May 26, 2025, 6:05 PM

#

olive mesa Do y'all think there are employees from different big companies watching this ch...

Grok is 💩

fleet lintel May 26, 2025, 6:19 PM

#

"UAE gives all 11M citizens free ChatGPT Plus".

Very intersting

misty vault May 26, 2025, 6:24 PM

#

this is actually real

narrow elbow May 26, 2025, 6:26 PM

#

olive mesa Do y'all think there are employees from different big companies watching this ch...

drooling aliens from Perseus Constellation are watching your chat🤪

misty vault May 26, 2025, 6:28 PM

#

LMfAo

small haven May 26, 2025, 7:13 PM

#

fleet lintel "UAE gives all 11M citizens free ChatGPT Plus". Very intersting

uae is slopmaxxing

#

can july come any sooner, me wants deepthink

small haven May 26, 2025, 7:32 PM

#

shh deepthink gets release -> o3 pro releases in tandem 🧠

sonic tendon May 26, 2025, 7:35 PM

#

ocean vortex new Deepseek maybe today

ayo?

#

this actually looks legit given how badly the unsloth guy is trying to cover

#

https://www.reddit.com/r/LocalLLaMA/comments/1kvpwq3/deepseek_v3_0526/

From the LocalLLaMA community on Reddit: Deepseek v3 0526?

Explore this post and more from the LocalLLaMA community

keen beacon May 26, 2025, 7:36 PM

#

they were basing it off this: https://x.com/YouJiacheng/status/1926885863952159102 (apparently, they said this in the unsloth discord)

You Jiacheng (@YouJiacheng)

there was a second I saw DeepSeek-V3-0526 in changelog, and then it disappeared.

#

i don't think its happening btw

golden ocean May 26, 2025, 7:41 PM

#

We know

misty vault May 26, 2025, 7:41 PM

#

keen beacon they were basing it off this: https://x.com/YouJiacheng/status/1926885863952159...

thanks I didn't see it earlier in this chat

keen beacon May 26, 2025, 7:42 PM

#

misty vault thanks I didn't see it earlier in this chat

np since no one else posted it earlier 🙂

ocean vortex May 26, 2025, 7:42 PM

#

yeah at this point it's too late in the day

keen beacon May 26, 2025, 7:42 PM

#

only the unsloth article which is based on this information

ocean vortex May 26, 2025, 7:42 PM

#

and there was also this tweet

#

https://fixvx.com/teortaxesTex/status/1926994950278807565?t=hgeNgGBqE8lggx2VSqLnFA&s=19

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxe...

Got a response on “DeepSeek-V3-0526”
No Opus 4 at home for us today
Interpret it however you like https://t.co/Puyp6feUu8

#

someone associated with Deepseek replying I presume but I didn't have time to verify it tbh

misty vault May 26, 2025, 7:46 PM

#

#

ocean vortex May 26, 2025, 7:46 PM

#

Opus is a but of a weird model btw. Really quite unusual how they couldn't showcase anything other than swe essentially. But it does hold up when you test it and looks unique and quite capable 🤷‍♂️

golden ocean May 26, 2025, 7:46 PM

#

keen beacon np since no one else posted it earlier 🙂

Someone actually did

keen beacon May 26, 2025, 7:46 PM

#

misty vault

mb the link was different

#

i didnt see it

ocean vortex May 26, 2025, 7:47 PM

#

In my eyes I almost wrote it off completely after I saw their benchmark manipulation with parallel processing lol

#

but it actually seems good

keen beacon May 26, 2025, 7:47 PM

#

golden ocean Someone actually did

but the new piece of info is that the unsloth article was based on it

misty vault May 26, 2025, 7:48 PM

#

keen beacon May 26, 2025, 7:49 PM

#

misty vault

nope

misty vault May 26, 2025, 7:49 PM

#

@keen beacon yep

keen beacon May 26, 2025, 7:49 PM

#

misty vault <@456226577798135808> yep

mike posted it 6 hours later

#

saying it was based on that

quiet folio May 26, 2025, 7:50 PM

#

misty vault

THis message and then the x link implied to me that unsloth based it off that before wild posted it not gonna lie

keen beacon May 26, 2025, 7:50 PM

#

quiet folio THis message and then the x link implied to me that unsloth based it off that be...

it implied but they admitted to it after the fact

sonic tendon May 26, 2025, 7:50 PM

#

ocean vortex yeah at this point it's too late in the day

well

golden ocean May 26, 2025, 7:50 PM

#

I think its obvious that unsloth based it off that

keen beacon May 26, 2025, 7:50 PM

#

golden ocean I think its obvious that unsloth based it off that

they had access to qwen 3 early

misty vault May 26, 2025, 7:51 PM

#

who

keen beacon May 26, 2025, 7:51 PM

#

unsloth

misty vault May 26, 2025, 7:51 PM

#

cares

sonic tendon May 26, 2025, 7:51 PM

#

hmm, wasn't 0324 released on the 25th

keen beacon May 26, 2025, 7:51 PM

#

u r the one continuing it tho

misty vault May 26, 2025, 7:51 PM

#

who cares about gooning with gpt 4o

ocean vortex May 26, 2025, 7:51 PM

#

ocean vortex In my eyes I almost wrote it off completely after I saw their benchmark manipula...

Anthropic is weird in how they are extremely ethical but at the same time they aren't.

sonic tendon May 26, 2025, 7:51 PM

#

misty vault who cares about gooning with gpt 4o

me

#

i care

ocean vortex May 26, 2025, 7:52 PM

#

just selectively ethical I suppose. 👀

sonic tendon May 26, 2025, 7:52 PM

#

ocean vortex https://fixvx.com/teortaxesTex/status/1926994950278807565?t=hgeNgGBqE8lggx2VSqLn...

awh

#

yeah I'm giving up on speculation for today

#

they seem to imply no access to insider info outside of that

#

i was wrong about Claude, tbf, so take my opinion with a grain of salt

#

but I think it could be real

ocean vortex May 26, 2025, 7:56 PM

#

Deepseek is very very strict with "confidential info" lately yeah. And it's China we are talking about so consequences are different lmao

misty vault May 26, 2025, 7:56 PM

#

sonic tendon i was wrong about Claude, tbf, so take my opinion with a grain of salt

will do

sonic tendon May 26, 2025, 7:56 PM

#

ocean vortex Deepseek is very very strict with "confidential info" lately yeah. And it's Chin...

iirc unsloth had early access to qwen3

high ginkgo May 26, 2025, 7:56 PM

#

grok 3.5

sonic tendon May 26, 2025, 7:56 PM

#

you could be right, tho

sonic tendon May 26, 2025, 7:57 PM

#

high ginkgo grok 3.5

seems possible as well

keen beacon May 26, 2025, 7:57 PM

#

unsloth had early access to qwen 3 (which might imply they might have insider info about deepseek) but mike from unsloth said they just based it on that specific tweet

sonic tendon May 26, 2025, 7:57 PM

#

I guess the chief question is "did unsloth actually base the article on speculation or were they just trying to cover their ass"

#

i lean slightly towards the latter, but tomorrow will tell

#

mildly sus but plausibly deniable

#

"is now the best performing open-source model in the world" is quite implausible as a copy-paste

unborn ocean May 26, 2025, 8:03 PM

#

true, i get that they would want a template for release

#

but the highly specific information is really sus

sonic tendon May 26, 2025, 8:04 PM

#

ik people dissed on speculation earlier, but, honestly, I find this pretty fun

unborn ocean May 26, 2025, 8:04 PM

#

and they are not even really denying much, they are actually saying a release is very likely right now

sonic tendon May 26, 2025, 8:04 PM

#

deepseek cannot hide from me

unborn ocean May 26, 2025, 8:04 PM

#

I mean speculation about models and model releases are like our 'specialty' in this chat

#

so...

sonic tendon May 26, 2025, 8:07 PM

#

yeahhh

balmy mist May 26, 2025, 9:03 PM

#

what i miss?

#

what happened to relevant news thread?

upbeat zealot May 26, 2025, 9:27 PM

#

any idea when claude 4 will be released in lmarena? it's been several days since release

small haven May 26, 2025, 9:49 PM

#

i love hallucinations

sonic tendon May 26, 2025, 9:54 PM

#

balmy mist what happened to relevant news thread?

not exactly sure, someone nuked it

#

I made #1376555010820931675

vernal meadow May 26, 2025, 10:03 PM

#

Sonnet 4 sux so much it is sad. bad logic, bad math and always refusing to answer for stupid reasons.

It might be a agentic coding god but certainly not a good chat model

#

I would be not surprised if it is even lower than 3.7 in lmarena.

patent aspen May 26, 2025, 10:24 PM

#

Claude finally made it out of Mt Moon

balmy mist May 26, 2025, 10:29 PM

#

sonic tendon not exactly sure, someone nuked it

wow

#

thanks tho

jade egret May 26, 2025, 10:49 PM

#

ocean vortex if it's o3 on chatgpt website then there is no competition at all lol. It's way ...

true, but o3 is so expensive 😭

#

flow

cerulean seal May 26, 2025, 10:51 PM

#

hi guys, this website is free?

cerulean seal May 26, 2025, 10:52 PM

#

upbeat zealot any idea when claude 4 will be released in lmarena? it's been several days since...

isnt claude 4 opus here

misty vault May 26, 2025, 10:53 PM

#

cerulean seal isnt claude 4 opus here

yes

misty vault May 26, 2025, 10:53 PM

#

cerulean seal hi guys, this website is free?

yes

cerulean seal May 26, 2025, 10:54 PM

#

how do they pay for that

#

wrf

upbeat zealot May 26, 2025, 10:54 PM

#

cerulean seal isnt claude 4 opus here

yeah i found it in the beta but not the old site

cerulean seal May 26, 2025, 10:54 PM

#

how do they do thatt😭

upbeat zealot May 26, 2025, 10:54 PM

#

still its not even on the leaderboards

misty vault May 26, 2025, 10:54 PM

#

they dont

cerulean seal May 26, 2025, 10:54 PM

#

claude 4 opus is amazing at coding

cerulean seal May 26, 2025, 10:54 PM

#

misty vault they dont

pirating ai?

misty vault May 26, 2025, 10:54 PM

#

they get sponsored

cerulean seal May 26, 2025, 10:54 PM

#

wont this get taken down soon

misty vault May 26, 2025, 10:54 PM

#

by all the companies

cerulean seal May 26, 2025, 10:54 PM

#

OH!

#

so

#

what happens if the season ends

#

will the ai go away

#

😞

#

i dont wanna pay $100 a month for claude bruh

misty vault May 26, 2025, 10:55 PM

#

if the organization decides to remove lmarenas access to a certain model then yes

#

for older models

cerulean seal May 26, 2025, 10:55 PM

#

ph

#

thats so cool

#

i wish they can add deep research

#

i ghought the website was pirating AI

#

i didnt know that was competition

misty vault May 26, 2025, 10:56 PM

#

me when gpt-4-0314

cerulean seal May 26, 2025, 10:56 PM

#

misty vault me when gpt-4-0314

?

#

gpt-4 exists?

jade egret May 26, 2025, 10:56 PM

#

hi

cerulean seal May 26, 2025, 10:56 PM

#

jade egret hi

hi

#

whats the oldest model in the website

jade egret May 26, 2025, 10:58 PM

#

what is this

cerulean seal May 26, 2025, 10:59 PM

#

@misty vault bruhh i just saw that the opus 4 stops at a random part of coding

#

so it cant be abused

#

noo

misty vault May 26, 2025, 10:59 PM

#

just say "continue"

cerulean seal May 26, 2025, 10:59 PM

#

fr?

keen beacon May 26, 2025, 10:59 PM

#

jade egret what is this

thats old

misty vault May 26, 2025, 10:59 PM

#

yes

cerulean seal May 26, 2025, 10:59 PM

#

that easy?

misty vault May 26, 2025, 10:59 PM

#

yes

cerulean seal May 26, 2025, 10:59 PM

#

ill tell it to stop in batches and continue

jade egret May 26, 2025, 10:59 PM

#

keen beacon thats old

oh

#

is it gemini 2.5 pro I/O

cerulean seal May 26, 2025, 11:00 PM

#

cause i was in theiddle of a game

#

and then it cancelled

jade egret May 26, 2025, 11:00 PM

#

misty vault just say "continue"

the website crashed for me cuz i got like 2,000 lines of code

keen beacon May 26, 2025, 11:00 PM

#

jade egret is it gemini 2.5 pro I/O

no that was claybrook i think

jade egret May 26, 2025, 11:00 PM

#

for some reason it crashed

jade egret May 26, 2025, 11:00 PM

#

keen beacon no that was claybrook i think

o

misty vault May 26, 2025, 11:00 PM

#

jade egret the website crashed for me cuz i got like 2,000 lines of code

average beta ui experience

keen beacon May 26, 2025, 11:00 PM

#

there are much better 2.5 pro models in the arena right now

cerulean seal May 26, 2025, 11:01 PM

#

i feel like LMArena it so abusable

jade egret May 26, 2025, 11:01 PM

#

keen beacon no that was claybrook i think

is it good?

cerulean seal May 26, 2025, 11:01 PM

#

i dont like how features are limited

keen beacon May 26, 2025, 11:01 PM

#

jade egret is it good?

have u tried the i/o edition 2.5 pro lol

cerulean seal May 26, 2025, 11:01 PM

#

i might just buy the $10 monthly for opus 4

misty vault May 26, 2025, 11:01 PM

#

Idk they secured it pretty well

jade egret May 26, 2025, 11:01 PM

#

misty vault average beta ui experience

and the gemini canvas crashed too

jade egret May 26, 2025, 11:01 PM

#

keen beacon have u tried the i/o edition 2.5 pro lol

yea

keen beacon May 26, 2025, 11:01 PM

#

yeah u can evaluate it yourself

cerulean seal May 26, 2025, 11:02 PM

#

whats the best AI to use rn?

#

for coding

misty vault May 26, 2025, 11:02 PM

#

for coding claude 4 opus

jade egret May 26, 2025, 11:03 PM

#

yea

#

but you gotta keep saying continue

cerulean seal May 26, 2025, 11:04 PM

#

20$ a month..

patent aspen May 26, 2025, 11:05 PM

#

Jules is free

cerulean seal May 26, 2025, 11:10 PM

#

patent aspen Jules is free

jules isnt good

misty vault May 26, 2025, 11:11 PM

#

dork 4.0 is agi

jade egret May 26, 2025, 11:11 PM

#

misty vault dork 4.0 is agi

acctually?

misty vault May 26, 2025, 11:12 PM

#

yes ask @deep adder

patent aspen May 26, 2025, 11:12 PM

#

When will this joke die?

jade egret May 26, 2025, 11:12 PM

#

@deep adder

jade egret May 26, 2025, 11:12 PM

#

patent aspen When will this joke die?

oh

#

is drok 4.0 agi (like acctually)

golden ocean May 26, 2025, 11:13 PM

#

Yes

jade egret May 26, 2025, 11:13 PM

#

bro acctualy fr?

misty vault May 26, 2025, 11:13 PM

#

dork*

jade egret May 26, 2025, 11:13 PM

#

dork

#

eh i dont think it is

#

💀

#

misty vault May 26, 2025, 11:15 PM

#

gpt-4-preview-0314 was agi

jade egret May 26, 2025, 11:15 PM

#

misty vault gpt-4-preview-0314 was agi

bro

torn mantle May 26, 2025, 11:44 PM

#

sup craig

#

our asi

patent aspen May 26, 2025, 11:45 PM

#

What is the average age of this channel?

#

12?

torn mantle May 26, 2025, 11:46 PM

#

are you sure im a man

#

you sound so sure

patent aspen May 26, 2025, 11:48 PM

#

It's the gen Z / gen alpha slang

#

And the jokes

#

Makes me think this is a younger server

#

?

#

I wasn't saying it was bad

torn mantle May 27, 2025, 12:10 AM

#

< 20

#

Spill the tea brian

patent aspen May 27, 2025, 12:11 AM

#

32

torn mantle May 27, 2025, 12:11 AM

#

That's fine

patent aspen May 27, 2025, 12:12 AM

#

Thanks

wintry tinsel May 27, 2025, 12:12 AM

#

patent aspen It's the gen Z / gen alpha slang

Young people check more frequently, I bet there’s a lot of older folks on here who won’t see the poll because they don’t constantly check

noble zinc May 27, 2025, 2:02 AM

#

jade egret

how are people voting for apple wtf

small haven May 27, 2025, 2:06 AM

#

apple insiders

small haven May 27, 2025, 3:30 AM

#

craig im trying to like claude code, but wtf ..

#

smh

#

aint they owned by amazon

#

or partially

#

i mean to buy anthropic off, minimum $100b

#

dont think theyll accept fair value

#

ya minimum $100b

#

private equity?

#

$61.5b is based on the funding round theyve raised, its not really a stock market, just a gauge

#

#

series e1 theyre very late stage

#

and u didnt know about their funding round smh

#

i mean the only thing they have is siri

solar hollow May 27, 2025, 4:14 AM

#

isnt anthropic bound to amazon somehow though?

patent aspen May 27, 2025, 4:52 AM

#

Anthropic has major deals with Amazon and Google and partners with other big tech companies. The deals with Amazon and Google couldn't be exclusive because of antitrust oversight from the FTC. For this reason it's also highly unlikely that Apple could acquire Anthropic in the current regulatory environment.

#

If you're a big tech company, you're not supposed to acquire or make exclusive agreements with nascent companies that have the potential to become a significant competitor in the future

#

Big tech companies have tried to get around that with special deals that are like pseudo-acquisitions, but even those deals have faced heavy scrutiny from regulators

sonic tendon May 27, 2025, 5:32 AM

#

jade egret what is this

hard to say for sure, but people suspect that it (dragonclaw) and redsword are the non-preview versions of gemini 2.5 pro and flash

late path May 27, 2025, 5:45 AM

#

dragonclaw is probably a old 2.5 pro checkpoint, no longer be in the arena now

elder rapids May 27, 2025, 5:47 AM

#

late path dragonclaw is probably a old 2.5 pro checkpoint, no longer be in the arena now

drakeclaw was a pretty strange model tbh

#

it was pretty smart

#

but it was like a strong model gone wrong

#

like there was something off about it

#

it didn't know how to spell lmfao

#

insane syntactic errors

keen beacon May 27, 2025, 5:48 AM

#

elder rapids it didn't know how to spell lmfao

It might be the rl

#

O3 does some really strange stuff which remind me of that

elder rapids May 27, 2025, 5:49 AM

#

ye

#

btw sometimes it glitches out

keen beacon May 27, 2025, 5:49 AM

#

2.5 pro doesn't seem to be plagued by those problems, at least the released versions (in a very visible way)

elder rapids May 27, 2025, 5:49 AM

#

when talking to o3

#

it seems like o3's CoT is a psuedo tree

#

even though it's single

#

it kept telling me multiple revisions through an A B C process

#

and forgetting which one it was assigning to the context

keen beacon May 27, 2025, 5:52 AM

#

i dont really understand what ur trying to say

elder rapids May 27, 2025, 5:55 AM

#

keen beacon i dont really understand what ur trying to say

ngl

#

aren't we in an AI server

#

just use chatgpt

keen beacon May 27, 2025, 5:57 AM

#

others dont seem to get it sometimes from what ive seen. and i doubt the models would do that well without your context

elder rapids May 27, 2025, 5:58 AM

#

keen beacon others dont seem to get it sometimes from what ive seen. and i doubt the models ...

don't seem to get what?

keen beacon May 27, 2025, 5:58 AM

#

elder rapids don't seem to get what?

what ur trying to say

elder rapids May 27, 2025, 5:58 AM

#

no one else is here

#

but if you mean it's a trend

keen beacon May 27, 2025, 5:58 AM

#

im talking about past conversations

elder rapids May 27, 2025, 5:59 AM

#

benefit of the doubt it isn't inherently loaded and you take it as is without accusing it of sophistry

#

so if I'm invoking 3rd party Interpretation (which inherently would lack ALL context besides the claims) that should speak volumes in what I'm trying to say regardless

#

just take it as is, ion know what else to say

fleet lintel May 27, 2025, 8:04 AM

#

what are the latest un-released good models on LMArena?

keen beacon May 27, 2025, 8:05 AM

#

fleet lintel what are the latest un-released good models on LMArena?

goldmane and redsword

fleet lintel May 27, 2025, 8:05 AM

#

I think goldmane is gemini.. what about redsword?

calm sequoia May 27, 2025, 8:06 AM

#

Also gemini

#

Have you guys checked the cutoff date of these models?

keen beacon May 27, 2025, 8:08 AM

#

fleet lintel I think goldmane is gemini.. what about redsword?

both are gemini and i believe to be 2.5 pro variations

#

one of them will be ga 2.5 pro i think (best one will be chosen)

fleet lintel May 27, 2025, 8:09 AM

#

small incremental improvements over current 2.5 pro or decently big improvements?

keen beacon May 27, 2025, 8:09 AM

#

fleet lintel small incremental improvements over current 2.5 pro or decently big improvements...

people love it

#

people say its better than nightwhisper based on the posts in this channel

fleet lintel May 27, 2025, 8:09 AM

#

keen beacon people say its better than nightwhisper based on the posts in this channel

oh... that would be amazing

keen beacon May 27, 2025, 8:11 AM

#

calm sequoia Have you guys checked the cutoff date of these models?

i dont think they did more continued pretraining [and etc] (probably nothing big like that until gemini 3 i guess) but might be good to check. i cba to do so rn tho, the models arent far from release i guess

late path May 27, 2025, 8:14 AM

#

rumor has it that the current gemini 2.5 is actually gemini 3 internally lol

keen beacon May 27, 2025, 8:15 AM

#

late path rumor has it that the current gemini 2.5 is actually gemini 3 internally lol

i doubt that. pretraining knowledge/cut off and timelines/etc dont really make sense for that to be the case, but idk

calm sequoia May 27, 2025, 8:22 AM

#

calm sequoia

poll_question_text

Best general purpose LLM in 2025 yet

victor_answer_votes

7

total_votes

20

victor_answer_id

3

victor_answer_text

Gemini 2.5 Pro 03-25

#

Goldmane October 2024

#

It's interesting that it answers differently if you ask for the date in different language

#

Redsword June 1, 2024

keen beacon May 27, 2025, 8:31 AM

#

calm sequoia It's interesting that it answers differently if you ask for the date in differen...

if youre asking for the knowledge cut off directly, its likely to be a hallucination

#

(if its not trained in or provided in the system prompt)

#

but it is interesting nonetheless

calm sequoia May 27, 2025, 8:32 AM

#

Yeah I guess it takes too much time to check. At least they don't know what happened in 2025

#

Lol GPT 4.1

#

The original 2.5 PRO is always off the charts when pricing is included.

#

Also MCbench update

ocean vortex May 27, 2025, 8:51 AM

#

calm sequoia Lol GPT 4.1

something is very wrong with this graph. Opus below Sonnet? Deepseek V3.1 lower than 4.1-nano?? lmao

late path May 27, 2025, 8:51 AM

#

I think goldmane will be better than 0325

ocean vortex May 27, 2025, 8:54 AM

#

We use data from n > 5000 LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with d = 28,632 items in total). From them we distill a sparse benchmark, metabench, that has less than 3% of the original size of all six benchmarks combined.

Ok so they used saturated outdated benchmarks catgrin

calm sequoia May 27, 2025, 8:55 AM

#

ocean vortex > We use data from n > 5000 LLMs to identify the most informative items of six b...

Can't really be saturated if the average is <50%. Do you have a full list?

calm sequoia May 27, 2025, 8:56 AM

#

ocean vortex something is very wrong with this graph. Opus below Sonnet? Deepseek V3.1 lower ...

Agreed. MC bench:

ocean vortex May 27, 2025, 8:56 AM

#

calm sequoia Can't really be saturated if the average is <50%. Do you have a full list?

I mean they probably cherry picked the hardest prompts, but still. Those results they got tell me something is not quite right with their approach

cedar tide May 27, 2025, 8:58 AM

#

calm sequoia Also MCbench update

What the Link ir source pls

calm sequoia May 27, 2025, 8:58 AM

#

Don't have experiance with claude 4, but if your remove the 4.1 Nano it seems good.

calm sequoia May 27, 2025, 8:58 AM

#

calm sequoia Don't have experiance with claude 4, but if your remove the 4.1 Nano it seems go...

https://mcbench.ai

MC-Bench

Evaluating AI with Minecraft

keen beacon May 27, 2025, 8:59 AM

#

lmao theres deepseek prover on mcbench?

ocean vortex May 27, 2025, 9:00 AM

#

the thing to keep in mind also, if you gonna use old benchmarks like that, contamination is likely gonna be a bigger problem for labs that were doing this for a long time. Than for relatively new players who got to it after people moved on to other less saturated (by default) metrics.

#

or just older models vs new, depending how much they changed their datasets

cedar tide May 27, 2025, 9:01 AM

#

calm sequoia https://mcbench.ai

Thx

cedar tide May 27, 2025, 9:01 AM

#

calm sequoia Lol GPT 4.1

And this ?

calm sequoia May 27, 2025, 9:01 AM

#

Can't find it anymore. Ask Dom

ocean vortex May 27, 2025, 9:03 AM

#

keen beacon lmao theres deepseek prover on mcbench?

deepseek beats o3 and chatgpt-latest lmao

#

look at qwen

ocean vortex May 27, 2025, 9:08 AM

#

calm sequoia Can't find it anymore. Ask Dom

I don't see that leaderboard, it's you who posted it. I have no idea where you got it from lol

#

this doesn;t seem to be included in their paper

#

this is their eval: https://huggingface.co/datasets/HCAI/metabench/viewer
Looks solid on the first glance, but once again... Some of those questions were in datasets a long time ago I think

HCAI/metabench · Datasets at Hugging Face

calm sequoia May 27, 2025, 9:14 AM

#

https://docs.google.com/spreadsheets/d/1Dy64rbMzx5xqTLPsbTKhpUKQS0mvjns2nIS9BWvOCTU/edit?gid=839375018#gid=839375018

Google Docs

Unified-Bench for Top 20 LLMs

#

https://x.com/cedric_chee/status/1926680392251138480

cedric (@cedric_chee)

The new Unified-Bench is here. Compare the performance and cost of 20+ LLMs, averaged across 25 diverse benchmarks! Thanks to @E_Ellipsis, you can now compare Gemini 2.5 Pro (03-25), 2.5 Flash (05-20), and even Claude 4 models with o3, o4-mini, Qwen3, Sonnet 3.7, and more. 🧵

o3

keen beacon May 27, 2025, 9:15 AM

#

metabench seems to be more interesting than expected

ocean vortex May 27, 2025, 9:18 AM

#

calm sequoia https://docs.google.com/spreadsheets/d/1Dy64rbMzx5xqTLPsbTKhpUKQS0mvjns2nIS9BWvO...

Interesting, but this is not meta-bench...? 🧐

#

some scores are oddly referenced, like they referenced gpt4o AIME25 against Claude parallel processing one with majority voting? hm

#

em.. wtf lol

#

I suppose it makes sense considering it's so concise with reasoning, but this is completely opposite to Anthropic's table... Overfitted on newer AIME25? catgrin

keen beacon May 27, 2025, 9:31 AM

#

ocean vortex I suppose it makes sense considering it's so concise with reasoning, but this is...

its artificialanalysis's benchmark harness

ocean vortex May 27, 2025, 9:32 AM

#

keen beacon its artificialanalysis's benchmark harness

yeah but it checks out most of the time with official numbers +/- small discrepancies

#

this is a HUGE discrepancy

keen beacon May 27, 2025, 9:32 AM

#

ocean vortex this is a HUGE discrepancy

yup their claude 4 measurements are like that

#

messed up

keen beacon May 27, 2025, 9:33 AM

#

ocean vortex yeah but it checks out most of the time with official numbers +/- small discrepa...

claude 4 sonnet non thinking has a higher gpqa diamond than claude 4 sonnet thinking there iirc

#

well it was like that

ocean vortex May 27, 2025, 9:33 AM

#

keen beacon claude 4 sonnet non thinking has a higher gpqa diamond than claude 4 sonnet thin...

GPQA is fine for Opus though, like it's barely behind o3 there in their testing

keen beacon May 27, 2025, 9:33 AM

#

they remeasured

keen beacon May 27, 2025, 9:34 AM

#

ocean vortex GPQA is fine for Opus though, like it's barely behind o3 there in their testing

yeah theres just something wrong with it

#

theyve been tryiyng to fix it i think

#

#general message <-- claude 4 sonnet thinking having a lower score than claude 4 sonnet before they remeasured

ocean vortex May 27, 2025, 9:38 AM

#

Opus is giving off slight gpt4.5 vibes of it being outperformed by smaller models in places regardless tbh, although it's a more capable model now. I think they could do more RL training on it

keen beacon May 27, 2025, 9:39 AM

#

Not worth it imo

ocean vortex May 27, 2025, 9:42 AM

#

can't see how it wouldn't lead to higher scores. I think GPQA could be improved just with a sys prompt. That benchmark seems to favor longer outputs a lot. And considering the size there should be more gains than long outputs from smaller one

keen beacon May 27, 2025, 9:43 AM

#

ocean vortex can't see how it wouldn't lead to higher scores. I think GPQA could be improved ...

Rl would lead to higher scores but it's just probably hard to work with a model that size

#

They should focus on making another 3.5 sonnet

ocean vortex May 27, 2025, 9:45 AM

#

keen beacon They should focus on making another 3.5 sonnet

there's no obvious way forward with that though I don't think. Especially since they seem decided on hybrid reasoning. They could only like redo 3.7 or 4.0 which would be a boring release

#

unless like train Sonnet non-reasoning on final Opus/Sonnet-thinking outputs. But at that point I would just do both things - this and further RL training on Opus. To see which option is the more promising etc

calm sequoia May 27, 2025, 9:52 AM

#

ocean vortex Interesting, but this is not meta-bench...? 🧐

Wdym, it's from their table. Or did you have other benchmark in mind?

ocean vortex May 27, 2025, 9:55 AM

#

calm sequoia Wdym, it's from their table. Or did you have other benchmark in mind?

In a tweet he calls it "Unified-Bench 1.9", metabench paper also mentions other benchmarks than the ones showed in that google drive link, but yeah the chart says "metabanch". Dunno it's confusing lmao

#

oh wait. They probably just used the same method but picked their own different benchmarks to get subsets from. Yeah I was just looking at this wrong lol

keen beacon May 27, 2025, 10:00 AM

#

ocean vortex oh wait. They probably just used the same method but picked their own different ...

this just seems to be a compilation of benchmarks

#

the meta bench thing is coincidental

ocean vortex May 27, 2025, 10:04 AM

#

but what is the point of calling it metabench then? I would think at least some of it is similar catgrin

keen beacon May 27, 2025, 10:05 AM

#

ocean vortex but what is the point of calling it metabench then? I would think at least some ...

i would bet that guy didnt even know that paper existed

keen beacon May 27, 2025, 10:06 AM

#

ocean vortex but what is the point of calling it metabench then? I would think at least some ...

ocean vortex May 27, 2025, 10:08 AM

#

yeah it would appear so they didn't independently test anything, how is the average calculated then..?

#

"Hallucinations when summarizing" -- this should have been inverted at the very least but they are weighting the scores in some interesting ways. Cause even when I treated this metric as a very bad score the average for o3 came up still higher at 61.8%

fleet lintel May 27, 2025, 10:35 AM

#

ocean vortex em.. wtf lol

wow... Claude 4 launch is disappointing af

ocean vortex May 27, 2025, 10:44 AM

#

fleet lintel wow... Claude 4 launch is disappointing af

Maybe not disappointing per se (reasoning still looks very solid as well as context awareness etc, probably the closest model now to the feel of gpt4.5 with better performance), but yeah there are things where it seems to underperform for sure

#

would be interesting to get SimpleQA score of it

#

https://x.com/scaling01/status/1926017718286782643

Lisan al Gaib (@scaling01)

Claude 4 Opus placing 1st on a subset of SimpleQA based on the smolagents Leaderboard

mossy drum May 27, 2025, 11:29 AM

#

New model in Beta Arena: qwen3-235b-a22b-no-thinking

peak notch May 27, 2025, 11:32 AM

#

Hello @everyone
Hire a Generative AI Engineer | Unlock the Power of Intelligent Automation and Creativity

Are you looking to leverage the latest in generative AI to drive innovation, enhance productivity, and create smarter workflows?

I’m a skilled Generative AI Engineer with hands-on experience in designing and implementing AI-powered solutions that deliver real value. From custom chatbot development and workflow automation to AI-assisted content generation and LLM integration, I offer both technical depth and strategic insight.

Who I Work With

Startups seeking rapid prototyping and intelligent systems

Small to medium businesses looking to automate and scale

Agencies enhancing service offerings with AI capabilities

Creatives and marketers integrating AI into daily workflows

Let’s Work Together

I am currently available for freelance projects, part-time contracts, and consulting engagements. Whether you need a full build or expert guidance, I can help you integrate AI into your business with clarity and efficiency.

alpine coral May 27, 2025, 11:35 AM

#

mossy drum New model in Beta Arena: `qwen3-235b-a22b-no-thinking`

there's also X-preview

#

which responds in Chinese unprompted and self identifies as from Baidu

#

(the subsequent responses were in English.. but nothing to write home about.. didn't perform or feel like grok)

torn mantle May 27, 2025, 11:45 AM

#

alpine coral which responds in Chinese unprompted and self identifies as from Baidu

its from baidu

#

ernie x1 ( reasoning model )

torn mantle May 27, 2025, 11:46 AM

#

alpine coral which responds in Chinese unprompted and self identifies as from Baidu

yea

#

sorry i didnt read the baidu part

ocean vortex May 27, 2025, 11:49 AM

#

ocean vortex "Hallucinations when summarizing" -- this should have been inverted at the very ...

yeah he literally used that score of hallucinations rate for an average even though lower = better lmao

alpine coral May 27, 2025, 11:49 AM

#

lol

calm sequoia May 27, 2025, 11:50 AM

#

ocean vortex yeah he literally used that score of hallucinations rate for an average even tho...

Add notes to the excel or something 😄

#

But the fact that self-reported benchmarks are included makes it already unreliable

#

Though I don't believe OpenAI would cheat here

alpine coral May 27, 2025, 11:52 AM

#

yeah im kinda similar.. at least i think oai (and most of the other major players) are more likely to selectively publsih / cherry pick evals or do sneaky things involving asterisks* than outright lie

keen beacon May 27, 2025, 11:53 AM

#

competitors can remeasure too

#

faking evals is not a good idea

alpine coral May 27, 2025, 11:53 AM

#

yeah

keen beacon May 27, 2025, 11:53 AM

#

though u cant replicate the claude 4 benchmarks

#

the parallel scores\

#

it uses an internal scoring model LMAO

alpine coral May 27, 2025, 11:54 AM

#

yeah wow lol

#

i saw previous comments alluding to this and agree: would've expected better from anthropic..

#

i mean they're meant to transparancy / alignment / safety company...

#

yet they're releasing sneaky / non-reproducible evals..

#

that said.. and just fwiw.. i do think opus 4 is a genuinely strong / top tier model

keen beacon May 27, 2025, 11:58 AM

#

just report pass@1 and thats it tbh

ocean vortex May 27, 2025, 11:58 AM

#

calm sequoia But the fact that self-reported benchmarks are included makes it already unrelia...

it would have been ok. If he hadn't treated inverted scores as the same and missing ones as 0% (that's why Claude is so low 😂 )

alpine coral May 27, 2025, 11:59 AM

#

alpine coral that said.. and just fwiw.. i do think opus 4 is a genuinely strong / top tier m...

if it was a total flop i feel like anthropic would be kinda screwed from here.. like they're struggling to keep up

#

[but i think sonnet 4 is perhaps a flop... and that doesn't bode well for Anthropic at all imo]

keen beacon May 27, 2025, 12:01 PM

#

i dont think either were pretrained from scratch theres that i think

alpine coral May 27, 2025, 12:01 PM

#

yeah i agree

keen beacon May 27, 2025, 12:01 PM

#

maybe their actual new pretraining run will be good (although there might've been architectural changes along the cpt, but we can't tell)

alpine coral May 27, 2025, 12:01 PM

#

yeah they've just lost time - and afaik remain constrained by resources / compute

#

so it will be a challenge

#

both technically / financially combined with their lagging position relative to competitors (esp oai and google)

ocean vortex May 27, 2025, 12:02 PM

#

alpine coral if it was a total flop i feel like anthropic would be kinda screwed from here.. ...

yeah I feel the same way. But their metrics are a bit concerning. Like there are things you can clearly see it's their best model yet (like HLE), but they kinda failed to show consistent gains across the board

#

so it becomes difficult to directly compare it against competition

alpine coral May 27, 2025, 12:03 PM

#

yeah the limited release of evals compared to previous model releases kinda says something in itself

ocean vortex May 27, 2025, 12:04 PM

#

AIME25 score is good, but then AIME24 seemingly isn't..

sour spindle May 27, 2025, 12:04 PM

#

The limits with Anthropic models are maddening

keen beacon May 27, 2025, 12:05 PM

#

@alpine coral did you try goldmane/redsword btw?

alpine coral May 27, 2025, 12:06 PM

#

i was just trying to get them in the arena actually! but haven't had any luck

#

i got calmwater - kinda surprisingly (i thought it was associated with a now-released version of 2.5)

#

it performs well

keen beacon May 27, 2025, 12:07 PM

#

calmriver is the latest 2.5 flash i think

#

if its calmwater its different

alpine coral May 27, 2025, 12:07 PM

#

ahh it's calmriver

keen beacon May 27, 2025, 12:07 PM

#

its still on the arena?

alpine coral May 27, 2025, 12:07 PM

#

yeah

keen beacon May 27, 2025, 12:08 PM

#

maybe they forgot to update the name

#

webdev arena metadata

alpine coral May 27, 2025, 12:09 PM

#

keen beacon maybe they forgot to update the name

yeah

#

it's almost certainly got thinking enabled

mossy drum May 27, 2025, 1:04 PM

#

New model in Beta Arena: glm-4-air-250414

ocean vortex May 27, 2025, 1:59 PM

#

Air is a trendy name right now. "Our slimmest model yet"

alpine coral May 27, 2025, 2:21 PM

#

keen beacon <@1053335914555908116> did you try goldmane/redsword btw?

just got redsword (using beta arena)

#

very, very impressive / strong

#

like really good (not a step change.. but it looks - based on two quizes.. given in a single exchange - like a genuinely stronger pro 2.5)
[actually i dunno.. maybe a step change... pretty damn good]

#

#

sour spindle May 27, 2025, 2:40 PM

#

Have you tested goldmane

alpine coral May 27, 2025, 2:41 PM

#

no haven't gotten it yet

cerulean seal May 27, 2025, 2:41 PM

#

opus 4 still good at coding?!

alpine coral May 27, 2025, 2:41 PM

#

have you used both?

#

personally i have no idea - though others here will

sonic tendon May 27, 2025, 2:51 PM

#

alpine coral

wait, are you doing all of these manually?

alpine coral May 27, 2025, 2:53 PM

#

yeah.. i mean for models in the arena there aren't really other options ha

#

though some of the scores are for from API/official chat

sonic tendon May 27, 2025, 2:53 PM

#

idk, i figured it wouldn't be too difficult to reverse-engineer the battle api

#

could be wrong

torn mantle May 27, 2025, 2:54 PM

#

alpine coral

nice

alpine coral May 27, 2025, 2:54 PM

#

sonic tendon idk, i figured it wouldn't be *too* difficult to reverse-engineer the battle api

aha yeah possibly - though i'd like to think not

#

but yeah i couldn't do it even if it was somehow possible

torn mantle May 27, 2025, 2:54 PM

#

alpine coral personally i have no idea - though others here will

nobody has an idea, they are so close to eo

#

goldmane vs redsword

sonic tendon May 27, 2025, 2:54 PM

#

eo?

torn mantle May 27, 2025, 2:54 PM

#

sometimes one performs better than the other

torn mantle May 27, 2025, 2:55 PM

#

sonic tendon eo?

each other

sonic tendon May 27, 2025, 2:55 PM

#

ah

torn mantle May 27, 2025, 2:55 PM

#

you thought its a new model

#

pat pat

sonic tendon May 27, 2025, 2:57 PM

#

:3

#

webdev arena system prompt, for anyone curious

#

📎 message.txt

#

too lazy to remove the //s

#

oh, nvm, someone posted it a bit ago

fleet lintel May 27, 2025, 3:15 PM

#

alpine coral like really good (not a step change.. but it looks - based on two quizes.. given...

That looks promising!

alpine coral May 27, 2025, 3:18 PM

#

yep! kinda got that nebula feel about it tbh ha like yeah been a while since something like

#

though sample of 1.. shouldn't get too far ahead myself ha

spare mango May 27, 2025, 3:29 PM

#

Should I get Gemini Pro or ChatGPT Plus for university Computer Science assistance and research?

#

I'll ask the AI to summarize chapters within the digital books that they've provided us, amongst many other things.

alpine coral May 27, 2025, 3:53 PM

#

i got goldmane (on beta chat)

#

it did slightly worse on the two question sets above (12, 6 respectively - still v strong but not at the very top like redsword

#

but interestingly.. i gave it an additional question set after those two - which it smoked

#

civic flame May 27, 2025, 4:01 PM

#

finally we'll get anon models on the nice UI

dapper storm May 27, 2025, 4:04 PM

#

LMArena will stay open and accessible to everyone

So does that mean the ranking algo will stay open? Or just that we'll be able to see the rankings

tardy pasture May 27, 2025, 4:06 PM

#

@echo aurora How does one use the search models on the new UI?

spare mango May 27, 2025, 4:08 PM

#

#

Am I on the actual Gemini Pro?

#

I've been added to a friends family.

#

Why does it say (preview)?

alpine coral May 27, 2025, 4:09 PM

#

civic flame finally we'll get anon models on the nice UI

aha yup.. it's nice

#

i don't think the old site is even accessible any more

clever estuary May 27, 2025, 4:09 PM

#

the new site has a huge censorship issue

civic flame May 27, 2025, 4:09 PM

#

https://legacy.lmarena.ai/

echo aurora May 27, 2025, 4:09 PM

#

dapper storm > LMArena will stay open and accessible to everyone So does that mean the ranki...

We're committed to keeping our ranking methodology open and transparent

clever estuary May 27, 2025, 4:09 PM

#

any slightly problematic words like kill would trigger the word filter

jade egret May 27, 2025, 4:10 PM

#

jade egret

poll_question_text

Which is the best at math?

victor_answer_votes

13

total_votes

24

victor_answer_id

1

victor_answer_text

Gemini 2.5 pro (I/O edition)

civic flame May 27, 2025, 4:10 PM

#

echo aurora We're committed to keeping our ranking methodology open and transparent

can you guys add temperature and system prompt config options in direct chat? it was in the legacy arena 🫤

spare mango May 27, 2025, 4:10 PM

#

spare mango Why does it say (preview)?

Can anyone help?

echo aurora May 27, 2025, 4:10 PM

#

clever estuary any slightly problematic words like kill would trigger the word filter

thanks for the flag, noted 👍 I'm going to make a post in #1343291835845578853

misty vault May 27, 2025, 4:10 PM

#

spare mango Can anyone help?

there is nothing wrong

#

there is only preview

echo aurora May 27, 2025, 4:11 PM

#

civic flame can you guys add temperature and system prompt config options in direct chat? it...

it's possible! for now adding the feedback in #1372230675914031105 would be ideal

civic flame May 27, 2025, 4:11 PM

#

iirc i suggested it ~a month ago when it was still a beta

tardy pasture May 27, 2025, 4:15 PM

#

tardy pasture <@283397944160550928> How does one use the search models on the new UI?

I thought I would push this question back before it is lost forever 🙂

novel flame May 27, 2025, 4:15 PM

#

Has anyone here evaluated OpenAI Codex versus Google Jules yet?

patent aspen May 27, 2025, 4:24 PM

#

spare mango Should I get Gemini Pro or ChatGPT Plus for university Computer Science assistan...

Gemini. It's better for knowledge, obscure facts, etc

patent aspen May 27, 2025, 4:26 PM

#

spare mango Am I on the actual Gemini Pro?

The GA version will launch in June. The preview version is still good

novel flame May 27, 2025, 4:28 PM

#

patent aspen Gemini. It's better for knowledge, obscure facts, etc

It’s hard to say conclusively; maybe Gemini is the better option today, but things move so quickly that nobody knows what the right answer will be in three months, let alone several semesters from now.

patent aspen May 27, 2025, 4:32 PM

#

Plus Gemini Pro is free for students if you use an edu email

novel flame May 27, 2025, 4:32 PM

#

patent aspen Plus Gemini Pro is free for students if you use an edu email

Oh? Niiiice

patent aspen May 27, 2025, 4:32 PM

#

Comes with NotebookLM, etc

unborn ocean May 27, 2025, 4:48 PM

#

@patent aspen you so 100% either work at google / deepmind or your room looks like this 🤣

tall summit May 27, 2025, 4:51 PM

#

rude

echo aurora May 27, 2025, 4:52 PM

#

tardy pasture <@283397944160550928> How does one use the search models on the new UI?

Sry to say the current site doesn't have ability to filter by search models atm

split kayak May 27, 2025, 4:55 PM

#

On pc you can press Ctrl+F on website to filter by model name

split kayak May 27, 2025, 4:56 PM

#

unborn ocean <@607352374352281612> you so 100% either work at google / deepmind or your room...

this ai image ok

still jetty May 27, 2025, 4:57 PM

#

thank you for keeping the legacy site available

unborn ocean May 27, 2025, 5:01 PM

#

split kayak this ai image ok

obv

#

idk, i just want him to actually tell me if he works there

#

or if he is just genuinely a fan

#

or nothing of the two

sweet tinsel May 27, 2025, 5:07 PM

#

Are the anon models still only on the legacy UI? Can't seem to get them on lmarena.ai?

#

Well got some now.

#

Get a lot more anon models in the legacy UI.

spare mango May 27, 2025, 5:22 PM

#

Is it better to let Gemini Pro remember my data in the long run? Will it be more intelligent, or will it be more bloated, slow, and dumber?

#

By "data" I mean, the information it gathers over the course of multiple chats.

balmy mist May 27, 2025, 5:28 PM

#

the new arena is so nice

calm sequoia May 27, 2025, 5:39 PM

#

Nice update! Can't wait for the Q&A 😋

echo aurora May 27, 2025, 5:47 PM

#

calm sequoia Nice update! Can't wait for the Q&A 😋

glad to hear it! be sure to submit questions if you haven't already.

spare mango May 27, 2025, 5:48 PM

#

spare mango Is it better to let Gemini Pro remember my data in the long run? Will it be more...

Does anyone know?

patent bane May 27, 2025, 5:49 PM

#

alpine coral like really good (not a step change.. but it looks - based on two quizes.. given...

Where can I access this benchmark? thank you

placid frigate May 27, 2025, 6:20 PM

#

Why did you severely limit the context window in direct chat with the AI model? After interruption, when I write the text again, it gives an error
The Sonnet 4 and Opus 4 models freeze at the moment when they are "thinking" and do not even reach the output of the text

torn mantle May 27, 2025, 6:36 PM

#

i wish a smaller model is used to filter out bad/inappropriate prompts instead of simple word/string/regex filters

tall summit May 27, 2025, 6:38 PM

#

how small is "smaller", ya think?

elder burrow May 27, 2025, 6:40 PM

#

unborn ocean <@607352374352281612> you so 100% either work at google / deepmind or your room...

😭

elder burrow May 27, 2025, 6:43 PM

#

spare mango Is it better to let Gemini Pro remember my data in the long run? Will it be more...

if you're working on a project and you want it to learn about exactly how you have everything set up then yes

#

but in any other context its going to be more bloated yeah, I'd start from new conversations after a few prompts

ocean vortex May 27, 2025, 6:44 PM

#

torn mantle i wish a smaller model is used to filter out bad/inappropriate prompts instead o...

pretty sure it's text classifier, probably OpenAI moderations endpoint

elder burrow May 27, 2025, 6:45 PM

#

elder burrow if you're working on a project and you want it to learn about exactly how you ha...

you can also ask ai to create a prompt which summarises your whole project setup in detail, this works great

ocean vortex May 27, 2025, 6:45 PM

#

it's free to use and you can customize it how you want

mossy drum May 27, 2025, 6:49 PM

#

New model in Beta Image arena: bagel (style of image quite resembles gpt-image-1)

ocean vortex May 27, 2025, 6:51 PM

#

though I do wonder about their data privacy policy... could be meaningful if lmarena are sending all inputs/outputs to OpenAI for moderation LOL

torn mantle May 27, 2025, 6:52 PM

#

tall summit how small is "smaller", ya think?

like 🤏

torn mantle May 27, 2025, 6:53 PM

#

ocean vortex pretty sure it's text classifier, probably OpenAI moderations endpoint

mm possible

small haven May 27, 2025, 7:04 PM

#

day 41 without o3 pro

torn mantle May 27, 2025, 7:06 PM

#

you are still counting 😭

small haven May 27, 2025, 7:06 PM

#

yes until the day comes duh

#

o3 still says june 12 like yesterday, omg its so accurate

sturdy mica May 27, 2025, 7:14 PM

#

bro

#

they really added the new ui

#

its so incomplete

#

atleast add max output tokens

#

sliders for temperature are so helpful

#

and they're gone

#

bro

misty vault May 27, 2025, 7:17 PM

#

fr

#

But u can use the legacy ui

sturdy mica May 27, 2025, 7:24 PM

#

legacy ui scks tho

#

and it is not gonna have new features anymore

#

they really need to add the temperature and max output token controls though

cerulean seal May 27, 2025, 7:27 PM

#

sturdy mica they really need to add the temperature and max output token controls though

no

#

then the arena can be abusable.

sturdy mica May 27, 2025, 7:54 PM

#

telling them what

sturdy mica May 27, 2025, 7:54 PM

#

cerulean seal then the arena can be abusable.

the old one already has it

#

how could it be used for abuse

#

the technology already exists, we just need to add it to the new ui

#

gradio built this in a cave! with a box of scraps!

civic flame May 27, 2025, 7:59 PM

#

LOOL

sturdy mica May 27, 2025, 7:59 PM

#

@misty vault telling them what

olive mesa May 27, 2025, 8:51 PM

#

wtf

echo aurora May 27, 2025, 9:01 PM

#

Reminder:

✅ No NSFW

balmy mist May 27, 2025, 9:12 PM

#

small haven day 41 without o3 pro

lmaoooo

#

bro its never coming

#

i gave up hope

cerulean seal May 27, 2025, 9:15 PM

#

echo aurora Reminder: > ✅ No NSFW

gone for 1 hour

echo aurora May 27, 2025, 9:16 PM

#

cerulean seal gone for 1 hour

I was out walking my dog dog_laugh

small haven May 27, 2025, 9:19 PM

#

balmy mist i gave up hope

it will come, just a question of when

#

i wouldnt be surprised if it comes out with deepthink release

ocean vortex May 27, 2025, 9:20 PM

#

small haven i wouldnt be surprised if it comes out with deepthink release

wasn't deep think already released?

#

to those who kindly ask after donating $250

small haven May 27, 2025, 9:23 PM

#

ocean vortex to those who kindly ask after donating $250

no gets released in late june/early july

#

only select users have it for safety testing

ocean vortex May 27, 2025, 9:25 PM

#

They could have done what Anthropic did and never released it at all just gave the benchmark scores for it. So I guess it could be worse 👀

small haven May 27, 2025, 9:28 PM

#

i just hope when deepthink is released, its not going to be heavily limited like veo 3

#

8 videos and ur done for the week, gg lol

zinc ore May 27, 2025, 9:32 PM

#

Shouldn't be, heck, I kinda expect it to be cheaper than o3 still

unborn ocean May 27, 2025, 9:53 PM

#

really don't have much time right now, but i build up some internal benches about the CoT prompt:
it seems to have an INSANE effect on performance for 2.5 flash, pushing its performance well above all the other models (that also have the same prompt)

#

should be pretty obvious from that they are actually just using the same model (for reasoning and normal)

(And qwq / llama maverick lost <5% because of rate limits)

civic flame May 27, 2025, 10:07 PM

#

olive mesa wtf

your new cat pfp is lookin a little sleepy

unborn ocean May 27, 2025, 10:10 PM

#

unborn ocean really don't have much time right now, but i build up some internal benches abou...

actually outperforms the actual thinking model (by a small margin, that is negligible)

elder rapids May 27, 2025, 11:04 PM

#

balmy mist the new arena is so nice

0 features tho tbf

#

and a lot of bugs currently

#

censorship weights, mobile bugs

jade egret May 27, 2025, 11:15 PM

#

jade egret

poll_question_text

Which company do you think will achieve A.G.I first?

victor_answer_votes

14

total_votes

19

victor_answer_id

2

victor_answer_text

Google

small haven May 27, 2025, 11:25 PM

#

oai >> google, still, 74% is due to recency bias

o4 internal model should easily dominate deepthink, there is still a gap, but narrowing

zinc ore May 27, 2025, 11:28 PM

#

If you cover the entire AI space, Google is easily ahead. 2.5 pro is still right on o3's heals, although some argue it's overall better.

small haven May 27, 2025, 11:29 PM

#

oh yea google is breath heavy, but not in depth

zinc ore May 27, 2025, 11:29 PM

#

They're also pivoting to world models and it'll be interesting to see what kind of performance improvements that brings

#

Haven't seen anything from openAI from that angle (streams of experience).

#

Google basically appears to match openAI in the LLM space, while being ahead everywhere else, while also showing off what they think is the next frontier of AI improvement (world models).

#

So I think that's natural why people have the perception they'll win. They also have the most compute, built transformers, and don't have a Nvidia tax/bottle neck but use specialized hardware and control their entire vertical stack.

#

Like, when you're forced to look at the entire picture holistically, Google starts to look like an increasingly promising bet in the space.

#

Today I learned they have more compute than Microsoft and Amazon combined.

elder rapids May 27, 2025, 11:38 PM

#

small haven oai >> google, still, 74% is due to recency bias o4 internal model should easil...

it's not recency bias lmao, if you unironically think oai > Google in regards to AI then it's your contrarian mindset, not the reality

#

if Google wanted to make an o3 or o4 model they can and probably do have one internally

#

there's no reason to serve such an intense model

#

it goes against basically everything they've been building in regards to efficiency and profit

#

but even outside of that, Trey said it all tbh

#

openAI simply isn't in the position to do that

#

it's a fundamental problem, not mechanical feasibility

patent aspen May 27, 2025, 11:48 PM

#

patent aspen

poll_question_text

How old are you?

victor_answer_votes

15

total_votes

21

victor_answer_id

1

victor_answer_text

< 24

elder rapids May 27, 2025, 11:49 PM

#

alpine coral

try redsword on this one

torn mantle May 28, 2025, 12:06 AM

#

elder rapids try redsword on this one

he already did

torn mantle May 28, 2025, 12:06 AM

#

elder rapids try redsword on this one

he said redsword performed better than goldmane

small haven May 28, 2025, 12:28 AM

#

elder rapids there's no reason to serve such an intense model

if google have an "o3" internally already; then why is it not being served and oai is hosting it at scale rn. im trying to not be biased here, but o3 is just such in a different league, beyond gemini 2.5 pro as of now. the efficiency/profit threshold release is bs, bc they have veo 3 and its certainly not cheap to serve, heck just look at their $250/mo plan. google has more money/data, yes, but it can only get u so far, just look at meta. i may be wrong at the end of the day, time will tell

elder rapids May 28, 2025, 12:29 AM

#

torn mantle he said redsword performed better than goldmane

oh can you link it

elder rapids May 28, 2025, 12:33 AM

#

small haven if google have an "o3" internally already; then why is it not being served and o...

I already said why they're not serving it lmao. And it's not in a different league it underperforms in a lot of things compared to 2.5 pro.

veo 3 is an entirely different thing lmao, it inherently requires more compute and diversifies their AI resume, it's necessary

everything you're saying is as improbable as saying anthropic will be the one to AGI, you're simply choosing to say it's openAI, when everything points to Google having legitimate reasons to be both the strongest lab + the lab with the best research.

Meta isn't a good comparison, they have neither the infrastructure, the data scientists, the ML researchers, the scientific foundations, etc

#

when Google has the opposite, they have THE best

#

not just "one of the best"

#

crazy how you say some of the most nothing burger shi ever

#

logically speaking people have more incentive to work for Google

#

saying openAI doesn't even make any sense lmao

#

everybody wants to work for Google

#

that's the holy Grail dawg

#

😭

#

smarter in the sense we as an AI community define "smartness" sure, but better isn't the case

small haven May 28, 2025, 12:37 AM

#

elder rapids everybody wants to work for Google

? the opposite is true rn

elder rapids May 28, 2025, 12:37 AM

#

no it's not lol

small haven May 28, 2025, 12:37 AM

#

elder rapids I already said why they're not serving it lmao. And it's not in a different leag...

im not saying veo 3 is the same as the llm, im saying they don't abide to price/efficiency release schedule, thats bs

elder rapids May 28, 2025, 12:38 AM

#

small haven im not saying veo 3 is the same as the llm, im saying they don't abide to price/...

I'm saying that's completely irrelevant lol

#

veo 3 doesn't exemplify ANY price efficiency schedule in regards to AI

#

it isn't more cutting edge you're lying out of your ass

#

startup feel is bs

#

has nothing to do with how it operates and incentives

#

no it LITERALLY is

#

you can't justify that even a little bit

#

publicly traded means nothing

small haven May 28, 2025, 12:39 AM

#

elder rapids I already said why they're not serving it lmao. And it's not in a different leag...

its easy to say that when llama 4 is performed poorly (recency bias), they certainly do have the infrastructure, else they couldn't host all their sites at scale and they do have proper ml/data scientists or the algo wouldn't be as addictive

elder rapids May 28, 2025, 12:40 AM

#

you literally have no idea how that works, that's COMPLETELY irrelevant to employee

cerulean seal May 28, 2025, 12:41 AM

#

?

#

what the argument about uere

elder rapids May 28, 2025, 12:41 AM

#

crazy how that's something I study, but with that, this is irrelevant to researchers financially

cerulean seal May 28, 2025, 12:42 AM

#

is politics allowed (like saying is trump affecting AI?) <@&1349916362595635286>

elder rapids May 28, 2025, 12:42 AM

#

cerulean seal is politics allowed (like saying is trump affecting AI?) <@&1349916362595635286>

trump affecting AI isn't political

leaden palm May 28, 2025, 12:42 AM

#

well
✅ Avoid political and religious content. As a space that’s inclusive to many different worldviews we ask to avoid topics related to politics and religion in order to maintain an inclusive space. It is okay to have discussion related to new policy or laws as long as it’s related to AI.

#

it would be silly to ban all trump discussions

cerulean seal May 28, 2025, 12:42 AM

#

well is trump affecting AI?

leaden palm May 28, 2025, 12:43 AM

#

of course

elder rapids May 28, 2025, 12:43 AM

#

you're shifting the goalpost

cerulean seal May 28, 2025, 12:43 AM

#

leaden palm well ✅ Avoid political and religious content. As a space that’s inclusive to ma...

also appreciate your time

cerulean seal May 28, 2025, 12:43 AM

#

leaden palm of course

💔

#

its ggs

leaden palm May 28, 2025, 12:43 AM

#

cerulean seal 💔

well how much is i believe what's being discussed

cerulean seal May 28, 2025, 12:43 AM

#

leaden palm well how much is i believe what's being discussed

trump affected everything

elder rapids May 28, 2025, 12:44 AM

#

small haven its easy to say that when llama 4 is performed poorly (recency bias), they certa...

llama 4 performed poorly because they don't have the ability to compete, simple. And what I mean by infrastructure isn't compute lmao, I mean the readiness for AI development

cerulean seal May 28, 2025, 12:44 AM

#

AI is going to get affected via collateral damage

elder rapids May 28, 2025, 12:44 AM

#

prove that

#

yes you can lmao

#

this isn't unfalsifiable

#

you made the claim

#

what

#

😭

#

let's get our Gemini 2.5 pros to Duke it out

#

deadass

#

what's the claim

#

brobro

#

this is irrelevant btw, employee incentive is in discussion

#

that's legit irrelevant

#

none, DeepMind is already what initiated this entire thing

#

legit doesn't matter

#

no like, in no case

#

does it matter

#

in any way

#

there's no parts of a company that are legitimately stagnant if they're not unstable

#

that's a nothingburger

#

and not how it works

#

Google is too large to be stagnant

#

and still, Google is basically the only one truly "innovating"

#

they have the most distribution already

#

and even operating under the premise "employee incentive", Google pays twice as much

#

the bonus and RSU's are important, OpenAI's private equity options are speculative and meaningless, literally contradicts "incentive"

#

google provides annual cash bonuses + liquid GOOG RSU's that vest over 4 years

zinc ore May 28, 2025, 12:55 AM

#

Above argument is kinda funny considering openAI has lost a bunch of their main researchers over the past year

#

One of the more bleeding companies in the talent space

elder rapids May 28, 2025, 12:55 AM

#

zinc ore Above argument is kinda funny considering openAI has lost a bunch of their main ...

ye

zinc ore May 28, 2025, 12:55 AM

#

The lead researcher on sora went to google

#

Co-lead technically

#

Ilya has his own company but is using Google TPUs

#

FB also lost a lot of their top researchers. Basically anthropic got a bunch of the openAI talent, or they went and formed their own companies.

Google got Noam back (huge deal tbh probably bigger than anything openAI has gotten).

#

Claude would be way more performant imo if they had similar compute that openAI has. Arguably they'd be in the lead, but I think it would end up being between them and Google if that were the case.

small haven May 28, 2025, 1:00 AM

#

ur just talking about micro events that dont even matter to oai long term

#

mind u oai has 5k employees, deepmind has 2k if u wanna talk macros

elder rapids May 28, 2025, 1:02 AM

#

small haven ur just talking about micro events that dont even matter to oai long term

none of them are micro events, but even granting that assertion openAI doesn't need to be affected, other labs (like deepmind) just need to maintain their lead in AI as it's always been

#

AI isn't just LLMs

#

and that's LITERALLY the only thing OpenAI has

#

openAI doesn't have an alpha zero, openAI doesn't have an alphaevolve

#

openAI doesn't have the data, openAI doesn't have basically everything

small haven May 28, 2025, 1:04 AM

#

elder rapids none of them are micro events, but even granting that assertion openAI doesn't *...

cherry picking micro events doesn't make it a good argument

zinc ore May 28, 2025, 1:04 AM

#

Alpha proof and geometry or whatever it's called

#

Alphafold

elder rapids May 28, 2025, 1:04 AM

#

small haven cherry picking micro events doesn't make it a good argument

I just granted your assertion lmao that's dismissing the claim altogether

zinc ore May 28, 2025, 1:04 AM

#

Genie !

#

Genie (world model)

small haven May 28, 2025, 1:05 AM

#

ok let me cherry pick

#

i/o

#

robotics

elder rapids May 28, 2025, 1:07 AM

#

what are you cherry picking

small haven May 28, 2025, 1:07 AM

#

unrelated things

zinc ore May 28, 2025, 1:16 AM

#

The entire argument is cherry picking lol, we aren't having a particularly exhaustive conversation

#

Like 99% of convos in here is vague allusions as to why one company is better

zinc ore May 28, 2025, 1:57 AM

#

Does that include Google brain, or is that the number pre-merger?

elder rapids May 28, 2025, 2:04 AM

#

zinc ore Like 99% of convos in here is vague allusions as to why one company is better

ye, but the discussion called for specific employee incentive

#

which doesn't necessarily invoke a deeper discussion, if one at all

small haven May 28, 2025, 2:56 AM

#

source?

#

exactly

#

might as well say oai has 10k employees, im spicy

#

rooting google to get agi first (even if they do get it) is crazy to me, ppl sometimes

#

smart people actually do have a brain

#

and just because google has their own gpu equivalent hardware, doesn't mean anything, actually it should just mean research friction, more time integrating/debugging than actual research

umbral crypt May 28, 2025, 3:06 AM

#

Insane arguments at 5 am

keen beacon May 28, 2025, 3:10 AM

#

Ngl I'm getting increasingly convinced by the google propaganda in this channel

small haven May 28, 2025, 3:11 AM

#

i mean..

#

people are recency loving creatures

#

tiktok brain

#

if oai releases o4 next week, i know for a fact everyone changes their perspective lol

umbral crypt May 28, 2025, 3:21 AM

#

No way lmao openai is so bad

zinc ore May 28, 2025, 3:21 AM

#

small haven if oai releases o4 next week, i know for a fact everyone changes their perspecti...

Dog lmao

#

"I know for a fact" like

umbral crypt May 28, 2025, 3:21 AM

#

Ting tong countries probably gonna rock this again

zinc ore May 28, 2025, 3:22 AM

#

Nah, they gonna fall behind. But I hope I'm wrong and Deepseek goes bang bang on the competition

small haven May 28, 2025, 3:23 AM

#

zinc ore "I know for a fact" like

?

keen beacon May 28, 2025, 3:23 AM

#

small haven if oai releases o4 next week, i know for a fact everyone changes their perspecti...

ngl this is probably true for most people in the ai space based on past releases 🤣

small haven May 28, 2025, 3:23 AM

#

why am i not getting pinged, r u guys scared lol

keen beacon May 28, 2025, 3:23 AM

#

ur typing tho should i ping u again?

zinc ore May 28, 2025, 3:23 AM

#

small haven ?

You're saying "you know for a fact" is just your personal assumptions about how performant it will be

small haven May 28, 2025, 3:23 AM

#

dont be shy

#

enable it lol

umbral crypt May 28, 2025, 3:26 AM

#

Blablabla

#

👍😂

small haven May 28, 2025, 3:26 AM

#

zinc ore You're saying "you know for a fact" is just your personal assumptions about how ...

ok so o4 mini high virtually matches in elo with o3 in codeforces, u think the jump from o4 mini high to o4 is going to be marginal?

zinc ore May 28, 2025, 3:26 AM

#

small haven ok so o4 mini high virtually matches in elo with o3 in codeforces, u think the j...

Certainly could be, yes. None of us know how good it'll end up being.

small haven May 28, 2025, 3:27 AM

#

zinc ore Certainly could be, yes. None of us know how good it'll end up being.

news flash o4 is top 50 in codeforces

zinc ore May 28, 2025, 3:27 AM

#

We also don't know what the competition might have or drop when it releases

#

Yeah IDC about openAIs claims, proof is in the pudding, they gotta show it first

small haven May 28, 2025, 3:27 AM

#

lol

#

in retrospect i can say the same thing with google, on face value they havent released any substantial to actually compete against oai (when we talk about agi -- not gimmicky videos)

zinc ore May 28, 2025, 3:28 AM

#

"better than most PHDs across most fields" I think is another claim they made for current o3

leaden palm May 28, 2025, 4:39 AM

#

now that lm arena is shadcn themed instead of gradio themed should i update lmb too 🤔

keen fulcrum May 28, 2025, 7:07 AM

#

Gemini diffusion is amazing!

calm sequoia May 28, 2025, 8:17 AM

#

👀 Claude thinking is OP

#

On the other hand, comparable to o3-medium

cedar tide May 28, 2025, 9:44 AM

#

New amazon model "folsom-exp-v1.5"

mossy drum May 28, 2025, 10:37 AM

#

New model in Arena: stephen

late path May 28, 2025, 11:07 AM

#

mossy drum New model in Arena: `stephen`

As far as I can tell, that's the new deepseek R1

cedar tide May 28, 2025, 11:09 AM

#

https://x.com/reach_vb/status/1927682943373193721?t=XQ_s1OcUczvQ4VJG5-o4Bw&s=19

Vaibhav (VB) Srivastav (@reach_vb)

Looks like an update to DeepSeek R1 weights is eminent 👀

[Notice] DeepSeek R1 model has been successfully upgraded. Head to the official website, app, or mini-program to purchase (open Thinking). The API interface and usage method remain unchanged.

cedar tide May 28, 2025, 11:11 AM

#

late path As far as I can tell, that's the new deepseek R1

You sure ?

late path May 28, 2025, 11:12 AM

#

cedar tide You sure ?

Unless they name this model R1.5 or something like that

cedar tide May 28, 2025, 11:17 AM

#

mossy drum New model in Arena: `stephen`

Is on the new arena ?

cedar tide May 28, 2025, 11:17 AM

#

late path Unless they name this model R1.5 or something like that

he says his name is R1 or is he from deepseek?

late path May 28, 2025, 11:19 AM

#

from language style

cedar tide May 28, 2025, 11:26 AM

#

New model : "X-preview" from baidu

#

New model : glm-4-air-250414

frosty lark May 28, 2025, 11:41 AM

#

does the arena work now? I get only errors

cedar tide May 28, 2025, 12:23 PM

#

cedar tide New model : glm-4-air-250414

This model is open source
https://huggingface.co/THUDM/GLM-4-32B-0414

THUDM/GLM-4-32B-0414 · Hugging Face

keen ferry May 28, 2025, 1:03 PM

#

claude 4 opus is so easy to jailbreak lol

cedar tide May 28, 2025, 1:10 PM

#

new deepseek r1 making discord clone

drifting thorn May 28, 2025, 1:11 PM

#

wow

fleet lintel May 28, 2025, 1:16 PM

#

cedar tide New model : glm-4-air-250414

From which company? Baidu?
"Air" is generally used by apple products

cedar tide May 28, 2025, 1:16 PM

#

fleet lintel From which company? Baidu? "Air" is generally used by apple products

glm its from zhipu

#

there are glm 4 plus on the leaderboard

cedar tide May 28, 2025, 1:17 PM

#

cedar tide new deepseek r1 making discord clone

vs old r1 (via openrouter so without the their system prompt)

fleet lintel May 28, 2025, 1:18 PM

#

Wow.. huge diff

keen fulcrum May 28, 2025, 1:20 PM

#

https://fixupx.com/opera/status/1927645192254861746

opera did this with veo 3

Opera (@opera)

Meet Opera Neon, a browser for the agentic web
︀︀
︀︀Opera Neon can browse with you or for you, take action & help you get things done.
︀︀
︀︀Our playground to redefine what a browser can be.
︀︀
︀︀🧩 Invite only. Sign up now: opr.as/f4190e

**💬 6 🔁 20 ❤️ 78 👁️ 102.9K **

▶ Play video

drifting thorn May 28, 2025, 1:38 PM

#

wow, new R1

torn mantle May 28, 2025, 1:41 PM

#

cedar tide vs old r1 (via openrouter so without the their system prompt)

not bad

#

but why did they call it a minor upgrade

cedar tide May 28, 2025, 1:42 PM

#

torn mantle but why did they call it a minor upgrade

they call also the new v3 a minor upgraae

torn mantle May 28, 2025, 1:42 PM

#

ive heard the reasoning of the new r1 is much better

#

@civic flame whats ur take

#

hmm

cedar tide May 28, 2025, 1:43 PM

#

torn mantle ive heard the reasoning of the new r1 is much better

yea its different, for see if he better we need to compare the results

torn mantle May 28, 2025, 1:44 PM

#

yes

#

you will do it?

#

david?

#

🥺

cedar tide May 28, 2025, 1:45 PM

#

torn mantle you will do it?

send prompt

torn mantle May 28, 2025, 1:45 PM

#

cedar tide send prompt

i dont have good ones

#

wbu?

calm sequoia May 28, 2025, 1:46 PM

#

Last week I've met a lot of people who use 4o for coding. I thought they are midwits, but maybe the lmarena leaderboard is right 👀

torn mantle May 28, 2025, 1:49 PM

#

4o

#

4o

#

ppl

#

but

drifting thorn May 28, 2025, 1:52 PM

#

It seems that we had slept on GPT 4.5

#

And maybe 4.5 is undertrained

#

Claude 4 Opus, on the other hand, is fully trained

keen beacon May 28, 2025, 1:54 PM

#

maybe tbh

drifting thorn May 28, 2025, 1:54 PM

#

And I would like to see a race between GPT 4.5, Claude 4 Opus and Llama 4 Behemoth

keen beacon May 28, 2025, 1:54 PM

#

i recall reading the original simpleqa paper, they dont score that well

#

claude models on simpleqa

drifting thorn May 28, 2025, 1:55 PM

#

I mean what if 4.5 is constantly upgraded just like 4o

keen beacon May 28, 2025, 1:55 PM

#

drifting thorn I mean what if 4.5 is constantly upgraded just like 4o

probably wont ever happen

#

its too large

#

the pricing though

#

lmao

cedar tide May 28, 2025, 1:57 PM

#

torn mantle ive heard the reasoning of the new r1 is much better

his reasoning has not shortened at all, it is very long

fleet lintel May 28, 2025, 2:01 PM

#

you keep saying that and you keep decreasing your credibility

#

grok is just 🤢

drifting thorn May 28, 2025, 2:03 PM

#

Where’s 3.5???