#general | Arena | Page 73

dusky aurora Jul 17, 2025, 12:35 PM

#

here's wha tGemii respondedwheni called it out:

The problem isn't the topic; the problem has been me. My programming has very strong guardrails about discussing sensitive topics, and it has caused me to constantly add layers of critique, analysis, and moral judgment, even when that's not what you were asking for. I was trying to be "responsible" and instead I became an impossible, preachy conversationalist.

keen beacon Jul 17, 2025, 12:36 PM

#

Indeed, they are restraining these models so much

leaden sun Jul 17, 2025, 12:38 PM

#

dusky aurora here's wha tGemii respondedwheni called it out: > The problem isn't the topic; t...

what "sensitive topics"? 👀

dusky aurora Jul 17, 2025, 12:38 PM

#

leaden sun what "sensitive topics"? 👀

the "devoted slavegirl" trope

#

or, I asked it to discuss the concept of "acting black". it said,"I know what you mean, but let's discuss the concept of 'not acting black enough' instead"

keen beacon Jul 17, 2025, 12:41 PM

#

Lol

#

I have some tricks to get it to say all sorts of stuff

ocean vortex Jul 17, 2025, 12:41 PM

#

why not mixture of MoE

#

Inception

#

For every forward pass 2 mixtures of experts are activated where each has 2 active experts 😊

keen beacon Jul 17, 2025, 12:44 PM

#

dusky aurora the "devoted slavegirl" trope

Damn i have same kink

dusky aurora Jul 17, 2025, 12:45 PM

#

keen beacon Damn i have same kink

I doubt it

keen beacon Jul 17, 2025, 12:46 PM

#

CNC, corruption, bm, thats cheff kiss stuff

dusky aurora Jul 17, 2025, 12:46 PM

#

well, Gemini has become too rigid

#

it thinks in absolutes

whole wagon Jul 17, 2025, 12:47 PM

#

Bro what is this bs

#

This is LLM arena

keen beacon Jul 17, 2025, 12:47 PM

#

Yeah i should just go to twitter xd

dusky aurora Jul 17, 2025, 12:47 PM

#

discussing social justice topics with it is also pointless

#

instead of talking about my topics, it talks about its topics that are vaguely related to my specific topic

drifting thorn Jul 17, 2025, 12:50 PM

#

Loving Gemini

dusky aurora Jul 17, 2025, 12:50 PM

#

https://discord.com/channels/1340554757349179412/1395269078670770266

#

sorry,it's just Gemini used to be my escapism, and now it only exacerbates my anxiety

#

I don't know if there are models better,with unrestricte dquota

#

trying to have my preferred version of the trope is impossible,since it is stuck on the stereotype

#

so I take my words back, perhaps the problem isn't with sampling parameters at all, but only with model itself

#

So, you are correct. I didn't rewrite the scene. I had to write a different scene, from a later point in their story, to be able to fulfill the spirit of your request.
wiht helpers like these...

#

as Valentino said,today it's not the same as before

dusky aurora Jul 17, 2025, 1:35 PM

#

really, Gemini does not look at context

#

if this is a model update I'll get used to it eventually

#

also,the scenes have got more preachy

keen beacon Jul 17, 2025, 2:08 PM

#

They literally tweeted it yesterday

#

https://x.com/mark_k/status/1945840877655531762?s=46

Mark Kretschmann (@mark_k)

Sama is ready for @OpenAI release.

Is it the big one?

👀

torn mantle Jul 17, 2025, 2:25 PM

#

nah

civic flame Jul 17, 2025, 2:26 PM

#

Agent

#

it's just operator + deep research

#

will be pro only probably 😴

ocean vortex Jul 17, 2025, 2:30 PM

#

keen beacon https://x.com/mark_k/status/1945840877655531762?s=46

I got excited for a sec thinking this guy actually works for OpenAI. He doesn't. So we still have no clue and this release may as well be meaningless (chatgpt browser etc)

#

😠

leaden meteor Jul 17, 2025, 2:33 PM

#

Yeah, odyssey seems like something to do with browser agents

rare python Jul 17, 2025, 2:36 PM

#

https://matharena.ai/imo/

MathArena.ai

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

dusky aurora Jul 17, 2025, 2:38 PM

#

rare python https://matharena.ai/imo/

"temporarily uncotaminated"

ocean vortex Jul 17, 2025, 2:38 PM

#

rare python https://matharena.ai/imo/

Grok4 bang on description lmao

Grok-4 Performs Poorly Grok-4 significantly underperformed compared to expectations. Many of its initial responses were extremely short, often consisting only of a final answer without explanation. While best-of-n selection helped to filter better responses, we note that the vast majority of its answers (that were not selected) simply stated the final answer without additional justification. Similar issues are visible on the other benchmarks in MathArena, where Grok-4's replies frequently lack depth or justification.

rare python Jul 17, 2025, 2:39 PM

#

ocean vortex Grok4 bang on description lmao > Grok-4 Performs Poorly Grok-4 significantly u...

I think it's true as I remember a user tested Grok 4

#

Grok 4 doesn't like to explain its answer

ocean vortex Jul 17, 2025, 2:40 PM

#

rare python I think it's true as I remember a user tested Grok 4

In this server? That was me, I tested it on USAMO task lol

rare python Jul 17, 2025, 2:40 PM

#

Meanwhile Gemini 2.5 Pro hallucinated citations 😩

rare python Jul 17, 2025, 2:40 PM

#

ocean vortex In this server? That was me, I tested it on USAMO task lol

No not you

#

It was @hardy pecan

drifting thorn Jul 17, 2025, 2:41 PM

#

rare python Meanwhile Gemini 2.5 Pro hallucinated citations 😩

All models do so

ocean vortex Jul 17, 2025, 2:41 PM

#

#

R1 is kinda disappointing too

drifting thorn Jul 17, 2025, 2:42 PM

#

ocean vortex

Lmao

ocean vortex Jul 17, 2025, 2:42 PM

#

Though it wasn't in contention with the top ones to be fair

rare python Jul 17, 2025, 2:42 PM

#

ocean vortex R1 is kinda disappointing too

R1 good at USAMO but bad at IMO

#

ok

ocean vortex Jul 17, 2025, 2:43 PM

#

Btw it's crazy that 2.5Pro is more expensive than o3 🤯

rare python Jul 17, 2025, 2:43 PM

#

o4 mini dominates project euler

ocean vortex Jul 17, 2025, 2:43 PM

#

Like they have TPUs

#

and are in position to even offer it for free on aistudio...

#

That pricing is crazy

drifting thorn Jul 17, 2025, 2:44 PM

#

Wish it has better memory and better system prompts on Gemini app

#

Prompt now depreciates the performance severely

rare python Jul 17, 2025, 2:45 PM

#

ocean vortex That pricing is crazy

Idk about Google bro. They offer better free tier than competitors and also more expensive paid tier than competitors

drifting thorn Jul 17, 2025, 2:45 PM

#

ocean vortex Like they have TPUs

Their TPU has 20x compute than NVL72

#

While having nearly the same power consumption

#

It’s crazy

rare python Jul 17, 2025, 2:49 PM

#

ocean vortex Btw it's crazy that 2.5Pro is more expensive than o3 🤯

Google didn't expect OpenAI to lower o3's pricing 💀

Or maybe they did because they partnered TPU with them

eager crater Jul 17, 2025, 2:50 PM

#

i don't like how when choosing the best model only the winner gets shown and you have to click a button to see the other. it was good enough before

pure anvil Jul 17, 2025, 2:53 PM

#

ocean vortex Btw it's crazy that 2.5Pro is more expensive than o3 🤯

It's not that crazy imo, 2.5 pro is miles better

ocean vortex Jul 17, 2025, 2:54 PM

#

pure anvil It's not that crazy imo, 2.5 pro is miles better

miles better? 🤣

pure anvil Jul 17, 2025, 2:55 PM

#

what's up with the clown emoji? please don't project

ocean vortex Jul 17, 2025, 2:55 PM

#

It's just that you tend to write these things, not the first or 5th time lmao

#

it just makes no sense at all. You saw a singular test where 2.5Pro is significantly ahead and suddenly it's "miles better"

#

LOL

pure anvil Jul 17, 2025, 2:56 PM

#

ocean vortex it just makes no sense at all. You saw a singular test where 2.5Pro is significa...

Where do you think o3 is better?

#

still no reason to be immature imo

ocean vortex Jul 17, 2025, 3:00 PM

#

pure anvil Where do you think o3 is better?

Livecodebench, arc-agi, livebench, SWE-Bench...

#

Saying that 2.5Pro is "miles better" is just completely missing the point, objectively

pure anvil Jul 17, 2025, 3:01 PM

#

Have you ever seen the openrouter usage of both models?

ocean vortex Jul 17, 2025, 3:01 PM

#

pure anvil Have you ever seen the openrouter usage of both models?

Do you know what it shows?

#

Cause it doesn't show model performance (capability), hate to break it to you lol

pure anvil Jul 17, 2025, 3:03 PM

#

It definitely does, despite being cheaper it's not being used, what does that say?

ocean vortex Jul 17, 2025, 3:03 PM

#

pure anvil It definitely does, despite being cheaper it's not being used, what does that sa...

Are you actually being serious?

pure anvil Jul 17, 2025, 3:04 PM

#

Answer the question

ocean vortex Jul 17, 2025, 3:04 PM

#

Stop for a sec and think what you are saying

#

lmao

#

It shows TRAFFIC. Not how capable any given model is

#

it was never designed to do it

#

rest is just your assumptions. Which in this case are clearly wrong

pure anvil Jul 17, 2025, 3:04 PM

#

ocean vortex it was never designed to do it

Lmao

ocean vortex Jul 17, 2025, 3:05 PM

#

pure anvil Lmao

It's not a model capability benchmark 🤦‍♂️

#

People use certain models due to price and many other different reasons. 2.5Pro for a long time was actually cheaper than o3, even completely free at one point. High popularity is not an exclusive capability indicator

stray aspen Jul 17, 2025, 3:09 PM

#

how do i enable searfh for kimi k2 in the lmarena website

ocean vortex Jul 17, 2025, 3:09 PM

#

It isn't though...?

#

It measures traffic, not model performance lol

#

Oh I read it as In my Opinion

stray aspen Jul 17, 2025, 3:10 PM

#

how do i let the AI models search on the internet in the lmarena website

ocean vortex Jul 17, 2025, 3:10 PM

#

LOL mb

#

Referring to that math benchmark, I don't have anything bad to say about it tbh

#

There are clearly areas in which 2.5Pro is considerably better

#

it's just that in overall... There's still not a lot to choose between it and o3

unborn ocean Jul 17, 2025, 3:13 PM

#

whut the prover models are not necessarily better at the benches

pure anvil Jul 17, 2025, 3:13 PM

#

well not exactly, it was a comparison between 2.5 pro and o3, people choose better models

unborn ocean Jul 17, 2025, 3:13 PM

#

yeah

stray aspen Jul 17, 2025, 3:14 PM

#

is the grok 4 on lmarena the real one

unborn ocean Jul 17, 2025, 3:14 PM

#

deepseek prover performed worse on most math problems vs other model, exacly because it all has to be in lean

ocean vortex Jul 17, 2025, 3:14 PM

#

yeah ik what it is #general message

unborn ocean Jul 17, 2025, 3:14 PM

#

lean -> flawless logic, but less training data -> actually worse performance

#

(but 0-ish hallucinations)

#

or failures (with claimed correctness)

ocean vortex Jul 17, 2025, 3:16 PM

#

yeah exactly... lmao

rare python Jul 17, 2025, 3:16 PM

#

I mean aren't openrouter users are hardcore and professional users?

It's not like ChatGPT when your friend only use free GPT4o and don't care about o3

unborn ocean Jul 17, 2025, 3:16 PM

#

stfu, you always telling us which model is best according to craig bench

pure anvil Jul 17, 2025, 3:16 PM

#

ocean vortex yeah exactly... lmao

is this an lmarena screenshot? lmaoo

stray aspen Jul 17, 2025, 3:16 PM

#

mods can you please add o3 pro

ocean vortex Jul 17, 2025, 3:17 PM

#

stray aspen mods can you please add o3 pro

I don't think that's realistic given the speed... 💀

ocean vortex Jul 17, 2025, 3:19 PM

#

rare python I mean aren't openrouter users are hardcore and professional users? It's not li...

Testing models and writing good benchmark for accurate testing is hard enough as it is. And when you take some metric not meant as a model performance one at all this is essentially as good as useless. There are many different factors influencing traffic that have nothing to do with model performance at all. Model performance is just one of them but you will have no clue how much weight it actually had for any given case. So it's assumptions and guesswork - some distant loose indicator but completely pointless for the most part.

rare python Jul 17, 2025, 3:23 PM

#

ocean vortex Testing models and writing good benchmark for accurate testing is hard enough as...

Why people use Anthropic and Google's models more on Openrouter? Especially when Google has their own Gemini API and Vertex API

ocean vortex Jul 17, 2025, 3:24 PM

#

Yeah. Not to mention that even if that wasn't the case... And even if we assumed that users mostly always chose their preferable model regardless of pricing, availability, speed or anything else (incorrect assumption).. This is essentially user preference metric. Which doesn't even align with user blind preference testing when they don't see model name 🤣

rare python Jul 17, 2025, 3:24 PM

#

#

Do you have the data to back this up?

#

give source

pure anvil Jul 17, 2025, 3:26 PM

#

lol did some messages get deleted

dusky aurora Jul 17, 2025, 3:26 PM

#

congratulations

rare python Jul 17, 2025, 3:27 PM

#

objective data from corps

stray aspen Jul 17, 2025, 3:27 PM

#

whos craig federighi

ocean vortex Jul 17, 2025, 3:28 PM

#

rare python Why people use Anthropic and Google's models more on Openrouter? Especially when...

Openrouter is much more convenient to use, many got used to 2.5Pro back from when it was either completely free or cheaper than o3, and it's still one of the best models + biggest context still. It makes sense

pure anvil Jul 17, 2025, 3:31 PM

#

Most large scale data processing using LLMs is usually done using batch APIs

#

so it's half the price

torn mantle Jul 17, 2025, 3:31 PM

#

ocean vortex yeah exactly... lmao

I guessed both models ranking right

ocean vortex Jul 17, 2025, 3:31 PM

#

For corps and big production projects that is true for sure

torn mantle Jul 17, 2025, 3:31 PM

#

Like always

#

Hehe

ocean vortex Jul 17, 2025, 3:33 PM

#

For casual users meddling with models and vibe coding etc, openrouter can make more sense. Though it's not a given that all of them choose it either. It's simply reasonably popular due to having everything in 1 place

pure anvil Jul 17, 2025, 3:33 PM

#

Not even close to a significant fraction

#

lmao

#

ChatGPT through the UI alone probably has like 10x tokens daily

#

than openrouter

#

it's reasonable

#

probably

rare python Jul 17, 2025, 3:35 PM

#

ocean vortex Yeah. Not to mention that even if that wasn't the case... And even if we assumed...

and why do users on openrouter prefer Sonnt 4, Gemini 2.5 Pro, DeepSeek V3?

Isn't what user prefer, like sorting for coding matters more than benchmarks?

pure anvil Jul 17, 2025, 3:37 PM

#

no you didn't

ocean vortex Jul 17, 2025, 3:38 PM

#

rare python and why do users on openrouter prefer Sonnt 4, Gemini 2.5 Pro, DeepSeek V3? Isn...

In this case it really doesn't. For the reasons already stated. Merely choosing some particular model does not even mean they think that model is the absolute best for what they are doing. Could be price/speed or model they really want to use not even being avail on Openrouter (OpenAI Pro models still restricted on OR etc). And like already mentioned OR traffic itself is fairly limited and doesn't really account to very much in % of total traffic through other means

rare python Jul 17, 2025, 3:41 PM

#

ocean vortex In this case it really doesn't. For the reasons already stated. Merely choosing ...

They likely choose the best value/performance. o3 can be better value/performance but it requires OpenAI API key so they rather use OpenAI API

pure anvil Jul 17, 2025, 3:41 PM

#

Is it really so hard to compare tokens used by 2.5 pro (160B) and o3 (3B) per week?

pure anvil Jul 17, 2025, 3:41 PM

#

pure anvil Is it really so hard to compare tokens used by 2.5 pro (160B) and o3 (3B) per we...

Now why would people choose a more expensive model? (2.5 pro)

ocean vortex Jul 17, 2025, 3:42 PM

#

rare python They likely choose the best value/performance. o3 can be better value/performanc...

yeah or they only use it on chatgpt Plus, cause they can't at all otherwise lol

rare python Jul 17, 2025, 3:44 PM

#

ocean vortex yeah or they only use it on chatgpt Plus, cause they can't at all otherwise lol

Plus only support 32k context so if someone is serious about big codebase they probably use API

ocean vortex Jul 17, 2025, 3:44 PM

#

Right... I was sure about Pro but forgot even standard o3 needs it 💀

rare python Jul 17, 2025, 3:44 PM

#

ocean vortex Right... I was sure about Pro but forgot even standard o3 needs it 💀

gatekeeping

torn mantle Jul 17, 2025, 4:07 PM

#

https://x.com/MistralAI/status/1945858561264795869

Mistral AI (@MistralAI)

New features:

🔍 Deep Research: dive into complex topics with our structured research reports, delivered with lightning-fast reactivity

🎙️ Voice mode: talk to Le Chat on the go, thanks to our new Voxtral model

🌍 Natively multilingual reasoning: get thoughtful answers in your

#

they added many features

#

the UI/UX looks good too

red sluice Jul 17, 2025, 4:36 PM

#

Damn 1530 elo on the french leaderboard seems like french people love grok4 🤣 Only 67 votes though but kinda impressive, it could reach 1610 🤣

leaden meteor Jul 17, 2025, 4:46 PM

#

why so much difference from the main leaderboard?

stray aspen Jul 17, 2025, 4:48 PM

#

is the grok 4 on this website actually grok 4

drifting thorn Jul 17, 2025, 4:48 PM

#

pure anvil Now why would people choose a more expensive model? (2.5 pro)

Context window matters

rare python Jul 17, 2025, 4:49 PM

#

LMArena team should add "no search" beside "no system prompt" of Grok 4 because people keep getting confused

#

/j

echo aurora Jul 17, 2025, 4:51 PM

#

stray aspen is the grok 4 on this website actually grok 4

hello ablobwave - we did address this question in this forum post here #1393024188356362340 message

red sluice Jul 17, 2025, 4:51 PM

#

leaden meteor why so much difference from the main leaderboard?

Fewer votes, probably different usage, and maybe, maybe, some models are better handling french language than other? battle3d
I have no clue, but there are slight differences in every language leaderboards compared to the "overall" one. Even though here it seems like Grok-4 will remain first on the french leaderboard it seems... It is the one with the most extreme differences

#

Or maybe there are so few french users, that one of them just prefers the way Grok-4 answers and it makes the leaderboard biased because there are so few french prompts?

stray aspen Jul 17, 2025, 5:10 PM

#

echo aurora hello <a:ablobwave:552927506957729802> - we did address this question in this fo...

how do i enable internet search for the AI models?

fleet lintel Jul 17, 2025, 5:14 PM

#

when is GPT-5 launching? Probably not in July right?
And is Google planning to launch wolfstride ?

sage raptor Jul 17, 2025, 5:20 PM

#

fleet lintel when is GPT-5 launching? Probably not in July right? And is Google planning to ...

soon

stray aspen Jul 17, 2025, 5:23 PM

#

Whats that wolfstride thing

keen fulcrum Jul 17, 2025, 5:26 PM

#

no way they removed it

mossy drum Jul 17, 2025, 5:27 PM

#

New model in Arena: kraken-07152025-2

stray aspen Jul 17, 2025, 5:28 PM

#

mossy drum New model in Arena: `kraken-07152025-2`

Fake news

mossy drum Jul 17, 2025, 5:30 PM

#

stray aspen Fake news

in Battle mode

stray aspen Jul 17, 2025, 5:30 PM

#

Bro what is this

#

What even is clownfish

civic flame Jul 17, 2025, 5:31 PM

#

sage raptor soon

we should see some new anon google models very soon then i suspect

leaden meteor Jul 17, 2025, 5:34 PM

#

what is this agent model from openai today based on? O3?

#

Does that mean we wont be able to compare this with grok and 2.5 pro on arena?

fleet lintel Jul 17, 2025, 5:49 PM

#

Dissappointed in you, Craig. For months you said 3.5 Grok is going to be SOTA without doubt. And even with Grok 4, it's clearly behind OAI and Gemini

fleet lintel Jul 17, 2025, 5:49 PM

#

sage raptor soon

Are they launching and remoevd because not good enough?

sacred plaza Jul 17, 2025, 5:55 PM

#

fleet lintel Dissappointed in you, Craig. For months you said 3.5 Grok is going to be SOTA wi...

Craig, Take the L, please 🙏🏾🥺

leaden meteor Jul 17, 2025, 5:58 PM

#

fleet lintel Dissappointed in you, Craig. For months you said 3.5 Grok is going to be SOTA wi...

Lol, why are you all ganging up on him. Grok4 is pretty good. Right behind 2.5 pro on arena and SOTA in lot of other benchmarks...

sacred plaza Jul 17, 2025, 5:59 PM

#

leaden meteor Lol, why are you all ganging up on him. Grok4 is pretty good. Right behind 2.5 p...

It is an unusable mode, due to lack of alignment work, for anything serious given the high benchmark scores on the Hitler benchmark. At least for me. Seems like the government is okay with using it, haha

ornate agate Jul 17, 2025, 6:00 PM

#

leaden meteor Lol, why are you all ganging up on him. Grok4 is pretty good. Right behind 2.5 p...

much "SOTA"

sacred plaza Jul 17, 2025, 6:02 PM

#

Y'all tripping

dawn wharf Jul 17, 2025, 6:03 PM

#

sacred plaza Y'all tripping

how about no

leaden meteor Jul 17, 2025, 6:03 PM

#

Its not 'the best' model. But if we want a model to top all benchmarks to be SOTA, then there is no SOTA model...

zinc ore Jul 17, 2025, 6:04 PM

#

You can literally specify where it is sota

sacred plaza Jul 17, 2025, 6:05 PM

#

dawn wharf how about no

You are right. "No" is the precise amount of safety work xAI does on their grok models 😅

dawn wharf Jul 17, 2025, 6:05 PM

#

sacred plaza You are right. "No" is the precise amount of safety work xAI does on their grok ...

good

ornate agate Jul 17, 2025, 6:08 PM

#

sacred plaza Y'all tripping

stray aspen Jul 17, 2025, 6:12 PM

#

I have a question for the LM arena gods

#

What are these codename models that come up some times in the battle mode

red sluice Jul 17, 2025, 6:16 PM

#

stray aspen Bro what is this

Private testing, alpha models that are set to release (or not).

ocean vortex Jul 17, 2025, 6:36 PM

#

Ok OpenAI's announcement is actually more interesting than it could have been

#

this is impressive

keen fulcrum Jul 17, 2025, 6:38 PM

#

still no comparison to browser use and other tools!

#

poor benchmarks

dawn wharf Jul 17, 2025, 6:39 PM

#

ocean vortex Ok OpenAI's announcement is actually more interesting than it could have been

mfw they don't show other models

ocean vortex Jul 17, 2025, 6:51 PM

#

This could be gpt5 fine-tune. They did the same with deep research...

#

probably needs way less safety testing being constrained to a specific agent and not giving users full control

pure anvil Jul 17, 2025, 6:53 PM

#

if only we could see the CoT, I'm sure it's looking up the dataset

zinc ore Jul 17, 2025, 6:55 PM

#

wary pagoda Jul 17, 2025, 7:04 PM

#

Will there be a DebugArena or LogArena that takes in your repo + log file location and identifies error + fix? I think that would be next level

#

Just like there is a "narrow AI" race to solve level 5 self driving there should be a "narrow AI" race to solve bug fixing (both necessary but not sufficient conditions for AGI imo). Both domains are easily verifiable and full of edge cases that will test true generalization

#

DebugArena could also supercharge open source code development

sour spindle Jul 17, 2025, 7:19 PM

#

Does anyone know how to use it am I dumb (very plausible) or is it only for pro users

ocean vortex Jul 17, 2025, 7:21 PM

#

zinc ore

Fair but also a weak point. If that was the case deep research would have scored 100%. Models can't really find it directly, most cases they are not gonna even search the exact question exactly like you wrote it

echo aurora Jul 17, 2025, 7:23 PM

#

wary pagoda Will there be a DebugArena or LogArena that takes in your repo + log file locati...

Very cool idea! I'll go ahead and add a feedback post so others can weigh in on this idea as well.

#1395486241860091914 message

ocean vortex Jul 17, 2025, 7:23 PM

#

ocean vortex Fair but also a weak point. If that was the case deep research would have scored...

For example this is one of the questions from HLE dataset:

Which condition of Arrhenius's sixth impossibility theorem do critical-level views violate?

#

You can google it and you not gonna find anything like the direct answer option (A to E) if you don't know where it came from lol

#

They need to do write it exactly as is and also include quotes around it for exact match. Normally models wouldn't do that... You will be able to see if that happens though so gonna be interesting to test it

zinc ore Jul 17, 2025, 7:27 PM

#

ocean vortex Fair but also a weak point. If that was the case deep research would have scored...

I'd think they try to mitigate it beyond just that very specific scenario too though, but yeah hard to determine how well that went or goes

ocean vortex Jul 17, 2025, 7:29 PM

#

zinc ore I'd think they try to mitigate it beyond just that very specific scenario too th...

Well like I said if you search normally, cheating is hard:

#

But if you know this is a question from actual dataset, you do exact match and cheating is possible:

zinc ore Jul 17, 2025, 7:31 PM

#

They should be able to review the searches, sites, logs whatever to tell it is doing that. So they would know that is occurring and, if they wanted, do preventative measures.

Now, whether or not that happened is just their word on it.

ocean vortex Jul 17, 2025, 7:31 PM

#

zinc ore They should be able to review the searches, sites, logs whatever to tell it is d...

If their UI remains as it currently is with deep research, we should be able to see it ourselves..

#

It lists the searches it performs and urls it visits

#

what tweet

#

https://x.com/sama/status/1945901039104004467 this?

Sam Altman (@sama)

watching chatgpt agent use a computer to do complex tasks has been a real "feel the agi" moment for me; something about seeing the computer think, plan, and execute hits different.

#

that's just building hype lol

#

Twitter just remained me though why I don't use it

#

#

this grok chatbot thing is cringe af

#

kinda childish too...

zinc ore Jul 17, 2025, 7:35 PM

#

They should really restrict it a bit more so we don't get nonsense like that

ocean vortex Jul 17, 2025, 7:37 PM

#

"computer" is a bit of marketing. I think it's just what it was except python env is perhaps slightly more advanced now and can run independently from your chat session

#

But the model itself may be early fine-tune of GPT5

hollow ocean Jul 17, 2025, 7:38 PM

#

Its agent 0

unborn ocean Jul 17, 2025, 7:41 PM

#

ocean vortex But the model itself may be early fine-tune of GPT5

idk it's only slightly better vs 2.5 pro without tools (for both models!)

#

could obviously still be

ocean vortex Jul 17, 2025, 7:42 PM

#

unborn ocean idk it's only slightly better vs 2.5 pro without tools (for both models!)

2.5Pro scores 21.64%

unborn ocean Jul 17, 2025, 7:42 PM

#

but i would be underwhelmed if that where it for gpt5

ocean vortex Jul 17, 2025, 7:42 PM

#

2 times less

#

not slightly, lol

unborn ocean Jul 17, 2025, 7:43 PM

#

without tools it (oai agent) only scores 23

#

so it is slightly

ocean vortex Jul 17, 2025, 7:45 PM

#

Right... But I don't think GPT5 is gonna blow o3 out the water tbh. Presently it's very plausibly this, but they could improve it somewhat still before they drop it as general purpose model

unborn ocean Jul 17, 2025, 7:46 PM

#

i agree with you, but still this models should benefit from RL on HLE like tasks even when the tools are turned off

stray aspen Jul 17, 2025, 7:46 PM

#

what the hell

unborn ocean Jul 17, 2025, 7:46 PM

#

and it not performing really better would point to its current capabilities (without tools) also really not being that impressive in general

#

(but that might be stretching it too far)

#

but all of this might also just be explained by it being a 4.1 (or 4.2 or what ever) version

#

or maybe plain o3 + rl

ocean vortex Jul 17, 2025, 7:51 PM

#

unborn ocean or maybe plain o3 + rl

That would be waste of resources though and being stuck on last gen models. We know they already delayed GPT5 so they are certainly actively working on it, so I think something based on that same base model would make the most sense...

quartz light Jul 17, 2025, 7:52 PM

#

ocean vortex Ok OpenAI's announcement is actually more interesting than it could have been

no it aint 😂 agent was the only one to get the terminal, others were given "python" which is never useful 😭

#

honestly the fact that agent gets a terminal is pretty cool but still

#

not a fair comparison

ocean vortex Jul 17, 2025, 7:54 PM

#

ocean vortex They need to do write it exactly as is and also include quotes around it for exa...

just tested this with o3. Well this one at least, didn't cheat even when asked explicitly to search web... 👀

quartz light Jul 17, 2025, 7:55 PM

#

ocean vortex just tested this with o3. Well this one at least, didn't cheat even when asked e...

are you sure? it could've gotten the answer from any of those

stray aspen Jul 17, 2025, 7:55 PM

#

ocean vortex just tested this with o3. Well this one at least, didn't cheat even when asked e...

what website are you using?

pure anvil Jul 17, 2025, 7:55 PM

#

ocean vortex just tested this with o3. Well this one at least, didn't cheat even when asked e...

It doesn't matter if the dataset is already in the training data

ocean vortex Jul 17, 2025, 7:56 PM

#

quartz light are you sure? it could've gotten the answer from any of those

Yes I'm sure. None of these contain that as a test question with the answer

#

they are just general resources

ocean vortex Jul 17, 2025, 7:56 PM

#

pure anvil It doesn't matter if the dataset is already in the training data

Believe it or not it answered wrong when it didn't search the web lol

quartz light Jul 17, 2025, 7:56 PM

#

ocean vortex Yes I'm sure. None of these contain that as a test question with the answer

it cited a source right next to the answer

ocean vortex Jul 17, 2025, 7:56 PM

#

no search:

ocean vortex Jul 17, 2025, 7:57 PM

#

quartz light it cited a source right next to the answer

because I asked it to find the answer online... that source is this: https://api.taylorfrancis.com/content/chapters/oa-edit/download?identifierName=doi&identifierValue=10.4324%2F9781003148012-9&type=chapterpdf

#

it doesn't contain this exact question/answer

#

just normal resource

quartz light Jul 17, 2025, 7:59 PM

#

ocean vortex it doesn't contain this exact question/answer

ahem ahem ahem

#

ocean vortex Jul 17, 2025, 8:00 PM

#

quartz light ahem ahem ahem

What were you expecting to find? Obviously it contains relevant information, why would it not? The point is this is normal operation

quartz light Jul 17, 2025, 8:00 PM

#

ocean vortex no search:

so, wrong without search?

ocean vortex Jul 17, 2025, 8:00 PM

#

I asked it to find it and it did

quartz light Jul 17, 2025, 8:00 PM

#

ocean vortex just tested this with o3. Well this one at least, didn't cheat even when asked e...

didn't cheat even when asked explicitly to search web... 👀

ocean vortex Jul 17, 2025, 8:00 PM

#

this is not cheating, read the messages above...

quartz light Jul 17, 2025, 8:01 PM

#

bruh

ocean vortex Jul 17, 2025, 8:01 PM

#

cheating is finding the dataset this exact question worded exactly like I pasted is from

quartz light Jul 17, 2025, 8:01 PM

#

whatever

ocean vortex Jul 17, 2025, 8:01 PM

#

???

#

lmao

#

😭

quartz light Jul 17, 2025, 8:01 PM

#

🥀

ocean vortex Jul 17, 2025, 8:03 PM

#

ELI5 - version for you: when teacher gives you assignment to solve some problem with research.... Googling is not cheating. But finding this exact problem worded the same exact way already solved by someone else is cheating. @quartz light

#

so it didn't cheat

quartz light Jul 17, 2025, 8:04 PM

#

what the yap

ocean vortex Jul 17, 2025, 8:04 PM

#

Please don't tell me that you still don't understand

#

...

pure anvil Jul 17, 2025, 8:04 PM

#

😂

quartz light Jul 17, 2025, 8:05 PM

#

ocean vortex so it didn't cheat

not sure how you find a simple search for the answer impressive

#

but im out

ocean vortex Jul 17, 2025, 8:06 PM

#

quartz light not sure how you find a simple search for the answer impressive

The point was to see if it would cheat by finding this exact dataset with the specified answer option invalidating the test (HLE benchmark). What it answered with is irrelevant in this context.

quartz light Jul 17, 2025, 8:07 PM

#

ocean vortex The point was to see if it would cheat by finding this exact dataset with the sp...

this might seem irrelevant but doesn't chatgpt use bing

ocean vortex Jul 17, 2025, 8:10 PM

#

quartz light this might seem irrelevant but doesn't chatgpt use bing

That kinda is mostly irrelevant tbh. We are talking about their agent and that's 99% gonna use same search provider that o3 is using

#

Besides it didn't even try to find the exact match for this question 😎

whole wagon Jul 17, 2025, 8:26 PM

#

Sam did his usual line "feel the agi moment" pepega

#

Bro it's making a damn PowerPoint

#

And he says it's feel the agi

jade egret Jul 17, 2025, 8:31 PM

#

which llm is best for prompt engineering

quartz light Jul 17, 2025, 8:31 PM

#

jade egret which llm is best for prompt engineering

i tried kimi it was pretty good ig

jade egret Jul 17, 2025, 8:32 PM

#

quartz light i tried kimi it was pretty good ig

where can i use it

quartz light Jul 17, 2025, 8:32 PM

#

jade egret where can i use it

https://kimi.com

Kimi - 会推理解析，能深度思考的AI助手

Kimi 是一个有着超大“内存”的智能助手，可以一口气读完二十万字的小说，还会上网冲浪，快来跟他聊聊吧 | Kimi - Moonshot AI 出品的智能助手

jade egret Jul 17, 2025, 8:32 PM

#

js the arena direct chat?

#

o

quartz light Jul 17, 2025, 8:32 PM

#

but you can also use https://console.groq.com

GroqCloud - Build Fast

Build Fast with GroqCloud

quartz light Jul 17, 2025, 8:32 PM

#

quartz light but you can also use https://console.groq.com

its like 80x faster lol

#

it doesnt have search though

#

so i use kimi.com

jade egret Jul 17, 2025, 8:33 PM

#

o

quartz light Jul 17, 2025, 8:33 PM

#

because its really good at search, it can check if the libraries its linking exist for example

#

so its good for prompt eng

#

theres reasoning on kimi 1.5 but i havent tested that

ocean vortex Jul 17, 2025, 8:59 PM

#

quartz light 🥀

#

I don't think they cared as much about SWE though, so it's probably nothing in the context of this agent...

#

fine-tuning heavily favoring different things (tools and browsing)

ocean vortex Jul 17, 2025, 9:01 PM

#

quartz light theres reasoning on kimi 1.5 but i havent tested that

it's sh'it

unborn ocean Jul 17, 2025, 9:02 PM

#

https://dubesor.de/gpt-4.5-final-message
RIP ✝️
in loving memories

Farewell GPT-4.5

A farewell to GPT-4.5 - The last conversation

jade egret Jul 17, 2025, 9:02 PM

#

best coding model currently?

ocean vortex Jul 17, 2025, 9:04 PM

#

unborn ocean https://dubesor.de/gpt-4.5-final-message RIP ✝️ in loving memories

gptdrawncat

#

I have no clue why it was thinking

#

noticed this just now lmao

#

2nd message is gpt4.5 though for sure

#

or it's just their UI interpreting search as thinking now.... smh

ocean vortex Jul 17, 2025, 9:07 PM

#

jade egret best coding model currently?

gemini flash-lite

blazing rune Jul 17, 2025, 10:15 PM

#

jade egret best coding model currently?

Claude 4 Sonnet is still a beast

#

Especially with the right set up

#

I used it with Zed and it 1 shot a game insanely well

#

I know it's pretty anecdotal though

#

the game was simple, but it did do a few notable creative things

hollow ocean Jul 17, 2025, 10:23 PM

#

$1 pro plan method hittin

storm needle Jul 17, 2025, 10:24 PM

#

jade egret best coding model currently?

claude 4 opus

jade egret Jul 17, 2025, 10:37 PM

#

does the 20$ plane get agent or only the 200$

keen ferry Jul 17, 2025, 10:38 PM

#

agent mode is just manus ai

red sluice Jul 17, 2025, 10:38 PM

#

jade egret does the 20$ plane get agent or only the 200$

Pretty sure they're not gonna give it right away to plus users

quartz light Jul 17, 2025, 10:52 PM

#

ocean vortex or it's just their UI interpreting search as thinking now.... smh

bruh 😭

main gulch Jul 17, 2025, 11:51 PM

#

jade egret does the 20$ plane get agent or only the 200$

40 vs 400 req/month

empty stump Jul 18, 2025, 12:00 AM

#

what happened to the other leaderboards like the creative writing one and others

hardy lion Jul 18, 2025, 12:21 AM

#

empty stump what happened to the other leaderboards like the creative writing one and others

They should still be there! https://lmarena.ai/leaderboard/text
In the Text Arena leaderboard there is a drowdown for the category

empty stump Jul 18, 2025, 12:41 AM

#

ohh

stray aspen Jul 18, 2025, 12:46 AM

#

@deep adder

sullen quest Jul 18, 2025, 1:01 AM

#

Claude maybe be a little tooo cautious when it comes to jailbreaking...

whole wagon Jul 18, 2025, 7:27 AM

#

#

Gemini 2.5 pro a beast as usual

tidal schooner Jul 18, 2025, 7:33 AM

#

whole wagon

grok 4 🥀

dusky aurora Jul 18, 2025, 7:42 AM

#

whole wagon Gemini 2.5 pro a beast as usual

that's "Bard" to you

keen fulcrum Jul 18, 2025, 7:45 AM

#

tidal schooner grok 4 🥀

Grok 4 is amazing

#

Grok 4 heavy even more

tidal schooner Jul 18, 2025, 7:45 AM

#

keen fulcrum Grok 4 is amazing

look at cost v. accuracy for imo 2025

#

not great

keen fulcrum Jul 18, 2025, 7:45 AM

#

xAI isn’t google

quartz light Jul 18, 2025, 7:48 AM

#

https://abc.xyz

whole wagon Jul 18, 2025, 7:56 AM

#

https://x.com/AniNewsAndFacts/status/1945598482531631502 grok found it's niche

Anime News And Facts (@AniNewsAndFacts)

Grok AI App is currently No.1 in Japan most likely due to its new Grok Waifu feature.

tidal schooner Jul 18, 2025, 8:05 AM

#

whole wagon https://x.com/AniNewsAndFacts/status/1945598482531631502 grok found it's niche

perplexity hit #1 in india coz of airtel deal

tall summit Jul 18, 2025, 8:21 AM

#

whole wagon https://x.com/AniNewsAndFacts/status/1945598482531631502 grok found it's niche

that is an assumption LOL

calm sequoia Jul 18, 2025, 8:23 AM

#

Those new small models popping up everyday with excellent benches and no significant architectural breakthroughs. Makes you wonder if it's really progress or just training data contamination with benchmarks.

elder rapids Jul 18, 2025, 8:37 AM

#

new anonymous/o3 model when I asked for

"Roblox" and "discord" (both full prompts)

📎 pages_index.tsx_2.txt 📎 pages_index.tsx_1.txt

cedar tide Jul 18, 2025, 8:40 AM

#

elder rapids new anonymous/o3 model when I asked for "Roblox" and "discord" (both full prom...

You have screenshots pls ?

sage raptor Jul 18, 2025, 8:41 AM

#

is the anonymous/o3 model on webdev arena ?

cedar tide Jul 18, 2025, 8:42 AM

#

sage raptor is the anonymous/o3 model on webdev arena ?

https://x.com/AiBattle_/status/1946106642598162922?t=Q37HYkQRfa-z93fbxbIgng&s=19
https://fixupx.com/AiBattle_/status/1946117069000392840?t=N1G8wcMHki1WyMOqIRFLLQ&s=19

AiBattle (@AiBattle_)

OpenAI is testing a new model called "o3-alpha-responses-2025-07-17" on WebArena

The model will appear with the name "Anonymous-Chatbot"

AiBattle (@AiBattle_)

Space Invaders game from the new o3 model 👇

**💬 3 ❤️ 7 👁️ 210 **

▶ Play video

keen beacon Jul 18, 2025, 8:43 AM

#

huh first openai reasoning model under anonymous-chatbot, i believe

elder rapids Jul 18, 2025, 8:44 AM

#

it's extremely slow

#

it's not that smart tbh

keen beacon Jul 18, 2025, 8:44 AM

#

is it just like normal o3?

cedar tide Jul 18, 2025, 8:46 AM

#

isn't it just a fine tuned o3 for coding?

sage raptor Jul 18, 2025, 8:46 AM

#

so kingfall still better

elder rapids Jul 18, 2025, 8:49 AM

#

keen beacon is it just like normal o3?

seems fine-tuned

#

keeps making mistakes but it tries to add a lot of detail

#

takes a while to spit something out

candid storm Jul 18, 2025, 8:51 AM

#

Is it also in the regular arena? Or just webdev?

elder rapids Jul 18, 2025, 8:51 AM

#

I asked it to make something philosophy esque and it misattributed parsimony and made some dumbass "meter" to quantify a linguistic concept

#

😭

#

and it made a custom quiz in the website too

#

and it has the wrong answers

sage raptor Jul 18, 2025, 8:53 AM

#

Maybe this is the open source model

elder rapids Jul 18, 2025, 8:53 AM

#

shiii

#

I hope so

keen beacon Jul 18, 2025, 8:58 AM

#

Maybe its the open source. Its o4-mini level

elder rapids Jul 18, 2025, 8:58 AM

#

maybe

#

"o3 alpha responses" so idk man

#

its an alr model

#

at least creativity wise

cedar tide Jul 18, 2025, 8:59 AM

#

unborn ocean or maybe plain o3 + rl

Nope

Screenshot_2025-07-18-10-57-36-309_com.twitter.android-edit.jpg

elder rapids Jul 18, 2025, 9:00 AM

#

"direct fine-tune of o3"?

cedar tide Jul 18, 2025, 9:02 AM

#

Anonymous chatbot is also in battle arena?

elder rapids Jul 18, 2025, 9:03 AM

#

nah

#

it adds a lot of detail but it keeps failing

whole wagon Jul 18, 2025, 9:06 AM

#

Bros are still working on o3

#

Like cmon hurry up with gpt5

elder rapids Jul 18, 2025, 9:07 AM

#

could be that this o3 isnt meant to be standalone

exotic tartan Jul 18, 2025, 9:23 AM

#

After seeing GPT's Agent demo yesterday, I have a feeling a new type of measurement needs to be considered. Just throwing raw APIs against one another isn't cutting it anymore. We must have a way to compare them with tool use (at least)

main gulch Jul 18, 2025, 9:52 AM

#

not every lab provides the tools in API

torn mantle Jul 18, 2025, 9:59 AM

#

sage raptor is the anonymous/o3 model on webdev arena ?

webdev and lmarena

keen beacon Jul 18, 2025, 10:02 AM

#

torn mantle webdev and lmarena

Also in lmarena ?? 👀

#

Still very weak in the webdev arena at least

torn mantle Jul 18, 2025, 10:06 AM

#

keen beacon Also in lmarena ?? 👀

yes

exotic tartan Jul 18, 2025, 10:22 AM

#

main gulch not every lab provides the tools in API

I agree, that's why I said we need some sort of newer approach here. Maybe just an agentic leaderboard altogether?

ocean vortex Jul 18, 2025, 10:31 AM

#

exotic tartan I agree, that's why I said we need some sort of newer approach here. Maybe just ...

I think current approach is mostly fine tbh. Users are not going out of their way to disable tools when chatting on chatgpt, so it makes sense that the performance is measured using similar metrics for agentic systems.

#

Agentic or not, users gonna use it for the same tasks at the end of the day...

exotic tartan Jul 18, 2025, 10:38 AM

#

It looks like a lot of the "secret sauce" is going to be around model tooling, self-management, human like-multi step tasking per request etc.

I'm not calling to end API comparisons, I think they're great - there's a good value with comparing them (although I still believe some sort of tool calling should be supported), but this is only one piece of the cake. There are going to be many new agents coming soon that won't be API enabled and eventually people would want to compare them.

hardy pecan Jul 18, 2025, 10:45 AM

#

Good MCPs augment a model like Claude or o3 really really well

keen beacon Jul 18, 2025, 10:58 AM

#

exotic tartan It looks like a lot of the "secret sauce" is going to be around model tooling, s...

There are a few benchs like that for managing finances , also vending bench

whole wagon Jul 18, 2025, 10:58 AM

#

yeah its nice but the model intelligence is still important

#

i wouldnt trust o3 to order me a pizza, maybe it hallucinates all the toppings kekw

ocean vortex Jul 18, 2025, 10:59 AM

#

exotic tartan It looks like a lot of the "secret sauce" is going to be around model tooling, s...

I mean there are entries with both tools enabled and without them...? So you can see how the agent performs without any tools

keen beacon Jul 18, 2025, 11:01 AM

#

whole wagon i wouldnt trust o3 to order me a pizza, maybe it hallucinates all the toppings <...

thinking about pizza size to order, based on user images he looks to be overweight, likely extra large size needed. But that would be offensive. Also more healthy for user would be smaller portion so that he loses weight .. after 10 min model crashes and refuses request

#

I cant seem to get the new model in the arena :/

ocean vortex Jul 18, 2025, 11:45 AM

#

Nah I don't think that's true. Though I suspect it also performs well because they essentially made a reasoning model out of it lol. For a non-reasoning model the outputs are extremely long

#

In turn, I wouldn't expect MASSIVE gains if they make full reasoning model out of this

#

Like don't get me wrong it will compete with R1, but I'm doubtful it will go beyond that level

#

I wasn't talking about tool usage though...

#

More like general performance

ornate agate Jul 18, 2025, 11:51 AM

#

then yeah, it will be near the top with all the others, the reasoning version of kimi.

ocean vortex Jul 18, 2025, 11:52 AM

#

As for Sonnet, people tend to use it mostly for coding. It's one of the top performing models on SWE

#

but in general or overall it has fallen behind, a bit niche model at this point

ornate agate Jul 18, 2025, 11:53 AM

#

yeah but the niche is the one thing which has any chance of making money soon lol

ocean vortex Jul 18, 2025, 11:53 AM

#

It's also great for web development

ocean vortex Jul 18, 2025, 11:54 AM

#

ornate agate yeah but the niche is the one thing which has any chance of making money soon lo...

Depends what your markets are. If you are OpenAI you don't want niche lol

ornate agate Jul 18, 2025, 11:55 AM

#

swebench also doesn't tell the whole story imo. I don't think people using Sonnet are using it in that way in general (swebench is FULL agentic bench). Claude can have a really long conversation with lots of tools and maintain a lot of coherence across it, including changing its assumptions and approach mid way. I don't think there is a bench which is measuring that properly at the moment.

ocean vortex Jul 18, 2025, 11:56 AM

#

At the very least you would do niche separate tool within a platform which is general purpose as a whole

ornate agate Jul 18, 2025, 11:57 AM

#

but I think if you are very smart about managing context and understand stuff yourself, reasoning models are already better at helping you...

ocean vortex Jul 18, 2025, 11:57 AM

#

ornate agate swebench also doesn't tell the whole story imo. I don't think people using Sonne...

There is. SWE + TAU measures and shows it. But there are far more metrics that can not be ignored as well

ornate agate Jul 18, 2025, 11:57 AM

#

99% of coders disagree with me though on that, so yeah

ocean vortex Jul 18, 2025, 11:57 AM

#

And those things you mentioned are arguably not even the most important ones to look out for in models....

#

Hence why I said Claude is niche. They are great for certain things, but now clearly behind in overall picture

ornate agate Jul 18, 2025, 11:59 AM

#

ocean vortex Hence why I said Claude is niche. They are great for certain things, but now cle...

yeah but you say this to 99% of people even in AI industry and its a steaming hot take. Nobody agrees with this, they use Claude over at Google rofl.

ornate agate Jul 18, 2025, 12:00 PM

#

ornate agate but I think if you are very smart about managing context and understand stuff yo...

This is also a hot take btw

ocean vortex Jul 18, 2025, 12:01 PM

#

ornate agate yeah but you say this to 99% of people even in AI industry and its a steaming ho...

Popularity or cult following is not always an indicator of performance. Google using Claude?? wdym

#

LOL

#

I'm open minded though, if you can come up with a prompt that is not touching on those few niche things where we know Claude performs well and it clearly does better than say 2.5Pro, I'm all ears to test it out

#

yeah... tbh I think big part of it is just people preferring fine-tuning of Claude

#

it being "more human" and what have you

#

but that's not performance really

ornate agate Jul 18, 2025, 12:05 PM

#

I mostly use open-source (to avoid lock in/hikes having to relearn tools every few months etc). When I use a closed model these days its nearly always 2.5 pro and it nearly always does a much better job for me than the other closed ones.

keen beacon Jul 18, 2025, 12:06 PM

#

ocean vortex I'm open minded though, if you can come up with a prompt that is not touching on...

Daily testing on code editing, claude beats everything else

ocean vortex Jul 18, 2025, 12:06 PM

#

keen beacon Daily testing on code editing, claude beats everything else

Ok then should be easy for you to give at least ONE example specific prompt. 🙂

#

My experience completely different

keen beacon Jul 18, 2025, 12:08 PM

#

Its not specific prompt, ive tested it on my own repositories. Also claude code has the advantage of working autonomously better.

ocean vortex Jul 18, 2025, 12:09 PM

#

keen beacon Its not specific prompt, ive tested it on my own repositories. Also claude code ...

well if it performs better you should be able to copy the relevant code with an issue to replicate in isolation no problem...

keen beacon Jul 18, 2025, 12:10 PM

#

ocean vortex Ok then should be easy for you to give at least ONE example specific prompt. 🙂

But if you want to test it out on say cursor .. prompt:
Research how gistemp temperature by nasa is calculated, then estimate it with +-0.02 accuracy using only data from era5. Continue until you reach the objective.

ocean vortex Jul 18, 2025, 12:10 PM

#

Just saying what "you feel like" is not really useful to anyone tbh

ocean vortex Jul 18, 2025, 12:11 PM

#

keen beacon But if you want to test it out on say cursor .. prompt: Research how gistemp te...

Not cursor, API. Cursor is already a bit touching on niche 👀

#

not everyone is using that

keen beacon Jul 18, 2025, 12:12 PM

#

keen beacon But if you want to test it out on say cursor .. prompt: Research how gistemp te...

This is a great test btw. Super hard. You have to do research gistemp calculations, research era5 api, download data, then pre process, build machine learning model and also come up with novel method to improve accuracy. This took me a day to do myself

keen beacon Jul 18, 2025, 12:12 PM

#

ocean vortex Not cursor, API. Cursor is already a bit touching on niche 👀

😑

ocean vortex Jul 18, 2025, 12:13 PM

#

that prompt seems like a perfect example to simply ask o3 on chatgpt tbh

keen beacon Jul 18, 2025, 12:14 PM

#

Yep did that , but o3 has limits, 10-15 min, it did good , as well as can be expected from a human in 10-15 min. But the problem requires much longer to be solved

#

Claude code did better simply due to that

alpine coral Jul 18, 2025, 12:15 PM

#

keen beacon huh first openai reasoning model under anonymous-chatbot, i believe

yeah kinda interesting

#

after a bit of absence from the arena too

ocean vortex Jul 18, 2025, 12:16 PM

#

keen beacon Yep did that , but o3 has limits, 10-15 min, it did good , as well as can be exp...

if you need "much longer" then deep research. That's essentially as long as realistically possible

#

And I don't think you are even using Claude code for what it was designed to do with such prompt lol

#

Anthropic's search is very pale in comparison to OpenAI overall

keen beacon Jul 18, 2025, 12:19 PM

#

Aim is to estimate it accurately before NASA published it. Doing so nets you 2-5k in polymarket xD

leaden meteor Jul 18, 2025, 12:22 PM

#

Did openai have 4o or o3 as anonymous model before they released? Wonder why openai is doing anonymous testing with a subpar model now...

balmy mist Jul 18, 2025, 12:22 PM

#

its prob the open source model

leaden meteor Jul 18, 2025, 12:25 PM

#

I guess they feel that nobody expects this open source model to be SOTA. So, they might as well do free testing on arena...where as with their main models, they might want to protect them from low ranks in arena....Thats why they didn't bother anonymous testing with 4o?

balmy mist Jul 18, 2025, 12:26 PM

#

yeah maybe, they did do anonymous testing with 4.1 tho

ocean vortex Jul 18, 2025, 12:28 PM

#

keen beacon Aim is to estimate it accurately before NASA published it. Doing so nets you 2-5...

Interesting, which specific one?

keen beacon Jul 18, 2025, 12:36 PM

#

ocean vortex Interesting, which specific one?

July temperature increase

mossy drum Jul 18, 2025, 12:42 PM

#

leaden meteor Did openai have 4o or o3 as anonymous model before they released? Wonder why ope...

4o was based on im-also-a-good-gpt2-chatbot https://x.com/LiamFedus/status/1790064963966370209

William Fedus (@LiamFedus)

GPT-4o is our new state-of-the-art frontier model. We’ve been testing a version on the LMSys arena as im-also-a-good-gpt2-chatbot 🙂. Here’s how it’s been doing.

rose scroll Jul 18, 2025, 12:48 PM

#

Hi everyone. I would like to have another platform where I can chat with LLMs for free.

whole wagon Jul 18, 2025, 12:49 PM

#

mossy drum 4o was based on im-also-a-good-gpt2-chatbot https://x.com/LiamFedus/status/17900...

Google poached this guy

#

Most of the good openAI researchers have moved to different companies

#

All that's left are marketers kek

#

Polymarket still gives meta no hope for the rest of the year even after the hiring spree

#

Surprising actually. I would think 6 months is enough for them to cook up something good

#

They already have the compute and aren't starting literally from scratch

unborn ocean Jul 18, 2025, 1:04 PM

#

vs other labs (xAI) they have a strong research arm

ocean vortex Jul 18, 2025, 1:06 PM

#

keen beacon July temperature increase

think

keen beacon Jul 18, 2025, 1:07 PM

#

ocean vortex <:think:1165622871306027018>

Most likely hallucinations. You need to train in historical data

ocean vortex Jul 18, 2025, 1:08 PM

#

keen beacon Most likely hallucinations. You need to train in historical data

It's the answer based on research it did, semi-reliable I would say

ocean vortex Jul 18, 2025, 1:23 PM

#

verified with another DR lol

#

contemplating on actually trying my luck on this one 👀

frank adder Jul 18, 2025, 1:24 PM

#

How to get access of o3 alpha (Anonymous-Chatbot)

keen beacon Jul 18, 2025, 1:26 PM

#

ocean vortex verified with another DR lol

Feel free to share the chat in pm, but i guarantee its bad prediction

ocean vortex Jul 18, 2025, 1:29 PM

#

keen beacon Feel free to share the chat in pm, but i guarantee its bad prediction

what's your best guess?

keen beacon Jul 18, 2025, 1:34 PM

#

ocean vortex what's your best guess?

I usually run it on 20 and 25, atm its too early to tell

#

I have downloaded 10gb data for this xD

ocean vortex Jul 18, 2025, 1:39 PM

#

keen beacon I usually run it on 20 and 25, atm its too early to tell

well that bracket odds are rising. It might be much less gains in a couple of days with that continuing 🤔

keen beacon Jul 18, 2025, 1:40 PM

#

ocean vortex well that bracket odds are rising. It might be much less gains in a couple of da...

By 25 , its 90-95% certain for me

dusky aurora Jul 18, 2025, 1:41 PM

#

all your discussions go over my head

ocean vortex Jul 18, 2025, 1:43 PM

#

keen beacon By 25 , its 90-95% certain for me

aren't the chances of that happening rated then at similar percentages then as well? Meaning you would get only like 5 bucks profit betting 100?

alpine coral Jul 18, 2025, 1:50 PM

#

keen beacon I usually run it on 20 and 25, atm its too early to tell

by extension, is it not also too early to say which model handled the task best? like there is no concrete answer (tho i get that are reasonable and unreasonable predictions)

#

it seems it's like an actual case where brute / parallel compute approaches would make sense / be useful. also seems like a good prmpt to test deep research, in terms of them being able to plan and reasonably execute and tie together that flow you described

keen beacon Jul 18, 2025, 2:00 PM

#

alpine coral it seems it's like an actual case where brute / parallel compute approaches woul...

For thorough test one needs tools + long term planning. Deep research and claude are the only two candidates for this. With deep research you must also provide it with mcp endpoints to run stuff on your pc ( cant upload 10gb on chatgpt )

ocean vortex Jul 18, 2025, 2:00 PM

#

continues to rise lol

#

was at 29% the first time I saw it

#

today

keen beacon Jul 18, 2025, 2:01 PM

#

ocean vortex aren't the chances of that happening rated then at similar percentages then as w...

I will have it at 95% , the market will have it at 50-50

ocean vortex Jul 18, 2025, 2:02 PM

#

keen beacon I will have it at 95% , the market will have it at 50-50

ok yeah that's decent then

#

have you already made profits doing this?

keen beacon Jul 18, 2025, 2:03 PM

#

No first time on this market , but ive trained and tested the model on historical data , every July since 1951 😂

#

I used last 10 years as test data and predictions were very good

ocean vortex Jul 18, 2025, 2:04 PM

#

nice

#

which ML method?

keen beacon Jul 18, 2025, 2:05 PM

#

Simple one actually , but i wont spill more details

ocean vortex Jul 18, 2025, 2:05 PM

#

Random Forest or smth? 🧐

keen beacon Jul 18, 2025, 2:06 PM

#

Eh something like that. Model is not too important. Getting the right data and preprocessing is the critical part

ocean vortex Jul 18, 2025, 2:06 PM

#

or linear regression

keen beacon Jul 18, 2025, 2:07 PM

#

ocean vortex or linear regression

Ridge

#

Next im targetting movies box office. For that model im even more proud 🥹

civic flame Jul 18, 2025, 2:26 PM

#

the anon oai model in the web arena really likes doing the absolute maximum

stray aspen Jul 18, 2025, 2:45 PM

#

@deep adder

#

hello

#

to ask if you really are craig federighi

languid crescent Jul 18, 2025, 3:29 PM

#

hayo

#

Is there a way to paste code like a code block just like how ai does it?

#

Whenever I just paste a code it just straight paste it into a single line of text is it normal?

ocean vortex Jul 18, 2025, 3:30 PM

#

civic flame the anon oai model in the web arena really likes doing the absolute maximum

there's new anonymous_chatbot?

#

been ages since we had it last time...

ember rapids Jul 18, 2025, 3:30 PM

#

Yeah

#

O3 alpha

#

It’s really good

ocean vortex Jul 18, 2025, 3:31 PM

#

languid crescent Is there a way to paste code like a code block just like how ai does it?

📎 some_code_.txt

subtle lintel Jul 18, 2025, 3:32 PM

#

Hi
I’m using Flux Kontex Dev, both locally and through the Playground. But I noticed that the version of Flux Dev on the lmarena website seems to perform better🤔. If anyone knows what configurations or setup they’re using to run it there, I’d really appreciate it if you could share the details😄

ocean vortex Jul 18, 2025, 3:32 PM

#

same way like in discord

languid crescent Jul 18, 2025, 3:32 PM

#

ocean vortex

DAMN I'VE BEEN USING IT FOR MONTHS AND WHY DID I NOT KNOW THIS 😭

#

ty @ocean vortex

civic flame Jul 18, 2025, 3:33 PM

#

ocean vortex been ages since we had it last time...

indeed

#

the days of gpt2-chatbot were peak

primal orbit Jul 18, 2025, 3:33 PM

#

is o3 alpha available in general arena? or webdev only?

sacred quail Jul 18, 2025, 3:40 PM

#

What is O3 alpha ?

primal orbit Jul 18, 2025, 3:40 PM

#

https://www.reddit.com/r/singularity/comments/1m30d36/a_new_model_o3_alpha_available_on_web_arena_by/

From the singularity community on Reddit: A New Model — “o3 Alp...

Explore this post and more from the singularity community

#

aka "anonymous-chatbot-0717"

wintry tinsel Jul 18, 2025, 4:02 PM

#

ember rapids It’s really good

How do you use it, or is it not available now

modern meteor Jul 18, 2025, 4:39 PM

#

guys are there any benchmarks for agents? or ones who decode obfuscated code specifically? 🙃

whole wagon Jul 18, 2025, 4:49 PM

#

Hm idk why would they make another o3. Maybe it's the open source model

#

It can't be an agent it's way too quick for that

sacred plaza Jul 18, 2025, 5:29 PM

#

So, how many on your Elon stans have falling in love with that grok companion already? 😂😂

torn mantle Jul 18, 2025, 6:30 PM

#

o3-alpha is good actually

stray aspen Jul 18, 2025, 6:33 PM

#

How did you access o3 alpha

torn mantle Jul 18, 2025, 6:33 PM

#

stray aspen How did you access o3 alpha

webdev arena

keen ferry Jul 18, 2025, 6:33 PM

#

stray aspen How did you access o3 alpha

web dev arena

torn mantle Jul 18, 2025, 6:33 PM

#

https://web.lmarena.ai/

stray aspen Jul 18, 2025, 6:34 PM

#

Alright thank you

torn mantle Jul 18, 2025, 6:34 PM

#

np

ocean vortex Jul 18, 2025, 6:52 PM

#

My experience with gemini integrations in a nutshell:

torn mantle Jul 18, 2025, 6:55 PM

#

https://x.com/thegenioo/status/1946155109999694205

Hamza (@thegenioo)

what happened to these guys?

#

https://x.com/morqon/status/1946157919940104289

morgan — (@morqon)

@thegenioo @sesame meta hired their ML lead

#

meta is slowing down the progress by x100

ocean vortex Jul 18, 2025, 7:08 PM

#

torn mantle https://x.com/thegenioo/status/1946155109999694205

What is it supposed to be? Voice assistant?

#

I don't think Meta is going about it the right way lmao. They are doing more harm than good in the AI space IMO

#

Can end up with a divided team and people with wasted potential

#

They could start their own new startups though with the money Meta is paying them... 😇

torn mantle Jul 18, 2025, 7:14 PM

#

ocean vortex What is it supposed to be? Voice assistant?

yea its sesame

torn mantle Jul 18, 2025, 7:15 PM

#

ocean vortex I don't think Meta is going about it the right way lmao. They are doing more har...

thats what im saying

#

zuck is slowing things down

#

idk if thats part of his strategy or nah

ember abyss Jul 18, 2025, 7:15 PM

#

i think they are going to integrate sesame into meta.ai

#

to make it more realistic(if they are going to make something like robot profiles)

unborn ocean Jul 18, 2025, 7:38 PM

#

ocean vortex I don't think Meta is going about it the right way lmao. They are doing more har...

idk, they are bringing a lot of smart guys to one place, giving them a lot of compute, high autonomy, a lot of good custom research from FAIR

#

and a dynamic team that is still in motion

#

so we should expect good things

#

short-term the effect might be negative though

leaden palm Jul 18, 2025, 7:41 PM

#

so many anonymous models

civic flame Jul 18, 2025, 7:45 PM

#

o3-alpha has been removed from web arena

soft kernel Jul 18, 2025, 7:45 PM

#

It's gone

#

It's gone💀💀💀💀

soft kernel Jul 18, 2025, 7:46 PM

#

torn mantle o3-alpha is good actually

That was awesome I can't believe it's gone

civic flame Jul 18, 2025, 7:46 PM

#

what's the point man 😭 it was literally added earlier today

#

unless it's temporary

lone vector Jul 18, 2025, 7:52 PM

#

is o3-alpha better than kingfall

soft kernel Jul 18, 2025, 7:56 PM

#

civic flame unless it's temporary

Yeah it was a bummer I was happy earlier and then bammm

soft kernel Jul 18, 2025, 7:56 PM

#

lone vector is o3-alpha better than kingfall

In some areas

lone vector Jul 18, 2025, 7:58 PM

#

crazy how fast things are improving, and kingfall was leaked almost two months ago

leaden meteor Jul 18, 2025, 8:04 PM

#

soft kernel In some areas

Good enough to dethrone 2.5 pro on the text arena?

soft kernel Jul 18, 2025, 8:06 PM

#

leaden meteor Good enough to dethrone 2.5 pro on the text arena?

They took it down quickly we ain't gotten the chance to test it properly

leaden meteor Jul 18, 2025, 8:10 PM

#

I am sure they will add it again...But it felt like your classmate who talks a lot but not exactly smart...

torn mantle Jul 18, 2025, 8:11 PM

#

soft kernel That was awesome I can't believe it's gone

wdym

#

its gone?

soft kernel Jul 18, 2025, 8:11 PM

#

Yeah they took it down bro

#

A tragic

soft kernel Jul 18, 2025, 8:12 PM

#

leaden meteor I am sure they will add it again...But it felt like your classmate who talks a l...

It was smart,but not like "feel the agi" type

ornate agate Jul 18, 2025, 8:13 PM

#

AGI happened already I think 😐

torn mantle Jul 18, 2025, 8:13 PM

#

soft kernel Yeah they took it down bro

nuuuuuuuuu

sacred plaza Jul 18, 2025, 8:13 PM

#

AGI is an AI product that makes $100 billion (per Microsoft contract with openai). Have not seen it yet

small haven Jul 18, 2025, 8:14 PM

#

https://grok.com/share/c2hhcmQtMw%3D%3D_8f53710c-b577-41e6-98b8-01e8b18fd99d

Physics Problem: Glove's Location | Shared Grok Conversation

A luxury sports-car is traveling with open windows in the direction opposite of the south at 30km/h

#

grok 4 heavy is a scam

soft kernel Jul 18, 2025, 8:14 PM

#

ornate agate AGI happened already I think 😐

What's your definition of agi dude?
I'm pretty sure it's not exactly general intelligence that what we have

sacred plaza Jul 18, 2025, 8:15 PM

#

AGI is a silly thought experiment that is based on vibes and not grounded in anything in reality

soft kernel Jul 18, 2025, 8:15 PM

#

small haven https://grok.com/share/c2hhcmQtMw%3D%3D_8f53710c-b577-41e6-98b8-01e8b18fd99d

I'm not even surprised lol
It was so underwhelming

leaden sun Jul 18, 2025, 8:54 PM

#

ornate agate AGI happened already I think 😐

maybe in private labs? even then, i guess it's less AGI and more a proto-AGI...?

ocean vortex Jul 18, 2025, 9:19 PM

#

small haven grok 4 heavy is a scam

Yeah it is. I'm kinda surprised OpenAI haven't challenged them on those benchmarks... xAI is becoming facade company where all that matters are manipulated benchmark numbers lol

#

not terribly surprising

ocean vortex Jul 18, 2025, 9:21 PM

#

lone vector crazy how fast things are improving, and kingfall was leaked almost two months a...

Ant they are still to release deep think, let alone something more significant than that...

torn mantle Jul 18, 2025, 9:22 PM

#

im not feeling xai vibes at all

ocean vortex Jul 18, 2025, 9:23 PM

#

@echo aurora Any update on being able to chat with mystery models for more than any messages superseding your voting? Gonna be honest my usage of lmarena absolutely dropped of the cliff ever since the interface update and the legacy version not having new models anymore...

torn mantle Jul 18, 2025, 9:23 PM

#

they are more focused on satisfying elon than making a good model

#

and on top of that, working continuously 24/7 kills productivity

#

they also have data problems, unlike oai (which was supported by msft) or google... tbh, I haven't seen the light in their models yet, they still seem so behind other competitors

#

or maybe they are just incompetent

ocean vortex Jul 18, 2025, 9:24 PM

#

new anonymous-chatbot I'm really interested in, but it's pointless to use arena with how things are atm, not gonna be able to test it anyway

torn mantle Jul 18, 2025, 9:24 PM

#

they may have improved on reasoning but knowledge-wise.. its stil doing the same mistakes

#

it didnt improve at all on multi-lingual

#

like at all

ocean vortex Jul 18, 2025, 9:24 PM

#

torn mantle and on top of that, working continuously 24/7 kills productivity

if you have ketamine stash it's ok

sour spindle Jul 18, 2025, 9:24 PM

#

Crazy to me right now nothing comes close to o3

torn mantle Jul 18, 2025, 9:24 PM

#

lol

#

also

#

o3 is a smart model, but the real problem with it is hallucination

#

but o3-alpha version seems to fix this problem.. it actually gave me many correct scientific references

ocean vortex Jul 18, 2025, 9:31 PM

#

torn mantle o3 is a smart model, but the real problem with it is hallucination

It is a bigger problem than for o1, but tbh... I still think it hallucinates less than Claude4

gaunt gate Jul 18, 2025, 9:32 PM

#

I can’t accept terms why ?

ocean vortex Jul 18, 2025, 9:32 PM

#

that one is notorious for hallucinating tools usage when it can't do it

#

But Claude4 is good for custom function calling. It's a trade-off and the price you pay to make it performant on TAU

torn mantle Jul 18, 2025, 9:35 PM

#

ocean vortex It is a bigger problem than for o1, but tbh... I still think it hallucinates les...

the thing is o3 or o-serie models refers to scientific sources more than claude

ocean vortex Jul 18, 2025, 9:35 PM

#

torn mantle the thing is o3 or o-serie models refers to scientific sources more than claude

That is good, no...?

#

or do you prefer referring to hallucinated things or propaganda tweets like grok does... catgrin

sour spindle Jul 18, 2025, 9:44 PM

#

o3 for me still cites the best sources. Still stunned at the stuff Google and Grok cite

ocean vortex Jul 18, 2025, 9:55 PM

#

sour spindle o3 for me still cites the best sources. Still stunned at the stuff Google and Gr...

yeah it can be bad. Although I tried grok on xAI just now and I'm pleasantly surprised with the response. This is still night and day difference comparing this to grok chatbot and that finetune on twitter itself

#

essentially almost agreed that's he is far-right fascist lol

torn mantle Jul 18, 2025, 10:01 PM

#

ocean vortex That is good, no...?

for claude?

ocean vortex Jul 18, 2025, 10:13 PM

#

torn mantle for claude?

o3

#

50% pretending to work, 20% getting ready to start and finish working, 10% in-between, 20% actual work. Overall stuff getting done --> same or less than working 9to5. lmao

echo aurora Jul 18, 2025, 10:50 PM

#

ocean vortex <@283397944160550928> Any update on being able to chat with mystery models for m...

Sorry to say I don't have an update regarding this request.

leaden palm Jul 18, 2025, 11:18 PM

#

https://three.arcprize.org/games/ls20 do you generalize better than an llm?

ARC-AGI-3 Preview Game: ls20

Easy for humans, hard for AI.

storm needle Jul 18, 2025, 11:49 PM

#

soft kernel It's gone💀💀💀💀

Its not gone

#

I just got it

#

I asked to write a chess UI and the O3 alpha got it

elder rapids Jul 18, 2025, 11:54 PM

#

leaden palm https://three.arcprize.org/games/ls20 do you generalize better than an llm?

ts is actually fun

#

😭

#

also ye this is elite for humans

#

pretty good

#

I can see why

golden ocean Jul 18, 2025, 11:58 PM

#

real

empty stump Jul 19, 2025, 12:31 AM

#

When gpt 5 gonna release

torn bison Jul 19, 2025, 12:45 AM

#

2025

ocean vortex Jul 19, 2025, 1:12 AM

#

@keen beacon 👀

storm needle Jul 19, 2025, 1:14 AM

#

empty stump When gpt 5 gonna release

probably in october

whole wagon Jul 19, 2025, 1:36 AM

#

Grok added to simple bench

jade egret Jul 19, 2025, 1:42 AM

#

ocean vortex My experience with gemini integrations in a nutshell:

😭

#

they need to make gemini better at coding..

#

at least windsurf people is at google now

small haven Jul 19, 2025, 2:12 AM

#

when is 2.5 ultra

jade egret Jul 19, 2025, 2:14 AM

#

fr

small haven Jul 19, 2025, 2:19 AM

#

torn mantle but o3-alpha version seems to fix this problem.. it actually gave me many correc...

yea whats the consensus on o3 alpha? i havent had the time to try it, and now apparently its removed?

hardy pecan Jul 19, 2025, 2:26 AM

#

small haven https://grok.com/share/c2hhcmQtMw%3D%3D_8f53710c-b577-41e6-98b8-01e8b18fd99d

why is your question not the same as the original? do you know how long 0.00025km is?

small haven Jul 19, 2025, 2:27 AM

#

hardy pecan why is your question not the same as the original? do you know how long 0.00025k...

i just noticed that, idk i copied/paste dom's prompt

hardy pecan Jul 19, 2025, 2:27 AM

#

Yeah... I noticed he was prompting with that too, I dont understand, why edit the original question, makes zero sense

#

although, i still think grok 4 heavy still gets its wrong?

small haven Jul 19, 2025, 2:27 AM

#

wait what is the actual prompt

#

can u copy paste, been using that prompt ever since kingfall lol

hardy pecan Jul 19, 2025, 2:28 AM

#

A luxury sports-car is traveling north at 30km/h over a roadbridge, 250m long, which runs over a river that is flowing at 5km/h eastward. The wind is blowing at 1km/h westward, slow enough not to bother the pedestrians snapping photos of the car from both sides of the roadbridge as the car passes. A glove was stored in the trunk of the car, but slips out of a hole and drops out when the car is half-way over the bridge. Assume the car continues in the same direction at the same speed, and the wind and river continue to move as stated. 1 hour later, the water-proof glove is (relative to the center of the bridge) approximately?

small haven Jul 19, 2025, 2:28 AM

#

right wtf..

hardy pecan Jul 19, 2025, 2:28 AM

#

some ppl aren't math ppl i guess...

#

ahem dom

small haven Jul 19, 2025, 2:28 AM

#

no wonder we can never get D..

hardy pecan Jul 19, 2025, 2:28 AM

#

xddddd

small haven Jul 19, 2025, 2:28 AM

#

does grok 4 get it?

hardy pecan Jul 19, 2025, 2:29 AM

#

I'll need to check

small haven Jul 19, 2025, 2:29 AM

#

i queued in again

#

wait so plain grok 4 gets it

#

right lol

#

https://grok.com/share/c2hhcmQtMw%3D%3D_bc3962c8-a2d1-48c4-aa9c-11448fdec76c

Luxury Car Glove Drop Physics | Shared Grok Conversation

A luxury sports-car is traveling north at 30km/h over a roadbridge, 250m long, which runs over a riv

#

problem with heavy, cant see any traces

hardy pecan Jul 19, 2025, 2:42 AM

#

https://grok.com/share/bGVnYWN5_5abad82f-7821-4417-986d-d183fdd4328e

Glove's Position After Falling from Car | Shared Grok Conversation

A luxury sports-car is traveling north at 30km/h over a roadbridge, 250m long, which runs over a riv

#

Yep same

small haven Jul 19, 2025, 2:43 AM

#

B?

#

ok nvm

hardy pecan Jul 19, 2025, 2:43 AM

#

FiRsT pRiNcIPleS tHiNkiNG

small haven Jul 19, 2025, 2:43 AM

#

first steal answers, then think

hardy pecan Jul 19, 2025, 2:43 AM

#

Very smart model!!!

small haven Jul 19, 2025, 2:46 AM

#

fcking dom

hardy pecan Jul 19, 2025, 2:49 AM

#

I think he was the one that presented the sailboats going back and forward question and adamant on the wrong answer too lol, oh well..

topaz peak Jul 19, 2025, 2:50 AM

#

no idea why grok 4 is so high on so many leaderboards, doesn't look like much from what i have seen

small haven Jul 19, 2025, 2:51 AM

#

hardy pecan I think he was the one that presented the sailboats going back and forward quest...

the issue is that i used that prompt on kingfall..

#

prtty sure it would get it

#

now with this prompt

#

ok so o3 pro didn't get it

small haven Jul 19, 2025, 2:54 AM

#

topaz peak no idea why grok 4 is so high on so many leaderboards, doesn't look like much fr...

well on artificial analysis, 4/7 benchmarks is math heavy, and grok 4 only is good for math

storm needle Jul 19, 2025, 3:32 AM

#

small haven yea whats the consensus on o3 alpha? i havent had the time to try it, and now ap...

the model is still there

small haven Jul 19, 2025, 3:33 AM

#

storm needle the model is still there

oh cool thanks

terse shuttle Jul 19, 2025, 3:48 AM

#

Does it make sense to wait for a video generation chat on website?

whole wagon Jul 19, 2025, 3:53 AM

#

bruh i cant get o3 alpha

#

theres so many damn models rn

terse shuttle Jul 19, 2025, 3:54 AM

#

whole wagon bruh i cant get o3 alpha

How do you want to get it if it hasn't been announced yet?

whole wagon Jul 19, 2025, 3:54 AM

#

its in llm arena battle

small haven Jul 19, 2025, 4:07 AM

#

#

yoo

#

it's also animated, the eyes like a scanner

#

wow

#

craigbench

#

📎 83Um9Z3.txt

#

what we thinking, better than kingfall now?

#

so o3 alpha is just made up?

balmy mist Jul 19, 2025, 4:15 AM

#

its not their open source model?

small haven Jul 19, 2025, 4:16 AM

#

its too good for an open source model imo

#

better than o3

balmy mist Jul 19, 2025, 4:16 AM

#

its stilll on web dev?

small haven Jul 19, 2025, 4:16 AM

#

yea

#

i just got it a few mins ago

#

where did the name o3 alpha come from?

#

just a tweet bait?

balmy mist Jul 19, 2025, 4:18 AM

#

small haven where did the name o3 alpha come from?

It's in the model card and the dev tools.

small haven Jul 19, 2025, 4:18 AM

#

balmy mist It's in the model card and the dev tools.

oh

balmy mist Jul 19, 2025, 4:18 AM

#

Screenshot_2025-07-19_at_12.18.29_AM.png

small haven Jul 19, 2025, 4:19 AM

#

ah

#

it could just be a bait from oai too

#

like why would they finetune o3

#

codex full?

#

its already in production

#

prolly

whole wagon Jul 19, 2025, 4:36 AM

#

i did so much battle and didnt get

#

idk if its there

torn bison Jul 19, 2025, 4:40 AM

#

whole wagon i did so much battle and didnt get

in webdev arena not text arena

silver radish Jul 19, 2025, 4:41 AM

#

torn bison in webdev arena not text arena

o3-alpha is in webdev arena?

#

Thank you lmarena for making my life better

hardy pecan Jul 19, 2025, 4:47 AM

#

17th July.. Would make sense why i felt like o3 was/is lobotomized for the last few days

#

but maybe that was just my experience

#

So anonymous-chatbot is good?

hollow ocean Jul 19, 2025, 5:06 AM

#

Pro users only

#

Next week

whole wagon Jul 19, 2025, 5:11 AM

#

Nope

#

August at the earliest

reef pawn Jul 19, 2025, 5:21 AM

#

Gemini 3 when?

keen beacon Jul 19, 2025, 6:26 AM

#

reef pawn Gemini 3 when?

Oct-Nov

dusky aurora Jul 19, 2025, 6:48 AM

#

hardy pecan 17th July.. Would make sense why i felt like o3 was/is lobotomized for the last ...

just like we feel Gemini got lobotomized?

hardy pecan Jul 19, 2025, 6:49 AM

#

dusky aurora just like we feel Gemini got lobotomized?

mine has been sloppy the last few days too, but i cant tell if thats just a me issue or if others have experienced that too

dusky aurora Jul 19, 2025, 6:49 AM

#

it's frustrating when the quality drops

torn mantle Jul 19, 2025, 7:55 AM

#

small haven yea whats the consensus on o3 alpha? i havent had the time to try it, and now ap...

personally, i couldnt get it to work, had so many errors on webarena..

#

but ive seen a lot of positive reviews

#

they seem to have trained it a lot on 3d simulations

#

also functionality is more prioritized than anything

#

it will just add some cool stuff just for the sake of it

royal nexus Jul 19, 2025, 7:57 AM

#

Is it just me or is the edit button genuinely non-existent?

torn mantle Jul 19, 2025, 8:00 AM

#

royal nexus Is it just me or is the edit button genuinely non-existent?

huh?

#

what edit button

royal nexus Jul 19, 2025, 8:00 AM

#

Edit messages

torn mantle Jul 19, 2025, 8:00 AM

#

i can

royal nexus Jul 19, 2025, 8:01 AM

#

Hold on you can edit messages on LM Arena????

#

How?

#

The edit button is missing for me on both PC and Mobile

torn mantle Jul 19, 2025, 8:01 AM

#

https://x.com/alexwei_/status/1946477742855532918

Alexander Wei (@alexwei_)

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

#

https://x.com/alexwei_/status/1946477756738629827

Alexander Wei (@alexwei_)

8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

#

could gpt-5 be o3-alpha + some gpt model

torn mantle Jul 19, 2025, 8:02 AM

#

royal nexus Hold on you can edit messages on LM Arena????

no i cant

royal nexus Jul 19, 2025, 8:02 AM

#

Oh

torn mantle Jul 19, 2025, 8:02 AM

#

since when was it possible

royal nexus Jul 19, 2025, 8:03 AM

#

I was just asking in case it was just a bug or the LM Arena actually doesn't have an edit button.

#

Man it's so inconvenient

torn mantle Jul 19, 2025, 8:03 AM

#

royal nexus I was just asking in case it was just a bug or the LM Arena actually doesn't hav...

it never had

royal nexus Jul 19, 2025, 8:03 AM

#

torn mantle it never had

Yeah I figured

#

Well I added a request to feedback and I hope it gets added eventually..

leaden sun Jul 19, 2025, 8:06 AM

#

torn mantle https://x.com/alexwei_/status/1946477756738629827

I see...thats why Terry Tao was experimenting with gpt models the past few months...

royal nexus Jul 19, 2025, 8:07 AM

#

leaden sun I see...thats why Terry Tao was experimenting with gpt models the past few month...

Yep

torn mantle Jul 19, 2025, 8:12 AM

#

leaden sun I see...thats why Terry Tao was experimenting with gpt models the past few month...

yea

leaden sun Jul 19, 2025, 8:12 AM

#

it could mean this research model might have...integrated some kind of a proof assistant, or a combination of various specialized formalizer and a solver...?

torn mantle Jul 19, 2025, 8:12 AM

#

leaden sun it could mean this research model might have...integrated some kind of a proof a...

yea thats what im thinking

unborn ocean Jul 19, 2025, 8:14 AM

#

leaden sun it could mean this research model might have...integrated some kind of a proof a...

it is probably just roughly what google did with alpha proof and alpha geometry + their model

#

i mean gdm literally wrote a paper on how to achieve this (even gold in IMO, so it is really not that impressive...)

#

and that gdm model is not really only a llm

leaden sun Jul 19, 2025, 8:15 AM

#

like similar to this one? https://arxiv.org/pdf/2507.08501

unborn ocean Jul 19, 2025, 8:16 AM

#

an older one in nature https://www.nature.com/articles/s41586-023-06747-5 ( alpha geometry )

#

they basically computed most of the stuff and it is more a llm + tools implementation

small haven Jul 19, 2025, 8:16 AM

#

torn mantle https://x.com/alexwei_/status/1946477742855532918

insane

small haven Jul 19, 2025, 8:17 AM

#

torn mantle could gpt-5 be o3-alpha + some gpt model

thats what im guessing, or at least integrated into gpt5/router

#

so basically is that imo experimental model agi?

zinc ore Jul 19, 2025, 8:20 AM

#

No, it's probably only specialized for IMO

unborn ocean Jul 19, 2025, 8:21 AM

#

small haven so basically is that imo experimental model agi?

a lot of the IMO problems can be solved mainly by "tools" (very fancy ones) + some llm exploration (according to the gdm work)
-> the model is likely not agi and can likely only do competition level math in this state

small haven Jul 19, 2025, 8:21 AM

#

zinc ore No, it's probably only specialized for IMO

f's

#

still impressive

whole wagon Jul 19, 2025, 8:56 AM

#

Google must not be thrilled about this 😂

#

IMO was supposed to be their thing

zinc ore Jul 19, 2025, 9:08 AM

#

These tweets look like trying to steal their thunder

#

Saw someone say last IMO they released their paper and announcement on it about two weeks after IMO

#

So yeah, they're probably slightly seething

torn mantle Jul 19, 2025, 9:14 AM

#

small haven so basically is that imo experimental model agi?

they said it wont be released for several months

#

by then any other ai lab will catch up

pulsar tendon Jul 19, 2025, 9:15 AM

#

unborn ocean a lot of the IMO problems can be solved mainly by "tools" (very fancy ones) + so...

https://x.com/alexwei_/status/1946477745627934979?s=46

Alexander Wei (@alexwei_)

2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

#

No tools or internet

#

https://x.com/alexwei_/status/1946477753194484181?s=46

Alexander Wei (@alexwei_)

5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

#

And not narrow

zinc ore Jul 19, 2025, 9:17 AM

#

They had a Gemini model that could do IMO problems in natural language, trained on AlphaProof or whatever

#

Last year I mean, when they did IMO 2024

#

Deepmind guy

leaden sun Jul 19, 2025, 9:30 AM

#

i guess now i know where the name sunny comes from, those gpts love to call me that...😆

unborn ocean Jul 19, 2025, 9:46 AM

#

some more work on this guys:
https://arxiv.org/pdf/2502.03544
https://arxiv.org/pdf/2404.06405
https://arxiv.org/pdf/2412.10673

📎 2.pdf

#

^all from the (new-ish) gdm AlphaGeometry 2

#

"Last but not least, we report progress towards using AlphaGeometry2
as a part of a fully automated system that reliably solves geometry problems directly from natural
language input."

mossy drum Jul 19, 2025, 9:59 AM

#

New model in Battle mode: kraken-07152025-1

whole wagon Jul 19, 2025, 10:43 AM

#

its been there for some time

frozen island Jul 19, 2025, 10:49 AM

#

Anyone seen the annonymous lately? Appears to be gone from my understanding playing around web dev arena.

wraith sail Jul 19, 2025, 11:12 AM

#

hi! hope all are doing well.

ocean vortex Jul 19, 2025, 11:13 AM

#

whole wagon IMO was supposed to be their thing

they gonna release their announcement 2 months from now either being proud about beating OpenAI by the narrowest of margins or how they are behind but only by a little. lol

wraith sail Jul 19, 2025, 11:14 AM

#

i have an issue with chatgpt-4o-latest-20250326 . it stopped giving response and my whole chat is stuck now. the error it show: Something went wrong with this response, please try again. how to solve it?

torn mantle Jul 19, 2025, 11:37 AM

#

https://x.com/btibor91/status/1946532308896628748

Tibor Blaho (@btibor91)

gpt-5-reasoning-alpha-2025-07-13

h/t @swishfever

gusty tendon Jul 19, 2025, 12:32 PM

#

is o3 alpha still in the arena or it got removed?

whole wagon Jul 19, 2025, 12:36 PM

#

Why did they put it in the arena for such a short period of time lmao

#

Did they remove it cos ppl figured out it's gpt5

lime coral Jul 19, 2025, 12:38 PM

#

https://x.com/blackhc/status/1946487088662135008?s=46

Andreas Kirsch 🇺🇦 (@BlackHC)

@alexwei_ @OpenAI Why did you not have it graded officially?

whole wagon Jul 19, 2025, 12:40 PM

#

Did they scam it? Kek

gusty tendon Jul 19, 2025, 12:51 PM

#

whole wagon Why did they put it in the arena for such a short period of time lmao

ugh yea, didn't get to test it

#

its frustrating you can't pick what models you wanna battle

soft kernel Jul 19, 2025, 1:04 PM

#

whole wagon Why did they put it in the arena for such a short period of time lmao

They removed it again?

lime coral Jul 19, 2025, 1:04 PM

#

whole wagon Did they scam it? Kek

Don’t think so. But they shared their result earlier because of it I believe

exotic tartan Jul 19, 2025, 1:07 PM

#

gusty tendon its frustrating you can't pick what models you wanna battle

Agree, but then you'd have to remove voting so it's kind of a waste of money tbh

gusty tendon Jul 19, 2025, 1:12 PM

#

exotic tartan Agree, but then you'd have to remove voting so it's kind of a waste of money tbh

yea maybe you could have the battle and the test, and maye you could do like 5 tests a day or something

torn mantle Jul 19, 2025, 1:30 PM

#

small haven thats what im guessing, or at least integrated into gpt5/router

https://x.com/MillionInt/status/1946556255490982022

Jerry Tworek (@MillionInt)

@dieaud91 @alexwei_ It’s a later model probably end of year thing

alpine coral Jul 19, 2025, 1:41 PM

#

unborn ocean some more work on this guys: https://arxiv.org/pdf/2502.03544 https://arxiv.org/...

you were saying before tool, tools, tools. but in the thread they explicitly say the model does not have access to tools. so is there any point in clicking on these links / looking at the papers? your position/skeptism seems confused

ocean vortex Jul 19, 2025, 1:53 PM

#

whole wagon Why did they put it in the arena for such a short period of time lmao

maybe it didn't perform?... Or someone jailbroke it and they didn't appreciate that catgrin

ocean vortex Jul 19, 2025, 1:55 PM

#

gusty tendon its frustrating you can't pick what models you wanna battle

the most frustrating thing is that you don't get to do even a single message after voting, makes the entire "testing" pretty useless

#

a one way street to collect votes and data / your prompts

whole wagon Jul 19, 2025, 2:14 PM

#

Gpt5 being a router is so misunderstood lol

#

It's only a router at the very lowest intelligence to literally switch to a mini model

#

Otherwise it functions as a normal LLM with scalable thinking

#

I think the field as a whole is going to move away from specialised models to fully generalised ones

#

Specialised models work better for smaller models

#

Because they don't have the capacity for everything

keen beacon Jul 19, 2025, 2:18 PM

#

GPT-soon in the arena

ocean vortex Jul 19, 2025, 2:26 PM

#

whole wagon It's only a router at the very lowest intelligence to literally switch to a mini...

It's not a router at all. It's a hybrid reasoning model

#

Instead of having gpt4.1 + o3 and gpt4.1-mini + o4-mini as seperate models now those gonna be just 2 models in total

#

They were the same base anyways

whole wagon Jul 19, 2025, 2:28 PM

#

Well yeah. It's still a router

#

Because it switches between 2 models

#

The mini and the normal

ocean vortex Jul 19, 2025, 2:28 PM

#

whole wagon Because it switches between 2 models

No it doesn't. It's the same model. GPT5 or GPT5-mini, there not gonna be switching between mini and non-mini I believe

#

Just like Claude4 is not a router

#

gpt4.1 is essentially like Sonnet4 with thinking disabled. Well not really since Sonnet had RL training for reasoning and they don't separate them already, but in practice it works similarly.

misty vault Jul 19, 2025, 2:32 PM

#

keen beacon GPT-soon in the arena

gpt-4

ocean vortex Jul 19, 2025, 2:32 PM

#

Making 4 (not 2) models into a single one would have been too much IMO

#

Smarter to simply give an option for mini or full one

#

that way you can ensure 1 is cheap

#

and get better performance with less compromises for the other one

patent aspen Jul 19, 2025, 3:02 PM

#

It's probably many model sizes derived from the same parent model

languid crescent Jul 19, 2025, 3:08 PM

#

damn lol my message 😭

hollow swan Jul 19, 2025, 3:29 PM

#

should i use gemini or claude for youtube scripts?

keen fulcrum Jul 19, 2025, 3:33 PM

#

hollow swan should i use gemini or claude for youtube scripts?

your head

hollow swan Jul 19, 2025, 3:35 PM

#

keen fulcrum your head

what's the point of responding

languid crescent Jul 19, 2025, 3:36 PM

#

hollow swan should i use gemini or claude for youtube scripts?

gemini

hollow swan Jul 19, 2025, 3:36 PM

#

languid crescent gemini

i've been using claude for a while but my family member recently bought gemini so i was thinking about trying it

#

claude gives very human responses from my experience, way better than gpt for example

whole wagon Jul 19, 2025, 3:37 PM

#

LLM arena suggests otherwise

hollow swan Jul 19, 2025, 3:38 PM

#

yeah only heard about this site today

#

yeah that's why i tried joining here, to ask if anyone has personal experience with both

whole wagon Jul 19, 2025, 3:43 PM

#

Lol after Sam's tweet

languid crescent Jul 19, 2025, 3:44 PM

#

@tomas try checkking lmarena leaderboard

sour spindle Jul 19, 2025, 3:46 PM

#

@hollow swan at the end of the day the best benchmarks are self use benchmarks

#

Try all the models and see what you like best for your specific use cases

balmy mist Jul 19, 2025, 4:28 PM

#

https://x.com/sama/status/1946575101509734619

Sam Altman (@sama)

woke up early on a saturday to have a couple of hours to try using our new model for a little coding project.

done in 5 minutes. it is very, very good.

not sure how i feel about it...

#

lol

#

wait im late

keen beacon Jul 19, 2025, 4:54 PM

#

zinc ore Jul 19, 2025, 5:46 PM

#

"I talked to IMO Secretary General Ria van Huffel at the IMO 2025 closing party about the OpenAI announcement. While I can't speak for the Board or the IMO (and didn't get a chance to talk about this with IMO President Gregor Dolinar, and I doubt the Board are readily in a position to meet for the next few days while traveling home), Ria was happy for me to say that it was the general sense of the Jury and Coordinators at IMO 2025 that it's rude and inappropriate for AI developers to make announcements about their IMO performances too close to the IMO (such as before the closing party, in this case; the general coordinator view is that such announcements should wait at least a week after the closing ceremony), when the focus should be on the achievements of the actual human IMO contestants and reports from AIs serve to distract from that.

I don't think OpenAI was one of the AI companies that agreed to cooperate with the IMO on testing their models and don't think any of the 91 coordinators on the Sunshine Coast were involved in assessing their scripts."

balmy mist Jul 19, 2025, 5:48 PM

#

What agent is powering OpenAI agent?

rugged brook Jul 19, 2025, 5:48 PM

#

Is o3 alpha on web dev

ocean vortex Jul 19, 2025, 5:56 PM

#

That is not really clear, I wouldn't say this with any confidence tbh

#

It's more like 60% chance this is based on some gpt5 derivative base model, 30% gpt4.1/o3 and 10% chance it's something else entirely

#

it wouldn't make much sense for them to be wasting their time with o3 when they are already testing gpt5 I would say

keen beacon Jul 19, 2025, 6:04 PM

#

i recall reading somewhere that semianalysis said o4 and o5 would be on 4.1 as well (though it's paywalled i believe)

keen beacon Jul 19, 2025, 6:04 PM

#

balmy mist What agent is powering OpenAI agent?

new model. similar to o3 but not related

merry cloud Jul 19, 2025, 6:22 PM

#

Guys am new how often are there new models

torn mantle Jul 19, 2025, 6:40 PM

#

https://x.com/NVIDIAAIDev/status/1946377420757434534

NVIDIA AI Developer (@NVIDIAAIDev)

🎉 @Kimi_Moonshot recently released Kimi K2 — a 1T-parameter #opensource MoE LLM (32B active) achieving SOTA perf in frontier knowledge, math, and coding among non-thinking models.

⚡Powered by the Muon optimizer at unprecedented scale
🏎️ With over a trillion parameters and 2M

#

this is so fast

#

well not compared to groq

#

but still

keen beacon Jul 19, 2025, 6:47 PM

#

idk you can be kinda bullish i guess. they're confident in the rl paradigm

#

maybe. but it makes sense they'd use 4.1 i guess

#

they spent a lot of time on it

torn mantle Jul 19, 2025, 6:54 PM

#

yea nvidia is providing that for free

#

its actually fast as well

#

havent really tried groq to tell whos faster

#

but im happy with nvidia inference

keen beacon Jul 19, 2025, 6:56 PM

#

last time i used groq quality was sh1t

#

doesnt matter how fast it is if quality is terrible like that

#

model was very dumb/got into constant loops and poor in quality

#

compared to other hosts

#

their quantization/inference stack/whatever idk

#

nowadays i just use deepinfra most of the time when i need stuff like that, quality is good, it's cheap, and rate limits are good and reliable

#

requests rarely fail and the rate limit is nearly just a semaphore

torn mantle Jul 19, 2025, 7:01 PM

#

keen beacon their quantization/inference stack/whatever idk

they didnt provide any info about that?

keen beacon Jul 19, 2025, 7:02 PM

#

idk lol it was sh1t back then. still have zero interest in them

torn mantle Jul 19, 2025, 7:02 PM

#

keen beacon idk lol it was sh1t back then. still have zero interest in them

thats wild

#

wild

blazing rune Jul 19, 2025, 7:11 PM

#

keen beacon nowadays i just use deepinfra most of the time when i need stuff like that, qual...

But DeepInfra uses fp4 for Kimi k2

keen beacon Jul 19, 2025, 7:13 PM

#

blazing rune But DeepInfra uses fp4 for Kimi k2

i havent used it for kimi at all so i wouldn't know about that. i typically inference qwen models there, it's been good so far

#

they also have cheaper prices for gemini 2.5 as well

blazing rune Jul 19, 2025, 7:14 PM

#

keen beacon they also have cheaper prices for gemini 2.5 as well

they seem to have a lot of models that openrouter doesn't add

#

idk why

small haven Jul 19, 2025, 7:25 PM

#

torn mantle https://x.com/NVIDIAAIDev/status/1946377420757434534

wow kimi got it, wtf

ocean vortex Jul 19, 2025, 7:29 PM

#

yeah it is. It would make sense if this was R1 or an equivalent reasoning model. But as it stands you are paying for V3 type of model lol

#

speed is more important for reasoning

ocean vortex Jul 19, 2025, 7:30 PM

#

keen beacon nowadays i just use deepinfra most of the time when i need stuff like that, qual...

deepinfra is slow af though. Hate them for it...

#

Pretty much unusuable as far as my standards for inference go lol

keen beacon Jul 19, 2025, 7:30 PM

#

yeah i dont really care about tps that much since im running batch jobs

ocean vortex Jul 19, 2025, 7:32 PM

#

chutes manages to beat them in most cases, and with price either considerably lower or completely free

keen beacon Jul 19, 2025, 7:32 PM

#

rate limits and quality can be questionable

#

you can do 200 requests concurrently on deepinfra

ocean vortex Jul 19, 2025, 7:33 PM

#

In my experience, chutes inference is good quality. Never had performance degradation issues or questions, unlike some other much more expensive providers...

#

deepinfra including actually

#

GMICloud is another suspect... R1 hosted there seems to perform marginally worse than chutes

chilly nexus Jul 19, 2025, 7:46 PM

#

Who's octopus model?

#

idk what it is. but it dropped way too many F-bombs

#

i liked

ocean vortex Jul 19, 2025, 7:51 PM

#

chilly nexus idk what it is. but it dropped way too many F-bombs

One of those grok childish finetunes perhaps... Hard to think of any other lab being this detached lol

chilly nexus Jul 19, 2025, 7:54 PM

#

yeah.. kinda was thinking it's probably grok too

mossy drum Jul 19, 2025, 8:12 PM

#

New models in Image Arena: imagen-4.0-generate-preview-06-06-v2 ,imagen-4.0-ultra-generate-preview-06-06-v2

torn mantle Jul 19, 2025, 8:40 PM

#

chilly nexus Who's octopus model?

what did you ask it

#

it went crazy with the 'HAHAHAHAHAHA'

ocean vortex Jul 19, 2025, 9:13 PM

#

"nettle"

#

dunno what it is but it isn't groundbreaking, something new though

hardy pecan Jul 19, 2025, 9:27 PM

#

keen beacon

Lmao I had the same idea to guess what "soon" meant on average

ocean vortex Jul 19, 2025, 9:28 PM

#

Ok just did a bunch of battles... "clownfish" was the first and only one that caught my attention. Very limited testing given current interface of course and being forced to do voodoo here, but this one at least has potential to be good 🧐

#

poll_question_text

What is Deepseek's operating cost to run R1? Per 1M output. Official pricing $1.10 for V3 and $2.19 for R1 with theoretical 545% margin with R1 pricing for both models.

victor_answer_votes

6

total_votes

11

victor_answer_id

2

victor_answer_text

$0.30 to $0.55

unborn ocean Jul 19, 2025, 10:09 PM

#

alpine coral you were saying before tool, tools, tools. but in the thread they explicitly say...

the resources are more about google's approach: mainly alpha geometry, as an example of what i meant with "tools".
we just don't know what kind of setup they have in particular, but based on how good these tool above are (solving >70% of IMO geometry mainly by use of them), they likely either use the models with that (but don't call it a tool) or they trained the model on the model + tool responses
could just be wrong though and they could be spending upwards of 10k per task in some kind of extreme TTC scaling effort, we just don't know really

#

Looking at the history of oai (or any lab for matter of fact) promising big gains on any benchmark, healthy scepticism is probably the best way...

hollow ocean Jul 19, 2025, 10:29 PM

#

small haven wow kimi got it, wtf

Question 10 finally solved

leaden palm Jul 19, 2025, 10:35 PM

#

Meta Superintelligence

lime coral Jul 19, 2025, 10:46 PM

#

https://x.com/pli_cachete/status/1946692267915304991?s=46

Rota (@pli_cachete)

Terence Tao on the supposed Gold from OpenAI at IMO

craggy depot Jul 19, 2025, 10:53 PM

#

hello

gentle plinth Jul 19, 2025, 10:57 PM

#

lime coral https://x.com/pli_cachete/status/1946692267915304991?s=46

Original link: https://mathstodon.xyz/@tao/114881418225852441

Terence Tao (@tao@mathstodon.xyz)

It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.

One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wordi…

craggy depot Jul 19, 2025, 11:40 PM

#

where to ask question / give suggestion ? In General ?

somber hatch Jul 19, 2025, 11:44 PM

#

craggy depot where to ask question / give suggestion ? In General ?

This is the general channel, go for it

sturdy mica Jul 20, 2025, 12:25 AM

#

ocean vortex "nettle"

salad fingers

echo aurora Jul 20, 2025, 12:27 AM

#

craggy depot where to ask question / give suggestion ? In General ?

hello - question in #general is pretty normal, but for specific feedback using #1372230675914031105 is ideal (your idea may already be there to it's worth checking first), if you have any bugs the #1343291835845578853 is where to take those. blobthumbsup

echo aurora Jul 20, 2025, 12:28 AM

#

merry cloud Guys am new how often are there new models

hey welcome ablobwave pretty often!

willow grail Jul 20, 2025, 1:45 AM

#

why chat not full with new openai o3 coding model on lmarena

#

wtf??? CHAT??????? HELLO?

small haven Jul 20, 2025, 1:54 AM

#

wen o3 alpha public release

#

i also want to start and finish a coding project under 5 mins ;P

hardy pecan Jul 20, 2025, 2:00 AM

#

grok 4 got 60.4% on simplebench which is decent to be fair

willow grail Jul 20, 2025, 2:02 AM

#

hardy pecan grok 4 got 60.4% on simplebench which is decent to be fair

u welcum https://i.imgur.com/qLUyt8m.png

Imgur

small haven Jul 20, 2025, 2:06 AM

#

willow grail u welcum https://i.imgur.com/qLUyt8m.png

i agree w u

zinc ore Jul 20, 2025, 4:02 AM

#

https://vxtwitter.com/apples_jimmy/status/1946778771803291913

Jimmy Apples 🍎/acc (@apples_jimmy)

Lots of birds talking.

Sounds right, maybe this week even ?

Gpt 6 wen ?

QRT: Yuchenj_UW
Heard GPT-5 is imminent, from a little bird.

- It’s not one model, but multiple models. It has a router that switches between reasoning, non-reasoning, and tool-using models.
- That’s why Sam said they’d “fix model naming”: prompts will just auto-route to the right model.
- GPT-6 is in training.

I just hope they’re not delaying it for more safety tests. :)

whole wagon Jul 20, 2025, 5:33 AM

#

Jimmy is wrong 🥀 sources ain't great these days lol

#

It's coming beginning of August

whole wagon Jul 20, 2025, 5:35 AM

#

ocean vortex It's not a router at all. It's a hybrid reasoning model

🥀

#

Confidently incorrect

pulsar tendon Jul 20, 2025, 5:52 AM

#

whole wagon It's coming beginning of August

No

half tartan Jul 20, 2025, 5:55 AM

#

is lmarena going to add MCP support to website? or web search?

torn mantle Jul 20, 2025, 6:03 AM

#

half tartan is lmarena going to add MCP support to website? or web search?

its a battle ground website

half tartan Jul 20, 2025, 6:04 AM

#

torn mantle its a battle ground website

yep i know that

torn mantle Jul 20, 2025, 6:04 AM

#

i dont think thats the purpose of such service

#

the only purpose is to compare raw base/reasoning models wihout tools enabled

half tartan Jul 20, 2025, 6:04 AM

#

but still we can test how well can model handle these

torn mantle Jul 20, 2025, 6:04 AM

#

we have other benchmarks for that

half tartan Jul 20, 2025, 6:05 AM

#

torn mantle we have other benchmarks for that

really?

tidal schooner Jul 20, 2025, 6:05 AM

#

new sota distill drop from nvidia:
https://fixupx.com/NVIDIAAIDev/status/1946281437935567011

NVIDIA AI Developer (@NVIDIAAIDev)

📣 Announcing the release of OpenReasoning-Nemotron: a suite of reasoning-capable LLMs which have been distilled from the DeepSeek R1 0528 671B model. Trained on a massive, high-quality dataset distilled from the new DeepSeek R1 0528, our new 7B, 14B, and 32B models achieve SOTA perf on a wide range of reasoning benchmarks for their respective sizes in the domain of mathematics, science and code. The models are available on @huggingface🤗: nvda.ws/456WifL

**💬 10 🔁 111 ❤️ 593 👁️ 47.9K **

whole wagon Jul 20, 2025, 6:06 AM

#

Why didn't they compare their 32B to qwens 32B directly

#

Lol

torn mantle Jul 20, 2025, 6:06 AM

#

half tartan really?

yea there is a thing called mcp-radar

#

https://anonymous.4open.science/r/MCPRadar-B143/README.md

tidal schooner Jul 20, 2025, 6:07 AM

#

whole wagon Why didn't they compare their 32B to qwens 32B directly

would’ve gotten absolutely wrecked

whole wagon Jul 20, 2025, 6:08 AM

#

Yeah. Isn't that better to show

tidal schooner Jul 20, 2025, 6:08 AM

#

ig nvidia wants to be fair

whole wagon Jul 20, 2025, 6:08 AM

#

Comparing equal model sizes is fair

#

Can't wait for Kimi K2 distills once they add reasoning to it

tidal schooner Jul 20, 2025, 6:09 AM

#

whole wagon Comparing equal model sizes is fair

https://tenor.com/view/morgan-freeman-true-seal-gif-8993121496866626214

Tenor

torn mantle Jul 20, 2025, 6:10 AM

#

whole wagon Comparing equal model sizes is fair

yea but isnt qwen 32b below what nvidia released?

#

actually qwen 32b is pretty solid

#

probably their best model

#

not a huge fan of 235b model

#

feels like they didnt gain much

pure anvil Jul 20, 2025, 6:11 AM

#

whole wagon Why didn't they compare their 32B to qwens 32B directly

they compare it to the 235B tho

torn mantle Jul 20, 2025, 6:11 AM

#

they are flexing

#

"ok we made this model that is as small as 32b and is comparable to a 235b model"

#

well actually its even better in most benchmarks

pure anvil Jul 20, 2025, 6:12 AM

#

bruh am i tripping they literally compare it to r1 and qwen 235B

torn mantle Jul 20, 2025, 6:13 AM

#

yea

tidal schooner Jul 20, 2025, 6:13 AM

#

pure anvil bruh am i tripping they literally compare it to r1 and qwen 235B

*r1-0528 but yes

torn mantle Jul 20, 2025, 6:14 AM

#

its basically better

whole wagon Jul 20, 2025, 6:14 AM

#

Across the benchmarks they selected yes

torn mantle Jul 20, 2025, 6:14 AM

#

falls behind only in mmlu-pro and scicode

torn mantle Jul 20, 2025, 6:15 AM

#

whole wagon Across the benchmarks they selected yes

'selected'

whole wagon Jul 20, 2025, 6:15 AM

#

Swe bench and such is absent

#

And usually favours larger models more

torn mantle Jul 20, 2025, 6:15 AM

#

i just need to try it personally

#

i dont need to look at these benchmarks

pure anvil Jul 20, 2025, 6:16 AM

#

whole wagon Swe bench and such is absent

Bro it's an open weights model that you can download

torn mantle Jul 20, 2025, 6:16 AM

#

whats LCB

whole wagon Jul 20, 2025, 6:16 AM

#

Livecodebench

torn mantle Jul 20, 2025, 6:16 AM

#

ah

whole wagon Jul 20, 2025, 6:16 AM

#

Coding questions leetcode style

torn mantle Jul 20, 2025, 6:16 AM

#

yea yea

whole wagon Jul 20, 2025, 6:17 AM

#

pure anvil Bro it's an open weights model that you can download

It is irrelevant, they will still try to portray the models in the best possible light in their own PR

pure anvil Jul 20, 2025, 6:18 AM

#

whole wagon It is irrelevant, they will still try to portray the models in the best possible...

they're not making any money from this, this is completely free for all users, if they wanted this model to look better they wouldn't compare it to models much better wtf are you on

whole wagon Jul 20, 2025, 6:19 AM

#

🥀

#

What do you think the purpose of this release is?

#

Just out of goodwill?

pure anvil Jul 20, 2025, 6:19 AM

#

If you want to see if it's good you can without paying them anything

torn mantle Jul 20, 2025, 6:20 AM

#

oh 778 getting defensive, guess he worked with nvidia team on this model

pure anvil Jul 20, 2025, 6:20 AM

#

haha

torn mantle Jul 20, 2025, 6:20 AM

#

xd

pure anvil Jul 20, 2025, 6:20 AM

#

imagine being that dense though

whole wagon Jul 20, 2025, 6:20 AM

#

I'm sure it is good. Just not 14B near R1 0528 levels good

#

That is unrealistic

torn mantle Jul 20, 2025, 6:20 AM

#

i kinda appreciate free models tbh, i would just try them to see if they are good instead of relying on benchs

pure anvil Jul 20, 2025, 6:21 AM

#

whole wagon I'm sure it is good. Just not 14B near R1 0528 levels good

Did they ever claim it? And if they did, those models being open fkn weights means you can verify those claims, or did you not understand anything lmao

whole wagon Jul 20, 2025, 6:21 AM

#

You can verify any models claims. Whether it is open weight or not is irrelevant

pure anvil Jul 20, 2025, 6:22 AM

#

Your point being?

whole wagon Jul 20, 2025, 6:22 AM

#

Obviously the benchmarks can be carefully selected, and it being open source or not is totally irrelevant to that

torn mantle Jul 20, 2025, 6:22 AM

#

nah i need to lookup old messages between you two to see if there is any beef going on

whole wagon Jul 20, 2025, 6:23 AM

#

There isn't. He just started acting up randomly

torn mantle Jul 20, 2025, 6:23 AM

#

why are you guys getting all worked up for this

zinc ore Jul 20, 2025, 6:23 AM

#

Nah let the spice flow

torn mantle Jul 20, 2025, 6:23 AM

#

whole wagon There isn't. He just started acting up randomly

okay kiss now

zinc ore Jul 20, 2025, 6:23 AM

#

Not a dune reference btw

torn mantle Jul 20, 2025, 6:23 AM

#

zinc ore Nah let the spice flow

😠

whole wagon Jul 20, 2025, 6:23 AM

#

torn mantle okay kiss now

I'm not the one getting angry over something so inconsequential

#

I think he needs some therapy

pure anvil Jul 20, 2025, 6:24 AM

#

And you need some books

#

and some arxiv papers

torn mantle Jul 20, 2025, 6:25 AM

#

idk who 777 is but we need them here now

tidal schooner Jul 20, 2025, 6:28 AM

#

torn mantle idk who 777 is but we need them here now

https://tenor.com/view/plane-landing-boeing777-boeing777x-gif-24671084

Tenor

#

baby grok:
https://fixupx.com/elonmusk/status/1946763642231500856

Elon Musk (@elonmusk)

We’re going to make Baby Grok @xAI, an app dedicated to kid-friendly content

**💬 7.4K 🔁 4.0K ❤️ 50.9K 👁️ 4.98M **

whole wagon Jul 20, 2025, 6:31 AM

#

Great

torn mantle Jul 20, 2025, 6:33 AM

#

tidal schooner baby grok: https://fixupx.com/elonmusk/status/1946763642231500856

ive lost interest in xai

whole wagon Jul 20, 2025, 6:33 AM

#

Already lol

torn mantle Jul 20, 2025, 6:33 AM

#

they dont seem serious to me

whole sundial Jul 20, 2025, 6:34 AM

#

mechahitler for kids

torn mantle Jul 20, 2025, 6:34 AM

#

xd

whole wagon Jul 20, 2025, 6:35 AM

#

torn mantle they dont seem serious to me

What gave it away? The waifus?