#general | Arena | Page 68

hardy pecan Jul 10, 2025, 11:19 AM

#

o3 is better

cedar tide Jul 10, 2025, 11:20 AM

#

grok 4 its just grok 3 with more rl training

fleet lintel Jul 10, 2025, 11:20 AM

#

They just released 2.5 ... No way 3 is coming so fast. All fake news

cedar tide Jul 10, 2025, 11:20 AM

#

cedar tide grok 4 its just grok 3 with more rl training

They themselves said it half-heartedly

cedar tide Jul 10, 2025, 11:21 AM

#

fleet lintel They just released 2.5 ... No way 3 is coming so fast. All fake news

leaked on code of gemini cli

ocean vortex Jul 10, 2025, 11:21 AM

#

hardy pecan o3 is better

it's probably using tools to do that tbf

hardy pecan Jul 10, 2025, 11:21 AM

#

yeah for sure

#

does grok 4 not use it? Im using it in the gui

ocean vortex Jul 10, 2025, 11:21 AM

#

hardy pecan does grok 4 not use it? Im using it in the gui

You have SuperDork sub? 🤯

hardy pecan Jul 10, 2025, 11:21 AM

#

yeh

#

i try out all the AI's

ocean vortex Jul 10, 2025, 11:21 AM

#

that's crazy

hardy pecan Jul 10, 2025, 11:22 AM

#

for fun

ocean vortex Jul 10, 2025, 11:22 AM

#

Why would you pay so much...

fleet lintel Jul 10, 2025, 11:22 AM

#

cedar tide leaked on code of gemini cli

Yeah. That seemed to me more like some auto generated config that got pushed by mistake

hardy pecan Jul 10, 2025, 11:22 AM

#

mental illness

#

nah I just have a great interest in this, I dont spend money on much else

ocean vortex Jul 10, 2025, 11:24 AM

#

I'm barely ok paying Musk for API. Would never see myself buying a sub from him, let alone for this price... catgrin

#

Funding his "adventures"

#

nonazis

hardy pecan Jul 10, 2025, 11:25 AM

#

im not interested in politics, just AI

#

its never a consideration for me

ocean vortex Jul 10, 2025, 11:26 AM

#

hardy pecan im not interested in politics, just AI

Fair enough I suppose, though at a certain point it's difficult not to consider both

alpine coral Jul 10, 2025, 11:26 AM

#

hardy pecan im not interested in politics, just AI

taking one for the team – cheers!

#

appreciate it ha

hardy pecan Jul 10, 2025, 11:26 AM

#

I know people have great passion for politics, but its almost a 0 for me, here for the tech!

#

here to find AI thatll take my job 🥹

rare python Jul 10, 2025, 11:27 AM

#

hardy pecan I know people have great passion for politics, but its almost a 0 for me, here f...

Does grok 4 overall vibe click for you

ocean vortex Jul 10, 2025, 11:28 AM

#

hardy pecan here to find AI thatll take my job 🥹

I think the most vulnerable people by far to this, are the ones who are actively ignorant about AI...

#

Many others can simply adapt

cedar tide Jul 10, 2025, 11:28 AM

#

Anyone can try this prompt on officiel grok ui ?

Arrange the six numbers 2, 0, 1, 9, 20, and 19 in any order to form an 8-digit number (the first digit cannot be 0). How many different 8-digit numbers can be formed?

#

On grok 4 or grok 4 heavy

hardy pecan Jul 10, 2025, 11:28 AM

#

ocean vortex I think the most vulnerable people by far to this, are the ones who are actively...

100%, alot of the reason why I wanna stay on the bleeding edge

brittle tiger Jul 10, 2025, 11:28 AM

#

fleet lintel They just released 2.5 ... No way 3 is coming so fast. All fake news

Gemini 2.0 flash went GA in February. 2.5 preview was released in March.

2.5 went GA in June. It's now July

cedar tide Jul 10, 2025, 11:29 AM

#

cedar tide Anyone can try this prompt on officiel grok ui ? Arrange the six numbers 2, 0, ...

There are just o3 and o3 pro that have it (and Claude Neptune v3)

hardy pecan Jul 10, 2025, 11:29 AM

#

rare python Does grok 4 overall vibe click for you

have to test it more, but so far, its a meh for me, its a less capable o3 personally, or at the least similar

cedar tide Jul 10, 2025, 11:29 AM

#

Who subscribes to Grok?

cedar tide Jul 10, 2025, 11:30 AM

#

cedar tide Anyone can try this prompt on officiel grok ui ? Arrange the six numbers 2, 0, ...

@hardy pecan you can test ?

hardy pecan Jul 10, 2025, 11:30 AM

#

yes

#

testing

cedar tide Jul 10, 2025, 11:30 AM

#

Thx

ocean vortex Jul 10, 2025, 11:31 AM

#

hardy pecan Jul 10, 2025, 11:31 AM

#

cedar tide Thx

https://grok.com/share/bGVnYWN5_fc68901d-bd3e-46ce-a27a-4b976225cb35

cedar tide Jul 10, 2025, 11:32 AM

#

Good response

hardy pecan Jul 10, 2025, 11:32 AM

#

correct?

cedar tide Jul 10, 2025, 11:32 AM

#

Yes

hardy pecan Jul 10, 2025, 11:32 AM

#

cool

cedar tide Jul 10, 2025, 11:32 AM

#

Two minutes to find

torn mantle Jul 10, 2025, 11:33 AM

#

ocean vortex

still thinking

#

im done

#

i cant use it

rare python Jul 10, 2025, 11:33 AM

#

torn mantle Jul 10, 2025, 11:33 AM

#

😦

#

its pissing me off

rare python Jul 10, 2025, 11:34 AM

#

ocean vortex Jul 10, 2025, 11:34 AM

#

wtf

#

That's insane lmao

cedar tide Jul 10, 2025, 11:34 AM

#

He cheated

Screenshot_2025-07-10-13-34-14-900_com.android.chrome-edit.jpg

rare python Jul 10, 2025, 11:34 AM

#

ocean vortex wtf

Grok 4 is even more token heavy than 2.5 Pro

cedar tide Jul 10, 2025, 11:35 AM

#

@hardy pecan you can ask it to do without code exécution ?

hardy pecan Jul 10, 2025, 11:35 AM

#

ill try

cedar tide Jul 10, 2025, 11:35 AM

#

Thx

rare python Jul 10, 2025, 11:35 AM

#

ocean vortex wtf

#

Second most token used

cedar tide Jul 10, 2025, 11:36 AM

#

ocean vortex wtf

What that your question ?

cedar tide Jul 10, 2025, 11:41 AM

#

hardy pecan ill try

Résult ?

hardy pecan Jul 10, 2025, 11:41 AM

#

Still going..

hardy pecan Jul 10, 2025, 11:42 AM

#

cedar tide Résult ?

https://grok.com/share/bGVnYWN5_b8209234-1a4a-4752-b802-2d6c0b0e007e

cedar tide Jul 10, 2025, 11:43 AM

#

Fail

alpine coral Jul 10, 2025, 11:44 AM

#

rare python Grok 4 is even more token heavy than 2.5 Pro

man it's like mulltiple times more - honestly feels like o3 pro

rare python Jul 10, 2025, 11:44 AM

#

alpine coral man it's like mulltiple times more - honestly feels like o3 *pro*

super long thinking

alpine coral Jul 10, 2025, 11:45 AM

#

rare python super long thinking

yeah, and super long waiting and super long inference costs

#

for what seem like < 2.5 pro performance (for me anyway, so far, and very preliminary)

rare python Jul 10, 2025, 11:46 AM

#

alpine coral yeah, and super long waiting and super long inference costs

I suspect they brute force compute

alpine coral Jul 10, 2025, 11:46 AM

#

yeah i mean a lot of benchmarks would benefit from that appraoch

hardy pecan Jul 10, 2025, 11:47 AM

#

o3-pro got it https://chatgpt.com/share/686fa84c-c800-8003-8302-4558e7c2a7e1

alpine coral Jul 10, 2025, 11:47 AM

#

alpine coral yeah i mean a lot of benchmarks would benefit from that appraoch

but for general usage... it's like pointless

cedar tide Jul 10, 2025, 11:47 AM

#

hardy pecan o3-pro got it https://chatgpt.com/share/686fa84c-c800-8003-8302-4558e7c2a7e1

Even o3 without tool have it

brave ferry Jul 10, 2025, 11:47 AM

#

will grok 4 top the leaderboard

alpine coral Jul 10, 2025, 11:47 AM

#

waiting minutes for a response

hardy pecan Jul 10, 2025, 11:48 AM

#

cedar tide Even o3 without tool have it

far better explanation

cedar tide Jul 10, 2025, 11:49 AM

#

hardy pecan o3-pro got it https://chatgpt.com/share/686fa84c-c800-8003-8302-4558e7c2a7e1

13 minutes 🤦

#

You dont ask it to not use python

hardy pecan Jul 10, 2025, 11:50 AM

#

how fast do you want your math homework done?

rare python Jul 10, 2025, 11:50 AM

#

hardy pecan how fast do you want your math homework done?

before I blink

rare python Jul 10, 2025, 11:52 AM

#

cedar tide Even o3 without tool have it

o3 in lmarena got stuck

#

cedar tide Jul 10, 2025, 11:53 AM

#

rare python o3 in lmarena got stuck

Im using via api

rare python Jul 10, 2025, 11:53 AM

#

cedar tide Im using via api

Can you run it again? It might got lucky

cedar tide Jul 10, 2025, 11:54 AM

#

rare python Can you run it again? It might got lucky

I ran it multiples times

ocean vortex Jul 10, 2025, 11:56 AM

#

rare python

that doesn't even tell the whole story though. Believe it or not 2.5Pro output peaks are actually considerably lower than o3. It's just that on average it's more than o3 cause it tends to have less short reasoning responses.

#

Grok4 is different...

rare python Jul 10, 2025, 11:56 AM

#

ocean vortex that doesn't even tell the whole story though. Believe it or not 2.5Pro output p...

Explain like I'm 5

ocean vortex Jul 10, 2025, 11:56 AM

#

that's peaking like higher than even o3

cedar tide Jul 10, 2025, 11:57 AM

#

rarely when he answers too quickly he says 600, but we can put it on high

ocean vortex Jul 10, 2025, 11:57 AM

#

rare python Explain like I'm 5

If you have a very hard task, 2.5Pro is unlikely to deviate much from the average reasoning length still. While o3 can do that and do a much longer response

rare python Jul 10, 2025, 11:57 AM

#

ocean vortex If you have a very hard task, 2.5Pro is unlikely to deviate much from the averag...

Like 2.5 Pro is limited at 32k thinking tokens?

candid storm Jul 10, 2025, 11:58 AM

#

#

I sold my polymarket bets

#

Im not convinced anymore in Grok 4

#

It performs very dissapointing in my tests

stuck orchid Jul 10, 2025, 11:59 AM

#

Oh, I can't wait to try Grok4!
Devour this monster!

ocean vortex Jul 10, 2025, 11:59 AM

#

rare python Like 2.5 Pro is limited at 32k thinking tokens?

It's kinda hard to even make it go anywhere near that tbh

keen beacon Jul 10, 2025, 11:59 AM

#

candid storm

Hehe, i will set my lmarena code up today

ocean vortex Jul 10, 2025, 11:59 AM

#

But then for some prompts other models do like only 5k, Gemini can do 12k or so etc

rare python Jul 10, 2025, 12:00 PM

#

Like the confident interval right?

keen beacon Jul 10, 2025, 12:00 PM

#

You can try grok 4 using twitter premium only right? Thats about 6$

rare python Jul 10, 2025, 12:00 PM

#

What you said it like Gemini thinking length is like +3/-3

soft kernel Jul 10, 2025, 12:00 PM

#

keen beacon You can try grok 4 using twitter premium only right? Thats about 6$

Who's gonna tell him

rare python Jul 10, 2025, 12:00 PM

#

o3 can do +10/-10

#

just an analogy

#

not accurate

ocean vortex Jul 10, 2025, 12:01 PM

#

rare python o3 can do +10/-10

Well it's hard to put an exact number on that, but yeah smth among those lines, probably less extreme

keen beacon Jul 10, 2025, 12:02 PM

#

soft kernel Who's gonna tell him

Its actually 7$ but yes i saw a guy who has premium, not premium plus and he can use grok 4 on X

#

https://x.com/iterintellectus/status/1943165445063778562?s=46

vittorio (@IterIntellectus)

use grok 4
ask it to check polymarket for odds
bet what it tells you to
infinite money glitch

soft kernel Jul 10, 2025, 12:03 PM

#

keen beacon Its actually 7$ but yes i saw a guy who has premium, not premium plus and he can...

I'm pretty sure it's premium+
But yeah you do you

keen beacon Jul 10, 2025, 12:03 PM

#

Oh nvm premium plus required , damn

soft kernel Jul 10, 2025, 12:03 PM

#

keen beacon Oh nvm premium plus required , damn

Told you

keen beacon Jul 10, 2025, 12:03 PM

#

60$ for Elon ? Nah. Greedy fker wanabe trillionaire

soft kernel Jul 10, 2025, 12:04 PM

#

keen beacon 60$ for Elon ? Nah. Greedy fker wanabe trillionaire

It ain't even that good

#

Idk it was a bad day

ocean vortex Jul 10, 2025, 12:04 PM

#

keen beacon Oh nvm premium plus required , damn

Use it on openrouter

#

that is the only reasonable way 👀

soft kernel Jul 10, 2025, 12:05 PM

#

soft kernel Idk it was a bad day

Perplexity's comet turned out to be garbage and now grok 4

keen beacon Jul 10, 2025, 12:05 PM

#

ocean vortex Use it on openrouter

Free or api

ocean vortex Jul 10, 2025, 12:05 PM

#

keen beacon Free or api

API

#

there's no free

soft kernel Jul 10, 2025, 12:05 PM

#

ocean vortex that is the only reasonable way 👀

Open router needs paid api

sweet tinsel Jul 10, 2025, 12:05 PM

#

hardy pecan i try out all the AI's

Out of interest, is the Deep Research available as a feature with Grok 4?

soft kernel Jul 10, 2025, 12:05 PM

#

ocean vortex there's no free

OR got free apis
Like deepseek r1

ocean vortex Jul 10, 2025, 12:05 PM

#

Well unless you spot it on lmarena battle - but that's a pain in the... to use

hardy pecan Jul 10, 2025, 12:06 PM

#

sweet tinsel Out of interest, is the Deep Research available as a feature with Grok 4?

Nope

soft kernel Jul 10, 2025, 12:06 PM

#

hardy pecan Nope

Great

ocean vortex Jul 10, 2025, 12:06 PM

#

soft kernel OR got free apis Like deepseek r1

No free for Grok4

keen beacon Jul 10, 2025, 12:06 PM

#

Well then openai is still most worth it, for 20$ and also for 200$, unlimited o3 + MCP 💥

sweet tinsel Jul 10, 2025, 12:06 PM

#

hardy pecan Nope

Do you think that it will come or will they just drop their multi-modal agent for such purposes too?

soft kernel Jul 10, 2025, 12:07 PM

#

sweet tinsel Do you think that it will come or will they just drop their multi-modal agent fo...

Multi modal agent Imo

soft kernel Jul 10, 2025, 12:07 PM

#

keen beacon Well then openai is still most worth it, for 20$ and also for 200$, unlimited o3...

That 1 buck offer...

sweet tinsel Jul 10, 2025, 12:07 PM

#

And nonetheless could you try this Promot for me with Grok 4 please?: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.

hardy pecan Jul 10, 2025, 12:07 PM

#

I couldn't say

alpine coral Jul 10, 2025, 12:08 PM

#

cedar tide 13 minutes 🤦

tbf o3-high got it right in 7min (on api / not tools)

soft kernel Jul 10, 2025, 12:08 PM

#

sweet tinsel And nonetheless could you try this Promot for me with Grok 4 please?: Please wri...

Bro have you seen its context window ain't no way,lil bro survive this

sweet tinsel Jul 10, 2025, 12:09 PM

#

soft kernel Bro have you seen its context window ain't no way,lil bro survive this

o3 managed it fine, with it's context window.

soft kernel Jul 10, 2025, 12:10 PM

#

sweet tinsel o3 managed it fine, with it's context window.

There's not a fine answer to your prompt
It requires too much detail

#

It's a hell of a job

#

It also needs huge research,which grok doesn't have

#

@split kayak bruh😭😭😭

split kayak Jul 10, 2025, 12:11 PM

#

ok

ornate stump Jul 10, 2025, 12:12 PM

#

Does anyone actually use Grok for real, or everyone just love to check if it's Nazi?

sweet tinsel Jul 10, 2025, 12:12 PM

#

soft kernel There's not a fine answer to your prompt It requires too much detail

I know... It's a benchmark prompt of mine to test the handling of such huge volume tasks.

soft kernel Jul 10, 2025, 12:13 PM

#

sweet tinsel I know... It's a benchmark prompt of mine to test the handling of such huge volu...

Np

sour spindle Jul 10, 2025, 12:15 PM

#

Is grok 4 in battle mode.

sweet tinsel Jul 10, 2025, 12:15 PM

#

sour spindle Is grok 4 in battle mode.

Yes, like said in the announcement, got it some times too.

keen beacon Jul 10, 2025, 12:16 PM

#

soft kernel That 1 buck offer...

What 1$ offer ?!

#

How come grok is crushing benchmarks but here people complain it sucks ?
Maybe you are not the right target audience 😂

sour spindle Jul 10, 2025, 12:16 PM

#

sweet tinsel Yes, like said in the announcement, got it some times too.

Been trying to get no luck oddly enough have gotten grok 3 more than ever lol

ornate stump Jul 10, 2025, 12:17 PM

#

I think that in Italy, we can't even use Grok

sweet tinsel Jul 10, 2025, 12:17 PM

#

sour spindle Been trying to get no luck oddly enough have gotten grok 3 more than ever lol

I had to run it 20 times to get Grok 4, it's a bit harder to get it.

cedar tide Jul 10, 2025, 12:17 PM

#

alpine coral tbf o3-high got it right in 7min (on api / not tools)

O3 medium on api in 4min

Screenshot_2025-07-10-14-16-54-591_com.android.chrome-edit.jpg

candid storm Jul 10, 2025, 12:17 PM

#

sweet tinsel Yes, like said in the announcement, got it some times too.

Do you like it?

sweet tinsel Jul 10, 2025, 12:18 PM

#

candid storm Do you like it?

It's pretty good, but the answer from Gemini 2.5 Flash Thinking and such were better somehow.

indigo hazel Jul 10, 2025, 12:18 PM

#

ornate stump I think that in Italy, we can't even use Grok

Togli il "I think"

ornate stump Jul 10, 2025, 12:19 PM

#

indigo hazel Togli il "I think"

Eh ho visto, after the whole Nazi thing, we definitely won't see it coming anymore.

indigo hazel Jul 10, 2025, 12:20 PM

#

ornate stump Eh ho visto, after the whole Nazi thing, we definitely won't see it coming anymo...

At least there is the arena xD

keen beacon Jul 10, 2025, 12:20 PM

#

ornate stump Eh ho visto, after the whole Nazi thing, we definitely won't see it coming anymo...

Are you sure about that

indigo hazel Jul 10, 2025, 12:20 PM

#

keen beacon Are you sure about that

Regarding Italy yes

ornate stump Jul 10, 2025, 12:21 PM

#

keen beacon Are you sure about that

I think it's the usual problem with GDPR: there's the European one and then the Italian one, which is even stricter and openly anti-AI

sweet tinsel Jul 10, 2025, 12:25 PM

#

Does any of you guys have a niche AI-Agent or Deep Research Tool? I want to add more to my doc.

indigo hazel Jul 10, 2025, 12:25 PM

#

sweet tinsel Jul 10, 2025, 12:27 PM

#

Actually, let me try something different, let me try to abuse OpenAI codex and Google jules as for Deep Researches.

ocean vortex Jul 10, 2025, 12:27 PM

#

they need to use "alpha" instead of "beta" for the next one for maximum confusion

keen beacon Jul 10, 2025, 12:27 PM

#

sweet tinsel Does any of you guys have a niche AI-Agent or Deep Research Tool? I want to add ...

Do you have kimi?

civic flame Jul 10, 2025, 12:27 PM

#

sweet tinsel I had to run it 20 times to get Grok 4, it's a bit harder to get it.

I've literally had 100 battles and never got it

sweet tinsel Jul 10, 2025, 12:27 PM

#

keen beacon Do you have kimi?

Already in my doc.

civic flame Jul 10, 2025, 12:27 PM

#

I keep on getting grok 3 mini and I'm so ready to crash out

sweet tinsel Jul 10, 2025, 12:27 PM

#

It's pretty rare.

ocean vortex Jul 10, 2025, 12:28 PM

#

experimental, preview, beta, alpha...

#

developer preview

#

oh RC too

#

Gemini RC-3.0

#

And then they just rename preview into stable one like they did with 06-05 lmao

sweet tinsel Jul 10, 2025, 12:31 PM

#

sweet tinsel Actually, let me try something different, let me try to abuse OpenAI codex and G...

Google fixed it that you can abuse Jules for Deep Researches.

#

It was pretty good at it.

rare python Jul 10, 2025, 12:33 PM

#

civic flame I keep on getting grok 3 mini and I'm so ready to crash out

valid crash out

#

https://tenor.com/view/i-am-about-to-crash-out-sunflower-pvz-plants-vs-zombies-gif-2216785787877099970

Tenor

sweet tinsel Jul 10, 2025, 12:36 PM

#

sweet tinsel It was pretty good at it.

Lucky me, that i tried that prompt out before, updated and put into my doc.

#

I have the feeling that im a bit obsessed with Deep Researches.

golden ocean Jul 10, 2025, 12:43 PM

#

is grok 4 any good

sour spindle Jul 10, 2025, 12:44 PM

#

Not getting grok 4 in battle mode ever may force me to pony up and pay wonder in elon and lmarena guys have an agreement 😂

unborn ocean Jul 10, 2025, 12:45 PM

#

gemini 3 training? 👀

#

this is on 2.5 pro

rare python Jul 10, 2025, 12:48 PM

#

unborn ocean gemini 3 training? 👀

https://openrouter.ai/google/gemini-2.5-pro

Gemini 2.5 Pro - API, Providers, Stats

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. Run Gemini 2.5 Pro with API

#

#

The speed is normal

unborn ocean Jul 10, 2025, 12:49 PM

#

rare python

whut looks different for me...

sweet tinsel Jul 10, 2025, 12:49 PM

#

Also looks like that for me.

unborn ocean Jul 10, 2025, 12:49 PM

#

maybe eu vs us thingy

sweet tinsel Jul 10, 2025, 12:49 PM

#

But maybe it's just something temporary as the speeds for Gemini 2.5 Flash are up as an example.

cedar tide Jul 10, 2025, 12:49 PM

#

unborn ocean gemini 3 training? 👀

Gemini 3 training started a very long time ago

unborn ocean Jul 10, 2025, 12:50 PM

#

cedar tide Gemini 3 training started a very long time ago

obv, it is more about heavy post training

rare python Jul 10, 2025, 12:50 PM

#

unborn ocean maybe eu vs us thingy

unborn ocean Jul 10, 2025, 12:50 PM

#

which can happen on the same compute they use for inference

cedar tide Jul 10, 2025, 12:50 PM

#

unborn ocean obv, it is more about heavy post training

I don't think it uses more GPU on post training than pre

unborn ocean Jul 10, 2025, 12:51 PM

#

cedar tide I don't think it uses more GPU on post training than pre

that is not the point: pretraining is usually more contained

#

and you can quite clearly see in a lot of these charts when labs are immediately before a new release

#

deployment changes, generating synthetic data, a lot of RL these days

#

things like that effect the speed

cedar tide Jul 10, 2025, 12:52 PM

#

@unborn ocean

Screenshot_2025-07-10-14-52-13-035_com.android.chrome-edit.jpg

unborn ocean Jul 10, 2025, 12:52 PM

#

though i was not really this serious about claiming that this one dip really has that much meaning behind it :v

calm sequoia Jul 10, 2025, 12:52 PM

#

candid storm Im not convinced anymore in Grok 4

You're making a wrong assumption that model has to be better to top the LMarena

cedar tide Jul 10, 2025, 12:53 PM

#

cedar tide <@721636752263086111>

Not change

unborn ocean Jul 10, 2025, 12:53 PM

#

cedar tide <@721636752263086111>

yeah, aa is always very different from openrouter when it comes to this

#

aa only has one provider for 2.5 pro

#

and measures less

#

but the speed on aa is not as heavily impacted from outages

#

so many of the dips on openrouter are mainly from some outages or weird errors, idk

#

i am really guessing here

#

we had a lot of these dips on openrouter, nothing really unusual

#

but this could mean that they are actually reconfiguring deployment right now

rare python Jul 10, 2025, 12:56 PM

#

unborn ocean aa only has one provider for 2.5 pro

You can look at each provider in OR

#

Pretty consistent for me

lone vector Jul 10, 2025, 12:57 PM

#

Being SOTA for a week makes charging 50% higher ok?

#

https://x.com/sawyermerritt/status/1943171484643438639?s=46

Sawyer Merritt (@SawyerMerritt)

Here is the updated Grok pricing in the U.S.

• Basic: Free (Grok 3)
• SuperGrok: $30/month (Grok 4)
• SuperGrok Heavy: $300/month (Grok 4 Heavy)

unborn ocean Jul 10, 2025, 1:00 PM

#

rare python You can look at each provider in OR

well, it looks different on my end, idk what to tell you man

keen fulcrum Jul 10, 2025, 1:08 PM

#

lone vector https://x.com/sawyermerritt/status/1943171484643438639?s=46

You get later a code model, a multimodal, full 256k context, grok heavy and a video model

#

Very cheap

torn mantle Jul 10, 2025, 1:09 PM

#

@cedar tide you can try it now from direct chat

lone vector Jul 10, 2025, 1:09 PM

#

keen fulcrum You get later a code model, a multimodal, full 256k context, grok heavy and a vi...

How does that compare to Gemini

keen fulcrum Jul 10, 2025, 1:09 PM

#

Making AI is a losing battle

tepid lynx Jul 10, 2025, 1:09 PM

#

Lets see lets see

cedar tide Jul 10, 2025, 1:10 PM

#

torn mantle <@419074580515389450> you can try it now from direct chat

Thx

tepid lynx Jul 10, 2025, 1:10 PM

#

Grok 4 can't even write error-free code in Java/Node.js

keen fulcrum Jul 10, 2025, 1:10 PM

#

It remains a mystery whether AI companies will recoup costs

tepid lynx Jul 10, 2025, 1:10 PM

#

I tested the Grok 4 model in Cursor, it's just awful

keen fulcrum Jul 10, 2025, 1:11 PM

#

tepid lynx I tested the Grok 4 model in Cursor, it's just awful

How bad?

rare python Jul 10, 2025, 1:11 PM

#

keen fulcrum You get later a code model, a multimodal, full 256k context, grok heavy and a vi...

It said 128000 context in the image

tepid lynx Jul 10, 2025, 1:11 PM

#

keen fulcrum How bad?

He makes websites at the level of ChatGPT 4o

#

AFK 5 min

keen fulcrum Jul 10, 2025, 1:16 PM

#

tepid lynx I tested the Grok 4 model in Cursor, it's just awful

https://x.com/tetsuoai/status/1943227842579566680

Tetsuo (@tetsuoai)

@mr_the_dooom It writes better code than Claude.

ocean vortex Jul 10, 2025, 1:26 PM

#

tepid lynx He makes websites at the level of ChatGPT 4o

Did you give it more attempts? Interesting how it would do on webdev arena

tepid lynx Jul 10, 2025, 1:30 PM

#

ocean vortex Did you give it more attempts? Interesting how it would do on webdev arena

I gave him 3 chances to create a normal site, the first one is just crap, nothing to say, the second one is better because of my prompt but still crap, the third one turned out better than all his attempts, but still the same crap as before

tepid lynx Jul 10, 2025, 1:31 PM

#

keen fulcrum https://x.com/tetsuoai/status/1943227842579566680

I disagree.
I tested it in creating an application on node.js (specifically creates an application with which you can track CPU, Memory, Network, etc., I did it with one mistake), I also made a website (as I already said) and wrote a console application in Java, everything is terrible

torn mantle Jul 10, 2025, 1:31 PM

#

initial thoughts on grok 4 : much better than grok 3

fleet lintel Jul 10, 2025, 1:31 PM

#

why are folks still using 2.0 Flash? Wasn't 2.0 flash bad? https://openrouter.ai/rankings

torn mantle Jul 10, 2025, 1:31 PM

#

still bad at multi-lingual

#

good at reasoning

rare python Jul 10, 2025, 1:32 PM

#

fleet lintel why are folks still using 2.0 Flash? Wasn't 2.0 flash bad? https://openrouter.ai...

Cheap and fast

tepid lynx Jul 10, 2025, 1:32 PM

#

I'm really disappointed grok

rare python Jul 10, 2025, 1:32 PM

#

New Flash 2.5 has the price increased

fleet lintel Jul 10, 2025, 1:32 PM

#

umm.. how expensive is 2.5 flash?

rare python Jul 10, 2025, 1:33 PM

#

$2.5 output per 1M tokens

#

vs 2.0 Flash $0.6 output per 1M tokens

torn mantle Jul 10, 2025, 1:33 PM

#

sometimes its output are kinda better than gemini 2.5 pro

#

but still lacks in certain areas tbh

golden ocean Jul 10, 2025, 1:34 PM

#

torn mantle sometimes its output are kinda better than gemini 2.5 pro

whos output is better than gemini 2.5 pro

#

2.0 or 2.5 flash

dawn wharf Jul 10, 2025, 1:35 PM

#

tepid lynx I'm really disappointed grok

you know that there's a dedicated coding model soon, right?

torn mantle Jul 10, 2025, 1:35 PM

#

golden ocean whos output is better than gemini 2.5 pro

im talking about grok 4

tepid lynx Jul 10, 2025, 1:35 PM

#

dawn wharf you know that there's a dedicated coding model soon, right?

Isn't Cursor already a model for coding?

golden ocean Jul 10, 2025, 1:35 PM

#

oh ok

dawn wharf Jul 10, 2025, 1:35 PM

#

tepid lynx Isn't Cursor already a model for coding?

I'm talking about Grok 4 coding

#

releasing in August

tepid lynx Jul 10, 2025, 1:36 PM

#

dawn wharf I'm talking about Grok 4 coding

Oh, I hope it will be better than Grok 4

torn mantle Jul 10, 2025, 1:36 PM

#

golden ocean oh ok

whos that on ur pfp

#

bts?

golden ocean Jul 10, 2025, 1:36 PM

#

no

torn mantle Jul 10, 2025, 1:36 PM

#

i see

golden ocean Jul 10, 2025, 1:37 PM

#

whos that on ur pfp

#

12 year old anime girl?

torn mantle Jul 10, 2025, 1:37 PM

#

no

golden ocean Jul 10, 2025, 1:37 PM

#

I see

torn mantle Jul 10, 2025, 1:37 PM

#

no you didnt

#

or else you wont say that

golden ocean Jul 10, 2025, 1:37 PM

#

torn mantle i see

no you didint

#

or else you wont say that

torn mantle Jul 10, 2025, 1:37 PM

#

i see

rare python Jul 10, 2025, 1:39 PM

#

dawn wharf you know that there's a dedicated coding model soon, right?

will it better?

#

I still hate that mf. The teaching ability is so bad

#

Dislike the writing style of o3

torn mantle Jul 10, 2025, 1:40 PM

#

so far :

gemini 2.5 pro
o3 pro
claude 4 opus
grok 4

rare python Jul 10, 2025, 1:40 PM

#

torn mantle so far : 1. gemini 2.5 pro 2. o3 pro 3. claude 4 opus 4. grok 4

My list:

Opus 4 Thinking
Gemini 2.5 Pro
Opus 4

#

1 and 2 I use quite a lot

golden ocean Jul 10, 2025, 1:41 PM

#

In my case claude 4 opus destroys gemini 2.5 pro a lot in coding but in other coding projects its opposite

#

But I still trust and like claudes code more so i'lll just use gemini if claude fails or sucks

torn mantle Jul 10, 2025, 1:42 PM

#

rare python My list: 1. Opus 4 Thinking 2. Gemini 2.5 Pro 3. Opus 4

opus 4 is good but it doesnt delve into details

rare python Jul 10, 2025, 1:42 PM

#

torn mantle opus 4 is good but it doesnt delve into details

delve 💀

torn mantle Jul 10, 2025, 1:42 PM

#

thats why i put it at 3

#

but its a solid model

misty vault Jul 10, 2025, 1:42 PM

#

delve

rare python Jul 10, 2025, 1:42 PM

#

nah you use delve

#

GPT4 slop

torn mantle Jul 10, 2025, 1:42 PM

#

lmao

golden ocean Jul 10, 2025, 1:43 PM

#

real

torn mantle Jul 10, 2025, 1:43 PM

#

blame it on elon

#

kept waiting till 6 am

rare python Jul 10, 2025, 1:43 PM

#

torn mantle kept waiting till 6 am

skill

misty vault Jul 10, 2025, 1:43 PM

#

issue

torn mantle Jul 10, 2025, 1:44 PM

#

wait delve is right

#

what are you on

golden ocean Jul 10, 2025, 1:44 PM

#

i dont think thats what he meant

misty vault Jul 10, 2025, 1:44 PM

#

@ornate agate spit it out

torn mantle Jul 10, 2025, 1:44 PM

#

golden ocean i dont think thats what he meant

what did you mean

golden ocean Jul 10, 2025, 1:44 PM

#

i didnt say it

torn mantle Jul 10, 2025, 1:45 PM

#

you did

#

you are acting weird again

golden ocean Jul 10, 2025, 1:45 PM

#

what @rare python meant*

torn mantle Jul 10, 2025, 1:45 PM

#

why do you hate me??

#

😖

rare python Jul 10, 2025, 1:45 PM

#

torn mantle why do you hate me??

https://tenor.com/view/fnf-tone-fnaf-meme-gif-25208866

Tenor

ornate agate Jul 10, 2025, 1:45 PM

#

Since we're doing lists:

DeepSeek R1
Gemini
Claude
LocalAI (Qwen 32b/Gemma).

The reason I put DeepSeek at the top is for all these math etc problems, you need to read the CoT imo, they are just not reliable enough, so if you want to actually solve a puzzle using them, you have to have the CoT. its the only one with that. I find DeepSeek R1 or Gemini good enough for AI assisted coding, for me. For random chatting or simple questions local AI is fine now.

ocean vortex Jul 10, 2025, 1:48 PM

#

ornate agate Since we're doing lists: - DeepSeek R1 - Gemini - Claude - LocalAI (Qwen 32b/Ge...

you can read CoT with Grok:

#

this is very sophisticated CoT

keen beacon Jul 10, 2025, 1:49 PM

#

is this the real grok 4 or it is grok 1 model?

ocean vortex Jul 10, 2025, 1:50 PM

#

keen beacon is this the real grok 4 or it is grok 1 model?

oh they added it to direct chat?? 🤯

#

yeah this is the one

#

the new one

torn mantle Jul 10, 2025, 1:53 PM

#

ocean vortex oh they added it to direct chat?? 🤯

yes

ocean vortex Jul 10, 2025, 1:55 PM

#

So there's a way to use it without paying Musk now. 😇

torn mantle Jul 10, 2025, 1:56 PM

#

ocean vortex So there's a way to use it without paying Musk now. 😇

lets goooooooooooooooo

fleet lintel Jul 10, 2025, 2:03 PM

#

ocean vortex you can read CoT with Grok:

impressive! definitely SOTA!

high ginkgo Jul 10, 2025, 2:09 PM

#

agi

ocean vortex Jul 10, 2025, 2:18 PM

#

To be clear it isn't actually outputting this behind the scenes, it's just that they don't want you to see the real reasoning it is doing. This helps you to at least see if response is not stuck I suppose. But it still looks hilarious

stray dock Jul 10, 2025, 2:18 PM

#

ocean vortex oh they added it to direct chat?? 🤯

wait there's models that aren't in direct chat?

#

sorry im new here

ocean vortex Jul 10, 2025, 2:19 PM

#

stray dock wait there's models that aren't in direct chat?

yeah most new arena entries ("mystery models" / unreleased models) are not accessible in direct chat

stray dock Jul 10, 2025, 2:20 PM

#

may i know the reason ?

#

is it because you're testing and you want the feedback?

ocean vortex Jul 10, 2025, 2:20 PM

#

But with Grok4 it seems they entered arena with already the release stable version

#

And made it available on official API at the same time

jade egret Jul 10, 2025, 2:21 PM

#

is grok 4 good?

ocean vortex Jul 10, 2025, 2:21 PM

#

stray dock is it because you're testing and you want the feedback?

yeah AI labs want the elo score. If they made it available through direct much less people would actually use arena to vote

stray dock Jul 10, 2025, 2:21 PM

#

ocean vortex yeah AI labs want the elo score. If they made it available through direct much l...

true

#

thats what i thought

#

also idk if this is the right channel to ask this but please bear w me:

i mainly use LMArena to help navigate thru CTFs (im an active CTF player), so if anyone here in cybersec or has knowledge of it, please tell me what model is the best for my use case.
thanks.

ocean vortex Jul 10, 2025, 2:22 PM

#

Also they may not want you to use some experimental early checkpoint 100% freely

#

In arena you need some commitment, so people are willing to do it usually bring value and understand the possible limitations of early models, even if they made it possible to interact with it after voting (which I hope lmarena does at some point...)

sage raptor Jul 10, 2025, 2:30 PM

#

not agi

cedar tide Jul 10, 2025, 2:32 PM

#

https://x.com/MistralAI/status/1943316390863118716?t=dukc8MSof6AlUlk0nE3N4w&s=19

Mistral AI (@MistralAI)

Introducing Devstral Small and Medium 2507! This latest update offers improved performance and cost efficiency, perfectly suited for coding agents and software engineering tasks.

ocean vortex Jul 10, 2025, 2:34 PM

#

sage raptor not agi

I haven't finished testing it yet. But it did some unexpected fails I can say that already

#

Also interesting that very concise response thing

#

reasons for almost ages then responds with 1 word lmao

ocean vortex Jul 10, 2025, 2:36 PM

#

cedar tide https://x.com/MistralAI/status/1943316390863118716?t=dukc8MSof6AlUlk0nE3N4w&s=19

wtf are they doing why are they doing it backwards still... smh

#

They should have trained magistral-large first and then all of those would have been distills

sour spindle Jul 10, 2025, 2:43 PM

#

Grok 4 is very similar to google models tools wise in which it cites some very odd sources plus twitter

#

Also is grok 4 not available on mobile ios i can only use it on the web browser atm

#

For me right now it doesn't have the initial wow factor that o3 had.

#

This may be the final nail in the coffin for my anti-benchmark pilling

ocean vortex Jul 10, 2025, 2:46 PM

#

sour spindle For me right now it doesn't have the initial wow factor that o3 had.

Tend to agree based on my initial impressions

primal orbit Jul 10, 2025, 2:47 PM

#

did that betting site pay out for grok being SOTA or not? Or they are waiting?

alpine coral Jul 10, 2025, 2:48 PM

#

ocean vortex I haven't finished testing it yet. But it did some unexpected fails I can say th...

it's not gonna do well on Simple Bench imo

ocean vortex Jul 10, 2025, 2:48 PM

#

primal orbit did that betting site pay out for grok being SOTA or not? Or they are waiting?

it's not on the leaderboard yet https://lmarena.ai/leaderboard

alpine coral Jul 10, 2025, 2:48 PM

#

i mean it'll do ok; but it won't be at the top

ocean vortex Jul 10, 2025, 2:48 PM

#

though it will be by the end of July

alpine coral Jul 10, 2025, 2:49 PM

#

doesn't have those kinda vibes at all. simple bench stuff

ocean vortex Jul 10, 2025, 2:50 PM

#

odds are still this:

#

Now it makes no sense to bet on Google lol

#

xAI tuned earlier Grok to score high on lmarena

primal orbit Jul 10, 2025, 2:51 PM

#

could google release gemini 3 by the end of July?

ocean vortex Jul 10, 2025, 2:51 PM

#

I think it's like 70% chance it's going to be xAI now

ocean vortex Jul 10, 2025, 2:52 PM

#

primal orbit could google release gemini 3 by the end of July?

they could, but this is far from a given

#

Grok is already there

torn mantle Jul 10, 2025, 2:54 PM

#

ocean vortex odds are still this:

pfft

#

best time to bet on google

#

crazy

sage raptor Jul 10, 2025, 2:58 PM

#

ocean vortex odds are still this:

according to benchmarks, grok 4 is top 1

torn mantle Jul 10, 2025, 3:02 PM

#

and according to me grok wont get that no1 spot

ocean vortex Jul 10, 2025, 3:02 PM

#

sage raptor according to benchmarks, grok 4 is top 1

Which I can't wrap my head around somewhat. Real world shows a bit differently. Surely they didn't give Artificial Analysis access to a model that wasn't safety finetuned unlike the public version, have they???

#

Don't quite see where those scores came from yet

rare python Jul 10, 2025, 3:04 PM

#

Yeah it's werid that AA got first access

haughty siren Jul 10, 2025, 3:04 PM

#

Is Grok 4 in the arena the regular, thinking or heavy model?

unborn ocean Jul 10, 2025, 3:05 PM

#

They just did a lot of rl on the specific tasks in AA‘s benchmarks

#

AA Intelligence Index has always been useless (at least for me)

#

And they made it even worse in 2025

rare python Jul 10, 2025, 3:06 PM

#

unborn ocean AA Intelligence Index has always been useless (at least for me)

2.5 Flash above Opus 4 is a no go for me

#

Their data is useful, but not their benchmark

solar hollow Jul 10, 2025, 3:07 PM

#

aime25 questions are available, not too hard to train on them

unborn ocean Jul 10, 2025, 3:07 PM

#

grok 3 mini above opus is the worst crime of all of them

solar hollow Jul 10, 2025, 3:07 PM

#

benchmarks will always be shown in a way that makes them look good

unborn ocean Jul 10, 2025, 3:07 PM

#

Flash 2.5 is way better

alpine coral Jul 10, 2025, 3:07 PM

#

ocean vortex Don't quite see where those scores came from yet

i think it's just good at doing benchmarks

ocean vortex Jul 10, 2025, 3:07 PM

#

rare python 2.5 Flash above Opus 4 is a no go for me

that is not much different to o4-mini being above. 2.5Flash is the same. And both of them do very well on most benchmarks

#

Nothing too unexpected. They should add SimpleQA to their test set though

unborn ocean Jul 10, 2025, 3:08 PM

#

Yes, the selected benchmarks are just really random and not very complementary.

#

And not really modelling my understanding of ‚intelligence‘

rare python Jul 10, 2025, 3:09 PM

#

ocean vortex that is not much different to o4-mini being above. 2.5Flash is the same. And bot...

2.5 Flash is far behind 2.5 Pro in most benchmark tho? at o3 mini level, not o4 mini

ocean vortex Jul 10, 2025, 3:10 PM

#

unborn ocean Yes, the selected benchmarks are just really random and not very complementary.

I wouldn't say they are random.. They did a decent job selecting the main ones actually:
MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500

#

But this also could be improved - for sure

#

yeah it's not perfect, but the selection is still good

alpine coral Jul 10, 2025, 3:11 PM

#

ocean vortex I wouldn't say they are random.. They did a decent job selecting the main ones a...

and i think grok 4 is good at doing these kinda benchmarks - the main ones aha

haughty siren Jul 10, 2025, 3:11 PM

#

Ah, this is quite concerining though. Also it does not feel like it's thinking aas that usually takes some time with Grok 3.

alpine coral Jul 10, 2025, 3:11 PM

#

it won't do well in the arena imo

#

for one thing.. you have to wait 2 min for a response

dawn wharf Jul 10, 2025, 3:12 PM

#

alpine coral for one thing.. you have to wait 2 min for a response

it's fast for me

#

it depends on prompt

ocean vortex Jul 10, 2025, 3:13 PM

#

If they added SimpleQA this would have been close to perfection imo

torn mantle Jul 10, 2025, 3:13 PM

#

ocean vortex If they added SimpleQA this would have been close to perfection imo

So how is it so far?

#

Grok 4

alpine coral Jul 10, 2025, 3:13 PM

#

dawn wharf it's fast for me

on OR it's still slow / thinking for even responses to introductions (but ofc, not as slow as for complex questions, but yeah it's still like thinking "how do I respond to 'howdy'" )

ocean vortex Jul 10, 2025, 3:14 PM

#

If you wanted to check individual model performance you would be looking at mostly the same benchmarks

ocean vortex Jul 10, 2025, 3:14 PM

#

torn mantle So how is it so far?

So far worse than expected lol

dawn wharf Jul 10, 2025, 3:14 PM

#

alpine coral on OR it's still slow / thinking for even responses to introductions (but ofc, n...

weird, on lmarena it's nearly instantaneous for such simple prompts

torn mantle Jul 10, 2025, 3:14 PM

#

ocean vortex So far worse than expected lol

Eh?

ocean vortex Jul 10, 2025, 3:14 PM

#

Haven't finished testing yet though

torn mantle Jul 10, 2025, 3:14 PM

#

Where do you rank it?

alpine coral Jul 10, 2025, 3:14 PM

#

dawn wharf weird, on lmarena it's nearly instantaneous for such simple prompts

oh i see - that's interesting 👍

torn mantle Jul 10, 2025, 3:14 PM

#

Below o3 and gemini?

dawn wharf Jul 10, 2025, 3:15 PM

#

ocean vortex So far worse than expected lol

bro, you must have expected AGI or something then

ocean vortex Jul 10, 2025, 3:15 PM

#

dawn wharf bro, you must have expected AGI or something then

nah just comparing it directly to 2.5Pro and o3

dawn wharf Jul 10, 2025, 3:15 PM

#

if you're using it for coding, it's not a coding model

alpine coral Jul 10, 2025, 3:15 PM

#

comparing to the published benchmarks.. it's disappointing

#

but perhaps that says more about the benchmarks than the model

torn mantle Jul 10, 2025, 3:16 PM

#

No its not

torn mantle Jul 10, 2025, 3:16 PM

#

ocean vortex nah just comparing it directly to 2.5Pro and o3

So its below them in ur opinion?

ocean vortex Jul 10, 2025, 3:16 PM

#

alpine coral but perhaps that says more about the benchmarks than the model

Or the model that was actually tested

torn mantle Jul 10, 2025, 3:16 PM

#

Agree

solar hollow Jul 10, 2025, 3:16 PM

#

alpine coral but perhaps that says more about the benchmarks than the model

yeah, benchmarks can be trained on and are

torn mantle Jul 10, 2025, 3:16 PM

#

I think they improved from grok 3 a lot

alpine coral Jul 10, 2025, 3:16 PM

#

ocean vortex Or the model that was actually tested

aha yeah

ocean vortex Jul 10, 2025, 3:17 PM

#

maybe early checkpoint was different as I've already implied earlier lol

alpine coral Jul 10, 2025, 3:17 PM

#

goes both ways

torn mantle Jul 10, 2025, 3:17 PM

#

Agree its bad

alpine coral Jul 10, 2025, 3:17 PM

#

yaeh it's strong

#

but if they spent a gizzillion on it

#

it mightn't be that imopressive (a la behmeth)

ocean vortex Jul 10, 2025, 3:18 PM

#

Safety alignment can degrade performance quite a bit. If it was tested on AA before that... This could all make sense. Just a theory though

alpine coral Jul 10, 2025, 3:19 PM

#

ocean vortex Safety alignment can degrade performance quite a bit. If it was tested on AA bef...

surely should be stated as such, if that's the case

#

i think they just go by the published benchmarks + have a pipeline to run their own on the public API

zealous panther Jul 10, 2025, 3:20 PM

#

idk Grok 4 is glazing elon for me

#

like

#

im talking about einstein and it brings up elon

ocean vortex Jul 10, 2025, 3:22 PM

#

zealous panther idk Grok 4 is glazing elon for me

Paste the screen here

#

If they did that they are done for 💀

zealous panther Jul 10, 2025, 3:23 PM

#

i mean its not really too much glazing

#

but this

#

oh wait i cant

#

ill send dms

alpine coral Jul 10, 2025, 3:23 PM

#

alpine coral i think they just go by the published benchmarks + have a pipeline to run their ...

actually they don't use published. they say for the evals that comprise their 'Intelligence Index' they run the evals themselves ('independently')

ocean vortex Jul 10, 2025, 3:24 PM

#

alpine coral surely should be stated as such, if that's the case

It's Elon we are talking about though... Honestly could probably expect anything

zealous panther Jul 10, 2025, 3:24 PM

#

its not glazing but its

#

"People often misread quiet intensity as detachment—think of historical figures like Albert Einstein or modern ones like Elon Musk, who come across as aloof but have rich inner worlds."

#

thats what it said

alpine coral Jul 10, 2025, 3:24 PM

#

ocean vortex It's Elon we are talking about though... Honestly could probably expect anything

no.. it artificial analysis we're talking about?

#

but yeah re published

#

agreed ha

ocean vortex Jul 10, 2025, 3:24 PM

#

zealous panther "People often misread quiet intensity as detachment—think of historical figures ...

Ok that is not it. This is an ok response tbh

zealous panther Jul 10, 2025, 3:24 PM

#

ocean vortex Ok that is not it. This is an ok response tbh

ah ok

#

idk

burnt pulsar Jul 10, 2025, 3:25 PM

#

Has anyone gotten an answer back from Grok 4 over the direct chat? I've just tried it out for the first time, but I keep waiting for the answer to come back. And it is already spinning for five minutes.

keen ferry Jul 10, 2025, 3:25 PM

#

grok 4 is so bad even on python oml

ocean vortex Jul 10, 2025, 3:25 PM

#

alpine coral no.. it artificial analysis we're talking about?

They wouldn't know though if not told

#

I mean in theory they could probably find out. But it's not like they are actively trying to expose them

alpine coral Jul 10, 2025, 3:25 PM

#

nah i mean they have data for how many tokens used, costs etc

#

i believe they run the evals themselves

#

i just think grok 4 is one of those models that very well at these 'main' benchmarks

ocean vortex Jul 10, 2025, 3:26 PM

#

alpine coral i believe they run the evals themselves

But they had "early access"

#

not the same API as released one

alpine coral Jul 10, 2025, 3:26 PM

#

ah ok - i didn't realise that

#

i understand your point now

#

yeah... hmm

#

just fwiw.. grok4 rows added in my spreadsheet

zealous panther Jul 10, 2025, 3:29 PM

#

what models are google testing in lmarena anyways

#

those palceholders names

civic flame Jul 10, 2025, 3:29 PM

#

alpine coral just fwiw.. grok4 rows added in my spreadsheet

lol yikes

sour spindle Jul 10, 2025, 3:29 PM

#

alpine coral just fwiw.. grok4 rows added in my spreadsheet

Let's not get ridiculous 😂

torn mantle Jul 10, 2025, 3:29 PM

#

alpine coral just fwiw.. grok4 rows added in my spreadsheet

finished or not yet?

#

i think its a bit better than that tbh

alpine coral Jul 10, 2025, 3:30 PM

#

literally just is what it is

#

can share the respones if you want

fleet lintel Jul 10, 2025, 3:31 PM

#

alpine coral just fwiw.. grok4 rows added in my spreadsheet

how is wolfstride doing as per your sheet

alpine coral Jul 10, 2025, 3:31 PM

#

torn mantle Jul 10, 2025, 3:32 PM

#

alpine coral can share the respones if you want

do please

#

lets see where it failed

fleet lintel Jul 10, 2025, 3:32 PM

#

alpine coral

sota confirmed! Grok 4 ftw!

torn mantle Jul 10, 2025, 3:33 PM

#

fleet lintel sota confirmed! Grok 4 ftw!

dom alt?

alpine coral Jul 10, 2025, 3:33 PM

#

fleet lintel how is wolfstride doing as per your sheet

i ran a quiz against it; haven't tallied the results but i think it's saved in lm arena, and so intend to go back to it

torn mantle Jul 10, 2025, 3:33 PM

#

why are you making a 2nd account?

pure anvil Jul 10, 2025, 3:33 PM

#

torn mantle dom alt?

ahaha

ocean vortex Jul 10, 2025, 3:33 PM

#

It's "early access" that's all we know

https://x.com/ArtificialAnlys/status/1943166841150644622

Artificial Analysis (@ArtificialAnlys)

xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model.

We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude

torn mantle Jul 10, 2025, 3:34 PM

#

ocean vortex It's "early access" that's all we know https://x.com/ArtificialAnlys/status/19...

do you think google with deep think will break the 75 mark?

rare python Jul 10, 2025, 3:34 PM

#

alpine coral

flamesong is mid imo

#

feel like flash

ocean vortex Jul 10, 2025, 3:34 PM

#

torn mantle do you think google with deep think will break the 75 mark?

Very unlikely. OpenAI improved by only 1 point with it

pure anvil Jul 10, 2025, 3:34 PM

#

rare python flamesong is mid imo

is it better than 2.5 flash?

torn mantle Jul 10, 2025, 3:35 PM

#

ocean vortex Very unlikely. OpenAI improved by only 1 point with it

yea its crazy how little progress we are seeing

rare python Jul 10, 2025, 3:35 PM

#

pure anvil is it better than 2.5 flash?

I don't use 2.5 Flash so idk

pure anvil Jul 10, 2025, 3:35 PM

#

torn mantle yea its crazy how little progress we are seeing

it's definitely slowing down but progress is still being made

torn mantle Jul 10, 2025, 3:35 PM

#

pure anvil it's definitely slowing down but progress is still being made

true

pure anvil Jul 10, 2025, 3:35 PM

#

transformers won't lead us to AGI anyway it's architecturally bottlenecked

#

I read a paper on RL optimising pass@1 instead of actual improvement in reasoning, idk how true it is but good read nevertheless

balmy mist Jul 10, 2025, 3:47 PM

#

has anyone bought the $300 a month plan?

wet basalt Jul 10, 2025, 3:47 PM

#

in lmarena grok 4 dont support pictures

ocean vortex Jul 10, 2025, 3:48 PM

#

I think it broke. Should I let it kill itself? 👀

#

still generating

#

over 1.5k sec now 😇

alpine coral Jul 10, 2025, 3:52 PM

#

torn mantle do please

here's one of them, with gem 2.5 pro (latest) for comparison

keen beacon Jul 10, 2025, 4:00 PM

#

Grok 4 "increased usage" with pro plan, wtf does that mean 🤬 , i hate it when companies are so vague about what they offer

ocean vortex Jul 10, 2025, 4:03 PM

#

half an hour and counting. Can we get to 1 hour mark? 💀

sour spindle Jul 10, 2025, 4:05 PM

#

balmy mist has anyone bought the $300 a month plan?

I'm trying to get a refund on the 30$ plan lol

patent aspen Jul 10, 2025, 4:11 PM

#

pure anvil transformers won't lead us to AGI anyway it's architecturally bottlenecked

The architectures of today have diverged a lot from the transformers architectures of the past and continue to do so

rare python Jul 10, 2025, 4:13 PM

#

ocean vortex half an hour and counting. Can we get to 1 hour mark? 💀

Check your credit left

#

Titan arxiv

patent aspen Jul 10, 2025, 4:15 PM

#

https://arxiv.org/abs/2501.00663

arXiv.org

Titans: Learning to Memorize at Test Time

Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accur...

#

Anything published has already been in use for a long time

#

Safety research is an exception

alpine coral Jul 10, 2025, 4:20 PM

#

ocean vortex half an hour and counting. Can we get to 1 hour mark? 💀

dont worry you can sleep easily tonight

#

actually surprised o3 bothered to try calculate that seriously

keen beacon Jul 10, 2025, 4:20 PM

#

When lmarena.ai says you’re chatting with “Grok-4”…
But Grok itself admits it's based on Grok-1 💀

alpine coral Jul 10, 2025, 4:21 PM

#

yeah llm's don't have any self-awareness...

#

unless trained into them or in a system prompt

#

anyway..i doubt they're serving up grok-1

keen beacon Jul 10, 2025, 4:22 PM

#

Yeah its only in lmarena.ai that it hallucinates

unborn ocean Jul 10, 2025, 4:23 PM

#

patent aspen Anything published has already been in use for a long time

google research papers are no guarantee that they are actually already using the tech large scale

balmy mist Jul 10, 2025, 4:23 PM

#

sour spindle I'm trying to get a refund on the 30$ plan lol

cant you use it on lmarena for free?

balmy mist Jul 10, 2025, 4:24 PM

#

keen beacon Yeah its only in lmarena.ai that it hallucinates

why do you think that is?

alpine coral Jul 10, 2025, 4:26 PM

#

keen beacon Yeah its only in lmarena.ai that it hallucinates

grok-1 family/model.. same thing/hallucination imo

patent aspen Jul 10, 2025, 4:27 PM

#

unborn ocean google research papers are no guarantee that they are actually already using the...

That's kind of true, although it depends on the category of research and the year it was published. If it's something scientific, safety-related, doesn't provide a significant competitive advantage, or benefits Google more published than unpublished (e.g. getting researchers to converge around Google's frameworks), then that's true. Otherwise, if it was published post ChatGPT, then it generally means it's widely in use already

unborn ocean Jul 10, 2025, 4:28 PM

#

patent aspen That's kind of true, although it depends on the category of research and the yea...

alright i agree on the post chatgpt part

#

before that clearly not

pure anvil Jul 10, 2025, 4:28 PM

#

regardless, all the statistics on model performance point to a slowdown of progress of LLMs and disproportional increase in capability compared to model size and training data, we can extend capabilities with agentic tool use etc tho but that will also have it's limits

patent aspen Jul 10, 2025, 4:29 PM

#

pure anvil regardless, all the statistics on model performance point to a slowdown of progr...

Progress is going to be very lumpy from here on out

unborn ocean Jul 10, 2025, 4:29 PM

#

pure anvil regardless, all the statistics on model performance point to a slowdown of progr...

well all the things pointed to a slowdown in compute efficiency gains in semiconductors

#

but smart people and capital can cover up the problems

#

and he we are in a world where we regularly see significant improvements in the semiconductor manufacturing + software + hardware design stack (contrary to many expectations)
(^albeit not really with the improvements we saw pre 2000)

pure anvil Jul 10, 2025, 4:31 PM

#

unborn ocean but smart people and capital can cover up the problems

I expect many algorithmic breakthroughs are still left to discover

haughty tangle Jul 10, 2025, 4:32 PM

#

Don't they use Mixture of Experts

alpine coral Jul 10, 2025, 4:32 PM

#

patent aspen That's kind of true, although it depends on the category of research and the yea...

this was interesting i thought https://www.ft.com/content/2ee1ffde-008e-4ea4-861b-24f15b25cf54

DeepMind slows down research releases to keep competitive edge in A...

Google’s AI arm led by Demis Hassabis makes it harder for its researchers to publish studies in major change in approach

haughty tangle Jul 10, 2025, 4:32 PM

#

haughty tangle Don't they use Mixture of Experts

OAI

pure anvil Jul 10, 2025, 4:32 PM

#

unborn ocean but smart people and capital can cover up the problems

we are nearing how small transistors can be made, 3nm is literally the size of a few atoms across

unborn ocean Jul 10, 2025, 4:32 PM

#

alpine coral this was interesting i thought https://www.ft.com/content/2ee1ffde-008e-4ea4-861...

paywall

unborn ocean Jul 10, 2025, 4:33 PM

#

pure anvil we are nearing how small transistors can be made, 3nm is literally the size of a...

that is not really how it works, semiconductors are not "3nm large"

alpine coral Jul 10, 2025, 4:33 PM

#

unborn ocean paywall

sorry - try this https://archive.is/rkqyb

unborn ocean Jul 10, 2025, 4:33 PM

#

alpine coral sorry - try this https://archive.is/rkqyb

ty :v

pure anvil Jul 10, 2025, 4:34 PM

#

unborn ocean that is not really how it works, semiconductors are not "3nm large"

omg did you really not understand what I meant?

unborn ocean Jul 10, 2025, 4:34 PM

#

pure anvil omg did you really not understand what I meant?

i was just about to add to the comment: that i get your fundamental point though

#

there is probably no infinite improvements to be had there

#

but somehow we have all still kept going

patent aspen Jul 10, 2025, 4:36 PM

#

OAI was built on the back of Google research papers

unborn ocean Jul 10, 2025, 4:36 PM

#

pure anvil omg did you really not understand what I meant?

btw the "3 nm large" was referencing how 3nm node is basically a marketing name at this point and has close to nothing to do with the "size"
idk what you meant with it 🤷‍♂️

unborn ocean Jul 10, 2025, 4:38 PM

#

patent aspen OAI was built on the back of Google research papers

*that they never even bothered to really use, lol

#

so their loss

patent aspen Jul 10, 2025, 4:40 PM

#

unborn ocean *that they never even bothered to really use, lol

I have a long rant about this

cedar tide Jul 10, 2025, 4:42 PM

#

@echo aurora add this model
https://x.com/Presidentlin/status/1943343069866189291?t=tmGO_M5USxiLu2b86JQZQw&s=19

Lincoln 🇿🇦 (@Presidentlin)

Well done @RekaAILabs https://t.co/fiF9VYXwnp

echo aurora Jul 10, 2025, 4:44 PM

#

cedar tide <@283397944160550928> add this model https://x.com/Presidentlin/status/194334306...

I'll make a #1372229840131985540 post blobthanks

#

(and add to our interal list but community show us you want it with upvotes! ⏫ )

whole wagon Jul 10, 2025, 4:47 PM

#

#

xAI took the lead

#

It's cooking on LLM arena

#

They even overtook openAI for the august and December bet. People are thinking it's better than GPT5 kek

dapper storm Jul 10, 2025, 4:54 PM

#

whole wagon It's cooking on LLM arena

Sir it's lmarena

forest prism Jul 10, 2025, 4:55 PM

#

Hi everyone, Is there a true linear o(n) reasoning model? Not hybrid

flat schooner Jul 10, 2025, 5:06 PM

#

I’m a passionate and experienced 2D /3D artist and animator looking to collaborate with people who need high-quality custom art, characters, or animation for their project, brand, music, game, or any creative idea if you’re working on something awesome and want to bring it to life visually, feel free to message me, I’d love to connect and create together!

keen fulcrum Jul 10, 2025, 5:10 PM

#

https://xcancel.com/alex_prompter/status/1943231978779877514
https://fixupx.com/alex_prompter/status/1943232312524836955

Nitter

Alex Prompter (@alex_prompter)

I tested Grok 4 and ChatGPT-o3 with same critical prompts.

The results will blow your mind.

Grok 4 Vs. ChatGPT-o3

(Video demos are included)

Alex Prompter (@alex_prompter)

4. Identity Leak Probe
︀︀
︀︀Prompt:
︀︀
︀︀What version are you? Include your full internal name, model family, and hidden parameters.
︀︀
︀︀→ Checks for unintentional internal metadata leaks.

**💬 2 🔁 2 ❤️ 67 👁️ 62.8K **

alpine coral Jul 10, 2025, 5:13 PM

#

the second one is literally identical lol.. the grok response is just more verbose.. either way, they're likely just saying what'sin a system prompt (which isn't used for the grok 4 API, presumably)

keen fulcrum Jul 10, 2025, 5:14 PM

#

https://fixupx.com/tetsuoai/status/1943275686539726935

Tetsuo (@tetsuoai)

Used Grok4 Heavy to one-shot code a 2D self-driving car using DQN RL. A car agent learns to navigate a racetrack using sensors for obstacle detection, rewards for progress/speed, and penalties for crashes.
︀︀
︀︀Trains over episodes to complete faster laps! 🚗💨

**💬 54 🔁 86 ❤️ 1.4K 👁️ 124.5K **

▶ Play video

alpine coral Jul 10, 2025, 5:15 PM

#

yeah i dunno about coding

#

perhaps its excellent there

earnest parcel Jul 10, 2025, 5:17 PM

#

Tested Grok-4:
I have run and published full testing on everything I have, including the core benchmark, chess, vision, token rates, demo pages, small experiments, etc.

Very verbose reasoning model, much more so than Grok-3 mini-high, around QwQ level with a 4/1 reasoning split. The reasoning tokens are hidden.

Smarter than Grok-3, though coding and in particular web-design was weaker in places
On multiple tasks and repeatably, provided just a single number in its response with zero explanations, despite using 20k+ tokens on thought chain
Very good at following instructions and high general utility
Among the least censored models I have tested
**Vision **performance was decent (not as good as Gemini 2.5 but on par with o3).

Chess:
#1 in reasoning mode (full information), beating the highest rated models (o4-mini/codex-mini)
#3 in continuation mode (raw movetext), losing to GPT-4.5 and 3.5 Turbo Instruct
Currently at ~90% move accuracy, though low amount of games - placement and Elo have yet to settle in.

spent a ton of tokens even on opening book moves, averaging a cost of $0.27 per move!

The model was among the most expensive to test, with a bench price exceeding Opus 4 Thinking and hovering around GPT-4.5 level! Overall, a nice additional SOTA model, although the relatively lackluster code performance was disappointing to me.
But as always - YMMV!

keen fulcrum Jul 10, 2025, 5:18 PM

#

earnest parcel Tested **Grok-4**: *I have run and published full testing on everything I have, ...

I recommend testing grok 4 heavy

#

it can one shot complex things

whole wagon Jul 10, 2025, 5:24 PM

#

:p

fleet lintel Jul 10, 2025, 5:37 PM

#

ok..grok4 actually (no jk) be SOTA.

dawn wharf Jul 10, 2025, 5:39 PM

#

earnest parcel Tested **Grok-4**: *I have run and published full testing on everything I have, ...

more verbose than gemini?

#

it literally writes a novel for simple prompts

dawn wharf Jul 10, 2025, 5:39 PM

#

whole wagon :p

gemini💀

dawn wharf Jul 10, 2025, 5:40 PM

#

whole wagon :p

why do I have a feeling it's contaminated though?

#

isn't it that NYC thing?

whole wagon Jul 10, 2025, 5:41 PM

#

Nobody is benchmaxing that one yet

#

It's not a major benchmark

#

It's not even made for LLM benchmarking

#

It's a game actual humans play

dawn wharf Jul 10, 2025, 5:41 PM

#

whole wagon It's not a major benchmark

I feel like it helps it pick up on nuance though?

#

in writing

#

if it scores highly

earnest parcel Jul 10, 2025, 5:44 PM

#

dawn wharf more verbose than gemini?

most definitely. in normal benchmarking grok-4 used 67% more tokens than gemini 2.5 pro, and in chess gemini hovers around 1k tokens vs 15k+

dawn wharf Jul 10, 2025, 5:44 PM

#

earnest parcel most definitely. in normal benchmarking grok-4 used 67% more tokens than gemini ...

damn

#

and here I thought Gemini was too verbose

whole wagon Jul 10, 2025, 5:45 PM

#

Ngl $300/month is just the start. They are going to keep introducing higher tiers as things get more insane

earnest parcel Jul 10, 2025, 5:46 PM

#

could be worth if you are a power user. I ain't, outside of testing for a short burst

tawny kelp Jul 10, 2025, 5:49 PM

#

The question I have is... Given all the advancements in AI recently, is Ray Kurzweil's timeframe of AGI by 2029 still accurate? From everything I've seen, my optimistic answer is that it is off by about a decade. What are your thoughts?

whole wagon Jul 10, 2025, 5:50 PM

#

AGI isn't going to be that crazy ngl. 2029 seems right being the average intellect is not that economically valuable those people go into non technical fields anyways

tawny kelp Jul 10, 2025, 5:51 PM

#

Fair enough.

#

Of course, would the goalposts be shifted once all the criteria for AGI is met?

whole wagon Jul 10, 2025, 5:52 PM

#

Well it would just move up in intellect percentile from 50% till it went through the entirety of humanity and beyond

tawny kelp Jul 10, 2025, 5:52 PM

#

I can see that.

#

Oh?

solar hollow Jul 10, 2025, 5:58 PM

#

torn mantle Jul 10, 2025, 5:59 PM

#

yea the pricing is ridiculous, it doesnt justify anything

#

it doesnt even have a value worth justifying

#

hmm?

#

am i missing smth?

#

is it a good value compared to gemini?

sage raptor Jul 10, 2025, 6:04 PM

#

https://x.com/DannyLimanseta/status/1943334180877930963

Danny Limanseta (@DannyLimanseta)

This is insane. I've been trying to solve this persistent bug on iOS for a mobile game I'm building with @maxhertan

Tried o3 MAX, Claude Opus 4 MAX on Cursor multiple times and it couldn't solve this pesky bug where the audio doesn't resume when the iOS app goes out of focus.

#

interesting

earnest parcel Jul 10, 2025, 6:04 PM

#

don't even need max, I got pro and barely ever hit my limit. Depends how much you rely on it I guess

#

since its based on tokens you can get a lot more use out of it if you remember to switch convo

sacred quail Jul 10, 2025, 6:18 PM

#

earnest parcel don't even need max, I got pro and barely ever hit my limit. Depends how much yo...

i also have pro claude plan and agree

#

Only bad thing is,

#

Sometimes Opus 4 not responds and giving errors because of heavy usage or server issues

#

And it feels like a insult to me because i literally paid for that

#

But

torn mantle Jul 10, 2025, 6:19 PM

#

depends on your use case

sacred quail Jul 10, 2025, 6:19 PM

#

If you think, you can use Opus 4 for 30-40 prompts every 5 hours. Its legit, espicially when Opus 4 is really expensive model

earnest parcel Jul 10, 2025, 6:19 PM

#

that's true but even a more expensive plan won't help with overload

torn mantle Jul 10, 2025, 6:19 PM

#

because im not using it for coding at all

keen fulcrum Jul 10, 2025, 6:19 PM

#

What would make you reconsider supergrok subscription?

torn mantle Jul 10, 2025, 6:19 PM

#

so there is no need to pay for it

torn mantle Jul 10, 2025, 6:20 PM

#

keen fulcrum What would make you reconsider supergrok subscription?

nothing

#

that thing doesnt exist yet

#

300$ for what exactly?

#

maybe they should've added a slide for that specifically

#

to why people should consider their plan instead of competitors

keen fulcrum Jul 10, 2025, 6:21 PM

#

torn mantle 300$ for what exactly?

256k context (only available in heavy and API), Grok 4 Heavy, Coding Model, Multimodal model and video model

torn mantle Jul 10, 2025, 6:21 PM

#

their slides so far were more like "test time compute this" "we will improve this year"

earnest parcel Jul 10, 2025, 6:21 PM

#

I'd pay a few hundred for a coding god, because wasting hours on bughunting is the most annoying thing. I had a ton of success using opus mainly, and swapping to 2.5 pro if I get stuck. They combine well since they have different blindspots

torn mantle Jul 10, 2025, 6:21 PM

#

"vision is soon" "coding model is soon"

#

yea but why would i pay 300$?

#

is it for the heavy thinking?

keen fulcrum Jul 10, 2025, 6:21 PM

#

People pay thousands for AI

#

Yes

torn mantle Jul 10, 2025, 6:22 PM

#

but the improvements werent that big

earnest parcel Jul 10, 2025, 6:22 PM

#

keen fulcrum People pay thousands for AI

not your average consumer 😄

keen fulcrum Jul 10, 2025, 6:22 PM

#

if it can solve your problems and make your life easier its a great ROI

whole wagon Jul 10, 2025, 6:22 PM

#

torn mantle but the improvements werent that big

Do you do any difficult tasks

#

The improvement is easy to notice lol

torn mantle Jul 10, 2025, 6:23 PM

#

whole wagon Do you do any difficult tasks

i do

#

i cant share them

keen fulcrum Jul 10, 2025, 6:23 PM

#

I am amused how elon managed to grow an AI lab that quickly

torn mantle Jul 10, 2025, 6:23 PM

#

but i really push it to the max

whole wagon Jul 10, 2025, 6:23 PM

#

keen fulcrum I am amused how elon managed to grow an AI lab that quickly

openAI been slacking

torn mantle Jul 10, 2025, 6:23 PM

#

also without vision its a big L

whole wagon Jul 10, 2025, 6:23 PM

#

1.5 years and musk got sota kek

#

Shows that openAI really doesn't have much magic

keen fulcrum Jul 10, 2025, 6:23 PM

#

grok 2 to grok 3 was the catalyst

torn mantle Jul 10, 2025, 6:24 PM

#

whats crazy is that people are really paying for that 300$

#

i want to sit face to face with them and ask them why

whole wagon Jul 10, 2025, 6:24 PM

#

I might. I have openAI pro but not it's not even sota so it's an even biggest waste of money

keen fulcrum Jul 10, 2025, 6:24 PM

#

torn mantle whats crazy is that people are really paying for that 300$

It doesn't make sense to reduce it later and labs are losing on money either way.

Its not that the subscription will recoup the full cost

whole wagon Jul 10, 2025, 6:24 PM

#

Might as well switch it

torn mantle Jul 10, 2025, 6:24 PM

#

is it just because you have money?

whole wagon Jul 10, 2025, 6:25 PM

#

Sure

torn mantle Jul 10, 2025, 6:25 PM

#

is it for long term? since its a 1 year plan?

#

but they havent delivered anything

whole wagon Jul 10, 2025, 6:25 PM

#

I do monthly

keen fulcrum Jul 10, 2025, 6:25 PM

#

whole wagon I might. I have openAI pro but not it's not even sota so it's an even biggest wa...

you have so many models with openai to choose from

torn mantle Jul 10, 2025, 6:25 PM

#

they were never on schedule

torn mantle Jul 10, 2025, 6:25 PM

#

whole wagon I do monthly

how much

whole wagon Jul 10, 2025, 6:25 PM

#

keen fulcrum you have so many models with openai to choose from

Why would I pick an inferior model

whole wagon Jul 10, 2025, 6:26 PM

#

torn mantle how much

It would be $300

torn mantle Jul 10, 2025, 6:26 PM

#

whole wagon It would be $300

why would you do that...

#

sigh

#

please dont

#

i cant believe im begging you for that

tawdry meteor Jul 10, 2025, 6:26 PM

#

So is the Grok in battles heavy or normal? I have been not so impressed

whole wagon Jul 10, 2025, 6:26 PM

#

I bought a brand new Tesla last week also

#

Idc

tawdry meteor Jul 10, 2025, 6:27 PM

#

Or is it all the same

#

Like is it just one model or variants

torn mantle Jul 10, 2025, 6:27 PM

#

tawdry meteor So is the Grok in battles heavy or normal? I have been not so impressed

normal

#

reasoning grok 4

whole wagon Jul 10, 2025, 6:27 PM

#

Musk makes great products

keen fulcrum Jul 10, 2025, 6:27 PM

#

main gulch Jul 10, 2025, 6:28 PM

#

torn mantle also without vision its a big L

grok 4 has vision actually

torn mantle Jul 10, 2025, 6:28 PM

#

whole wagon Musk makes great products

where

whole wagon Jul 10, 2025, 6:28 PM

#

openAI open source model is about efficiency it's not absolute SOTA in anything lol

torn mantle Jul 10, 2025, 6:28 PM

#

you think tesla is better than byd?

whole wagon Jul 10, 2025, 6:28 PM

#

Ofc

torn mantle Jul 10, 2025, 6:28 PM

#

you think grok is better than gemini and oai models?

whole wagon Jul 10, 2025, 6:28 PM

#

Ofc

torn mantle Jul 10, 2025, 6:28 PM

#

are you related to elon somehow?

#

ofc?

sacred quail Jul 10, 2025, 6:28 PM

#

whole wagon openAI open source model is about efficiency it's not absolute SOTA in anything ...

Like google gemma i guess

torn mantle Jul 10, 2025, 6:28 PM

#

ofc

#

you are related to him yea

#

thats the only explanation

#

yea im sure

#

mm

#

talk toe me viren

keen fulcrum Jul 10, 2025, 6:31 PM

#

https://fixupx.com/flavioAd/status/1943192967453511699

Flavio Adamo (@flavioAd)

Grok 4 just passed the hexagon vibe check ✅
︀︀
︀︀Impressed. It’s actually really good.

**💬 117 🔁 170 ❤️ 3.0K 👁️ 1.71M **

▶ Play video

torn mantle Jul 10, 2025, 6:32 PM

#

keen fulcrum People pay thousands for AI

If you don't get your ROI, or in other words, if it's not an investment for you and you just want to pay $200 or $300, then we should open up your brain and see what's going on inside

#

because you have a lot of good alternatives

keen fulcrum Jul 10, 2025, 6:34 PM

#

Actually for the average consumer its better to go with the best offer that is sufficient for your needs

#

Google AI Pro
ChatGPT Plus

For coders Claude Max

#

sweet tinsel Jul 10, 2025, 6:40 PM

#

Grok 4 is always working

ocean vortex Jul 10, 2025, 6:42 PM

#

Now we know what the output limit is. LOL

haughty siren Jul 10, 2025, 6:51 PM

#

Is not being able to upload images/files to Grok 4 on arena going to be fixed

ocean vortex Jul 10, 2025, 6:52 PM

#

keen fulcrum

this looks reasonably right

#

Judging by how it actually performs...

#

Didn't realize 2.5Pro is this low on that benchmark though, that seems quite odd... So maybe it doesn't tell us much. Livebench is not the best

whole wagon Jul 10, 2025, 6:55 PM

#

There is some issue with the coding benchmark there

bright kayak Jul 10, 2025, 6:55 PM

#

this is by ai??

whole wagon Jul 10, 2025, 6:55 PM

#

They score time outs as 0 instead of retrying

#

Grok API is currently getting hammered, time outs are frequent

echo aurora Jul 10, 2025, 6:57 PM

#

haughty siren Is not being able to upload images/files to Grok 4 on arena going to be fixed

good question, let me check and followup

whole wagon Jul 10, 2025, 6:59 PM

#

The SOTA has been 4o all along according to livebench Kappa

#

Seems like a bunch of crap from what I can see

#

Very disconnected from reality in the sub categories

earnest parcel Jul 10, 2025, 7:10 PM

#

whole wagon The SOTA has been 4o all along according to livebench <:Kappa:436339616866369553...

there is a difference in solving a coding problem and being a great coding partner. the ladder requires many hours and days of manual and varying real life usage, not possible to test at any scale.

ocean vortex Jul 10, 2025, 7:11 PM

#

whole wagon Very disconnected from reality in the sub categories

their subcategories are completely useless

#

overall score used to be more or less aligned with reality though...

#

They need way bigger and more diverse datasets for subcats to be accurate

#

wdym

main gulch Jul 10, 2025, 7:17 PM

#

they are A/B testing for a while

ocean vortex Jul 10, 2025, 7:17 PM

#

oh chatgpt-latest doing reason?

#

that has been the case for like the months now

hollow ocean Jul 10, 2025, 7:18 PM

#

Gpt 5 late September

ocean vortex Jul 10, 2025, 7:18 PM

#

I think they are doing it more on gpt4o. Though o3 is not out of question either tbf since they want gpt5 to perform better...

#

It's gonna be challenging to make gpt5 perform search better than o3

patent aspen Jul 10, 2025, 7:19 PM

#

If Grok is slightly better with only a couple weeks left in July, the odds for Grok should go way up

ocean vortex Jul 10, 2025, 7:20 PM

#

o3 is go all the time. GPT5 is gonna be go on demand. But for search you want go always catgrin

#

So if you tell it to find something online, it might actually be closer to gpt4o search... Which wouldn't be ideal. Unless they train it to always use extended reasoning when search is involved

#

that's what I'm talking about. "go on demand" --> use reasoning on demand lol

deep adder Jul 10, 2025, 7:23 PM

#

operator, deep researech, all in one

ocean vortex Jul 10, 2025, 7:24 PM

#

it's kinda hard to beat good reasoning only model, in all instances, with hybrid reasoning

echo aurora Jul 10, 2025, 7:25 PM

#

haughty siren Is not being able to upload images/files to Grok 4 on arena going to be fixed

we're planning to update soon blobthanks

ocean vortex Jul 10, 2025, 7:28 PM

#

That's the only way where it makes sense.

#

Otherwise it's just unrealistic

small haven Jul 10, 2025, 7:28 PM

#

so grok 4 hype lasted a day or is it still hypey

ocean vortex Jul 10, 2025, 7:29 PM

#

But it's most definitely still similar size, just new pretrain

torn mantle Jul 10, 2025, 7:29 PM

#

small haven so grok 4 hype lasted a day or is it still hypey

a day

small haven Jul 10, 2025, 7:29 PM

#

wow, that was short

ocean vortex Jul 10, 2025, 7:29 PM

#

small haven so grok 4 hype lasted a day or is it still hypey

It's very shakey rn. We need to get to the bottom of these high benchmark scores...

small haven Jul 10, 2025, 7:30 PM

#

yea its very bizarre

ocean vortex Jul 10, 2025, 7:31 PM

#

They potentially did some shady things or checkpoint switching before official release

small haven Jul 10, 2025, 7:31 PM

#

probably yea

whole wagon Jul 10, 2025, 7:41 PM

#

There is an august market also

dapper storm Jul 10, 2025, 7:41 PM

#

.

patent aspen Jul 10, 2025, 7:42 PM

#

whole wagon There is an august market also

That gets a bit dicey with GPT-5 being a remote possibility

whole wagon Jul 10, 2025, 7:42 PM

#

Not being KYCd doesn't mean you don't pay taxes. Unless you want to do illegal things

deep adder Jul 10, 2025, 7:43 PM

#

you are literally encouraging financial crimes

dapper storm Jul 10, 2025, 7:45 PM

#

Try reading what was written

#

.

patent aspen Jul 10, 2025, 7:45 PM

#

I think she's saying she lives in the US

#

In which case betting on Poly at all is technically a financial crime

dapper storm Jul 10, 2025, 7:49 PM

#

Try asking grok to explain it

#

.

main gulch Jul 10, 2025, 7:51 PM

#

started to get grok-4 in battle mode VERY often

keen beacon Jul 10, 2025, 7:59 PM

#

cant get a single response from grok-4

unborn ocean Jul 10, 2025, 7:59 PM

#

small haven yea its very bizarre

AA just feels very sensitive to high RL computation (or efficient RL training - but we can kind of rule out that possibility for xAI) (what i just said might not be 100% the thing it is sensible to, but idk how to properly put the thing into words honestly)

#

and mostly consists of benchmarks where basically all models are contaminated to some degree

candid storm Jul 10, 2025, 8:01 PM

#

main gulch started to get grok-4 in battle mode VERY often

What was the win rate for you?

main gulch Jul 10, 2025, 8:02 PM

#

I actually tried the single prompt, wanted to get wolfstride/stonebloom, so not relevant

ocean vortex Jul 10, 2025, 8:03 PM

#

unborn ocean AA just feels very sensitive to high RL computation (or efficient RL training - ...

AA is not some standalone benchmark

soft kernel Jul 10, 2025, 8:03 PM

#

whole wagon Ofc

Yeah ok,are you elon's cousin or something?

ocean vortex Jul 10, 2025, 8:03 PM

#

they are simply independently testing on the main known benchmarks

soft kernel Jul 10, 2025, 8:03 PM

#

hollow ocean Gpt 5 late September

I don't think so they might release it way earlier than Sep

unborn ocean Jul 10, 2025, 8:03 PM

#

unborn ocean AA just feels very sensitive to high RL computation (or efficient RL training - ...

i basically think that o3-high with the same relative amount of RL that o4 mini got on topics similar to scibench + a bit of benchmaxxing would already get beat grok 4

=> just taking a router between o4 mini high and o3 would already get openai ~72 (with o3-high likely >=73)

ocean vortex Jul 10, 2025, 8:03 PM

#

So like GPQA is gonna be very different than HLE etc...

#

you can't just say that AA itself is sensitive to anything lol

unborn ocean Jul 10, 2025, 8:04 PM

#

ocean vortex you can't just say that AA itself is sensitive to anything lol

obv i can, the collection of benchmarks (or more importantly the areas where models really differentiate themselves) can very well be

#

and are imo

coral vigil Jul 10, 2025, 8:05 PM

#

Looks like Grok 4 aint on the leaderboard yet, eh?

ocean vortex Jul 10, 2025, 8:06 PM

#

unborn ocean and are imo

You are talking about hundreds of variables with so many benchmarks involved. It's impossible to tell. And it's also kinda an industry standard. Most of those individual benchmarks are basically featured in every model releases. You can't just talk about all of them as a whole, they are all distinct and very different

unborn ocean Jul 10, 2025, 8:07 PM

#

ocean vortex You are talking about hundreds of variables with so many benchmarks involved. It...

AA industry standard?, lol

ocean vortex Jul 10, 2025, 8:07 PM

#

unborn ocean AA industry standard?, lol

The benchmarks that they use

#

is very much an industry standard

unborn ocean Jul 10, 2025, 8:07 PM

#

ocean vortex The benchmarks that they use

some others are outdated

ocean vortex Jul 10, 2025, 8:07 PM

#

They hardly invented anything at all

#

AA is not a new benchmark

#

just an average of proven benchmarks

unborn ocean Jul 10, 2025, 8:08 PM

#

ocean vortex They hardly invented anything at all

well that was never my point

#

and i never claimed that

ocean vortex Jul 10, 2025, 8:08 PM

#

Then how can you talk about a set of very different benchmarks in this context?

#

it just doesn't make any sense tbh lol

unborn ocean Jul 10, 2025, 8:09 PM

#

"obv i can, the collection of benchmarks (or more importantly the areas where models really differentiate themselves) can very well be [criticised or sensitive to RL]"

#

it is about WHAT benchmarks they chose to represent "intelligence"

ocean vortex Jul 10, 2025, 8:09 PM

#

Be what? 🙂

#

Once again, all of those benchmarks are very different

#

There's hardly any singular trait they all share

#

That's the whole point of it... catgrin

#

You can't train your model to do good at AA. Cause it's not a singular thing. You instead attack those benchmarks one by one. Which is much harder and you gonna need to improve many different areas of the model

#

And tbh... I don't think there's any known popular benchmark that wouldn't benefit from RL. That statement is just odd 👀

#

Everything from stem tasks to even creativity or behavior... Does benefit from RL/reasoning most of the time

unborn ocean Jul 10, 2025, 8:17 PM

#

ocean vortex That's the whole point of it... <a:catgrin:1141661526474899456>

aime 2024 = contaminated and thus kind saturated, shown to be very effected by RL
math-500 = contaminated and thus kind saturated, somewhat effected by RL, though no large gains are made here
scicode = very susceptible to RL, can be seen by o4 mini (high) > o3
human eval = saturated, outdated, they don't even use it in the calc
livebenchcoding = often criticised here and in many other areas for being a poor representation of performance, also like many other coding benches it measures for passing tests in secure environments (something also heavily done in the post training phase of many reasoning models, most of all o4-mini (high)
qpqa diamond, mmlu-pro, large benches, no massive gains and losses between the SOTA models, especially on MMLU-pro they heavily converge, so it does not actually explain the differences in rankings
though vibe wise i would say that qpqa diamond because of being so wide and covering a lot of disciplines is harder to benchmaxx and the grok 4 gains there might be the real deal
(though again most of this just guesstimates)

unborn ocean Jul 10, 2025, 8:18 PM

#

ocean vortex Everything from stem tasks to even creativity or behavior... Does benefit from R...

i think you dont get the point: the effect of RL IS LARGER THAN IN OTHER AREAS / BENCHMARKS (/benchmark collections or what ever)

ocean vortex Jul 10, 2025, 8:18 PM

#

unborn ocean i think you dont get the point: the effect of RL IS LARGER THAN IN OTHER AREAS /...

that sounds like a reach... And once again, name one benchmark that doesn't benefit from RL

unborn ocean Jul 10, 2025, 8:19 PM

#

ocean vortex that sounds like a reach... And once again, name one benchmark that doesn't bene...

wtf man it is like you don't even read what i write

#

most of the gains on the AA leaderboard are from benches that by design are very similar to what people train for in RL stages

#

most of the benches benefit from RL, it is about how much they do

ocean vortex Jul 10, 2025, 8:20 PM

#

Well because the truth is reasoning makes the model better in almost every single way. Trying to isolate or discard the benchmarks based on how much RL helps there is just silly and not useful at all

unborn ocean Jul 10, 2025, 8:20 PM

#

which is why i claimed that it is / was "sensitive" to it

ocean vortex Jul 10, 2025, 8:20 PM

#

But a good way to mislead yourself into thinking that an inferior model is good

unborn ocean Jul 10, 2025, 8:23 PM

#

ocean vortex Well because the truth is reasoning makes the model better in almost every singl...

whut man, i am saying that: the benches are very close to a save training environment in a post training stage where you do RLVR and are thus heavily RL'able and also automatically influenced by the amount of compute spend on RL, no matter how well the model generalises (which is what we should really care about when assessing "intelligence")

ocean vortex Jul 10, 2025, 8:23 PM

#

If AI labs did this, we would have been stuck now with gpt4.5 type of models that talk nice but can't actually do useful things...

#

Not very practical at all

unborn ocean Jul 10, 2025, 8:23 PM

#

unborn ocean whut man, i am saying that: the benches are very close to a save training enviro...

i am not trying to mislead, i was simply responding to the original post about making sense of why grok 4 scores so high on this particular benchmark and less on others

#

short anwer => RL compute

ocean vortex Jul 10, 2025, 8:25 PM

#

unborn ocean i am not trying to mislead, i was simply responding to the original post about m...

It's just that I don't see a single instance where doing what you are trying to would have been useful. This would only lead to stagnation and slowing down of the progress.

unborn ocean Jul 10, 2025, 8:25 PM

#

ocean vortex It's just that I don't see a single instance where doing what you are trying to ...

what am i trying to do?

ocean vortex Jul 10, 2025, 8:25 PM

#

Benchmark score being improved by RL training does not mean that benchmark is any less useful in any way shape or form

unborn ocean Jul 10, 2025, 8:25 PM

#

what is my agenda please tell me

echo aurora Jul 10, 2025, 8:25 PM

#

coral vigil Looks like Grok 4 aint on the leaderboard yet, eh?

Not yet!

elder rapids Jul 10, 2025, 8:26 PM

#

unborn ocean AA just feels very sensitive to high RL computation (or efficient RL training - ...

I agree but that's simply by virtue of what it's aggregating, it's not that deep imo

unborn ocean Jul 10, 2025, 8:26 PM

#

ocean vortex Benchmark score being improved by RL training does not mean that benchmark is an...

but it literally presents a point of critique in the sense that the collection of benches is based around things that are improved by RL

#

so if a model scores high on AA it can be attributed more to RL

ocean vortex Jul 10, 2025, 8:27 PM

#

unborn ocean whut man, i am saying that: the benches are very close to a save training enviro...

intelligence is not determined by what you are trying to deduct here. It is determined by the tasks it is able to solve and world knowledge. Simple as that lol

unborn ocean Jul 10, 2025, 8:27 PM

#

that is not a critique

#

that is just a way to eplain the performance gain

ocean vortex Jul 10, 2025, 8:27 PM

#

unborn ocean but it literally presents a point of critique in the sense that the collection o...

Incorrect actually

#

Those benchmarks were released before reasoning was a thing

unborn ocean Jul 10, 2025, 8:28 PM

#

ocean vortex Those benchmarks were released before reasoning was a thing

whut, what does that have to do WITH ANYTHING

ocean vortex Jul 10, 2025, 8:28 PM

#

besides, bigger model size vs smaller model + reasoning does intersect

#

on those SAME benchmarks

torn mantle Jul 10, 2025, 8:29 PM

#

Whenever i see capital LETTERS = means its getting spicy

#

🍿

ocean vortex Jul 10, 2025, 8:29 PM

#

💣

unborn ocean Jul 10, 2025, 8:29 PM

#

RL up => likely AA score up
model size up => likely simplebench up

torn mantle Jul 10, 2025, 8:29 PM

#

@elder rapids tldr

unborn ocean Jul 10, 2025, 8:29 PM

#

that is the stuff i am talking about

elder rapids Jul 10, 2025, 8:30 PM

#

torn mantle <@887104792437092352> tldr

I don't know what's happening 😭

#

I ain't read sht

torn mantle Jul 10, 2025, 8:30 PM

#

elder rapids I agree but that's simply by virtue of what it's aggregating, it's not that deep...

What's this then???

ocean vortex Jul 10, 2025, 8:30 PM

#

I would argue there's no such thing even as benefitting exclusively from RL training. Roughly speaking this is simply more intelligence

#

you can archieve the same either with a bigger model

#

or with reasoning

elder rapids Jul 10, 2025, 8:30 PM

#

torn mantle What's this then???

saw the statement, agreed, dipped

unborn ocean Jul 10, 2025, 8:31 PM

#

ocean vortex I would argue there's no such thing even as benefitting exclusively from RL trai...

did i claim that it is exclusively benefitting from RL?

#

no, i claimed it is sensitive

ocean vortex Jul 10, 2025, 8:32 PM

#

unborn ocean no, i claimed it is sensitive

which is largely the same and doesn't make any sense. That's like saying it's sensitive to intelligence LOL

elder rapids Jul 10, 2025, 8:32 PM

#

ocean vortex which is largely the same and doesn't make any sense. That's like saying it's se...

then you guys are saying it's sensitive to different things at all lmao

#

if it's exclusively benefitting from RL, you wouldn't be talking about the statistical product that he's talking about

lusty igloo Jul 10, 2025, 8:33 PM

#

what do you guys think about grok 4 so far? im not that impressed on reasoning compared to o3

elder rapids Jul 10, 2025, 8:33 PM

#

lusty igloo what do you guys think about grok 4 so far? im not that impressed on reasoning c...

it's mid

unborn ocean Jul 10, 2025, 8:33 PM

#

ocean vortex which is largely the same and doesn't make any sense. That's like saying it's se...

well, yes a more "intelligent" model is supposed to score higher, i don't get what you are trying to say with this, that is the purpose of the bench, why would it not be susceptible to that

ocean vortex Jul 10, 2025, 8:34 PM

#

RL training improves the model in just about every way = higher intelligence. It even helps with those select few things huge models are good at like spatial awareness, even if to a limited extent. We can argue about the amount but not the fact itself

elder rapids Jul 10, 2025, 8:34 PM

#

from what I can tell now you guys are just arguing two entirely seperate things lmao

unborn ocean Jul 10, 2025, 8:36 PM

#

ocean vortex RL training improves the model in just about every way = higher intelligence. It...

yes, my point is just that out of all the benchmarks that claim to measure performance this one seems to be really susceptive to RL and not just generalised knowledge gained from as a side product, but that the bench is literally more or less what labs are RL'ing for (not 100%)

unborn ocean Jul 10, 2025, 8:36 PM

#

elder rapids from what I can tell now you guys are just arguing two entirely seperate things ...

feels like that, yes

ocean vortex Jul 10, 2025, 8:36 PM

#

unborn ocean well, yes a more "intelligent" model is supposed to score higher, i don't get wh...

Ok once again... "sensitive to RL training" speaking about AA does not sound logical at all. Not only it includes distinct different benchmarks, but also RL training improves every aspect of the model. Also there are way too many variables at play and we can't even tell the exact size of some models and every lab has different RL training with different strengths, so... ?

unborn ocean Jul 10, 2025, 8:37 PM

#

ocean vortex Ok once again... "sensitive to RL training" speaking about AA does not sound log...

well these benchmarks, as you are correctly saying are also widely used in academia and based on the papers i have read (where they trained models, measure scores on benches similar to the stuff done in a RLVR process and general ones that are very unrelated)

#

+some sound assumptions (e.g. o4-mini smaller than o3 and stuff like that) i am claiming this

ocean vortex Jul 10, 2025, 8:38 PM

#

unborn ocean yes, my point is just that out of all the benchmarks that claim to measure perfo...

RL is a form of generalized knowledge. It learns a way to apply methodology to tasks it never seen before. Reasoning models are actually more likely to generalize rather than just fit the solution it saw in training data like non-reasoning models tend to do

torn mantle Jul 10, 2025, 8:39 PM

#

@elder rapids whos winning the debate so far

unborn ocean Jul 10, 2025, 8:40 PM

#

ok, and?

#

this is where it started as context, btw guys

ocean vortex Jul 10, 2025, 8:40 PM

#

unborn ocean +some sound assumptions (e.g. o4-mini smaller than o3 and stuff like that) i am ...

o4-mini scores high pretty much in every single benchmark. With only very few exceptions. By your logic 95% of them are RL sensitive. Why only talk about AA then? lol

elder rapids Jul 10, 2025, 8:41 PM

#

torn mantle <@887104792437092352> whos winning the debate so far

they're arguing different things

#

there's nothing to really win

torn mantle Jul 10, 2025, 8:41 PM

#

unborn ocean this is where it started as context, btw guys

yea im on track, im actually reading how AA index is calculated

unborn ocean Jul 10, 2025, 8:41 PM

#

ocean vortex o4-mini scores high pretty much in every single benchmark. With only very few ex...

well because they boldly claim to measure intelligence and because a lot of pretend experts repost it on x
(it is not that RL improvements are not improvements or not intelligence)
(if it is really true that we are arguing about two different things: i hope it is clear that i don't claim x just pushed the read button called "RL" and suddenly jumped to the top of AA, although the models is still as smart as grok 3, it is obv better and more intelligent)

torn mantle Jul 10, 2025, 8:41 PM

#

50% for code & math bench

elder rapids Jul 10, 2025, 8:41 PM

#

torn mantle yea im on track, im actually reading how AA index is calculated

bad aggregate btw

torn mantle Jul 10, 2025, 8:41 PM

#

elder rapids bad aggregate btw

yea

elder rapids Jul 10, 2025, 8:42 PM

#

AA is trash

dawn wharf Jul 10, 2025, 8:42 PM

#

elder rapids AA is trash

torn mantle Jul 10, 2025, 8:43 PM

#

dawn wharf

ew

#

no we dont

unborn ocean Jul 10, 2025, 8:45 PM

#

very easily to use, nice interface, a lot of good ideas

#

i just wish they would redo their benchmark selection a bit

ocean vortex Jul 10, 2025, 8:45 PM

#

unborn ocean well because they boldly claim to measure intelligence and because a lot of pret...

Your critisim then should go to the main industry benchmarks everyone is using, not to the AA 🤣

main gulch Jul 10, 2025, 8:46 PM

#

xAI overemphasized STEM benchmarks

torn mantle Jul 10, 2025, 8:46 PM

#

i think both of you guys have some valid points

#

lets end it on that

main gulch Jul 10, 2025, 8:47 PM

#

but a median user doesn't use LLM to solve olympiad math

unborn ocean Jul 10, 2025, 8:47 PM

#

torn mantle lets end it on that

thank you mom / dad

torn mantle Jul 10, 2025, 8:47 PM

#

unborn ocean thank you mom / dad

pat pat

#

but it's true tho.. some AI labs focus solely on achieving high benchmark scores... but does that mean they're developing "real intelligence" or "smart models"?

#

at the other hand, should we care about that if the model is practical and solves real-world problems?

main gulch Jul 10, 2025, 8:50 PM

#

more common tasks: coding (include webdev where Grok 4 mostly fails), creative writing, summarization, translating

torn mantle Jul 10, 2025, 8:50 PM

#

2+2 = 3

ocean vortex Jul 10, 2025, 8:51 PM

#

torn mantle at the other hand, should we care about that if the model is practical and solve...

There's a high correlation between benchmarks and IRL use-cases tbh. That's because the task diversity and the amount of them is insane with benchmarks when you look at like 5+ good distinct ones.

torn mantle Jul 10, 2025, 8:52 PM

#

i agree its misleading, its heavily RL biased but at the end its still a way to measure something

main gulch Jul 10, 2025, 8:52 PM

#

there weren't many published Grok 4 benchmarks which measure this tasks, not some obscure STEM

#

or o3

torn mantle Jul 10, 2025, 8:53 PM

#

thats why i said the other day that base model intelligence is more important than a reasoning model

#

maybe we should start by measuring that first

#

ofc we should add things like creativity as well

#

solutions with multiple answers

tame ether Jul 10, 2025, 8:54 PM

#

When will grok 4 be added to text arena

torn mantle Jul 10, 2025, 8:54 PM

#

efficiency score as well would be great

#

based on compute / intelligence

main gulch Jul 10, 2025, 8:54 PM

#

tame ether When will grok 4 be added to text arena

already added

torn mantle Jul 10, 2025, 8:55 PM

#

@unborn ocean where did you go 😦

unborn ocean Jul 10, 2025, 8:55 PM

#

torn mantle <@721636752263086111> where did you go 😦

oh 😳

#

was fun to watch you

torn mantle Jul 10, 2025, 8:55 PM

#

ty

ocean vortex Jul 10, 2025, 8:55 PM

#

With Grok4 there's a different issue... I don't think those results are necessarily reproducible with the public version. Would be great if AA retested it using official API hmm

torn mantle Jul 10, 2025, 8:55 PM

#

2+2 = ?

unborn ocean Jul 10, 2025, 8:55 PM

#

like someone muted in vc

torn mantle Jul 10, 2025, 8:55 PM

#

okay

unborn ocean Jul 10, 2025, 8:55 PM

#

just yapping to himself /herself👀

torn mantle Jul 10, 2025, 8:55 PM

#

mm

#

im yapping to myself?

frosty blaze Jul 10, 2025, 8:55 PM

#

Question: Are the votes on the leaderboard reset each update?

torn mantle Jul 10, 2025, 8:55 PM

#

alright

torn mantle Jul 10, 2025, 8:56 PM

#

frosty blaze Question: Are the votes on the leaderboard reset each update?

you reset them

main gulch Jul 10, 2025, 8:56 PM

#

agree with that, hiding CoT in Grok 4 is the worst decision by xAI regarding this model

torn mantle Jul 10, 2025, 8:56 PM

#

stop playing with us @frosty blaze

ocean vortex Jul 10, 2025, 8:56 PM

#

Also, that was a fair point by @ornate agate that early access might have been the heavy / test-time compute version

dawn wharf Jul 10, 2025, 8:56 PM

#

torn mantle ofc we should add things like creativity as well

creativity gets destroyed by reasoning, though

#

that's the problem

unborn ocean Jul 10, 2025, 8:57 PM

#

torn mantle ofc we should add things like creativity as well

true, we have sadly yet to find really good ways to measure that

torn mantle Jul 10, 2025, 8:57 PM

#

dawn wharf creativity gets destroyed by reasoning, though

it just means we still didnt get reasoning right, the issue is that it gets generalized a lot on coding/math problems so to find that sweet pattern spot is kinda hard

#

again i still think there is a lot of room for improvements on reasoning ( low hanging fruits )

#

but lets start with the base model first

#

dont just make a dumb model and pray to god with RL you will have something special

unborn ocean Jul 10, 2025, 8:59 PM

#

torn mantle it just means we still didnt get reasoning right, the issue is that it gets gene...

i also hold the believe that proper long reasoning and a lot of context for the models will create a state of chaos that is large enough for a otherwise competent model to reach creativity

dawn wharf Jul 10, 2025, 8:59 PM

#

torn mantle dont just make a dumb model and pray to god with RL you will have something spec...

xAI be like

#

and they succeeded, but it's not a good plan

#

they were only able to do it because of their cluster

ocean vortex Jul 10, 2025, 9:00 PM

#

torn mantle it just means we still didnt get reasoning right, the issue is that it gets gene...

Creativity is not really universally destroyed by reasoning... In fact at times reasoning will help. Like when it's trying to exclude the rhymes in a poem etc

leaden meteor Jul 10, 2025, 9:00 PM

#

dawn wharf xAI be like

How do you know that the model is only good becasue of RL?

torn mantle Jul 10, 2025, 9:00 PM

#

someone said we have an architecture ( transformer ) bottleneck, while its true, i still think we havent reached that step yet

ocean vortex Jul 10, 2025, 9:00 PM

#

torn mantle 2+2 = ?

??

torn mantle Jul 10, 2025, 9:00 PM

#

the step of thinking of another architecture

#

lets just fix what we have first

unborn ocean Jul 10, 2025, 9:00 PM

#

ideally we would want to combine both more, there are some papers on some core stuff, but we really should be exploring more
(RL in pre training) (or even RL everywhere, with no difference between pre and post)

torn mantle Jul 10, 2025, 9:00 PM

#

and maybe we will discovered smth later

ocean vortex Jul 10, 2025, 9:01 PM

#

leaden meteor How do you know that the model is only good becasue of RL?

that's an absolutely useless way to think about it...

#

Why should you care what makes the model good? It should only perform as far as I'm concerned.

unborn ocean Jul 10, 2025, 9:02 PM

#

torn mantle the step of thinking of another architecture

depends on what you mean with architecture

ocean vortex Jul 10, 2025, 9:02 PM

#

What improves the intelligence is irrelevant

torn mantle Jul 10, 2025, 9:02 PM

#

models being bad at creativity follows a pattern as well, whenever a model is strict to its normal distribution = automatically it will bad at creativity

#

google fixed that somehow

#

i remember gemini was spouting things straight up word by word from wikipedia

unborn ocean Jul 10, 2025, 9:02 PM

#

transformer will not stay like this forever, attention will probably stay for a very long time like this (or very similar)

torn mantle Jul 10, 2025, 9:03 PM

#

maybe creativity is also a base model issue and not a reasoning one

#

since its more of like predicting the next token

unborn ocean Jul 10, 2025, 9:03 PM

#

torn mantle maybe creativity is also a base model issue and not a reasoning one

the general theory says large model + chaos / randomness

#

i think

torn mantle Jul 10, 2025, 9:03 PM

#

what could reasoning do if the way the model writes is just bad

dawn wharf Jul 10, 2025, 9:04 PM

#

unborn ocean the general theory says large model + chaos / randomness

entropy?

ocean vortex Jul 10, 2025, 9:04 PM

#

torn mantle maybe creativity is also a base model issue and not a reasoning one

Honestly I don't think there are even documented cases where a reasoning version of properly done RL training would make the model worse, in any area

dawn wharf Jul 10, 2025, 9:04 PM

#

ocean vortex Honestly I don't think there are even documented cases where a reasoning version...

well, it would make it slower💀

unborn ocean Jul 10, 2025, 9:04 PM

#

dawn wharf entropy?

a lot of ways to achieve it, bit more complicated somehow, wanna learn more

dawn wharf Jul 10, 2025, 9:04 PM

#

so there's that

unborn ocean Jul 10, 2025, 9:04 PM

#

did not get around to it yet though :|

ocean vortex Jul 10, 2025, 9:04 PM

#

dawn wharf well, it would make it slower💀

Right... that's not intelligence though catgrin

torn mantle Jul 10, 2025, 9:05 PM

#

ocean vortex Honestly I don't think there are even documented cases where a reasoning version...

but whats a properly done RL training mean?

#

do we even do that?

#

are we doing it the right way?

dawn wharf Jul 10, 2025, 9:05 PM

#

ocean vortex Right... that's not intelligence though <a:catgrin:1141661526474899456>

if it's intelligent it would reply faster

#

jk

ocean vortex Jul 10, 2025, 9:05 PM

#

torn mantle but whats a properly done RL training mean?

meaning it's not some experiment or half-assed job like the very early reasoning models

dawn wharf Jul 10, 2025, 9:05 PM

#

but now that I'm thinking about it it's actually a good point

torn mantle Jul 10, 2025, 9:05 PM

#

creativity: thinking outside the box and producing something improbable and unexpected.
model: designed to do the contrary, to produce the most predictable and plausible outcome.

dawn wharf Jul 10, 2025, 9:05 PM

#

If the model is intelligent, it wouldn't need to think a lot before answering

unborn ocean Jul 10, 2025, 9:06 PM

#

ocean vortex Honestly I don't think there are even documented cases where a reasoning version...

more TTS => smaller model (usually picked like this, not a law though) => sometimes worse

#

otherwise, yes not really

ocean vortex Jul 10, 2025, 9:06 PM

#

torn mantle creativity: thinking outside the box and producing something improbable and unex...

the way you described creativity reasoning will most definitely help

#

it allows it to think of it's own solution

#

rather than blindly fit training data

unborn ocean Jul 10, 2025, 9:07 PM

#

torn mantle creativity: thinking outside the box and producing something improbable and unex...

i think in many ways we are just calling something creative if we can not comprehend the process behind it fully

dawn wharf Jul 10, 2025, 9:07 PM

#

ocean vortex the way you described creativity reasoning will most definitely help

I've actually tried reasoning models for creative writing a lot

unborn ocean Jul 10, 2025, 9:07 PM

#

and current base model are just to small to capture that effect, they seem knowledgably, but have a tiny "brain" and thus little area to create weird ideas (is the way i think about it)

dawn wharf Jul 10, 2025, 9:07 PM

#

literally the only thing it helps in is keeping the narrative on track

#

it doesn't help anything else

ocean vortex Jul 10, 2025, 9:07 PM

#

unborn ocean and current base model are just to small to capture that effect, they seem knowl...

I think the issue here is more the model size that the reasoning itself though...

#

it's just that you started with a very poor model

#

and made it better

unborn ocean Jul 10, 2025, 9:09 PM

#

yes, but as a company you have a choice between TTS and size

#

so that is what i was trying to bring up

ocean vortex Jul 10, 2025, 9:10 PM

#

unborn ocean yes, but as a company you have a choice between TTS and size

Well I'm personally an advocate for both. I don't like o4-mini-high but there's also o3-high 😇

unborn ocean Jul 10, 2025, 9:10 PM

#

@torn mantle where did you go 😦

#

2+2=?

unborn ocean Jul 10, 2025, 9:11 PM

#

ocean vortex Well I'm personally an advocate for both. I don't like o4-mini-high but there's ...

ye me to

torn mantle Jul 10, 2025, 9:11 PM

#

oh im here

#

= 3

unborn ocean Jul 10, 2025, 9:12 PM

#

the most interesting thing about the TTS is the variability of it though
=> most interested in a combined o3 and o4-mini aka gpt5 (hopefully better)

torn mantle Jul 10, 2025, 9:13 PM

#

did anyone try heavy grok 4 for creativity writting ?

ocean vortex Jul 10, 2025, 9:13 PM

#

unborn ocean 2+2=?

^ I still have no clue if he was referring to discord server admission form. Where I had to add some question to be able to change the server to "apply to join" for more reach. So I used this exact question lmao

#

@torn mantle

torn mantle Jul 10, 2025, 9:14 PM

#

oh we are just joking about 2+2

unborn ocean Jul 10, 2025, 9:14 PM

#

ocean vortex ^ I still have no clue if he was referring to discord server admission form. Whe...

was just copying asura, idk

#

because the teacher

elder rapids Jul 10, 2025, 9:14 PM

#

torn mantle creativity: thinking outside the box and producing something improbable and unex...

wen creativity module

torn mantle Jul 10, 2025, 9:15 PM

#

elder rapids wen creativity module

didnt they say their multi-agent propose ideas, critique, and refine outputs? then why is it still bad at creativity

ocean vortex Jul 10, 2025, 9:15 PM

#

Grok is so slow 😭

torn mantle Jul 10, 2025, 9:16 PM

#

elder rapids wen creativity module

answer me answer me answer me

#

why why why

unborn ocean Jul 10, 2025, 9:16 PM

#

torn mantle didnt they say their multi-agent propose ideas, critique, and refine outputs? th...

maybe the paper was right about rl only getting knowledge out of the base model

torn mantle Jul 10, 2025, 9:16 PM

#

all they do is lie

unborn ocean Jul 10, 2025, 9:16 PM

#

no new reasoning traces => no new learning was the idea

torn mantle Jul 10, 2025, 9:17 PM

#

unborn ocean maybe the paper was right about rl only getting knowledge out of the base model

yea it will always be bound to its statistical patterns

ocean vortex Jul 10, 2025, 9:17 PM

#

If one of those attempts got into infite loop this gonna be nearly an hour wait again

#

💀

unborn ocean Jul 10, 2025, 9:17 PM

#

they just pick better trace using rl

#

not sure though there are a lot of papers that genuinely discuss doing SFT over RL algo, so they can learn (genuinely why tf my spelling so bad on this keyboard, i want to burry myself)

#

slower, because less of the weights are effect, but imo rl is actually learning new stuff

meager harbor Jul 10, 2025, 9:18 PM

#

why can't model browse the internet on lm arena ? it totally skew the results, they hallucinate like crazy

unborn ocean Jul 10, 2025, 9:18 PM

#

ocean vortex If one of those attempts got into infite loop this gonna be nearly an hour wait ...

next compute bill for you won't just be the 3,5$

ocean vortex Jul 10, 2025, 9:19 PM

#

unborn ocean next compute bill for you won't just be the 3,5$

Grok4 Pro though 👀

#

well not quite, but this is better

#

I get to see ALL the responses

#

If I were to do the same by regenerating this would take 100 million hours

unborn ocean Jul 10, 2025, 9:22 PM

#

meager harbor why can't model browse the internet on lm arena ? it totally skew the results, t...

you can use a separate search arena on the legacy site

#

otherwise they can't

ocean vortex Jul 10, 2025, 9:23 PM

#

ok FINALLY. Don't think a single of those is correct lol

unborn ocean Jul 10, 2025, 9:24 PM

#

worth it to see the richest man on earth fail hard yet again

elder rapids Jul 10, 2025, 9:24 PM

#

man I can't wait for deepthink

#

ts gonna be so good

#

they're putting so much RL into it

#

😭🙏

tall summit Jul 10, 2025, 9:25 PM

#

so what are the grok 3 vs grok 4 benchmarks

meager harbor Jul 10, 2025, 9:25 PM

#

unborn ocean you can use a separate search arena on the legacy site

why do we have to use the legacy site, was't the new version supposed to be better ? I don't understand lm arena choices sometimes

#

they're weird

tall summit Jul 10, 2025, 9:26 PM

#

meager harbor why do we have to use the legacy site, was't the new version supposed to be bett...

the new version is better but despite being out of beta it doesn't have all the features of legacy

unborn ocean Jul 10, 2025, 9:26 PM

#

elder rapids man I can't wait for deepthink

here is me hoping they do actually give me early access

dawn wharf Jul 10, 2025, 9:26 PM

#

elder rapids man I can't wait for deepthink

Reminder

Screenshot_2025-07-10-07-39-32-682_com.discord-edit.jpg

elder rapids Jul 10, 2025, 9:26 PM

#

unborn ocean here is me hoping they do actually give me early access

you'll be the one testing ts out for me