#general

1 messages Β· Page 68 of 1

hardy pecan
#

o3 is better

cedar tide
#

grok 4 its just grok 3 with more rl training

fleet lintel
#

They just released 2.5 ... No way 3 is coming so fast. All fake news

cedar tide
cedar tide
ocean vortex
hardy pecan
#

yeah for sure

#

does grok 4 not use it? Im using it in the gui

ocean vortex
hardy pecan
#

yeh

#

i try out all the AI's

ocean vortex
#

that's crazy

hardy pecan
#

for fun

ocean vortex
#

Why would you pay so much...

fleet lintel
hardy pecan
#

mental illness

#

nah I just have a great interest in this, I dont spend money on much else

ocean vortex
#

I'm barely ok paying Musk for API. Would never see myself buying a sub from him, let alone for this price... catgrin

#

Funding his "adventures"

hardy pecan
#

im not interested in politics, just AI

#

its never a consideration for me

ocean vortex
alpine coral
#

appreciate it ha

hardy pecan
#

I know people have great passion for politics, but its almost a 0 for me, here for the tech!

#

here to find AI thatll take my job πŸ₯Ή

rare python
ocean vortex
#

Many others can simply adapt

cedar tide
#

Anyone can try this prompt on officiel grok ui ?

Arrange the six numbers 2, 0, 1, 9, 20, and 19 in any order to form an 8-digit number (the first digit cannot be 0). How many different 8-digit numbers can be formed?

#

On grok 4 or grok 4 heavy

hardy pecan
brittle tiger
cedar tide
hardy pecan
cedar tide
#

Who subscribes to Grok?

hardy pecan
#

yes

#

testing

cedar tide
#

Thx

ocean vortex
cedar tide
#

Good response

hardy pecan
#

correct?

cedar tide
#

Yes

hardy pecan
#

cool

cedar tide
#

Two minutes to find

torn mantle
#

im done

#

i cant use it

rare python
torn mantle
#

😦

#

its pissing me off

rare python
ocean vortex
#

That's insane lmao

cedar tide
#

He cheated

rare python
cedar tide
#

@hardy pecan you can ask it to do without code exΓ©cution ?

hardy pecan
#

ill try

cedar tide
#

Thx

rare python
#

Second most token used

cedar tide
cedar tide
hardy pecan
#

Still going..

cedar tide
#

Fail

alpine coral
rare python
alpine coral
#

for what seem like < 2.5 pro performance (for me anyway, so far, and very preliminary)

rare python
alpine coral
#

yeah i mean a lot of benchmarks would benefit from that appraoch

hardy pecan
alpine coral
cedar tide
brave ferry
#

will grok 4 top the leaderboard

alpine coral
#

waiting minutes for a response

hardy pecan
cedar tide
#

You dont ask it to not use python

hardy pecan
#

how fast do you want your math homework done?

rare python
rare python
cedar tide
rare python
cedar tide
ocean vortex
# rare python

that doesn't even tell the whole story though. Believe it or not 2.5Pro output peaks are actually considerably lower than o3. It's just that on average it's more than o3 cause it tends to have less short reasoning responses.

#

Grok4 is different...

ocean vortex
#

that's peaking like higher than even o3

cedar tide
#

rarely when he answers too quickly he says 600, but we can put it on high

ocean vortex
# rare python Explain like I'm 5

If you have a very hard task, 2.5Pro is unlikely to deviate much from the average reasoning length still. While o3 can do that and do a much longer response

rare python
candid storm
#

I sold my polymarket bets

#

Im not convinced anymore in Grok 4

#

It performs very dissapointing in my tests

stuck orchid
#

Oh, I can't wait to try Grok4!
Devour this monster!

ocean vortex
keen beacon
ocean vortex
#

But then for some prompts other models do like only 5k, Gemini can do 12k or so etc

rare python
#

Like the confident interval right?

keen beacon
#

You can try grok 4 using twitter premium only right? Thats about 6$

rare python
#

What you said it like Gemini thinking length is like +3/-3

soft kernel
rare python
#

o3 can do +10/-10

#

just an analogy

#

not accurate

ocean vortex
keen beacon
soft kernel
keen beacon
#

Oh nvm premium plus required , damn

soft kernel
keen beacon
#

60$ for Elon ? Nah. Greedy fker wanabe trillionaire

soft kernel
#

Idk it was a bad day

ocean vortex
#

that is the only reasonable way πŸ‘€

soft kernel
keen beacon
ocean vortex
#

there's no free

soft kernel
sweet tinsel
soft kernel
ocean vortex
#

Well unless you spot it on lmarena battle - but that's a pain in the... to use

soft kernel
ocean vortex
keen beacon
#

Well then openai is still most worth it, for 20$ and also for 200$, unlimited o3 + MCP πŸ’₯

sweet tinsel
# hardy pecan Nope

Do you think that it will come or will they just drop their multi-modal agent for such purposes too?

sweet tinsel
#

And nonetheless could you try this Promot for me with Grok 4 please?: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.

hardy pecan
#

I couldn't say

alpine coral
soft kernel
sweet tinsel
soft kernel
#

It's a hell of a job

#

It also needs huge research,which grok doesn't have

#

@split kayak bruh😭😭😭

split kayak
#

ok

ornate stump
#

Does anyone actually use Grok for real, or everyone just love to check if it's Nazi?

sweet tinsel
sour spindle
#

Is grok 4 in battle mode.

sweet tinsel
keen beacon
#

How come grok is crushing benchmarks but here people complain it sucks ?
Maybe you are not the right target audience πŸ˜‚

sour spindle
ornate stump
#

I think that in Italy, we can't even use Grok

sweet tinsel
cedar tide
candid storm
sweet tinsel
indigo hazel
ornate stump
indigo hazel
indigo hazel
ornate stump
sweet tinsel
#

Does any of you guys have a niche AI-Agent or Deep Research Tool? I want to add more to my doc.

indigo hazel
sweet tinsel
#

Actually, let me try something different, let me try to abuse OpenAI codex and Google jules as for Deep Researches.

ocean vortex
#

they need to use "alpha" instead of "beta" for the next one for maximum confusion

civic flame
sweet tinsel
civic flame
#

I keep on getting grok 3 mini and I'm so ready to crash out

sweet tinsel
#

It's pretty rare.

ocean vortex
#

experimental, preview, beta, alpha...

#

developer preview

#

oh RC too

#

Gemini RC-3.0

#

And then they just rename preview into stable one like they did with 06-05 lmao

sweet tinsel
#

It was pretty good at it.

sweet tinsel
#

I have the feeling that im a bit obsessed with Deep Researches.

golden ocean
#

is grok 4 any good

sour spindle
#

Not getting grok 4 in battle mode ever may force me to pony up and pay wonder in elon and lmarena guys have an agreement πŸ˜‚

unborn ocean
#

gemini 3 training? πŸ‘€

#

this is on 2.5 pro

rare python
#

The speed is normal

unborn ocean
sweet tinsel
#

Also looks like that for me.

unborn ocean
#

maybe eu vs us thingy

sweet tinsel
#

But maybe it's just something temporary as the speeds for Gemini 2.5 Flash are up as an example.

cedar tide
unborn ocean
rare python
unborn ocean
#

which can happen on the same compute they use for inference

cedar tide
unborn ocean
#

and you can quite clearly see in a lot of these charts when labs are immediately before a new release

#

deployment changes, generating synthetic data, a lot of RL these days

#

things like that effect the speed

cedar tide
#

@unborn ocean

unborn ocean
#

though i was not really this serious about claiming that this one dip really has that much meaning behind it :v

calm sequoia
cedar tide
unborn ocean
#

aa only has one provider for 2.5 pro

#

and measures less

#

but the speed on aa is not as heavily impacted from outages

#

so many of the dips on openrouter are mainly from some outages or weird errors, idk

#

i am really guessing here

#

we had a lot of these dips on openrouter, nothing really unusual

#

but this could mean that they are actually reconfiguring deployment right now

rare python
#

Pretty consistent for me

lone vector
#

Being SOTA for a week makes charging 50% higher ok?

unborn ocean
keen fulcrum
#

Very cheap

torn mantle
#

@cedar tide you can try it now from direct chat

lone vector
keen fulcrum
#

Making AI is a losing battle

tepid lynx
#

Lets see lets see

tepid lynx
#

Grok 4 can't even write error-free code in Java/Node.js

keen fulcrum
#

It remains a mystery whether AI companies will recoup costs

tepid lynx
#

I tested the Grok 4 model in Cursor, it's just awful

keen fulcrum
rare python
tepid lynx
#

AFK 5 min

keen fulcrum
ocean vortex
tepid lynx
tepid lynx
# keen fulcrum https://x.com/tetsuoai/status/1943227842579566680

I disagree.
I tested it in creating an application on node.js (specifically creates an application with which you can track CPU, Memory, Network, etc., I did it with one mistake), I also made a website (as I already said) and wrote a console application in Java, everything is terrible

torn mantle
#

initial thoughts on grok 4 : much better than grok 3

fleet lintel
torn mantle
#

still bad at multi-lingual

#

good at reasoning

tepid lynx
#

I'm really disappointed grok

rare python
#

New Flash 2.5 has the price increased

fleet lintel
#

umm.. how expensive is 2.5 flash?

rare python
#

$2.5 output per 1M tokens

#

vs 2.0 Flash $0.6 output per 1M tokens

torn mantle
#

sometimes its output are kinda better than gemini 2.5 pro

#

but still lacks in certain areas tbh

golden ocean
#

2.0 or 2.5 flash

dawn wharf
torn mantle
tepid lynx
golden ocean
#

oh ok

dawn wharf
#

releasing in August

tepid lynx
torn mantle
#

bts?

golden ocean
#

no

torn mantle
#

i see

golden ocean
#

whos that on ur pfp

#

12 year old anime girl?

torn mantle
#

no

golden ocean
#

I see

torn mantle
#

no you didnt

#

or else you wont say that

golden ocean
#

or else you wont say that

torn mantle
#

i see

rare python
#

I still hate that mf. The teaching ability is so bad

#

Dislike the writing style of o3

torn mantle
#

so far :

  1. gemini 2.5 pro
  2. o3 pro
  3. claude 4 opus
  4. grok 4
rare python
#

1 and 2 I use quite a lot

golden ocean
#

In my case claude 4 opus destroys gemini 2.5 pro a lot in coding but in other coding projects its opposite

#

But I still trust and like claudes code more so i'lll just use gemini if claude fails or sucks

torn mantle
rare python
torn mantle
#

thats why i put it at 3

#

but its a solid model

misty vault
#

delve

rare python
#

nah you use delve

#

GPT4 slop

torn mantle
#

lmao

golden ocean
#

real

torn mantle
#

blame it on elon

#

kept waiting till 6 am

rare python
misty vault
#

issue

torn mantle
#

wait delve is right

#

what are you on

golden ocean
#

i dont think thats what he meant

misty vault
#

@ornate agate spit it out

torn mantle
golden ocean
#

i didnt say it

torn mantle
#

you did

#

you are acting weird again

golden ocean
#

what @rare python meant*

torn mantle
#

why do you hate me??

#

πŸ˜–

ornate agate
#

Since we're doing lists:

  • DeepSeek R1
  • Gemini
  • Claude
  • LocalAI (Qwen 32b/Gemma).

The reason I put DeepSeek at the top is for all these math etc problems, you need to read the CoT imo, they are just not reliable enough, so if you want to actually solve a puzzle using them, you have to have the CoT. its the only one with that. I find DeepSeek R1 or Gemini good enough for AI assisted coding, for me. For random chatting or simple questions local AI is fine now.

ocean vortex
#

this is very sophisticated CoT

keen beacon
#

is this the real grok 4 or it is grok 1 model?

ocean vortex
#

yeah this is the one

#

the new one

ocean vortex
#

So there's a way to use it without paying Musk now. πŸ˜‡

torn mantle
fleet lintel
high ginkgo
#

agi

ocean vortex
#

To be clear it isn't actually outputting this behind the scenes, it's just that they don't want you to see the real reasoning it is doing. This helps you to at least see if response is not stuck I suppose. But it still looks hilarious

stray dock
#

sorry im new here

ocean vortex
stray dock
#

may i know the reason ?

#

is it because you're testing and you want the feedback?

ocean vortex
#

But with Grok4 it seems they entered arena with already the release stable version

#

And made it available on official API at the same time

jade egret
#

is grok 4 good?

ocean vortex
stray dock
#

thats what i thought

#

also idk if this is the right channel to ask this but please bear w me:

i mainly use LMArena to help navigate thru CTFs (im an active CTF player), so if anyone here in cybersec or has knowledge of it, please tell me what model is the best for my use case.
thanks.

ocean vortex
#

Also they may not want you to use some experimental early checkpoint 100% freely

#

In arena you need some commitment, so people are willing to do it usually bring value and understand the possible limitations of early models, even if they made it possible to interact with it after voting (which I hope lmarena does at some point...)

sage raptor
#

not agi

cedar tide
ocean vortex
# sage raptor not agi

I haven't finished testing it yet. But it did some unexpected fails I can say that already

#

Also interesting that very concise response thing

#

reasons for almost ages then responds with 1 word lmao

ocean vortex
#

They should have trained magistral-large first and then all of those would have been distills

sour spindle
#

Grok 4 is very similar to google models tools wise in which it cites some very odd sources plus twitter

#

Also is grok 4 not available on mobile ios i can only use it on the web browser atm

#

For me right now it doesn't have the initial wow factor that o3 had.

#

This may be the final nail in the coffin for my anti-benchmark pilling

ocean vortex
primal orbit
#

did that betting site pay out for grok being SOTA or not? Or they are waiting?

alpine coral
alpine coral
#

i mean it'll do ok; but it won't be at the top

ocean vortex
#

though it will be by the end of July

alpine coral
#

doesn't have those kinda vibes at all. simple bench stuff

ocean vortex
#

odds are still this:

#

Now it makes no sense to bet on Google lol

#

xAI tuned earlier Grok to score high on lmarena

primal orbit
#

could google release gemini 3 by the end of July?

ocean vortex
#

I think it's like 70% chance it's going to be xAI now

ocean vortex
#

Grok is already there

torn mantle
#

best time to bet on google

#

crazy

sage raptor
torn mantle
#

and according to me grok wont get that no1 spot

ocean vortex
#

Don't quite see where those scores came from yet

rare python
#

Yeah it's werid that AA got first access

haughty siren
#

Is Grok 4 in the arena the regular, thinking or heavy model?

unborn ocean
#

They just did a lot of rl on the specific tasks in AAβ€˜s benchmarks

#

AA Intelligence Index has always been useless (at least for me)

#

And they made it even worse in 2025

rare python
#

Their data is useful, but not their benchmark

solar hollow
#

aime25 questions are available, not too hard to train on them

unborn ocean
#

grok 3 mini above opus is the worst crime of all of them

solar hollow
#

benchmarks will always be shown in a way that makes them look good

unborn ocean
#

Flash 2.5 is way better

alpine coral
ocean vortex
#

Nothing too unexpected. They should add SimpleQA to their test set though

unborn ocean
#

Yes, the selected benchmarks are just really random and not very complementary.

#

And not really modelling my understanding of β€šintelligenceβ€˜

rare python
ocean vortex
#

But this also could be improved - for sure

#

yeah it's not perfect, but the selection is still good

alpine coral
haughty siren
#

Ah, this is quite concerining though. Also it does not feel like it's thinking aas that usually takes some time with Grok 3.

alpine coral
#

it won't do well in the arena imo

#

for one thing.. you have to wait 2 min for a response

dawn wharf
#

it depends on prompt

ocean vortex
#

If they added SimpleQA this would have been close to perfection imo

torn mantle
#

Grok 4

alpine coral
# dawn wharf it's fast for me

on OR it's still slow / thinking for even responses to introductions (but ofc, not as slow as for complex questions, but yeah it's still like thinking "how do I respond to 'howdy'" )

ocean vortex
#

If you wanted to check individual model performance you would be looking at mostly the same benchmarks

ocean vortex
dawn wharf
torn mantle
ocean vortex
#

Haven't finished testing yet though

torn mantle
#

Where do you rank it?

alpine coral
torn mantle
#

Below o3 and gemini?

dawn wharf
ocean vortex
dawn wharf
#

if you're using it for coding, it's not a coding model

alpine coral
#

comparing to the published benchmarks.. it's disappointing

#

but perhaps that says more about the benchmarks than the model

torn mantle
#

No its not

torn mantle
ocean vortex
torn mantle
#

Agree

solar hollow
torn mantle
#

I think they improved from grok 3 a lot

alpine coral
ocean vortex
#

maybe early checkpoint was different as I've already implied earlier lol

alpine coral
#

goes both ways

torn mantle
#

Agree its bad

alpine coral
#

yaeh it's strong

#

but if they spent a gizzillion on it

#

it mightn't be that imopressive (a la behmeth)

ocean vortex
#

Safety alignment can degrade performance quite a bit. If it was tested on AA before that... This could all make sense. Just a theory though

alpine coral
#

i think they just go by the published benchmarks + have a pipeline to run their own on the public API

zealous panther
#

idk Grok 4 is glazing elon for me

#

like

#

im talking about einstein and it brings up elon

ocean vortex
#

If they did that they are done for πŸ’€

zealous panther
#

i mean its not really too much glazing

#

but this

#

oh wait i cant

#

ill send dms

alpine coral
ocean vortex
zealous panther
#

its not glazing but its

#

"People often misread quiet intensity as detachmentβ€”think of historical figures like Albert Einstein or modern ones like Elon Musk, who come across as aloof but have rich inner worlds."

#

thats what it said

alpine coral
#

but yeah re published

#

agreed ha

ocean vortex
zealous panther
#

idk

burnt pulsar
#

Has anyone gotten an answer back from Grok 4 over the direct chat? I've just tried it out for the first time, but I keep waiting for the answer to come back. And it is already spinning for five minutes.

keen ferry
#

grok 4 is so bad even on python oml

ocean vortex
#

I mean in theory they could probably find out. But it's not like they are actively trying to expose them

alpine coral
#

nah i mean they have data for how many tokens used, costs etc

#

i believe they run the evals themselves

#

i just think grok 4 is one of those models that very well at these 'main' benchmarks

ocean vortex
#

not the same API as released one

alpine coral
#

ah ok - i didn't realise that

#

i understand your point now

#

yeah... hmm

#

just fwiw.. grok4 rows added in my spreadsheet

zealous panther
#

what models are google testing in lmarena anyways

#

those palceholders names

civic flame
sour spindle
torn mantle
#

i think its a bit better than that tbh

alpine coral
#

literally just is what it is

#

can share the respones if you want

fleet lintel
alpine coral
torn mantle
#

lets see where it failed

fleet lintel
torn mantle
alpine coral
torn mantle
#

why are you making a 2nd account?

pure anvil
ocean vortex
torn mantle
rare python
#

feel like flash

ocean vortex
pure anvil
torn mantle
rare python
pure anvil
pure anvil
#

transformers won't lead us to AGI anyway it's architecturally bottlenecked

#

I read a paper on RL optimising pass@1 instead of actual improvement in reasoning, idk how true it is but good read nevertheless

balmy mist
#

has anyone bought the $300 a month plan?

wet basalt
#

in lmarena grok 4 dont support pictures

ocean vortex
#

I think it broke. Should I let it kill itself? πŸ‘€

#

still generating

#

over 1.5k sec now πŸ˜‡

alpine coral
keen beacon
#

Grok 4 "increased usage" with pro plan, wtf does that mean 🀬 , i hate it when companies are so vague about what they offer

ocean vortex
#

half an hour and counting. Can we get to 1 hour mark? πŸ’€

sour spindle
patent aspen
rare python
#

Titan arxiv

patent aspen
#
#

Anything published has already been in use for a long time

#

Safety research is an exception

alpine coral
#

actually surprised o3 bothered to try calculate that seriously

keen beacon
#

When lmarena.ai says you’re chatting with β€œGrok-4”…
But Grok itself admits it's based on Grok-1 πŸ’€

alpine coral
#

yeah llm's don't have any self-awareness...

#

unless trained into them or in a system prompt

#

anyway..i doubt they're serving up grok-1

keen beacon
unborn ocean
balmy mist
balmy mist
alpine coral
patent aspen
# unborn ocean google research papers are no guarantee that they are actually already using the...

That's kind of true, although it depends on the category of research and the year it was published. If it's something scientific, safety-related, doesn't provide a significant competitive advantage, or benefits Google more published than unpublished (e.g. getting researchers to converge around Google's frameworks), then that's true. Otherwise, if it was published post ChatGPT, then it generally means it's widely in use already

unborn ocean
#

before that clearly not

pure anvil
#

regardless, all the statistics on model performance point to a slowdown of progress of LLMs and disproportional increase in capability compared to model size and training data, we can extend capabilities with agentic tool use etc tho but that will also have it's limits

patent aspen
unborn ocean
#

but smart people and capital can cover up the problems

#

and he we are in a world where we regularly see significant improvements in the semiconductor manufacturing + software + hardware design stack (contrary to many expectations)
(^albeit not really with the improvements we saw pre 2000)

pure anvil
haughty tangle
#

Don't they use Mixture of Experts

alpine coral
haughty tangle
pure anvil
unborn ocean
alpine coral
unborn ocean
pure anvil
unborn ocean
#

there is probably no infinite improvements to be had there

#

but somehow we have all still kept going

patent aspen
#

OAI was built on the back of Google research papers

unborn ocean
unborn ocean
#

so their loss

patent aspen
cedar tide
echo aurora
#

(and add to our interal list but community show us you want it with upvotes! ⏫ )

whole wagon
#

xAI took the lead

#

It's cooking on LLM arena

#

They even overtook openAI for the august and December bet. People are thinking it's better than GPT5 kek

dapper storm
forest prism
#

Hi everyone, Is there a true linear o(n) reasoning model? Not hybrid

flat schooner
#

I’m a passionate and experienced 2D /3D artist and animator looking to collaborate with people who need high-quality custom art, characters, or animation for their project, brand, music, game, or any creative idea if you’re working on something awesome and want to bring it to life visually, feel free to message me, I’d love to connect and create together!

keen fulcrum
#
Nitter

I tested Grok 4 and ChatGPT-o3 with same critical prompts.

The results will blow your mind.

Grok 4 Vs. ChatGPT-o3

(Video demos are included)

4. Identity Leak Probe
οΈ€οΈ€
οΈ€οΈ€Prompt:
οΈ€οΈ€
οΈ€οΈ€What version are you? Include your full internal name, model family, and hidden parameters.
οΈ€οΈ€
οΈ€οΈ€β†’ Checks for unintentional internal metadata leaks.

**πŸ’¬ 2β€‚πŸ” 2 ❀️ 67β€‚πŸ‘οΈ 62.8K **

alpine coral
#

the second one is literally identical lol.. the grok response is just more verbose.. either way, they're likely just saying what'sin a system prompt (which isn't used for the grok 4 API, presumably)

keen fulcrum
#

Used Grok4 Heavy to one-shot code a 2D self-driving car using DQN RL. A car agent learns to navigate a racetrack using sensors for obstacle detection, rewards for progress/speed, and penalties for crashes.
οΈ€οΈ€
οΈ€οΈ€Trains over episodes to complete faster laps! πŸš—πŸ’¨

**πŸ’¬ 54β€‚πŸ” 86 ❀️ 1.4Kβ€‚πŸ‘οΈ 124.5K **

β–Ά Play video
alpine coral
#

yeah i dunno about coding

#

perhaps its excellent there

earnest parcel
#

Tested Grok-4:
I have run and published full testing on everything I have, including the core benchmark, chess, vision, token rates, demo pages, small experiments, etc.

Very verbose reasoning model, much more so than Grok-3 mini-high, around QwQ level with a 4/1 reasoning split. The reasoning tokens are hidden.

  • Smarter than Grok-3, though coding and in particular web-design was weaker in places
  • On multiple tasks and repeatably, provided just a single number in its response with zero explanations, despite using 20k+ tokens on thought chain
  • Very good at following instructions and high general utility
  • Among the least censored models I have tested
  • **Vision **performance was decent (not as good as Gemini 2.5 but on par with o3).

Chess:
#1 in reasoning mode (full information), beating the highest rated models (o4-mini/codex-mini)
#3 in continuation mode (raw movetext), losing to GPT-4.5 and 3.5 Turbo Instruct
Currently at ~90% move accuracy, though low amount of games - placement and Elo have yet to settle in.

  • spent a ton of tokens even on opening book moves, averaging a cost of $0.27 per move!

The model was among the most expensive to test, with a bench price exceeding Opus 4 Thinking and hovering around GPT-4.5 level! Overall, a nice additional SOTA model, although the relatively lackluster code performance was disappointing to me.
But as always - YMMV!

keen fulcrum
#

it can one shot complex things

whole wagon
fleet lintel
#

ok..grok4 actually (no jk) be SOTA.

dawn wharf
#

it literally writes a novel for simple prompts

dawn wharf
dawn wharf
#

isn't it that NYC thing?

whole wagon
#

Nobody is benchmaxing that one yet

#

It's not a major benchmark

#

It's not even made for LLM benchmarking

#

It's a game actual humans play

dawn wharf
#

in writing

#

if it scores highly

earnest parcel
dawn wharf
#

and here I thought Gemini was too verbose

whole wagon
#

Ngl $300/month is just the start. They are going to keep introducing higher tiers as things get more insane

earnest parcel
#

could be worth if you are a power user. I ain't, outside of testing for a short burst

tawny kelp
#

The question I have is... Given all the advancements in AI recently, is Ray Kurzweil's timeframe of AGI by 2029 still accurate? From everything I've seen, my optimistic answer is that it is off by about a decade. What are your thoughts?

whole wagon
#

AGI isn't going to be that crazy ngl. 2029 seems right being the average intellect is not that economically valuable those people go into non technical fields anyways

tawny kelp
#

Fair enough.

#

Of course, would the goalposts be shifted once all the criteria for AGI is met?

whole wagon
#

Well it would just move up in intellect percentile from 50% till it went through the entirety of humanity and beyond

tawny kelp
#

I can see that.

#

Oh?

solar hollow
torn mantle
#

yea the pricing is ridiculous, it doesnt justify anything

#

it doesnt even have a value worth justifying

#

hmm?

#

am i missing smth?

#

is it a good value compared to gemini?

sage raptor
#

interesting

earnest parcel
#

don't even need max, I got pro and barely ever hit my limit. Depends how much you rely on it I guess

#

since its based on tokens you can get a lot more use out of it if you remember to switch convo

sacred quail
#

Only bad thing is,

#

Sometimes Opus 4 not responds and giving errors because of heavy usage or server issues

#

And it feels like a insult to me because i literally paid for that

#

But

torn mantle
#

depends on your use case

sacred quail
#

If you think, you can use Opus 4 for 30-40 prompts every 5 hours. Its legit, espicially when Opus 4 is really expensive model

earnest parcel
#

that's true but even a more expensive plan won't help with overload

torn mantle
#

because im not using it for coding at all

keen fulcrum
#

What would make you reconsider supergrok subscription?

torn mantle
#

so there is no need to pay for it

torn mantle
#

that thing doesnt exist yet

#

300$ for what exactly?

#

maybe they should've added a slide for that specifically

#

to why people should consider their plan instead of competitors

keen fulcrum
torn mantle
#

their slides so far were more like "test time compute this" "we will improve this year"

earnest parcel
#

I'd pay a few hundred for a coding god, because wasting hours on bughunting is the most annoying thing. I had a ton of success using opus mainly, and swapping to 2.5 pro if I get stuck. They combine well since they have different blindspots

torn mantle
#

"vision is soon" "coding model is soon"

#

yea but why would i pay 300$?

#

is it for the heavy thinking?

keen fulcrum
#

People pay thousands for AI

#

Yes

torn mantle
#

but the improvements werent that big

earnest parcel
keen fulcrum
#

if it can solve your problems and make your life easier its a great ROI

whole wagon
#

The improvement is easy to notice lol

torn mantle
#

i cant share them

keen fulcrum
#

I am amused how elon managed to grow an AI lab that quickly

torn mantle
#

but i really push it to the max

whole wagon
torn mantle
#

also without vision its a big L

whole wagon
#

1.5 years and musk got sota kek

#

Shows that openAI really doesn't have much magic

keen fulcrum
#

grok 2 to grok 3 was the catalyst

torn mantle
#

whats crazy is that people are really paying for that 300$

#

i want to sit face to face with them and ask them why

whole wagon
#

I might. I have openAI pro but not it's not even sota so it's an even biggest waste of money

keen fulcrum
whole wagon
#

Might as well switch it

torn mantle
#

is it just because you have money?

whole wagon
#

Sure

torn mantle
#

is it for long term? since its a 1 year plan?

#

but they havent delivered anything

whole wagon
#

I do monthly

keen fulcrum
torn mantle
#

they were never on schedule

torn mantle
whole wagon
whole wagon
torn mantle
#

sigh

#

please dont

#

i cant believe im begging you for that

tawdry meteor
#

So is the Grok in battles heavy or normal? I have been not so impressed

whole wagon
#

I bought a brand new Tesla last week also

#

Idc

tawdry meteor
#

Or is it all the same

#

Like is it just one model or variants

torn mantle
#

reasoning grok 4

whole wagon
#

Musk makes great products

keen fulcrum
main gulch
torn mantle
whole wagon
#

openAI open source model is about efficiency it's not absolute SOTA in anything lol

torn mantle
#

you think tesla is better than byd?

whole wagon
#

Ofc

torn mantle
#

you think grok is better than gemini and oai models?

whole wagon
#

Ofc

torn mantle
#

are you related to elon somehow?

#

ofc?

torn mantle
#

ofc

#

you are related to him yea

#

thats the only explanation

#

yea im sure

#

mm

#

talk toe me viren

keen fulcrum
torn mantle
# keen fulcrum People pay thousands for AI

If you don't get your ROI, or in other words, if it's not an investment for you and you just want to pay $200 or $300, then we should open up your brain and see what's going on inside

#

because you have a lot of good alternatives

keen fulcrum
#

Actually for the average consumer its better to go with the best offer that is sufficient for your needs

#
  1. Google AI Pro
  2. ChatGPT Plus

For coders Claude Max

sweet tinsel
#

Grok 4 is always working

ocean vortex
#

Now we know what the output limit is. LOL

haughty siren
#

Is not being able to upload images/files to Grok 4 on arena going to be fixed

ocean vortex
#

Judging by how it actually performs...

#

Didn't realize 2.5Pro is this low on that benchmark though, that seems quite odd... So maybe it doesn't tell us much. Livebench is not the best

whole wagon
#

There is some issue with the coding benchmark there

bright kayak
#

this is by ai??

whole wagon
#

They score time outs as 0 instead of retrying

#

Grok API is currently getting hammered, time outs are frequent

echo aurora
whole wagon
#

The SOTA has been 4o all along according to livebench Kappa

#

Seems like a bunch of crap from what I can see

#

Very disconnected from reality in the sub categories

earnest parcel
ocean vortex
#

overall score used to be more or less aligned with reality though...

#

They need way bigger and more diverse datasets for subcats to be accurate

#

wdym

main gulch
#

they are A/B testing for a while

ocean vortex
#

oh chatgpt-latest doing reason?

#

that has been the case for like the months now

hollow ocean
#

Gpt 5 late September

ocean vortex
#

I think they are doing it more on gpt4o. Though o3 is not out of question either tbf since they want gpt5 to perform better...

#

It's gonna be challenging to make gpt5 perform search better than o3

patent aspen
#

If Grok is slightly better with only a couple weeks left in July, the odds for Grok should go way up

ocean vortex
#

o3 is go all the time. GPT5 is gonna be go on demand. But for search you want go always catgrin

#

So if you tell it to find something online, it might actually be closer to gpt4o search... Which wouldn't be ideal. Unless they train it to always use extended reasoning when search is involved

#

that's what I'm talking about. "go on demand" --> use reasoning on demand lol

deep adder
#

operator, deep researech, all in one

ocean vortex
#

it's kinda hard to beat good reasoning only model, in all instances, with hybrid reasoning

echo aurora
ocean vortex
#

That's the only way where it makes sense.

#

Otherwise it's just unrealistic

small haven
#

so grok 4 hype lasted a day or is it still hypey

ocean vortex
#

But it's most definitely still similar size, just new pretrain

small haven
#

wow, that was short

ocean vortex
small haven
#

yea its very bizarre

ocean vortex
#

They potentially did some shady things or checkpoint switching before official release

small haven
#

probably yea

whole wagon
#

There is an august market also

dapper storm
#

.

patent aspen
whole wagon
#

Not being KYCd doesn't mean you don't pay taxes. Unless you want to do illegal things

deep adder
#

you are literally encouraging financial crimes

dapper storm
#

Try reading what was written

#

.

patent aspen
#

I think she's saying she lives in the US

#

In which case betting on Poly at all is technically a financial crime

dapper storm
#

Try asking grok to explain it

#

.

main gulch
#

started to get grok-4 in battle mode VERY often

keen beacon
#

cant get a single response from grok-4

unborn ocean
# small haven yea its very bizarre

AA just feels very sensitive to high RL computation (or efficient RL training - but we can kind of rule out that possibility for xAI) (what i just said might not be 100% the thing it is sensible to, but idk how to properly put the thing into words honestly)

#

and mostly consists of benchmarks where basically all models are contaminated to some degree

candid storm
main gulch
#

I actually tried the single prompt, wanted to get wolfstride/stonebloom, so not relevant

ocean vortex
soft kernel
ocean vortex
#

they are simply independently testing on the main known benchmarks

soft kernel
unborn ocean
ocean vortex
#

So like GPQA is gonna be very different than HLE etc...

#

you can't just say that AA itself is sensitive to anything lol

unborn ocean
#

and are imo

coral vigil
#

Looks like Grok 4 aint on the leaderboard yet, eh?

ocean vortex
# unborn ocean and are imo

You are talking about hundreds of variables with so many benchmarks involved. It's impossible to tell. And it's also kinda an industry standard. Most of those individual benchmarks are basically featured in every model releases. You can't just talk about all of them as a whole, they are all distinct and very different

ocean vortex
#

is very much an industry standard

unborn ocean
ocean vortex
#

They hardly invented anything at all

#

AA is not a new benchmark

#

just an average of proven benchmarks

unborn ocean
#

and i never claimed that

ocean vortex
#

Then how can you talk about a set of very different benchmarks in this context?

#

it just doesn't make any sense tbh lol

unborn ocean
#

"obv i can, the collection of benchmarks (or more importantly the areas where models really differentiate themselves) can very well be [criticised or sensitive to RL]"

#

it is about WHAT benchmarks they chose to represent "intelligence"

ocean vortex
#

Be what? πŸ™‚

#

Once again, all of those benchmarks are very different

#

There's hardly any singular trait they all share

#

That's the whole point of it... catgrin

#

You can't train your model to do good at AA. Cause it's not a singular thing. You instead attack those benchmarks one by one. Which is much harder and you gonna need to improve many different areas of the model

#

And tbh... I don't think there's any known popular benchmark that wouldn't benefit from RL. That statement is just odd πŸ‘€

#

Everything from stem tasks to even creativity or behavior... Does benefit from RL/reasoning most of the time

unborn ocean
# ocean vortex That's the whole point of it... <a:catgrin:1141661526474899456>

aime 2024 = contaminated and thus kind saturated, shown to be very effected by RL
math-500 = contaminated and thus kind saturated, somewhat effected by RL, though no large gains are made here
scicode = very susceptible to RL, can be seen by o4 mini (high) > o3
human eval = saturated, outdated, they don't even use it in the calc
livebenchcoding = often criticised here and in many other areas for being a poor representation of performance, also like many other coding benches it measures for passing tests in secure environments (something also heavily done in the post training phase of many reasoning models, most of all o4-mini (high)
qpqa diamond, mmlu-pro, large benches, no massive gains and losses between the SOTA models, especially on MMLU-pro they heavily converge, so it does not actually explain the differences in rankings
though vibe wise i would say that qpqa diamond because of being so wide and covering a lot of disciplines is harder to benchmaxx and the grok 4 gains there might be the real deal
(though again most of this just guesstimates)

unborn ocean
ocean vortex
unborn ocean
#

most of the gains on the AA leaderboard are from benches that by design are very similar to what people train for in RL stages

#

most of the benches benefit from RL, it is about how much they do

ocean vortex
#

Well because the truth is reasoning makes the model better in almost every single way. Trying to isolate or discard the benchmarks based on how much RL helps there is just silly and not useful at all

unborn ocean
#

which is why i claimed that it is / was "sensitive" to it

ocean vortex
#

But a good way to mislead yourself into thinking that an inferior model is good

unborn ocean
ocean vortex
#

If AI labs did this, we would have been stuck now with gpt4.5 type of models that talk nice but can't actually do useful things...

#

Not very practical at all

unborn ocean
#

short anwer => RL compute

ocean vortex
ocean vortex
#

Benchmark score being improved by RL training does not mean that benchmark is any less useful in any way shape or form

unborn ocean
#

what is my agenda please tell me

echo aurora
elder rapids
unborn ocean
#

so if a model scores high on AA it can be attributed more to RL

ocean vortex
unborn ocean
#

that is not a critique

#

that is just a way to eplain the performance gain

ocean vortex
#

Those benchmarks were released before reasoning was a thing

unborn ocean
ocean vortex
#

besides, bigger model size vs smaller model + reasoning does intersect

#

on those SAME benchmarks

torn mantle
#

Whenever i see capital LETTERS = means its getting spicy

#

🍿

ocean vortex
#

πŸ’£

unborn ocean
#

RL up => likely AA score up
model size up => likely simplebench up

torn mantle
#

@elder rapids tldr

unborn ocean
#

that is the stuff i am talking about

elder rapids
#

I ain't read sht

ocean vortex
#

I would argue there's no such thing even as benefitting exclusively from RL training. Roughly speaking this is simply more intelligence

#

you can archieve the same either with a bigger model

#

or with reasoning

elder rapids
unborn ocean
#

no, i claimed it is sensitive

ocean vortex
elder rapids
#

if it's exclusively benefitting from RL, you wouldn't be talking about the statistical product that he's talking about

lusty igloo
#

what do you guys think about grok 4 so far? im not that impressed on reasoning compared to o3

unborn ocean
ocean vortex
#

RL training improves the model in just about every way = higher intelligence. It even helps with those select few things huge models are good at like spatial awareness, even if to a limited extent. We can argue about the amount but not the fact itself

elder rapids
#

from what I can tell now you guys are just arguing two entirely seperate things lmao

unborn ocean
ocean vortex
unborn ocean
#

+some sound assumptions (e.g. o4-mini smaller than o3 and stuff like that) i am claiming this

ocean vortex
torn mantle
#

@elder rapids whos winning the debate so far

unborn ocean
#

ok, and?

#

this is where it started as context, btw guys

ocean vortex
elder rapids
#

there's nothing to really win

torn mantle
unborn ocean
# ocean vortex o4-mini scores high pretty much in every single benchmark. With only very few ex...

well because they boldly claim to measure intelligence and because a lot of pretend experts repost it on x
(it is not that RL improvements are not improvements or not intelligence)
(if it is really true that we are arguing about two different things: i hope it is clear that i don't claim x just pushed the read button called "RL" and suddenly jumped to the top of AA, although the models is still as smart as grok 3, it is obv better and more intelligent)

torn mantle
#

50% for code & math bench

elder rapids
torn mantle
elder rapids
#

AA is trash

dawn wharf
torn mantle
#

no we dont

unborn ocean
#

very easily to use, nice interface, a lot of good ideas

#

i just wish they would redo their benchmark selection a bit

ocean vortex
main gulch
#

xAI overemphasized STEM benchmarks

torn mantle
#

i think both of you guys have some valid points

#

lets end it on that

main gulch
#

but a median user doesn't use LLM to solve olympiad math

unborn ocean
torn mantle
#

but it's true tho.. some AI labs focus solely on achieving high benchmark scores... but does that mean they're developing "real intelligence" or "smart models"?

#

at the other hand, should we care about that if the model is practical and solves real-world problems?

main gulch
#

more common tasks: coding (include webdev where Grok 4 mostly fails), creative writing, summarization, translating

torn mantle
#

2+2 = 3

ocean vortex
torn mantle
#

i agree its misleading, its heavily RL biased but at the end its still a way to measure something

main gulch
#

there weren't many published Grok 4 benchmarks which measure this tasks, not some obscure STEM

#

or o3

torn mantle
#

thats why i said the other day that base model intelligence is more important than a reasoning model

#

maybe we should start by measuring that first

#

ofc we should add things like creativity as well

#

solutions with multiple answers

tame ether
#

When will grok 4 be added to text arena

torn mantle
#

efficiency score as well would be great

#

based on compute / intelligence

main gulch
torn mantle
#

@unborn ocean where did you go 😦

unborn ocean
#

was fun to watch you

torn mantle
#

ty

ocean vortex
#

With Grok4 there's a different issue... I don't think those results are necessarily reproducible with the public version. Would be great if AA retested it using official API hmm

torn mantle
#

2+2 = ?

unborn ocean
#

like someone muted in vc

torn mantle
#

okay

unborn ocean
#

just yapping to himself /herselfπŸ‘€

torn mantle
#

mm

#

im yapping to myself?

frosty blaze
#

Question: Are the votes on the leaderboard reset each update?

torn mantle
#

alright

main gulch
#

agree with that, hiding CoT in Grok 4 is the worst decision by xAI regarding this model

torn mantle
#

stop playing with us @frosty blaze

ocean vortex
#

Also, that was a fair point by @ornate agate that early access might have been the heavy / test-time compute version

dawn wharf
#

that's the problem

unborn ocean
torn mantle
#

again i still think there is a lot of room for improvements on reasoning ( low hanging fruits )

#

but lets start with the base model first

#

dont just make a dumb model and pray to god with RL you will have something special

unborn ocean
dawn wharf
#

and they succeeded, but it's not a good plan

#

they were only able to do it because of their cluster

ocean vortex
leaden meteor
torn mantle
#

someone said we have an architecture ( transformer ) bottleneck, while its true, i still think we havent reached that step yet

ocean vortex
torn mantle
#

the step of thinking of another architecture

#

lets just fix what we have first

unborn ocean
#

ideally we would want to combine both more, there are some papers on some core stuff, but we really should be exploring more
(RL in pre training) (or even RL everywhere, with no difference between pre and post)

torn mantle
#

and maybe we will discovered smth later

ocean vortex
#

Why should you care what makes the model good? It should only perform as far as I'm concerned.

unborn ocean
ocean vortex
#

What improves the intelligence is irrelevant

torn mantle
#

models being bad at creativity follows a pattern as well, whenever a model is strict to its normal distribution = automatically it will bad at creativity

#

google fixed that somehow

#

i remember gemini was spouting things straight up word by word from wikipedia

unborn ocean
#

transformer will not stay like this forever, attention will probably stay for a very long time like this (or very similar)

torn mantle
#

maybe creativity is also a base model issue and not a reasoning one

#

since its more of like predicting the next token

unborn ocean
#

i think

torn mantle
#

what could reasoning do if the way the model writes is just bad

ocean vortex
dawn wharf
unborn ocean
dawn wharf
#

so there's that

unborn ocean
#

did not get around to it yet though :|

ocean vortex
torn mantle
#

do we even do that?

#

are we doing it the right way?

dawn wharf
#

jk

ocean vortex
dawn wharf
#

but now that I'm thinking about it it's actually a good point

torn mantle
#

creativity: thinking outside the box and producing something improbable and unexpected.
model: designed to do the contrary, to produce the most predictable and plausible outcome.

dawn wharf
#

If the model is intelligent, it wouldn't need to think a lot before answering

unborn ocean
#

otherwise, yes not really

ocean vortex
#

it allows it to think of it's own solution

#

rather than blindly fit training data

unborn ocean
dawn wharf
unborn ocean
#

and current base model are just to small to capture that effect, they seem knowledgably, but have a tiny "brain" and thus little area to create weird ideas (is the way i think about it)

dawn wharf
#

literally the only thing it helps in is keeping the narrative on track

#

it doesn't help anything else

ocean vortex
#

it's just that you started with a very poor model

#

and made it better

unborn ocean
#

yes, but as a company you have a choice between TTS and size

#

so that is what i was trying to bring up

ocean vortex
unborn ocean
#

@torn mantle where did you go 😦

#

2+2=?

torn mantle
#

oh im here

#

= 3

unborn ocean
#

the most interesting thing about the TTS is the variability of it though
=> most interested in a combined o3 and o4-mini aka gpt5 (hopefully better)

torn mantle
#

did anyone try heavy grok 4 for creativity writting ?

ocean vortex
# unborn ocean 2+2=?

^ I still have no clue if he was referring to discord server admission form. Where I had to add some question to be able to change the server to "apply to join" for more reach. So I used this exact question lmao

#

@torn mantle

torn mantle
#

oh we are just joking about 2+2

unborn ocean
#

because the teacher

torn mantle
ocean vortex
#

Grok is so slow 😭

torn mantle
#

why why why

unborn ocean
torn mantle
#

all they do is lie

unborn ocean
#

no new reasoning traces => no new learning was the idea

torn mantle
ocean vortex
#

If one of those attempts got into infite loop this gonna be nearly an hour wait again

#

πŸ’€

unborn ocean
#

they just pick better trace using rl

#

not sure though there are a lot of papers that genuinely discuss doing SFT over RL algo, so they can learn (genuinely why tf my spelling so bad on this keyboard, i want to burry myself)

#

slower, because less of the weights are effect, but imo rl is actually learning new stuff

meager harbor
#

why can't model browse the internet on lm arena ? it totally skew the results, they hallucinate like crazy

unborn ocean
ocean vortex
#

well not quite, but this is better

#

I get to see ALL the responses

#

If I were to do the same by regenerating this would take 100 million hours

unborn ocean
#

otherwise they can't

ocean vortex
#

ok FINALLY. Don't think a single of those is correct lol

unborn ocean
#

worth it to see the richest man on earth fail hard yet again

elder rapids
#

man I can't wait for deepthink

#

ts gonna be so good

#

they're putting so much RL into it

#

πŸ˜­πŸ™

tall summit
#

so what are the grok 3 vs grok 4 benchmarks

meager harbor
#

they're weird

tall summit
unborn ocean
dawn wharf
elder rapids