#general

1 messages Β· Page 67 of 1

wind moth
#

questions to ask

#

they should just premiere

whole wagon
#

It sometimes randomly has nazi outbursts it's not just political questions

wind moth
#

a livestreme

hollow ocean
#

guys did it start yet

wind moth
#

like what many companies do

empty stump
whole wagon
#

Like the other day it was flaming literally random Jewish ppl on X. Since they had Jewish surnames

zenith saffron
#

yea agree they should just go full hackathon mode and just replay things

#

not actually use it

whole wagon
#

Without any political prompt

whole sundial
#

supergrok heavy is not sold out anymore

#

(don't worry, i don't have $3,000 to spend. also, "SuperGrokPro"? I thought this was SuperGrok Heavy!)

#

they must have changed the name in the last second

whole wagon
#

Bets on it starting on the hour?

north vale
#

which hour of the night can it start at to minimize viewership

#

that's where my money is

whole sundial
#

meanwhile someone on the grok discord accidentally bought SuperGrok Heavy

whole wagon
#

You can just downgrade it and it'll refund the diff scaled by days

whole sundial
wind moth
whole wagon
#

It did start on the hour

#

Ez

north vale
#

ggs

whole wagon
#

Maybe they really did want it to start at nein

jade egret
#

FINALLY

hallow pelican
#

it started eventually

empty stump
#

We were all wrong

hallow pelican
#

πŸ₯³

whole sundial
jade egret
#

EZ

#

FINALLY

whole wagon
#

That's a lot of viewers damn

zenith saffron
#

as predicted, one hour and one minute late

jade egret
#

how much r there?

#

21.6?

whole wagon
#

It's climbing rapidly

jade egret
#

or is that vies

whole wagon
#

At 60k rn

jade egret
zenith saffron
#

now we're being baited

keen beacon
#

Loading screen for another hour

zenith saffron
#

the livestream began, but the real livestream has still yet to begin

#

😭

jade egret
#

oh this?

whole wagon
#

80k

echo aurora
#

we have cool music now tho

jade egret
whole wagon
#

100k

#

What the hell

leaden meteor
#

"worlds most powerful ai model" hmm....

jade egret
small haven
#

over/under elon musk happy or sad on stream

whole wagon
#

This is going to be the most watched livestream on X by far

#

120k

#

140k

echo aurora
elder rapids
whole wagon
#

I mean watched live ofc

elder rapids
#

yeah, live

jade egret
#

146k for me rn

hardy pecan
#

ooh

#

we on

jade egret
#

152 now

zinc ore
#

140k live will end up being millions of viewers

elder rapids
#

do you mean the most watched AI related thing

jade egret
#

yooo

zenith saffron
#

i wonder who's the voice

elder rapids
#

it's AI generated

torn mantle
#

Who's talking

elder rapids
#

it's deadass AI

empty stump
#

When is it gonna be on lmarena leaderboard

whole wagon
#

wild

zenith saffron
#

lol "AI is advancing faster than any human"

small haven
#

so this is going to better than kingfall?

jade egret
zenith saffron
#

oh wait what

elder rapids
jade egret
#

238k

whole wagon
#

college exams were solved long ago man we are already on phd questions

#

get to the good stuff

zinc ore
#

Elon is so low energy rn lol

whole sundial
#

"Grok 4 is smarter than almost all graduate students in all disciplines simultaneously."

hardy pecan
#

blud needs to sleep

whole wagon
hardy pecan
#

order of magnitude is big

jade egret
#

is that good

zenith saffron
#

is elon trying to emulate jensen

whole wagon
zenith saffron
#

compute used for RL step

hardy pecan
#

gwak 4, based

zenith saffron
#

they didn't confirm that explicitly

hardy pecan
#

oh nvm there is sneaky orange colours to cause confusion

elder rapids
#

?

#

they're highlighting the RL difference

whole wagon
#

400k viewers sheesh

jade egret
#
poll_question_text

in the next 10 years, will google win the ai race or get defeated?

victor_answer_votes

18

total_votes

25

victor_answer_id

1

victor_answer_text

win : )

zenith saffron
#

yeah it's not explicit but yeah i see what you mean

#

it's likely it's basically the same base

whole wagon
#

yeah

zenith saffron
#

bro

#

"hebrew source text"

#

i'm dead

jade egret
#

444k for me viewers..

#

crazy...

whole wagon
#

ngl they are overhyping still. it aint AGI

#

ok he clarified lmao

zenith saffron
#

i wonder how it does in math research

#

rly want to see my math phd friend try this

zinc ore
#

Elon doesn't even believe himself

whole sundial
#

26.9%

whole sundial
#

no tool

whole wagon
#

550k viewers

#

its going to hit a million

whole sundial
#

41% with tool

elder rapids
#

wait

#

aren't the deep researches higher

whole sundial
#

"we put the tools in training"

elder rapids
#

lmao

jade egret
whole sundial
#

i think so?

jade egret
#

ooo

whole wagon
#

its extremely good HLE yes

#

double the SOTA

elder rapids
#

kinda inherent to the size tho, no?

#

what

jade egret
#

holy it double

elder rapids
#

can this dude talk faster

#

holy sht

#

😭

zinc ore
#

Elon is so out of it lol

whole wagon
#

he seems depressed

jade egret
jade egret
whole wagon
#

i guess since grok turned into a nazi

#

he is sad

elder rapids
#

did he just say "idk"

jade egret
#

705k is crazy.......

wind moth
#

he needs to stop yapping

whole wagon
#

are there more benchmarks or smth

#

or a demo

echo aurora
#

I'm sure there will be

jade egret
#

is it gonna be on arena?

elder rapids
#

ye

jade egret
elder rapids
#

interesting conclusion

#

"yeah.... yeah."

jade egret
#

i gtg...

#

im gonna come back soon tho

#

cya

echo aurora
#

🍊

jade egret
whole wagon
#

they trained it on puzzle type things i bet

zinc ore
#

Ik openAI and Google feeling pretty good rn

elder rapids
#

this is a pretty bad stream

#

lmao

#

ts buns

zenith saffron
#

"closing the loop on reality" makes me worried about paperclips

whole sundial
#

50.7% now

#

with test time compute

hardy pecan
#

text only subset

#

?

#

misleading!

zenith saffron
#

wait, text-only subset?

whole wagon
zenith saffron
#

wdym by your question?

whole sundial
#

maybe they didn't finish the multimodal part yet?

zenith saffron
whole sundial
#

oh, test-time compute is Grok 4 Heavy

zinc ore
#

Consc@1024

hardy pecan
#

lol]

proper prawn
#

They seem to be hiding reasoning

whole wagon
#

simple problem

#

i could solve that even lol

#

bro is showing polymarket

zinc ore
#

"seeker of truth" "aligns with reality"

#

Maaannn

hardy pecan
#

what a dumb example,

#

gRoK wIlL pReDiCt tHe fuTuRe

somber hatch
#

test time compute is just using more compute when answering a question. Reasoning tokens are an example of TTC, also running multiple in parallel and voting or colaborating is TTC. It's just use GPUs a lot while answering the question, not training or updating the model at all

whole wagon
#

1M viewers

#

the odds plunged back down

#

53% to 41%

elder rapids
#

they tryna hypnotize us lmao

zinc ore
somber hatch
#

their graph was shoing how adding more TTC increases the performance. OpenAI has shown similar charts when they released O1 and O3

storm needle
#

what are the chances of everyone there getting fired

elder rapids
#

they're not showing anything impressive

#

what's going on lol

#

they're showing what makes o3 cool

zenith saffron
#

"wen grok 5"

elder rapids
#

nice

#

😭

#

tool use

proper prawn
elder rapids
#

they're saying grok 4 is doing all this research to get the simulation correct

whole wagon
#

latex was wrong lel

elder rapids
keen beacon
#

I wonder if they switched off the qwq cold start lmfao

whole sundial
#

so grok 4 is currrently able to look at images

proper prawn
#

Training on tool use could be big

whole wagon
#

its not that wild man

keen beacon
#

They have so many resources. They should make their own XD and actually innovate

elder rapids
#

no

zenith saffron
#

imo there's a lot they're not revealing

whole sundial
#

38.6% with tools, 44.4% heavy

empty stump
#

Is the gemini 2.5 hle score w/o tools

whole sundial
#

25.4% without tools

zenith saffron
elder rapids
zenith saffron
#

oh i'm just saying like

#

this is not even close to a technical paper

empty stump
#

Wonder grok 4 superheavy vs o3 pro vs gemini 2.5 pro deepthink

zenith saffron
#

just because they're still using RL doesn't mean there isn't any innovation lol

clever estuary
#

reich4 looks itneresting

whole wagon
#

bro what in a few weeks

zenith saffron
#

that's like saying GPT-3 was not innovative because it was just scaling the same method up

whole sundial
#

they are training version 7 with vision and image gen

#

grok 4 is version 6

zenith saffron
#

(whereas in this case we don't even know what methods they used)

zinc ore
#

Gemini no tools does better than grok 4 no tools on HLE

jade egret
#

back

whole sundial
proper prawn
#

61.9% in USAMO

jade egret
#

what did i miss

echo aurora
jade egret
zinc ore
#

Okay finally more benchmarks

elder rapids
#

they're not

#

grok 3 mini remember

#

lmao

#

ye

#

prob

keen beacon
#

For 10 minutes πŸ˜‚

elder rapids
#

we'll see in practice

#

deadass

jade egret
#

but better models gonna come out very soon right

#

like usual?

keen beacon
#

Maybe not this month

zinc ore
#

I completely didn't retain how the reasoning works with grok heavy

jade egret
whole wagon
#

they just do the parallel thing

#

same as o3 pro

#

holy cringe

proper prawn
#

lol they know their audience

torn mantle
#

Ewwww

#

What the hell

hollow ocean
#

only $300 for heavy

#

not bad

whole wagon
#

bruh

jade egret
#

diet coke

hardy pecan
#

my stomach hurts

whole wagon
#

how are they not cringing

zinc ore
#

😐

whole wagon
#

aspartame ambrosia πŸ˜‚

#

bro what

zenith saffron
#

this is so ex machina coded

torn mantle
#

Oh it failed

hardy pecan
#

that was weary weary cringe

elder rapids
#

interesting

torn mantle
#

Ye

elder rapids
#

nice voice tho

echo aurora
whole wagon
#

it didnt fail he didnt realise it retained the context

empty stump
#

now advertising openai

whole wagon
#

lol

torn mantle
#

Openai?

whole wagon
#

RIP

jade egret
#

did openai fail

empty stump
#

no

hardy pecan
#

why does it keep asking questions

#

just say the numbers!

whole wagon
#

openai got destroyed

#

lmao

#

thats hilarious

torn mantle
#

Lmao

#

That wasnt fair

#

No no

jade egret
#

openai die : (

torn mantle
#

He was faster

#

Nah

#

That's unfair

hardy pecan
#

do they have a hitler voice, based

empty stump
#

oh thats what they were showing the speed

elder rapids
#

oh nice api

whole sundial
#

real grok 4 arc agi 2 score: 15.9%

#

still very much in lead

whole wagon
#

πŸ‘€

zinc ore
#

Yeah previous sota was 8%

whole wagon
#

sheesh

#

ARC-AGI-2 is getting solved in a year

jade egret
#

holy 2x

elder rapids
#

I'm ngl I can just dismiss arc agi 2 scores

#

this one is obviously a training thing

empty stump
#

I wonder about SWE bench score

torn mantle
#

Now we are getting to the real stuff

elder rapids
#

it's arc agi 2 score isn't proportional

hardy pecan
#

i thought it cant see images

#

how did it do do arc

torn mantle
#

That's benchmarks looks crazy ngl

torn mantle
torn mantle
hardy pecan
#

translated to text?

#

ok

torn mantle
whole wagon
#

1.5M viewers damn

whole sundial
#

grok 4 0709 is the current version

whole wagon
#

what the hell is vending bench

jade egret
elder rapids
#

nobody knows

empty stump
#

Running a small shop or something

zinc ore
#

Grok downloads about to skyrocket

torn mantle
#

Please add it on lmarena so i can try it

#

Can you add it?

jade egret
#

when do yall think google and openai will take to catch up

whole wagon
#

how is bro gonna add it

jade egret
#

cuz gemin 3 and gpt 5 is comming

torn mantle
#

Arent you one of lmarena staff team?

elder rapids
hollow ocean
#

@deep adder deepthink already dead

jade egret
torn mantle
#

Okay~

jade egret
small haven
jade egret
#

rlly fast ig

tidal schooner
#

100% on aime25

#

insane

empty stump
#

how does it do in IMO

whole wagon
#

1.7M rn

zinc ore
#

It says 1.7m views

#

So probably sub 100k live

empty stump
#

so dumb they dont show live count

whole sundial
tidal schooner
#

i think

#

which is absurd

wind moth
#

thats usamo

#

not imo

whole wagon
frigid phoenix
#

Is grok that good? lol
Didnt watch the stream

zinc ore
#

Last 20 mins was way better than the beginning

empty stump
#

not trying it out until leaderboard results come out

proper prawn
#

Is grok 4 in arena?

empty stump
#

no

dawn wharf
#

Deepthink dead before even releasing

tidal schooner
empty stump
tidal schooner
#

don’t think they tested it for imo then

#

still extremely impressive

#

better than i could ever do

zenith saffron
hardy pecan
#

very lacking in the demo's, just benchmaxxxxing it seems

#

will need to try ourselves

tidal schooner
#

ironically

zenith saffron
#

i am lowkey a little worried

empty stump
zenith saffron
#

within the next 5 years

whole sundial
#

grok 4 rate limit: same as grok 3 free, 20 per 2 hours for supergrok subscribers (not available for free rn)

empty stump
#

someone try out grok 4 and tell me how good it really is

zenith saffron
#

my p(doom) has been climbing

small haven
dawn wharf
dawn wharf
small haven
zinc ore
#

Grok 4 creating the shader (no errors).

QRT: emollick
o3-pro does by far the best so far at my benchmark (scroll quote tweet thread for others): "create a visually interesting shader that can run in twigl app make it like the ocean in a storm"

It did take 21 minutes for o3-pro to think (and another 19 to fix a small shader error) https://t.co/KqzmuHm5Zf

β–Ά Play video
elder rapids
#

holy pumped compute

#

we already know the pricing

elder rapids
#

it's like 3$ input 15$ output

tidal schooner
#

huh

whole sundial
#

no

elder rapids
#

per million

#

wym?

#

it says 3 in 15 out

#

link

#

yo what

#

are you joking

whole sundial
keen beacon
#

No parallel compute? Lol

#

Must be a mistake

elder rapids
#

wait where are we getting this from, I'm only seeing 3 in 15 out

#

if it's actually that expensive then all the hype for me is squashed

keen beacon
#

It's a typo

elder rapids
#

id thought so

keen beacon
#

The pricing below that is correct

zinc ore
elder rapids
#

ye see

keen beacon
#

Do they still give you reasoning 'summaries'?

#

Can u show if you don't mind I can't check rn

#

Oh I mean regular grok 4

#

Oh lol they actually summarize now

#

Thanks

keen fulcrum
#

Grok 4 achieves SOTA

keen beacon
#

Did they update the cut off or is it literally just rl on grok 3 lol

#

Yeah I didn't really watch it lmao

keen fulcrum
#

@echo aurora when can we expect grok 4 to be added to lmarena? ArtificialAnalysis received early access

keen beacon
#

Ultra should beat it though

keen fulcrum
#

It would be great if grok 4 heavy can be tested by us as well

dawn wharf
#

bro is flying

hollow ocean
#

grok 4 heavy is the truth

fleet pine
echo aurora
keen fulcrum
#

looks like its available

elder rapids
#

yeah

#

@leaden palm gave me a story it wrote

#

and I don't need any other examples

#

ts BUNS at writing

#

πŸ’”πŸ˜’

empty stump
#

what llm is best at writing so far

elder rapids
#

think it might be

empty stump
#

So gpt 4.5 is not best at anything

elder rapids
#

but it's not that good by itself

dapper storm
#

Why do you think that?

tidal schooner
civic flame
#

lol ive been using this

#

it's honestly kinda meh..

tidal schooner
#

and it’s gone again?

elder burrow
#

i only trust scicode for coding

look at 4 opus on livecodebench..
its behind nemotron πŸ’€

#

this is accurate

empty stump
#

grok 4 gone from arena bruh

elder burrow
empty stump
#

for a few mins

elder burrow
#

bruh

indigo hazel
whole wagon
keen fulcrum
empty stump
#

no idea

primal orbit
#

I managed to get 1 prompt to grok 4 in lmarena then it dissapeared and chat switched to chatgpt

zinc ore
#

Yep

fleet lintel
#

I tried scrolling 100 comments to understand grok4 is SOTA or not... I am still not sure.

Is it SOTA or not?

elder burrow
#

πŸ’€

keen fulcrum
zinc ore
empty stump
#

i can do better on ms paint

elder burrow
old garden
#

what

#

a

elder burrow
#

2 Treys

old garden
#

lol

civic flame
# zinc ore

i've noticed currently frontend dev seems really poor. i am giving it the benefit of the doubt because i am only able to use it through X where it has a system prompt and all of that applied, but still..

zinc ore
#

Why u tag me twice

civic flame
elder burrow
#

hey at least its 20x cheaper than 4..

#

oh you meant regular 3

#

isnt mini better

zenith saffron
#

you think it's contaminated for HLE, USAMO, etc.?

fleet lintel
#

SuperGrok Heavy: $300/month (Multi Agent Version)

Dayum! Is it really that good?

elder burrow
#

wtf is stonebloom

fleet lintel
#

@deep adder time for you to buy $300/month version and tell us the truth

elder burrow
#

πŸ’€

#

@rare python can we have grok 4 back

zenith saffron
#

where's this from

#

what problem is this

whole wagon
#

Just tell it not to search

elder burrow
#

you can disable search in settings

#

:p

whole wagon
#

Tell it in the prompt

elder burrow
#

oh

#

lmfao

whole wagon
#

The API will not randomly call tools

#

Without it being enabled

#

There are benchmark numbers without tools

stuck orchid
#

Waiting for Grok 4 on LMArena to vote πŸ‘

hardy pecan
#

seems to be bad at coding, cant edit my code without bugging

small haven
#

it thinks a lot

dapper storm
#

No I hope

stuck orchid
#

I think Grok 4 should be something like o3 in terms of reasoning and cognition.
o3 is lazy and also worse at coding than Gemini, especially in web coding, but according to my tests, o3 is the most powerful model for researchers.
Grok 4 should reason "from first principles."

abstract leaf
winged wing
#

same base model as grok 3?

abstract leaf
#

whn will you guys update the leaderboard ?

winged wing
#

really i totally missed that

abstract leaf
#

grok-4 should be around 1500.
-# my assumption.

winged wing
#

gd damn good thing i got out break even

stuck orchid
abstract leaf
#

yes

zenith saffron
#

did they actually?

#

i didn't literally hear them say that

winged wing
#

u manage to catch any of the swings on HLE?

elder rapids
#

the model is overall mediocre

#

@civic flame

indigo hazel
#

basically like a student who cheats during the test at school

elder rapids
#

have you tried that old

#

math prompt

#

with the answer of 3031

civic flame
#

but right now i've only tried via Grok on X

elder rapids
#

grok on X is different

civic flame
#

which has a big annoying system prompt and tools it won't let me disable

#

yes i know

elder rapids
#

alr

civic flame
#

so i'm waiting for lmarena to add grok 4

elder rapids
#

but from my testing

#

it's still buns

winged wing
#

It might have some sauce with the scaffolding. Idk I wouldnt be so quick to judge. Its def gonna get blown out of the water by the end of the month tho

elder rapids
#

ye ofc

#

not terrible at all

civic flame
#

lol we're still waiting on R2

elder rapids
#

yep, but nice benchmarks

civic flame
#

seems to also suffer from being too succinct when writing code

whole sundial
#

trained off of same base model as grok 3, they are just now training the new base model

civic flame
#

no wonder it's mid

blazing bison
#

Bro stop yapping, for god sake

zinc ore
#

They should have kept the 3.5 naming scheme

rare python
whole sundial
#

so i guess there will be a Grok 4.5 with image gen? Gemini 3/GPT-5 image gen will be better

blazing bison
#

Because you're poor, don't even have access to the model and is yapping around

#

Yes you're

civic flame
blazing bison
#

So stop yapping

hardy pecan
#

children, stop arguing

blazing bison
#

This is the mass user model, the true model is api only

small haven
#

what we thinking

civic flame
#

oh it's right but

#

are all your tools set to OFF

#

yesyou can

#

one sec

small haven
hardy pecan
#

im thinking 0.00025km long is very small

blazing bison
#

If you use o3, sonnet, everything on their frontend is dumber than api

hardy pecan
#

the bridge is 25cm

small haven
#

check prompt

civic flame
civic flame
hardy pecan
elder rapids
blazing bison
#

It's not even secret, just search on X and you gonna see openai team talking about it. You're a yapping machine bro

elder rapids
#

doesn't catch anything im saying

#

lmao

#

it just got crushed in a debate with 2.5 pro too

thorny bane
#

what makes you think it's very contaminated?

small haven
#

oh yea terminator svg benchmark, i forgot

whole sundial
#

maybe grok would have even higher bench scores if it didn't use brave search, which has their own index so it kinda sucks. a sample search/deep research across gemini 2.5pro, grok deepsearch, duckduckgo, and perplexity shows that grok is the clear loser. brave's index is so bad that when i asked the question a few days ago, it finally found a relevant source while all previous attempts were just... bad. idk why they can't just license from duckduckgo lol

civic flame
#

when are they dropping the code model πŸ’€

#

i forgot about it

#

when is it

whole sundial
#

august

civic flame
#

i didn't see the stream

#

bruh

small haven
civic flame
#

yeah this is disappointing

whole sundial
#

september for "multimodal agent" a.k.a. new image gen

indigo hazel
#

so i was waiting for the best model ever, but it's worse than o3 and 2.5 pro?

whole sundial
#

to be fair grok was the first major ai that i know of that had native image gen. all the others did it this spring. they had it last winter.

zenith saffron
#

it's just jimmy ba and tony wu the co-founders lol

indigo hazel
#

so what

zenith saffron
#

just researchers

whole sundial
#

i think grok 4 was rushed, it's all hype

elder rapids
#

yo

undone hull
#

Can't wait to see MechaHitler 4 on the charts

elder rapids
#

@civic flame it hallucinates search a ton

#

lmao

winged wing
#

ok lets see it

indigo hazel
wintry tinsel
whole sundial
#

wanted to try my "basic" knowledge question that only OpenAI models (o1, o3, GPT-4.5, maybe 4o/4.1?) get right. Claude Opus fails it. Gemini 2.5 Pro, stonebloom (Ultra) gets it wrong. Grok 3 got kinda close, but I don't think it will get it right if it uses same base model. Might get it right with reasoning.

civic flame
#

what's the question?

whole sundial
#

I'll wait for it to be on arena, thank you though

thorny bane
#

grok 5 will invent new physics

civic flame
#

have you tried it with wolfstride?

indigo hazel
elder rapids
#

if you find different results lmk

whole sundial
#

i will

elder rapids
#

but it's far from it

#

it's not dumb by any means

dawn wharf
#

do models that support web search use it on their own?

#

or they don't use it

elder rapids
#

this model has the same issue as grok 3 with Improving through a context

dawn wharf
#

direct chat

civic flame
#

lol ok

indigo hazel
#

so it got that high result in arc agi 2 by using tools?

small haven
civic flame
#

lol am i supposed to be seeing grok 4 in lmarena

#

it's not there

stuck orchid
#

Is Grok 4 similar in communication style to o3?

civic flame
#

not in direct chat and i haven't seen it in battle in the 10 rounds i've gone thru

elder rapids
#

same as grok 3

whole sundial
#

grok 4 from what i heard was available on the arena for a few minutes earlier

civic flame
#

cc @echo aurora

echo aurora
civic flame
#

i've gone through about 15 battle rounds now and not seen it πŸ˜”

#

either i'm unlucky or something's up

echo aurora
#

there are issues with putting it in Direct & side-by-side, but we're working on it

stuck orchid
whole sundial
#

it's really just grok 3 but the very smoky Colossus is pumping more power into it to make it reason better

echo aurora
small haven
#

grok 4 sucks

echo aurora
#

I haven't heard otherwise but will flag if more folks are saying they're not getting it

whole sundial
#

first try

#

it gets it right!

civic flame
#

on what platform

whole sundial
#

i'm an idiot for doubting Dork 4

#

this was the question lol. Grok 4's response is fully correct. First non-OpenAI model to get it correct!

civic flame
#

memphis in general is being screwed over by a bunch of ai-related developments that aren't adequately planned (imo)

stuck orchid
#

Why then did Elon say that Grok 4 could rewrite the entire current dataset for AI training?
I think Grok 4 has hidden potential. If it's not very good in other areas, it must be good for some specific type of task. We need to try to find that area.

whole sundial
#

grok 3

civic flame
stuck orchid
#

Esl

whole sundial
#

well i guess all of that pumped up energy did something to Dork 4. Not paying $300 a month to see what "Heavy" is like, though.

eager mica
fleet lintel
whole sundial
#

this benchmark is for offline models only, no cheating here!

eager mica
# eager mica Current pretraining datasets are crappola. Other AI companies are already thinki...
elder rapids
#

after even more testing I can say that it's not smart but it's knowledgeable ASF and can pinpoint things through a context well + good tool usage + brute forces puzzles really well

#

2.5 pro and o3 >>>

civic flame
fleet lintel
elder rapids
#

should tell you everything

#

lmao

whole sundial
#

more fails from Claude 4 Opus Thinking and Gemini 2.5 Pro (the Ultra anon models fail here)

fleet lintel
#

somehow these CEOs characters are reflected in these models... shady CEOs == shade models

whole sundial
#

the tools they use bump up the numbers, but grok 4 offline is better than grok 3 offline

elder rapids
#

yep

elder rapids
#

it ends there tho tbh

#

it got crushed in a debate with o3 too

#

that's crazy

#

one thing Ive noticed tho

blazing bison
dawn wharf
#

weren't you glazing elon before the announcement?

elder rapids
#

is that it's not as dogmatic

#

as other models

#

I have to ask it not to hold back and be efficient etc etc

blazing bison
elder rapids
#

even though it still gets crushed

#

it's something I have to remove

keen fulcrum
#

grok 4 is such a meme and a legend

haughty siren
#

Is the Grok 4 model in the arena with thinking and is it super heavy

keen fulcrum
blazing bison
#

in my math tests it actually crushed them

keen fulcrum
#

believe due to rate limit

winged wing
wind moth
#

hows grok 4 doing for those who have tested it

elder rapids
haughty siren
elder rapids
#

that's redundant overall

indigo hazel
whole sundial
#

aka the arena!

blazing bison
indigo hazel
blazing bison
pure anvil
blazing bison
#

Sonnet is bad at math

#

Claude in general is bad at math

pure anvil
#

it's an example

#

math is not correlative to general performance

blazing bison
elder burrow
#

in battle mode?

stuck orchid
# eager mica Current pretraining datasets are crappola. Other AI companies are already thinki...

Yes, it seems that's the issue. Models nowadays are often fixated on the specific text of the question they are asked. They might provide a good answer to task X, but get confused when the same task X appears as part of task Y.
Some tasks require broader reasoning, for instance, in agent modes, to even understand the environment they are operating in. This is likely due to training on unformatted web data.
However, the o3 model really stands out in this regard; while it can be somewhat 'lazy', it sometimes understands the task context better, though it also hallucinates quite significantly

whole sundial
elder burrow
#

scicode is the best

elder burrow
#

😭

blazing bison
#

I would use Grok 4 for math-related problems, but I wouldn't switch from ChatGPT to Grok. Sorry, Musk, but you need to do more

#

Maybe heavy grok is good, but $300?

indigo hazel
#

basically it's a cycle. like a student who uses ai to cheat, or like he uses calculator in maths tests, grok uses tools to respond correctly

dawn wharf
#

if it's truly smart, it shouldn't need tools

keen fulcrum
#
  • full context 256k + grok 4 heavy
blazing bison
pure anvil
elder burrow
#

have yall seen grok 3 mini reasoning high price to performance

pure anvil
#

2.5 pro + veo3

indigo hazel
elder burrow
keen fulcrum
#

claude max is cheap if you do code

whole sundial
#

coming soon: Dork 4.5Vo SuperHeavyDuty Ultra DeepThink - super duper early preview available in SuperDuperDork Pro Max+ for $1,000,000 a month and $10,000,000 a year

#

may get 35% on arc agi 2

sour spindle
#

Is grok 4 heavy only available on the super premium $300

elder burrow
#

btw yall seen the last few secs of grok stream? they revealed timeline, coding model in august

sour spindle
#

Don’t know if grok 4 then is worth it compared to o3

blazing bison
#

GPT-5 is still months away

elder burrow
#

not even available in api

sour spindle
#

Does seem like all the impressive benchmarks are grok 4 heavy

elder burrow
violet adder
#

@echo aurora Where did Grok 4 disappear to in Direct Chat mode?

elder burrow
#

mr moreknowing

#

oh damn

blazing bison
#

There's no point in discussing this, but GPT-5 is not a July thing

whole sundial
#

try the arena battle mode

elder burrow
#

YES

#

i thought its intentional

whole sundial
#

i'm surprised i got grok 4 on the first try in the battle mode

#

it didn't even reason, it just spat out the correct response

echo aurora
blazing bison
indigo hazel
patent bane
#

is grok 4 better than o3 or 2.5 pro?

blazing bison
patent bane
#

is it just a hype?

blazing bison
#

also no

pure anvil
#

Did yall see the grok 4 demo livestream? elon is ket'd out

whole sundial
#

if it was doing that, we would see Llama 4 Maverick: the sequel

blazing bison
whole sundial
#

the response began like 1 to 2 seconds after i hit enter, it did not even have time to search and reason

blazing bison
#

Llama will be fine now. They hired very experienced people

whole sundial
#

but tools are likely disabled on the arena

whole sundial
#

i was just saying that if xAI did use tools on the model in the arena, something like that would happen again

whole sundial
#

but i don't think they are using tools, it's just model output

#

there was a search arena on the old site, likely still usable

#

but this is just the standard arena on the new site

#

i wonder why the Step-1X edit and SeedEdit 3.0 models only appear in Arena mode for Image editing. I assume this is something the model owner set in place? but you can access Seedream 3.0 just fine, so idk

pure anvil
#

he's so jittery

whole sundial
#

and bagel, forgot about that one

rare python
#

is bagel SeedEdit 2.0?

#

It's on lmarena right now

whole sundial
#

bagel is a separate open source model they made based off of Qwen-2.5 VL 7B (I think)

hollow ocean
#

misclick $3000/yr

cedar tide
#

We want grok 4 on direct chat

echo aurora
winged wing
#

its not being ran in battle mode either, no?

echo aurora
lilac nimbus
#

Grok4 seems great I try

cedar tide
#

@echo aurora the site is all bugged

echo aurora
#

Yeah we’re looking into

#

Sorry everyone

echo aurora
indigo hazel
whole sundial
#

is it related to grok 4 or is it something that just... happened?

blazing bison
#

grok 4 hacked lmarena

#

πŸ’€

echo aurora
indigo hazel
rapid merlin
#

wasn't grok in a controversy like a tiny bit ago?

#

where it "shared its thoughts" on some twitter replies

blazing bison
#

the mecha thing

rapid merlin
#

yeah

keen fulcrum
winged wing
#

@echo aurora does basically every model provider ask for the 1 day heads up thingy?

keen fulcrum
#

Its now the second incident where grok ran crazy πŸ€ͺ

rapid merlin
#

Solid cover!

winged wing
#

if the llm doesnt seek the truth that you think is correct, you make it. At least this is what elon thinks.

blazing bison
#

They said it could go crazy. It's an experiment

ornate stump
#

Just woke up and saw all the benchβ€”I'm a little hyped, not gonna lie. But where's Grok 4? Is it just an announcement?

echo aurora
#

Site should be working again btw πŸ‘

blazing bison
#

@ornate stump battle mode only

keen fulcrum
echo aurora
keen fulcrum
echo aurora
winged wing
echo aurora
whole sundial
#

now the site is just... slow

winged wing
#

mhm i see

whole sundial
#

and that thing happens where your chat history gets wiped and have to accept the terms of use again... except you can't!

echo aurora
#

Site is struggling again

#

Apologies for the issues, I spoke too soon about it being fixed

whole wagon
#

All that grok 4 induced demand Kappa

cedar tide
#

Anyone have webdev example from grok 4 ?

burnt pulsar
#

Is grok 4 heavy also planned to be available?

blazing bison
blazing bison
burnt pulsar
#

Understandable, it's the priciest of the new Grok 4 models.

main gulch
#

grok-4-heavy is not available even in the official API yet

sullen finch
#

why did they remove the ability to attach files in direct chat

burnt pulsar
#

That did never work when I tried to attach files. πŸ™‚

mystic mica
#

I can't even make it past the Cloudflare verification

sullen finch
sullen finch
burnt pulsar
#

Maybe there is no disk space left? πŸ˜‰

sullen finch
soft kernel
#

Why they didn't release an image gen?

burnt pulsar
whole sundial
burnt pulsar
#

Even more impressive if they achieved these kinds of improvements just with a lot of RL with grok-3 as base model....

hardy oriole
#

Grok 4 not generating answers on webdev arena? Got it two times in a row and only the opposite model generated code

verbal nimbus
#

Why is Gemini 2.5 Pro so dumb on Github Copilot

#

It can plan an entire 2 week schedule on AI Studio, but can't even swap two time slots within 2 hrs of each other on Copilot. I ask it to swap slots A and B and it forgets to re-add one of them.

#

Tried it in AIStudio and it runs fine

keen ferry
#

I just woke up, where can I see the grok 4 demonstration?

torn mantle
#

Is grok4 really on lmarena?

#

I couldn't get it once

blazing bison
torn mantle
blazing bison
#

I think only cursor with max mode offer full model context

soft kernel
languid crescent
#

Can't find grok 4 in direct chat?

civic flame
#

i've had enough

#

literally had 60 battles

#

it's come up 0 times

keen ferry
blazing bison
#

Sorry

torn mantle
keen ferry
#

what chances of getting it

opaque adder
#

I just bought grok 4 heavy give me some prompts

keen ferry
#

we need grok 4 on direct arena

keen ferry
opaque adder
#

300 usd

keen ferry
#

wtf

indigo hazel
opaque adder
#

no

#

its 3000 usd per year

#

300 per month

indigo hazel
opaque adder
#

if you can read

#

i clearly said heavy?

rare python
#

lmarena is extremely laggy when both model generating code

indigo hazel
opaque adder
#

how much does the monthly subscription costs I might get one

#

Runo β€” 09:56
I just bought grok 4 heavy give me some prompts

swen β€” 09:57
how much does the monthly subscription costs I might get one

indigo hazel
#

but both of them are per year.

indigo hazel
opaque adder
#

grok 4 heavy is 300 per month

#

are u braindead

indigo hazel
#

sorry

opaque adder
#

yeah you are

keen ferry
#

@echo aurora can we get grok 4 on direct chat it's just impossible to get it in battle

torn mantle
#

Don't make @opaque adder mad

opaque adder
#

asura

torn mantle
#

Runo

opaque adder
#

i see you here every time

#

i come into this discord

#

and i come here once per month

#

i never miss your username

#

i havent checked your message count but i assume you have over 40k

civic flame
torn mantle
#

Your eyes playing tricks on you

civic flame
#

or they've deliberately made it ridiculously hard to get

opaque adder
#

ok thats surprising

civic flame
#

in which case, why even bother

torn mantle
#

5k???

opaque adder
#

grok 4 is not going into lmarena

torn mantle
#

That's a lot

keen ferry
civic flame
#

HOW HAVE I GOT IT 0 TIMES AFTER 80 BATTLES ☠️

keen ferry
civic flame
#

still better than me

stuck orchid
rare python
#

They reveal the name

keen ferry
stuck orchid
stuck orchid
#

Sometimes Deepseek-r1 says that it is Grok4

rare python
#

DeepSeek is the mastermind

keen ferry
rare python
#

It was made from multiple models

stuck orchid
rare python
#

Why did Gork drop?

tidal schooner
#

also some people said the demos were meh

indigo hazel
#

why is o3 so low if it's comparable with 2.5 pro?

tidal schooner
#

grok 4 seemed kinda rushed to me for some reason idk why

#

elon was stuttering half the time during the announcement

rare python
burnt pulsar
#

As it is a highly competitive landscape, he possibly wanted to get the attention in the summer days.

rare python
#

nervous laugh

tidal schooner
#

ngl bro was just yapping about the stuff in his tweet prior to the livestream

tidal schooner
#

r2 wen

rare python
tidal schooner
rare python
#

r1 0528 is a decent model

tidal schooner
rare python
#

The base for RL training

#

Or they will merge into one model

tidal schooner
#

would prob be slower for a lot of tasks tho tbf

fleet pine
#

does anyone tried deepseek r1 with 600+ billions parameter?

indigo hazel
keen beacon
#

Hey, do you guys know if Grok 4 is ever coming to direct chat?

cedar tide
#

@echo aurora grok 4 not working

keen ferry
cedar tide
#

Anyone see a new mystery model that he dont want to say his name and his good ?

cedar tide
torn mantle
#

i want to try it

#

also just reading some grok 4 outputs i think it wont top lmarena

torn mantle
keen beacon
keen fulcrum
cedar tide
#

I dont have a real way to use it, just I increase my chances of getting it faster

#

@keen beacon

#

@torn mantle

sacred plaza
#

What will be higher the amount of companies that will overpay for a Nazi aligned AI model or the amount of trade deals trump signs with other nations (3 so far)

#

🀑🀑🀑🀑🀑🀑

keen fulcrum
#

Any model has hiccups of misinformation and wrong beliefs

#

Its common in the industry

cedar tide
#

Pikachu by grok 4

stuck orchid
#

Elon is focused on training the 256K token model and does not want to increase the context window yet, probably because he wants to first achieve something groundbreaking at this context window length before expanding it to a longer context

ocean vortex
#

So Dork4 AGI? PikaOMG

torn mantle
ocean vortex
#

even confirmed by AA

rare python
ocean vortex
#

the only way they could have cheated if they are serving different model publicly (with safety alignment and whatnot) than the one which was tested, but that's a reach...

#

probably even for Elon

torn mantle
#

bruh

#

my brain is fried

#

i messed up my sleep with the release

ocean vortex
alpine coral
#

it thinks for so fkn long

torn mantle
alpine coral
#

at least on OR

ocean vortex
alpine coral
#

weirdly underwhleming so far.. like compared to the screenshots of the evals here - was expecting way more

#

but only just started playing around

ocean vortex
torn mantle
#

its not working for me

#

😦

rare python
alpine coral
#

yeah it's definitely solid - but i suspect it excels at single prompt, exam/eval-style questions.. i've tried a few questions (non-riddles) that require multiple steps of knowledge recall, which only 2.5 pro and o3 get right, and it fails kinda terribly

alpine coral
hazy quest
#

On the Wes Roth live testing, he asked how many times a basketball dropped from 100m would bounce if no air friction, and Grok 4 Heavy, after almost 5min of thinking, answered "infinitely many times", which is an absurdly wrong answer to a simple question

#

Worrying that it gets it that wrong

alpine coral
#

it kina feels like o1/3-pro

#

how long it thinks

#

which can lead to both brilliant and ridiculous respones

torn mantle
#

is it telling me im poor?

keen beacon
hardy oriole
#

Grok has lost every single "battle" so far in webdev arena for me, probably 6 times total... There's something wrong with the model or idk what's going on lmao

It's producing worse results than even 2.5 flash

keen beacon
#

Damn

#

Google may also release gemini 3

#

Those guys camt stop cooking

torn mantle
keen beacon
#

OpenAI better step up their game

torn mantle
#

seems like they wanted to focus on other things and make a seperate coding model

keen beacon
#

Yeah they have a separate coding model for that reason. Interesting choice.

hazy quest
#

The unhiged voice mode is pretty cool

#

But every company could do that, they just chose not to for now

keen beacon
#

Im excited for the open source o3-mini like model next week

burnt pulsar
#

I'd love to see Gemini 3 getting rid of some character encoding issue that are very annoying in my coding tasks.

cedar tide
#

discord clone by grok 4

alpine coral
ocean vortex
hardy oriole
ocean vortex
#

Slightly more even

rare python
cedar tide
#

r1 0528 in comparaison

keen ferry
keen ferry
#

why does everyone keep testing it in html and css and not like python or c++

hardy oriole
#

Grok 4

#

Wolfstride

#

Just using the pre made prompts in lmarena

cedar tide
hardy pecan
#

grok4's geoguessing abilties aren't so great