#general | Arena | Page 70

tepid lynx Jul 11, 2025, 4:08 PM

#

Google is currently leading in steps in this video, and Grok is considered far, far away

#

Deepseek made a competitor to 4o at one time, I think they will do no worse now

indigo hazel Jul 11, 2025, 4:10 PM

#

i published the link to the second video he published about his test for grok 4

jade egret Jul 11, 2025, 4:11 PM

#

tepid lynx Google is currently leading in steps in this video, and Grok is considered far, ...

:0

indigo hazel Jul 11, 2025, 4:12 PM

#

tepid lynx Google is currently leading in steps in this video, and Grok is considered far, ...

if you need, i can send the link for the videos where he tested also r1-0528, magistral medium etc in the same test

torn mantle Jul 11, 2025, 4:12 PM

#

thanks

#

lets laugh at them

indigo hazel Jul 11, 2025, 4:12 PM

#

torn mantle thanks

no problem. bro it's taking credits for a video i shared 😦

torn mantle Jul 11, 2025, 4:13 PM

#

indigo hazel no problem. bro it's taking credits for a video i shared 😦

nuuuuu

#

😦

indigo hazel Jul 11, 2025, 4:13 PM

#

torn mantle nuuuuu

i sent the second, the continue of that video if you are interested into

tepid lynx Jul 11, 2025, 4:13 PM

#

indigo hazel if you need, i can send the link for the videos where he tested also r1-0528, ma...

Even if he improved the answer in a few prompts, Grok 4 will remain a piece of sh t in history

torn mantle Jul 11, 2025, 4:13 PM

#

indigo hazel i sent the second, the continue of that video if you are interested into

yes

#

send

rare python Jul 11, 2025, 4:14 PM

#

torn mantle send

#

This is peak

torn mantle Jul 11, 2025, 4:14 PM

#

rare python

lol

indigo hazel Jul 11, 2025, 4:14 PM

#

torn mantle send

https://www.youtube.com/watch?v=kxRlN0legRY

YouTube

Discover AI

GROK 4 on Logic? MY TEST - PART 2

ELON says GROK 4 is not yet fully optimized for reasoning? NO PROBLEM - We'll FIX IT!

We'll optimize causal reasoning of GROK 4 right now, right here. 2nd part of my video where I test the causal reasoning performance of GROK 4.
Multiple runs, check the GROK 4 internal assumptions and boundary conditions imposed by the system itself and give it...

▶ Play video

torn mantle Jul 11, 2025, 4:14 PM

#

yea sometimes its stuck in a loop generating base64

torn mantle Jul 11, 2025, 4:14 PM

#

indigo hazel https://www.youtube.com/watch?v=kxRlN0legRY

ty

rare python Jul 11, 2025, 4:14 PM

#

torn mantle yea sometimes its stuck in a loop generating base64

GA btw

#

kek

#

How can Deepmind not found this bug?

torn mantle Jul 11, 2025, 4:17 PM

#

they know

#

i guess its just hard to fix

#

yes

verbal nimbus Jul 11, 2025, 4:18 PM

#

torn mantle yea sometimes its stuck in a loop generating base64

I encountered that with Claude too, lol

gusty helm Jul 11, 2025, 4:21 PM

#

indigo hazel https://www.youtube.com/watch?v=kxRlN0legRY

I think this is relevant/easy to skip over it

indigo hazel Jul 11, 2025, 4:22 PM

#

gusty helm I think this is relevant/easy to skip over it

oh you are watching the first video, ok

gusty helm Jul 11, 2025, 4:22 PM

#

ah sry, tagged the wrong reply 😄

verbal nimbus Jul 11, 2025, 4:23 PM

#

Probably same fate as GPT-4.5

balmy mist Jul 11, 2025, 4:26 PM

#

should I buy grok heavy?

#

lmaoo

#

i knew you would say that

#

did you buy it?

#

we might need a subscription that has all top ai models like the max plan for all, similar to how we did with cable back in the day, there are like 5 different companies that have $100 plus plans

#

grok, claude, perplex, google, and openai

#

but that comet browser seems interesting by perplex

pure anvil Jul 11, 2025, 4:29 PM

#

balmy mist grok, claude, perplex, google, and openai

it would be called Soylent AI max

balmy mist Jul 11, 2025, 4:29 PM

#

i heard ppl saying o3 pro is better than grok heavy

balmy mist Jul 11, 2025, 4:29 PM

#

pure anvil it would be called Soylent AI max

lol i like htat

#

that*

#

craig if you can list which models are best for which what would you do? like lets do grok, gemini, claude and o3
ik claude is code
o3 maybe everyday use
gemini general? idk

#

we should actually make a internal poll or leaderboard for this in terms of vibes for these metrics instead of benchmarks

pure anvil Jul 11, 2025, 4:33 PM

#

Gemini is good for image input

balmy mist Jul 11, 2025, 4:35 PM

#

im similar but use o3 for mostly everything, and gemini for coding since its free, i dont use claude or grok

#

i dont have pro so i cant use 4.5 that much

#

i aint gonna lie, i think i like o3 the best

#

gemini is cool, but o3 just feels natural when i use it and its so much faster than it used to be

#

please dooooo

#

i cant imagine what they would charge for grok 4 heavy lol

steel quail Jul 11, 2025, 4:59 PM

#

lmarena actually broken

indigo hazel Jul 11, 2025, 5:00 PM

#

steel quail lmarena actually broken

that's the proof that it cant use tools? because otherwise it would use web search

zealous panther Jul 11, 2025, 5:08 PM

#

yep

steel quail Jul 11, 2025, 5:08 PM

#

lmaren is using a broken api

steel quail Jul 11, 2025, 5:09 PM

#

indigo hazel that's the proof that it cant use tools? because otherwise it would use web sear...

indigo hazel Jul 11, 2025, 5:09 PM

#

steel quail

based on my chat it said october 2023 lmao

steel quail Jul 11, 2025, 5:09 PM

#

api used is broken lol making it look not even grok4

#

looks a 2/3 mix

rare python Jul 11, 2025, 5:10 PM

#

https://tenor.com/view/baby-facepalm-raxalxshivar-anano-ananofacepalm-gif-12554045231795655384

Tenor

indigo hazel Jul 11, 2025, 5:10 PM

#

but they have to fix this thing of tools or grok wont be able to use tools forever in lmarena?

rare python Jul 11, 2025, 5:12 PM

#

gusty helm Jul 11, 2025, 5:14 PM

#

try same question on gemini

#

it thinks the same (joe biden and its 2023/2024)

#

it looks like an intentional limitation

rare python Jul 11, 2025, 5:16 PM

#

gusty helm try same question on gemini

It seems Gemini 2.5 Pro's knowledge density is around second half of 2024

#

You will get answer around May, June, July or even October

echo aurora Jul 11, 2025, 5:28 PM

#

steel quail lmarena actually broken

Hey wanted to point to this forum post where we've been discussing this #1393024188356362340 More importantly this message that has been forwarded already, but wanted to ensure you're seeing it.

solar hollow Jul 11, 2025, 5:58 PM

#

poll_question_text

based on your testing so far, is groq 4 the best model right now?

victor_answer_votes

13

total_votes

17

victor_answer_id

2

victor_answer_text

no

whole wagon Jul 11, 2025, 6:01 PM

#

https://x.com/apples_jimmy/status/1943479993746530450
I have also been hearing the same

Jimmy Apples 🍎/acc (@apples_jimmy)

Hearing a few whispers now from birds that internal evals are having gpt5 a tad over grok 4 Heavy.

Evals only tell one side to a model however, curious to see if we get any major agentic or other improvements.

cedar tide Jul 11, 2025, 6:01 PM

#

Im waiting for Kimi k2 thinking & vision

whole wagon Jul 11, 2025, 6:02 PM

#

whole wagon https://x.com/apples_jimmy/status/1943479993746530450 I have also been hearing t...

which is astonishing, xAI is 1.5 years old

storm needle Jul 11, 2025, 6:03 PM

#

whole wagon https://x.com/apples_jimmy/status/1943479993746530450 I have also been hearing t...

this account is not trustworthy

whole wagon Jul 11, 2025, 6:03 PM

#

he is an openai fanboi

#

so wouldnt expect him to say this

#

the open source model is clear efficiency SOTA. I guess they struggled to scale it up for whatever reason

#

Or maybe it's all just compressing together

#

The performances of different ai companies seems to get closer

alpine coral Jul 11, 2025, 6:22 PM

#

did very poorly on the one question set i gave it

whole wagon Jul 11, 2025, 6:27 PM

#

Simple bench is so slow to add new models

keen fulcrum Jul 11, 2025, 6:27 PM

#

poll_question_text

How will OpenAI new open source reasoning model perform?

victor_answer_votes

10

total_votes

11

victor_answer_id

3

victor_answer_text

Better than o4-mini, worse than o3.

whole wagon Jul 11, 2025, 6:27 PM

#

By the time they add the new model all the hype already died

keen fulcrum Jul 11, 2025, 6:31 PM

#

whole wagon Simple bench is so slow to add new models

Just a hobby of him

whole wagon Jul 11, 2025, 6:34 PM

#

Well not really a hobby, he has entire YouTube channel and patreon lol

wind moth Jul 11, 2025, 6:37 PM

#

will grok 4 top lmarena?

whole wagon Jul 11, 2025, 6:41 PM

#

I doubt

#

Though that doesn't necessarily mean it isnt the SOTA lol

#

Because Google optimised for llm arena

#

Grok 4 didn't optimise for it at all they had no secret model

brittle tiger Jul 11, 2025, 6:43 PM

#

whole wagon Well not really a hobby, he has entire YouTube channel and patreon lol

Best of the wide audience AI YouTubers by far

torn mantle Jul 11, 2025, 6:48 PM

#

https://polymarket.com/event/which-company-has-best-ai-model-end-of-july

Polymarket

Which company has best AI model end of July?

Polymarket | This market will resolve according to the company which owns the model which has the highest arena score based off the Chatbot Arena LLM Leaderb...

#

lmao

#

actually its 71% google

sacred quail Jul 11, 2025, 6:49 PM

#

I remember google working on a new reasoning technique

#

It was "tree" something

torn mantle Jul 11, 2025, 6:49 PM

#

kimi k2 should be added too asap

#

it could easily get like top 6

#

top 8 maybe

sacred quail Jul 11, 2025, 6:49 PM

#

Why

torn mantle Jul 11, 2025, 6:49 PM

#

sacred quail It was "tree" something

its on deep think

#

its the equivalent of big brain in grok 4

#

aka heavy thinking

torn mantle Jul 11, 2025, 6:50 PM

#

sacred quail Why

because its a good model

sacred quail Jul 11, 2025, 6:50 PM

#

When i use kimi 1.5

#

It was nice but kinda mid

#

More abilities than deepseek but worse language and outputs

torn mantle Jul 11, 2025, 6:51 PM

#

yea kimi 1.5 was alright

#

but they improved a lot on k2

sacred quail Jul 11, 2025, 6:51 PM

#

hmmmmmm... Maybe i should check

unborn ocean Jul 11, 2025, 6:51 PM

#

K2 is way larger and better in general (vs 1.5)

torn mantle Jul 11, 2025, 6:52 PM

#

howso

steel quail Jul 11, 2025, 6:52 PM

#

whole wagon Because Google optimised for llm arena

this

torn mantle Jul 11, 2025, 6:52 PM

#

how should i read your name btw @unborn ocean

#

is it "not so"

#

or like a merged name

#

or it doesnt matter?

#

mm

unborn ocean Jul 11, 2025, 6:52 PM

#

Split I guess

torn mantle Jul 11, 2025, 6:53 PM

#

i see

unborn ocean Jul 11, 2025, 6:53 PM

#

But I mean it obviously doesn’t matter

unborn ocean Jul 11, 2025, 6:53 PM

#

torn mantle howso

Was that about what I said?

sacred quail Jul 11, 2025, 6:54 PM

#

@torn mantle is kimi k2 better than latest qwen ? Did you check

torn mantle Jul 11, 2025, 6:56 PM

#

unborn ocean Was that about what I said?

no

torn mantle Jul 11, 2025, 6:56 PM

#

sacred quail <@295243581818404874> is kimi k2 better than latest qwen ? Did you check

idk it feels better

#

but its a non reasoning model

#

its kinda like grok 3

#

smart without being a reasoner

sacred quail Jul 11, 2025, 6:57 PM

#

i like good base models

potent snow Jul 11, 2025, 7:09 PM

#

is grok 4 image generator better than grok 3?

woeful viper Jul 11, 2025, 7:13 PM

#

Is there an issue with grok 4 in lmarena ? I can't send an image in direct chat, but grok 4 has capability to deal with images ?

haughty siren Jul 11, 2025, 7:14 PM

#

woeful viper Is there an issue with grok 4 in lmarena ? I can't send an image in direct chat,...

They said they are getting that fixed.

woeful viper Jul 11, 2025, 7:14 PM

#

haughty siren They said they are getting that fixed.

Oh thank you

haughty siren Jul 11, 2025, 7:14 PM

#

for lmarena*

torn mantle Jul 11, 2025, 7:30 PM

#

https://x.com/chetaslua/status/1943679519187149084

Chetaslua (@chetaslua)

🚨 Google Gemini leak reveals advanced #AgenticAI features:

🔹 Task Modes:

* AGENTIC_TASK: Autonomous agents plan & execute workflows.
* DUPLEX_TALK_TO_AGENT: Voice-call automation (e.g., bookings).
* DUPLEX_TRIGGER_AND_POLL: Automated polling for long tasks.

🔹 Immersive

keen beacon Jul 11, 2025, 7:58 PM

#

@echo aurora I just want to take a moment to give a huge heartfelt thank you to lmarena.ai and the incredible team behind it. The fact that you're making access to the latest paid models available for free is not just generous it's deeply humane. In a time when knowledge and creativity are increasingly locked behind paywalls, what you're doing is nothing short of empowering.
You're helping so many of us learn, create, and grow,without any barrier. Its’s rare to see something this generous. Much love and endless respect.

echo aurora Jul 11, 2025, 8:01 PM

#

keen beacon <@283397944160550928> I just want to take a moment to give a huge heartfelt than...

Thank you so much for the kind words! This means a lot and has absolutely made my day! I'm going to share with the team right now. Thank you!! heartthrow

coral vigil Jul 11, 2025, 8:08 PM

#

Is there a way to view historical leaderboard data at LMArena?

echo aurora Jul 11, 2025, 8:14 PM

#

coral vigil Is there a way to view historical leaderboard data at LMArena?

yes - https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard/tree/main

lmarena-ai/chatbot-arena-leaderboard at main

coral vigil Jul 11, 2025, 8:14 PM

#

echo aurora yes - https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard/tree/ma...

wonderful, tysm

whole wagon Jul 11, 2025, 8:18 PM

#

Would be nice if it was possible in UI

echo aurora Jul 11, 2025, 8:25 PM

#

whole wagon Would be nice if it was possible in UI

Yeah we have been chatting about it a little bit in #1391779409362419902 feel free to share additional feedback!

#

Reminder we have a contest running right now!! #announcements message

We want more cows in space please!

unborn ocean Jul 11, 2025, 8:29 PM

#

https://huggingface.co/moonshotai/Kimi-K2-Instruct

#

very thorough benchmarking

#

but wild is right, can't compete with the large us labs

whole wagon Jul 11, 2025, 8:31 PM

#

They aren't

#

Grok 4 is 2T

unborn ocean Jul 11, 2025, 8:31 PM

#

does not have to be, can also be more compute or larger experts (kimi k2 has only 32b active params) / different configuration (though typically a high simpleqa = large model)

whole wagon Jul 11, 2025, 8:32 PM

#

It's 2.7 T if you count all feed-forward plus attention experts (every shard on disk).

It's 1.7 T if you count only the parameters that sit inside the experts (Skip KV-cache buffers, embeddings, and some routing weights).

#

That's the exact figures for you lol

#

Well I got it from a primary source but it's been talked about by a few xAI engineers like https://medium.com/predict/the-emergence-of-grok-4-a-deep-dive-into-xais-flagship-ai-model-eda5d500e4e7

Medium

The Emergence of Grok 4: A Deep Dive into xAI’s Flagship AI Model

The artificial intelligence landscape continues its rapid evolution, marked by significant advancements in large language models (LLMs).

#

They are relaxed about sharing higher level arch details

#

They only trained grok 3 on 12.8T tokens also

#

I assume it's not as sparse as the open source models

#

To take all that compute for 12.8T training

ocean vortex Jul 11, 2025, 8:39 PM

#

wtf is happening

#

@unborn ocean ?

hollow ocean Jul 11, 2025, 8:39 PM

#

mogged by grok 4 heavy

whole wagon Jul 11, 2025, 8:40 PM

#

Average meltdown over anyone talking about grok

ocean vortex Jul 11, 2025, 8:40 PM

#

whole wagon Well I got it from a primary source but it's been talked about by a few xAI engi...

grok4 is a sh'it model

whole wagon Jul 11, 2025, 8:40 PM

#

😂

hollow ocean Jul 11, 2025, 8:41 PM

#

count the Elon haters

unborn ocean Jul 11, 2025, 8:41 PM

#

yeah, that was just too much speculations, trust me bro and jumping from one thing to the other with one gazillion unseen assumptions...

#

got nothing todo with grok

#

though it would be fun to make u aware of it

#

well we can argue about the educated part

#

the the point is i do not and you do not either

#

which is why the best guess is none / a very conservative one, instead of gemini 2.5 pro = 6t, to maintain best possibilities of being right

#

people like to overestimate their own ability to be predict -> they are worse than uneducated people (that thus don't overestimate, because they are aware of their own unfamiliarity with the topic)

#

that just pickes save stuff

#

not on all things, but especially on things like these

#

yeah, i changed the comment for a reason dude

#

well, sometimes people have to hear it regardless

#

if i had to guess, i would not want to rule out something very close to or above 1t

#

just because of how sparse many of them have become

#

but the save option is probably within that ballpark

ocean vortex Jul 11, 2025, 8:52 PM

#

Dork4 describing how to deport Musk

#

hollow ocean Jul 11, 2025, 8:52 PM

#

What other model has similar writing style or better than GPT 4.5

unborn ocean Jul 11, 2025, 8:53 PM

#

well google has a model larger than 2.5 pro

hollow ocean Jul 11, 2025, 8:53 PM

#

GPT 4.5 is just different

#

More natural

ocean vortex Jul 11, 2025, 8:54 PM

#

Nah he must go 😇

ocean vortex Jul 11, 2025, 8:55 PM

#

hollow ocean What other model has similar writing style or better than GPT 4.5

Opus4 the closest

hollow ocean Jul 11, 2025, 8:55 PM

#

ocean vortex Opus4 the closest

Yeah

unborn ocean Jul 11, 2025, 8:56 PM

#

10tn might be a bit large, scaling a model effectively to a size like that, idk

ocean vortex Jul 11, 2025, 8:56 PM

#

Dunno about that but they most certainly do not have one that performs better than 2.5Pro. They do have 1.0 Ultra which is bigger than Pro but it performs like crap. Not really better than 1.5Pro

unborn ocean Jul 11, 2025, 8:56 PM

#

especially the big model you want to distill from

#

the are likely not in a lower quantisation ( because of distilling and stuff like that), so 10t model would be insane

ocean vortex Jul 11, 2025, 8:57 PM

#

Dunno about "by and large", I think it was somewhat overhyped to be completely honest

#

they haven't finished post-training of 3.0 yet afaik

#

99% they are training it now

unborn ocean Jul 11, 2025, 8:59 PM

#

100%

ocean vortex Jul 11, 2025, 8:59 PM

#

-1

#

behemoth is a disaster

#

No lab should follow in those footsteps lmao

unborn ocean Jul 11, 2025, 9:00 PM

#

meta is trying to match the big labs compute wise and has been known to heavily brute force for the llama 4 training

#

so they are probably more a sign that the other model are a bit smaller

ocean vortex Jul 11, 2025, 9:01 PM

#

It's still "training". Because it's way too big to make sense

#

epic failure

unborn ocean Jul 11, 2025, 9:02 PM

#

the reports on it talk about them using it for distilling and basically scrambling to get something out of the compute invested, by training the smaller models etc.

ocean vortex Jul 11, 2025, 9:02 PM

#

Yeah but they did that way before there was any competition and way before anyone even knew how to do it

#

it's now 2025

#

not 2023

unborn ocean Jul 11, 2025, 9:03 PM

#

they should have been readly

#

and those two are fails

#

massive compute went in

#

scout got like 40t or 60t idk tokens training ( don't remember)

keen fulcrum Jul 11, 2025, 9:04 PM

#

Grok 4

ocean vortex Jul 11, 2025, 9:04 PM

#

Deepseek started training of their model I believe later than meta. The main difference is their model was not oversized to oblivion

#

so it didn't take ages to train

#

wdym. R1 is much smaller than Behemoth - fact

#

And they are training Behemoth for ages now

#

Deepseek released AND updated R1

#

in the meantime

unborn ocean Jul 11, 2025, 9:06 PM

#

ocean vortex Deepseek released AND updated R1

if deepseek had the meta compute 😳

ocean vortex Jul 11, 2025, 9:07 PM

#

Well that's what I mean. Meta failed so hard that we just HAVE TO talk about a model that is not even released lmao

#

In no way shape or form you denied what I said earlier. My point still stands lol

#

Behemoth is a disaster

#

At the end of the day, no one really cares about parameters. People only care about performance

dawn wharf Jul 11, 2025, 9:08 PM

#

ocean vortex Jul 11, 2025, 9:09 PM

#

And R1 already performs better than Behemoth ever could hope for, tbh

dawn wharf Jul 11, 2025, 9:09 PM

#

where's huawei's GPUs

hazy quest Jul 11, 2025, 9:10 PM

#

Hey guys, got again the "Which response do you prefer?" on AI Studio, within 5min of use. Last time I got it was this morning also very quickly, and the previous time before was a loooong while ago.

What about you, has it popped for you too today?

ocean vortex Jul 11, 2025, 9:10 PM

#

It is a valid comparison because it scales exponentially. For Meta it's taking longer DESPITE having infinitely more compute lol

unborn ocean Jul 11, 2025, 9:10 PM

#

they have more (not much, but a bit more than 10k)
the training only took like 2 weeks for v3 (obv just one run, they def did multiple test runs, tweaks etc.)

#

so idk what you are talking about

ocean vortex Jul 11, 2025, 9:12 PM

#

unborn ocean they have more (not much, but a bit more than 10k) the training only took like 2...

They for sure have 'undocumented' GPUs yeah, still not as much compute as Meta though tbf. But I think this clearly showed compute is not everything. You can have all the compute in the world and still waste it all on Behemoth type of model, get beaten by the smaller players

unborn ocean Jul 11, 2025, 9:13 PM

#

e.g. the gemini ultra vs 1.5 pro moment was also legendary

ocean vortex Jul 11, 2025, 9:14 PM

#

Do you really think they are gonna tell you officially about all the compute that they have? lmao

#

Also, this is China

unborn ocean Jul 11, 2025, 9:14 PM

#

brains > compute (sometimes)

ocean vortex Jul 11, 2025, 9:14 PM

#

we are talking about

young verge Jul 11, 2025, 9:15 PM

#

Hey, is grok 4 really sota? or did they game the benchmarks. When will we see it in the LMSYS arena leaderboard?

ocean vortex Jul 11, 2025, 9:15 PM

#

Never said a million, but believing some official figure from China is silly. I don't think it would have been possible to do what they did in such a short period of time. Then host the model and train the updated version at the same time, just with this, tbh.

ocean vortex Jul 11, 2025, 9:18 PM

#

young verge Hey, is grok 4 really sota? or did they game the benchmarks. When will we see it...

Thus far leaning strongly towards "game". I have probably never seen a model with such a big disconnect between reported benchmark numbers and the actual IRL performance and testing (same testing like I do with all the other models....)

unborn ocean Jul 11, 2025, 9:18 PM

#

well a lot of smart people are not just randomly in a room together, deepseek hires these young grad students from top chinese unis for a lot of cash
-> there is no way these people don't get the compute they need (not saying there is no constraint though)

young verge Jul 11, 2025, 9:19 PM

#

Post training on validation set is all you need

#

😄

ocean vortex Jul 11, 2025, 9:19 PM

#

Grok3 was good, Grok4?... Can only say negative things about it to be completely honest...

#

Not only it doesn't perform well in reasoning, but the whole fine-tuning is super weird. Reasons for 50k tokens which are all hidden, then gives you 1 worded response. Wtf?? 💀

young verge Jul 11, 2025, 9:21 PM

#

When I first saw the benchmarks and that they scaled post training thought they had a break through but guess not lol

unborn ocean Jul 11, 2025, 9:22 PM

#

ocean vortex Not only it doesn't perform well in reasoning, but the whole fine-tuning is supe...

î think the just lack the smart guys for post training and reasoning that the other labs (and now also meta) have

#

(or have yet to build their people)

#

but imagine researching new stuff in a 80h work week seems unreasonable for most (bc it is)

ocean vortex Jul 11, 2025, 9:24 PM

#

unborn ocean î think the just lack the smart guys for post training and reasoning that the ot...

That's a shame. Cause ironically Grok3 non-reasoning was almost revolutionary at the time for what it was. It essentially bridged the gap and did almost reasoning like responses. So you saw everything, and it was very flexible with response lengths. Also strangely the new one does fail some things the old one didn't...

unborn ocean Jul 11, 2025, 9:25 PM

#

nah, "we" (i am not from the us) still do that...

#

though i think that deepseek has a smart strat when it comes to this

unborn ocean Jul 11, 2025, 9:26 PM

#

ocean vortex That's a shame. Cause ironically Grok3 non-reasoning was almost revolutionary at...

jup, was also hoping for more, because i was mainly expecting they would do more cpt or a fresh-ish pretrain idk

ocean vortex Jul 11, 2025, 9:26 PM

#

New one is 99% of the response hidden, then the remaining 1% is the final output, which has a high chance of being incorrect (comparing with o3 or 2.5Pro which tend to not only give the correct answer but also provide thinking summary + detailed answer)

unborn ocean Jul 11, 2025, 9:28 PM

#

probably was not articulating it right: people still get hired for imo stuff (world wide), though in the ai world, anyone with a good uni degree behind em gets a good offer

#

the most typical area where a lot of people get hired like that is quant finance though

ocean vortex Jul 11, 2025, 9:30 PM

#

grok3 vs grok4 in a nutshell:

unborn ocean Jul 11, 2025, 9:30 PM

#

like 90% of the smaller math events i know of are sponsored by em

ocean vortex Jul 11, 2025, 9:30 PM

#

LIke you can't even make this sht up lmao

#

in what universe is this an acceptable answer...

#

💀

unborn ocean Jul 11, 2025, 9:32 PM

#

well from what i hear, it is still definitely a route for many, though honestly the performance on stuff like that is just less relevant in a lot of these areas these days

#

and deepseek probably also does not care, they want young people with good ideas, imo is not the only way to measure that

#

especially when you are recruiting cs and engineering talent that might have never even attended a math club or the like (and does not need to)

ocean vortex Jul 11, 2025, 9:34 PM

#

Grok4 reminds me a bit of og gpt4 with no system prompt

#

extremely concise

#

except GPT4 performed very well at the time even despite that...

dawn wharf Jul 11, 2025, 9:34 PM

#

ocean vortex in what universe is this an acceptable answer...

is it correct?

unborn ocean Jul 11, 2025, 9:34 PM

#

bro turned into bing chat, lol

ocean vortex Jul 11, 2025, 9:34 PM

#

dawn wharf is it correct?

it's the opposite of correct

#

lol

#

Grok3 was correct though

dawn wharf Jul 11, 2025, 9:35 PM

#

ocean vortex grok3 vs grok4 in a nutshell:

oh didn't even read the prompt💀

unborn ocean Jul 11, 2025, 9:35 PM

#

@ornate agate, i don't really spend time in these communities anyways

#

a couple of my friends got in like that (might just be the uni degree in their cases)

ocean vortex Jul 11, 2025, 9:37 PM

#

Summaries are not perfect but that I would still classify as "acceptable". Hiding it completely however, is not... Especially when your final responses are so insanely concise

primal orbit Jul 11, 2025, 9:38 PM

#

https://i.snipboard.io/O8Dn7V.jpg

Gemini roasting me.

torn mantle Jul 11, 2025, 9:40 PM

#

ocean vortex grok3 vs grok4 in a nutshell:

kimi

#

k2

#

all models gets it right

#

so far

ocean vortex Jul 11, 2025, 9:42 PM

#

torn mantle kimi

half pass. "This implies..." -- no it doesn't. smh

hollow ocean Jul 11, 2025, 9:43 PM

#

Grok 4 heavy mogs it

ocean vortex Jul 11, 2025, 9:43 PM

#

This would have been "fine", if it performed. But it doesn't seem to perform lol

ocean vortex Jul 11, 2025, 9:45 PM

#

hollow ocean Grok 4 heavy mogs it

That's potentially what ArtificialAnalysis was testing... Heavy/tools 💀

#

I really do not see how this public model can score that. Tempted to do some benchmark run myself from the ones they did but that's also to be expensive af lol

hollow ocean Jul 11, 2025, 9:47 PM

#

ocean vortex I really do not see how this public model can score that. Tempted to do some ben...

You gotta try grok 4 heavy

#

It’s the real deal

leaden palm Jul 11, 2025, 9:48 PM

#

hollow ocean Grok 4 heavy mogs it

are we comparing grok 4 heavy to kimi k2

#

that's like comparing it to v3

ocean vortex Jul 11, 2025, 9:48 PM

#

hollow ocean You gotta try grok 4 heavy

Not interested in that. It's too expensive to be useful for 90% of people. If grok4 the normal one is sh'it, then those other versions are kinda irrelevant tbh... might as well just use o3/tools/pro

hollow ocean Jul 11, 2025, 9:48 PM

#

leaden palm are we comparing grok 4 heavy to kimi k2

🤷‍♂️

hollow ocean Jul 11, 2025, 9:49 PM

#

ocean vortex Not interested in that. It's too expensive to be useful for 90% of people. If gr...

Regular grok is good

ocean vortex Jul 11, 2025, 9:49 PM

#

hollow ocean Regular grok is good

Scroll up. I think I made it fairly clear why I think it's sh'it lol

#

That prompt was one of many it does fail. I just don't see it performing tbh

hollow ocean Jul 11, 2025, 9:50 PM

#

ocean vortex Scroll up. I think I made it fairly clear why I think it's sh'it lol

Give me the prompt

ocean vortex Jul 11, 2025, 9:51 PM

#

hollow ocean Give me the prompt

it's in the ss

leaden palm Jul 11, 2025, 9:51 PM

#

i also find grok 4 (the model you get on the api) uninteresting but from what i've seen, when you're using it from grok.com, it's decent

ocean vortex Jul 11, 2025, 9:52 PM

#

leaden palm i also find grok 4 (the model you get on the api) uninteresting but from what i'...

that's the same model + system prompt. Well and tools

#

o3 is good with none of that, so is 2.5Pro, so... 🤷‍♂️

leaden palm Jul 11, 2025, 9:52 PM

#

ocean vortex that's the same model + system prompt. Well and tools

it gets tools, you get thinking summaries

api has neither

ocean vortex Jul 11, 2025, 9:53 PM

#

leaden palm it gets tools, you get thinking summaries api has neither

I expect it to do good with no tools. And thinking summaries you do get with Google or OA API

leaden palm Jul 11, 2025, 9:53 PM

#

i can kinda see why it does that

#

like imagine being an ai assistant

#

you have no opinions

#

then someone asks you of your opinion

#

all you know is that you're grok by xai and you want to be consistent with what you've said before

#

so you search what grok says and what your creator says

main gulch Jul 11, 2025, 9:54 PM

#

do you mean by "serious" asking Grok their opinion about Israel/Palestine?

lone vector Jul 11, 2025, 9:55 PM

#

that was quick

hollow ocean Jul 11, 2025, 9:55 PM

#

No other model can do this tho

ocean vortex Jul 11, 2025, 9:55 PM

#

And really... if you add tools into the equation, testing things like math skills becomes completely irrelevant. You are testing the python env at this point rather than the model lol

ocean vortex Jul 11, 2025, 9:57 PM

#

lone vector that was quick

LMAO. I called it 70% way before they reacted 😇

#

#general message

ocean vortex Jul 11, 2025, 10:02 PM

#

lone vector that was quick

wait I read it wrong 🤯

#

only 20% for xAI, that is insane. This just made me bet on xAI even though I hate Grok4 lmfao

#

it still has a very reasonable chance to top lmarena

lone vector Jul 11, 2025, 10:04 PM

#

along that nobody releases another model for 3 weeks

#

20% is pretty low though

ocean vortex Jul 11, 2025, 10:04 PM

#

chatgpt-latest almost has that spot... the bar is not very high. Just some system prompt I think and their usual response formatting which tended to do very decent on lmarena (with grok3)

dusty hazel Jul 11, 2025, 10:05 PM

#

main gulch do you mean by "serious" asking Grok their opinion about Israel/Palestine?

It says that Clarion is by SednaEarfit, and Divinus eartips are just Divinus eartips, it doesn't know that this brand has many models

It's not serious to ask it about anything really

#

Been the #1... For a week

ocean vortex Jul 11, 2025, 10:06 PM

#

ok I'm not sure this is what I had in mind, but I suppose this is better than blank system prompt 😂

#

It's crazy though that this is not the same (or the variation of) the system prompt used on grok website

#

One would think they have the highest chances with that

main gulch Jul 11, 2025, 10:29 PM

#

so wolfstride is Ultra?

sharp olive Jul 11, 2025, 10:30 PM

#

Is the Grok 4 version of Chatbot Arena the reasoning one?

main gulch Jul 11, 2025, 10:30 PM

#

Grok 4 is reasoning-only

elder rapids Jul 11, 2025, 10:30 PM

#

main gulch so wolfstride is Ultra?

why do people hate this model

#

in a lot of my testing it beats kingfall

main gulch Jul 11, 2025, 10:30 PM

#

they love kingfall

elder rapids Jul 11, 2025, 10:31 PM

#

pretty badly

sharp olive Jul 11, 2025, 10:31 PM

#

main gulch Grok 4 is reasoning-only

Like Gemini 2.5 Pro?

elder rapids Jul 11, 2025, 10:31 PM

#

stonebloom did pretty well universally and it was more concise

#

wolfstride needs to be tweaked tho

main gulch Jul 11, 2025, 10:31 PM

#

sharp olive Like Gemini 2.5 Pro?

yeah but 2.5 Pro has thinking budget unlike Grok

elder rapids Jul 11, 2025, 10:32 PM

#

depends if it even is its own model

#

pretty sure it would be

#

and thinking variance would be the differentiator

main gulch Jul 11, 2025, 10:32 PM

#

I think there will be many models under the same brand

elder rapids Jul 11, 2025, 10:32 PM

#

but still we don't know

#

btw didn't they say all tiers would have access to gpt 5

#

unlimited too

main gulch Jul 11, 2025, 10:34 PM

#

are you about March or May checkpoint?

#

March was better than May

elder rapids Jul 11, 2025, 10:34 PM

#

what do you think the benchmarks are going to be

#

I'm ngl the ultra lineup is definitely the best I've used

#

out of all models

#

though 2.5 pro GA is similar

main gulch Jul 11, 2025, 10:35 PM

#

elder rapids what do you think the benchmarks are going to be

SOTA on HLE and SimpleQA, on par with Grok 4 in STEM

#

pricing is more interesting

elder rapids Jul 11, 2025, 10:36 PM

#

I agree

#

it'll probably be around 10 input

#

or 15 input

#

sum like 75 output

main gulch Jul 11, 2025, 10:37 PM

#

I think slightly cheaper than Opus 4

elder rapids Jul 11, 2025, 10:37 PM

#

yeah definitely

elder rapids Jul 11, 2025, 10:37 PM

#

main gulch I think slightly cheaper than Opus 4

than opus 4???

#

nah

#

it'll be more expensive

main gulch Jul 11, 2025, 10:38 PM

#

I always confuse SimpleQA with SimpleBench

elder rapids Jul 11, 2025, 10:38 PM

#

ye

main gulch Jul 11, 2025, 10:38 PM

#

but Ultra will be SOTA in both

misty vault Jul 11, 2025, 10:38 PM

#

ye

elder rapids Jul 11, 2025, 10:38 PM

#

wonder if theyre going to have major architecture changes to Gemini 3

main gulch Jul 11, 2025, 10:39 PM

#

will Gemini Diffusion go into GA

elder rapids Jul 11, 2025, 10:40 PM

#

that's what they're probably trying to figure out, I don't think it'd ever be the flagship model tho

main gulch Jul 11, 2025, 10:40 PM

#

or at least open preview with API

elder rapids Jul 11, 2025, 10:40 PM

#

i think it would be something the model calls to

main gulch Jul 11, 2025, 10:48 PM

#

Veo 4 should be a great release btw

#

and maybe more important for Google in PR terms than Gemini

elder rapids Jul 11, 2025, 10:49 PM

#

ye

#

veo 3 was the best thing to happen for Google in AI

#

imo Gemini 3 could be the largest leap they're trying to make

#

just simply based off the fact 2.5 was moved on with

#

and their claims of video understanding

#

and the ultra route they're taking

#

wonder how that'll affect its inherent spatial understanding in text

#

Gemini 4 might leave the realm of modern tokenizers at this rate

#

btw is there going to be an intermediate model between pro and flash?

main gulch Jul 11, 2025, 10:56 PM

#

elder rapids btw is there going to be an intermediate model between pro and flash?

there is no room for it in their new pricing

elder rapids Jul 11, 2025, 10:57 PM

#

I've been seeing people talk about some new "2.5 standard"

#

sounds like bs to me

main gulch Jul 11, 2025, 11:01 PM

#

elder rapids I've been seeing people talk about some new "2.5 standard"

this is hallucinated by AI in GitHub commit

sacred quail Jul 11, 2025, 11:03 PM

#

is grok 4 thinking too long or is it just server issue

#

on lmarena

main gulch Jul 11, 2025, 11:06 PM

#

sacred quail is grok 4 thinking too long or is it just server issue

the former, Grok 4 likes to overthink

hollow ocean Jul 11, 2025, 11:09 PM

#

https://www.youtube.com/watch?v=2USUfv7klr8\

YouTube

Fireship

Grok 4 pushes humanity closer to AGI… but there’s a problem

Get 3 months of Sentry’s team plan free: https://sentry.io/fireship

Elon Musk has the 'trust me bro' benchmarks to prove that Grok 4 is the world's most powerful AI model. But just how well does it compare against competitors in real life scenarios? And is it still calling itself MechaHitler?

#Grok4 #Grok #elonmusk #coding #tech

💬 Chat w...

▶ Play video

dawn wharf Jul 11, 2025, 11:09 PM

#

main gulch the former, Grok 4 likes to overthink

real

whole wagon Jul 11, 2025, 11:10 PM

#

https://www.theverge.com/openai/705999/google-windsurf-ceo-openai openAI takes yet another L

The Verge

OpenAI’s Windsurf deal is off — and Windsurf’s CEO is going t...

Key researchers are joining Google DeepMind, too.

#

“The talks between OpenAI to buy the startup for $3 billion ended in recent days after Windsurf’s team raised concerns over how the coding assistant would fit into the OpenAI and Microsoft agreement, which requires OpenAI to share its technology with Microsoft, according to two people familiar with the company’s discussions.”
- The Information

#

💀

wintry tinsel Jul 11, 2025, 11:29 PM

#

Grok 4 is such a benchmaxxer

#

What is it useful for?

#

Claude is better at coding and writing

#

It seems to be best at logic and math

whole wagon Jul 11, 2025, 11:36 PM

#

https://www.reuters.com/business/musks-xai-seeks-up-200-billion-valuation-next-fundraising-ft-reports-2025-07-11/

Reuters

Musk's xAI seeks up to $200 billion valuation in next funding round...

Elon Musk's xAI is preparing to raise more money from investors in a deal that could value the artificial-intelligence company between $170 billion and $200 billion, the Financial Times reported on Friday, citing people close to the discussions.

ocean vortex Jul 11, 2025, 11:56 PM

#

Dork fails creative writing 🧐

hollow ocean Jul 11, 2025, 11:59 PM

#

o3 pro best creative writer

sacred quail Jul 12, 2025, 12:08 AM

#

O3 is interesting choice

#

for creative writing

#

Too mechinal for me

torn mantle Jul 12, 2025, 12:13 AM

#

ocean vortex Dork fails creative writing 🧐

expected

#

nah i think it will be top 3

#

let me check the actual ranking

#

yea no3 for sure

dawn wharf Jul 12, 2025, 12:18 AM

#

ocean vortex Dork fails creative writing 🧐

what's the criteria

elder rapids Jul 12, 2025, 12:45 AM

#

ocean vortex Dork fails creative writing 🧐

how's it so low? wtf

#

what does "high" mean for grok 4

#

lmfao

torn mantle Jul 12, 2025, 12:48 AM

#

elder rapids how's it so low? wtf

trained to be truthfull and accurate

elder rapids Jul 12, 2025, 12:48 AM

#

nah but I mean

#

it's not a bad model

#

so we can toss that idea aside

#

it's just fails so much In practice

#

it's absurd

torn mantle Jul 12, 2025, 12:48 AM

#

no i mean part of that is the reason why its bad at creativity

elder rapids Jul 12, 2025, 12:48 AM

#

that's not the case

torn mantle Jul 12, 2025, 12:49 AM

#

it is the case

#

what are you talking about

elder rapids Jul 12, 2025, 12:49 AM

#

nah that's not how it works

torn mantle Jul 12, 2025, 12:49 AM

#

how does it work then

elder rapids Jul 12, 2025, 12:49 AM

#

not how claims work either

#

accuracy and technical things don't exclude creative ability

#

that's bs

#

seems like they rl'd too much for specific domains

#

rather than general abilities

torn mantle Jul 12, 2025, 12:53 AM

#

when a model is trained for maximum truthfulness then it will likely be optimized to select the most statistically probable and factually supported words = which leads to more predictable and less creative outputs

elder rapids Jul 12, 2025, 12:53 AM

#

that's not how it works

torn mantle Jul 12, 2025, 12:54 AM

#

thats how probability distribution works

#

and there is a tradeoff between truthfulness/accuracy & creativity, the question is how did they fixed that

elder rapids Jul 12, 2025, 12:56 AM

#

torn mantle when a model is trained for maximum truthfulness then it will likely be optimize...

truthfulness is epistemic and can't be tied to statistical probability unless truthfulness is defined as pure statistical output (which isn't truth in the way we're defining it btw) and fact support is arbitrary. You're confusing temp and those things with its training

#

it would be impossible for these things to prevent creative ability, but the LACK of creative ability in the first place seems to be the reason for all this

#

aka they never rl'd it for any sort of creativity in the first place which is pretty bad for the model, so the question is how did they mess that up

torn mantle Jul 12, 2025, 1:00 AM

#

nah

dawn leaf Jul 12, 2025, 1:00 AM

#

elder rapids

Hmm...

torn mantle Jul 12, 2025, 1:00 AM

#

temp is just a sampling method applied on top of that already built distribution

#

training is what shapes that statistical distribution

elder rapids Jul 12, 2025, 1:01 AM

#

I'm ngl you could just ask chatgpt for this

#

to explain what im saying

torn mantle Jul 12, 2025, 1:01 AM

#

you could use chatgpt to understand what im saying as well

#

you are missing the whole point here

elder rapids Jul 12, 2025, 1:02 AM

#

torn mantle you could use chatgpt to understand what im saying as well

no I do understand what you're saying, and I don't need chatgpt to break it down for me 😭

dawn leaf Jul 12, 2025, 1:02 AM

#

GPT 4.5 It's questionable to add.

elder rapids Jul 12, 2025, 1:02 AM

#

this is basic

#

you don't know basic

#

you need chatgpt

#

I do not

torn mantle Jul 12, 2025, 1:02 AM

#

i think you need gemini to dumb it down for you

elder rapids Jul 12, 2025, 1:02 AM

#

yo

#

you are not someone who works with AI

#

stop larping dawg

#

just ask chatgpt

#

😭

torn mantle Jul 12, 2025, 1:04 AM

#

just ask gemini

elder rapids Jul 12, 2025, 1:04 AM

#

to put what I'm saying to you in other words? 😭

#

you never said anything explicit to me for it to define

#

you're not correcting me, nor are you making a claim

#

you're just expressing a bad understanding of model training and alignment

torn mantle Jul 12, 2025, 1:06 AM

#

elder rapids you're just expressing a bad understanding of model training and alignment

thanks for using gemini

#

what was the prompt

elder rapids Jul 12, 2025, 1:06 AM

#

that's not even new Information

#

quit the ragebait

torn mantle Jul 12, 2025, 1:06 AM

#

eli5 what asura was saying?

elder rapids Jul 12, 2025, 1:07 AM

#

🫩

#

Craig level

torn mantle Jul 12, 2025, 1:08 AM

#

im jk

#

dont take it seriously

#

ignore him

elder rapids Jul 12, 2025, 1:17 AM

#

the level of troll

#

you say random shi

#

and that's what asura was doing

#

bugging asf

torn mantle Jul 12, 2025, 1:18 AM

#

https://x.com/sama/status/1943837550369812814

Sam Altman (@sama)

we planned to launch our open-weight model next week.

we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us.

while we trust the community will build great things with this model, once weights are

#

idkd idkdkdi its 3 am

#

honestly i half-read what you said

#

but im still right tho

#

im always right

#

uhm

#

right

#

nod

#

agree

leaden palm Jul 12, 2025, 1:34 AM

#

torn mantle https://x.com/sama/status/1943837550369812814

he got ratiod

leaden palm Jul 12, 2025, 1:39 AM

#

leaden palm he got ratiod

kalo's point

#

they have openai merch and they have https://huggingface.co/ft-hf-o-c membership

ft-hf-o-c (yofo)

#

idk if they actually have the open model

whole wagon Jul 12, 2025, 1:44 AM

#

Openai is an actual meme

#

What additional safety there are already strong open source models

#

And the world did not end

#

we are not yet sure how long it will take us. not even giving an ETA

leaden palm Jul 12, 2025, 1:46 AM

#

this type of slop

hollow ocean Jul 12, 2025, 2:03 AM

#

o3 and o3 pro scores the highest on creative writing but no one uses them for writing 🤔

dawn wharf Jul 12, 2025, 2:07 AM

#

hollow ocean o3 and o3 pro scores the highest on creative writing but no one uses them for wr...

and you didn't hear anybody saying the benchmark is rigged🤔

hollow ocean Jul 12, 2025, 2:08 AM

#

dawn wharf and you didn't hear anybody saying the benchmark is rigged🤔

They use llms to rate it

dawn wharf Jul 12, 2025, 2:08 AM

#

hollow ocean They use llms to rate it

using ai to rate creative writing?💀

hollow ocean Jul 12, 2025, 2:08 AM

#

dawn wharf using ai to rate creative writing?💀

Yeah lemme show you

dawn wharf Jul 12, 2025, 2:09 AM

#

so basically it's a useless benchmark

hollow ocean Jul 12, 2025, 2:09 AM

#

dawn wharf so basically it's a useless benchmark

https://x.com/lechmazur/status/1943452576948580840?s=46&t=AH7sIlIv16Z3Kdb6j3cjfg

Lech Mazur (@LechMazur)

#

I’ve never heard of anyone say o3 or o3 pro is the best at creative writing

whole sundial Jul 12, 2025, 2:11 AM

#

kimi k2 wins "KyrieBench", first open source model to get the right (prev. o1/o3/gpt4.1 and grok 4)

#

i now agree that kimi k2 is the real resaon why openai open source model was delayed

#

also first Chinese model to get it right

dawn wharf Jul 12, 2025, 2:12 AM

#

whole sundial i now agree that kimi k2 is the real resaon why openai open source model was del...

no bro, It's for safety and shi

rare python Jul 12, 2025, 2:13 AM

#

yeah

dawn wharf Jul 12, 2025, 2:13 AM

#

because it can launch nukes or something idk

whole sundial Jul 12, 2025, 2:14 AM

#

i don't think the chinese are doing much safety testing outside of making sure it doesn't tell people what happened in 1989, definitely just a stupid OpenAI excuse to delay the model

#

until it is not relevant anymore

#

when i said "OpenAI", it should be "ClosedAI" lol

dawn wharf Jul 12, 2025, 2:16 AM

#

whole sundial i don't think the chinese are doing much safety testing outside of making sure i...

if K2 is actually the reason they delayed their model, then that means it's not good in the first place

whole sundial Jul 12, 2025, 2:16 AM

#

k2 reasoning and r2 will be better than their open source model

whole wagon Jul 12, 2025, 2:16 AM

#

I am confused. Why would K2 cause a delay

#

I don't think they were the open source SOTA anyways. Not in absolute performance

dawn wharf Jul 12, 2025, 2:17 AM

#

did they even say what the model's main goal is?

whole sundial Jul 12, 2025, 2:17 AM

#

model performs like gpt 4.1, add some reasoning and it could be better than o3

whole wagon Jul 12, 2025, 2:17 AM

#

The openAI model that is

#

The leaks are all like o4-mini level at best

#

I think there is another reason

#

That it's delayed

hollow ocean Jul 12, 2025, 2:21 AM

#

Yall still trust benchmarks or no

#

🤣

#

Infinite o3 pro glitch

#

Multiple emails

whole wagon Jul 12, 2025, 3:05 AM

#

😂

elder rapids Jul 12, 2025, 3:52 AM

#

leaden palm he got ratiod

this is stupid lmao

#

there's nothing about a small model like Sam Altman hinted at that needs to be compared to a 1T model

#

deadass

#

what's with the sensationalism getting worse here

elder rapids Jul 12, 2025, 3:59 AM

#

leaden palm kalo's point

this is literally just a horrible horrible take

#

😭😭

#

2 completely different caliber models

#

yeah o3 mini level

#

vs a 1T model

#

dumb comparison

#

you can't even begin to think that that model release somehow impacted theirs

#

he's right

#

once you release the weights

#

you can't unrelease them

#

it seems like nobody is trying to think about how impactful this could be to openAI's future

#

(VERY meaningful)

hollow ocean Jul 12, 2025, 4:05 AM

#

5 accounts with o3 pro and 4.5 @deep adder

#

Best method

#

One runs out move to the next account

#

Paying $1 is better

#

99% off

#

🤣

#

Google ultra accs are being sold for cheap

#

🤷‍♂️

#

Not buying them just saying it’s a thing

elder rapids Jul 12, 2025, 4:11 AM

#

the model has to be completely safe and smart enough (aka aligned) for it to cement itself, especially for its size. It doesn't have to be an insane model, or good model at all, but if it's not reliable just like their closed API then it's completely a waste. It's the first real example besides gpt-2 of a model they've developed that can be looked through. It also influences government decisions like open-weight performance to appeal to as a baseline from a partnered American company (and could influence laws etc etc), it also revives openAI open weight expectations and deflects legal claims against it's closed off nature (and could reinforce its standard of "safety") it also re-cements openAI into the open source community and could create a very large pool of developers/researchers who choose to use the open source model

#

this is the biggest thing openAI could do tbh

leaden palm Jul 12, 2025, 4:19 AM

#

@deep adder @elder rapids on the "less than 100b params" and "small model like Sam Altman hinted at" point

*note the plural

#

ofc it could be relatively small compared to a 1t model

#

but "it's not a small model"

elder rapids Jul 12, 2025, 4:36 AM

#

leaden palm <@348477266704990208> <@887104792437092352> on the "less than 100b params" and "...

who is he and how is what he said verifiable at all? and this all doesn't matter because the point isn't that you'd be comparing performance regardless, if it's not precisely a 1T model just like the Kimi model then it's simply wrong to say this is the cause, it could be 900B, it doesn't matter. I can also argue ones a reasoning model the other isn't so it's a completely refuted skepticism

#

I agree there could be nuance about its size though, the problem is Sam Altman has been all over Twitter clarifying hobbyist utility and it should be "ran on GPUs"

leaden palm Jul 12, 2025, 4:39 AM

#

i'm awaiting the openai open model but i find sam's activities odd

elder rapids Jul 12, 2025, 4:40 AM

#

they're in the same position as you with the skepticism lmao, big providers having access to the model isn't meaningful towards its alignment or sentiment towards delaying it a little bit. He also has no idea whether these big providers even have it

#

if they planned to release the model so soon obviously preparation would've been met, calling it off unexpectedly as he framed it would align with what I said

leaden palm Jul 12, 2025, 4:43 AM

#

perhaps i'm too trusting but i'd expect being well known and from a decently sized inference/dev org (lambda/nous) to indicate having connections and indicate that their claims are based in reality

elder rapids Jul 12, 2025, 4:44 AM

#

there's plenty of examples of people on Twitter in the AI space saying things and then getting corrected by the actual researchers

#

I forgot that dudes name

leaden palm Jul 12, 2025, 4:44 AM

#

elder rapids there's plenty of examples of people on Twitter in the AI space saying things an...

true...

elder rapids Jul 12, 2025, 4:46 AM

#

leaden palm true...

I'm ngl it's kind of scary too how much it would sway public opinion, especially in the AI subreddits until actual confirmation arrived

#

but yeah regardless, I'm not saying you can't trust him just off the fact of trends (even though it wouldn't be a bad inference) there's just a lot of things that implicitly support Sam's (or the openAI researchers...) position here over the skepticism people have about what Sam said

#

kinda actually pains me that nobody tries to look for practical validity in the AI Twitter space, like you can inductively infer something you know

#

instead of expressing the opposing claim outright

pulsar tendon Jul 12, 2025, 4:53 AM

#

leaden palm perhaps i'm too trusting but i'd expect being well known and from a decently siz...

tek is well known for nous and Yuchen is the ceo of a inference provider

#

Like I think their claims have validity

elder rapids Jul 12, 2025, 4:54 AM

#

it's not that they have no validity, it's just that however valid they are, none of them could know unless they were actually there and could frame it non speculatively like how they're doing

#

"literally never trust this guy" like bro what are we doing

#

it's not even that either, there's just more evidence for the non speculation than the pro speculation

whole sundial Jul 12, 2025, 5:02 AM

#

leaden palm i'm awaiting the openai open model but i find sam's activities odd

it's true, US labs make promises but don't keep them. Meanwhile a Chinese company (Baidu) made an explicit promise when they released ERNIE 4.5 in March that the model would be open sourced on June 30. They released everything related to ERNIE 4.5 on that date. Meanwhile xAI made a promise in February to open source Grok 2 "in the coming months". Grok 4 has came out and the weights for any xAI model other than the original Grok are nowhere to be found. OpenAI made a similar promise, but with a special model made for open source. Every time the model is close to release, they delay it for arbitary reasons.

TL;DR: China keeps promises better than US labs.

elder rapids Jul 12, 2025, 5:04 AM

#

even tho I agree with him

whole sundial Jul 12, 2025, 5:05 AM

#

true, but China is better at playing catchup

elder rapids Jul 12, 2025, 5:05 AM

#

it's non evidence for anything other than "China companies cooler than America's!!"

#

and doesn't help the case here

whole sundial Jul 12, 2025, 5:10 AM

#

I'm not saying there are not US companies that commit to open source, Meta and Nvidia has done it, Google has released lesser version of their flagship models, but China has just done it better. There is no other way I can say this. US focuses on commercialization and their models are popular in those environments but China just releases the models openly to get themselves in the door. Maybe they are trying to influence US opinions with their models, we don't know. But the point is China just does it better. I'm not saying Chinese models are better than US ones, but the difference is that the US ones are closed and many Chinese ones are open. And the ones from the US that are open are worse than China's.

echo aurora Jul 12, 2025, 5:10 AM

#

Lets keep things focussed on AI pls

whole sundial Jul 12, 2025, 5:10 AM

#

ok

#

oh are you talking about Craig?

#

I was just talking about AI

#

but there may come a day that Chinese models become better than US flagships. They are improving themselves; Kimi K2 is based on DeepSeek's arch but is still better than V3 due to its size. And yes, they do distill from US companies but they are doing it out of desperation, they want to catch up to the US. And now their reasoning models are close enough to US ones in performance that they can distill from their own. Yes, they are still trying to distill from US models because China can't create enough English synthetic data with their own models, but they won't need to do that anymore. And US companies like Meta are basing their models off of what Chinese companies has done because they need to catch up in the open source race. Llama 5 may very well beat China's models when it comes out, considering Meta poached some of AI's best minds. Yes, US models will keep on improving themselves and will still be very good models. But I think China is just accelerating the race and also helping to push the frontier. They just don't have models that beat the US yet because of GPU sanctions. US companies are taking China's work and using their resources to make it better.

#

yes, but they are doing research that US companies are applying to do so

#

lol people on twitter are saying that K2 is to blame for OpenAI's delay (I do believe them, because that model probably beats their open source models in things that just can't be done well with smaller models, like knowledge based benchmarks like SimpleQA)

#

which OpenAI invented btw

#

it's not like they are telling us, US companies don't like releasing papers anymore

whole wagon Jul 12, 2025, 5:29 AM

#

whole sundial lol people on twitter are saying that K2 is to blame for OpenAI's delay (I do be...

This is extremely illogical

#

K2 is not a reasoning model i would assume openAI can beat it with a reasoning model

#

Firstly

whole sundial Jul 12, 2025, 5:32 AM

#

it will soon become one though

#

they said they would make it reason and multimodal

whole wagon Jul 12, 2025, 5:37 AM

#

Who cares really. It's always going to get overtaken in a few months that's how things work

whole sundial Jul 12, 2025, 5:39 AM

#

true, it will be beaten by both US and other Chinese models

tidal schooner Jul 12, 2025, 5:49 AM

#

i wonder how k2.5 will fare tho

rare python Jul 12, 2025, 5:57 AM

#

tidal schooner i wonder how k2.5 will fare tho

2.5 what?

tidal schooner Jul 12, 2025, 5:58 AM

#

rare python 2.5 what?

k2.5

#

should clarify

rare python Jul 12, 2025, 5:58 AM

#

tidal schooner k2.5

ok, but will they make k2.5 or go to k3?

tidal schooner Jul 12, 2025, 5:59 AM

#

rare python ok, but will they make k2.5 or go to k3?

they had k1.5

rare python Jul 12, 2025, 6:00 AM

#

tidal schooner they had k1.5

Naming is random

#

They might see a leap big enough to call their next gen model k3

#

and hopefully k2.5 or k3 will be smaller

elder rapids Jul 12, 2025, 6:01 AM

#

whole wagon K2 is not a reasoning model i would assume openAI can beat it with a reasoning m...

ye

#

I think it's extremely braindead

#

a lot of what Kiri is saying is non evidence, closed AI and open source are indistinct In the context they're trying to make an argument for

#

US open source models being worse than China's (which isn't true, the only outlier is deepseek) isn't meaningful at all, especially when you consider total competition is in context relative to the respective countries

#

China having a different open source culture is obviously true, but it's not as meaningful as made out to be

harsh flume Jul 12, 2025, 6:07 AM

#

Anyone here tried deep research with Grok 4 Heavy?

elder rapids Jul 12, 2025, 6:07 AM

#

not sure why he mentioned Chinese models eventually being better than "US flagships" like the US isn't progressing at a much much faster rate than the Chinese models he's calling into question, and acting like China hasn't been benefiting the most from US open source, enabling their progress

harsh flume Jul 12, 2025, 6:07 AM

#

That's my only use case for Grok and i'm wondering if it performs better than std Grok 3/4

whole sundial Jul 12, 2025, 6:10 AM

#

ok that makes sense

#

I'm sure Chinese models will be right behind the US though

#

And I'm sure China is using US models for training or other purposes

#

they would still be behind if they didn't

elder rapids Jul 12, 2025, 6:14 AM

#

whole sundial I'm sure Chinese models will be right behind the US though

yeah definitely, it seems like they're the only real competitors, France hopping on open source early seemed to do wonders for their progress

#

countries like Japan seem to be more experimental

whole sundial Jul 12, 2025, 6:15 AM

#

by France you mean just Mistral

elder rapids Jul 12, 2025, 6:15 AM

#

yep

#

but that's enough

whole sundial Jul 12, 2025, 6:15 AM

#

and they have some closed source models, probably to get people to use their API, which is sad

#

but yes, mistral has done a lot for open source

#

7b is still one of the most famous and modified open source LLMs out there

#

Nemo is maintaining its legacy

harsh flume Jul 12, 2025, 6:16 AM

#

lol Grok 4 Heavy has no Deep Research tool 💀 😵 💀

#

Just torched 300$ apparently

elder rapids Jul 12, 2025, 6:17 AM

#

harsh flume Just torched 300$ apparently

damn

#

it's alr tho, they're adding a lot of stuff

#

so you might get a research agent eventually

elder rapids Jul 12, 2025, 6:17 AM

#

whole sundial but yes, mistral has done a lot for open source

ye

harsh flume Jul 12, 2025, 6:17 AM

#

I mean I cant help my curiosity and impulse bought to try some stuff

whole sundial Jul 12, 2025, 6:18 AM

#

their early MoE models likely inspired later MoEs like DeepSeek

harsh flume Jul 12, 2025, 6:18 AM

#

made a deep research styled prompt anyway and am running, hopefully it has research heuristics that are good tho not labeled as a selectable tool

whole sundial Jul 12, 2025, 6:18 AM

#

(like 8x22B)

elder rapids Jul 12, 2025, 6:18 AM

#

harsh flume made a deep research styled prompt anyway and am running, hopefully it has resea...

will be good

elder rapids Jul 12, 2025, 6:21 AM

#

whole sundial their early MoE models likely inspired later MoEs like DeepSeek

ye

#

I agree

#

at least in the open source

whole sundial Jul 12, 2025, 6:27 AM

#

Mistral models aren't great now though

#

I recently gave an easy prompt to Magistral Medium (the closed source commercial model!) and it went into an endless loop

tidal schooner Jul 12, 2025, 6:30 AM

#

whole sundial I recently gave an easy prompt to Magistral Medium (the closed source commercial...

magistral is just a token-spamming joke atp

harsh flume Jul 12, 2025, 6:45 AM

#

harsh flume made a deep research styled prompt anyway and am running, hopefully it has resea...

From what I gathered so far the progress bars are likely just a visual cue gimmick and don't really represent any linear function as to how much thinking its making

#

This one prompt ended in 4m

red sluice Jul 12, 2025, 6:53 AM

#

It's really too easy to spot an answer made by ChatGPT or Grok 4. It kinda kills the fun of the arena. The fact that LLMs show personnality isn't bad as such, but it really makes the whole system rigged. If I want to penalize Grok 4 in order to earn some money out of Polymarket, I can litterally do it, just let me play lmarena non stop for 20 hours, pretty sure I can secure 100 lose-on-purpose battles on LMarena. If I take precautions not to get caught, I use free residential IP VPN providers, and I can go litterally uncaught.
Kinda sucks, no system is perfect, but I hate to know that it's doable and that it probably has been attempted before.

#

ChatGPT: Emoji at the end, structure.
Grok: Grokking.

harsh flume Jul 12, 2025, 6:57 AM

#

loool theres some Mechahitler-esque tone to its answer 🤣 (I made it cross-analyze deep research results from Gemini and o3 with its own findings)

tribal aspen Jul 12, 2025, 7:30 AM

#

Hi so how can I send images to the model on Direct Chat on LMArena?

red sluice Jul 12, 2025, 7:31 AM

#

tribal aspen Jul 12, 2025, 7:35 AM

#

red sluice

Oh never mind Im not getting it for Opus 4 and grok 4

indigo hazel Jul 12, 2025, 7:36 AM

#

grok 4 is still great even if it cant use tools? better o3 or grok?

lime coral Jul 12, 2025, 7:43 AM

#

why not just try?

quartz light Jul 12, 2025, 7:54 AM

#

R R A L L I O R P

whole wagon Jul 12, 2025, 7:56 AM

#

Sam kinda pathetic giving this safety excuse and underestimating the intelligence of everyone else on the planet

indigo hazel Jul 12, 2025, 7:56 AM

#

quartz light

what's this? k2?

whole wagon Jul 12, 2025, 7:56 AM

#

They have been testing this for a month, you dont just suddenly realise the model is 'unsafe' just before the release

#

That's not how it works it's obvious to everyone

#

Musk calling grok 4 agi repeatedly also lol

tidal schooner Jul 12, 2025, 7:59 AM

#

whole wagon Musk calling grok 4 agi repeatedly also lol

"phd in everything"

whole wagon Jul 12, 2025, 7:59 AM

#

#

This ain't how you determine AGI

tribal aspen Jul 12, 2025, 8:15 AM

#

whole wagon This ain't how you determine AGI

thats how hypers define it

#

also

#

will grok heavy be available on LMarena direct chat?

#

when api of that comes out/

whole wagon Jul 12, 2025, 8:17 AM

#

I assume not. For same reason o3 pro not there

#

Expensive and slow

tidal schooner Jul 12, 2025, 8:22 AM

#

tribal aspen will grok heavy be available on LMarena direct chat?

likely never

whole wagon Jul 12, 2025, 8:24 AM

#

When do we think grok 5 is coming

#

Before the end of the year is my guess

tidal schooner Jul 12, 2025, 8:26 AM

#

whole wagon When do we think grok 5 is coming

january/february tbh

#

december is stretching it tho

whole wagon Jul 12, 2025, 8:26 AM

#

Didn't musk say grok 5 was cooking already

tidal schooner Jul 12, 2025, 8:27 AM

#

whole wagon Didn't musk say grok 5 was cooking already

hmm...

#

it could arrive maybe a bit early then

#

grok 1 to grok 2 was about 9 months

#

grok 2 to grok 3 was about 6 months

#

grok 3 to grok 4 was about 5 months

#

keeps slashing it between releases

#

so earliest could theoretically be november but highly unlikely imo

#

3.5 seems more realistic

whole wagon Jul 12, 2025, 8:31 AM

#

There's 172 Days till end the year

#

I think end of the year release makes sense, Gemini 3 is coming then too

tribal aspen Jul 12, 2025, 8:36 AM

#

tidal schooner likely never

so no o3 pro / gemini deepthink / grok 4 heavyy ever?

tidal schooner Jul 12, 2025, 8:37 AM

#

tribal aspen so no o3 pro / gemini deepthink / grok 4 heavyy ever?

all of them are too costly

tribal aspen Jul 12, 2025, 8:37 AM

#

tidal schooner all of them are too costly

I think opus 4 thinking is a lot more costly than whatever the price of deepthink will be

tidal schooner Jul 12, 2025, 8:38 AM

#

tribal aspen I think opus 4 thinking is a lot more costly than whatever the price of deepthin...

then make a suggestion lol

#

#1372230675914031105

#

haven't checked costs tho

tribal aspen Jul 12, 2025, 8:38 AM

#

tidal schooner haven't checked costs tho

cost of deepthink isnt yet out

tidal schooner Jul 12, 2025, 8:38 AM

#

shouldn't be as bad as the multi-agent framework grok 4 heavy uses tho but idk

tribal aspen Jul 12, 2025, 8:38 AM

#

but gemini costs arent too much

#

they are cheap models

tidal schooner Jul 12, 2025, 8:38 AM

#

tribal aspen cost of deepthink isnt yet out

maybe

tribal aspen Jul 12, 2025, 8:39 AM

#

tidal schooner maybe

also are the models nerfed?

#

in lmarena?/

#

like opus thinking and grok 4 are still high class models

#

even with the data i dont think its too practical to provide these models

whole wagon Jul 12, 2025, 8:39 AM

#

Deepthink isnt going to be cheap for some reasons which cannot yet be stated lol

tribal aspen Jul 12, 2025, 8:40 AM

#

whole wagon Deepthink isnt going to be cheap for some reasons which cannot yet be stated lol

what if it will?

#

cuz gemini models are cheap

whole wagon Jul 12, 2025, 8:40 AM

#

It won't

#

They are cooking up something special there

#

Which is why the delay

tribal aspen Jul 12, 2025, 8:41 AM

#

whole wagon They are cooking up something special there

the delay was for another reason

#

they already had a model which was performing better than it's upper tier model from other companies

#

I mean 2.5 pro was better than o3 pro and stuff

#

so there was no point of releasing another SOTA having their own model on top already

whole wagon Jul 12, 2025, 8:43 AM

#

Very wrong

tribal aspen Jul 12, 2025, 8:43 AM

#

whole wagon Very wrong

wdym

whole wagon Jul 12, 2025, 8:44 AM

#

There was no point releasing gpt 4 cos gpt 3 was already SOTA

#

Very very wrong

tribal aspen Jul 12, 2025, 8:50 AM

#

whole wagon There was no point releasing gpt 4 cos gpt 3 was already SOTA

that time they had no competition

alpine coral Jul 12, 2025, 9:20 AM

#

ocean vortex half pass. "This implies..." -- no it doesn't. smh

half pass but only if being charitable imo. it briefly acknowledges digital clocks don't have hands, then moves on assuming it must be an analogue clock; it's final answer is 7.5 (which is a fail)

dusky aurora Jul 12, 2025, 9:20 AM

#

I really hope Gemini will have new updates this month

alpine coral Jul 12, 2025, 9:21 AM

#

kingfall (one among 15 questions it was answering) passes. like it's possible to acknowledge the analogue clock possible interpretation, but not give that is the final answer

alpine coral Jul 12, 2025, 9:25 AM

#

whole sundial model performs like gpt 4.1, add some reasoning and it could be better than o3

fwiw it feels quite a way behind 4.1 in the limited use i've had so far (i don't do coding / web dev / SVG prompts tho, which seems to be what a lot of people are mostly interested in )

fleet lintel Jul 12, 2025, 9:32 AM

#

whole wagon Deepthink isnt going to be cheap for some reasons which cannot yet be stated lol

How are you folks getting this info? I know people working in Gemini teams and even they don't know these things : 🤔

hazy quest Jul 12, 2025, 9:35 AM

#

alpine coral kingfall (one among 15 questions it was answering) passes. like it's possible to...

How are you still able to use/test kingfall?

frigid coral Jul 12, 2025, 9:37 AM

#

damn viren's here

#

must be a legit discord

alpine coral Jul 12, 2025, 9:44 AM

#

hazy quest How are you still able to use/test kingfall?

ahh.. it's an old screenshot (I was just scrolling through until I found one that included that question, and it just so happened to be kingfall (from one month ago) aha

torn mantle Jul 12, 2025, 10:50 AM

#

whole wagon

lol

#

@alpine coral can u benchmark kimi k2 pls

#

No but seriously did anyone notice the first principles Elon's was talking about?

#

'It will provide answers that you won't find on the internet'

#

didnt he say that too

rare python Jul 12, 2025, 11:13 AM

#

torn mantle No but seriously did anyone notice the first principles Elon's was talking about...

It reasons through Elon tweets

keen fulcrum Jul 12, 2025, 11:55 AM

#

torn mantle 'It will provide answers that you won't find on the internet'

Yes indeed

#

Others have too

#

Smartest AI I have used

unborn ocean Jul 12, 2025, 12:11 PM

#

forgot to add: for 4.1 / o3 + 8 is the current price

ocean vortex Jul 12, 2025, 12:31 PM

#

For reference, Deepseek R1 official API is $2.19. This sounds insanely low and OpenAI have more traffic, but also more efficient GPUs and better infra, so I still think 2-3.

#

Deepseek's cost is probably around $2 dead, and those 20 cents is their entire "profit" lol

keen beacon Jul 12, 2025, 12:37 PM

#

they have a 545% margin...

frigid phoenix Jul 12, 2025, 12:45 PM

#

Do we know when grok is going to be added to the leaderboard Hmm

ocean vortex Jul 12, 2025, 12:59 PM

#

frigid phoenix Do we know when grok is going to be added to the leaderboard <:Hmm:1306491372328...

This month for sure, next week is my guess...

#

Since it's already collecting votes

#

this looks wild now though. OpenAI with no model released yet higher odds than xAI... lol

candid storm Jul 12, 2025, 1:01 PM

#

ocean vortex this looks wild now though. OpenAI with no model released yet higher odds than x...

Youre looking wrong

#

Youre looking at end of 2025

torn mantle Jul 12, 2025, 1:01 PM

#

ocean vortex this looks wild now though. OpenAI with no model released yet higher odds than x...

thats for end of 2025

frigid phoenix Jul 12, 2025, 1:01 PM

#

Would be smart If staff Here Waits for the Last Day of the month and Puts grok in top 1

torn mantle Jul 12, 2025, 1:02 PM

#

i would still bet on google tbh

frigid phoenix Jul 12, 2025, 1:02 PM

#

While having PM shares on it

ocean vortex Jul 12, 2025, 1:02 PM

#

torn mantle thats for end of 2025

uhh right. Wrong link 💀

#

this is more realistic ok

torn mantle Jul 12, 2025, 1:02 PM

#

yea

ocean vortex Jul 12, 2025, 1:02 PM

#

Betting on Google still would be silly...

torn mantle Jul 12, 2025, 1:02 PM

#

why

#

i would still bet on them

#

even if the profit is low giving the high chance

candid storm Jul 12, 2025, 1:03 PM

#

I would not bet on july personally rn

#

Too uncertain

ocean vortex Jul 12, 2025, 1:03 PM

#

torn mantle i would still bet on them

You get almost no return and there's reasonable risk xAI takes it and Google doesn't even respond by the end of July for it to be on the leaderboard...

torn mantle Jul 12, 2025, 1:04 PM

#

im not feeling grok 4

candid storm Jul 12, 2025, 1:04 PM

#

torn mantle im not feeling grok 4

Yeah same

#

But never bet against Elon!

ocean vortex Jul 12, 2025, 1:04 PM

#

torn mantle im not feeling grok 4

I'm not either. But it doesn't take a whole lot if you look at current leaderboard. 4o-latest is 3rd not far away from top spot

torn mantle Jul 12, 2025, 1:05 PM

#

candid storm But never bet against Elon!

idk tbh.. i may as well just bet against him all the time

ocean vortex Jul 12, 2025, 1:05 PM

#

If they tuned it for arena prompts I think chances are very reasonable

tribal aspen Jul 12, 2025, 1:05 PM

#

Why can't open ai codex be on LMarena?

#

direct chat

torn mantle Jul 12, 2025, 1:06 PM

#

ocean vortex I'm not either. But it doesn't take a whole lot if you look at current leaderboa...

i just feel like wolfstride/stonebloom could top it instead

ocean vortex Jul 12, 2025, 1:09 PM

#

torn mantle i just feel like wolfstride/stonebloom could top it instead

that's a threat for top spot for sure, is it still in arena though?

#

wolfstride

#

Seems like they are testing a bit random things and then retracting it

tribal aspen Jul 12, 2025, 1:10 PM

#

is opus 4 good or 2.5 pro good in mass coding?

#

like for start building a vibecoding project/

tribal aspen Jul 12, 2025, 1:11 PM

#

tribal aspen like for start building a vibecoding project/

^

#

ohh

#

why cant we upload images to claude 4 opus on lmarena direct chat?

ocean vortex Jul 12, 2025, 1:13 PM

#

tribal aspen like for start building a vibecoding project/

If you want for it to go absolutely off the rails doing the entire thing in one go probably Sonnet 3.7. That thing can hit the output cap just with code lol

#

Opus is gonna be singificantly more concise, then Sonnet 4, but 3.7 even more unhinged

tribal aspen Jul 12, 2025, 1:14 PM

#

ocean vortex If you want for it to go absolutely off the rails doing the entire thing in one ...

how is the inferior class old family model better than their best model 4 opus thinking?

tribal aspen Jul 12, 2025, 1:14 PM

#

ocean vortex Opus is gonna be singificantly more concise, then Sonnet 4, but 3.7 even more un...

which 3.7 is the best for coding?

#

what about gemini 2.5 pro?

#

so opus 4 not good for coding?

ocean vortex Jul 12, 2025, 1:14 PM

#

tribal aspen how is the inferior class old family model better than their best model 4 opus t...

Different fine-tuning. You need different tools for different tasks. Opus is not meant as a power horse of repetitive work

#

it is much more concise with outputs

tribal aspen Jul 12, 2025, 1:15 PM

#

also this

tribal aspen Jul 12, 2025, 1:15 PM

#

ocean vortex it is much more concise with outputs

ohh

#

what about 2.5 pro vs sonnet 3.7 vs sonnet 4

#

which one should I choose?

#

also the new Kimi 2

ocean vortex Jul 12, 2025, 1:16 PM

#

tribal aspen which one should I choose?

2.5Pro is a safe bet for almost everything tbh. Then you can do 3.7 if you feel 2.5 outputs are too short for any given task

tribal aspen Jul 12, 2025, 1:17 PM

#

ocean vortex 2.5Pro is a safe bet for almost everything tbh. Then you can do 3.7 if you feel ...

its one shots are about 6k to 9k characters (2.5 pro

#

)

#

do u have any idea about kimi 2?

ocean vortex Jul 12, 2025, 1:17 PM

#

tribal aspen its one shots are about 6k to 9k characters (2.5 pro

Well I did a coding task where it was 32k code in a single response with 3.7 lol

tribal aspen Jul 12, 2025, 1:17 PM

#

also so opus 4 and sonnet 4 are good for nothing?

tribal aspen Jul 12, 2025, 1:18 PM

#

ocean vortex Well I did a coding task where it was 32k code in a single response with 3.7 lol

💀

#

i might try it

#

im actually so confused

#

like at this point

#

there are so many models

#

2.5 pro opus 4 sommet 4 sonnet 3.7 grok 4 kimi 2

#

also oai's codex

alpine coral Jul 12, 2025, 1:19 PM

#

fwiw don't fixate on them too much

#

you're better off just using one / some of them, and getting a feel – they're not like insanely different at a surface level..

#

it's like maybe one will be the magic wand you want - build / fix whatever it is you're working on perfectly. but there's still effort along the way.. engaging with them.. they're all kinda similar in many ways

ocean vortex Jul 12, 2025, 1:22 PM

#

tribal aspen 2.5 pro opus 4 sommet 4 sonnet 3.7 grok 4 kimi 2

Kimi 2 you shouldn't worry about for now. It's competing with gpt4.1, 4.5, Deepseek V3 etc... Not with SOTA reasoning models

tribal aspen Jul 12, 2025, 1:23 PM

#

ocean vortex Kimi 2 you shouldn't worry about for now. It's competing with gpt4.1, 4.5, Deeps...

yea its actually competing with opus 4

#

I tried it and went crazy

keen beacon Jul 12, 2025, 1:23 PM

#

its so slow tho

tribal aspen Jul 12, 2025, 1:23 PM

#

like it's one shot code is so better than 2.5 pro and grok 4

tribal aspen Jul 12, 2025, 1:23 PM

#

keen beacon its so slow tho

yea but the quality is good ig

alpine coral Jul 12, 2025, 1:24 PM

#

tribal aspen I tried it and went crazy

good! don't look at the tables with benchmark comparisons.. just use the models and see what you think ha

tribal aspen Jul 12, 2025, 1:24 PM

#

alpine coral good! don't look at the tables with benchmark comparisons.. just use the models ...

yea lol

#

also its good on the benches

ocean vortex Jul 12, 2025, 1:25 PM

#

As a non-reasoner it's a good option in that segment yeah

rare python Jul 12, 2025, 1:25 PM

#

ocean vortex As a non-reasoner it's a good option in that segment yeah

I'm afraid that they can't serve 1T model for free for long

#

Unless they have insane fund

keen beacon Jul 12, 2025, 1:27 PM

#

theyre probably hosting it on their own gpus, cost is probably not that high for them

ocean vortex Jul 12, 2025, 1:27 PM

#

rare python I'm afraid that they can't serve 1T model for free for long

There are probably gonna be free 3rd party alternatives, like there are for R1 now, which is not that much smaller...

#

tribal aspen Jul 12, 2025, 1:27 PM

#

I creaed this one with gemini

#

yashjit.tech

#

http://yashjit.tech

rare python Jul 12, 2025, 1:27 PM

#

ocean vortex There are probably gonna be free 3rd party alternatives, like there are for R1 n...

Didn't chute remove the free tier for deepseek?

keen beacon Jul 12, 2025, 1:27 PM

#

yea i meant if you use it on their site. they aren't losing piles of money

#

probably

ocean vortex Jul 12, 2025, 1:28 PM

#

rare python Didn't chute remove the free tier for deepseek?

I thought so too but it works so ig not entirely removed lol

#

You also have 2 more providers

alpine coral Jul 12, 2025, 1:28 PM

#

for sure

keen beacon Jul 12, 2025, 1:28 PM

#

deepseek api is very slow too

rare python Jul 12, 2025, 1:28 PM

#

ocean vortex You also have 2 more providers

Do they use fp4 or some quant?

keen beacon Jul 12, 2025, 1:29 PM

#

i remember getting 60 tps. now its so slow and at times unstable

rare python Jul 12, 2025, 1:29 PM

#

DeepSeek V3 and R1 before it blew up to the mainstream is so peak. It's sooo fast

gusty helm Jul 12, 2025, 1:29 PM

#

I feel they dumped xAI too early in that market

ocean vortex Jul 12, 2025, 1:29 PM

#

rare python Do they use fp4 or some quant?

Nah it's the same as paid ones:

round haven Jul 12, 2025, 1:30 PM

#

Hey guys, does anyone know what happened to dragontail? I loved that model and it seems like none of the models publicly available today are as smart as dragontail.

ocean vortex Jul 12, 2025, 1:30 PM

#

fp8 is full precision

#

for R1

rare python Jul 12, 2025, 1:30 PM

#

dragontail seems familiar

civic flame Jul 12, 2025, 1:30 PM

#

https://x.com/mckaywrigley/status/1943385794414334032 saw this and tried it with wolfstride

Mckay Wrigley (@mckaywrigley)

My thoughts on Grok 4 Heavy after 12hrs:

Crazy good!

“Create an animation of a crowd of people walking to form “Hello world, I am Grok” as camera changes to birds-eye.”

And it 1-shotted the *entire* thing.

No other model comes close.

Watch the full clip.

rare python Jul 12, 2025, 1:30 PM

#

isn't it a Gemini model?

civic flame Jul 12, 2025, 1:30 PM

#

it did it 0-shot and arguably better

#

📎 message.txt

round haven Jul 12, 2025, 1:31 PM

#

The rumor that it was a google model yes but it seems to be singificantly smarter than 2.5 Pro

#

2.5 Pro still can't solve problems that dragontail has been able to solve when I tried it

civic flame Jul 12, 2025, 1:31 PM

#

civic flame

vs a swarm of agents too btw

#

this model is really good

round haven Jul 12, 2025, 1:31 PM

#

I would really like to use dragontail again

rare python Jul 12, 2025, 1:31 PM

#

round haven Jul 12, 2025, 1:32 PM

#

rare python

how do they know these are all gemini models? By asking them?

civic flame Jul 12, 2025, 1:32 PM

#

civic flame it did it 0-shot and arguably better

grok 4 heavy does it in 2 (first "hello world", then "i am grok") and with less people

#

wolfstride does it in 1, as the prompt asks for, and with more people

round haven Jul 12, 2025, 1:35 PM

#

dragontail has been able to solve problems none of the publicly available models I've tried can solve still

civic flame Jul 12, 2025, 1:35 PM

#

like?

round haven Jul 12, 2025, 1:35 PM

#

Like this one:

Let G be a group. Denote by a' the inverse of a for simplicity. Define the generalized commmutator [a1,...,an] to be a1...ana1'....an'. So in particular [x,y] = xyx'y' is the usual commmutator. Give a formula to write [a,b][c,d] as a single generalized commutator. Verify the derived formula step by step to check that everything cancels out nicely. I cannot read latex so use plaintext. Also make sure to use my convention by writing x' for the inverse of x.

civic flame Jul 12, 2025, 1:35 PM

#

what's the answer

round haven Jul 12, 2025, 1:35 PM

#

there are many

#

also you should try it yourself first, it's a fun puzzle

alpine coral Jul 12, 2025, 1:36 PM

#

rare python dragontail seems familiar

is it back in the arena?

ocean vortex Jul 12, 2025, 1:36 PM

#

What if you coded your own "heavy" except it has 5 instances of o3 and 5 of 2.5Pro, 5 of Grok4 (giving it the benefit of the doubt with that impressive arc-agi score)... Then can also add Opus4 for good measure I suppose for a total of 20. That would have been SOTA by a good margin 🤔

unborn ocean Jul 12, 2025, 1:37 PM

#

ocean vortex What if you coded your own "heavy" except it has 5 instances of o3 and 5 of 2.5P...

you might need a very good framework to figure the best response though

bleak venture Jul 12, 2025, 1:38 PM

#

Hi guys, is the open weights R1T2 Chimera on the Leaderboard already? https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera
If not, what's the typical process to add it to the board?

tngtech/DeepSeek-TNG-R1T2-Chimera · Hugging Face

keen beacon Jul 12, 2025, 1:38 PM

#

claim the benchmarks without releasing the scoring system 🤣

ocean vortex Jul 12, 2025, 1:39 PM

#

unborn ocean you might need a very good framework to figure the best response though

You do but I think it's not THAT critical. Any good model you use out of them for forming the final response should do a decent job seeing all of those attempts. 2.5Pro probably the most appropriate given the context size

alpine coral Jul 12, 2025, 1:40 PM

#

civic flame what's the answer

@round haven also curious

ocean vortex Jul 12, 2025, 1:42 PM

#

Those Pro/Heavy models do well because often you (or the model) can actually spot a good response or what other attempts missed just by looking at it

unborn ocean Jul 12, 2025, 1:42 PM

#

ocean vortex You do but I think it's not THAT critical. Any good model you use out of them fo...

would probably get a bit harder, because the models will obv prefer their own response the most bias wise

#

will not completely destroy the potential though

#

just as small hurdle

ocean vortex Jul 12, 2025, 1:44 PM

#

unborn ocean would probably get a bit harder, because the models will obv prefer their own re...

Yeah fair point. Probably make it rate their own ones then and use yet another model to then review the best from each you used to generate lol

unborn ocean Jul 12, 2025, 1:44 PM

#

most of them get RLAI training with some version of themselves in the process -> they produce output they like

#

would be interesting to fine tune a good open model as the final answer generator using some rl on rlvr

torn mantle Jul 12, 2025, 1:48 PM

#

ocean vortex that's a threat for top spot for sure, is it still in arena though?

yea still

civic flame Jul 12, 2025, 1:49 PM

#

round haven there are many

with the rate that they usually put stuff out i would expect a new anon google model on arena soon

#

oops ignore that reply

#

didn't realise it was still on it

rare python Jul 12, 2025, 1:49 PM

#

alpine coral is it back in the arena?

no

dawn wharf Jul 12, 2025, 1:49 PM

#

does anybody know a free provider for K2 api?

round haven Jul 12, 2025, 1:50 PM

#

civic flame didn't realise it was still on it

dragontail is still on lmarena?! 😮

#

why do i never get it

rare python Jul 12, 2025, 1:50 PM

#

round haven how do they know these are all gemini models? By asking them?

asking them and check the web dev arena code

#

get what?

round haven Jul 12, 2025, 1:50 PM

#

i thought they said it's still in the arena

rare python Jul 12, 2025, 1:50 PM

#

only stonebloom and wolfstride has Gemini mystery models iirc

rare python Jul 12, 2025, 1:51 PM

#

round haven i thought they said it's still in the arena

who?

round haven Jul 12, 2025, 1:51 PM

#

leo

#

anyway, a misunderstanding probably

#

When I use lmarena new UI i get these hanging responses. The generator keeps spinning and i never get a response. Grok 4 generator is still spinning from yesterday

#

v buggy

rare python Jul 12, 2025, 1:52 PM

#

true

#

I don't know if those models thought too much and crash or it's just delay

#

No thinking available, even as summary

pure anvil Jul 12, 2025, 1:59 PM

#

dawn wharf does anybody know a free provider for K2 api?

none atm

dusky aurora Jul 12, 2025, 2:04 PM

#

rare python

so will there be an update this month or not?

rare python Jul 12, 2025, 2:04 PM

#

dusky aurora so will there be an update this month or not?

How can I know?

#

¯_(ツ)_/¯

dusky aurora Jul 12, 2025, 2:05 PM

#

sorry. I'm just hoping that there will be something better than 06-05

rare python Jul 12, 2025, 2:05 PM

#

in lmarena right now but don't when it will debut

dusky aurora Jul 12, 2025, 2:06 PM

#

waiting for Gemini updates and Arena updates are my main joys

rare python Jul 12, 2025, 2:06 PM

#

and what they are

dusky aurora Jul 12, 2025, 2:07 PM

#

how do you use Battle mode?

torn mantle Jul 12, 2025, 2:08 PM

#

Kimi k2 is what i expected deepseek v4 to be

#

But can v4 really be better than k2?

#

I don't think deepseek will train a base model to 1T

#

K2 feels exactly like grok 3, smart for a base model

main gulch Jul 12, 2025, 2:09 PM

#

torn mantle But can v4 really be better than k2?

K2 (as all the Chinese models) is bad in non-English world knowledge

torn mantle Jul 12, 2025, 2:10 PM

#

main gulch K2 (as all the Chinese models) is bad in non-English world knowledge

Yea it missed some events when i asked it about French conflicts

#

Gave like 2 wrong dates as well

civic flame Jul 12, 2025, 2:12 PM

#

shhh

astral kayak Jul 12, 2025, 2:12 PM

#

https://tenor.com/view/gatekeep-why-gatekeep-why-hate-why-are-you-gatekeeping-gif-2058740858049169092

Tenor

jade egret Jul 12, 2025, 2:18 PM

#

openai sad

#

windsurf employes goto google

round haven Jul 12, 2025, 2:29 PM

#

ok this is epic

civic flame Jul 12, 2025, 2:33 PM

#

astral kayak https://tenor.com/view/gatekeep-why-gatekeep-why-hate-why-are-you-gatekeeping-gi...

because it's just going to get patched lol

alpine coral Jul 12, 2025, 2:35 PM

#

i've seen all kinda things in chinese language forums lately..

civic flame Jul 12, 2025, 2:36 PM

#

yikes

alpine coral Jul 12, 2025, 2:36 PM

#

kinda forget how epic it is how can you legit translate the content of webpages these days

civic flame Jul 12, 2025, 2:36 PM

#

linux.do is great but also people over there have 0 regard for like anybody else

alpine coral Jul 12, 2025, 2:36 PM

#

wasn't that long ago that translatino was shite and clunky

pure anvil Jul 12, 2025, 2:37 PM

#

lol

civic flame Jul 12, 2025, 2:39 PM

#

wow what an edgy cool guy 🥶 🥶

pure anvil Jul 12, 2025, 2:39 PM

#

wdym?

civic flame Jul 12, 2025, 2:39 PM

#

💀

pure anvil Jul 12, 2025, 2:39 PM

#

did you interpret it that way?

#

if so

civic flame Jul 12, 2025, 2:41 PM

#

you just come across as a 14 yr old trying too hard to be edgy

alpine coral Jul 12, 2025, 2:41 PM

#

locusts?

#

im just confused

#

but yeah.. discussion = exposure = patching

pure anvil Jul 12, 2025, 2:43 PM

#

civic flame you just come across as a 14 yr old trying too hard to be edgy

Sorry if I came through that way

torn mantle Jul 12, 2025, 2:43 PM

#

Leo gatekeeps everything ive noticed that too

#

And sometimes he plays dumb

keen beacon Jul 12, 2025, 2:43 PM

#

you gotta do what you gotta do 🤷

torn mantle Jul 12, 2025, 2:44 PM

#

I can use deep think and other gemini new models for free but imma gatekeep as well

civic flame Jul 12, 2025, 2:45 PM

#

alpine coral locusts?

it's an ethnic slur for the mainland chinese in HK

pure anvil Jul 12, 2025, 2:45 PM

#

what?

#

no way

#

well I definitely didn't mean that

#

lmaoo

civic flame Jul 12, 2025, 2:45 PM

#

torn mantle Leo gatekeeps everything ive noticed that too

sometimes yes but generally it's for good reason

torn mantle Jul 12, 2025, 2:45 PM

#

Don't delete

keen beacon Jul 12, 2025, 2:45 PM

#

i dont thihnk he meant that

civic flame Jul 12, 2025, 2:45 PM

#

then what did you mean 🙏😭

pure anvil Jul 12, 2025, 2:45 PM

#

I didn't even know that was a thing

#

I meant free loaders kekkkk

civic flame Jul 12, 2025, 2:46 PM

#

yeah nvm then mb

torn mantle Jul 12, 2025, 2:46 PM

#

alpine coral but yeah.. discussion = exposure = patching

I will share the code ive used to get deep think for free

alpine coral Jul 12, 2025, 2:47 PM

#

pure anvil I meant free loaders kekkkk

eh yeahi kinda thout that - but it'd be something more like parasite (leech or something) than locust

torn mantle Jul 12, 2025, 2:47 PM

#

I wont share how to use wolfstride and stonebloom for free or maybe i will

alpine coral Jul 12, 2025, 2:47 PM

#

but perhaps lost in translation

civic flame Jul 12, 2025, 2:47 PM

#

alpine coral eh yeahi kinda thout that - but it'd be something more like parasite (leech or s...

yeah that's what had me confused

pure anvil Jul 12, 2025, 2:47 PM

#

alpine coral eh yeahi kinda thout that - but it'd be something more like parasite (leech or s...

all very similar

keen beacon Jul 12, 2025, 2:47 PM

#

nah ive heard the term used like that before

torn mantle Jul 12, 2025, 2:47 PM

#

I want it to be patched just because some people are gatekeeping

keen beacon Jul 12, 2025, 2:47 PM

#

post it here i dare u lol

civic flame Jul 12, 2025, 2:47 PM

#

torn mantle I want it to be patched just because some people are gatekeeping

that doesn't really make sense

#

but alright

#

it's not hard to find if you do 2 minutes of research

#

🤷‍♂️

alpine coral Jul 12, 2025, 2:48 PM

#

keen beacon nah ive heard the term used like that before

yeah it's apparently a thing

torn mantle Jul 12, 2025, 2:49 PM

#

Yea but when i asked you, you played dumb

keen beacon Jul 12, 2025, 2:49 PM

#

alpine coral yeah it's apparently a thing

i meant the free loader thing not the ethnic slur lolll

alpine coral Jul 12, 2025, 2:49 PM

#

oh lol

keen beacon Jul 12, 2025, 2:49 PM

#

i didnt even know about that lol

torn mantle Jul 12, 2025, 2:49 PM

#

Just tell me im sorry cant share that with you

#

Although its an easy trick

alpine coral Jul 12, 2025, 2:49 PM

#

cause yeah.. that seems like a cooincidence or intended.. or some translatiion thing

#

or im just too high

torn mantle Jul 12, 2025, 2:49 PM

#

Im just lazy to re

civic flame Jul 12, 2025, 2:49 PM

#

torn mantle Yea but when i asked you, you played dumb

what, where did you ask me ??

keen beacon Jul 12, 2025, 2:50 PM

#

alpine coral cause yeah.. that seems like a cooincidence or intended.. or some translatiion t...

it seems unintended, i doubt the people who ive seen that used the word actually used it as an ethnic slur. that seems like a china specific thing

alpine coral Jul 12, 2025, 2:50 PM

#

keen beacon i meant the free loader thing not the ethnic slur lolll

kinda the same thing tho.. like was using locusts to mean parasites i thought.. and yeah locusts aren't really parasitic..

#

parasite = freeloader

civic flame Jul 12, 2025, 2:51 PM

#

I do think it would've made more sense to say parasites in that situation because it's more commonly used to describe free loaders but

#

oh well

pure anvil Jul 12, 2025, 2:51 PM

#

That is a better word come to think

#

of it

torn mantle Jul 12, 2025, 2:51 PM

#

Maybe i should contact lmarena devs

#

To patch it

keen beacon Jul 12, 2025, 2:51 PM

#

alpine coral kinda the same thing tho.. like was using locusts to mean parasites i thought.. ...

suppose so but the severity is different i guess even if its symbolically similar

torn mantle Jul 12, 2025, 2:51 PM

#

@keen beacon should i?

civic flame Jul 12, 2025, 2:51 PM

#

torn mantle Maybe i should contact lmarena devs

lol

torn mantle Jul 12, 2025, 2:52 PM

#

Yes or nah

#

Wild will decide

civic flame Jul 12, 2025, 2:52 PM

#

what's even the point

#

and you didn't answer my question

keen beacon Jul 12, 2025, 2:52 PM

#

torn mantle <@456226577798135808> should i?

idrc tbh. do it if you want to 😂

civic flame Jul 12, 2025, 2:53 PM

#

pretty sure they know by now anyway

pure anvil Jul 12, 2025, 2:55 PM

#

civic flame linux.do is great but also people over there have 0 regard for like anybody else

they have high quality discussions imo

civic flame Jul 12, 2025, 2:57 PM

#

there are definitely some pretty smart people over there

#

but they're relentless lol

pure anvil Jul 12, 2025, 3:00 PM

#

civic flame there are definitely some pretty smart people over there

definitely one of the more cracked forums

civic flame Jul 12, 2025, 3:15 PM

#

!

tall summit Jul 12, 2025, 3:35 PM

#

civic flame linux.do is great but also people over there have 0 regard for like anybody else

linux.do my goat

drifting thorn Jul 12, 2025, 3:37 PM

#

that’s Kimi

#

Btw what’s the system prompt of Grok4 on LMArena?

#

It seems different to the version in Grok app and in API

jade egret Jul 12, 2025, 3:42 PM

#

guys

#

now that windsurf people wok at google deepmind will google models become much better at coding?

barren prairie Jul 12, 2025, 3:53 PM

#

Kimi or kiki?

keen beacon Jul 12, 2025, 4:34 PM

#

depends on the task tbh

#

can you be more specific about what u typically want to use it for? tbh, you should just try out different local ai models yourself and get a feel of which model to use in what scenario

tribal aspen Jul 12, 2025, 4:36 PM

#

lmarena is so unusable in the direct chyat mode when it generates code

blazing bison Jul 12, 2025, 4:37 PM

#

yes the site is not very optimized for code

#

wait for openai new model

keen beacon Jul 12, 2025, 4:39 PM

#

you need vision? or like the good at svg / web design type