#general

1 messages ยท Page 70 of 1

tepid lynx
#

Google is currently leading in steps in this video, and Grok is considered far, far away

#

Deepseek made a competitor to 4o at one time, I think they will do no worse now

indigo hazel
#

i published the link to the second video he published about his test for grok 4

indigo hazel
torn mantle
#

thanks

#

lets laugh at them

indigo hazel
indigo hazel
tepid lynx
rare python
#

This is peak

torn mantle
indigo hazel
torn mantle
#

yea sometimes its stuck in a loop generating base64

rare python
#

kek

#

How can Deepmind not found this bug?

torn mantle
#

they know

#

i guess its just hard to fix

#

yes

verbal nimbus
gusty helm
indigo hazel
gusty helm
#

ah sry, tagged the wrong reply ๐Ÿ˜„

verbal nimbus
#

Probably same fate as GPT-4.5

balmy mist
#

should I buy grok heavy?

#

lmaoo

#

i knew you would say that

#

did you buy it?

#

we might need a subscription that has all top ai models like the max plan for all, similar to how we did with cable back in the day, there are like 5 different companies that have $100 plus plans

#

grok, claude, perplex, google, and openai

#

but that comet browser seems interesting by perplex

pure anvil
balmy mist
#

i heard ppl saying o3 pro is better than grok heavy

balmy mist
#

that*

#

craig if you can list which models are best for which what would you do? like lets do grok, gemini, claude and o3
ik claude is code
o3 maybe everyday use
gemini general? idk

#

we should actually make a internal poll or leaderboard for this in terms of vibes for these metrics instead of benchmarks

pure anvil
#

Gemini is good for image input

balmy mist
#

im similar but use o3 for mostly everything, and gemini for coding since its free, i dont use claude or grok

#

i dont have pro so i cant use 4.5 that much

#

i aint gonna lie, i think i like o3 the best

#

gemini is cool, but o3 just feels natural when i use it and its so much faster than it used to be

#

please dooooo

#

i cant imagine what they would charge for grok 4 heavy lol

steel quail
#

lmarena actually broken

indigo hazel
zealous panther
#

yep

steel quail
#

lmaren is using a broken api

indigo hazel
steel quail
#

api used is broken lol making it look not even grok4

#

looks a 2/3 mix

indigo hazel
#

but they have to fix this thing of tools or grok wont be able to use tools forever in lmarena?

rare python
gusty helm
#

try same question on gemini

#

it thinks the same (joe biden and its 2023/2024)

#

it looks like an intentional limitation

rare python
#

You will get answer around May, June, July or even October

echo aurora
solar hollow
#
poll_question_text

based on your testing so far, is groq 4 the best model right now?

victor_answer_votes

13

total_votes

17

victor_answer_id

2

victor_answer_text

no

whole wagon
cedar tide
#

Im waiting for Kimi k2 thinking & vision

whole wagon
storm needle
whole wagon
#

he is an openai fanboi

#

so wouldnt expect him to say this

#

the open source model is clear efficiency SOTA. I guess they struggled to scale it up for whatever reason

#

Or maybe it's all just compressing together

#

The performances of different ai companies seems to get closer

alpine coral
#

did very poorly on the one question set i gave it

whole wagon
#

Simple bench is so slow to add new models

keen fulcrum
#
poll_question_text

How will OpenAI new open source reasoning model perform?

victor_answer_votes

10

total_votes

11

victor_answer_id

3

victor_answer_text

Better than o4-mini, worse than o3.

whole wagon
#

By the time they add the new model all the hype already died

keen fulcrum
whole wagon
#

Well not really a hobby, he has entire YouTube channel and patreon lol

wind moth
#

will grok 4 top lmarena?

whole wagon
#

I doubt

#

Though that doesn't necessarily mean it isnt the SOTA lol

#

Because Google optimised for llm arena

#

Grok 4 didn't optimise for it at all they had no secret model

brittle tiger
torn mantle
#

lmao

#

actually its 71% google

sacred quail
#

I remember google working on a new reasoning technique

#

It was "tree" something

torn mantle
#

kimi k2 should be added too asap

#

it could easily get like top 6

#

top 8 maybe

sacred quail
#

Why

torn mantle
#

its the equivalent of big brain in grok 4

#

aka heavy thinking

torn mantle
sacred quail
#

When i use kimi 1.5

#

It was nice but kinda mid

#

More abilities than deepseek but worse language and outputs

torn mantle
#

yea kimi 1.5 was alright

#

but they improved a lot on k2

sacred quail
#

hmmmmmm... Maybe i should check

unborn ocean
#

K2 is way larger and better in general (vs 1.5)

torn mantle
#

howso

steel quail
torn mantle
#

how should i read your name btw @unborn ocean

#

is it "not so"

#

or like a merged name

#

or it doesnt matter?

#

mm

unborn ocean
#

Split I guess

torn mantle
#

i see

unborn ocean
#

But I mean it obviously doesnโ€™t matter

unborn ocean
sacred quail
#

@torn mantle is kimi k2 better than latest qwen ? Did you check

torn mantle
torn mantle
#

but its a non reasoning model

#

its kinda like grok 3

#

smart without being a reasoner

sacred quail
#

i like good base models

potent snow
#

is grok 4 image generator better than grok 3?

woeful viper
#

Is there an issue with grok 4 in lmarena ? I can't send an image in direct chat, but grok 4 has capability to deal with images ?

haughty siren
woeful viper
haughty siren
#

for lmarena*

torn mantle
#

๐Ÿšจ Google Gemini leak reveals advanced #AgenticAI features:

๐Ÿ”น Task Modes:

* AGENTIC_TASK: Autonomous agents plan & execute workflows.
* DUPLEX_TALK_TO_AGENT: Voice-call automation (e.g., bookings).
* DUPLEX_TRIGGER_AND_POLL: Automated polling for long tasks.

๐Ÿ”น Immersive

keen beacon
#

@echo aurora I just want to take a moment to give a huge heartfelt thank you to lmarena.ai and the incredible team behind it. The fact that you're making access to the latest paid models available for free is not just generous it's deeply humane. In a time when knowledge and creativity are increasingly locked behind paywalls, what you're doing is nothing short of empowering.
You're helping so many of us learn, create, and grow,without any barrier. Itsโ€™s rare to see something this generous. Much love and endless respect.

echo aurora
coral vigil
#

Is there a way to view historical leaderboard data at LMArena?

whole wagon
#

Would be nice if it was possible in UI

echo aurora
unborn ocean
#

very thorough benchmarking

#

but wild is right, can't compete with the large us labs

whole wagon
#

They aren't

#

Grok 4 is 2T

unborn ocean
#

does not have to be, can also be more compute or larger experts (kimi k2 has only 32b active params) / different configuration (though typically a high simpleqa = large model)

whole wagon
#

It's 2.7 T if you count all feed-forward plus attention experts (every shard on disk).

It's 1.7 T if you count only the parameters that sit inside the experts (Skip KV-cache buffers, embeddings, and some routing weights).

#

That's the exact figures for you lol

#

They are relaxed about sharing higher level arch details

#

They only trained grok 3 on 12.8T tokens also

#

I assume it's not as sparse as the open source models

#

To take all that compute for 12.8T training

ocean vortex
#

wtf is happening

#

@unborn ocean ?

hollow ocean
#

mogged by grok 4 heavy

whole wagon
#

Average meltdown over anyone talking about grok

whole wagon
#

๐Ÿ˜‚

hollow ocean
#

count the Elon haters

unborn ocean
#

yeah, that was just too much speculations, trust me bro and jumping from one thing to the other with one gazillion unseen assumptions...

#

got nothing todo with grok

#

though it would be fun to make u aware of it

#

well we can argue about the educated part

#

the the point is i do not and you do not either

#

which is why the best guess is none / a very conservative one, instead of gemini 2.5 pro = 6t, to maintain best possibilities of being right

#

people like to overestimate their own ability to be predict -> they are worse than uneducated people (that thus don't overestimate, because they are aware of their own unfamiliarity with the topic)

#

that just pickes save stuff

#

not on all things, but especially on things like these

#

yeah, i changed the comment for a reason dude

#

well, sometimes people have to hear it regardless

#

if i had to guess, i would not want to rule out something very close to or above 1t

#

just because of how sparse many of them have become

#

but the save option is probably within that ballpark

ocean vortex
#

Dork4 describing how to deport Musk

hollow ocean
#

What other model has similar writing style or better than GPT 4.5

unborn ocean
#

well google has a model larger than 2.5 pro

hollow ocean
#

GPT 4.5 is just different

#

More natural

ocean vortex
#

Nah he must go ๐Ÿ˜‡

ocean vortex
hollow ocean
unborn ocean
#

10tn might be a bit large, scaling a model effectively to a size like that, idk

ocean vortex
#

Dunno about that but they most certainly do not have one that performs better than 2.5Pro. They do have 1.0 Ultra which is bigger than Pro but it performs like crap. Not really better than 1.5Pro

unborn ocean
#

especially the big model you want to distill from

#

the are likely not in a lower quantisation ( because of distilling and stuff like that), so 10t model would be insane

ocean vortex
#

Dunno about "by and large", I think it was somewhat overhyped to be completely honest

#

they haven't finished post-training of 3.0 yet afaik

#

99% they are training it now

unborn ocean
#

100%

ocean vortex
#

-1

#

behemoth is a disaster

#

No lab should follow in those footsteps lmao

unborn ocean
#

meta is trying to match the big labs compute wise and has been known to heavily brute force for the llama 4 training

#

so they are probably more a sign that the other model are a bit smaller

ocean vortex
#

It's still "training". Because it's way too big to make sense

#

epic failure

unborn ocean
#

the reports on it talk about them using it for distilling and basically scrambling to get something out of the compute invested, by training the smaller models etc.

ocean vortex
#

Yeah but they did that way before there was any competition and way before anyone even knew how to do it

#

it's now 2025

#

not 2023

unborn ocean
#

they should have been readly

#

and those two are fails

#

massive compute went in

#

scout got like 40t or 60t idk tokens training ( don't remember)

keen fulcrum
#

Grok 4

ocean vortex
#

Deepseek started training of their model I believe later than meta. The main difference is their model was not oversized to oblivion

#

so it didn't take ages to train

#

wdym. R1 is much smaller than Behemoth - fact

#

And they are training Behemoth for ages now

#

Deepseek released AND updated R1

#

in the meantime

unborn ocean
ocean vortex
#

Well that's what I mean. Meta failed so hard that we just HAVE TO talk about a model that is not even released lmao

#

In no way shape or form you denied what I said earlier. My point still stands lol

#

Behemoth is a disaster

#

At the end of the day, no one really cares about parameters. People only care about performance

dawn wharf
ocean vortex
#

And R1 already performs better than Behemoth ever could hope for, tbh

dawn wharf
#

where's huawei's GPUs

hazy quest
#

Hey guys, got again the "Which response do you prefer?" on AI Studio, within 5min of use. Last time I got it was this morning also very quickly, and the previous time before was a loooong while ago.

What about you, has it popped for you too today?

ocean vortex
#

It is a valid comparison because it scales exponentially. For Meta it's taking longer DESPITE having infinitely more compute lol

unborn ocean
#

they have more (not much, but a bit more than 10k)
the training only took like 2 weeks for v3 (obv just one run, they def did multiple test runs, tweaks etc.)

#

so idk what you are talking about

ocean vortex
unborn ocean
#

e.g. the gemini ultra vs 1.5 pro moment was also legendary

ocean vortex
#

Do you really think they are gonna tell you officially about all the compute that they have? lmao

#

Also, this is China

unborn ocean
#

brains > compute (sometimes)

ocean vortex
#

we are talking about

young verge
#

Hey, is grok 4 really sota? or did they game the benchmarks. When will we see it in the LMSYS arena leaderboard?

ocean vortex
#

Never said a million, but believing some official figure from China is silly. I don't think it would have been possible to do what they did in such a short period of time. Then host the model and train the updated version at the same time, just with this, tbh.

ocean vortex
unborn ocean
#

well a lot of smart people are not just randomly in a room together, deepseek hires these young grad students from top chinese unis for a lot of cash
-> there is no way these people don't get the compute they need (not saying there is no constraint though)

young verge
#

Post training on validation set is all you need

#

๐Ÿ˜„

ocean vortex
#

Grok3 was good, Grok4?... Can only say negative things about it to be completely honest...

#

Not only it doesn't perform well in reasoning, but the whole fine-tuning is super weird. Reasons for 50k tokens which are all hidden, then gives you 1 worded response. Wtf?? ๐Ÿ’€

young verge
#

When I first saw the benchmarks and that they scaled post training thought they had a break through but guess not lol

unborn ocean
#

(or have yet to build their people)

#

but imagine researching new stuff in a 80h work week seems unreasonable for most (bc it is)

ocean vortex
unborn ocean
#

nah, "we" (i am not from the us) still do that...

#

though i think that deepseek has a smart strat when it comes to this

unborn ocean
ocean vortex
#

New one is 99% of the response hidden, then the remaining 1% is the final output, which has a high chance of being incorrect (comparing with o3 or 2.5Pro which tend to not only give the correct answer but also provide thinking summary + detailed answer)

unborn ocean
#

probably was not articulating it right: people still get hired for imo stuff (world wide), though in the ai world, anyone with a good uni degree behind em gets a good offer

#

the most typical area where a lot of people get hired like that is quant finance though

ocean vortex
#

grok3 vs grok4 in a nutshell:

unborn ocean
#

like 90% of the smaller math events i know of are sponsored by em

ocean vortex
#

LIke you can't even make this sht up lmao

#

in what universe is this an acceptable answer...

#

๐Ÿ’€

unborn ocean
#

well from what i hear, it is still definitely a route for many, though honestly the performance on stuff like that is just less relevant in a lot of these areas these days

#

and deepseek probably also does not care, they want young people with good ideas, imo is not the only way to measure that

#

especially when you are recruiting cs and engineering talent that might have never even attended a math club or the like (and does not need to)

ocean vortex
#

Grok4 reminds me a bit of og gpt4 with no system prompt

#

extremely concise

#

except GPT4 performed very well at the time even despite that...

dawn wharf
unborn ocean
#

bro turned into bing chat, lol

ocean vortex
#

lol

#

Grok3 was correct though

dawn wharf
unborn ocean
#

@ornate agate, i don't really spend time in these communities anyways

#

a couple of my friends got in like that (might just be the uni degree in their cases)

ocean vortex
#

Summaries are not perfect but that I would still classify as "acceptable". Hiding it completely however, is not... Especially when your final responses are so insanely concise

primal orbit
torn mantle
#

k2

#

all models gets it right

#

so far

ocean vortex
hollow ocean
#

Grok 4 heavy mogs it

ocean vortex
#

This would have been "fine", if it performed. But it doesn't seem to perform lol

ocean vortex
#

I really do not see how this public model can score that. Tempted to do some benchmark run myself from the ones they did but that's also to be expensive af lol

hollow ocean
#

Itโ€™s the real deal

leaden palm
#

that's like comparing it to v3

ocean vortex
# hollow ocean You gotta try grok 4 heavy

Not interested in that. It's too expensive to be useful for 90% of people. If grok4 the normal one is sh'it, then those other versions are kinda irrelevant tbh... might as well just use o3/tools/pro

hollow ocean
ocean vortex
#

That prompt was one of many it does fail. I just don't see it performing tbh

ocean vortex
leaden palm
#

i also find grok 4 (the model you get on the api) uninteresting but from what i've seen, when you're using it from grok.com, it's decent

ocean vortex
#

o3 is good with none of that, so is 2.5Pro, so... ๐Ÿคทโ€โ™‚๏ธ

leaden palm
ocean vortex
leaden palm
#

i can kinda see why it does that

#

like imagine being an ai assistant

#

you have no opinions

#

then someone asks you of your opinion

#

all you know is that you're grok by xai and you want to be consistent with what you've said before

#

so you search what grok says and what your creator says

main gulch
#

do you mean by "serious" asking Grok their opinion about Israel/Palestine?

lone vector
#

that was quick

hollow ocean
#

No other model can do this tho

ocean vortex
#

And really... if you add tools into the equation, testing things like math skills becomes completely irrelevant. You are testing the python env at this point rather than the model lol

ocean vortex
ocean vortex
#

only 20% for xAI, that is insane. This just made me bet on xAI even though I hate Grok4 lmfao

#

it still has a very reasonable chance to top lmarena

lone vector
#

along that nobody releases another model for 3 weeks

#

20% is pretty low though

ocean vortex
#

chatgpt-latest almost has that spot... the bar is not very high. Just some system prompt I think and their usual response formatting which tended to do very decent on lmarena (with grok3)

dusty hazel
#

Been the #1... For a week

ocean vortex
#

ok I'm not sure this is what I had in mind, but I suppose this is better than blank system prompt ๐Ÿ˜‚

#

It's crazy though that this is not the same (or the variation of) the system prompt used on grok website

#

One would think they have the highest chances with that

main gulch
#

so wolfstride is Ultra?

sharp olive
#

Is the Grok 4 version of Chatbot Arena the reasoning one?

main gulch
#

Grok 4 is reasoning-only

elder rapids
#

in a lot of my testing it beats kingfall

main gulch
#

they love kingfall

elder rapids
#

pretty badly

sharp olive
elder rapids
#

stonebloom did pretty well universally and it was more concise

#

wolfstride needs to be tweaked tho

main gulch
elder rapids
#

depends if it even is its own model

#

pretty sure it would be

#

and thinking variance would be the differentiator

main gulch
#

I think there will be many models under the same brand

elder rapids
#

but still we don't know

#

btw didn't they say all tiers would have access to gpt 5

#

unlimited too

main gulch
#

are you about March or May checkpoint?

#

March was better than May

elder rapids
#

what do you think the benchmarks are going to be

#

I'm ngl the ultra lineup is definitely the best I've used

#

out of all models

#

though 2.5 pro GA is similar

main gulch
#

pricing is more interesting

elder rapids
#

I agree

#

it'll probably be around 10 input

#

or 15 input

#

sum like 75 output

main gulch
#

I think slightly cheaper than Opus 4

elder rapids
#

yeah definitely

elder rapids
#

nah

#

it'll be more expensive

main gulch
#

I always confuse SimpleQA with SimpleBench

elder rapids
#

ye

main gulch
#

but Ultra will be SOTA in both

misty vault
#

ye

elder rapids
#

wonder if theyre going to have major architecture changes to Gemini 3

main gulch
#

will Gemini Diffusion go into GA

elder rapids
#

that's what they're probably trying to figure out, I don't think it'd ever be the flagship model tho

main gulch
#

or at least open preview with API

elder rapids
#

i think it would be something the model calls to

main gulch
#

Veo 4 should be a great release btw

#

and maybe more important for Google in PR terms than Gemini

elder rapids
#

ye

#

veo 3 was the best thing to happen for Google in AI

#

imo Gemini 3 could be the largest leap they're trying to make

#

just simply based off the fact 2.5 was moved on with

#

and their claims of video understanding

#

and the ultra route they're taking

#

wonder how that'll affect its inherent spatial understanding in text

#

Gemini 4 might leave the realm of modern tokenizers at this rate

#

btw is there going to be an intermediate model between pro and flash?

main gulch
elder rapids
#

I've been seeing people talk about some new "2.5 standard"

#

sounds like bs to me

main gulch
sacred quail
#

is grok 4 thinking too long or is it just server issue

#

on lmarena

main gulch
hollow ocean
#

Get 3 months of Sentryโ€™s team plan free: https://sentry.io/fireship

Elon Musk has the 'trust me bro' benchmarks to prove that Grok 4 is the world's most powerful AI model. But just how well does it compare against competitors in real life scenarios? And is it still calling itself MechaHitler?

#Grok4 #Grok #elonmusk #coding #tech

๐Ÿ’ฌ Chat w...

โ–ถ Play video
dawn wharf
whole wagon
#

โ€œThe talks between OpenAI to buy the startup for $3 billion ended in recent days after Windsurfโ€™s team raised concerns over how the coding assistant would fit into the OpenAI and Microsoft agreement, which requires OpenAI to share its technology with Microsoft, according to two people familiar with the companyโ€™s discussions.โ€
- The Information

#

๐Ÿ’€

wintry tinsel
#

Grok 4 is such a benchmaxxer

#

What is it useful for?

#

Claude is better at coding and writing

#

It seems to be best at logic and math

whole wagon
ocean vortex
#

Dork fails creative writing ๐Ÿง

hollow ocean
#

o3 pro best creative writer

sacred quail
#

O3 is interesting choice

#

for creative writing

#

Too mechinal for me

torn mantle
#

nah i think it will be top 3

#

let me check the actual ranking

#

yea no3 for sure

dawn wharf
elder rapids
#

what does "high" mean for grok 4

#

lmfao

torn mantle
elder rapids
#

nah but I mean

#

it's not a bad model

#

so we can toss that idea aside

#

it's just fails so much In practice

#

it's absurd

torn mantle
#

no i mean part of that is the reason why its bad at creativity

elder rapids
#

that's not the case

torn mantle
#

it is the case

#

what are you talking about

elder rapids
#

nah that's not how it works

torn mantle
#

how does it work then

elder rapids
#

not how claims work either

#

accuracy and technical things don't exclude creative ability

#

that's bs

#

seems like they rl'd too much for specific domains

#

rather than general abilities

torn mantle
#

when a model is trained for maximum truthfulness then it will likely be optimized to select the most statistically probable and factually supported words = which leads to more predictable and less creative outputs

elder rapids
#

that's not how it works

torn mantle
#

thats how probability distribution works

#

and there is a tradeoff between truthfulness/accuracy & creativity, the question is how did they fixed that

elder rapids
#

it would be impossible for these things to prevent creative ability, but the LACK of creative ability in the first place seems to be the reason for all this

#

aka they never rl'd it for any sort of creativity in the first place which is pretty bad for the model, so the question is how did they mess that up

torn mantle
#

nah

dawn leaf
torn mantle
#

temp is just a sampling method applied on top of that already built distribution

#

training is what shapes that statistical distribution

elder rapids
#

I'm ngl you could just ask chatgpt for this

#

to explain what im saying

torn mantle
#

you could use chatgpt to understand what im saying as well

#

you are missing the whole point here

elder rapids
dawn leaf
#

GPT 4.5 It's questionable to add.

elder rapids
#

this is basic

#

you don't know basic

#

you need chatgpt

#

I do not

torn mantle
#

i think you need gemini to dumb it down for you

elder rapids
#

yo

#

you are not someone who works with AI

#

stop larping dawg

#

just ask chatgpt

#

๐Ÿ˜ญ

torn mantle
#

just ask gemini

elder rapids
#

to put what I'm saying to you in other words? ๐Ÿ˜ญ

#

you never said anything explicit to me for it to define

#

you're not correcting me, nor are you making a claim

#

you're just expressing a bad understanding of model training and alignment

torn mantle
#

what was the prompt

elder rapids
#

that's not even new Information

#

quit the ragebait

torn mantle
#

eli5 what asura was saying?

elder rapids
#

๐Ÿซฉ

#

Craig level

torn mantle
#

im jk

#

dont take it seriously

#

ignore him

elder rapids
#

the level of troll

#

you say random shi

#

and that's what asura was doing

#

bugging asf

torn mantle
#

we planned to launch our open-weight model next week.

we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us.

while we trust the community will build great things with this model, once weights are

#

idkd idkdkdi its 3 am

#

honestly i half-read what you said

#

but im still right tho

#

im always right

#

uhm

#

right

#

nod

#

agree

leaden palm
leaden palm
#

idk if they actually have the open model

whole wagon
#

Openai is an actual meme

#

What additional safety there are already strong open source models

#

And the world did not end

#

we are not yet sure how long it will take us. not even giving an ETA

leaden palm
#

this type of slop

hollow ocean
#

o3 and o3 pro scores the highest on creative writing but no one uses them for writing ๐Ÿค”

dawn wharf
hollow ocean
dawn wharf
hollow ocean
dawn wharf
#

so basically it's a useless benchmark

hollow ocean
#

Iโ€™ve never heard of anyone say o3 or o3 pro is the best at creative writing

whole sundial
#

kimi k2 wins "KyrieBench", first open source model to get the right (prev. o1/o3/gpt4.1 and grok 4)

#

i now agree that kimi k2 is the real resaon why openai open source model was delayed

#

also first Chinese model to get it right

rare python
#

yeah

dawn wharf
#

because it can launch nukes or something idk

whole sundial
#

i don't think the chinese are doing much safety testing outside of making sure it doesn't tell people what happened in 1989, definitely just a stupid OpenAI excuse to delay the model

#

until it is not relevant anymore

#

when i said "OpenAI", it should be "ClosedAI" lol

dawn wharf
whole sundial
#

k2 reasoning and r2 will be better than their open source model

whole wagon
#

I am confused. Why would K2 cause a delay

#

I don't think they were the open source SOTA anyways. Not in absolute performance

dawn wharf
#

did they even say what the model's main goal is?

whole sundial
#

model performs like gpt 4.1, add some reasoning and it could be better than o3

whole wagon
#

The openAI model that is

#

The leaks are all like o4-mini level at best

#

I think there is another reason

#

That it's delayed

hollow ocean
#

Yall still trust benchmarks or no

#

๐Ÿคฃ

#

Infinite o3 pro glitch

#

Multiple emails

whole wagon
#

๐Ÿ˜‚

elder rapids
#

there's nothing about a small model like Sam Altman hinted at that needs to be compared to a 1T model

#

deadass

#

what's with the sensationalism getting worse here

elder rapids
#

๐Ÿ˜ญ๐Ÿ˜ญ

#

2 completely different caliber models

#

yeah o3 mini level

#

vs a 1T model

#

dumb comparison

#

you can't even begin to think that that model release somehow impacted theirs

#

he's right

#

once you release the weights

#

you can't unrelease them

#

it seems like nobody is trying to think about how impactful this could be to openAI's future

#

(VERY meaningful)

hollow ocean
#

5 accounts with o3 pro and 4.5 @deep adder

#

Best method

#

One runs out move to the next account

#

Paying $1 is better

#

99% off

#

๐Ÿคฃ

#

Google ultra accs are being sold for cheap

#

๐Ÿคทโ€โ™‚๏ธ

#

Not buying them just saying itโ€™s a thing

elder rapids
#

the model has to be completely safe and smart enough (aka aligned) for it to cement itself, especially for its size. It doesn't have to be an insane model, or good model at all, but if it's not reliable just like their closed API then it's completely a waste. It's the first real example besides gpt-2 of a model they've developed that can be looked through. It also influences government decisions like open-weight performance to appeal to as a baseline from a partnered American company (and could influence laws etc etc), it also revives openAI open weight expectations and deflects legal claims against it's closed off nature (and could reinforce its standard of "safety") it also re-cements openAI into the open source community and could create a very large pool of developers/researchers who choose to use the open source model

#

this is the biggest thing openAI could do tbh

leaden palm
#

@deep adder @elder rapids on the "less than 100b params" and "small model like Sam Altman hinted at" point

*note the plural

#

ofc it could be relatively small compared to a 1t model

#

but "it's not a small model"

elder rapids
# leaden palm <@348477266704990208> <@887104792437092352> on the "less than 100b params" and "...

who is he and how is what he said verifiable at all? and this all doesn't matter because the point isn't that you'd be comparing performance regardless, if it's not precisely a 1T model just like the Kimi model then it's simply wrong to say this is the cause, it could be 900B, it doesn't matter. I can also argue ones a reasoning model the other isn't so it's a completely refuted skepticism

#

I agree there could be nuance about its size though, the problem is Sam Altman has been all over Twitter clarifying hobbyist utility and it should be "ran on GPUs"

leaden palm
#

i'm awaiting the openai open model but i find sam's activities odd

elder rapids
#

they're in the same position as you with the skepticism lmao, big providers having access to the model isn't meaningful towards its alignment or sentiment towards delaying it a little bit. He also has no idea whether these big providers even have it

#

if they planned to release the model so soon obviously preparation would've been met, calling it off unexpectedly as he framed it would align with what I said

leaden palm
#

perhaps i'm too trusting but i'd expect being well known and from a decently sized inference/dev org (lambda/nous) to indicate having connections and indicate that their claims are based in reality

elder rapids
#

there's plenty of examples of people on Twitter in the AI space saying things and then getting corrected by the actual researchers

#

I forgot that dudes name

elder rapids
# leaden palm true...

I'm ngl it's kind of scary too how much it would sway public opinion, especially in the AI subreddits until actual confirmation arrived

#

but yeah regardless, I'm not saying you can't trust him just off the fact of trends (even though it wouldn't be a bad inference) there's just a lot of things that implicitly support Sam's (or the openAI researchers...) position here over the skepticism people have about what Sam said

#

kinda actually pains me that nobody tries to look for practical validity in the AI Twitter space, like you can inductively infer something you know

#

instead of expressing the opposing claim outright

pulsar tendon
#

Like I think their claims have validity

elder rapids
#

it's not that they have no validity, it's just that however valid they are, none of them could know unless they were actually there and could frame it non speculatively like how they're doing

#

"literally never trust this guy" like bro what are we doing

#

it's not even that either, there's just more evidence for the non speculation than the pro speculation

whole sundial
# leaden palm i'm awaiting the openai open model but i find sam's activities odd

it's true, US labs make promises but don't keep them. Meanwhile a Chinese company (Baidu) made an explicit promise when they released ERNIE 4.5 in March that the model would be open sourced on June 30. They released everything related to ERNIE 4.5 on that date. Meanwhile xAI made a promise in February to open source Grok 2 "in the coming months". Grok 4 has came out and the weights for any xAI model other than the original Grok are nowhere to be found. OpenAI made a similar promise, but with a special model made for open source. Every time the model is close to release, they delay it for arbitary reasons.

TL;DR: China keeps promises better than US labs.

elder rapids
#

even tho I agree with him

whole sundial
#

true, but China is better at playing catchup

elder rapids
#

it's non evidence for anything other than "China companies cooler than America's!!"

#

and doesn't help the case here

whole sundial
#

I'm not saying there are not US companies that commit to open source, Meta and Nvidia has done it, Google has released lesser version of their flagship models, but China has just done it better. There is no other way I can say this. US focuses on commercialization and their models are popular in those environments but China just releases the models openly to get themselves in the door. Maybe they are trying to influence US opinions with their models, we don't know. But the point is China just does it better. I'm not saying Chinese models are better than US ones, but the difference is that the US ones are closed and many Chinese ones are open. And the ones from the US that are open are worse than China's.

echo aurora
#

Lets keep things focussed on AI pls

whole sundial
#

ok

#

oh are you talking about Craig?

#

I was just talking about AI

#

but there may come a day that Chinese models become better than US flagships. They are improving themselves; Kimi K2 is based on DeepSeek's arch but is still better than V3 due to its size. And yes, they do distill from US companies but they are doing it out of desperation, they want to catch up to the US. And now their reasoning models are close enough to US ones in performance that they can distill from their own. Yes, they are still trying to distill from US models because China can't create enough English synthetic data with their own models, but they won't need to do that anymore. And US companies like Meta are basing their models off of what Chinese companies has done because they need to catch up in the open source race. Llama 5 may very well beat China's models when it comes out, considering Meta poached some of AI's best minds. Yes, US models will keep on improving themselves and will still be very good models. But I think China is just accelerating the race and also helping to push the frontier. They just don't have models that beat the US yet because of GPU sanctions. US companies are taking China's work and using their resources to make it better.

#

yes, but they are doing research that US companies are applying to do so

#

lol people on twitter are saying that K2 is to blame for OpenAI's delay (I do believe them, because that model probably beats their open source models in things that just can't be done well with smaller models, like knowledge based benchmarks like SimpleQA)

#

which OpenAI invented btw

#

it's not like they are telling us, US companies don't like releasing papers anymore

whole wagon
#

K2 is not a reasoning model i would assume openAI can beat it with a reasoning model

#

Firstly

whole sundial
#

it will soon become one though

#

they said they would make it reason and multimodal

whole wagon
#

Who cares really. It's always going to get overtaken in a few months that's how things work

whole sundial
#

true, it will be beaten by both US and other Chinese models

tidal schooner
#

i wonder how k2.5 will fare tho

rare python
tidal schooner
#

should clarify

rare python
tidal schooner
rare python
#

They might see a leap big enough to call their next gen model k3

#

and hopefully k2.5 or k3 will be smaller

elder rapids
#

I think it's extremely braindead

#

a lot of what Kiri is saying is non evidence, closed AI and open source are indistinct In the context they're trying to make an argument for

#

US open source models being worse than China's (which isn't true, the only outlier is deepseek) isn't meaningful at all, especially when you consider total competition is in context relative to the respective countries

#

China having a different open source culture is obviously true, but it's not as meaningful as made out to be

harsh flume
#

Anyone here tried deep research with Grok 4 Heavy?

elder rapids
#

not sure why he mentioned Chinese models eventually being better than "US flagships" like the US isn't progressing at a much much faster rate than the Chinese models he's calling into question, and acting like China hasn't been benefiting the most from US open source, enabling their progress

harsh flume
#

That's my only use case for Grok and i'm wondering if it performs better than std Grok 3/4

whole sundial
#

ok that makes sense

#

I'm sure Chinese models will be right behind the US though

#

And I'm sure China is using US models for training or other purposes

#

they would still be behind if they didn't

elder rapids
#

countries like Japan seem to be more experimental

whole sundial
#

by France you mean just Mistral

elder rapids
#

yep

#

but that's enough

whole sundial
#

and they have some closed source models, probably to get people to use their API, which is sad

#

but yes, mistral has done a lot for open source

#

7b is still one of the most famous and modified open source LLMs out there

#

Nemo is maintaining its legacy

harsh flume
#

lol Grok 4 Heavy has no Deep Research tool ๐Ÿ’€ ๐Ÿ˜ต ๐Ÿ’€

#

Just torched 300$ apparently

elder rapids
#

it's alr tho, they're adding a lot of stuff

#

so you might get a research agent eventually

harsh flume
#

I mean I cant help my curiosity and impulse bought to try some stuff

whole sundial
#

their early MoE models likely inspired later MoEs like DeepSeek

harsh flume
#

made a deep research styled prompt anyway and am running, hopefully it has research heuristics that are good tho not labeled as a selectable tool

whole sundial
#

(like 8x22B)

elder rapids
#

I agree

#

at least in the open source

whole sundial
#

Mistral models aren't great now though

#

I recently gave an easy prompt to Magistral Medium (the closed source commercial model!) and it went into an endless loop

tidal schooner
harsh flume
#

This one prompt ended in 4m

red sluice
#

It's really too easy to spot an answer made by ChatGPT or Grok 4. It kinda kills the fun of the arena. The fact that LLMs show personnality isn't bad as such, but it really makes the whole system rigged. If I want to penalize Grok 4 in order to earn some money out of Polymarket, I can litterally do it, just let me play lmarena non stop for 20 hours, pretty sure I can secure 100 lose-on-purpose battles on LMarena. If I take precautions not to get caught, I use free residential IP VPN providers, and I can go litterally uncaught.
Kinda sucks, no system is perfect, but I hate to know that it's doable and that it probably has been attempted before.

#

ChatGPT: Emoji at the end, structure.
Grok: Grokking.

harsh flume
#

loool theres some Mechahitler-esque tone to its answer ๐Ÿคฃ (I made it cross-analyze deep research results from Gemini and o3 with its own findings)

tribal aspen
#

Hi so how can I send images to the model on Direct Chat on LMArena?

red sluice
tribal aspen
# red sluice

Oh never mind Im not getting it for Opus 4 and grok 4

indigo hazel
#

grok 4 is still great even if it cant use tools? better o3 or grok?

lime coral
#

why not just try?

quartz light
#

R R A L L I O R P

whole wagon
#

Sam kinda pathetic giving this safety excuse and underestimating the intelligence of everyone else on the planet

indigo hazel
whole wagon
#

They have been testing this for a month, you dont just suddenly realise the model is 'unsafe' just before the release

#

That's not how it works it's obvious to everyone

#

Musk calling grok 4 agi repeatedly also lol

tidal schooner
whole wagon
#

This ain't how you determine AGI

tribal aspen
#

also

#

will grok heavy be available on LMarena direct chat?

#

when api of that comes out/

whole wagon
#

I assume not. For same reason o3 pro not there

#

Expensive and slow

tidal schooner
whole wagon
#

When do we think grok 5 is coming

#

Before the end of the year is my guess

tidal schooner
#

december is stretching it tho

whole wagon
#

Didn't musk say grok 5 was cooking already

tidal schooner
#

it could arrive maybe a bit early then

#

grok 1 to grok 2 was about 9 months

#

grok 2 to grok 3 was about 6 months

#

grok 3 to grok 4 was about 5 months

#

keeps slashing it between releases

#

so earliest could theoretically be november but highly unlikely imo

#

3.5 seems more realistic

whole wagon
#

There's 172 Days till end the year

#

I think end of the year release makes sense, Gemini 3 is coming then too

tribal aspen
tidal schooner
tribal aspen
tidal schooner
#

haven't checked costs tho

tribal aspen
tidal schooner
#

shouldn't be as bad as the multi-agent framework grok 4 heavy uses tho but idk

tribal aspen
#

but gemini costs arent too much

#

they are cheap models

tidal schooner
tribal aspen
#

in lmarena?/

#

like opus thinking and grok 4 are still high class models

#

even with the data i dont think its too practical to provide these models

whole wagon
#

Deepthink isnt going to be cheap for some reasons which cannot yet be stated lol

tribal aspen
#

cuz gemini models are cheap

whole wagon
#

It won't

#

They are cooking up something special there

#

Which is why the delay

tribal aspen
#

they already had a model which was performing better than it's upper tier model from other companies

#

I mean 2.5 pro was better than o3 pro and stuff

#

so there was no point of releasing another SOTA having their own model on top already

whole wagon
#

Very wrong

tribal aspen
whole wagon
#

There was no point releasing gpt 4 cos gpt 3 was already SOTA

#

Very very wrong

tribal aspen
alpine coral
dusky aurora
#

I really hope Gemini will have new updates this month

alpine coral
#

kingfall (one among 15 questions it was answering) passes. like it's possible to acknowledge the analogue clock possible interpretation, but not give that is the final answer

alpine coral
fleet lintel
hazy quest
frigid coral
#

damn viren's here

#

must be a legit discord

alpine coral
torn mantle
#

@alpine coral can u benchmark kimi k2 pls

#

No but seriously did anyone notice the first principles Elon's was talking about?

#

'It will provide answers that you won't find on the internet'

#

didnt he say that too

rare python
keen fulcrum
#

Others have too

#

Smartest AI I have used

unborn ocean
#

forgot to add: for 4.1 / o3 + 8 is the current price

ocean vortex
#

For reference, Deepseek R1 official API is $2.19. This sounds insanely low and OpenAI have more traffic, but also more efficient GPUs and better infra, so I still think 2-3.

#

Deepseek's cost is probably around $2 dead, and those 20 cents is their entire "profit" lol

keen beacon
#

they have a 545% margin...

frigid phoenix
#

Do we know when grok is going to be added to the leaderboard Hmm

ocean vortex
#

Since it's already collecting votes

#

this looks wild now though. OpenAI with no model released yet higher odds than xAI... lol

candid storm
#

Youre looking at end of 2025

frigid phoenix
#

Would be smart If staff Here Waits for the Last Day of the month and Puts grok in top 1

torn mantle
#

i would still bet on google tbh

frigid phoenix
#

While having PM shares on it

ocean vortex
#

this is more realistic ok

torn mantle
#

yea

ocean vortex
#

Betting on Google still would be silly...

torn mantle
#

why

#

i would still bet on them

#

even if the profit is low giving the high chance

candid storm
#

I would not bet on july personally rn

#

Too uncertain

ocean vortex
torn mantle
#

im not feeling grok 4

candid storm
#

But never bet against Elon!

ocean vortex
torn mantle
ocean vortex
#

If they tuned it for arena prompts I think chances are very reasonable

tribal aspen
#

Why can't open ai codex be on LMarena?

#

direct chat

torn mantle
ocean vortex
#

wolfstride

#

Seems like they are testing a bit random things and then retracting it

tribal aspen
#

is opus 4 good or 2.5 pro good in mass coding?

#

like for start building a vibecoding project/

tribal aspen
#

ohh

#

why cant we upload images to claude 4 opus on lmarena direct chat?

ocean vortex
#

Opus is gonna be singificantly more concise, then Sonnet 4, but 3.7 even more unhinged

tribal aspen
tribal aspen
#

what about gemini 2.5 pro?

#

so opus 4 not good for coding?

ocean vortex
#

it is much more concise with outputs

tribal aspen
#

also this

tribal aspen
#

what about 2.5 pro vs sonnet 3.7 vs sonnet 4

#

which one should I choose?

#

also the new Kimi 2

ocean vortex
tribal aspen
#

)

#

do u have any idea about kimi 2?

ocean vortex
tribal aspen
#

also so opus 4 and sonnet 4 are good for nothing?

tribal aspen
#

i might try it

#

im actually so confused

#

like at this point

#

there are so many models

#

2.5 pro opus 4 sommet 4 sonnet 3.7 grok 4 kimi 2

#

also oai's codex

alpine coral
#

fwiw don't fixate on them too much

#

you're better off just using one / some of them, and getting a feel โ€“ they're not like insanely different at a surface level..

#

it's like maybe one will be the magic wand you want - build / fix whatever it is you're working on perfectly. but there's still effort along the way.. engaging with them.. they're all kinda similar in many ways

ocean vortex
tribal aspen
#

I tried it and went crazy

keen beacon
#

its so slow tho

tribal aspen
#

like it's one shot code is so better than 2.5 pro and grok 4

tribal aspen
alpine coral
tribal aspen
#

also its good on the benches

ocean vortex
#

As a non-reasoner it's a good option in that segment yeah

rare python
#

Unless they have insane fund

keen beacon
#

theyre probably hosting it on their own gpus, cost is probably not that high for them

ocean vortex
tribal aspen
#

I creaed this one with gemini

#

yashjit.tech

rare python
keen beacon
#

yea i meant if you use it on their site. they aren't losing piles of money

#

probably

ocean vortex
#

You also have 2 more providers

alpine coral
#

for sure

keen beacon
#

deepseek api is very slow too

rare python
keen beacon
#

i remember getting 60 tps. now its so slow and at times unstable

rare python
#

DeepSeek V3 and R1 before it blew up to the mainstream is so peak. It's sooo fast

gusty helm
#

I feel they dumped xAI too early in that market

ocean vortex
round haven
#

Hey guys, does anyone know what happened to dragontail? I loved that model and it seems like none of the models publicly available today are as smart as dragontail.

ocean vortex
#

fp8 is full precision

#

for R1

rare python
#

dragontail seems familiar

civic flame
#

https://x.com/mckaywrigley/status/1943385794414334032 saw this and tried it with wolfstride

My thoughts on Grok 4 Heavy after 12hrs:

Crazy good!

โ€œCreate an animation of a crowd of people walking to form โ€œHello world, I am Grokโ€ as camera changes to birds-eye.โ€

And it 1-shotted the *entire* thing.

No other model comes close.

Watch the full clip.

rare python
#

isn't it a Gemini model?

civic flame
#

it did it 0-shot and arguably better

round haven
#

The rumor that it was a google model yes but it seems to be singificantly smarter than 2.5 Pro

#

2.5 Pro still can't solve problems that dragontail has been able to solve when I tried it

civic flame
#

this model is really good

round haven
#

I would really like to use dragontail again

rare python
round haven
# rare python

how do they know these are all gemini models? By asking them?

civic flame
#

wolfstride does it in 1, as the prompt asks for, and with more people

round haven
#

dragontail has been able to solve problems none of the publicly available models I've tried can solve still

civic flame
#

like?

round haven
#

Like this one:

Let G be a group. Denote by a' the inverse of a for simplicity. Define the generalized commmutator [a1,...,an] to be a1...ana1'....an'. So in particular [x,y] = xyx'y' is the usual commmutator. Give a formula to write [a,b][c,d] as a single generalized commutator. Verify the derived formula step by step to check that everything cancels out nicely. I cannot read latex so use plaintext. Also make sure to use my convention by writing x' for the inverse of x.

civic flame
#

what's the answer

round haven
#

there are many

#

also you should try it yourself first, it's a fun puzzle

alpine coral
ocean vortex
#

What if you coded your own "heavy" except it has 5 instances of o3 and 5 of 2.5Pro, 5 of Grok4 (giving it the benefit of the doubt with that impressive arc-agi score)... Then can also add Opus4 for good measure I suppose for a total of 20. That would have been SOTA by a good margin ๐Ÿค”

unborn ocean
bleak venture
keen beacon
#

claim the benchmarks without releasing the scoring system ๐Ÿคฃ

ocean vortex
alpine coral
ocean vortex
#

Those Pro/Heavy models do well because often you (or the model) can actually spot a good response or what other attempts missed just by looking at it

unborn ocean
#

will not completely destroy the potential though

#

just as small hurdle

ocean vortex
unborn ocean
#

most of them get RLAI training with some version of themselves in the process -> they produce output they like

#

would be interesting to fine tune a good open model as the final answer generator using some rl on rlvr

civic flame
#

oops ignore that reply

#

didn't realise it was still on it

rare python
dawn wharf
#

does anybody know a free provider for K2 api?

round haven
#

why do i never get it

rare python
#

get what?

round haven
#

i thought they said it's still in the arena

rare python
#

only stonebloom and wolfstride has Gemini mystery models iirc

rare python
round haven
#

leo

#

anyway, a misunderstanding probably

#

When I use lmarena new UI i get these hanging responses. The generator keeps spinning and i never get a response. Grok 4 generator is still spinning from yesterday

#

v buggy

rare python
#

true

#

I don't know if those models thought too much and crash or it's just delay

#

No thinking available, even as summary

pure anvil
dusky aurora
rare python
#

ยฏ_(ใƒ„)_/ยฏ

dusky aurora
#

sorry. I'm just hoping that there will be something better than 06-05

rare python
#

in lmarena right now but don't when it will debut

dusky aurora
#

waiting for Gemini updates and Arena updates are my main joys

rare python
#

and what they are

dusky aurora
#

how do you use Battle mode?

torn mantle
#

Kimi k2 is what i expected deepseek v4 to be

#

But can v4 really be better than k2?

#

I don't think deepseek will train a base model to 1T

#

K2 feels exactly like grok 3, smart for a base model

main gulch
torn mantle
#

Gave like 2 wrong dates as well

civic flame
#

shhh

jade egret
#

openai sad

#

windsurf employes goto google

round haven
#

ok this is epic

civic flame
alpine coral
#

i've seen all kinda things in chinese language forums lately..

civic flame
#

yikes

alpine coral
#

kinda forget how epic it is how can you legit translate the content of webpages these days

civic flame
#

linux.do is great but also people over there have 0 regard for like anybody else

alpine coral
#

wasn't that long ago that translatino was shite and clunky

pure anvil
#

lol

civic flame
#

wow what an edgy cool guy ๐Ÿฅถ ๐Ÿฅถ

pure anvil
#

wdym?

civic flame
#

๐Ÿ’€

pure anvil
#

did you interpret it that way?

#

if so

civic flame
#

you just come across as a 14 yr old trying too hard to be edgy

alpine coral
#

locusts?

#

im just confused

#

but yeah.. discussion = exposure = patching

pure anvil
torn mantle
#

Leo gatekeeps everything ive noticed that too

#

And sometimes he plays dumb

keen beacon
#

you gotta do what you gotta do ๐Ÿคท

torn mantle
#

I can use deep think and other gemini new models for free but imma gatekeep as well

civic flame
pure anvil
#

what?

#

no way

#

well I definitely didn't mean that

#

lmaoo

civic flame
torn mantle
#

Don't delete

keen beacon
#

i dont thihnk he meant that

civic flame
#

then what did you mean ๐Ÿ™๐Ÿ˜ญ

pure anvil
#

I didn't even know that was a thing

#

I meant free loaders kekkkk

civic flame
#

yeah nvm then mb

torn mantle
alpine coral
torn mantle
#

I wont share how to use wolfstride and stonebloom for free or maybe i will

alpine coral
#

but perhaps lost in translation

civic flame
keen beacon
#

nah ive heard the term used like that before

torn mantle
#

I want it to be patched just because some people are gatekeeping

keen beacon
#

post it here i dare u lol

civic flame
#

but alright

#

it's not hard to find if you do 2 minutes of research

#

๐Ÿคทโ€โ™‚๏ธ

alpine coral
torn mantle
#

Yea but when i asked you, you played dumb

keen beacon
alpine coral
#

oh lol

keen beacon
#

i didnt even know about that lol

torn mantle
#

Just tell me im sorry cant share that with you

#

Although its an easy trick

alpine coral
#

cause yeah.. that seems like a cooincidence or intended.. or some translatiion thing

#

or im just too high

torn mantle
#

Im just lazy to re

civic flame
keen beacon
alpine coral
#

parasite = freeloader

civic flame
#

I do think it would've made more sense to say parasites in that situation because it's more commonly used to describe free loaders but

#

oh well

pure anvil
#

That is a better word come to think

#

of it

torn mantle
#

Maybe i should contact lmarena devs

#

To patch it

keen beacon
torn mantle
#

@keen beacon should i?

civic flame
torn mantle
#

Yes or nah

#

Wild will decide

civic flame
#

what's even the point

#

and you didn't answer my question

keen beacon
civic flame
#

pretty sure they know by now anyway

pure anvil
civic flame
#

there are definitely some pretty smart people over there

#

but they're relentless lol

pure anvil
civic flame
#

!

drifting thorn
#

thatโ€™s Kimi

#

Btw whatโ€™s the system prompt of Grok4 on LMArena?

#

It seems different to the version in Grok app and in API

jade egret
#

guys

#

now that windsurf people wok at google deepmind will google models become much better at coding?

barren prairie
#

Kimi or kiki?

keen beacon
#

depends on the task tbh

#

can you be more specific about what u typically want to use it for? tbh, you should just try out different local ai models yourself and get a feel of which model to use in what scenario

tribal aspen
#

lmarena is so unusable in the direct chyat mode when it generates code

blazing bison
#

yes the site is not very optimized for code

#

wait for openai new model

keen beacon
#

you need vision? or like the good at svg / web design type