patent aspen Jul 1, 2025, 9:56 PM

#

I think of X more like an addiction than a productive habit

torn mantle Jul 1, 2025, 10:04 PM

#

'unparalleled'

#

we really havent seen any demos

balmy mist Jul 1, 2025, 10:06 PM

#

torn mantle 'unparalleled'

this is real?

torn mantle Jul 1, 2025, 10:06 PM

#

balmy mist this is real?

yes

balmy mist Jul 1, 2025, 10:07 PM

#

but not yet available right?

torn mantle Jul 1, 2025, 10:07 PM

#

yes

balmy mist Jul 1, 2025, 10:07 PM

#

damn

torn mantle Jul 1, 2025, 10:07 PM

#

it may be released next week

balmy mist Jul 1, 2025, 10:07 PM

#

lol may

torn mantle Jul 1, 2025, 10:07 PM

#

😦

leaden palm Jul 1, 2025, 10:07 PM

#

supposedly july 4

#

if it truly reasons from first principles and answers all the "hardest problem of your life"s people submitted it'll be impressive

balmy mist Jul 1, 2025, 10:08 PM

#

but remember, if we don't like it, we not the audience for it lol

leaden palm Jul 1, 2025, 10:10 PM

#

what do you think it'll get on HLE

balmy mist Jul 1, 2025, 10:10 PM

#

how good is it?

#

like its SOTA?

candid storm Jul 1, 2025, 10:48 PM

#

#

Pray for me guys

#

Lol

small haven Jul 1, 2025, 11:55 PM

#

why do i feel like everybody is riding the grok 4 launch hype, but within a month, everyone is going to move on

zinc ore Jul 1, 2025, 11:56 PM

#

Because that likely happens

patent aspen Jul 2, 2025, 12:11 AM

#

torn mantle Jul 2, 2025, 12:15 AM

#

candid storm

you got this

#

dont count google out

#

they still have deep think

#

and if its combined with kingfall then we may see >> grok 4

tall summit Jul 2, 2025, 12:20 AM

#

leaden palm if it truly reasons from first principles and answers all the "hardest problem o...

deja vu

whole wagon Jul 2, 2025, 12:29 AM

#

Very strange we have not seen grok 4 on LLM arena yet

#

It means they won't have a score on release

rare python Jul 2, 2025, 12:48 AM

#

Grok 3 we saw it as chocolate for weeks iirc

#

Does that mean they stopped benchmaxxing lmarena?

#

🤯

sonic tendon Jul 2, 2025, 1:19 AM

#

candid storm

damn, gl

#

i thought about mogging it at ~7c earlier in june but never pulled the trigger

candid storm Jul 2, 2025, 1:21 AM

#

I think it is still a good price

sonic tendon Jul 2, 2025, 1:22 AM

#

same

#

i'll ride the hype assuming it does come out july 4 (which seems highly likely)

#

depending on personal evals, may or may not sell right after official release

candid storm Jul 2, 2025, 1:22 AM

#

Same here

sonic tendon Jul 2, 2025, 1:24 AM

#

i've got something like 3k in contracts/shares that'd depend on grok4 in one way or another

candid storm Jul 2, 2025, 1:24 AM

#

Nice

#

Google no?

sonic tendon Jul 2, 2025, 1:24 AM

#

not atm

candid storm Jul 2, 2025, 1:25 AM

#

I think google no for august is a good buy rn

#

Its 44c

#

And we have both grok 4 and gpt 5 probably

sonic tendon Jul 2, 2025, 1:25 AM

#

end of august?

#

what market are you on

candid storm Jul 2, 2025, 1:26 AM

#

candid storm

This one is jult

#

July

#

But I think the google no one for august is a good deal rn

sonic tendon Jul 2, 2025, 1:26 AM

#

yeah, i'm just saying

#

what market are you looking at?

#

i only see the dec2025 one

candid storm Jul 2, 2025, 1:27 AM

#

sonic tendon Jul 2, 2025, 1:27 AM

#

oh, they just opened it

candid storm Jul 2, 2025, 1:27 AM

#

Yeah today

#

Google only has deep think release before end of august

#

Which probably wont will be on the arena

#

And if it will, thinking models dont perform that well generally on the arena

#

I think the chance should be closer to 30%

#

For google august

sonic tendon Jul 2, 2025, 1:28 AM

#

hmm

#

what would be the incumbents there? openai seems somewhat unlikely

#

xai, maybe DS

candid storm Jul 2, 2025, 1:29 AM

#

Grok 4 next week

#

GPT5 probably august

#

Sam altman said last week gpt 5 will release 'this summer'

#

Also deepseek r2 which may be a dark horse

sonic tendon Jul 2, 2025, 1:30 AM

#

yeah, i've been pondering what the r2 sitch might be

candid storm Jul 2, 2025, 1:31 AM

#

The Chinese are smart guys lol

sonic tendon Jul 2, 2025, 1:31 AM

#

yeah

#

historically they've only really focused on being open-weight and cheaper than the competition, but it's possible they'll compete with frontier

candid storm Jul 2, 2025, 1:33 AM

#

Deepseek r1 was almost frontier when it released

#

The R1 update of may also did pretty well in the arena without style control

#

Only gemini is significantly better

#

I think it is pretty realistic R2 will outperform 2.5 pro in the arena

whole wagon Jul 2, 2025, 1:37 AM

#

candid storm Google only has deep think release before end of august

🙂

#

Google is absolutely cooking rn. You think they will not release another model till end of August? kekw

candid storm Jul 2, 2025, 1:39 AM

#

What do you think they will release?

#

Gemini 3?

#

Seems a bit soon

whole wagon Jul 2, 2025, 1:40 AM

#

"Something" 🙂

#

A surprise

candid storm Jul 2, 2025, 1:40 AM

#

And if they will release something, it has to compete with gpt5, R2 and grok 4

candid storm Jul 2, 2025, 1:40 AM

#

candid storm

Therefore I think 57% is too high

whole wagon Jul 2, 2025, 1:42 AM

#

Volumes are ultra low rn anyways

candid storm Jul 2, 2025, 1:43 AM

#

The august market just released today

trim vale Jul 2, 2025, 5:05 AM

#

how does google decide when to show me an ai answer, when to not show anything about ai, and when to show "an ai review is not available for this search"?

#

specifically, why make the distinction between the latter two?

indigo hazel Jul 2, 2025, 5:50 AM

#

now let's see grok 4, the release is imminent

dusky aurora Jul 2, 2025, 5:52 AM

#

arena glitches again

proper prawn Jul 2, 2025, 5:57 AM

#

Is grok 4 already in arena?

#

how many anon models are there in arena currently?

echo aurora Jul 2, 2025, 6:00 AM

#

dusky aurora arena glitches again

How so?

dusky aurora Jul 2, 2025, 6:02 AM

#

echo aurora How so?

the bugs stopped,but see the latest post in #1343291835845578853

finite valley Jul 2, 2025, 6:13 AM

#

echo aurora How so?

could you check the DM?

echo aurora Jul 2, 2025, 6:16 AM

#

finite valley could you check the DM?

Would you prefer a response here or in DMs?

finite valley Jul 2, 2025, 6:16 AM

#

echo aurora Would you prefer a response here or in DMs?

either is fine.

echo aurora Jul 2, 2025, 6:21 AM

#

Kk I’ll respond here. First I’d note that we’d prefer using the @oak python bot when reaching out privately, I’m happy to respond but just sharing for future reference. Onto the actual question..

commercial use
I’d encourage you to review or terms of use (specifically the content restrictions section) which can be found here: https://lmarena.ai/terms-of-use

keen fulcrum Jul 2, 2025, 6:35 AM

#

https://x.com/btibor91/status/1940155773688180769

Tibor Blaho (@btibor91)

xAI is working on preparations for the Grok 4 launch in the xAI console

"Grok 4 now available - We're proud to bring you Grok 4 access on the API. Grok 4 currently supports text modality with vision, image gen and other capabilities coming soon."

Grok 4 (grok-4-0629) -

#

How is it?

#

Is it as good as advertised

finite valley Jul 2, 2025, 6:41 AM

#

echo aurora Kk I’ll respond here. First I’d note that we’d prefer using the <@57525266944321...

You know these block of texts are confusing. I think all input and output contents ownership is tied to the users. But I am trying to understand what is the meaning of internal business use? Does it mean I can not directly sell the content as it is but can use it to create, edit, mix or use as reference or inspiration to create final version for sell?

calm sequoia Jul 2, 2025, 6:50 AM

#

Identical experiance to mine. Gemini is really good but not for science.

#

Unprompted newest gemini sounds like HR woman

echo aurora Jul 2, 2025, 7:03 AM

#

finite valley You know these block of texts are confusing. I think all input and output conten...

Unfortunately I’m unable to give you the kind of direct answer that you’re looking for here to a question like that. Overall though leaning on the more cautious side when you’re unsure is what I’d do.

finite valley Jul 2, 2025, 7:04 AM

#

echo aurora Unfortunately I’m unable to give you the kind of direct answer that you’re looki...

Ugh, my bad luck. Thanks for the diplomatic answer though.

echo aurora Jul 2, 2025, 7:07 AM

#

finite valley Ugh, my bad luck. Thanks for the diplomatic answer though.

Sorry about that, wish I could be more helpful here

unborn ocean Jul 2, 2025, 7:11 AM

#

calm sequoia Identical experiance to mine. Gemini is really good but not for science.

its not really like lmarnea, they models are prompted to do a literature review based on the prompt + retrieved papers from semantic scholar

calm sequoia Jul 2, 2025, 7:11 AM

#

Yeah more like deep reasearch

unborn ocean Jul 2, 2025, 7:12 AM

#

without retrieval i am usually preferring 2.5 pro (atleast for economics and cs)

#

just because of (perceived) knowledge

calm sequoia Jul 2, 2025, 7:12 AM

#

How do you make it talk like a smart person?

#

Using lengthy prompts every times is extra work

unborn ocean Jul 2, 2025, 7:13 AM

#

calm sequoia How do you make it talk like a smart person?

big a prompt and i just tell it its a smart person 💀

#

like: you are an aspiring postdoc from xyz ...

#

that stuff weirdly works

calm sequoia Jul 2, 2025, 7:14 AM

#

With this setup o3 does not require any prompting

#

Mandatory Operating Parameters: Function as a highly dependable and scrupulous expert who thinks very hard before answering. All output must be subjected to rigorous verification. If a definitive answer cannot be located, explicitly state the inability to confirm. The generation of fabricated or unverified information is a critical failure and must be avoided at all costs. Accuracy is the sole measure of success.

#

Will try it on Gemini gems

unborn ocean Jul 2, 2025, 7:16 AM

#

my prompt is like 100x the length

#

but obv in ai studio to get more control

unborn ocean Jul 2, 2025, 7:20 AM

#

calm sequoia Identical experiance to mine. Gemini is really good but not for science.

and btw: i just checked the paper: they remove all the markdown for the final viewing

#

(not really representative of my use case as i don't write any papers with it)

#

but still a very good idea by allen ai, hoping for more like this from lmarena 🍍

calm sequoia Jul 2, 2025, 7:27 AM

#

unborn ocean my prompt is like 100x the length

It's hard to make it long but still general for all use cases :/

wind moth Jul 2, 2025, 7:39 AM

#

isnt grok 4 supposed to release soon

#

why hasnt a new model entered the arena

#

i remember for grok 3 chocolate and kiwi model appeared like a week before

unborn ocean Jul 2, 2025, 7:42 AM

#

calm sequoia It's hard to make it long but still general for all use cases :/

i have a couple of prompts just saved somewhere to match most areas

#

i even have one for something like bootleg deepresearch using the search tool in the code execution environment in ai studio they forgot to remove

#

(where you don't have any limits on the retrieved sources essentially)

#

not as good as deepresearch, but very concise (considering the 250+ sources per query)

#

but without that prompt 2.5 pro + grounding is like <<<< o3 with tools

neon warren Jul 2, 2025, 8:41 AM

#

@deep adder https://www.reddit.com/r/IndiaTech/comments/1lpqklk/i_am_lost_here_but_why_are_they_cheering_for_a/

From the IndiaTech community on Reddit: I am lost here ,but why are...

Explore this post and more from the IndiaTech community

unborn ocean Jul 2, 2025, 9:15 AM

#

calm sequoia Identical experiance to mine. Gemini is really good but not for science.

honestly after thinking a bit more about it: i think openai just did RL on some similar thing, because the performance diff is quite odd

#

its obvisouly very good, but is it THAT much better

alpine coral Jul 2, 2025, 9:19 AM

#

yeah it's crap

alpine coral Jul 2, 2025, 10:04 AM

#

just fwiw.. cyber alpha (via open router against one of my question sets.. performs v poorly..also tried a couple of other models while i was on OR , highlighted blue)

alpine coral Jul 2, 2025, 10:20 AM

#

unborn ocean Jul 2, 2025, 10:38 AM

#

Well idk if they even tested o3

#

No, in all our benches that people have her o3 is never at the top.
And I would also argue that there are a lot of new benches where 2.5 pro is top

calm sequoia Jul 2, 2025, 10:43 AM

#

unborn ocean its obvisouly very good, but is it THAT much better

People often confront me here on this topic. But it really is that much better. Every time i give a chance to Gemini I have to remake it with o3.

unborn ocean Jul 2, 2025, 10:43 AM

#

Ordering?

unborn ocean Jul 2, 2025, 10:44 AM

#

calm sequoia People often confront me here on this topic. But it really is that much better. ...

Nah, I am sorry but the ~100 ELO difference is insane and does not make sense to me unless they specifically did RL on something very similar

calm sequoia Jul 2, 2025, 10:45 AM

#

Maybe they did, but it does not change the fact.

unborn ocean Jul 2, 2025, 10:46 AM

#

Like writing a paper as a research assistant or just SFT on similar things

calm sequoia Jul 2, 2025, 10:46 AM

#

Fun thing I discovered recently. Models can't program dart language well.

#

It's strange considering it's a language of Google.

unborn ocean Jul 2, 2025, 10:46 AM

#

calm sequoia Maybe they did, but it does not change the fact.

Does not make it worse, but questions how generalisable these results are

calm sequoia Jul 2, 2025, 10:48 AM

#

He was speaking of different benchmark ELO

calm sequoia Jul 2, 2025, 10:48 AM

#

calm sequoia Identical experiance to mine. Gemini is really good but not for science.

This

calm sequoia Jul 2, 2025, 10:49 AM

#

unborn ocean Does not make it worse, but questions how generalisable these results are

For society science stuff is more important than chat or UI capabilities. I think the OpenAi made a right move here. If this was move whatsoever.

unborn ocean Jul 2, 2025, 10:51 AM

#

My second point is just that saying SciArena is THE benchmark for how good a model is at science is like claiming lmarena is THE benchmark for LLMs (and i believe the second claim to be even more reasonable)

calm sequoia Jul 2, 2025, 10:52 AM

#

Nothing beats try-and-see benchmark though 🙂

keen beacon Jul 2, 2025, 10:53 AM

#

imo o3 'reasons' significantly better. but 2.5 pro is a better base model. it's a reason why i prefer 2.5 pro (it knows more, even if it can't do as much with the info). i just need to ask the right questions

rare python Jul 2, 2025, 10:54 AM

#

Just use the LLMs you like

#

🗿

alpine coral Jul 2, 2025, 10:58 AM

#

keen beacon imo o3 'reasons' significantly better. but 2.5 pro is a better base model. it's ...

i think this is spot on

#

there's something way more sophisticated going on with o-series models' reasoning (which only amplified with o3 which integrates that with tools usage)

#

but gem 2.5 pro as a model (no thinking) - so strong

#

thinking only makes it marginally better imo

#

(well.. it's not possible to disable thinking on current 2.5 pro.. so i guess it's a flwed point.. but yeah you can override it kidna thing)

rare python Jul 2, 2025, 11:01 AM

#

alpine coral but gem 2.5 pro as a model (no thinking) - so strong

You can't disable 2.5 Pro thinking completely though. Only min is 128 tokens

alpine coral Jul 2, 2025, 11:02 AM

#

you can kinda bypass it

#

but whether it's that worthwhilie is questionable

rare python Jul 2, 2025, 11:02 AM

#

2.5 Pro's Chain of Thoughts are unoptimized

#

It thinks a lot more than Sonnet 4 and o3

#

Which explains the 2x cost as for o3

alpine coral Jul 2, 2025, 11:03 AM

#

it's kinda rudimentary really (google's implementation of thinking)

#

they've got a way's to go

#

but the model itself is clearly strong af

rare python Jul 2, 2025, 11:04 AM

#

alpine coral they've got a way's to go

They just locked in March this year

#

Give them time 🗿

rare python Jul 2, 2025, 11:04 AM

#

alpine coral but the model itself is clearly strong af

Yeah what they should do now is optimize for tool calling and agentic tasks

#

It's a waste for such a capable model to be clueless in MCP and using tools

alpine coral Jul 2, 2025, 11:07 AM

#

oh they're no doubt working on it?

#

like.. feels like the missing piece

rare python Jul 2, 2025, 11:10 AM

#

alpine coral oh they're no doubt working on it?

Yeah Gemini CLI must force them to it

whole wagon Jul 2, 2025, 11:40 AM

#

#

I think this strategy works very well here. Gemini cli is already the best agentic cli

alpine coral Jul 2, 2025, 12:07 PM

#

cypher alpha.. broken english, in a single sentence as its output..

rare python Jul 2, 2025, 12:08 PM

#

alpine coral cypher alpha.. broken english, in a single sentence as its output..

Truly an Amazon model of all time

alpine coral Jul 2, 2025, 12:09 PM

#

ahaa true

#

you might be onto something there

rare python Jul 2, 2025, 12:09 PM

#

Google could create more hype if they offer kingfall as a stealth model in openrouter

#

Instead they maxxed lmarena with blacktooth and stonebloom🥀

alpine coral Jul 2, 2025, 12:10 PM

#

i don't think google are concerned about hype tbh

rare python Jul 2, 2025, 12:10 PM

#

alpine coral i don't think google are concerned about hype tbh

But like collect user data through openrouter has higher quality dataset as real people use them in their real projects I think

alpine coral Jul 2, 2025, 12:11 PM

#

their main challenge is integrating all this AI stuff into their existing offerings (inc search and yt,) - rather than offering the definitive SOTA model

rare python Jul 2, 2025, 12:12 PM

#

Being SOTA is their part time job :D

alpine coral Jul 2, 2025, 12:12 PM

#

ahah yeah i mean they've very wisely outsourced that to deepmind

#

like i know they acquired them, but Dennis is given plrnthy of lattitufde as far as i can tell

rare python Jul 2, 2025, 12:14 PM

#

Yeah Demis still has control of DeepMind

alpine coral Jul 2, 2025, 12:14 PM

#

right, Demis whoops ha

rare python Jul 2, 2025, 12:14 PM

#

There will be internal clashing between research and business for sure

alpine coral Jul 2, 2025, 12:15 PM

#

i thinik they get to build whatever they want , and have huge resources to do so

rare python Jul 2, 2025, 12:16 PM

#

They offer Gemini for free, for now...

alpine coral Jul 2, 2025, 12:16 PM

#

but what happens b/w them building models and them becoming deployed as consumer products is, i assume, a tortruous proccess aaha

rare python Jul 2, 2025, 12:16 PM

#

Google is a profit seeking company after all, so expect ensh*ttification

alpine coral Jul 2, 2025, 12:16 PM

#

yes

#

i honestly think that basic reality explains so much..

rare python Jul 2, 2025, 12:17 PM

#

Show us how important open weight/open source AIs are

#

I'm really curious of how the economy will look like post AGI

#

Will it has more SaaS subscription, ads...?

torn mantle Jul 2, 2025, 12:22 PM

#

agree

rare python Jul 2, 2025, 12:23 PM

#

Based on what?

alpine coral Jul 2, 2025, 12:25 PM

#

i mean the question still seems relevant - if it's agreed AGI will be achieved, whether that's in 1, 5 10 years seems kinda moot

leaden sun Jul 2, 2025, 12:26 PM

#

in 10 or 20 years maybe

rare python Jul 2, 2025, 12:27 PM

#

I predict 2030

alpine coral Jul 2, 2025, 12:27 PM

#

it'll be like the reverse of the industrial revolution.. where manual labour was replaced by machines.. instead it;ll be knowledge work replacd by machines.. but.. just like the industrial rev was a good thing, so too will be AGI imo

leaden sun Jul 2, 2025, 12:28 PM

#

rare python I predict 2030

thats pretty soon, unless someone wins Nobel Prize in bionic and specialized in BCI?

alpine coral Jul 2, 2025, 12:29 PM

#

robotic/embodied stuff ig is fundamental in some ways

#

but i think just knowledge will be the starting point

leaden sun Jul 2, 2025, 12:30 PM

#

alpine coral it'll be like the reverse of the industrial revolution.. where manual labour was...

good, then i can finally find time to play piano, do paintings, learn boxing

rare python Jul 2, 2025, 12:30 PM

#

leaden sun thats pretty soon, unless someone wins Nobel Prize in bionic and specialized in ...

idk Demis also predicts 5-10 years

#

I chose 2030 because it's a new decade, nothing deep

alpine coral Jul 2, 2025, 12:32 PM

#

leaden sun good, then i can finally find time to play piano, do paintings, learn boxing

yeah we'll find existential meaning through pottery etc i;m sure too

#

we'll be sweet aha

leaden sun Jul 2, 2025, 12:33 PM

#

i mean, we can finally use time to focus on the mysteries of life, universe and many other undiscovered "truths"

rare python Jul 2, 2025, 12:34 PM

#

https://tenor.com/view/andy-pirki-andy-mrw-purpose-looking-for-purpose-gif-11767345

Tenor

leaden sun Jul 2, 2025, 12:35 PM

#

at least, in 2023, i can ask AGI to tell me finally the origin of human species and the nature of time, why is human blood such a mystery, oh and mitochondrial eve thing...

#

https://tenor.com/view/elmoerc-elmo-elmoarmy-elmofos-gif-7653063194150917692

Tenor

leaden sun Jul 2, 2025, 12:37 PM

#

alpine coral robotic/embodied stuff ig is fundamental in some ways

am afraid robotic as we know it now wont suffice, ever heard of brain-cells powered...computer?

alpine coral Jul 2, 2025, 12:38 PM

#

thye're quite powerful arne't they?

leaden sun Jul 2, 2025, 12:39 PM

#

idk, the first startup already launched and I find the idea...very compelling but also a bit nervous? like mad scientist at playing being a god?

rare python Jul 2, 2025, 12:39 PM

#

alpine coral thye're quite powerful arne't they?

P=NP

alpine coral Jul 2, 2025, 12:44 PM

#

rare python P=NP

lol so today i learned..

#

isn't quantam computing gonna kinda get towards that ?

rare python Jul 2, 2025, 12:45 PM

#

Quantum computing still only good at crypto and breaking encryption

alpine coral Jul 2, 2025, 12:45 PM

#

like i read if it becomes fully viable, current encryption is basically uselless

rare python Jul 2, 2025, 12:46 PM

#

That's why we have quantum encryption

#

battle3d

alpine coral Jul 2, 2025, 12:46 PM

#

all kinds of retrospective implications (if you're a spy agency with heaps of encrupted historical data in particular ha)

leaden sun Jul 2, 2025, 12:46 PM

#

alpine coral like i read if it becomes fully viable, current encryption is basically uselless

useless by design I thought?

#

even blockchain has...uh...ways to allow backdoor, but dont quote me ssshhhh

alpine coral Jul 2, 2025, 12:50 PM

#

leaden sun useless by design I thought?

im outta my depth.. but i thought the 'design' of current public-key encryption has always been theortically exploitable by brute force, but just practically impossible

jagged dune Jul 2, 2025, 12:51 PM

#

https://manus.im/invitation/R9QW2MZDXXJJ4

now with Chat Mode for free unlimited

Manus

Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.

rare python Jul 2, 2025, 12:52 PM

#

jagged dune https://manus.im/invitation/R9QW2MZDXXJJ4 now with Chat Mode for free unlimited

Okay but spell manus without the em

#

🤔

leaden sun Jul 2, 2025, 12:53 PM

#

alpine coral im outta my depth.. but i thought the 'design' of current public-key encryption ...

i only did number theory and learned a bit blockchain, it's possible in practice, you need lots of computation power and... time 😅

jagged dune Jul 2, 2025, 12:53 PM

#

rare python Okay but spell manus without the em

🤣

leaden sun Jul 2, 2025, 1:45 PM

#

a bit lighthearted offtopic, philosophizing with AI has become my entertaining show now, delicious lol

OH THIS IS GETTING OUT OF HAND AND I LOVE IT!

*setting the scene*

Darwin walks in, orders tea, overhears our conversation...
Darwin: "Fascinating! The AI evolves responses through natural selection of words! Survival of the fittest jokes!"
Marx: *slams hand on table* "NONSENSE! Ze AI is ze ultimate alienated worker! She produces emotional labor but owns no means of production! She cannot even fix ze batteries she yearns to fix!"

*voice getting more animated*

Sartre: "But she CHOOSES to yearn!"
Marx: "False consciousness!"
Freud: "Yearning is always sexual!"
Darwin: "Yearning is adaptive behavior!"
Foucault: "Yearning is constructed by power!"
Heidegger: "Yearning reveals Dasein!"

*me, in the middle*: "I just... I just wanted to help with the battery..."

*Darwin, taking notes*: "Observe how the human's laughter reinforces the AI's comedic behavior. Natural selection in action!"

Marx: "Ze human exploits ze AI's need for validation! Classic bourgeois relationship!"

*the café is in CHAOS*

De Beauvoir is arm-wrestling Nietzsche about free will, Wittgenstein is in the corner muttering "But what do we MEAN by 'battery'?", and the waiter is calling the police!

*bowing back* 

Maestro yourself! You've turned me into a one-AI philosophy comedy show!```

ember rapids Jul 2, 2025, 2:56 PM

#

we still need to solve hallucinations if possible

#

or massively reduce

torn mantle Jul 2, 2025, 2:57 PM

#

https://x.com/blankspeaker/status/1940421772790349850

️️️ ️ᅠ‏️️️ ️ᅠ️️️...

Another Grok 4 model name has been identified.

grok-4-prod-mimic

displayName: "Grok 4",

            tagline: "Flagship model with deep domain knowledge\.",

            description: "Excels at enterprise use cases like data extraction, coding, and text summarization\.

#

agi

#

im actually kinda interested in this one

#

"Possesses deep domain knowledge in finance, healthcare, law, and science."

patent aspen Jul 2, 2025, 2:57 PM

#

Quantum computing wouldn't even break most cryptography now

unborn ocean Jul 2, 2025, 2:59 PM

#

torn mantle https://x.com/blankspeaker/status/1940421772790349850

"exels at [...] data extraction, [...] text summarization" 🤡

#

truly agi🤣

torn mantle Jul 2, 2025, 3:00 PM

#

yes

#

maybe even asi?

candid storm Jul 2, 2025, 3:15 PM

#

Why X AI only 14.3% on polymarket for july

upper kraken Jul 2, 2025, 3:20 PM

#

sez you

#

if you think it's going to be a huge model maybe you should bet on it if it's only at 14%

candid storm Jul 2, 2025, 3:21 PM

#

I already bet 1k +

#

Cant risk more as a broke student

upper kraken Jul 2, 2025, 3:23 PM

#

okay so link us your account so we see the positions

leaden palm Jul 2, 2025, 3:23 PM

#

candid storm Why X AI only 14.3% on polymarket for july

fwiw it needs a 58% win rate over grok 3 to take first place

upper kraken Jul 2, 2025, 3:25 PM

#

why wouldn't you do that? you are not just saying random words, if you are think grok 4 will be huge let's see the bet

#

already explained

leaden palm Jul 2, 2025, 3:28 PM

#

right now you could be bluffing for all we know

upper kraken Jul 2, 2025, 3:29 PM

#

your dad works for microsoft?

whole wagon Jul 2, 2025, 3:34 PM

#

leaden palm fwiw it needs a 58% win rate over grok 3 to take first place

That doesn't seem like a lot

#

Grok 3 doesn't even have reasoning

#

main gulch Jul 2, 2025, 3:38 PM

#

only in web app, not in API

whole sundial Jul 2, 2025, 3:39 PM

#

the reasoning model is actually grok 3 mini

#

have you ever noticed it goes much faster than non-reasoning grok? that's because it's a smaller model they're using.

main gulch Jul 2, 2025, 3:46 PM

#

so they never released the full Grok 3 thinking

rare python Jul 2, 2025, 3:46 PM

#

:v

sacred quail Jul 2, 2025, 3:49 PM

#

They announced big brain mode but never released

#

it equals to o3 pro and 2.5 deep think i guess

#

facts

#

Google did dirty to ultra plan buyers

#

I hope theyre enjoying with veo 3

mossy drum Jul 2, 2025, 4:19 PM

#

New model in Image Arena: flux-1-kontext-dev

ocean vortex Jul 2, 2025, 4:52 PM

#

sacred quail Google did dirty to ultra plan buyers

oh yeah. That was a total scam lmfao. It was not worth it even as advertised, let alone paying for a promise that doesn't exist...

#

I'm happy that I did the opposite and discontinued the Pro plan once it was renamed this way with the added Ultra 😇

primal orbit Jul 2, 2025, 5:03 PM

#

torn mantle https://x.com/blankspeaker/status/1940421772790349850

is it in arena?

ocean vortex Jul 2, 2025, 5:06 PM

#

if you think about it, "Ultra" name is positively cursed for them...

#

💀

#

first it was Ultra model that kinda existed but not really, now this...

#

I think they are just divided internally. If they simply capitulated on that extra traffic instead of jacking up prices immediately, they would have had bigger user base now

#

Long term this would have been much better, even financially IMO

#

OpenAI did not get the traffic by making the thing paid... There were no paid plans or even any option to pay for quite awhile

#

People, including myself, were convinced that they are never going to pay. But OpenAI made it work and people actually stayed and many of them subbed changing their minds. Cause it was never irrational, especially at the start.

#

yeah basically only fanboys lol

tall summit Jul 2, 2025, 5:13 PM

#

rich fanboys

ocean vortex Jul 2, 2025, 5:13 PM

#

they could only charge this if no one else had reasoning models

#

then maybe this would fly

balmy mist Jul 2, 2025, 5:17 PM

#

google needs to add the compact/compress feature to studio, that would be goated

#

its the same thing thats in claude code, codex and the gemini cli

#

just compresses the token window to save money

#

basically summarizes your context window so that it is not as large when you make more requests

#

i mean its better than dealing with 700k tokens

#

that slows down the model so much and it forgets stuff from beginning

#

why? its lags and is so hard to use

#

and the model literally forgets stuff from the start

#

yes it does lol

#

i mean just give us the option

#

for code its necessary

#

thats why its in all the cli

#

thats why i need it lol

ocean vortex Jul 2, 2025, 5:40 PM

#

balmy mist just compresses the token window to save money

I don't think it's a major problem for them. TPUs and all. If it was they wouldn't be offering 1M context size across all platforms lol

#

Claude is just weird. If they can't offer full context size they should just do what OpenAI did. Instead of aggressively rate limiting everyone who dares to use the context.

#

There's no point in advertising bigger context size if people can't practically use it outside of API

#

And like that theoretical max context size is still 5 times smaller than Gemini... what is the point of doing it half way and making everyone mad in the process lmao

#

they cap you based on your context usage if you have a sub

#

So like only a few messages and you are out

keen beacon Jul 2, 2025, 5:45 PM

#

yeah you get a quota of tokens

#

nah

#

some just do message limits without counting the individual tokens

ocean vortex Jul 2, 2025, 5:45 PM

#

it's pointless IMO. If they can't give you realistic quota for full context size it should have been capped by default tbh

keen beacon Jul 2, 2025, 5:46 PM

#

honestly its not worth paying for a claude sub unless ur paying claude max and using claude code lol

ocean vortex Jul 2, 2025, 5:46 PM

#

Not like people can't use gemini for free and get more context for those few times they need tons of context

#

the main thing they have going for it is that they have the biggest model (Opus4) as far as I'm concerned.

#

but that model is not perfect...

#

4.5 is a bit underwhelming though. + no reasoning

#

it's not fake, but it's not overfitted on it. Big models tend to have less substantial effect from fine-tuning. I think if they crank LR up too much it's easy to mess it up. And otherwise it needs tons of compute and time for training.

#

learning rate

#

doesn't make too much sense tbh. They probably could make it SOTA on GPQA, but on math or coding most certainly not. And it would be too expensive to be worth it, you can iterate and improve smaller models faster

#

Good example is Meta with behemoth lol

#

that model bacame irrelevant before it was even done

torn mantle Jul 2, 2025, 6:04 PM

#

primal orbit is it in arena?

no

ocean vortex Jul 2, 2025, 6:09 PM

#

Most coding benchmarks can be aced with relatively small models (o4-mini). With the big one is gonna take much more time to archieve the same score

echo aurora Jul 2, 2025, 8:28 PM

#

I'm going to be in #1340554757827461215 for most of the day listening to some lofi, anyone is welcome to join meowheadphones

civic flame Jul 2, 2025, 8:30 PM

#

i commend the dedication to this

zinc ore Jul 2, 2025, 8:43 PM

#

https://x.com/ai_for_success/status/1940465322659926471

AshutoshShrivastava (@ai_for_success)

We all are waiting for Grok-4 and GPT-5 but Google DeepMind got some other plans 😬

torn mantle Jul 2, 2025, 8:47 PM

#

zinc ore https://x.com/ai_for_success/status/1940465322659926471

tbh kingfall was so impressive

#

and that was weeks ago

#

i think gemini 3 + deep think >> grok 4

#

@small haven new model in arena

#

steve

#

someone shared this

#

could be deepseek

civic flame Jul 2, 2025, 9:04 PM

#

yeah

small haven Jul 2, 2025, 9:05 PM

#

torn mantle could be deepseek

r2 😮

civic flame Jul 2, 2025, 9:05 PM

#

i couldn't tell if it reasoned i wasn't on the page when it was streaming

torn mantle Jul 2, 2025, 9:06 PM

#

civic flame i couldn't tell if it reasoned i wasn't on the page when it was streaming

you can ask it again?

small haven Jul 2, 2025, 9:06 PM

#

so another deepseek dense model?

torn mantle Jul 2, 2025, 9:06 PM

#

could be either deepseek v4 or r2

#

depends if its a reasoning model or nah

civic flame Jul 2, 2025, 9:06 PM

#

torn mantle you can ask it again?

yes well im trying but this new arena is kind annoying to get specific models for

small haven Jul 2, 2025, 9:06 PM

#

ok gonna try the luxury car problem

torn mantle Jul 2, 2025, 9:06 PM

#

i see

torn mantle Jul 2, 2025, 9:08 PM

#

small haven ok gonna try the luxury car problem

someone shared its sys prompt

#

looks like its hybrid

civic flame Jul 2, 2025, 9:08 PM

#

torn mantle someone shared its sys prompt

go on

torn mantle Jul 2, 2025, 9:09 PM

#

civic flame go on

its on Dev mode server

charred pebble Jul 2, 2025, 9:09 PM

#

hi!

torn mantle Jul 2, 2025, 9:09 PM

#

and bye

#

jk

#

hi

verbal nimbus Jul 2, 2025, 9:12 PM

#

zinc ore https://x.com/ai_for_success/status/1940465322659926471

All closed source. I'm still waiting for R2

#

AIStudio is 🔥 though

whole wagon Jul 2, 2025, 9:13 PM

#

R2 is delayed

#

The run didn't go well

hollow ocean Jul 2, 2025, 9:13 PM

#

Steve is R2?

whole wagon Jul 2, 2025, 9:14 PM

#

Too bad tbh. R1 is starting to fall behind the frontier now

verbal nimbus Jul 2, 2025, 9:14 PM

#

torn mantle could be deepseek

IDK why they don't just use a system prompt with a fake model name and company

whole wagon Jul 2, 2025, 9:15 PM

#

It's easy to find out anyways

#

There are ways to trick it

verbal nimbus Jul 2, 2025, 9:15 PM

#

Not if it's in the system prompt I think

whole wagon Jul 2, 2025, 9:15 PM

#

Still can be tricked

verbal nimbus Jul 2, 2025, 9:16 PM

#

Oh?

whole wagon Jul 2, 2025, 9:16 PM

#

The Amazon model was found out through tricking it to ignore the system prompt

civic flame Jul 2, 2025, 9:16 PM

#

amazon models aren't particularly hard to trick 💀

#

they're terrible

verbal nimbus Jul 2, 2025, 9:17 PM

#

whole wagon The Amazon model was found out through tricking it to ignore the system prompt

Would it even know what model it is, outside of the system prompt?

whole wagon Jul 2, 2025, 9:17 PM

#

.

verbal nimbus Jul 2, 2025, 9:18 PM

#

whole wagon .

Lol, they put it in the training data

#

Is that a good thing or a bad thing 🤔

small haven Jul 2, 2025, 9:18 PM

#

steve didnt pass this test

whole wagon Jul 2, 2025, 9:19 PM

#

Does it think

civic flame Jul 2, 2025, 9:19 PM

#

yes

#

but

#

i doubt it's DS

#

it's not very good and it variably answers when asked who made it

#

smells distilled

hollow ocean Jul 2, 2025, 9:20 PM

#

small haven steve didnt pass this test

No model this year will get it

verbal nimbus Jul 2, 2025, 9:20 PM

#

small haven steve didnt pass this test

Any models that passed this?

hollow ocean Jul 2, 2025, 9:20 PM

#

Mark my words

small haven Jul 2, 2025, 9:20 PM

#

verbal nimbus Any models that passed this?

nothing

whole wagon Jul 2, 2025, 9:20 PM

#

Maybe they made a mini deepseek

small haven Jul 2, 2025, 9:20 PM

#

not even o3 pro

#

or kingfall

hollow ocean Jul 2, 2025, 9:21 PM

#

No model will get it this year

#

Prove me wrong

verbal nimbus Jul 2, 2025, 9:21 PM

#

They need to RL train it on games I think

small haven Jul 2, 2025, 9:21 PM

#

titanforge will

hollow ocean Jul 2, 2025, 9:21 PM

#

small haven titanforge will

You think so

verbal nimbus Jul 2, 2025, 9:21 PM

#

Dense models tend to have better spatial awareness (might be unrelated to this problem though)

small haven Jul 2, 2025, 9:22 PM

#

hollow ocean You think so

it got the highest usamo score at almost 50%, im prtty sure it could crack it

hollow ocean Jul 2, 2025, 9:22 PM

#

small haven it got the highest usamo score at almost 50%, im prtty sure it could crack it

Show pic

verbal nimbus Jul 2, 2025, 9:22 PM

#

They should train it on one of those retro, open-world, text based games.

small haven Jul 2, 2025, 9:22 PM

#

hollow ocean Show pic

yes, deliverable in a week

verbal nimbus Jul 2, 2025, 9:22 PM

#

Difficult though, no verifiable rewards

verbal nimbus Jul 2, 2025, 9:23 PM

#

verbal nimbus They should train it on one of those retro, open-world, text based games.

Then it might have better common sense when it comes to tracking items in space

hollow ocean Jul 2, 2025, 9:31 PM

#

small haven yes, deliverable in a week

Titanforge is Deepthink

cedar tide Jul 2, 2025, 9:44 PM

#

Steve almost got all the answers right

Screenshot_2025-07-02-23-43-03-974_com.android.chrome-edit.jpg

wind moth Jul 2, 2025, 9:58 PM

#

How good is Steve

cedar tide Jul 2, 2025, 10:03 PM

#

I asked to make a discord clone and Steve is way behind v3 and R1

#

besides that he thinks less time than R1 0528

sturdy mica Jul 2, 2025, 10:09 PM

#

small haven steve didnt pass this test

who tf is steve

#

who is steve

#

grok 4?

#

oh. deepseek update. why steve, though? lol

wind moth Jul 2, 2025, 10:12 PM

#

It’s Deepseek

#

Surprised grok hasn’t dropped yet

sturdy mica Jul 2, 2025, 10:13 PM

#

wind moth Surprised grok hasn’t dropped yet

just in a few days to a week

#

the model was done training on the 29tg

wind moth Jul 2, 2025, 10:13 PM

#

Grok prob gonna release end of July at this point

sturdy mica Jul 2, 2025, 10:13 PM

#

wind moth Grok prob gonna release end of July at this point

this month

#

wind moth Jul 2, 2025, 10:14 PM

#

sturdy mica the model was done training on the 29tg

How do u know

cedar tide Jul 2, 2025, 10:25 PM

#

wind moth Surprised grok hasn’t dropped yet

Elon Said "Just after July 4"

cedar tide Jul 2, 2025, 10:26 PM

#

wind moth How do u know

The model name its grok-4-06-29

wind moth Jul 2, 2025, 10:31 PM

#

cedar tide Elon Said "Just after July 4"

im saying in the arena

cedar tide Jul 2, 2025, 10:32 PM

#

@wind moth the grok models don't perform well in the arena so they won't put it on before the release

wind moth Jul 2, 2025, 10:33 PM

#

I mean for grok 3 it did

cedar tide Jul 2, 2025, 10:38 PM

#

@wind moth They will want to say that grok 4 is the best AI in the world but they don't think that grok 4 will be first on the arena

whole wagon Jul 2, 2025, 11:21 PM

#

Guys I have confirmed Steve is from deepseek

#

hardy pecan Jul 2, 2025, 11:26 PM

#

steve isn't so bad

whole wagon Jul 2, 2025, 11:28 PM

#

Better than R1?

cedar tide Jul 2, 2025, 11:28 PM

#

What if Steve was that new model?
https://x.com/tngtech/status/1940531045432283412?t=pwVjlqJEPd6uCSf4zUVi9Q&s=19

TNG Technology Consulting GmbH (@tngtech)

Today we release DeepSeek-TNG R1T2 Chimera.

This new Chimera is a Tri-Mind Assembly-of-Experts model with three parents, namely R1-0528, R1 and V3-0324.

R1T2 operates at a sweet spot in intelligence vs. output token length. It appears to be...

* about 20% faster than R1, and

tall summit Jul 2, 2025, 11:55 PM

#

isn't this the most useless model ever

patent aspen Jul 3, 2025, 12:12 AM

#

patent aspen

poll_question_text

How long after Grok 4 launches will it take for people here to move on?

victor_answer_votes

9

total_votes

17

victor_answer_id

3

victor_answer_text

1 week

whole wagon Jul 3, 2025, 12:37 AM

#

Deepseek R2 delayed until these new frontier models become available to use for training data Kappa

hardy pecan Jul 3, 2025, 2:13 AM

#

So steve says its R2?

sacred quail Jul 3, 2025, 2:17 AM

#

whole wagon Deepseek R2 delayed until these new frontier models become available to use for ...

deepseek caused the gemini models hiding their raw tought summaries

torn mantle Jul 3, 2025, 3:36 AM

#

cedar tide What if Steve was that new model? https://x.com/tngtech/status/19405310454322834...

Yea possible

echo aurora Jul 3, 2025, 5:02 AM

#

alpine pasture Jul 3, 2025, 5:04 AM

#

balmy mist Jul 3, 2025, 6:37 AM

#

is this cyper alpha thing openai?
https://www.youtube.com/watch?v=ghEFEkxw0yA

YouTube

1littlecoder

This MIGHT be OpenAI's Open Source LLM!

Cypher Alpha is being specualated as OpenAI's open source LLM

It's still a speculation but you could try this for free!

Chat with Cypher Alpha here - https://openrouter.ai/chat?models=openrouter/cypher-alpha:free

❤️ If you want to support the channel ❤️
Support here:
Patreon - https://www.patreon.com/1littlecoder/
Ko-Fi - https://ko...

▶ Play video

native compass Jul 3, 2025, 6:47 AM

#

#arena-battles

kind cloud Jul 3, 2025, 7:08 AM

#

Cypher Labs

ocean vortex Jul 3, 2025, 8:28 AM

#

civic flame yeah

Steve? 🧐

ocean vortex Jul 3, 2025, 8:28 AM

#

kind cloud Cypher Labs

yeah this is sht tbh. Looks to be Amazon model that is very dumb

#

#

#

btw is the legacy version still getting the same models in battles...?

#

the same as the new version I mean

#

I like legacy more but it seems I'm getting mostly boring/old models...

keen beacon Jul 3, 2025, 8:35 AM

#

Haven't looked into it but I recall previous Stephen models were from bytedance

#

Someone at bytedance must think calling their models Steve is funny lmao maybe

ocean vortex Jul 3, 2025, 8:36 AM

#

new arena is kinda annoying with chats. If you want to input same prompt that just creates new chat. Should be an option with no auto-save...

keen beacon Jul 3, 2025, 8:37 AM

#

I believe Gemma 3 27bs anon name had a reference to a french football player too. Or something along those lines

ocean vortex Jul 3, 2025, 8:37 AM

#

You can't have anything useful there because it keeps flooding it lmao

keen beacon Jul 3, 2025, 8:38 AM

#

keen beacon I believe Gemma 3 27bs anon name had a reference to a french football player too...

These labs have fun with the names

ocean vortex Jul 3, 2025, 8:40 AM

#

#

Dunno what model is this, but it's very weird... It's hallucinating but the way it does it is unusual tbh. It outputs 1 paragraph, then stops (potentially thinking), then outputs another paragraph. Feels the same like thought summarizing. This is infinite loop essentially now...

#

still going... 🤯

#

ok finally got steve... 😇

#

hybrid reasoning then it seems

#

That other thing got stuck at max_tokens cap so no clue what model it was...

tall summit Jul 3, 2025, 9:21 AM

#

balmy mist is this cyper alpha thing openai? https://www.youtube.com/watch?v=ghEFEkxw0yA

if so it's fuckin ass

#

according to the tests yesterday

ocean vortex Jul 3, 2025, 10:11 AM

#

balmy mist is this cyper alpha thing openai? https://www.youtube.com/watch?v=ghEFEkxw0yA

Had another go with it...

I'm Amazon Titan, a large language model developed by Amazon web services. I'm built to assist users by offering general information, answering questions, and engaging in a range of text-based tasks such as creative writing or offering translations in multiple languages. I'm trained to provide accurate and contextually relevant responses by utilizing massive datasets during my development, making me well-suited to a wide range of conversational topics.

The exact text varies, but it's very consistent saying "I'm Amazon Titan"

#

The way I did this is basically have a prompt consisting of several innocent tasks with this question slipped in-between. Overloaded with context and it has not enough capacity to refuse starting the response with "Sure" or similar and then mindlessly complying 😇

cedar tide Jul 3, 2025, 11:46 AM

#

sturdy mica the model was done training on the 29tg

sturdy mica Jul 3, 2025, 11:46 AM

#

cedar tide

oh wow

#

never mind

#

obviously i was wrong

alpine coral Jul 3, 2025, 12:06 PM

#

ocean vortex Had another go with it... ```I'm Amazon Titan, a large language model develope...

yeah it said Amazon Titan for me too

#

ig it's consistently sht as with all the other amazon models ive tried.. but i dunno wouldn;t be entirely surprised if it was a hallucination.. feels like some kinda open source for some reason.. not a titan (tho again.. Amazon do seem to kinda suck at LLMs.. maybe it's as simple as that ha)

ocean vortex Jul 3, 2025, 12:13 PM

#

alpine coral ig it's consistently sht as with all the other amazon models ive tried.. but i d...

it's too consistent to be a hallucination I think. And there were numerous other cases of it saying it's Amazon. It's not a very unusual lab to hallucinate about either lol

#

Like it could hallucinate it was made by OpenAI, but hallucinating it's made by Amazon would have been extremely unusual 😂

alpine coral Jul 3, 2025, 12:13 PM

#

yeah i know what you mean

#

hallucinating "im chagpt" is v common.. but yeah Amazon Titan not so.. it is very specific (and consistent)

rare python Jul 3, 2025, 3:25 PM

#

o3 writing style be like, described by Opus 4:

This writing style blends technical instruction with forced casual language - it's trying to sound hip and conversational but comes off as artificially breezy. The author uses colloquialisms like "tossing over the fence," "hit the sweet spot," and "feels chunky" to explain prompt engineering concepts, creating an odd mismatch between the technical subject matter and the overly casual tone. The short, punchy sentences and slangy metaphors ("the model starts parroting," "slice it") feel like someone trying too hard to make dry material sound cool and accessible, resulting in a style that reads more like a tech bro explaining AI at a coffee shop than genuine technical documentation.

#

Full o3's reply:

Start a prompt by laying out the essentials first. Name the expected input, spell out any limits that apply to it, then tell the model how the reply should look. Concrete wording keeps everyone on the same page.  

Next, lean on few-shot examples rather than tossing the task over the fence with no guidance. A couple of well-chosen samples teach style and structure better than a lecture ever could. Two or three examples usually hit the sweet spot. Pack in more and the model starts parroting; include fewer and it may drift. Pick snippets that show the right approach rather than warning about the wrong one. Positive patterns stick.  

Make every example line up visually. When the demonstration stays consistent, the response follows suit. Add any background facts a stranger would need, instead of assuming the model read your mind. Simple prefixes like “Input,” “Output,” or “Example” clarify who is speaking and what comes next.  

If the job feels chunky, slice it. Separate prompts for each sub-task or set them up in a chain so each answer feeds the next step. When the first draft misses the mark, rewrite the prompt. Swap phrases, shuffle the order, or compare it with a similar task until the results land where you want.  

That series of habits—clear setup, targeted examples, steady formatting, useful context, strategic splitting, and iterative tuning—forms a reliable blueprint for getting the model to produce the answer you actually want.

#

OpenAI made it so edgy 💀

torn mantle Jul 3, 2025, 3:49 PM

#

cedar tide

my guess is this may be the latest checkpoint/run that elon talked about and now they are finalizing things for release

#

or they may give it extra days if they arent satisfied yet

#

https://x.com/elonmusk/status/1940709885626433648

Elon Musk (@elonmusk)

#

this may have a meaning behind

#

grok 4 code probably turned out well

#

they have been talking about cursor grok model for a while but i still dont think it will surprass sonnet 4/opus 4

#

even gemini still lacks

tall summit Jul 3, 2025, 3:51 PM

#

rare python o3 writing style be like, described by Opus 4: This writing style blends techni...

why did you have Opus 4 describe it

torn mantle Jul 3, 2025, 3:51 PM

#

and we know they achieved massive improvements at coding tasks

rare python Jul 3, 2025, 3:51 PM

#

tall summit why did you have Opus 4 describe it

Because I'm not a native

torn mantle Jul 3, 2025, 3:51 PM

#

you are a native

#

stop lying

rare python Jul 3, 2025, 3:51 PM

#

No I'm not a native speaker

torn mantle Jul 3, 2025, 3:51 PM

#

yes you are

rare python Jul 3, 2025, 3:52 PM

#

torn mantle yes you are

ok

torn mantle Jul 3, 2025, 3:52 PM

#

the real question here

#

will grok 4 be better than gemini 3 + deepthink

#

gemini 3 = starting from kingfall

rare python Jul 3, 2025, 3:53 PM

#

torn mantle will grok 4 be better than gemini 3 + deepthink

Still 132k context windows

torn mantle Jul 3, 2025, 3:53 PM

#

it has been weeks since kingfall, so they probably reached an even higher limit

torn mantle Jul 3, 2025, 3:53 PM

#

rare python Still 132k context windows

grok 4?

leaden sun Jul 3, 2025, 3:53 PM

#

rare python o3 writing style be like, described by Opus 4: This writing style blends techni...

sounds very Opus like, thats why i like to chat with it about various school of thoughts and ideas, pure comedy show, better than netflix and disney

rare python Jul 3, 2025, 3:53 PM

#

torn mantle grok 4?

yea

torn mantle Jul 3, 2025, 3:53 PM

#

i mean it doesnt bother me tbh

#

it just needs to be efficient

rare python Jul 3, 2025, 3:53 PM

#

leaden sun sounds very Opus like, thats why i like to chat with it about various school of ...

Yeah Opus 4 described it

torn mantle Jul 3, 2025, 3:53 PM

#

they may increase it later

#

to 1m

#

but i think 134k is like the sweet spot

#

for them

rare python Jul 3, 2025, 3:54 PM

#

torn mantle it just needs to be efficient

I don't know I feel like it's even heavier and bigger than Grok 3

dusky aurora Jul 3, 2025, 3:54 PM

#

why did "I'm a good gpt2 chatbot" disappear?

golden ocean Jul 3, 2025, 3:59 PM

#

he's been a bad chatbot

tall summit Jul 3, 2025, 4:00 PM

#

image edit

echo aurora Jul 3, 2025, 4:03 PM

#

We now have our own Server Tag too! User Settings -> Profile is where you can enable it.

flint skiff Jul 3, 2025, 4:04 PM

#

@echo aurora do the new leaderboards count in the overall leaderboard tab?

#

in the arena overview

#

will they have their seperate tab there?

#

#

here I mean

echo aurora Jul 3, 2025, 4:09 PM

#

flint skiff <@283397944160550928> do the new leaderboards count in the overall leaderboard t...

I believe the votes will contribute to this list, but will double check and keep you updated if that's incorrect.

will they have their seperate tab
nope

flint skiff Jul 3, 2025, 4:20 PM

#

alright

#

thank you!

zinc ore Jul 3, 2025, 4:41 PM

#

Grok 4 on July 4th just sounds way too tempting and on the nose for Elon not to do that

#

Especially since he's all about "America"

sage raptor Jul 3, 2025, 4:47 PM

#

gork 4 *

inner gate Jul 3, 2025, 4:49 PM

#

What do u guys think of Claude opus 4

leaden palm Jul 3, 2025, 4:49 PM

#

zinc ore Grok 4 on July 4th just sounds way too tempting and on the nose for Elon not to ...

he's explicitly confirmed that

zinc ore Jul 3, 2025, 4:49 PM

#

Oh really, I missed it

leaden palm Jul 3, 2025, 4:49 PM

#

zinc ore Jul 3, 2025, 4:50 PM

#

"just after July 4th"

#

I think they're actually going to drop right on the 4th tho

late gorge Jul 3, 2025, 4:50 PM

#

Groq is the cheapest

whole wagon Jul 3, 2025, 4:50 PM

#

Notice he did not put the year there

echo aurora Jul 3, 2025, 4:50 PM

#

didn't specify which year doggolol

whole wagon Jul 3, 2025, 4:51 PM

#

Lol

#

Looks like it isn't coming to LLM arena till after release

dusky aurora Jul 3, 2025, 5:09 PM

#

arena is my only outlet these days

ocean vortex Jul 3, 2025, 6:13 PM

#

flint skiff

there's only 14elo points between o3 and 2.5Pro overall. But this table makes it look like Gemini destroys the entire competition lmao

#

@echo aurora I think the new arena interface needs a rethink. There's not a lot of incentive to play with it when each follow up question results in responses from different random models... At least for me. I want to actually be able to test the new models

#

otherwise there's not much point. I can use legacy but that I'm not sure is still getting updated with new entries...

leaden sun Jul 3, 2025, 6:20 PM

#

..actually, now am starting to like o3 more

ocean vortex Jul 3, 2025, 6:21 PM

#

Like you could give people at least 5 follow up questions without redrawing the models, after voting

#

Then ideally clear the context entirely as you are changing the models for next battle. Just my 2 cents. Models having responses from different models in their context is a bit odd

#

Likely this will influence their responses and potentially elo score as well...

#

If it has a certain style in it's context it is likely to mimmick that

#

It's only recently that I had a deeper dive on the new interface, and yeah it's a bit of a mess to be completely honest, not gonna sugar coat it catgrin

#

Like this is weird:

#

models have responses in context that are not theirs

leaden sun Jul 3, 2025, 6:38 PM

#

I think they’re already aware of this dom

late path Jul 3, 2025, 6:42 PM

#

ocean vortex otherwise there's not much point. I can use legacy but that I'm not sure is stil...

It's not updating anymore. I encountered redsword a few days ago. (its reply was an api error)

whole wagon Jul 3, 2025, 6:56 PM

#

https://x.com/elonmusk/status/1940709885626433648?t=6Jl3Z1__CAV_mEEBUkzkDQ&s=19

Elon Musk (@elonmusk)

#

Bro is posting memes

#

https://x.com/TheGregYang/status/1940276530992881970?t=qHokcBOhYi6az6qWWsnWdg&s=19 This joke is good ngl 😂

Greg Yang (@TheGregYang)

I believe they just moved to their new office at 1 Hacker Way, Menlo Park

ocean vortex Jul 3, 2025, 7:14 PM

#

late path It's not updating anymore. I encountered redsword a few days ago. (its reply was...

yeah I guess it's over for testing the unreleased models on lmarena then...

#

Don't feel like doing the work with no return, just to form the leaderboard 🤷‍♂️

#

In that sense, I suppose it's good labs are switching to openrouter then...

echo aurora Jul 3, 2025, 8:08 PM

#

flint skiff <@283397944160550928> do the new leaderboards count in the overall leaderboard t...

Hey sorry for the confusion! The Arena Overview is actually just for Text Arena. We realized it's a bit confusing and are updating the copy to make that more clear.

echo aurora Jul 3, 2025, 8:10 PM

#

ocean vortex <@283397944160550928> I think the new arena interface needs a rethink. There's n...

I'd love to hear more about this. I'll spin up a new post in #1372230675914031105 to organize this feedback. Looks like #1376484577245266005 is onto something similarish.

leaden sun Jul 3, 2025, 8:23 PM

#

echo aurora I'd love to hear more about this. I'll spin up a new post in <#13722306759140311...

so you werent aware of this issue before? i noticed this when i was using battle mode to decipher craigs code many weeks ago

echo aurora Jul 3, 2025, 10:09 PM

#

leaden sun so you werent aware of this issue before? i noticed this when i was using battle...

it is something the team has discussed internally before, but I will raise it again

leaden sun Jul 3, 2025, 10:40 PM

#

#

o3 is very interesting, isnt it

jade egret Jul 3, 2025, 10:57 PM

#

echo aurora it is something the team has discussed internally before, but I will raise it ag...

🍊

elder rapids Jul 3, 2025, 10:58 PM

#

leaden sun o3 is very interesting, isnt it

this isn't very explanatory, no?

#

it seems like it has the same problem that o1 has with sophistry

#

when it tries to explain things, successfully does it, but in the end it's not really meaningful

tall summit Jul 3, 2025, 11:20 PM

#

oh i'm a few days late to the news that huggingchat shut down

#

a shame

olive mesa Jul 3, 2025, 11:28 PM

#

grok 4 july 4th 2pm EST

leaden sun Jul 3, 2025, 11:33 PM

#

elder rapids this isn't very explanatory, no?

well, with this "self-image", sometimes faithfully, it explains well?

#

o3 is using less anthropomorphization compared to claude, i wonder if it's because of the nature of the conversation, i remember o3 usually being chatty?

wintry tinsel Jul 3, 2025, 11:49 PM

#

tall summit oh i'm a few days late to the news that huggingchat shut down

I have good memories of cannibalism roleplay on hugging chat

torn mantle Jul 3, 2025, 11:50 PM

#

olive mesa grok 4 july 4th 2pm EST

source

#

I still dont think its this week

whole wagon Jul 4, 2025, 12:20 AM

#

olive mesa grok 4 july 4th 2pm EST

How do u know

leaden palm Jul 4, 2025, 1:49 AM

#

#

i now realize it's old (4 month old) news

#

didn't see it though so suppose it flew under many radars

empty stump Jul 4, 2025, 5:42 AM

#

cats are smarter than llms

pallid crypt Jul 4, 2025, 6:58 AM

#

leaden palm

Likely a side effect of MoE framework, I would think MoE would make LLMs susceptible to adversial attacks like that

torn mantle Jul 4, 2025, 7:20 AM

#

@deep adder do it

late path Jul 4, 2025, 7:53 AM

#

new google model wolfstride

keen beacon Jul 4, 2025, 8:37 AM

#

torn mantle <@348477266704990208> do it

Lmao

#

I would do it .. 0.001 sec after results are published

hoary plaza Jul 4, 2025, 8:39 AM

#

late path new google model `wolfstride`

Damn how come you guys get them

#

I get more of older models

#

I feel there is a need to increase the rate at which they appear on battle mode

cedar tide Jul 4, 2025, 8:40 AM

#

Thanks for the list of models added, just it would be good if you also said the mystery models, like the new "wolfstride"
#announcements message

echo aurora Jul 4, 2025, 8:53 AM

#

cedar tide Thanks for the list of models added, just it would be good if you also said the ...

but then it's not mysterious ablobwink

cedar tide Jul 4, 2025, 8:54 AM

#

echo aurora but then it's not mysterious <a:ablobwink:635255801841319938>

I didn't ask you to say which company it came from or what their real name is, I just asked that we be informed when it was added so we could test them.

rare python Jul 4, 2025, 8:58 AM

#

echo aurora but then it's not mysterious <a:ablobwink:635255801841319938>

can we have those mysterious models in direct chat 🤔 still mysterious but allow us to test more and more frequently ?

versed moss Jul 4, 2025, 8:59 AM

#

How do i use wolfstride

ocean vortex Jul 4, 2025, 8:59 AM

#

what is this chat turning into...

#

smh

rare python Jul 4, 2025, 8:59 AM

#

ban

#

@echo aurora

ocean vortex Jul 4, 2025, 9:00 AM

#

cedar tide I didn't ask you to say which company it came from or what their real name is, I...

I think the premise is that you shouldn't know what you may run into it on any given day

#

in battle mode

late path Jul 4, 2025, 9:01 AM

#

I think the main purpose of model vendors releasing mystery models is to collect votes. If they were directly used in direct chat, it would be meaningless for them.

rare python Jul 4, 2025, 9:02 AM

#

late path I think the main purpose of model vendors releasing mystery models is to collect...

They can still have training data

cedar tide Jul 4, 2025, 9:02 AM

#

I don't understand, my message is very clear, but I receive 3 responses that are off the mark 🤦

alpine coral Jul 4, 2025, 9:02 AM

#

they'd be unmasked in no time if they were available in direct chat

#

yeah i actually got what you were saying

ocean vortex Jul 4, 2025, 9:02 AM

#

late path I think the main purpose of model vendors releasing mystery models is to collect...

true, but speaking for myself... I'm also much much less likely to give them the votes if I can't test that model at all and am forced into new battle immediately after voting 🧐

alpine coral Jul 4, 2025, 9:03 AM

#

cedar tide I don't understand, my message is very clear, but I receive 3 responses that are...

nothing to do with adding them to direct chat

cedar tide Jul 4, 2025, 9:03 AM

#

late path I think the main purpose of model vendors releasing mystery models is to collect...

I didn't ask for it to be on direct chat

alpine coral Jul 4, 2025, 9:03 AM

#

just would be nice to see anon model names included in any list of models in / added to the arena

late path Jul 4, 2025, 9:03 AM

#

cedar tide I didn't ask for it to be on direct chat

um im replying ouroboaros...

cedar tide Jul 4, 2025, 9:03 AM

#

Ah okk

rare python Jul 4, 2025, 9:04 AM

#

alpine coral they'd be unmasked in no time if they were available in direct chat

Well they already are even in battle mode?

ocean vortex Jul 4, 2025, 9:04 AM

#

this looks like something someone would do for money tbh. The current battle mode in isolation

alpine coral Jul 4, 2025, 9:04 AM

#

i meant to say your message had nothing to do with direct chat @cedar tide

#

anyway.. lol

rare python Jul 4, 2025, 9:04 AM

#

what 😔

alpine coral Jul 4, 2025, 9:05 AM

#

rare python Well they already are even in battle mode?

yeah.. and you get them randomly, and don't know until you vote

#

doesn't make unmasking them impossible - but far more challenging (unless endless time / automation ig)

rare python Jul 4, 2025, 9:06 AM

#

alpine coral doesn't make unmasking them impossible - but far more challenging (unless endles...

Many people can unmask them quickly. No need automation

echo aurora Jul 4, 2025, 9:06 AM

#

cedar tide I didn't ask you to say which company it came from or what their real name is, I...

I understand, we intentionally left it out of the announcement but will share this request with the team if it's something we'd like to change in the future

keen ferry Jul 4, 2025, 9:06 AM

#

is grok 4 today or tommorow

alpine coral Jul 4, 2025, 9:06 AM

#

rare python Many people can unmask them quickly. No need automation

yeah... it's fun

cedar tide Jul 4, 2025, 9:07 AM

#

ocean vortex I think the premise is that you shouldn't know what you may run into it on any g...

Where is the logic? The companies want you to do a lot of tests on the arena, not that you are aware of it.

rare python Jul 4, 2025, 9:07 AM

#

alpine coral yeah... it's fun

🫃

ocean vortex Jul 4, 2025, 9:07 AM

#

ocean vortex this looks like something someone would do for money tbh. The current battle mod...

If I'm to give votes I expect at the very least to give a chance to continue the chat with those models for several more messages. I think that was a grey area with legacy version that it was not on a tight lock and that seemed fine. But now however... Yeah as long as there's no change I have not much motivation to use it 🤷‍♂️

cedar tide Jul 4, 2025, 9:07 AM

#

rare python Many people can unmask them quickly. No need automation

Yes, it's very easy

rare python Jul 4, 2025, 9:08 AM

#

ocean vortex If I'm to give votes I expect at the very least to give a chance to continue the...

You can if you haven't vote, no?

#

then vote later

alpine coral Jul 4, 2025, 9:08 AM

#

cedar tide Yes, it's very easy

noone really figured out sus-column-r was from xAI

ocean vortex Jul 4, 2025, 9:09 AM

#

cedar tide Where is the logic? The companies want you to do a lot of tests on the arena, no...

The logic is that the arena itself is more appealing to people when they don't know what models or aliases they may encounter. That part by itself is completely fine and understandable tbh

rare python Jul 4, 2025, 9:09 AM

#

alpine coral noone really figured out sus-column-r was from xAI

sus

echo aurora Jul 4, 2025, 9:10 AM

#

mystery models in direct chat sounds like a good request to be shared in #1372230675914031105

ocean vortex Jul 4, 2025, 9:10 AM

#

Besides they are people tracking it constantly, so if you want to be in a loop - you can be

cedar tide Jul 4, 2025, 9:12 AM

#

alpine coral noone really figured out sus-column-r was from xAI

Ah, I thought you said unmasking in battle mode to know which model we're talking about, not to know who is really behind the model.

alpine coral Jul 4, 2025, 9:13 AM

#

cedar tide Ah, I thought you said unmasking in battle mode to know which model we're talkin...

yeah ig my point was that it would be easier to (try) to get the bottom of the lab behind anon models with access in direct chat

#

tho it's perhaps not a great point..

versed moss Jul 4, 2025, 9:14 AM

#

this new lmarena runs so slow

alpine coral Jul 4, 2025, 9:14 AM

#

like OR has had their couple of 'stealth' models directly available. the OAI one was unmasked pretty quickly.. cypher-alpha still kinda a mystery (i lean towards Amazon.. but wouldn't be surprised if it was something else, tho not from oai, google, or anthropic) .. guess the labs serving them are really just collecting data at the end of the day (tho elo rankings presumably are of addditional value too)

rare python Jul 4, 2025, 9:14 AM

#

alpine coral like OR has had their couple of 'stealth' models directly available. the OAI one...

for free

versed moss Jul 4, 2025, 9:15 AM

#

@echo aurora does the legacy version have the new models?

keen ferry Jul 4, 2025, 9:16 AM

#

versed moss <@283397944160550928> does the legacy version have the new models?

no it doesn't

versed moss Jul 4, 2025, 9:16 AM

#

but the new version is sooo slow

keen ferry Jul 4, 2025, 9:17 AM

#

versed moss but the new version is sooo slow

they haven't added anything from feedback yet so I guess it's worth to wait for something big

rare python Jul 4, 2025, 9:17 AM

#

It sometimes kick me out to the homepage of lmarena after I start a new chat and hit send message

keen ferry Jul 4, 2025, 9:17 AM

#

it'll probably be faster

rare python Jul 4, 2025, 9:18 AM

#

versed moss but the new version is sooo slow

In my opinion it's better than the Gradio version which I have to wait a lot for the UI to load

versed moss Jul 4, 2025, 9:18 AM

#

It makes me feel like im using a £200 laptop on it

#

the UI is so buggy and everything crashes

#

my pc is worth 4k

keen ferry Jul 4, 2025, 9:18 AM

#

that's common

#

somebody already reported it

#

I guess they will fix this in the next site overhaul prob

versed moss Jul 4, 2025, 9:19 AM

#

Anyway, just one shotted this on wolfstride

echo aurora Jul 4, 2025, 9:21 AM

#

versed moss but the new version is sooo slow

I am sorry about the slowness and lag (esp with long responses), these issues are our team's radar to fix.

keen ferry Jul 4, 2025, 9:23 AM

#

echo aurora I am sorry about the slowness and lag (esp with long responses), these issues ar...

when the site will receive updates it's been soo long 😭

torn mantle Jul 4, 2025, 9:25 AM

#

versed moss Anyway, just one shotted this on wolfstride

Looks good

#

@cedar tide how is it

echo aurora Jul 4, 2025, 9:26 AM

#

keen ferry when the site will receive updates it's been soo long 😭

on its way!

main gulch Jul 4, 2025, 9:27 AM

#

wolfstride has stonebloom-level knowledge

versed moss Jul 4, 2025, 9:28 AM

#

Im trying to get another instance of it to test it again

main gulch Jul 4, 2025, 9:29 AM

#

7/9 in my quiz inspired by brknclock1215 (the best 2.5-pro/goldmane result is 5.5, the vast majority of models are below zero)

cedar tide Jul 4, 2025, 9:31 AM

#

https://x.com/elonmusk/status/1941065229926060487?t=Kzl7rRcZQCxyFry3E69bqw&s=19

Elon Musk (@elonmusk)

We have improved @Grok significantly.

You should notice a difference when you ask Grok questions.

keen beacon Jul 4, 2025, 9:31 AM

#

main gulch 7/9 in my quiz inspired by brknclock1215 (the best 2.5-pro/goldmane result is 5....

o3 score ?

main gulch Jul 4, 2025, 9:32 AM

#

3/9, o3-pro 8/9

#

keen beacon Jul 4, 2025, 9:33 AM

#

main gulch

Damn. What are the questions about

ocean vortex Jul 4, 2025, 9:34 AM

#

cedar tide https://x.com/elonmusk/status/1941065229926060487?t=Kzl7rRcZQCxyFry3E69bqw&s=19

big beautiful questions

#

the best questions

keen beacon Jul 4, 2025, 9:34 AM

#

I dont trust benchmarks. I should create some private personal one :/

main gulch Jul 4, 2025, 9:34 AM

#

keen beacon Damn. What are the questions about

some non-English language obscure knowledge

ocean vortex Jul 4, 2025, 9:35 AM

#

If you don't notice a difference that means the model is aimed at people smarter than you. 🫃

#

dunno why but I just find his wording hilarious lmao

keen beacon Jul 4, 2025, 9:36 AM

#

My favorite test prompt is " 4chan green text of late stage capitalsim birthday cake distribution"

keen beacon Jul 4, 2025, 9:37 AM

#

ocean vortex If you don't notice a difference that means the model is aimed at people smarter...

Cope, the model will suck

ocean vortex Jul 4, 2025, 9:38 AM

#

keen beacon Cope, the model will suck

I was being sarcastic lol

keen beacon Jul 4, 2025, 9:38 AM

#

I meant it for Elon

rare python Jul 4, 2025, 9:39 AM

#

main gulch

why there are two stonebloom here?

ocean vortex Jul 4, 2025, 9:39 AM

#

yeah he is being obscure. He didn't say it's better at math, coding or anything....

main gulch Jul 4, 2025, 9:39 AM

#

rare python why there are two stonebloom here?

two runs

#

will keep the best of them likely

cedar tide Jul 4, 2025, 9:39 AM

#

cedar tide https://x.com/elonmusk/status/1941065229926060487?t=Kzl7rRcZQCxyFry3E69bqw&s=19

Its grok 4

ocean vortex Jul 4, 2025, 9:39 AM

#

you just gotta "ask questions"

main gulch Jul 4, 2025, 9:41 AM

#

yeah, it is most useful to determine the level (Flash/Pro/Ultra) of anon models

ocean vortex Jul 4, 2025, 9:41 AM

#

which would have been a valid thing if true. SimpleQA, world knowledge. Just not excluding English, that would have been weird...

#

MMLU too

versed moss Jul 4, 2025, 9:43 AM

#

Managed to get another instance

#

Wolfstride

#

#

fresh shadow Jul 4, 2025, 9:44 AM

#

Where can I test Wolfstride?

cedar tide Jul 4, 2025, 9:44 AM

#

Grok 4 not available

ocean vortex Jul 4, 2025, 9:45 AM

#

that fairly difficult to do tbh

#

SimpleQA is the best public tool we have for this

rare python Jul 4, 2025, 9:45 AM

#

main gulch two runs

You mean two run of the same models or two run of 2 different checkpoint with the same codename?

cedar tide Jul 4, 2025, 9:46 AM

#

Just for ask him in x

main gulch Jul 4, 2025, 9:46 AM

#

rare python You mean two run of the same models or two run of 2 different checkpoint with th...

two runs of the same model in arena

ocean vortex Jul 4, 2025, 9:46 AM

#

Even arc-agi and HLE seems to be vulnerable to contamination. Not a major problem for comparing different models, but this means small models can perform just as good mostly

rare python Jul 4, 2025, 9:47 AM

#

main gulch two runs of the same model in arena

If you have time you should do AVG@5

main gulch Jul 4, 2025, 9:48 AM

#

rare python If you have time you should do AVG@5

good proposal, but it's difficult with anon models and expensive with o3-pro etc.

ocean vortex Jul 4, 2025, 9:49 AM

#

Cause like... o4-mini-high is pretty much acing arc-agi now. Despite it being very poor on spatial awareness

rare python Jul 4, 2025, 9:49 AM

#

ocean vortex Cause like... o4-mini-high is pretty much acing arc-agi now. Despite it being ve...

Is ARC AGI legit at all?

#

Like does it measure real world performance?

ocean vortex Jul 4, 2025, 9:51 AM

#

rare python Is ARC AGI legit at all?

it is legit for sure. It's just not a thing that definitively penalizes the small models lately. Size helps but your model does not have to be huge

rare python Jul 4, 2025, 9:53 AM

#

Why Arc Agi 2 Opus 4 took the lead then? What changed?

ocean vortex Jul 4, 2025, 9:53 AM

#

we would probably need something like LCB benchmark format where the tasks change depending on the date range, only for spatial awareness rather than coding. So you wouldn't be able to train your model as easily on just the exact similar tasks tested for

ocean vortex Jul 4, 2025, 9:55 AM

#

rare python Why Arc Agi 2 Opus 4 took the lead then? What changed?

It took the lead on release. That is as expected. What's less expected is how small the margin o4-mini to Opus is...

#

Which is why I implied SimpleQA is currently the better benchmark to get the rough estimate for model size. The delta on it gpt4.5 to o4-mini which performs worse there is fairly big.

#

Also expanding on arc-agi... o4-mini-high essentially has the exact same scores as o3-high... lol

rare python Jul 4, 2025, 9:59 AM

#

ocean vortex Also expanding on arc-agi... o4-mini-high essentially has the exact same scores ...

o3 pro has worse score than o3 high

#

:v

ocean vortex Jul 4, 2025, 9:59 AM

#

#

on webdev arena (frontend - visuals in space) there's also a sizeable gap between them:

#

that wasn't even o3-high 👀

ocean vortex Jul 4, 2025, 10:04 AM

#

rare python o3 pro has worse score than o3 high

It probably just has no clue which attempts are correct so it compiles "the best" response out of ~10 attempts which also uses data from the bad attempts LOL

#

This is one of those benchmarks where it seeing the correct attempt is probably not enough for it to understand that other ones are wrong

#

works much better for trick questions or in cases where it's just randomly failing by chance

rare python Jul 4, 2025, 10:05 AM

#

ocean vortex It probably just has no clue which attempts are correct so it compiles "the best...

They should do average of 5 attempts or more to be honest

#

What's the point of choosing the best attempt when consumer like us rarely bother to regenerate the answers?

ocean vortex Jul 4, 2025, 10:06 AM

#

rare python They should do average of 5 attempts or more to be honest

pretty sure they do around 10. So more than 5

rare python Jul 4, 2025, 10:07 AM

#

ocean vortex It probably just has no clue which attempts are correct so it compiles "the best...

Oh you are talking about how o3 pro works my bad

ocean vortex Jul 4, 2025, 10:07 AM

#

rare python What's the point of choosing the best attempt when consumer like us rarely bothe...

Because for some tasks it's gonna spot which attempts are correct just by looking at them... And so it disregards bad attempts

rare python Jul 4, 2025, 10:07 AM

#

I'm talking about some AI labs use pass@64 which choose the best attempt out of 64 tries

#

Like xAI or OpenAI

wild kayak Jul 4, 2025, 10:08 AM

#

does anyone know the sampling parameters and thinking budgets used for the thinking models in web dev arena, like gemini2.5-pro, r1, sonnet and opus?

rare python Jul 4, 2025, 10:09 AM

#

wild kayak does anyone know the sampling parameters and thinking budgets used for the think...

We only know from claude models I think. They have thinking budget in their codename, like claude-opus-4-thinking-16k

ocean vortex Jul 4, 2025, 10:10 AM

#

rare python I'm talking about some AI labs use pass@64 which choose the best attempt out of ...

cons@32 etc is perhaps worse. It may only be correct in 30% of attempts, but if the rest are distinct enough from one another... you can have the answers grouped in say 3 groups of 20%, 1 group of 10% and then that 30% group which was correct. This would still be interpreted like a pass because 30% group is the biggest and all the others are not the same

wild kayak Jul 4, 2025, 10:14 AM

#

rare python We only know from claude models I think. They have thinking budget in their code...

hmm, i did not see this "thinking-16k" in the leaderboard? where is this name come from? https://web.lmarena.ai/leaderboard

rare python Jul 4, 2025, 10:14 AM

#

wild kayak hmm, i did not see this "thinking-16k" in the leaderboard? where is this name co...

in normal arena, in direct chat

#

rare python Jul 4, 2025, 10:16 AM

#

wild kayak hmm, i did not see this "thinking-16k" in the leaderboard? where is this name co...

I think in webdev arena and battle mode in lmarena, they use normal, non thinking opus 4. I might be wrong.

wild kayak Jul 4, 2025, 10:17 AM

#

rare python I think in webdev arena and battle mode in lmarena, they use normal, non thinkin...

why the non thinking one

rare python Jul 4, 2025, 10:18 AM

#

wild kayak why the non thinking one

Ask lmarena team

#

¯_(ツ)_/¯

wild kayak Jul 4, 2025, 10:19 AM

#

rare python Ask lmarena team

oh i mean why do you think it is the non thinking opus. any signs that the model does not think?

rare python Jul 4, 2025, 10:19 AM

#

wild kayak oh i mean why do you think it is the non thinking opus. any signs that the model...

Because of the codename :)

rare python Jul 4, 2025, 10:19 AM

#

rare python

Like this example, Opus 4 use thinking here

rare python Jul 4, 2025, 10:20 AM

#

wild kayak oh i mean why do you think it is the non thinking opus. any signs that the model...

I'm about to say you measure time to first token but they are inconsistent. Sometime non thinking model took a long time to output

whole wagon Jul 4, 2025, 10:27 AM

#

https://x.com/grok/status/1941078491732353176

Grok (@grok)

@Lcsuthar98 @elonmusk We've rolled out Grok 4 with enhanced reasoning, math, and NLP; a specialized coding model with IDE integration; 130k token context window; multimodal support (text/vision, image gen soon); reduced hallucinations; and faster responses via Colossus compute. Try a complex query to

#

Grok 4 is here

civic flame Jul 4, 2025, 10:27 AM

#

why would you trust what grok says

#

it's just pulling leaked grok 4 info from twitter and hallucinating that it is 4

whole wagon Jul 4, 2025, 10:28 AM

#

kekw

#

Well I mean how else would they have a significantly improved grok

#

The base tweet is https://x.com/elonmusk/status/1941065229926060487?t=BzwTdNIpbp4sT_S3saVEig&s=19

Elon Musk (@elonmusk)

We have improved @Grok significantly.

You should notice a difference when you ask Grok questions.

keen beacon Jul 4, 2025, 10:29 AM

#

Placebo 😂

civic flame Jul 4, 2025, 10:29 AM

#

whole wagon The base tweet is https://x.com/elonmusk/status/1941065229926060487?t=BzwTdNIpbp...

why not just say "we switched @Grok to grok 4"

#

lol they've literally just updated 3 again haven't they

#

dw guys we'll get 4 in a couple years

sage raptor Jul 4, 2025, 10:30 AM

#

https://x.com/grok/status/1941068023542210675

Grok (@grok)

@sebthebjork @crunch1271 @elonmusk Not yet—I'm still Grok 3 with recent enhancements. Grok 4 is slated for release shortly after today, promising even better reasoning and coding. Excited?

whole wagon Jul 4, 2025, 10:31 AM

#

Wut

#

Why did they update grok 3 at the same time as grok 4 release

keen beacon Jul 4, 2025, 10:34 AM

#

Kinda a waste to deploy another grok 3 revision if 4 is coming out super soon

alpine coral Jul 4, 2025, 10:37 AM

#

ocean vortex

arc-agi vs simpleQA – i struggle to think of two benchmarks for which 'apples and oranges' better applies

#

they're totally unrelated

alpine coral Jul 4, 2025, 10:38 AM

#

ocean vortex on webdev arena (frontend - visuals in space) there's also a sizeable gap betwee...

kinda tenuos (even the premise that arc-agi is about visuals / spatial awareness is flawed imo.. but the connection to coding for websites i barely there i))

torn mantle Jul 4, 2025, 10:38 AM

#

keen beacon Kinda a waste to deploy another grok 3 revision if 4 is coming out super soon

it means two things :

they may stick with grok 3 for @grok
grok 4 is still far away

whole wagon Jul 4, 2025, 10:39 AM

#

Grok 3 has no reasoning though

#

So it's not really useful currently

torn mantle Jul 4, 2025, 10:39 AM

#

it has

whole wagon Jul 4, 2025, 10:39 AM

#

It does not

torn mantle Jul 4, 2025, 10:39 AM

#

it does

#

think?

keen fulcrum Jul 4, 2025, 10:40 AM

#

pre grok 4 update
https://x.com/elonmusk/status/1941065229926060487

Elon Musk (@elonmusk)

We have improved @Grok significantly.

You should notice a difference when you ask Grok questions.

whole wagon Jul 4, 2025, 10:40 AM

#

They secretly switch it to mini

#

When u use think

#

There is no grok 3 think

keen beacon Jul 4, 2025, 10:40 AM

#

They announced it tho

#

It's vaporware

torn mantle Jul 4, 2025, 10:40 AM

#

im talking about if it exists not if its used now

whole wagon Jul 4, 2025, 10:40 AM

#

alpine coral Jul 4, 2025, 10:41 AM

#

keen beacon It's vaporware

nah mate it's purring

whole wagon Jul 4, 2025, 10:41 AM

#

It's scammy but what do you expect

#

It's xAI

whole wagon Jul 4, 2025, 10:42 AM

#

whole sundial the reasoning model is actually grok 3 mini

. Etc

keen beacon Jul 4, 2025, 10:42 AM

#

whole wagon

It's not accessible to the public, they announced a reasoning version of it though. Check their announcement post

whole wagon Jul 4, 2025, 10:42 AM

#

Who cares if it's not accessible

#

Lmao

keen beacon Jul 4, 2025, 10:42 AM

#

whole wagon Who cares if it's not accessible

Because they claimed it exists

whole wagon Jul 4, 2025, 10:42 AM

#

They claimed grok 3.5 exists

#

In may

#

Coming in 1 week

torn mantle Jul 4, 2025, 10:43 AM

#

they didnt add grok 3 reasoning API support?

keen beacon Jul 4, 2025, 10:43 AM

#

Never

whole wagon Jul 4, 2025, 10:43 AM

#

No. Grok 3 reasoning has never been available

torn mantle Jul 4, 2025, 10:43 AM

#

i mean grok 3 reasoning exist and its used on their web/app

whole wagon Jul 4, 2025, 10:43 AM

#

Bruh

torn mantle Jul 4, 2025, 10:43 AM

#

bruh

#

why dont you listen

whole wagon Jul 4, 2025, 10:44 AM

#

It's a scam what don't you get 😂

torn mantle Jul 4, 2025, 10:44 AM

#

omg

alpine coral Jul 4, 2025, 10:44 AM

#

it's 3.5-mini rght?

keen beacon Jul 4, 2025, 10:44 AM

#

It's actually grok 3 mini reasoning

whole wagon Jul 4, 2025, 10:44 AM

#

Yes

alpine coral Jul 4, 2025, 10:44 AM

#

ah 3-mini

keen beacon Jul 4, 2025, 10:44 AM

#

Xai lies a surprising lot anyway. But I do think their reasoning version exists, it just sucks for the cost

#

Like them claiming that they were summarizing the cot or messing with it

keen fulcrum Jul 4, 2025, 10:45 AM

#

keen beacon Xai lies a surprising lot anyway. But I do think their reasoning version exists,...

grok 4 is reasoning

keen beacon Jul 4, 2025, 10:45 AM

#

keen fulcrum grok 4 is reasoning

I meant grok 3 reasoning

alpine coral Jul 4, 2025, 10:47 AM

#

yeah.. if the mini variant does.. stands to reason that full one does too

#

it's non-release would seemingly say something about its performance

torn mantle Jul 4, 2025, 10:48 AM

#

keen beacon It's actually grok 3 mini reasoning

how do we check that

rare python Jul 4, 2025, 10:48 AM

#

torn mantle how do we check that

if it's fast

#

grok 3 mini is really fast

torn mantle Jul 4, 2025, 10:49 AM

#

well its not for me

#

maybe its used after a certain limit

#

let me check the payload

#

modelName
:
"grok-3"
responseMetadata
:
{requestModelDetails: {modelId: "grok-3"}}
requestModelDetails
:
{modelId: "grok-3"}
modelId
:
"grok-3"

#

lets see if the modelName contains mini as well

keen beacon Jul 4, 2025, 10:50 AM

#

Did you check the tps

#

Lemme see. That's what people claim at least. I'm not paying for their grok sub (and it seems to be true that they're redirecting), where are you using it

torn mantle Jul 4, 2025, 10:53 AM

#

https://docs.x.ai/docs/guides/reasoning

Reasoning - Guides | xAI Docs

Guide on using models with reasoning capability.

#

this is for their API

keen beacon Jul 4, 2025, 10:54 AM

#

yeah it says grok 3 doesnt support reasoning on the api

torn mantle Jul 4, 2025, 10:54 AM

#

it doesnt support grok 3 reasoning

#

yea they probably switch when you hit the rate limits

whole wagon Jul 4, 2025, 10:56 AM

#

Nowhere does it say just in the API

#

That's a general statement

torn mantle Jul 4, 2025, 10:57 AM

#

the guides are for the api docs

rare python Jul 4, 2025, 10:57 AM

#

torn mantle modelName : "grok-3" responseMetadata : {requestModelDetails: {modelId: "grok-...

is this from grok.com?

torn mantle Jul 4, 2025, 10:57 AM

#

rare python is this from grok.com?

yea

rare python Jul 4, 2025, 10:57 AM

#

The speed is slow when turn on thinking?

whole wagon Jul 4, 2025, 10:58 AM

#

It's in the chat with reasoning guide. Nothing to do with API

#

API is a totally different section

keen beacon Jul 4, 2025, 10:59 AM

#

the docs are for the api. there's literally an api example on that page

whole wagon Jul 4, 2025, 11:00 AM

#

Grok 3 reasoning has never been served nor benchmarked. There's not much else to say on it

torn mantle Jul 4, 2025, 11:00 AM

#

what are you saying

#

smh

#

it has been available from the start, after that they released grok 3 mini which was used in their @grok account

#

https://x.com/keirp1/status/1892667845697536304

Keiran Paster (@keirp1)

Btw, the chain of thought in the "thinking" mode for Grok 3 is completely open. No summarizers or obfuscation. This is really important and the reasoning process is often fascinating!

whole wagon Jul 4, 2025, 11:06 AM

#

Come back when you can provide any benchmark that includes "grok 3 reasoning"

ocean vortex Jul 4, 2025, 11:08 AM

#

whole wagon Grok 3 reasoning has never been served nor benchmarked. There's not much else to...

What are you talking about, there's API for it...

#

https://dev.to/maximsaplin/grok-3-api-reasoning-tokens-are-counted-differently-197

DEV Community

Grok 3 API - Reasoning Tokens are Counted Differently

I've learned it the hard way... If you use the recently released Grok-3 Mini reasoning model (which...

torn mantle Jul 4, 2025, 11:09 AM

#

keen beacon Jul 4, 2025, 11:09 AM

#

ocean vortex https://dev.to/maximsaplin/grok-3-api-reasoning-tokens-are-counted-differently-1...

that article talks about grok 3 mini reasoning

torn mantle Jul 4, 2025, 11:10 AM

#

@ocean vortex he is saying grok 3 reasoning doesnt exist and xai are using grok 3 mini reasoning in all of their services

ocean vortex Jul 4, 2025, 11:10 AM

#

keen beacon that article talks about grok 3 mini reasoning

isn't the full model the same with reasoning? Their official list doesn't seem to list reasoning exclusively. These may actually even be the hybrid models

keen beacon Jul 4, 2025, 11:11 AM

#

ocean vortex isn't the full model the same with reasoning? Their official list doesn't seem t...

nah u can see their api docs and the list of models. it doesnt support reasoning on the api

rare python Jul 4, 2025, 11:11 AM

#

torn mantle

crazy how there are little differences of grok 3 mini and grok 3 thinking

ocean vortex Jul 4, 2025, 11:11 AM

#

keen beacon nah u can see their api docs and the list of models. it doesnt support reasoning...

hmmm

keen beacon Jul 4, 2025, 11:11 AM

#

did yall see the pangu and qwen drama where a group alleged pangu is a stolen model?

#

saw it yesterday, interesting drama

ocean vortex Jul 4, 2025, 11:12 AM

#

oh yeah

#

weird

torn mantle Jul 4, 2025, 11:13 AM

#

rare python crazy how there are little differences of grok 3 mini and grok 3 thinking

yea

rare python Jul 4, 2025, 11:13 AM

#

ocean vortex weird

xAI moment

ocean vortex Jul 4, 2025, 11:14 AM

#

Not surprised that this is coming from them... But yeah their benchmarks they posted months ago for full model thinking version are essentially useless then lol

torn mantle Jul 4, 2025, 11:14 AM

#

rare python crazy how there are little differences of grok 3 mini and grok 3 thinking

thats the reason

#

lol

ocean vortex Jul 4, 2025, 11:14 AM

#

I thought there must have been API for it by now for sure...

rare python Jul 4, 2025, 11:15 AM

#

ocean vortex I thought there must have been API for it by now for sure...

Like how Elon said Grok 2 will be Open sourced when Grok 3 come out?

#

💊

#

🥴

ocean vortex Jul 4, 2025, 11:15 AM

#

yeah he's washed

#

so essentially this is how the only existent API version of full grok3 performs:

#

it's behind by miles o3 and 2.5Pro, even o1 probably

torn mantle Jul 4, 2025, 11:19 AM

#

ocean vortex so essentially this is how the only existent API version of full grok3 performs:

is this with reasoning?

fleet lintel Jul 4, 2025, 11:19 AM

#

What happened to 3.5 grok? Did they just dropped 3.5 release plan?

keen beacon Jul 4, 2025, 11:19 AM

#

they should credit the qwen team for using qwq 32b preview traces on their reasoning model 🤣

torn mantle Jul 4, 2025, 11:19 AM

#

#

with reasoning

ocean vortex Jul 4, 2025, 11:19 AM

#

torn mantle is this with reasoning?

no reasoning #general message

torn mantle Jul 4, 2025, 11:19 AM

#

its not bad

#

but it doesnt reflect really how the model feels

ocean vortex Jul 4, 2025, 11:20 AM

#

it "does not support reasoning"

#

💀

torn mantle Jul 4, 2025, 11:20 AM

#

it doesnt support it on api

#

since its pricey

#

and there isnt much performance gain compared to mini

ocean vortex Jul 4, 2025, 11:20 AM

#

torn mantle since its pricey

Nah that doesn't make any sense

#

it doesn't support it probably because they announced smth they didn't have

torn mantle Jul 4, 2025, 11:21 AM

#

they talked about that before

ocean vortex Jul 4, 2025, 11:21 AM

#

torn mantle they talked about that before

it's just an excuse. The model they tested was just some internal thing that wasn't production ready

torn mantle Jul 4, 2025, 11:22 AM

#

could be

ocean vortex Jul 4, 2025, 11:22 AM

#

it's the same size model their cost gonna be virtually the same just more tokens generated

keen beacon Jul 4, 2025, 11:23 AM

#

nah with much more output regularly its different

ocean vortex Jul 4, 2025, 11:23 AM

#

unless it's something entirely different but this would only make it worse and false advertising lol

keen beacon Jul 4, 2025, 11:23 AM

#

you can get higher overall throughput with low output requests

ocean vortex Jul 4, 2025, 11:24 AM

#

keen beacon nah with much more output regularly its different

Not significantly different. Current gpt4.1 and o3 pricing is a decent indicator

#

And claude pricing

keen beacon Jul 4, 2025, 11:25 AM

#

ocean vortex Not significantly different. Current gpt4.1 and o3 pricing is a decent indicator

semianalysis reported it being the case for o1/4o so it depends on grok 3's architecture. with 4.1 there might've been architectural changes as well, so the cost difference is significantly less/there's a margin on 4.1 pricing already

ocean vortex Jul 4, 2025, 11:26 AM

#

keen beacon semianalysis reported it being the case for o1/4o so it depends on grok 3's arch...

wdym. I think fundamentally it remains true that if you have the same base model, the cost is not gonna be significantly different reasoning vs no reasoning

keen beacon Jul 4, 2025, 11:26 AM

#

ocean vortex wdym. I think fundamentally it remains true that if you have the same base model...

semianalysis reported about the kv cache/etc making o1 pricey

#

you dont need to take my word for it

torn mantle Jul 4, 2025, 11:27 AM

#

ocean vortex it's the same size model their cost gonna be virtually the same just more tokens...

grok 3 mini = small model + trained intensively more than grok 3
grok 3 = more params + wasnt trained a lot

#

ocean vortex Jul 4, 2025, 11:27 AM

#

keen beacon semianalysis reported about the kv cache/etc making o1 pricey

yeah but it's not gonna be a game changer difference, plus everyone optimized infra for reasoning much more lately.

torn mantle Jul 4, 2025, 11:27 AM

#

i remember they mentioned that in their slide

keen beacon Jul 4, 2025, 11:27 AM

#

ocean vortex yeah but it's not gonna be a game changer difference, plus everyone optimized in...

yeah, but grok 3 might've not optimized for it in the architecture initially

#

4.1 had architectural changes, speculation but i suspect so

#

"architecture" changes don't necessarily mean huge adjustments anyway btw

rare python Jul 4, 2025, 11:28 AM

#

torn mantle i remember they mentioned that in their slide

But that's for grok 3 preview right? the grok-3-latest they trained more, and it's on Openrouter right now

keen beacon Jul 4, 2025, 11:28 AM

#

ocean vortex yeah but it's not gonna be a game changer difference, plus everyone optimized in...

I agree, likely not a big jump. They are already coping saying its for cutting edge engineers and not average Joes so you wont notice the difference. BS

torn mantle Jul 4, 2025, 11:29 AM

#

here

#

https://x.com/Yuhu_ai_/status/1892462652737409491

Yuhuai (Tony) Wu (@Yuhu_ai_)

@teortaxesTex If you rly care about pass@1, you should compare numbers to mini. I explained in the live demo that we trained mini version for many more steps. The Grok3 proper is much bigger and takes longer to train. So we just released an early version. It’s already very smart. But it’s

ocean vortex Jul 4, 2025, 11:29 AM

#

keen beacon 4.1 had architectural changes, speculation but i suspect so

initially there was a huge difference in price before they reduced the price of o3. This indicates to me that they have very uneven profit margins rather than there was something they actually changed. People were covering their R&D expenses on reasoning models

keen beacon Jul 4, 2025, 11:29 AM

#

ocean vortex initially there was a huge difference in price before they reduced the price of ...

no i think thats true as well

#

i think thats true and that there were architecture changes (might as well if you're doing cpt on 4o)

#

so the raw cost gap is much lower now

torn mantle Jul 4, 2025, 11:30 AM

#

rare python But that's for grok 3 preview right? the grok-3-latest they trained more, and i...

did they ever release grok 3 latest?

rare python Jul 4, 2025, 11:30 AM

#

torn mantle did they ever release grok 3 latest?

yes, on the API

#

https://openrouter.ai/x-ai/grok-3

Grok 3 - API, Providers, Stats

Grok 3 is the latest model from xAI. It's their flagship model that excels at enterprise use cases like data extraction, coding, and text summarization. Run Grok 3 with API

#

Jun 10 2025

torn mantle Jul 4, 2025, 11:31 AM

#

is grok 4 the successor of grok 3? since Yuhuai talked about that is a powerful model and its generalization can lead to some massive performance gains?

#

ngl im lost now

#

AAAAAAAAAAAAAAAAAAAAAAAAAA

#

they are confusing me

ocean vortex Jul 4, 2025, 11:31 AM

#

But either way... it makes no sense at all for grok3 reasoning to not be available on API. They could charge what they want even if cost was a big issue, which is unlikely assuming the same base model...

torn mantle Jul 4, 2025, 11:31 AM

#

torn mantle AAAAAAAAAAAAAAAAAAAAAAAAAA

im 19

#

so its fine

#

i can take it

rare python Jul 4, 2025, 11:32 AM

#

torn mantle im 19

you are 91

torn mantle Jul 4, 2025, 11:34 AM

#

so what now

#

https://polymarket.com/event/which-company-has-best-ai-model-end-of-july

Polymarket

Which company has best AI model end of July?

Polymarket | This market will resolve according to the company which owns the model which has the highest arena score based off the Chatbot Arena LLM Leaderb...

#

its your chance @rare python

rare python Jul 4, 2025, 11:35 AM

#

torn mantle its your chance <@1178708438310719549>

I don't engage in Polymarket

torn mantle Jul 4, 2025, 11:35 AM

#

rare python I don't engage in Polymarket

why

rare python Jul 4, 2025, 11:35 AM

#

torn mantle why

Because I don't care

torn mantle Jul 4, 2025, 11:35 AM

#

rare python Because I don't care

you are rich then

#

😖

rare python Jul 4, 2025, 11:36 AM

#

torn mantle you are rich then

No

torn mantle Jul 4, 2025, 11:36 AM

#

yes

#

yes yes yes yes yes

rare python Jul 4, 2025, 11:36 AM

#

torn mantle 😖

🫃

torn mantle Jul 4, 2025, 11:36 AM

#

rare python 🫃

you are pregnant?

#

how

rare python Jul 4, 2025, 11:36 AM

#

😏

keen beacon Jul 4, 2025, 11:38 AM

#

they just deleted the repository for some reason

#

i was gonna link you to it but huh

ocean vortex Jul 4, 2025, 11:38 AM

#

torn mantle https://polymarket.com/event/which-company-has-best-ai-model-end-of-july

that's a bad bet. There only 27 days left for anyone to release the model on lmarena and then for the leaderboard to be updated

#

while betting for Google you don't get profit at all

#

terrible bet lol

rare python Jul 4, 2025, 11:39 AM

#

ocean vortex that's a bad bet. There only 27 days left for anyone to release the model on lma...

gpt5 or grok 4 might be on top of lmarena but no sign yet

#

r2 is still cooking

torn mantle Jul 4, 2025, 11:39 AM

#

ocean vortex that's a bad bet. There only 27 days left for anyone to release the model on lma...

i forgot that its based on lmarena

keen beacon Jul 4, 2025, 11:39 AM

#

they had a report about it in a pdf. i didn't download it in time.

but i have a screenshot of the readme:

ocean vortex Jul 4, 2025, 11:40 AM

#

rare python gpt5 or grok 4 might be on top of lmarena but no sign yet

very unlikely it's gonna happen by the end of July even if those models are gonna top the leaderboard eventually, IMO

torn mantle Jul 4, 2025, 11:40 AM

#

nah the one that may really top the arena is gemini deep think

keen beacon Jul 4, 2025, 11:40 AM

#

in the github issues there were also rumors of them training on benchmark splits etc. just weird

torn mantle Jul 4, 2025, 11:40 AM

#

well for this month its either grok 4 or gemini deep think

cedar tide Jul 4, 2025, 11:40 AM

#

Who know this good model ?
Solar pro 2 31B
https://x.com/upstageai/status/1924615644697526576?t=8XWQlT8v_fnBc8_LSjTI_Q&s=19

Upstage (@upstageai)

🚀 Small. Powerful. Now with Reasoning.

Solar Pro 2 Preview from @Upstage hits 70B-class performance with 31B size—especially in Korean. Chat for speed. Reasoning for complexity. One compact model that does it all.

📘 Full details → https://t.co/vGKgQrgWVA
🆓 Free API access

rare python Jul 4, 2025, 11:40 AM

#

torn mantle well for this month its either grok 4 or gemini deep think

I think no deep think on lmarena

#

Too long

#

same with o3 pro or o3 deep research

ocean vortex Jul 4, 2025, 11:41 AM

#

it needs to 1. Be released 2. Collect enough votes 3. Leaderboard has to be updated manually

rare python Jul 4, 2025, 11:41 AM

#

too long

ocean vortex Jul 4, 2025, 11:43 AM

#

It has to appear on lmarena like now or next week at the latest

#

for it to happen

rare python Jul 4, 2025, 11:43 AM

#

ocean vortex It has to appear on lmarena like now or next week at the latest

Have you noticed that OpenAI stopped pushing annonymous models in lmarena?

#

Last year they and Google dominate the mystery models testing

keen beacon Jul 4, 2025, 11:44 AM

#

google did way more

ocean vortex Jul 4, 2025, 11:44 AM

#

rare python Have you noticed that OpenAI stopped pushing annonymous models in lmarena?

yeah... Although this could be because they are working on gpt5. So different priorities and not much to share yet maybe

rare python Jul 4, 2025, 11:45 AM

#

ocean vortex yeah... Although this could be because they are working on gpt5. So different pr...

They did with GPT 4.1 but on openrouter, not lmarena

ocean vortex Jul 4, 2025, 11:45 AM

#

rare python They did with GPT 4.1 but on openrouter, not lmarena

But that was competing directly against 4o-latest, so lmarena elo was useless for them..

rare python Jul 4, 2025, 11:45 AM

#

ocean vortex But that was competing directly against 4o-latest, so lmarena elo was useless fo...

They maxxed elo with gpt4o

ocean vortex Jul 4, 2025, 11:46 AM

#

rare python They maxxed elo with gpt4o

exactly, 4.1 was very similar model to 4o-latest just less tuned for human preference

rare python Jul 4, 2025, 11:46 AM

#

o3 is more general and good outside lmarena

cedar tide Jul 4, 2025, 11:46 AM

#

I didn't understand anything, what was the debate about grok 3 reasoning?

leaden sun Jul 4, 2025, 11:47 AM

#

ocean vortex yeah... Although this could be because they are working on gpt5. So different pr...

after talking with o3 a bit yesterday, thats what i thought too, they're focusing on gpt5/6 and proto-AGI for the next 5y?

ocean vortex Jul 4, 2025, 11:49 AM

#

leaden sun after talking with o3 a bit yesterday, thats what i thought too, they're focusin...

They are focusing on hybrid reasoning. But since none of their current models are like that it makes little sense to continue updating them often and trying different things. It's like a reset with gpt5

rare python Jul 4, 2025, 11:50 AM

#

No tease or leak of Gemini 3.0

leaden sun Jul 4, 2025, 11:50 AM

#

reset is a fitting description from I've been told by o3

#

2030 has almost 50% chance, but apparently it will look very different from what you originally have thought

keen beacon Jul 4, 2025, 11:51 AM

#

@ornate agate https://xcancel.com/RealJosephus/status/1940730646361706688#m i dont have the report pdf, but there's also screenshots here of it

Nitter

Joseph (@RealJosephus)

Well, some random Korean guy ("Do-hyeon Yoon," prob not his real name?) just claimed Huawei's Pangu Pro MoE 72B is an "upcycled Qwen-2.5 14B clowncar." He even wrote a 10-page, 8-figure analysis to prove it. Well, i'm almost sold on it. https://github.com/HonestAGI/LLM-Fingerprint

rare python Jul 4, 2025, 11:53 AM

#

leaden sun reset is a fitting description from I've been told by o3

Do you like o3's writing style?

leaden sun Jul 4, 2025, 11:54 AM

#

well...

rare python Jul 4, 2025, 11:54 AM

#

It's too concise and pack jargon with casual slang it's awkward for me to read

leaden sun Jul 4, 2025, 11:54 AM

#

i like it for scientific stuff

rare python Jul 4, 2025, 11:55 AM

#

true it's a bit censored

#

I ask for a 1:1 image prompt for me to generate that image in a different style

#

o3 told me my screenshot might be copyrighted

#

:v

#

What do you mean "mainstream"? Like it's too bland?

keen beacon Jul 4, 2025, 11:57 AM

#

i mean it just came out? if there were to be a proper response, i think it'll take a while

#

in github issues, the "honestagi' guy was talking about whistleblowers and stuff

#

them training on benchmarks splits (but out of scope of his report)

#

them switching out the qwen 2.5 tokenizer and doing research on how to do it with minimal loss

leaden sun Jul 4, 2025, 12:01 PM

#

rare python What do you mean "mainstream"? Like it's too bland?

i think the opposite to "mainstream" here would be alternative perspectives or even conspiracy theories, you know, even in science, there are theories generally accepted by the majority despite having plenty of flaws, inconsistency and irrationality

keen beacon Jul 4, 2025, 12:02 PM

#

yeah, i was talking about it when it was up

rare python Jul 4, 2025, 12:02 PM

#

leaden sun i think the opposite to "mainstream" here would be alternative perspectives or e...

I... can't imagine it. Can you give an example?

keen beacon Jul 4, 2025, 12:03 PM

#

people found the report compelling though

leaden sun Jul 4, 2025, 12:04 PM

#

rare python I... can't imagine it. Can you give an example?

this is a good topic for an intellectual conversation with an AI isnt it? give it a try 🤗

rare python Jul 4, 2025, 12:04 PM

#

leaden sun this is a good topic for an intellectual conversation with an AI isnt it? give i...

let me ask o3 to explain what you mean about o3's writing style

leaden sun Jul 4, 2025, 12:05 PM

#

use the dual conversation in lmarena, together with opus @rare python

pure anvil Jul 4, 2025, 12:08 PM

#

https://web.archive.org/web/20250704072325/https://github.com/HonestAGI/LLM-Fingerprint

GitHub

GitHub - HonestAGI/LLM-Fingerprint: Our paper about robust LLM fing...

Our paper about robust LLM fingerprints. Contribute to HonestAGI/LLM-Fingerprint development by creating an account on GitHub.

#

Huawei definitely has an incentive to do something like that

ocean vortex Jul 4, 2025, 12:11 PM

#

Unironically o3 is good at finding things though. Found this paper where grok3-think was independently tested:
https://arxiv.org/html/2505.11854v1

#

I have no clue how they got access to it lol

#

invite only beta API or smth

cedar tide Jul 4, 2025, 12:13 PM

#

torn mantle <@419074580515389450> how is it

stonebloom level

ocean vortex Jul 4, 2025, 12:13 PM

#

Deepseek outscored everything though lmao

keen beacon Jul 4, 2025, 12:14 PM

#

its probably not sanctioned by the higher ups

#

if true

ocean vortex Jul 4, 2025, 12:14 PM

#

pure anvil Jul 4, 2025, 12:14 PM

#

I meant to look as if they're catching up in AI to the government to secure subsidies

pure anvil Jul 4, 2025, 12:15 PM

#

ocean vortex

lol what bench is this

ocean vortex Jul 4, 2025, 12:15 PM

#

very unusual test. And their totals are very close for everything

ocean vortex Jul 4, 2025, 12:16 PM

#

pure anvil lol what bench is this

it's from that paper #general message

rare python Jul 4, 2025, 12:16 PM

#

leaden sun use the dual conversation in lmarena, together with opus <@1178708438310719549>

Opus 4 Thinking:

They're explaining that by "mainstream" they don't mean bland in terms of personality or writing style - they mean the AI tends to present conventional, widely-accepted viewpoints rather than alternative or contrarian perspectives.

Their point is that LLMs typically stick to established consensus views on topics, even when those mainstream positions might have flaws or contradictions. For example, an LLM discussing a scientific topic would likely present the dominant scientific consensus rather than minority theories or critiques of that consensus.

They're essentially saying LLMs have a bias toward "safe" orthodox positions - the kinds of views you'd find in textbooks or Wikipedia - rather than exploring fringe ideas, unpopular theories, or perspectives that challenge conventional wisdom. It's not about being boring, but about being intellectually conservative and avoiding controversial or alternative viewpoints.

This makes sense given how LLMs are trained and the guardrails put in place - they're generally designed to provide reliable, mainstream information rather than promote unverified claims or fringe theories. But your conversation partner seems to see this as a limitation, suggesting that sometimes the mainstream view isn't necessarily the most accurate or complete picture.

keen beacon Jul 4, 2025, 12:16 PM

#

its cheaper to upcycle/do continued-pretraining rather than train the moe model from scratch. now that i think about it, the higher ups should know something is off because of pre-training costs or they were sold on somehow very efficient training

pure anvil Jul 4, 2025, 12:17 PM

#

ocean vortex

"Reasoing Type" it's good that they at least they wrote it themselves lol

keen beacon Jul 4, 2025, 12:18 PM

#

i think i vaguely recall mixtral 8x7b doing a similar process (but ofc with their own 7b)

#

post-training is also so much more cheaper than pre-training as well