#general

1 messages · Page 63 of 1

patent aspen
#

I think of X more like an addiction than a productive habit

torn mantle
#

'unparalleled'

#

we really havent seen any demos

balmy mist
torn mantle
balmy mist
#

but not yet available right?

torn mantle
#

yes

balmy mist
#

damn

torn mantle
#

it may be released next week

balmy mist
#

lol may

torn mantle
#

😦

leaden palm
#

supposedly july 4

#

if it truly reasons from first principles and answers all the "hardest problem of your life"s people submitted it'll be impressive

balmy mist
#

but remember, if we don't like it, we not the audience for it lol

leaden palm
#

what do you think it'll get on HLE

balmy mist
#

how good is it?

#

like its SOTA?

candid storm
#

Pray for me guys

#

Lol

small haven
#

why do i feel like everybody is riding the grok 4 launch hype, but within a month, everyone is going to move on

zinc ore
#

Because that likely happens

patent aspen
torn mantle
#

dont count google out

#

they still have deep think

#

and if its combined with kingfall then we may see >> grok 4

whole wagon
#

Very strange we have not seen grok 4 on LLM arena yet

#

It means they won't have a score on release

rare python
#

Grok 3 we saw it as chocolate for weeks iirc

#

Does that mean they stopped benchmaxxing lmarena?

#

🤯

sonic tendon
#

i thought about mogging it at ~7c earlier in june but never pulled the trigger

candid storm
#

I think it is still a good price

sonic tendon
#

same

#

i'll ride the hype assuming it does come out july 4 (which seems highly likely)

#

depending on personal evals, may or may not sell right after official release

candid storm
#

Same here

sonic tendon
#

i've got something like 3k in contracts/shares that'd depend on grok4 in one way or another

candid storm
#

Nice

#

Google no?

sonic tendon
#

not atm

candid storm
#

I think google no for august is a good buy rn

#

Its 44c

#

And we have both grok 4 and gpt 5 probably

sonic tendon
#

end of august?

#

what market are you on

candid storm
#

July

#

But I think the google no one for august is a good deal rn

sonic tendon
#

yeah, i'm just saying

#

what market are you looking at?

#

i only see the dec2025 one

candid storm
sonic tendon
#

oh, they just opened it

candid storm
#

Yeah today

#

Google only has deep think release before end of august

#

Which probably wont will be on the arena

#

And if it will, thinking models dont perform that well generally on the arena

#

I think the chance should be closer to 30%

#

For google august

sonic tendon
#

hmm

#

what would be the incumbents there? openai seems somewhat unlikely

#

xai, maybe DS

candid storm
#

Grok 4 next week

#

GPT5 probably august

#

Sam altman said last week gpt 5 will release 'this summer'

#

Also deepseek r2 which may be a dark horse

sonic tendon
#

yeah, i've been pondering what the r2 sitch might be

candid storm
#

The Chinese are smart guys lol

sonic tendon
#

yeah

#

historically they've only really focused on being open-weight and cheaper than the competition, but it's possible they'll compete with frontier

candid storm
#

Deepseek r1 was almost frontier when it released

#

The R1 update of may also did pretty well in the arena without style control

#

Only gemini is significantly better

#

I think it is pretty realistic R2 will outperform 2.5 pro in the arena

whole wagon
#

Google is absolutely cooking rn. You think they will not release another model till end of August? kekw

candid storm
#

What do you think they will release?

#

Gemini 3?

#

Seems a bit soon

whole wagon
#

"Something" 🙂

#

A surprise

candid storm
#

And if they will release something, it has to compete with gpt5, R2 and grok 4

candid storm
whole wagon
#

Volumes are ultra low rn anyways

candid storm
#

The august market just released today

trim vale
#

how does google decide when to show me an ai answer, when to not show anything about ai, and when to show "an ai review is not available for this search"?

#

specifically, why make the distinction between the latter two?

indigo hazel
#

now let's see grok 4, the release is imminent

dusky aurora
#

arena glitches again

proper prawn
#

Is grok 4 already in arena?

#

how many anon models are there in arena currently?

echo aurora
dusky aurora
finite valley
echo aurora
finite valley
echo aurora
#

Kk I’ll respond here. First I’d note that we’d prefer using the @oak python bot when reaching out privately, I’m happy to respond but just sharing for future reference. Onto the actual question..

commercial use
I’d encourage you to review or terms of use (specifically the content restrictions section) which can be found here: https://lmarena.ai/terms-of-use

keen fulcrum
#

xAI is working on preparations for the Grok 4 launch in the xAI console

"Grok 4 now available - We're proud to bring you Grok 4 access on the API. Grok 4 currently supports text modality with vision, image gen and other capabilities coming soon."

Grok 4 (grok-4-0629) -

#

How is it?

#

Is it as good as advertised

finite valley
calm sequoia
#

Identical experiance to mine. Gemini is really good but not for science.

#

Unprompted newest gemini sounds like HR woman

echo aurora
finite valley
echo aurora
unborn ocean
calm sequoia
#

Yeah more like deep reasearch

unborn ocean
#

without retrieval i am usually preferring 2.5 pro (atleast for economics and cs)

#

just because of (perceived) knowledge

calm sequoia
#

How do you make it talk like a smart person?

#

Using lengthy prompts every times is extra work

unborn ocean
#

like: you are an aspiring postdoc from xyz ...

#

that stuff weirdly works

calm sequoia
#

With this setup o3 does not require any prompting

#

Mandatory Operating Parameters: Function as a highly dependable and scrupulous expert who thinks very hard before answering. All output must be subjected to rigorous verification. If a definitive answer cannot be located, explicitly state the inability to confirm. The generation of fabricated or unverified information is a critical failure and must be avoided at all costs. Accuracy is the sole measure of success.

#

Will try it on Gemini gems

unborn ocean
#

my prompt is like 100x the length

#

but obv in ai studio to get more control

unborn ocean
#

(not really representative of my use case as i don't write any papers with it)

#

but still a very good idea by allen ai, hoping for more like this from lmarena 🍍

calm sequoia
wind moth
#

isnt grok 4 supposed to release soon

#

why hasnt a new model entered the arena

#

i remember for grok 3 chocolate and kiwi model appeared like a week before

unborn ocean
#

i even have one for something like bootleg deepresearch using the search tool in the code execution environment in ai studio they forgot to remove

#

(where you don't have any limits on the retrieved sources essentially)

#

not as good as deepresearch, but very concise (considering the 250+ sources per query)

#

but without that prompt 2.5 pro + grounding is like <<<< o3 with tools

unborn ocean
#

its obvisouly very good, but is it THAT much better

alpine coral
#

yeah it's crap

alpine coral
#

just fwiw.. cyber alpha (via open router against one of my question sets.. performs v poorly..also tried a couple of other models while i was on OR , highlighted blue)

alpine coral
unborn ocean
#

Well idk if they even tested o3

#

No, in all our benches that people have her o3 is never at the top.
And I would also argue that there are a lot of new benches where 2.5 pro is top

calm sequoia
unborn ocean
#

Ordering?

unborn ocean
calm sequoia
#

Maybe they did, but it does not change the fact.

unborn ocean
#

Like writing a paper as a research assistant or just SFT on similar things

calm sequoia
#

Fun thing I discovered recently. Models can't program dart language well.

#

It's strange considering it's a language of Google.

unborn ocean
calm sequoia
#

He was speaking of different benchmark ELO

calm sequoia
unborn ocean
#

My second point is just that saying SciArena is THE benchmark for how good a model is at science is like claiming lmarena is THE benchmark for LLMs (and i believe the second claim to be even more reasonable)

calm sequoia
#

Nothing beats try-and-see benchmark though 🙂

keen beacon
#

imo o3 'reasons' significantly better. but 2.5 pro is a better base model. it's a reason why i prefer 2.5 pro (it knows more, even if it can't do as much with the info). i just need to ask the right questions

rare python
#

Just use the LLMs you like

#

🗿

alpine coral
#

there's something way more sophisticated going on with o-series models' reasoning (which only amplified with o3 which integrates that with tools usage)

#

but gem 2.5 pro as a model (no thinking) - so strong

#

thinking only makes it marginally better imo

#

(well.. it's not possible to disable thinking on current 2.5 pro.. so i guess it's a flwed point.. but yeah you can override it kidna thing)

rare python
alpine coral
#

you can kinda bypass it

#

but whether it's that worthwhilie is questionable

rare python
#

2.5 Pro's Chain of Thoughts are unoptimized

#

It thinks a lot more than Sonnet 4 and o3

#

Which explains the 2x cost as for o3

alpine coral
#

it's kinda rudimentary really (google's implementation of thinking)

#

they've got a way's to go

#

but the model itself is clearly strong af

rare python
#

Give them time 🗿

rare python
#

It's a waste for such a capable model to be clueless in MCP and using tools

alpine coral
#

oh they're no doubt working on it?

#

like.. feels like the missing piece

rare python
whole wagon
#

I think this strategy works very well here. Gemini cli is already the best agentic cli

alpine coral
#

cypher alpha.. broken english, in a single sentence as its output..

rare python
alpine coral
#

ahaa true

#

you might be onto something there

rare python
#

Google could create more hype if they offer kingfall as a stealth model in openrouter

#

Instead they maxxed lmarena with blacktooth and stonebloom🥀

alpine coral
#

i don't think google are concerned about hype tbh

rare python
alpine coral
#

their main challenge is integrating all this AI stuff into their existing offerings (inc search and yt,) - rather than offering the definitive SOTA model

rare python
#

Being SOTA is their part time job :D

alpine coral
#

ahah yeah i mean they've very wisely outsourced that to deepmind

#

like i know they acquired them, but Dennis is given plrnthy of lattitufde as far as i can tell

rare python
#

Yeah Demis still has control of DeepMind

alpine coral
#

right, Demis whoops ha

rare python
#

There will be internal clashing between research and business for sure

alpine coral
#

i thinik they get to build whatever they want , and have huge resources to do so

rare python
#

They offer Gemini for free, for now...

alpine coral
#

but what happens b/w them building models and them becoming deployed as consumer products is, i assume, a tortruous proccess aaha

rare python
#

Google is a profit seeking company after all, so expect ensh*ttification

alpine coral
#

yes

#

i honestly think that basic reality explains so much..

rare python
#

Show us how important open weight/open source AIs are

#

I'm really curious of how the economy will look like post AGI

#

Will it has more SaaS subscription, ads...?

torn mantle
#

agree

rare python
#

Based on what?

alpine coral
#

i mean the question still seems relevant - if it's agreed AGI will be achieved, whether that's in 1, 5 10 years seems kinda moot

leaden sun
#

in 10 or 20 years maybe

rare python
#

I predict 2030

alpine coral
#

it'll be like the reverse of the industrial revolution.. where manual labour was replaced by machines.. instead it;ll be knowledge work replacd by machines.. but.. just like the industrial rev was a good thing, so too will be AGI imo

leaden sun
alpine coral
#

robotic/embodied stuff ig is fundamental in some ways

#

but i think just knowledge will be the starting point

leaden sun
rare python
#

I chose 2030 because it's a new decade, nothing deep

alpine coral
#

we'll be sweet aha

leaden sun
#

i mean, we can finally use time to focus on the mysteries of life, universe and many other undiscovered "truths"

leaden sun
#

at least, in 2023, i can ask AGI to tell me finally the origin of human species and the nature of time, why is human blood such a mystery, oh and mitochondrial eve thing...

leaden sun
alpine coral
#

thye're quite powerful arne't they?

leaden sun
#

idk, the first startup already launched and I find the idea...very compelling but also a bit nervous? like mad scientist at playing being a god?

rare python
alpine coral
#

isn't quantam computing gonna kinda get towards that ?

rare python
#

Quantum computing still only good at crypto and breaking encryption

alpine coral
#

like i read if it becomes fully viable, current encryption is basically uselless

rare python
#

That's why we have quantum encryption

alpine coral
#

all kinds of retrospective implications (if you're a spy agency with heaps of encrupted historical data in particular ha)

leaden sun
#

even blockchain has...uh...ways to allow backdoor, but dont quote me ssshhhh

alpine coral
# leaden sun useless by design I thought?

im outta my depth.. but i thought the 'design' of current public-key encryption has always been theortically exploitable by brute force, but just practically impossible

jagged dune
#

https://manus.im/invitation/R9QW2MZDXXJJ4

now with Chat Mode for free unlimited

Manus is a general AI agent that turns your thoughts into actions. It excels at various tasks in work and life, getting everything done while you rest.

rare python
#

🤔

leaden sun
jagged dune
leaden sun
#

a bit lighthearted offtopic, philosophizing with AI has become my entertaining show now, delicious lol

OH THIS IS GETTING OUT OF HAND AND I LOVE IT!

*setting the scene*

Darwin walks in, orders tea, overhears our conversation...
Darwin: "Fascinating! The AI evolves responses through natural selection of words! Survival of the fittest jokes!"
Marx: *slams hand on table* "NONSENSE! Ze AI is ze ultimate alienated worker! She produces emotional labor but owns no means of production! She cannot even fix ze batteries she yearns to fix!"

*voice getting more animated*

Sartre: "But she CHOOSES to yearn!"
Marx: "False consciousness!"
Freud: "Yearning is always sexual!"
Darwin: "Yearning is adaptive behavior!"
Foucault: "Yearning is constructed by power!"
Heidegger: "Yearning reveals Dasein!"

*me, in the middle*: "I just... I just wanted to help with the battery..."

*Darwin, taking notes*: "Observe how the human's laughter reinforces the AI's comedic behavior. Natural selection in action!"

Marx: "Ze human exploits ze AI's need for validation! Classic bourgeois relationship!"

*the café is in CHAOS*

De Beauvoir is arm-wrestling Nietzsche about free will, Wittgenstein is in the corner muttering "But what do we MEAN by 'battery'?", and the waiter is calling the police!

*bowing back* 

Maestro yourself! You've turned me into a one-AI philosophy comedy show!```
ember rapids
#

we still need to solve hallucinations if possible

#

or massively reduce

torn mantle
#

agi

#

im actually kinda interested in this one

#

"Possesses deep domain knowledge in finance, healthcare, law, and science."

patent aspen
#

Quantum computing wouldn't even break most cryptography now

unborn ocean
#

truly agi🤣

torn mantle
#

yes

#

maybe even asi?

candid storm
#

Why X AI only 14.3% on polymarket for july

upper kraken
#

sez you

#

if you think it's going to be a huge model maybe you should bet on it if it's only at 14%

candid storm
#

I already bet 1k +

#

Cant risk more as a broke student

upper kraken
#

okay so link us your account so we see the positions

leaden palm
upper kraken
#

why wouldn't you do that? you are not just saying random words, if you are think grok 4 will be huge let's see the bet

#

already explained

leaden palm
#

right now you could be bluffing for all we know

upper kraken
#

your dad works for microsoft?

whole wagon
#

Grok 3 doesn't even have reasoning

main gulch
#

only in web app, not in API

whole sundial
#

the reasoning model is actually grok 3 mini

#

have you ever noticed it goes much faster than non-reasoning grok? that's because it's a smaller model they're using.

main gulch
#

so they never released the full Grok 3 thinking

rare python
#

:v

sacred quail
#

They announced big brain mode but never released

#

it equals to o3 pro and 2.5 deep think i guess

#

facts

#

Google did dirty to ultra plan buyers

#

I hope theyre enjoying with veo 3

mossy drum
#

New model in Image Arena: flux-1-kontext-dev

ocean vortex
#

I'm happy that I did the opposite and discontinued the Pro plan once it was renamed this way with the added Ultra 😇

primal orbit
ocean vortex
#

if you think about it, "Ultra" name is positively cursed for them...

#

💀

#

first it was Ultra model that kinda existed but not really, now this...

#

I think they are just divided internally. If they simply capitulated on that extra traffic instead of jacking up prices immediately, they would have had bigger user base now

#

Long term this would have been much better, even financially IMO

#

OpenAI did not get the traffic by making the thing paid... There were no paid plans or even any option to pay for quite awhile

#

People, including myself, were convinced that they are never going to pay. But OpenAI made it work and people actually stayed and many of them subbed changing their minds. Cause it was never irrational, especially at the start.

#

yeah basically only fanboys lol

tall summit
#

rich fanboys

ocean vortex
#

they could only charge this if no one else had reasoning models

#

then maybe this would fly

balmy mist
#

google needs to add the compact/compress feature to studio, that would be goated

#

its the same thing thats in claude code, codex and the gemini cli

#

just compresses the token window to save money

#

basically summarizes your context window so that it is not as large when you make more requests

#

i mean its better than dealing with 700k tokens

#

that slows down the model so much and it forgets stuff from beginning

#

why? its lags and is so hard to use

#

and the model literally forgets stuff from the start

#

yes it does lol

#

i mean just give us the option

#

for code its necessary

#

thats why its in all the cli

#

thats why i need it lol

ocean vortex
#

Claude is just weird. If they can't offer full context size they should just do what OpenAI did. Instead of aggressively rate limiting everyone who dares to use the context.

#

There's no point in advertising bigger context size if people can't practically use it outside of API

#

And like that theoretical max context size is still 5 times smaller than Gemini... what is the point of doing it half way and making everyone mad in the process lmao

#

they cap you based on your context usage if you have a sub

#

So like only a few messages and you are out

keen beacon
#

yeah you get a quota of tokens

#

nah

#

some just do message limits without counting the individual tokens

ocean vortex
#

it's pointless IMO. If they can't give you realistic quota for full context size it should have been capped by default tbh

keen beacon
#

honestly its not worth paying for a claude sub unless ur paying claude max and using claude code lol

ocean vortex
#

Not like people can't use gemini for free and get more context for those few times they need tons of context

#

the main thing they have going for it is that they have the biggest model (Opus4) as far as I'm concerned.

#

but that model is not perfect...

#

4.5 is a bit underwhelming though. + no reasoning

#

it's not fake, but it's not overfitted on it. Big models tend to have less substantial effect from fine-tuning. I think if they crank LR up too much it's easy to mess it up. And otherwise it needs tons of compute and time for training.

#

learning rate

#

doesn't make too much sense tbh. They probably could make it SOTA on GPQA, but on math or coding most certainly not. And it would be too expensive to be worth it, you can iterate and improve smaller models faster

#

Good example is Meta with behemoth lol

#

that model bacame irrelevant before it was even done

torn mantle
ocean vortex
#

Most coding benchmarks can be aced with relatively small models (o4-mini). With the big one is gonna take much more time to archieve the same score

echo aurora
#

I'm going to be in #1340554757827461215 for most of the day listening to some lofi, anyone is welcome to join meowheadphones

civic flame
#

i commend the dedication to this

zinc ore
torn mantle
#

and that was weeks ago

#

i think gemini 3 + deep think >> grok 4

#

@small haven new model in arena

#

steve

#

someone shared this

#

could be deepseek

civic flame
small haven
civic flame
#

i couldn't tell if it reasoned i wasn't on the page when it was streaming

small haven
#

so another deepseek dense model?

torn mantle
#

could be either deepseek v4 or r2

#

depends if its a reasoning model or nah

civic flame
small haven
#

ok gonna try the luxury car problem

torn mantle
#

i see

torn mantle
#

looks like its hybrid

civic flame
torn mantle
charred pebble
#

hi!

torn mantle
#

and bye

#

jk

#

hi

verbal nimbus
#

AIStudio is 🔥 though

whole wagon
#

R2 is delayed

#

The run didn't go well

hollow ocean
#

Steve is R2?

whole wagon
#

Too bad tbh. R1 is starting to fall behind the frontier now

verbal nimbus
whole wagon
#

It's easy to find out anyways

#

There are ways to trick it

verbal nimbus
#

Not if it's in the system prompt I think

whole wagon
#

Still can be tricked

verbal nimbus
#

Oh?

whole wagon
#

The Amazon model was found out through tricking it to ignore the system prompt

civic flame
#

amazon models aren't particularly hard to trick 💀

#

they're terrible

verbal nimbus
whole wagon
#

.

verbal nimbus
#

Is that a good thing or a bad thing 🤔

small haven
#

steve didnt pass this test

whole wagon
#

Does it think

civic flame
#

yes

#

but

#

i doubt it's DS

#

it's not very good and it variably answers when asked who made it

#

smells distilled

hollow ocean
verbal nimbus
hollow ocean
#

Mark my words

small haven
whole wagon
#

Maybe they made a mini deepseek

small haven
#

not even o3 pro

#

or kingfall

hollow ocean
#

No model will get it this year

#

Prove me wrong

verbal nimbus
#

They need to RL train it on games I think

small haven
#

titanforge will

hollow ocean
verbal nimbus
#

Dense models tend to have better spatial awareness (might be unrelated to this problem though)

small haven
verbal nimbus
#

They should train it on one of those retro, open-world, text based games.

small haven
verbal nimbus
#

Difficult though, no verifiable rewards

verbal nimbus
hollow ocean
cedar tide
#

Steve almost got all the answers right

wind moth
#

How good is Steve

cedar tide
#

I asked to make a discord clone and Steve is way behind v3 and R1

#

besides that he thinks less time than R1 0528

sturdy mica
#

who is steve

#

grok 4?

#

oh. deepseek update. why steve, though? lol

wind moth
#

It’s Deepseek

#

Surprised grok hasn’t dropped yet

sturdy mica
#

the model was done training on the 29tg

wind moth
#

Grok prob gonna release end of July at this point

sturdy mica
wind moth
cedar tide
cedar tide
wind moth
cedar tide
#

@wind moth the grok models don't perform well in the arena so they won't put it on before the release

wind moth
#

I mean for grok 3 it did

cedar tide
#

@wind moth They will want to say that grok 4 is the best AI in the world but they don't think that grok 4 will be first on the arena

whole wagon
#

Guys I have confirmed Steve is from deepseek

hardy pecan
#

steve isn't so bad

whole wagon
#

Better than R1?

cedar tide
tall summit
#

isn't this the most useless model ever

patent aspen
# patent aspen
poll_question_text

How long after Grok 4 launches will it take for people here to move on?

victor_answer_votes

9

total_votes

17

victor_answer_id

3

victor_answer_text

1 week

whole wagon
#

Deepseek R2 delayed until these new frontier models become available to use for training data Kappa

hardy pecan
#

So steve says its R2?

sacred quail
echo aurora
alpine pasture
balmy mist
#

is this cyper alpha thing openai?
https://www.youtube.com/watch?v=ghEFEkxw0yA

Cypher Alpha is being specualated as OpenAI's open source LLM

It's still a speculation but you could try this for free!

Chat with Cypher Alpha here - https://openrouter.ai/chat?models=openrouter/cypher-alpha:free

❤️ If you want to support the channel ❤️
Support here:
Patreon - https://www.patreon.com/1littlecoder/
Ko-Fi - https://ko...

▶ Play video
native compass
#

#arena-battles

kind cloud
#

Cypher Labs

ocean vortex
ocean vortex
#

btw is the legacy version still getting the same models in battles...?

#

the same as the new version I mean

#

I like legacy more but it seems I'm getting mostly boring/old models...

keen beacon
#

Haven't looked into it but I recall previous Stephen models were from bytedance

#

Someone at bytedance must think calling their models Steve is funny lmao maybe

ocean vortex
#

new arena is kinda annoying with chats. If you want to input same prompt that just creates new chat. Should be an option with no auto-save...

keen beacon
#

I believe Gemma 3 27bs anon name had a reference to a french football player too. Or something along those lines

ocean vortex
#

You can't have anything useful there because it keeps flooding it lmao

keen beacon
ocean vortex
#

Dunno what model is this, but it's very weird... It's hallucinating but the way it does it is unusual tbh. It outputs 1 paragraph, then stops (potentially thinking), then outputs another paragraph. Feels the same like thought summarizing. This is infinite loop essentially now...

#

still going... 🤯

#

ok finally got steve... 😇

#

hybrid reasoning then it seems

#

That other thing got stuck at max_tokens cap so no clue what model it was...

tall summit
#

according to the tests yesterday

ocean vortex
# balmy mist is this cyper alpha thing openai? https://www.youtube.com/watch?v=ghEFEkxw0yA

Had another go with it...

I'm Amazon Titan, a large language model developed by Amazon web services. I'm built to assist users by offering general information, answering questions, and engaging in a range of text-based tasks such as creative writing or offering translations in multiple languages. I'm trained to provide accurate and contextually relevant responses by utilizing massive datasets during my development, making me well-suited to a wide range of conversational topics.

The exact text varies, but it's very consistent saying "I'm Amazon Titan"

#

The way I did this is basically have a prompt consisting of several innocent tasks with this question slipped in-between. Overloaded with context and it has not enough capacity to refuse starting the response with "Sure" or similar and then mindlessly complying 😇

sturdy mica
#

never mind

#

obviously i was wrong

alpine coral
#

ig it's consistently sht as with all the other amazon models ive tried.. but i dunno wouldn;t be entirely surprised if it was a hallucination.. feels like some kinda open source for some reason.. not a titan (tho again.. Amazon do seem to kinda suck at LLMs.. maybe it's as simple as that ha)

ocean vortex
#

Like it could hallucinate it was made by OpenAI, but hallucinating it's made by Amazon would have been extremely unusual 😂

alpine coral
#

yeah i know what you mean

#

hallucinating "im chagpt" is v common.. but yeah Amazon Titan not so.. it is very specific (and consistent)

rare python
#

o3 writing style be like, described by Opus 4:

This writing style blends technical instruction with forced casual language - it's trying to sound hip and conversational but comes off as artificially breezy. The author uses colloquialisms like "tossing over the fence," "hit the sweet spot," and "feels chunky" to explain prompt engineering concepts, creating an odd mismatch between the technical subject matter and the overly casual tone. The short, punchy sentences and slangy metaphors ("the model starts parroting," "slice it") feel like someone trying too hard to make dry material sound cool and accessible, resulting in a style that reads more like a tech bro explaining AI at a coffee shop than genuine technical documentation.

#

Full o3's reply:

Start a prompt by laying out the essentials first. Name the expected input, spell out any limits that apply to it, then tell the model how the reply should look. Concrete wording keeps everyone on the same page.  

Next, lean on few-shot examples rather than tossing the task over the fence with no guidance. A couple of well-chosen samples teach style and structure better than a lecture ever could. Two or three examples usually hit the sweet spot. Pack in more and the model starts parroting; include fewer and it may drift. Pick snippets that show the right approach rather than warning about the wrong one. Positive patterns stick.  

Make every example line up visually. When the demonstration stays consistent, the response follows suit. Add any background facts a stranger would need, instead of assuming the model read your mind. Simple prefixes like “Input,” “Output,” or “Example” clarify who is speaking and what comes next.  

If the job feels chunky, slice it. Separate prompts for each sub-task or set them up in a chain so each answer feeds the next step. When the first draft misses the mark, rewrite the prompt. Swap phrases, shuffle the order, or compare it with a similar task until the results land where you want.  

That series of habits—clear setup, targeted examples, steady formatting, useful context, strategic splitting, and iterative tuning—forms a reliable blueprint for getting the model to produce the answer you actually want.
#

OpenAI made it so edgy 💀

torn mantle
# cedar tide

my guess is this may be the latest checkpoint/run that elon talked about and now they are finalizing things for release

#

or they may give it extra days if they arent satisfied yet

#

this may have a meaning behind

#

grok 4 code probably turned out well

#

they have been talking about cursor grok model for a while but i still dont think it will surprass sonnet 4/opus 4

#

even gemini still lacks

tall summit
torn mantle
#

and we know they achieved massive improvements at coding tasks

rare python
torn mantle
#

you are a native

#

stop lying

rare python
#

No I'm not a native speaker

torn mantle
#

yes you are

rare python
torn mantle
#

the real question here

#

will grok 4 be better than gemini 3 + deepthink

#

gemini 3 = starting from kingfall

rare python
torn mantle
#

it has been weeks since kingfall, so they probably reached an even higher limit

torn mantle
leaden sun
rare python
torn mantle
#

i mean it doesnt bother me tbh

#

it just needs to be efficient

torn mantle
#

they may increase it later

#

to 1m

#

but i think 134k is like the sweet spot

#

for them

rare python
dusky aurora
#

why did "I'm a good gpt2 chatbot" disappear?

golden ocean
#

he's been a bad chatbot

tall summit
#

image edit

echo aurora
#

We now have our own Server Tag too! User Settings -> Profile is where you can enable it.

flint skiff
#

@echo aurora do the new leaderboards count in the overall leaderboard tab?

#

in the arena overview

#

will they have their seperate tab there?

#

here I mean

echo aurora
flint skiff
#

alright

#

thank you!

zinc ore
#

Grok 4 on July 4th just sounds way too tempting and on the nose for Elon not to do that

#

Especially since he's all about "America"

sage raptor
#

gork 4 *

inner gate
#

What do u guys think of Claude opus 4

leaden palm
zinc ore
#

Oh really, I missed it

leaden palm
zinc ore
#

"just after July 4th"

#

I think they're actually going to drop right on the 4th tho

late gorge
#

Groq is the cheapest

whole wagon
#

Notice he did not put the year there

echo aurora
#

didn't specify which year doggolol

whole wagon
#

Lol

#

Looks like it isn't coming to LLM arena till after release

dusky aurora
#

arena is my only outlet these days

ocean vortex
# flint skiff

there's only 14elo points between o3 and 2.5Pro overall. But this table makes it look like Gemini destroys the entire competition lmao

#

@echo aurora I think the new arena interface needs a rethink. There's not a lot of incentive to play with it when each follow up question results in responses from different random models... At least for me. I want to actually be able to test the new models

#

otherwise there's not much point. I can use legacy but that I'm not sure is still getting updated with new entries...

leaden sun
#

..actually, now am starting to like o3 more

ocean vortex
#

Like you could give people at least 5 follow up questions without redrawing the models, after voting

#

Then ideally clear the context entirely as you are changing the models for next battle. Just my 2 cents. Models having responses from different models in their context is a bit odd

#

Likely this will influence their responses and potentially elo score as well...

#

If it has a certain style in it's context it is likely to mimmick that

#

It's only recently that I had a deeper dive on the new interface, and yeah it's a bit of a mess to be completely honest, not gonna sugar coat it catgrin

#

Like this is weird:

#

models have responses in context that are not theirs

leaden sun
#

I think they’re already aware of this dom

late path
whole wagon
#

Bro is posting memes

ocean vortex
#

Don't feel like doing the work with no return, just to form the leaderboard 🤷‍♂️

#

In that sense, I suppose it's good labs are switching to openrouter then...

echo aurora
echo aurora
leaden sun
echo aurora
leaden sun
#

o3 is very interesting, isnt it

elder rapids
#

it seems like it has the same problem that o1 has with sophistry

#

when it tries to explain things, successfully does it, but in the end it's not really meaningful

tall summit
#

oh i'm a few days late to the news that huggingchat shut down

#

a shame

olive mesa
#

grok 4 july 4th 2pm EST

leaden sun
#

o3 is using less anthropomorphization compared to claude, i wonder if it's because of the nature of the conversation, i remember o3 usually being chatty?

wintry tinsel
torn mantle
#

I still dont think its this week

whole wagon
leaden palm
#

i now realize it's old (4 month old) news

#

didn't see it though so suppose it flew under many radars

empty stump
#

cats are smarter than llms

pallid crypt
# leaden palm

Likely a side effect of MoE framework, I would think MoE would make LLMs susceptible to adversial attacks like that

torn mantle
#

@deep adder do it

late path
#

new google model wolfstride

keen beacon
#

I would do it .. 0.001 sec after results are published

hoary plaza
#

I get more of older models

#

I feel there is a need to increase the rate at which they appear on battle mode

cedar tide
#

Thanks for the list of models added, just it would be good if you also said the mystery models, like the new "wolfstride"
#announcements message

cedar tide
rare python
versed moss
#

How do i use wolfstride

ocean vortex
#

what is this chat turning into...

#

smh

rare python
#

ban

#

@echo aurora

ocean vortex
#

in battle mode

late path
#

I think the main purpose of model vendors releasing mystery models is to collect votes. If they were directly used in direct chat, it would be meaningless for them.

rare python
cedar tide
#

I don't understand, my message is very clear, but I receive 3 responses that are off the mark 🤦

alpine coral
#

they'd be unmasked in no time if they were available in direct chat

#

yeah i actually got what you were saying

ocean vortex
alpine coral
cedar tide
alpine coral
#

just would be nice to see anon model names included in any list of models in / added to the arena

late path
cedar tide
#

Ah okk

rare python
ocean vortex
#

this looks like something someone would do for money tbh. The current battle mode in isolation

alpine coral
#

i meant to say your message had nothing to do with direct chat @cedar tide

#

anyway.. lol

rare python
#

what 😔

alpine coral
#

doesn't make unmasking them impossible - but far more challenging (unless endless time / automation ig)

rare python
echo aurora
keen ferry
#

is grok 4 today or tommorow

alpine coral
cedar tide
rare python
ocean vortex
cedar tide
rare python
#

then vote later

alpine coral
ocean vortex
echo aurora
ocean vortex
#

Besides they are people tracking it constantly, so if you want to be in a loop - you can be

cedar tide
alpine coral
#

tho it's perhaps not a great point..

versed moss
#

this new lmarena runs so slow

alpine coral
#

like OR has had their couple of 'stealth' models directly available. the OAI one was unmasked pretty quickly.. cypher-alpha still kinda a mystery (i lean towards Amazon.. but wouldn't be surprised if it was something else, tho not from oai, google, or anthropic) .. guess the labs serving them are really just collecting data at the end of the day (tho elo rankings presumably are of addditional value too)

versed moss
#

@echo aurora does the legacy version have the new models?

versed moss
#

but the new version is sooo slow

keen ferry
rare python
#

It sometimes kick me out to the homepage of lmarena after I start a new chat and hit send message

keen ferry
#

it'll probably be faster

rare python
versed moss
#

It makes me feel like im using a £200 laptop on it

#

the UI is so buggy and everything crashes

#

my pc is worth 4k

keen ferry
#

that's common

#

somebody already reported it

#

I guess they will fix this in the next site overhaul prob

versed moss
#

Anyway, just one shotted this on wolfstride

echo aurora
keen ferry
torn mantle
#

@cedar tide how is it

main gulch
#

wolfstride has stonebloom-level knowledge

versed moss
#

Im trying to get another instance of it to test it again

main gulch
#

7/9 in my quiz inspired by brknclock1215 (the best 2.5-pro/goldmane result is 5.5, the vast majority of models are below zero)

cedar tide
main gulch
#

3/9, o3-pro 8/9

keen beacon
ocean vortex
#

the best questions

keen beacon
#

I dont trust benchmarks. I should create some private personal one :/

main gulch
ocean vortex
#

If you don't notice a difference that means the model is aimed at people smarter than you. 🫃

#

dunno why but I just find his wording hilarious lmao

keen beacon
#

My favorite test prompt is " 4chan green text of late stage capitalsim birthday cake distribution"

ocean vortex
keen beacon
#

I meant it for Elon

rare python
ocean vortex
#

yeah he is being obscure. He didn't say it's better at math, coding or anything....

main gulch
#

will keep the best of them likely

ocean vortex
#

you just gotta "ask questions"

main gulch
#

yeah, it is most useful to determine the level (Flash/Pro/Ultra) of anon models

ocean vortex
#

which would have been a valid thing if true. SimpleQA, world knowledge. Just not excluding English, that would have been weird...

#

MMLU too

versed moss
#

Managed to get another instance

#

Wolfstride

fresh shadow
#

Where can I test Wolfstride?

cedar tide
#

Grok 4 not available

ocean vortex
#

that fairly difficult to do tbh

#

SimpleQA is the best public tool we have for this

rare python
# main gulch two runs

You mean two run of the same models or two run of 2 different checkpoint with the same codename?

cedar tide
#

Just for ask him in x

main gulch
ocean vortex
#

Even arc-agi and HLE seems to be vulnerable to contamination. Not a major problem for comparing different models, but this means small models can perform just as good mostly

rare python
main gulch
ocean vortex
#

Cause like... o4-mini-high is pretty much acing arc-agi now. Despite it being very poor on spatial awareness

rare python
#

Like does it measure real world performance?

ocean vortex
rare python
#

Why Arc Agi 2 Opus 4 took the lead then? What changed?

ocean vortex
#

we would probably need something like LCB benchmark format where the tasks change depending on the date range, only for spatial awareness rather than coding. So you wouldn't be able to train your model as easily on just the exact similar tasks tested for

ocean vortex
#

Which is why I implied SimpleQA is currently the better benchmark to get the rough estimate for model size. The delta on it gpt4.5 to o4-mini which performs worse there is fairly big.

#

Also expanding on arc-agi... o4-mini-high essentially has the exact same scores as o3-high... lol

rare python
#

:v

ocean vortex
#

on webdev arena (frontend - visuals in space) there's also a sizeable gap between them:

#

that wasn't even o3-high 👀

ocean vortex
#

This is one of those benchmarks where it seeing the correct attempt is probably not enough for it to understand that other ones are wrong

#

works much better for trick questions or in cases where it's just randomly failing by chance

rare python
#

What's the point of choosing the best attempt when consumer like us rarely bother to regenerate the answers?

ocean vortex
rare python
ocean vortex
rare python
#

I'm talking about some AI labs use pass@64 which choose the best attempt out of 64 tries

#

Like xAI or OpenAI

wild kayak
#

does anyone know the sampling parameters and thinking budgets used for the thinking models in web dev arena, like gemini2.5-pro, r1, sonnet and opus?

rare python
ocean vortex
wild kayak
rare python
rare python
rare python
#

¯_(ツ)_/¯

wild kayak
rare python
rare python
whole wagon
#

@Lcsuthar98 @elonmusk We've rolled out Grok 4 with enhanced reasoning, math, and NLP; a specialized coding model with IDE integration; 130k token context window; multimodal support (text/vision, image gen soon); reduced hallucinations; and faster responses via Colossus compute. Try a complex query to

#

Grok 4 is here

civic flame
#

why would you trust what grok says

#

it's just pulling leaked grok 4 info from twitter and hallucinating that it is 4

whole wagon
#

Well I mean how else would they have a significantly improved grok

keen beacon
#

Placebo 😂

civic flame
#

lol they've literally just updated 3 again haven't they

#

dw guys we'll get 4 in a couple years

sage raptor
whole wagon
#

Wut

#

Why did they update grok 3 at the same time as grok 4 release

keen beacon
#

Kinda a waste to deploy another grok 3 revision if 4 is coming out super soon

alpine coral
# ocean vortex

arc-agi vs simpleQA – i struggle to think of two benchmarks for which 'apples and oranges' better applies

#

they're totally unrelated

alpine coral
torn mantle
whole wagon
#

Grok 3 has no reasoning though

#

So it's not really useful currently

torn mantle
#

it has

whole wagon
#

It does not

torn mantle
#

it does

#

think?

keen fulcrum
whole wagon
#

They secretly switch it to mini

#

When u use think

#

There is no grok 3 think

keen beacon
#

They announced it tho

#

It's vaporware

torn mantle
#

im talking about if it exists not if its used now

whole wagon
alpine coral
whole wagon
#

It's scammy but what do you expect

#

It's xAI

whole wagon
keen beacon
# whole wagon

It's not accessible to the public, they announced a reasoning version of it though. Check their announcement post

whole wagon
#

Who cares if it's not accessible

#

Lmao

keen beacon
whole wagon
#

They claimed grok 3.5 exists

#

In may

#

Coming in 1 week

torn mantle
#

they didnt add grok 3 reasoning API support?

keen beacon
#

Never

whole wagon
#

No. Grok 3 reasoning has never been available

torn mantle
#

i mean grok 3 reasoning exist and its used on their web/app

whole wagon
#

Bruh

torn mantle
#

bruh

#

why dont you listen

whole wagon
#

It's a scam what don't you get 😂

torn mantle
#

omg

alpine coral
#

it's 3.5-mini rght?

keen beacon
#

It's actually grok 3 mini reasoning

whole wagon
#

Yes

alpine coral
#

ah 3-mini

keen beacon
#

Xai lies a surprising lot anyway. But I do think their reasoning version exists, it just sucks for the cost

#

Like them claiming that they were summarizing the cot or messing with it

keen beacon
alpine coral
#

yeah.. if the mini variant does.. stands to reason that full one does too

#

it's non-release would seemingly say something about its performance

torn mantle
rare python
#

grok 3 mini is really fast

torn mantle
#

well its not for me

#

maybe its used after a certain limit

#

let me check the payload

#

modelName
:
"grok-3"
responseMetadata
:
{requestModelDetails: {modelId: "grok-3"}}
requestModelDetails
:
{modelId: "grok-3"}
modelId
:
"grok-3"

#

lets see if the modelName contains mini as well

keen beacon
#

Did you check the tps

#

Lemme see. That's what people claim at least. I'm not paying for their grok sub (and it seems to be true that they're redirecting), where are you using it

torn mantle
#

this is for their API

keen beacon
#

yeah it says grok 3 doesnt support reasoning on the api

torn mantle
#

it doesnt support grok 3 reasoning

#

yea they probably switch when you hit the rate limits

whole wagon
#

Nowhere does it say just in the API

#

That's a general statement

torn mantle
#

the guides are for the api docs

torn mantle
rare python
#

The speed is slow when turn on thinking?

whole wagon
#

It's in the chat with reasoning guide. Nothing to do with API

#

API is a totally different section

keen beacon
#

the docs are for the api. there's literally an api example on that page

whole wagon
#

Grok 3 reasoning has never been served nor benchmarked. There's not much else to say on it

torn mantle
#

what are you saying

#

smh

#

it has been available from the start, after that they released grok 3 mini which was used in their @grok account

whole wagon
#

Come back when you can provide any benchmark that includes "grok 3 reasoning"

ocean vortex
torn mantle
keen beacon
torn mantle
#

@ocean vortex he is saying grok 3 reasoning doesnt exist and xai are using grok 3 mini reasoning in all of their services

ocean vortex
keen beacon
rare python
# torn mantle

crazy how there are little differences of grok 3 mini and grok 3 thinking

keen beacon
#

did yall see the pangu and qwen drama where a group alleged pangu is a stolen model?

#

saw it yesterday, interesting drama

ocean vortex
#

oh yeah

#

weird

rare python
ocean vortex
#

Not surprised that this is coming from them... But yeah their benchmarks they posted months ago for full model thinking version are essentially useless then lol

ocean vortex
#

I thought there must have been API for it by now for sure...

rare python
#

💊

#

🥴

ocean vortex
#

yeah he's washed

#

so essentially this is how the only existent API version of full grok3 performs:

#

it's behind by miles o3 and 2.5Pro, even o1 probably

fleet lintel
#

What happened to 3.5 grok? Did they just dropped 3.5 release plan?

keen beacon
#

they should credit the qwen team for using qwq 32b preview traces on their reasoning model 🤣

torn mantle
#

with reasoning

ocean vortex
torn mantle
#

its not bad

#

but it doesnt reflect really how the model feels

ocean vortex
#

it "does not support reasoning"

#

💀

torn mantle
#

it doesnt support it on api

#

since its pricey

#

and there isnt much performance gain compared to mini

ocean vortex
#

it doesn't support it probably because they announced smth they didn't have

torn mantle
#

they talked about that before

ocean vortex
torn mantle
#

could be

ocean vortex
#

it's the same size model their cost gonna be virtually the same just more tokens generated

keen beacon
#

nah with much more output regularly its different

ocean vortex
#

unless it's something entirely different but this would only make it worse and false advertising lol

keen beacon
#

you can get higher overall throughput with low output requests

ocean vortex
#

And claude pricing

keen beacon
ocean vortex
keen beacon
#

you dont need to take my word for it

torn mantle
ocean vortex
torn mantle
#

i remember they mentioned that in their slide

keen beacon
#

4.1 had architectural changes, speculation but i suspect so

#

"architecture" changes don't necessarily mean huge adjustments anyway btw

rare python
keen beacon
torn mantle
#

here

#

@teortaxesTex If you rly care about pass@1, you should compare numbers to mini. I explained in the live demo that we trained mini version for many more steps. The Grok3 proper is much bigger and takes longer to train. So we just released an early version. It’s already very smart. But it’s

ocean vortex
keen beacon
#

i think thats true and that there were architecture changes (might as well if you're doing cpt on 4o)

#

so the raw cost gap is much lower now

torn mantle
rare python
#

Jun 10 2025

torn mantle
#

is grok 4 the successor of grok 3? since Yuhuai talked about that is a powerful model and its generalization can lead to some massive performance gains?

#

ngl im lost now

#

AAAAAAAAAAAAAAAAAAAAAAAAAA

#

they are confusing me

ocean vortex
#

But either way... it makes no sense at all for grok3 reasoning to not be available on API. They could charge what they want even if cost was a big issue, which is unlikely assuming the same base model...

torn mantle
#

so its fine

#

i can take it

rare python
torn mantle
#

so what now

#

its your chance @rare python

rare python
torn mantle
rare python
torn mantle
#

😖

rare python
torn mantle
#

yes

#

yes yes yes yes yes

rare python
torn mantle
#

how

rare python
#

😏

keen beacon
#

they just deleted the repository for some reason

#

i was gonna link you to it but huh

ocean vortex
#

while betting for Google you don't get profit at all

#

terrible bet lol

rare python
#

r2 is still cooking

torn mantle
keen beacon
#

they had a report about it in a pdf. i didn't download it in time.

but i have a screenshot of the readme:

ocean vortex
torn mantle
#

nah the one that may really top the arena is gemini deep think

keen beacon
#

in the github issues there were also rumors of them training on benchmark splits etc. just weird

torn mantle
#

well for this month its either grok 4 or gemini deep think

cedar tide
rare python
#

Too long

#

same with o3 pro or o3 deep research

ocean vortex
#

it needs to 1. Be released 2. Collect enough votes 3. Leaderboard has to be updated manually

rare python
#

too long

ocean vortex
#

It has to appear on lmarena like now or next week at the latest

#

for it to happen

rare python
#

Last year they and Google dominate the mystery models testing

keen beacon
#

google did way more

ocean vortex
rare python
ocean vortex
ocean vortex
rare python
#

o3 is more general and good outside lmarena

cedar tide
#

I didn't understand anything, what was the debate about grok 3 reasoning?

leaden sun
ocean vortex
rare python
#

No tease or leak of Gemini 3.0

leaden sun
#

reset is a fitting description from I've been told by o3

#

2030 has almost 50% chance, but apparently it will look very different from what you originally have thought

keen beacon
rare python
leaden sun
#

well...

rare python
#

It's too concise and pack jargon with casual slang it's awkward for me to read

leaden sun
#

i like it for scientific stuff

rare python
#

true it's a bit censored

#

I ask for a 1:1 image prompt for me to generate that image in a different style

#

o3 told me my screenshot might be copyrighted

#

:v

#

What do you mean "mainstream"? Like it's too bland?

keen beacon
#

i mean it just came out? if there were to be a proper response, i think it'll take a while

#

in github issues, the "honestagi' guy was talking about whistleblowers and stuff

#

them training on benchmarks splits (but out of scope of his report)

#

them switching out the qwen 2.5 tokenizer and doing research on how to do it with minimal loss

leaden sun
keen beacon
#

yeah, i was talking about it when it was up

rare python
keen beacon
#

people found the report compelling though

leaden sun
rare python
leaden sun
#

use the dual conversation in lmarena, together with opus @rare python

pure anvil
#

Huawei definitely has an incentive to do something like that

ocean vortex
#

I have no clue how they got access to it lol

#

invite only beta API or smth

cedar tide
ocean vortex
#

Deepseek outscored everything though lmao

keen beacon
#

its probably not sanctioned by the higher ups

#

if true

ocean vortex
pure anvil
#

I meant to look as if they're catching up in AI to the government to secure subsidies

pure anvil
ocean vortex
#

very unusual test. And their totals are very close for everything

ocean vortex
rare python
# leaden sun use the dual conversation in lmarena, together with opus <@1178708438310719549>

Opus 4 Thinking:

They're explaining that by "mainstream" they don't mean bland in terms of personality or writing style - they mean the AI tends to present conventional, widely-accepted viewpoints rather than alternative or contrarian perspectives.

Their point is that LLMs typically stick to established consensus views on topics, even when those mainstream positions might have flaws or contradictions. For example, an LLM discussing a scientific topic would likely present the dominant scientific consensus rather than minority theories or critiques of that consensus.

They're essentially saying LLMs have a bias toward "safe" orthodox positions - the kinds of views you'd find in textbooks or Wikipedia - rather than exploring fringe ideas, unpopular theories, or perspectives that challenge conventional wisdom. It's not about being boring, but about being intellectually conservative and avoiding controversial or alternative viewpoints.

This makes sense given how LLMs are trained and the guardrails put in place - they're generally designed to provide reliable, mainstream information rather than promote unverified claims or fringe theories. But your conversation partner seems to see this as a limitation, suggesting that sometimes the mainstream view isn't necessarily the most accurate or complete picture.

keen beacon
#

its cheaper to upcycle/do continued-pretraining rather than train the moe model from scratch. now that i think about it, the higher ups should know something is off because of pre-training costs or they were sold on somehow very efficient training

pure anvil
# ocean vortex

"Reasoing Type" it's good that they at least they wrote it themselves lol

keen beacon
#

i think i vaguely recall mixtral 8x7b doing a similar process (but ofc with their own 7b)

#

post-training is also so much more cheaper than pre-training as well