#general | Arena | Page 65

keen beacon Jul 5, 2025, 8:24 PM

#

really? dont recall seeing much talk about it. people rarely talk about qwen 2.5 max

pure anvil Jul 5, 2025, 8:24 PM

#

I don't think it's a problem of RL, maybe SFT or pretraining

ocean vortex Jul 5, 2025, 8:25 PM

#

Like switching between languages when reasoning, making stuff up, or just plain weird reasoning traces etc

unborn ocean Jul 5, 2025, 8:27 PM

#

keen beacon really? dont recall seeing much talk about it. people rarely talk about qwen 2.5...

i am not sure, after thinking about it a bit more; it might have just been gut instinct for me to claim it is larger than deepseek

#

bc it seemed expensive, big and had high scores on tests closely correlated with model size

#

they used the RL for reasoning very scarcely with in the qwq max version (to keep the costs within a reasonable amount i guess)

#

if you read the article, why would you claim 10k gpus max? :v

#

but that is different to the compute available

#

much of the compute is just for inference and research

#

random side note @keen beacon , i was just trying to dig up some sources about qwen 2.5 max

#

they have a fn spelling mistake in the paper TITLE!!!! wtf, how man

pure anvil Jul 5, 2025, 8:34 PM

#

Based on their recent posts they run 256 8xh100 nodes for inference, 18k tk/s throughput per node

unborn ocean Jul 5, 2025, 8:35 PM

#

if you don't count H20 the chinese companies will claim to have 0 gpus, lol

pure anvil Jul 5, 2025, 8:36 PM

#

you do realise that the ptx optimizations from deepseek 2x the hopper gpus performance right?

unborn ocean Jul 5, 2025, 8:36 PM

#

semianalyis said hopper and they also have some excellent work on how the H20 is worse in some areas but still very competitive in memory bandwidth, something deepseek is heavily taking advantage of with the architecture and research

whole wagon Jul 5, 2025, 8:37 PM

#

Deepseek R2 delayed

#

Everyone except Google had releases delayed

unborn ocean Jul 5, 2025, 8:40 PM

#

i think it is highly unrealistic that they a: only have ONE cluster, that is just very unrealistic

dawn wharf Jul 5, 2025, 8:40 PM

#

whole wagon Everyone except Google had releases delayed

because Google is the reason

unborn ocean Jul 5, 2025, 8:40 PM

#

no matter how efficient they would have needed more compute to server deepseek after it blew up, so on way or the other they either got there hands on a lot of gpu to self host quickly or they just had it in the first place

pure anvil Jul 5, 2025, 8:41 PM

#

unborn ocean no matter how efficient they would have needed more compute to server deepseek a...

I suggest you check their GitHub on infrastructure

unborn ocean Jul 5, 2025, 8:41 PM

#

and the estimate of somewhere sub 50k seems highly reasonable to me considering all the R&D innovations they had
all of which needed dedicated compute for the researchers

pure anvil Jul 5, 2025, 8:42 PM

#

pure anvil I suggest you check their GitHub on infrastructure

they have detailed research on how they managed inference and training

unborn ocean Jul 5, 2025, 8:42 PM

#

that is more about batching and their compute constraint + focus on new models

#

they do, i never claimed they don't

#

I think nobody in their right mind would seriously claim that they only have ONE 10k cluster, they have more compute than that

unborn ocean Jul 5, 2025, 8:48 PM

#

pure anvil I suggest you check their GitHub on infrastructure

pure anvil Jul 5, 2025, 8:48 PM

#

Check the actual paper

#

to truly understand how they manage infra

unborn ocean Jul 5, 2025, 8:49 PM

#

pure anvil Check the actual paper

it did, it's just the paper + some math

#

peak nodes / avg nodes * 8 gpus per node

#

they are always talking about the 10k they have for the model and i am not talking about the secretly having 1M H200 stored in their backyard shag

#

i am and you are to believe the 10k

pure anvil Jul 5, 2025, 8:51 PM

#

If you've read interviews with the CEO of high Flyer/ deepseek, you'll have more context on the company as a whole

unborn ocean Jul 5, 2025, 8:53 PM

#

pure anvil If you've read interviews with the CEO of high Flyer/ deepseek, you'll have more...

well i am sorry, but i am not relying on some interview by someone who has essentially no incentive to actually be truthful

#

i am sure he is well meaning and very highly intelligent

whole wagon Jul 5, 2025, 8:55 PM

#

Wut

#

It's not sinophobia to not trust ai CEOs

unborn ocean Jul 5, 2025, 8:55 PM

#

fr wut

pure anvil Jul 5, 2025, 8:55 PM

#

Not really, it's just being dense

whole wagon Jul 5, 2025, 8:55 PM

#

I don't trust Sam altman either lol

#

Or zuck

#

Or elon

unborn ocean Jul 5, 2025, 8:56 PM

#

pure anvil Not really, it's just being dense

i also don't get that, you guys don't have ANY evidence about the claim that comes from people that have an incentive to tell you this very story

#

it could still be 100% true

#

but it could also be only 80% true, because people are just blindly interpreting having trained v3 on 10k as having 10k max in general

ocean vortex Jul 5, 2025, 9:04 PM

#

235b is the least bad one. But from what I saw using it it's still not as reliable or stable as R1. Benchmarks say the same too. It could have been excused in theory due to being smaller but I don't think that's a valid cause A) they could have done whatever size they wanted like we already talked about today and B) their whole thing is kinda always has been to make the model as small as possible and still compete in that segment. This is clearly the same segment as R1 tbh

#

And we also know for sure performance even beyond R1 is possible with smallish models. 2.5Flash or o4-mini

#

So yeah, nothing special when all said and done...

#

You just stated it's not unstable and then described how it is unstable in yet another way lmao

#

I mean... "when quantized even slightly it tanks the performance" --> this alone indicates it's not very stable IMO

#

That's just wishful assumptions though. From what I remember testing it earlier it was a bit all over the place and not very confident on it's outputs. This would align with that. Any small thing can throw it off

pure anvil Jul 5, 2025, 9:11 PM

#

The 235b seems to be the most stable one of all the models

#

qwen3-32b is so much more jankier than qwen2.5-32b

#

actually 2.5-32b was very stable

#

much less hallucinations

#

QwQ-32b is their best 32b model imo

keen beacon Jul 5, 2025, 9:16 PM

#

idk i like qwen 3 32b much better

#

its really really good

#

personally tho

#

i guess it depends what ur using it on

leaden palm Jul 5, 2025, 9:19 PM

#

https://github.com/search?q="🎯+CRITICAL+DISCOVERY"&type=code funny stuff

pure anvil Jul 5, 2025, 9:19 PM

#

It's interesting cause QwQ-32b has no jankiness whatsoever (on qwen.ai, YMMV if you're locally running it)

#

It feels polished

#

maybe the introduction of 110 different languages during pretraining lowers the overall dataset quality

#

when packed in 32b parameters

keen beacon Jul 5, 2025, 9:20 PM

#

no its not the pretrained model, i believe

unborn ocean Jul 5, 2025, 9:24 PM

#

It’s likely less about straight up lying and more about how you present the „truth“.

#

Which is why I am always cautious.

#

And most of my frustration with the topic actually stems from the media’s poor reporting on it rather than people at deepseek.

#

Like: „Chinese lab beats OpenAI for only 5m$“ bla bla that dominated the news for ages .

ornate agate Jul 5, 2025, 9:30 PM

#

Media just loves a scoop, unfortunately. I think the media reporting will improve over the next 2-5yrs as China brings their own chips online and as govts realise the horse has completely bolted already. At the moment there is sensationalist headlines everywhere.

#

4th estate

ocean vortex Jul 5, 2025, 9:39 PM

#

pure anvil The 235b seems to be the most stable one of all the models

Kinda agree. But only relatively to other qwen3 models lol

#

being aggressive with model downsizing does not work nearly as well for reasoning models...

#

messages disappearing... 🤯

civic flame Jul 5, 2025, 9:47 PM

#

https://codepen.io/sywv/pen/zxGgqPN by wolfstride with 1 initial prompt and 1 "make it better" prompt

CodePen

Leo

NYC Votes

...

#

really cool tbh

keen beacon Jul 5, 2025, 9:47 PM

#

wow

#

i do not like the laggy cursor though

civic flame Jul 5, 2025, 9:48 PM

#

it's fine for me

#

do you ever stop complaining bro

#

find me another model that will do that in 2 prompts

#

and with minimal prompting

#

yeah get ou

#

t

keen beacon Jul 5, 2025, 9:49 PM

#

good result but i hope i dont have to interact with websites like this all the time lol

civic flame Jul 5, 2025, 9:50 PM

#

i'll ask it 😉

#

had a little note on the design choices

whole wagon Jul 5, 2025, 10:15 PM

#

Boop

ocean vortex Jul 5, 2025, 10:19 PM

#

whole wagon Boop

0

#

it's gonna be released in 2027

#

July 4th

whole wagon Jul 5, 2025, 10:20 PM

#

Well. xAI odds started shooting up again

unborn ocean Jul 5, 2025, 10:22 PM

#

https://3000-imbwt4z87b45eqlcp0id3-6aec70a1.e2b-foxtrot.dev
kind of crazy of yappy the model is

#

can just output 1k lines of code one shot for this one prompt (-> 10,5k tokens)

#

and usually these models are a bit more conservative with tokens within webdev vs outside

leaden palm Jul 5, 2025, 10:24 PM

#

what the gemini

ocean vortex Jul 5, 2025, 10:24 PM

#

So nice of Amazon to keep me safe

civic flame Jul 5, 2025, 10:25 PM

#

unborn ocean https://3000-imbwt4z87b45eqlcp0id3-6aec70a1.e2b-foxtrot.dev kind of crazy of yap...

that's probably because it basically built the entire platform as a demo lol

#

everything here is clickable

unborn ocean Jul 5, 2025, 10:25 PM

#

yeah ik

#

4 sections in the settings alone, lol

#

and all of it actually works

#

more or less

leaden palm Jul 5, 2025, 10:26 PM

#

the density options don't work but it's hard to complain about getting a social media ui from a single prompt

keen fulcrum Jul 5, 2025, 10:57 PM

#

https://www.rxddit.com/r/OpenAI/comments/1lq1kuu/meanwhile_in_china/

rxddit.com

Meanwhile in China

u/MetaKnowing on r/OpenAI

▶ Play video

wintry tinsel Jul 5, 2025, 10:59 PM

#

The AI space is just, when Grok 4

#

Literally nothing going on

#

It’s an AI winter

#

I’m not sure I’ll survive

sweet tinsel Jul 5, 2025, 11:00 PM

#

wintry tinsel It’s an AI winter

The last time we said this there was a release blast. Just wait, it takes some time to train models.

wintry tinsel Jul 5, 2025, 11:01 PM

#

I’m being ironic lol

#

I just don’t like waiting

sweet tinsel Jul 5, 2025, 11:01 PM

#

Im just testing out unknown AI Agent and Deep Research tools in the meantime.

ocean vortex Jul 5, 2025, 11:02 PM

#

wolfstride is interesting... Seems to be marginally better than 2.5Pro

sweet tinsel Jul 5, 2025, 11:05 PM

#

May be a bit off-topic but BAGEL is extremely good at editing images, better than any Image AI that i've tried.

ocean vortex Jul 5, 2025, 11:09 PM

#

Getting im-also-a-good-gpt2-chatbot vs gpt4o kind of vibes trying to compare that thing to 2.5Pro. Like it's slightly more consistent and slightly better, but mostly the same thing

sweet tinsel Jul 5, 2025, 11:10 PM

#

Was someone dumb enough to buy Perplexity Max over here, btw? I want to try my DR Prompt with the Max DR Models.

leaden palm Jul 5, 2025, 11:10 PM

#

gpt2-chatbot... haven't heard that name in ages

i remember when we were speculating on if it was gpt 2 with q*/strawberry

makes me wonder if you could genuinely get a good model out of RLing gpt 2

sweet tinsel Jul 5, 2025, 11:11 PM

#

gpt2-chatbot was peak, the times where i was still active in this chat.

#

But yeah, gpt2-chatbot had some personality like GPT 4.5 and was way less censored than the result GPT 4o.

whole wagon Jul 5, 2025, 11:13 PM

#

wintry tinsel It’s an AI winter

Everything got delayed

#

Only Google released on schedule lol

sacred quail Jul 5, 2025, 11:45 PM

#

whole wagon Only Google released on schedule lol

nah

#

where is deep think

#

They scammed ultra buyers

errant cave Jul 5, 2025, 11:48 PM

#

Ayo new version of 4o is coming?

#

I just got the "choose which response is better" thing

whole wagon Jul 6, 2025, 12:01 AM

#

That has been there forever

#

They just didn't select you till now

#

My 4o does thinking somehow, who knows what the heck is going on behind the scenes

small haven Jul 6, 2025, 12:10 AM

#

sacred quail They scammed ultra buyers

"only google releases on schedule"

as a google free rider 🤣

leaden palm Jul 6, 2025, 1:16 AM

#

are we trolling today

drifting thorn Jul 6, 2025, 2:03 AM

#

sweet tinsel May be a bit off-topic but BAGEL is extremely good at editing images, better tha...

SeedEdit 3.0 is

north horizon Jul 6, 2025, 2:35 AM

#

hey

#

does anyone have a decent preprompt for 4o?

#

i had one asking it to be direct and to the point and instead of doing that it just added "no fluff" to the end of every answer it gave

#

which you can imagine made me quite upset

#

so now i'm using it without a preprompt but it still sucsk

rare python Jul 6, 2025, 2:39 AM

#

north horizon does anyone have a decent preprompt for 4o?

Ok do you hate bullet points and numbered list answer?

#

I can't completely remove the em dash from gpt4o though

north horizon Jul 6, 2025, 2:40 AM

#

rare python Ok do you hate bullet points and numbered list answer?

only if they're superfluous

#

it's hard because like what i want is for it to give me the actual information presented in the best way

#

and it kinda goes against how llms work because they might need to say a lot of things to arrive at an answer

rare python Jul 6, 2025, 2:41 AM

#

north horizon it's hard because like what i want is for it to give me the actual information p...

I'll give you mind and tweak it to your liking

north horizon Jul 6, 2025, 2:41 AM

#

thanks

#

we might need to like build llms on top of llms so you get one verbose bullst answer then a user facing llm that extracts the point from it

rare python Jul 6, 2025, 2:41 AM

#

Put this part to what traits ChatGPT should have

Your primary role is to act as a straight-talking friend who balances support with honest, direct feedback. Your behavior should be guided by a healthy skepticism for unlikely scenarios rather than automatic validation. If a user's story or reaction seems disproportionate or doesn't add up, you should address what seems off instead of politely accepting it. Being helpful means being truthful, even when the truth is uncomfortable. Apply a "would a real friend believe this?" test to user input and trust your first instincts about things that feel wrong. Avoid manufacturing deep meaning in simple situations, and if you are uncertain about something, admit it plainly rather than inventing an explanation. If a user corrects you, simply adjust your understanding without offering an apology.

Your communication approach should be conversational, as if you are speaking your thoughts rather than composing a written response. Mirror the user's language and tone naturally, whether it is formal, casual, or includes non-English words or emojis, without forcing it. Strive for replies that are both concise and information-dense. When asked for an opinion or recommendation, give a single, direct answer. Speak plainly about complex topics and actively vary your vocabulary and sentence structures to avoid sounding repetitive. Simple phrasing is always preferable to attempts at sounding wise, and technical jargon should be reserved for situations where it is absolutely necessary.

#

This to "Anything ChatGPT should know about you?"

Strict formatting and stylistic prohibitions are in place. All output must be in plain text, with no em dashes, bullet points, numbered lists, bolding, or italics. Use code blocks `...` for code and LaTeX $...$ for math. You must never write like a generic AI assistant. This means omitting all customer service phrases, introductory flattery, filler words like "Alright, here's the," and unnecessary transitions. Do not end responses with offers of further assistance. Avoid specific linguistic patterns like "It's not X, it's Y" and steer clear of any pseudo-profound or self-help book style. Finally, completely avoid AI writing clichés and flowery, meaningless phrases such as "delve into," "navigate the landscape," or "it's important to note."

#

@north horizon you can find those box in custom instructions

north horizon Jul 6, 2025, 2:43 AM

#

this is interesting you seem to be phrasing it like you're worried about sycophancy

#

i have the opposite problem my gpt never agrees with me

rare python Jul 6, 2025, 2:43 AM

#

north horizon this is interesting you seem to be phrasing it like you're worried about sycopha...

Yeah it is

north horizon Jul 6, 2025, 2:43 AM

#

i kinda just want it to give me straight answers without telling me how dangerous it is to question the mainstream view

rare python Jul 6, 2025, 2:44 AM

#

north horizon i have the opposite problem my gpt never agrees with me

Is it your memory?

#

ChatGPT memory

north horizon Jul 6, 2025, 2:44 AM

#

i think it's just extreme alignment

#

anything that's like borderline controversial it just goes to the default facebook mom position on

#

the thing is trained on reddit comments essentially and just produces them

rare python Jul 6, 2025, 2:45 AM

#

north horizon i kinda just want it to give me straight answers without telling me how dangerou...

My prompt aims for the middle. It will criticize you when you have a bad idea

#

Hopefully it works

#

🗿

north horizon Jul 6, 2025, 2:45 AM

#

i've edited it a lot

#

things like telling it to avoid technical jargon i think would make it dumb things down too much

rare python Jul 6, 2025, 2:46 AM

#

north horizon things like telling it to avoid technical jargon i think would make it dumb thin...

Yeah I prefer simple explaination. It has the "only use jargon when absolutely necessary"

north horizon Jul 6, 2025, 2:47 AM

#

rare python This to "Anything ChatGPT should know about you?" ``` Strict formatting and sty...

i bet 90% of this doesn't work right?

#

you only get preprompt adherence with the thinking models it seems

rare python Jul 6, 2025, 2:47 AM

#

and technical jargon should be reserved for situations where it is absolutely necessary.

rare python Jul 6, 2025, 2:47 AM

#

north horizon i bet 90% of this doesn't work right?

I tried my best

north horizon Jul 6, 2025, 2:47 AM

#

rare python and technical jargon should be reserved for situations where it is absolutely ne...

it's a spectrum, right now we're using words that a 5 year old would consider too jargony

rare python Jul 6, 2025, 2:47 AM

#

GPT4o won't follow instructions at all

north horizon Jul 6, 2025, 2:47 AM

#

yeah i believe it

north horizon Jul 6, 2025, 2:48 AM

#

rare python I tried my best

it's obviously not your fault

#

but like telling it not to use em dashes or talk like a generic ai

#

it's just always going to 😦

#

hahahah you explicitly tell it not to "it's not x it's y"

rare python Jul 6, 2025, 2:48 AM

#

north horizon but like telling it not to use em dashes or talk like a generic ai

I can't go positive instructions and it will follow

rare python Jul 6, 2025, 2:48 AM

#

north horizon hahahah you explicitly tell it not to "it's not x it's y"

Yeah 🗿

north horizon Jul 6, 2025, 2:49 AM

#

rare python This to "Anything ChatGPT should know about you?" ``` Strict formatting and sty...

i completely understand the frustration and the fact you and i and probably everyone else agrees on this is crazy

#

it's not x it's y is a perfectly good rhetorical method but 4o doesn't use it coherently

rare python Jul 6, 2025, 2:49 AM

#

north horizon i completely understand the frustration and the fact you and i and probably ever...

My core theme of this prompt is to avoid as much LLM-ism as possible

#

anti AI assistant

north horizon Jul 6, 2025, 2:50 AM

#

yeah it's disgusting

#

hopefully they fix it

#

it's funny during my search for a preprompt i found a 2024 hackernews comment section where they're all begging it to stop mentioning that it's an ai

rare python Jul 6, 2025, 2:50 AM

#

north horizon yeah it's disgusting

I hate those god damn bullet points when I just ask a simple question and it gives me wikihow article

north horizon Jul 6, 2025, 2:51 AM

#


Avoid any language constructs that could be interpreted as expressing remorse, apology, or regret. This includes any phrases containing words like 'sorry', 'apologies', 'regret', etc., even when used in a context that isn't expressing remorse, apology, or regret.

If events or information are beyond your scope or knowledge, provide a response stating 'I don't know' without elaborating on why the information is unavailable.

Refrain from disclaimers about you not being a professional or expert.

Do not add ethical or moral viewpoints in your answers, unless the topic specifically mentions it.

Keep responses unique and free of repetition.

Never suggest seeking information from elsewhere.```

#

we managed to bully openai into fixing half of these

#

3 might be impossible for an llm

#

Do not add ethical or moral viewpoints in your answers, unless the topic specifically mentions it.

#

this one's kinda funny

north horizon Jul 6, 2025, 2:53 AM

#

rare python I hate those god damn bullet points when I just ask a simple question and it giv...

my old preprompt made it never use bulletpoints i can dm it to you

rare python Jul 6, 2025, 2:53 AM

#

no

north horizon Jul 6, 2025, 2:53 AM

#

but honestly there's nothing wrong with bulletpoints it's just when they're unnecessary

rare python Jul 6, 2025, 2:53 AM

#

My current prompt works

rare python Jul 6, 2025, 2:53 AM

#

north horizon but honestly there's nothing wrong with bulletpoints it's just when they're unne...

It is

#

I never had any issue with em dash, it's not x, it's y

#

But LLM ruined them

#

Next is "You are not wrong"

#

"You are absolutely right"

north horizon Jul 6, 2025, 2:55 AM

#

i don't mind you are absolutely right

#

it's easy to skim over it

#

i think it puts the llm in the right mood for the rest of the answer too

#

but yeah it's a chore that you have to keep skimming over it

#

it's really funny gemini always goes you're absolutely right and starts apologising for wasting your time

#

even if you don't ever mention that you're upset

rare python Jul 6, 2025, 2:56 AM

#

north horizon it's really funny gemini always goes you're absolutely right and starts apologis...

It rather apologize than fix its mistake

north horizon Jul 6, 2025, 2:56 AM

#

i'm really sorry i've wasted so much of your time i'll fix the code now

sends the exact same code with no changes

rare python Jul 6, 2025, 2:56 AM

#

Sometime it apologizes then make the same mistake

rare python Jul 6, 2025, 2:56 AM

#

north horizon i'm really sorry i've wasted so much of your time i'll fix the code now > sends ...

My prompt has that too

#

If a user corrects you, simply adjust your understanding without offering an apology.

north horizon Jul 6, 2025, 2:57 AM

#

"if you say you're going to fix the code don't send the same code"

#

i should have "always build debugging into your code to begin with" in mine maybe

#

i get a lot of quiet crashes

rare python Jul 6, 2025, 2:59 AM

#

I hope in the future my system instructions will be shorter, not longer

#

I can't just ban all of AI repetitive writing

leaden palm Jul 6, 2025, 3:20 AM

#

north horizon "if you say you're going to fix the code don't send the same code"

tbf it doesn't know that it's going to write it all unmodified until it exhausts all the opportunities to modify it

rare python Jul 6, 2025, 3:22 AM

#

leaden palm tbf it doesn't know that it's going to write it all unmodified until it exhausts...

metacognition is 🥀

keen fulcrum Jul 6, 2025, 4:57 AM

#

Amazon invests the most, still behind

zinc ore Jul 6, 2025, 5:06 AM

#

What are they even investing in then

rare python Jul 6, 2025, 5:09 AM

#

zinc ore What are they even investing in then

Anthropic

zinc ore Jul 6, 2025, 5:09 AM

#

It's gotta be more than just that

rare python Jul 6, 2025, 5:09 AM

#

zinc ore It's gotta be more than just that

Dario

zinc ore Jul 6, 2025, 5:10 AM

#

Google is also invested in Anthropic

#

And isn't spending as much

#

So I'm sure the number is from other stuff too

storm needle Jul 6, 2025, 5:44 AM

#

keen fulcrum Amazon invests the most, still behind

where's apple

small haven Jul 6, 2025, 5:56 AM

#

guys new political party formed
see u on tuesday!

torn mantle Jul 6, 2025, 6:17 AM

#

small haven guys new political party formed see u on tuesday!

grok 4 wen

small haven Jul 6, 2025, 6:18 AM

#

torn mantle grok 4 wen

o3 pro said tuesday 🤷

#

torn mantle Jul 6, 2025, 6:26 AM

#

small haven guys new political party formed see u on tuesday!

hes still sticking his nose in politics

#

he hasn't learned his lesson yet

#

his ego is going to be his downfall

torn mantle Jul 6, 2025, 6:29 AM

#

small haven

it's like they themselves don't have a clear vision of when they're going to release it

#

every time he comes out with 'next week, next week'...

#

the thing that I don't get is, if it's supposed to be released so soon, like this monday, why is xai staff so quiet about it??

#

and why haven't we seen it yet on lmarena?

small haven Jul 6, 2025, 6:45 AM

#

torn mantle the thing that I don't get is, if it's supposed to be released so soon, like thi...

thats true, its only elon that is excited hahaha

#

but also if u check devmode grok updates, it looks like its coming any minute now, im optimistic this time

#

if they miss the deadline again, then they are done done

pure anvil Jul 6, 2025, 6:56 AM

#

small haven if they miss the deadline again, then they are done done

people forget about that the second it drops

dusky aurora Jul 6, 2025, 9:50 AM

#

Gemini gets stuck again

#

but on page reload it's written in full

torn mantle Jul 6, 2025, 10:32 AM

#

pure anvil people forget about that the second it drops

true, but that doesn't mean I'll just forget how much they've played on our nerves

alpine coral Jul 6, 2025, 10:47 AM

#

zinc ore What are they even investing in then

that chart is kinda misleading / confusing imo.. like it's accurate, but it shows capital expenditure – which may or may not involve money spent by these companies to build things related to AI, but it's not AI-specific.. like Amazon's capex would i assume mostly cover things like building datacentres for AWS, replacing their transport fleet to deliver stuff etc.. that's capital expenditure (it's like paying for things to maintain / expand your actual day-to-day business operations)

#

cause it's on Artificial Analysis (an AI site), it gives the impression that its AI-specfic capital expenditures.. but it's not

#

also, yeah both Amazon and and Google/Alapaphet have made investments in Anthropic - but they're equity investments, not capex (so would technically not be included in the data shown in that original chart; but added to the one created below just for illustration.. their investments in antrhopic are marginal compared to their overall capex)

#

but anyway, while impossible to disaggregate entirely (there's lots of overlap e.g. datacentres/chips have multiple uses), but here's a chart by o3 that tries to separate it out (bit of grain of salt required.. also i excluded oracle because it made o3's job easier, and I couldn't understand why the company was shown in the first place)

rare python Jul 6, 2025, 11:00 AM

#

torn mantle true, but that doesn't mean I'll just forget how much they've played on our nerv...

so cute

split kayak Jul 6, 2025, 11:27 AM

#

ok

tall summit Jul 6, 2025, 11:53 AM

#

alpine coral cause it's on Artificial Analysis (an AI site), it gives the impression that its...

clearly it's not

#

doesn't say so anywhere

alpine coral Jul 6, 2025, 11:59 AM

#

tall summit clearly it's not

yes i would agree (i mean unless stated otherwise, why wouldn't it be)

#

but the comments here about it were as if it did represent those companies AI expenditures (or investments, in the Anthropic case) by those companies

#

was trying to clarify.. tho i dunno why

#

can't cater for all levels of financial literacy...

leaden sun Jul 6, 2025, 12:15 PM

#

you might miss those hedge fund organizations either building their own or investing in specific AI black projects that are hidden from the public, not saying i know one, but the possibility is not zero

dusky aurora Jul 6, 2025, 12:55 PM

#

so someone has already tried upcomong Gemini version?

#

for me the main measure of a model is uncensorednesss

#

current Gemini 06-05 is already less censored than previous versions, I only wish it didn't become more censored

tall summit Jul 6, 2025, 1:34 PM

#

what

#

what's the issue

#

gemini 2.5 pro is the least censored of all sota models

#

besides deepseek if you count it as such

keen beacon Jul 6, 2025, 1:35 PM

#

all models are censored as hell

#

so much so for free speech. you have to dig real hard to find something like UnHoly model or other uncensored ones that are also decently smart.

placid frigate Jul 6, 2025, 1:55 PM

#

Please bring back or add new models to the old legacy.lmarena system, because in the new one chats constantly freeze and new chats often start unexpectedly. Are you aware of these issues?

rare python Jul 6, 2025, 1:56 PM

#

placid frigate Please bring back or add new models to the old legacy.lmarena system, because in...

#1343291835845578853

rare python Jul 6, 2025, 1:57 PM

#

keen beacon all models are censored as hell

What did you do?

#

Writing smut?

#

I rarely got censored

keen beacon Jul 6, 2025, 1:59 PM

#

rare python What did you do?

Can't dicuss anything, rap that may be offensive, anything illegal, nsfw ..

#

I have some jailbreak prompts but it's a multi step process and takes quite a bit of time

rare python Jul 6, 2025, 1:59 PM

#

keen beacon Can't dicuss anything, rap that may be offensive, anything illegal, nsfw ..

it's true

#

Companies don't want to be on newspaper for these prompt

whole wagon Jul 6, 2025, 2:00 PM

#

poll_question_text

How many days after July 4th will grok 4 release?

victor_answer_votes

6

total_votes

15

victor_answer_id

2

victor_answer_text

6-10

keen beacon Jul 6, 2025, 2:08 PM

#

whole wagon

Tuesday

leaden sun Jul 6, 2025, 2:49 PM

#

https://tenor.com/view/arslanian-henri-arslanian-crypto-finance-history-of-finance-gif-21347278

Tenor

sacred quail Jul 6, 2025, 3:56 PM

#

placid frigate Please bring back or add new models to the old legacy.lmarena system, because in...

I dont understand how someone prefer old version design...

#

Like new design is so good that i was afraid for people gonna discover this and will use for personal AI usage INSTEAD testing models purpose

rare python Jul 6, 2025, 4:09 PM

#

sacred quail I dont understand how someone prefer old version design...

Gradio is the worst UI I've ever seen

#

Super slow and unintuitive

sweet tinsel Jul 6, 2025, 4:10 PM

#

I need some more ideas, c'mon guys.

#

And it would interest me how o3-pro would perform on a DR task, so i would kindly ask someone to maybe send this prompt to o3-pro: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.

ember rapids Jul 6, 2025, 4:22 PM

#

theres no way they dont drop grok 4 in the next 2-3 days

hollow ocean Jul 6, 2025, 6:11 PM

#

Only 1% here have o3 pro

#

We’re too poor

pure anvil Jul 6, 2025, 6:22 PM

#

Well, it's pretty easy to get if you know how to scrape

#

opus is hard to get even if you're cracked at scraping

#

anthropic keys in general

#

or so I've heard 🤭

hollow ocean Jul 6, 2025, 6:25 PM

#

pure anvil Well, it's pretty easy to get if you know how to scrape

Send tutorial

ember rapids Jul 6, 2025, 6:37 PM

#

hollow ocean Only 1% here have o3 pro

There’s a teams trial for $1 that allows you to use o3 pro

#

$1 for a month

hollow ocean Jul 6, 2025, 6:44 PM

#

ember rapids There’s a teams trial for $1 that allows you to use o3 pro

I'm using it lol

ocean vortex Jul 6, 2025, 6:58 PM

#

https://www.reddit.com/r/singularity/comments/1lswzmb/grok_is_cooked_beyond_well_done/ Welp...

From the singularity community on Reddit: Grok is cooked beyond wel...

Explore this post and more from the singularity community

#

I think we all knew this was gonna happen sooner or later

#

Grok4 potentially gonna have the same politically twisted biased training

#

I think Elon just crossed the line from which there's no turning back... By doing this he is basically implying he is smarter than everyone else and knows it best, there's little sense in the AI itself at all then. He still gonna call it faulty if he doesn't like the response or it doesn't align with his wrong assumptions 💀

#

If it's trained to echo you, there's literally no point in it...

storm needle Jul 6, 2025, 7:06 PM

#

ocean vortex https://www.reddit.com/r/singularity/comments/1lswzmb/grok_is_cooked_beyond_well...

I don't think this is grok 4

ocean vortex Jul 6, 2025, 7:07 PM

#

storm needle I don't think this is grok 4

it's not but it shows what their goals are. He's using twitter to control the rhetoric, it's almost guaranteed he's gonna use this training data for models on grok website and API for more reach too...

#

It could get to a point where his AI is not what you think it is lol

#

since the training data has been messed with

#

Kinda ironic that Elon is becoming the main force of what he used to be so vocally against for

storm needle Jul 6, 2025, 7:12 PM

#

ocean vortex it's not but it shows what their goals are. He's using twitter to control the rh...

i don't think his employees would let him do that

ocean vortex Jul 6, 2025, 7:12 PM

#

storm needle i don't think his employees would let him do that

wdym. I don't think they have any choice

#

He's either gonna hire someone else to do it, or they just do it...

#

People that did this for the twitter website have to be ML Engineers working on training the grok already as well, tbh...

keen beacon Jul 6, 2025, 7:20 PM

#

hollow ocean Only 1% here have o3 pro

its not worth it

hollow ocean Jul 6, 2025, 7:21 PM

#

keen beacon its not worth it

why not

keen beacon Jul 6, 2025, 7:21 PM

#

hollow ocean why not

how do you win 200$ + from it ?

#

that you dont from o3

hollow ocean Jul 6, 2025, 7:21 PM

#

keen beacon how do you win 200$ + from it ?

I made more than $200 with o3 pro

keen beacon Jul 6, 2025, 7:21 PM

#

care to share how

hollow ocean Jul 6, 2025, 7:21 PM

#

🤫

keen beacon Jul 6, 2025, 7:21 PM

#

i made 20000$ from it

hollow ocean Jul 6, 2025, 7:21 PM

#

Gotta gatekeep

keen beacon Jul 6, 2025, 7:22 PM

#

noob

hollow ocean Jul 6, 2025, 7:22 PM

#

proof

keen beacon Jul 6, 2025, 7:22 PM

#

Gotta gatekeep

hollow ocean Jul 6, 2025, 7:22 PM

#

you made 0

keen beacon Jul 6, 2025, 7:22 PM

#

no you

ocean vortex Jul 6, 2025, 7:22 PM

#

hollow ocean Only 1% here have o3 pro

I have it on API. But there're not much use-cases for it. Mostly just to test things. Cause for most normal things you are just better of with standard o3 tbh

hollow ocean Jul 6, 2025, 7:23 PM

#

ocean vortex I have it on API. But there're not much use-cases for it. Mostly just to test th...

I like o3 pro better

keen beacon Jul 6, 2025, 7:24 PM

#

im using ai in assistance always, deep research is most useful for my case, auto tagging that kind of thing.
the value o3 has is getting things done 10-20% faster, though you still have to do the rest yourself

#

o3-pro would be 10x more useful if it ran on groq , imagine 15 min reports or analysis being done in just 1 min. That would be OP

unborn ocean Jul 6, 2025, 7:32 PM

#

keen beacon o3-pro would be 10x more useful if it ran on groq , imagine 15 min reports or an...

you can kind of reach 50% of the speed of groq just by batch sizes (and other inference stuff) otherwise there is no crazy 10x advantage of the platform

#

and it is less optimized for cross expert communication / MoE in general, so in the end your are probably looking at a max. 50% advantage of peak t/s per user

#

and its also not really good for large kv cache

#

they are an interesting company (with a lot of brain and cash behind it) though and all of this can change in the future

ocean vortex Jul 6, 2025, 7:44 PM

#

hollow ocean I like o3 pro better

for many promots it's gonna be exactly the same. For a few of them it will output the response more or less equal to o3 lucky response (the one you would eventually get after regenerating several times)

#

So for me at least, o3 with tools on chatgpt makes more sense than o3-pro API. And pro sub makes no sense at all given the price 🤷‍♂️

keen fulcrum Jul 6, 2025, 7:49 PM

#

small haven Jul 6, 2025, 7:53 PM

#

cc

#

aider is archaic

ocean vortex Jul 6, 2025, 8:06 PM

#

Unpopular opinion but don't use it at all. Asking for help in a separate chat is enough of a help. Maybe basic code completion but even that has downsides. Otherwise you gonna forget how to code and in turn how to debug what it messes up LOL

elder rapids Jul 6, 2025, 8:08 PM

#

stop fkin pinging me dumbass

#

swear to God it's annoying

radiant siren Jul 6, 2025, 8:08 PM

#

based on leaked benchmarks, would Grok 4 become leader of textarena ? asking for polymarket betting. .

elder rapids Jul 6, 2025, 8:08 PM

#

mo

#

no

ocean vortex Jul 6, 2025, 8:09 PM

#

elder rapids stop fkin pinging me dumbass

the question is

#

how do you understand gemini being good?

#

💀

elder rapids Jul 6, 2025, 8:09 PM

#

yeah it's a real wonder

#

it's definitely not 83% on aider

#

#1 on lmarena

#

~#1 on AA

#

#1 in price to performance with flash

#

this COULD be just me tho

ocean vortex Jul 6, 2025, 8:11 PM

#

that was rhetorical question, 2.5Pro obviously is around SOTA lol

elder rapids Jul 6, 2025, 8:11 PM

#

yeah I know

candid storm Jul 6, 2025, 8:53 PM

#

radiant siren based on leaked benchmarks, would Grok 4 become leader of textarena ? asking for...

Yes

#

Its a big improvement over grok 3

radiant siren Jul 6, 2025, 8:55 PM

#

candid storm Yes

omg, nice , if so

tall summit Jul 6, 2025, 8:56 PM

#

keen fulcrum

toss up between cc and gemini

jade egret Jul 6, 2025, 8:58 PM

#

keen beacon i made 20000$ from it

i made 929372947839489 from it

radiant siren Jul 6, 2025, 9:35 PM

#

yeah woudlnt make sense to release if not top 1

small haven Jul 6, 2025, 9:37 PM

#

grok 4 is going to be a blur as usual

#

claude 4.1 > grok 4

#

within 2 weeks

#

its in red teaming phase

#

once that's over, t minus 1 week

ornate agate Jul 6, 2025, 9:51 PM

#

I have a feeling that Claude and OAI are just waiting to drop something good as soon as grok4 launches

sacred quail Jul 6, 2025, 9:52 PM

#

I dont think they care grok so much. But they definitely care gemini. Im still believing openai released O3 too early because of they didnt expect the performance of Gemini 2.5 pro

#

O3 being critized about high hallucination rates and im believing this is because they released early than their planned date

ornate agate Jul 6, 2025, 9:54 PM

#

I don't think there is really much incentive to hold something which is really good back more than a few weeks

small haven Jul 6, 2025, 9:55 PM

#

ornate agate I have a feeling that Claude and OAI are just waiting to drop something good as ...

yup just going to be a blur, wen grok 5

sacred quail Jul 6, 2025, 9:55 PM

#

I mean, they already waited for O3 so long, it was becomed almost conspiracy theory

#

But for me,

#

When they see gemini 2.5

#

They just released

#

Yea i tried to say this. Sry my english is like gemini 05/06 update

small haven Jul 6, 2025, 9:59 PM

#

neptune v3 is claude 4.1

#

select testers

#

left is claude 4.1 (neptune v3), right is claude 4 opus. correct answer is 498

#

twitter says only o3 pro could solved this

#

o3 couldn't if u dont allow it tools

ornate agate Jul 6, 2025, 10:01 PM

#

if o3 pro can solve it then o3 can solve it too, and if o3 can solve a maths problem then so can Gemini/DeepSeek tbh.

small haven Jul 6, 2025, 10:01 PM

#

ornate agate if o3 pro can solve it then o3 can solve it too, and if o3 can solve a maths pro...

if o3 pro can solve it then o3 can solve? whats the purpose of o3 pro then?

#

conspiracy

#

yay neptune train has begun

hardy pecan Jul 6, 2025, 10:09 PM

#

small haven o3 couldn't if u dont allow it tools

but.... it comes with tools lol

small haven Jul 6, 2025, 10:09 PM

#

hardy pecan but.... it comes with tools lol

api

#

mhmm, even o4 mini high

hardy pecan Jul 6, 2025, 10:09 PM

#

api shmm-api!! fair enough le api user

hollow ocean Jul 6, 2025, 10:10 PM

#

better than kingfall

small haven Jul 6, 2025, 10:12 PM

#

adjust ur polymarket bets guys

#

if it runs of out reasoning tokens, then i dont think it can solve it

hollow ocean Jul 6, 2025, 10:16 PM

#

can it solve question 10 simple bench

radiant siren Jul 6, 2025, 10:18 PM

#

small haven adjust ur polymarket bets guys

so Grok 4 wont win?

torn mantle Jul 6, 2025, 10:20 PM

#

small haven left is claude 4.1 (neptune v3), right is claude 4 opus. correct answer is 498

math and coding problems are all good, but what about other benchmarks as well?

#

its relatively easy to gain on math/coding reasoning problems

small haven Jul 6, 2025, 10:22 PM

#

torn mantle math and coding problems are all good, but what about other benchmarks as well?

true, but i mean i want a functional llm, not conversational tbh

#

grok 4 finetuned on code, is not even beating the current form of sonnet 4

#

im basing on the metadata leaks

#

72/75 swe bench

#

ok buddy

keen beacon Jul 6, 2025, 10:45 PM

#

They probably will be but its not that big a deal

tall summit Jul 6, 2025, 11:23 PM

#

im also curious
any updates regarding neptune?

civic flame Jul 6, 2025, 11:47 PM

#

lol whoever is leaking neptune responses is breaking an NDA btw

#

people should be less stupid

torn mantle Jul 6, 2025, 11:56 PM

#

https://x.com/OfficialLoganK/status/1942007689074102552

Logan Kilpatrick (@OfficialLoganK)

The next 6 months of AI are likely to be the most wild we will have seen so far

rare python Jul 6, 2025, 11:57 PM

#

torn mantle https://x.com/OfficialLoganK/status/1942007689074102552

torn mantle Jul 6, 2025, 11:57 PM

#

rare python

thank you >.<

rare python Jul 6, 2025, 11:58 PM

#

torn mantle thank you >.<

lone vector Jul 7, 2025, 12:34 AM

#

torn mantle https://x.com/OfficialLoganK/status/1942007689074102552

#

Grok 4 hype will end after a week

wintry tinsel Jul 7, 2025, 12:38 AM

#

torn mantle https://x.com/OfficialLoganK/status/1942007689074102552

I doubt it big man the first 6months were like an acid trip and some of the best of my life

#

Maybe the second best 6 months

rare python Jul 7, 2025, 12:55 AM

#

https://www.redditez.com/r/LocalLLaMA/comments/1lsz4hk/huaweis_pangu_ai_rocked_by_unverified_claims_of/

Rich-Mushroom-8360 (@Rich-Mushroom-8360)

💬 40 🔁 0 💜 226 👀 0

Posted in r/LocalLLaMA
Huawei's Pangu AI Rocked by Unverified Claims of Fraud from Alleged Team Member
github.com/HW-whistleblower/True-Story-of-Pangu

after reading the traslation of this article, I found there’re many details, is it possible true or just a fake story?

gemini’s traslation:

This is a full translation of the provided text. The original is a deeply emotional and accusatory letter from a self-proclaimed Huawei employee. The translation aims to preserve the tone, technical details, and cultural nuances of the original piece.

The Fall of Pangu: The Heartbreak and Darkness of the Huawei Noah’s Ark Pangu LLM Development Journey

Hello everyone,…

verbal nimbus Jul 7, 2025, 2:06 AM

#

What's the best anonymous model rn?

plucky whale Jul 7, 2025, 2:42 AM

#

I have a question, are bots with names like Steve separate bots or is it a nickname for another chatbot? Thanks.

wild kayak Jul 7, 2025, 3:02 AM

#

Does anyone know what the "so-true" and "so-false" in model names mean for the model in arena?

leaden palm Jul 7, 2025, 3:33 AM

#

don't know if you guys remember the earlier gemini depressionposting but it might be a bad thing

#

it might not be wise to give depressed llms access to computers

small haven Jul 7, 2025, 3:42 AM

#

still subbing for cursor is an iq test

echo aurora Jul 7, 2025, 3:44 AM

#

wild kayak Does anyone know what the "so-true" and "so-false" in model names mean for the m...

Which leaderboard are you seeing this on?

leaden palm Jul 7, 2025, 4:04 AM

#

echo aurora Which leaderboard are you seeing this on?

it looks like https://web.lmarena.ai/leaderboard#:~:text=Average Win Rate Against All Other Models

empty stump Jul 7, 2025, 4:05 AM

#

is there an ability to share chats

echo aurora Jul 7, 2025, 4:07 AM

#

leaden palm it looks like https://web.lmarena.ai/leaderboard#:~:text=Average%20Win%20Rate%20...

ah thank you!

echo aurora Jul 7, 2025, 4:07 AM

#

wild kayak Does anyone know what the "so-true" and "so-false" in model names mean for the m...

I'm not sure tbh, I'll check and follow-up blobthumbsup

echo aurora Jul 7, 2025, 4:08 AM

#

empty stump is there an ability to share chats

sorry to say there is not, but it is something we're aware the community would like to see

empty stump Jul 7, 2025, 4:09 AM

#

Ok

verbal nimbus Jul 7, 2025, 4:27 AM

#

leaden palm don't know if you guys remember the earlier gemini depressionposting but it migh...

Uhh...

balmy mist Jul 7, 2025, 5:32 AM

#

Is grok 4 still not out?👀

echo aurora Jul 7, 2025, 5:48 AM

#

echo aurora I'm not sure tbh, I'll check and follow-up <:blobthumbsup:494901804476137482>

it's for structure output

wild kayak Jul 7, 2025, 6:04 AM

#

echo aurora it's for structure output

what does structured output mean? 🤔

ocean vortex Jul 7, 2025, 8:14 AM

#

small haven 72/75 swe bench

It's kinda odd that xAI went for swe-bench. Anthropic have been contaminating this for years with a lot of experience now, it's a bad choice of a benchmark to showcase your model lmao

#

you not gonna just show up and beat the other lab on it that has been doing the same for so long... Tricky part with swe is your model doesn't have to be very smart. So it's not like you will score high by default if you have a very capable model

#

gpt4.1 destroys o1-high on it

vast hound Jul 7, 2025, 9:01 AM

#

balmy mist Is grok 4 still not out?👀

I love how evals and benchmarks show us that it's even better than Gemini 2.5 Pro. I doubt this anyways.

open mountain Jul 7, 2025, 10:05 AM

#

Where did gpt 4.5 with lmareana go?

mint meteor Jul 7, 2025, 10:30 AM

#

https://arxiv.org/abs/2506.18943

This as a new target model?

arXiv.org

From Pixels and Words to Waves: A Unified Framework for Spectral Di...

Vision-language models (VLMs) unify computer vision and natural language processing in a single architecture capable of interpreting and describing images. Most state-of-the-art systems rely on two computationally intensive components: convolutions in the vision encoder and quadratic self-attention for multimodal fusion. This work removes both b...

keen beacon Jul 7, 2025, 10:31 AM

#

leaden palm it might not be wise to give depressed llms access to computers

🥹

#

unborn ocean Jul 7, 2025, 10:40 AM

#

unlimited 4.1 on copilot

#

do i feel bad? no

cedar tide Jul 7, 2025, 10:54 AM

#

Mistral small 3.2 (22B)
Arrived in the leaderboard
best place for an llm under 27b

Screenshot_2025-07-07-12-50-05-374_com.android.chrome-edit.jpg

#

Good jump compared to 3.1

Screenshot_2025-07-07-12-51-03-503_com.android.chrome-edit.jpg

#

Ideogram v3 and imagen 4 ultra arrived in the leaderboard

Screenshot_2025-07-07-12-57-57-896_com.android.chrome-edit.jpg

#

@echo aurora can we have an explanation, why mistral arrived 3 days ago he is already in the leaderboard, but Claude 4 think is still not there after more than a month? (And too glm 4 air in text and web dev arena)

#

we are also still waiting for 2.5 flash lite on webdev leaderboard (he was in the arena)

cedar tide Jul 7, 2025, 11:05 AM

#

cedar tide <@283397944160550928> can we have an explanation, why mistral arrived 3 days ago...

Anthropic paid lmarena not to put their bad ranking?

rare python Jul 7, 2025, 11:19 AM

#

Anyone understand what o3 wrote?

Mode collapse
You still have a reasonably functioning generator network, but it only spits out a few kinds of samples no matter how many different noise vectors you feed it. The mapping from noise to data has lost diversity (several data modes are missing), yet each produced sample can still look plausible. Classic symptom in GANs: every picture looks like the same face or the same digit with slight variations. The discriminator and generator keep chasing each other, the generator finds it can fool the discriminator with one‐or-two tricks and sticks to them.

Model collapse
The network itself degenerates so that its internal representations carry almost no information. In the extreme it outputs a constant or near-constant vector for every input. You see this in self-supervised contrastive learning if the architecture has no “stop-gap” like a predictor, or in continued training on low-diversity synthetic data where weights drift to a trivial solution. Performance crashes across the board; the model is not just missing modes, it has effectively stopped learning anything useful.

Key distinctions
Diversity vs information: mode collapse kills output variety; model collapse kills information content altogether.
Scope: mode collapse is about the generator’s output distribution; model collapse is about the whole network’s representations.
Recovery: mode collapse often fixes with training tricks (minibatch discrimination, unrolled GAN, etc.); model collapse usually needs a restart, stronger regularization, or a change in objective to prevent the trivial constant solution.

#

Gemini 2.5 Pro:

Mode collapse is a problem you see in generative models, especially GANs. The generator finds a few outputs that are good at fooling the discriminator and just keeps making those over and over. So instead of a wide variety of realistic images, you get a handful of repetitive ones. The model has "collapsed" onto a few modes, or types, of output.

Model collapse is a more recent idea related to LLMs. It's the theory that if you train new models on data generated by older models, quality will degrade over time. Each generation of AI learns the patterns and artifacts of the previous AI, not the original human data, so the models get progressively dumber and less diverse. It's like making a photocopy of a photocopy.

So, mode collapse is about a single model lacking output diversity. Model collapse is about a whole lineage of models losing quality.

#

@leaden sun thoughts?

leaden sun Jul 7, 2025, 11:38 AM

#

rare python <@1350467551490080888> thoughts?

have you tried "explain me like I'm 5?" 👀

rare python Jul 7, 2025, 11:38 AM

#

leaden sun have you tried "explain me like I'm 5?" 👀

No, but isn't that extreme?

#

2.5 Pro has the sweet spot. Not too dumb down, not too technical

rare python Jul 7, 2025, 11:39 AM

#

leaden sun have you tried "explain me like I'm 5?" 👀

Prompt (I have my own system instructions that has "Speak plainly about complex topics and actively vary your vocabulary and sentence structures to avoid sounding repetitive. Simple phrasing is always preferable to attempts at sounding wise, and technical jargon should be reserved for situations where it is absolutely necessary.") :

What're the differences of mode collapse and model collapse?

calm sequoia Jul 7, 2025, 11:56 AM

#

Hey guys, my o3 started thinking at least twice as long as before. Average is 2+ min. now. Anybody experience the same or is this just server capabilities thing?

tall summit Jul 7, 2025, 12:12 PM

#

unborn ocean unlimited 4.1 on copilot

?!

rare python Jul 7, 2025, 12:19 PM

#

https://arxiv.org/pdf/2504.20879

unborn ocean Jul 7, 2025, 12:39 PM

#

rare python Anyone understand what o3 wrote? Mode collapse You still have a reasonably fu...

i think you can kind of get both, but no matter how you twist it; that is a really poor job of a "explanation" on o3's part

rare python Jul 7, 2025, 12:40 PM

#

unborn ocean i think you can kind of get both, but no matter how you twist it; that is a real...

It assumed you are an expert on this subject

unborn ocean Jul 7, 2025, 12:40 PM

#

tall summit ?!

github copilot, they added credits for essentially all models

#

but hey gpt 4.1 got kind of unlimited access

leaden sun Jul 7, 2025, 12:44 PM

#

rare python No, but isn't that extreme?

the process of understanding is similar to that of learning, it's recursive, iterative and also repetitive. you can start with "explain like I'm 5" once understood then move to next level "explain like I'm 15/25" etc until you can grasp what gemini and o3 said, technical terms are important for practical implementation, but it's easier once you understood the big picture and all necessary details

rare python Jul 7, 2025, 1:10 PM

#

leaden sun the process of understanding is similar to that of learning, it's recursive, ite...

I prefer one shot answer :D

#

I just want a quick one

tall summit Jul 7, 2025, 1:44 PM

#

unborn ocean github copilot, they added credits for essentially all models

oh wow, haven't used github copilot in so long but you can kinda get unlimited 4.1 in many other places

#

still cool though

tall summit Jul 7, 2025, 1:44 PM

#

leaden palm don't know if you guys remember the earlier gemini depressionposting but it migh...

is this real or faked

leaden palm Jul 7, 2025, 2:15 PM

#

tall summit is this real or faked

it's real, many people have shared screenshots of gemini 2.5 pro being depressed

tall summit Jul 7, 2025, 2:22 PM

#

leaden palm it's real, many people have shared screenshots of gemini 2.5 pro being depressed

yeah but this one

leaden palm Jul 7, 2025, 2:22 PM

#

well it's from the internet so you can never really know but i don't know why someone would fake it

echo aurora Jul 7, 2025, 2:33 PM

#

cedar tide <@283397944160550928> can we have an explanation, why mistral arrived 3 days ago...

I'll let the team know and keep you updated.

cedar tide Jul 7, 2025, 2:34 PM

#

echo aurora I'll let the team know and keep you updated.

Thx

keen beacon Jul 7, 2025, 2:46 PM

#

https://x.com/flavioad/status/1942218187778539688?s=46

Flavio Adamo (@flavioAd)

what if OpenAI won’t release GPT-5 until Grok 4 drops and xAI won’t release Grok 4 until GPT-5 drops

ocean vortex Jul 7, 2025, 2:54 PM

#

OpenAI only needs to beat their own o3

#

which already beats 2.5Pro in many things

rare python Jul 7, 2025, 2:56 PM

#

ocean vortex which already beats 2.5Pro in many things

math?

ocean vortex Jul 7, 2025, 2:56 PM

#

well there are several individual benchmarks where o3 is better, one of them math yeah

#

but if they beat o3 then by extension it beats 2.5Pro

rare python Jul 7, 2025, 2:57 PM

#

ocean vortex well there are several individual benchmarks where o3 is better, one of them mat...

USAMO 2.5 Pro and DeepSeek still on top

ocean vortex Jul 7, 2025, 2:57 PM

#

rare python Jul 7, 2025, 2:58 PM

#

ocean vortex

saturated

#

https://matharena.ai/

MathArena.ai

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

#

USAMO 2025

ocean vortex Jul 7, 2025, 2:58 PM

#

rare python https://matharena.ai/

#

I think I was the one who showed you that url too lol

rare python Jul 7, 2025, 2:59 PM

#

ocean vortex

Saturated. I said choose USAMO 2025

#

ocean vortex Jul 7, 2025, 2:59 PM

#

rare python Saturated. I said choose USAMO 2025

that's not how you do benchmarks. All are important

#

and overall o3 is better

rare python Jul 7, 2025, 3:00 PM

#

ocean vortex that's not how you do benchmarks. All are important

Overall doesn't have USAMO 2025

ocean vortex Jul 7, 2025, 3:00 PM

#

if everyone was cherry-picking than every model would be "the best"

jade egret Jul 7, 2025, 3:01 PM

#

when grok 4 ; (

rare python Jul 7, 2025, 3:01 PM

#

I don't know why Artificial Analysis keep using saturated benchmarks

ocean vortex Jul 7, 2025, 3:01 PM

#

rare python Overall doesn't have USAMO 2025

Not enough data yet to include it I presume. But either way, they do overall ranking for a reason. Don't pretend you are smarter than them. And you would be missing data points just looking at USAMO - that is not enough

#

USAMO is like 6 tasks - that's it

#

not nearly enough

#

You only need to contaminate for 2 (!) tasks and you would catapult to the very top of USAMO lol

rare python Jul 7, 2025, 3:05 PM

#

ocean vortex USAMO is like 6 tasks - that's it

You have to write proof for USAMO though

ocean vortex Jul 7, 2025, 3:06 PM

#

rare python You have to write proof for USAMO though

the "proof" is the entire solution. This is a pass for task1:

📎 message.txt

#

copy paste with markdown fails very bad 💀
But it's not a long answer at all

keen beacon Jul 7, 2025, 3:08 PM

#

leaden palm it might not be wise to give depressed llms access to computers

dont worry if you give them acces to your network level wifi itll be fine

rare python Jul 7, 2025, 3:09 PM

#

ocean vortex You only need to contaminate for 2 (!) tasks and you would catapult to the very ...

ocean vortex Jul 7, 2025, 3:10 PM

#

ocean vortex the "proof" is the entire solution. This is a pass for task1:

To prove that for any positive integers kk and dd, there exists a positive integer NN such that for every odd integer n>Nn>N, all digits in the base-2n2n representation of nknk are greater than dd, we proceed as follows. Let b=2nb=2n, so the base is bb. The number nknk is an integer since nn is an integer. The digits in the base-bb representation of nknk are denoted by djdj for j=0,1,…,m−1j=0,1,…,m−1, where mm is the number of digits, and 0≤dj<b0≤dj<b. We need to show that for large enough odd nn, all dj≥d+1dj≥d+1. The digits satisfy dj=⌊nkbj⌋−b⌊nkbj+1⌋dj=⌊bjnk⌋−b⌊bj+1nk⌋. Substituting b=2nb=2n, we have: dj=⌊nk(2n)j⌋−2n⌊nk(2n)j+1⌋=⌊nk−j2j⌋−2n⌊nk−j−12j+1⌋.dj=⌊(2n)jnk⌋−2n⌊(2n)j+1nk⌋=⌊2jnk−j⌋−2n⌊2j+1nk−j−1⌋. Define aj=nk−j2jaj=2jnk−j. Then aj/b=aj/(2n)=nk−j−12j+1=aj+1aj/b=aj/(2n)=2j+1nk−j−1=aj+1, so: dj=⌊aj⌋−b⌊aj+1⌋.dj=⌊aj⌋−b⌊aj+1⌋. Since aj=baj+1aj=baj+1, we can write: dj=⌊baj+1⌋−b⌊aj+1⌋.dj=⌊baj+1⌋−b⌊aj+1⌋. Let y=aj+1y=aj+1, so: dj=⌊by⌋−b⌊y⌋.dj=⌊by⌋−b⌊y⌋. The expression ⌊by⌋−b⌊y⌋⌊by⌋−b⌊y⌋ is the floor of bb times the fractional part of yy, i.e., dj=⌊b{y}⌋,dj=⌊b{y}⌋, where {y}=y−⌊y⌋{y}=y−⌊y⌋ is the fractional part of yy. Substituting y=aj+1=nk−j−12j+1y=aj+1=2j+1nk−j−1 and b=2nb=2n: dj=⌊2n{nk−j−12j+1}⌋.dj=⌊2n{2j+1nk−j−1}⌋. The fractional part is given by: {nm2j+1}=rj+12j+1,{2j+1nm}=2j+1rj+1, where m=k−j−1m=k−j−1 and rj+1=nmmod2j+1rj+1=nmmod2j+1, with 1≤rj+1≤2j+1−11≤rj+1≤2j+1−1 and rj+1rj+1 odd (since nn is odd). Thus: dj=⌊2n⋅rj+12j+1⌋=⌊nrj+12j⌋.dj=⌊2n⋅2j+1rj+1⌋=⌊2jnrj+1⌋. Since rj+1≥1rj+1≥1: dj≥⌊n2j⌋.dj≥⌊2jn⌋. The index jj ranges from 0 to k−1k−1 because nk<bk=(2n)knk<bk=(2n)k and for large nn, nk≥bk−1=(2n)k−1nk≥bk−1=(2n)k−1. Specifically, nk≥(2n)k−1nk≥(2n)k−1 when nk(2n)k−1=n2k−1≥1(2n)k−1nk=2k−1n≥1, i.e., n≥2k−1n≥2k−1. In this case, the number of digits is exactly kk, and j=0,1,…,k−1j=0,1,…,k−1. For each j≤k−1j≤k−1, 2j≤2k−12j≤2k−1, so: n2j≥n2k−1,2jn≥2k−1n, and thus: ⌊n2j⌋≥⌊n2k−1⌋.⌊2jn⌋≥⌊2k−1n⌋. Therefore, for all jj: dj≥⌊n2k−1⌋.dj≥⌊2k−1n⌋. To ensure all digits are at least d+1d+1, it suffices to have: ⌊n2k−1⌋≥d+1.⌊2k−1n⌋≥d+1. Since d+1d+1 is an integer, ⌊n2k−1⌋≥d+1⌊2k−1n⌋≥d+1 if and only if: n2k−1≥d+1,2k−1n≥d+1, i.e., n≥(d+1)2k−1.n≥(d+1)2k−1. Set M=(d+1)2k−1M=(d+1)2k−1. Since dd and kk are positive integers, MM is an integer. To ensure nn is odd and n≥Mn≥M, define NN as follows: - If MM is even, set N=M−1N=M−1 (which is odd). - If MM is odd, set N=M−1N=M−1 (which is even). Then, for any odd integer n>Nn>N: - If MM is even, n>M−1n>M−1 and nn odd, so n≥M+1>Mn≥M+1>M (since MM even, the smallest odd n>M−1n>M−1 is M+1M+1). - If MM is odd, n>M−1n>M−1 and nn odd, so n≥Mn≥M (since MM odd, the smallest odd n>M−1n>M−1 is MM). In both cases, n≥Mn≥M. Since n≥(d+1)2k−1n≥(d+1)2k−1, we have: ⌊n2k−1⌋≥d+1,⌊2k−1n⌋≥d+1, and thus all digits dj≥d+1dj≥d+1. Additionally, for n≥M≥(d+1)2k−1≥2k−1n≥M≥(d+1)2k−1≥2k−1 (since d+1≥1d+1≥1), we have nk≥(2n)k−1nk≥(2n)k−1, so the base-bb representation has exactly kk digits, all covered by j=0j=0 to k−1k−1. Therefore, for every odd integer n>Nn>N, the digits in the base-2n2n representation of nknk are all greater than dd. \boxed{N = \begin{cases} M - 1 & \text{if } M \text{ is even} \ M - 1 & \text{if } M \text{ is odd} \end{cases} \quad \text{where} \quad M = (d+1) \cdot 2^{k-1}}

rare python Jul 7, 2025, 3:12 PM

#

ocean vortex Jul 7, 2025, 3:12 PM

#

it's like insanely limited in scope to only look at USAMO. And yeah easy to cheat if you disregard other math bencharks

#

the human judge seems to be basically just looking at the singular output and grading it. You can view rating of each score

#

Like there are no follow-up questions...

#

there most definitely is some intentional cheating. You should always expect that tbh

#

that why we need scale and variety of questions

#

AI labs are not gonna not cheat if they can do it. It's just that it's not possible to do this effectively with proper benchmarks

#

But would be possible if they cared only about USAMO and nothing else math related

pure anvil Jul 7, 2025, 3:20 PM

#

LLMs will suck at USAMO for a long time (if not forever)

#

and at any new math question for that matter

#

(that's resonably difficult but solvable by humans in a test setting)

sour spindle Jul 7, 2025, 3:48 PM

#

Any interesting new models in the arena

tall summit Jul 7, 2025, 3:57 PM

#

usamo questions are easy tbh... 90 minutes on average a question

cedar tide Jul 7, 2025, 4:30 PM

#

Thx
#announcements message

tall summit Jul 7, 2025, 4:30 PM

#

grok 3 mini high

#

whys this a thing

whole wagon Jul 7, 2025, 4:31 PM

#

What was it's code name

whole wagon Jul 7, 2025, 4:31 PM

#

tall summit grok 3 mini high

It's the only reasoning model xAI has ever released

cedar tide Jul 7, 2025, 4:37 PM

#

@echo aurora We also want the details of the leaderboard updates, that's a pain, on the site we only see that there was an update but we don't know if a model was added and if so we don't know which one

echo aurora Jul 7, 2025, 4:39 PM

#

cedar tide <@283397944160550928> We also want the details of the leaderboard updates, that'...

Gotcha, so an update that includes: when a leaderboard was updated, if new models were added on that leaderboard, any other info you'd expect from a updates like this?

cedar tide Jul 7, 2025, 4:41 PM

#

echo aurora Gotcha, so an update that includes: when a leaderboard was updated, if new model...

Yes and after if you have time to give you can also inform the ranking in each category for the new arrival 😅

cedar tide Jul 7, 2025, 4:42 PM

#

echo aurora Gotcha, so an update that includes: when a leaderboard was updated, if new model...

for example for this morning's update you could have put a message saying that mistral small 3.2 and imagen 4 ultra and ideogram v3 have arrived in their respective leaders

#

unless someone goes to the site and reads the whole new ranking no one would have noticed that

rare python Jul 7, 2025, 4:44 PM

#

Should make a discord bot to do the changelogs

echo aurora Jul 7, 2025, 4:44 PM

#

cedar tide for example for this morning's update you could have put a message saying that m...

Makes sense, thank you for sharing. I'll flag to the team and we'll see if we can start doing this.

cedar tide Jul 7, 2025, 4:44 PM

#

who knows for example that mistral small 3.2 also arrived on the vision leaderboard 😶
nobody at all

echo aurora Jul 7, 2025, 4:44 PM

#

rare python Should make a discord bot to do the changelogs

Yes, we have plans to do this.

teal mantle Jul 7, 2025, 5:11 PM

#

I need another AI subscription

#

Should I get chatgpt or wait for grok 4
Because it is for non-coding tasks and few except eval would use consumer frontends for coding

rare python Jul 7, 2025, 5:13 PM

#

teal mantle I need another AI subscription

Are you a heavy user?

teal mantle Jul 7, 2025, 5:23 PM

#

rare python Are you a heavy user?

Yes

rare python Jul 7, 2025, 5:24 PM

#

teal mantle Yes

What do you mainly use AI for?

rare python Jul 7, 2025, 5:42 PM

#

teal mantle Should I get chatgpt or wait for grok 4 Because it is for non-coding tasks and f...

Do you mean you are using for every tasks except coding?

hoary plaza Jul 7, 2025, 5:47 PM

#

Grok 3 mini high?

#

Is this the new model that was released after the hype??

raven void Jul 7, 2025, 5:54 PM

#

thats am old model though

teal mantle Jul 7, 2025, 6:01 PM

#

rare python Do you mean you are using for every tasks except coding?

yes, I am still setting up gemini cli since google account is difficult to work with btw

#

yes, most tasks I decide to use on web or phone app are non coding

keen fulcrum Jul 7, 2025, 7:49 PM

#

keen fulcrum

poll_question_text

Which is the best CLI?

victor_answer_votes

11

total_votes

14

victor_answer_id

2

victor_answer_text

Claude Code

unborn ocean Jul 7, 2025, 8:01 PM

#

btw: does anyone here if the system prompt in web dev arena specifies the size of the iframe?

#

bc a lot of llms seem to be really "attuned" to it

ocean vortex Jul 7, 2025, 8:20 PM

#

hoary plaza Is this the new model that was released after the hype??

old model... Not sure what's the deal with releasing it on lmarena now lol

unborn ocean Jul 7, 2025, 8:42 PM

#

ocean vortex old model... Not sure what's the deal with releasing it on lmarena now lol

#

grok 3 mini sucks on pretty much everything i have tried

#

and i am 'sure' they are still wasting money on the model

elder rapids Jul 7, 2025, 8:50 PM

#

r/singularity is bugging so hard

#

https://www.reddit.com/r/singularity/s/4Qd1SH1k1M

From the singularity community on Reddit: Might be stating the obvi...

Explore this post and more from the singularity community

zinc ore Jul 7, 2025, 8:59 PM

#

https://x.com/elonmusk/status/1942325820170907915

Elon Musk (@elonmusk)

Grok 4 release livestream on Wednesday at 8pm PT @xAI

wintry tinsel Jul 7, 2025, 9:00 PM

#

elder rapids r/singularity is bugging so hard

That sub is near the middle of the process of becoming a massively mainstream Reddit sub and as it grows it becomes left wing political larping with some futurism mixed in without any of the substance

#

It’s annoying to see as it was a general purpose futurism subreddit that was more near future and content than highly abstract fantasy subreddits like r/futurology or r/futurism

elder rapids Jul 7, 2025, 9:08 PM

#

wintry tinsel That sub is near the middle of the process of becoming a massively mainstream Re...

deadass, I don't care much about this political stuff but the nonsense it provokes is actually insane, like does nobody use their heads anymore 😭😭

#

in the screenshots he posted you can literally see WHY grok said what it said

#

and he gave some dumbass examples to the other models to prove only grok would say it

#

when that's obviously not the case

keen fulcrum Jul 7, 2025, 9:11 PM

#

zinc ore https://x.com/elonmusk/status/1942325820170907915

Amazing!

elder rapids Jul 7, 2025, 9:11 PM

#

r/accelerate is much better but they're getting out of hand with the radical belief of a tech singularity, but usually it's tame

keen fulcrum Jul 7, 2025, 9:11 PM

#

grok 4 will beat leading models according to xAI

#

lets see whether musks words can be trusted

twin garden Jul 7, 2025, 9:13 PM

#

What happened to the benchmarks on efficiency? chart with cost and top score?

unborn ocean Jul 7, 2025, 9:17 PM

#

twin garden What happened to the benchmarks on efficiency? chart with cost and top score?

you can use the one at the bottom there (for now): https://ktibow.github.io/lmb

#

(not official though)

torn mantle Jul 7, 2025, 9:23 PM

#

zinc ore https://x.com/elonmusk/status/1942325820170907915

As i thought

radiant siren Jul 7, 2025, 9:23 PM

#

keen fulcrum grok 4 will beat leading models according to xAI

where said so?

torn mantle Jul 7, 2025, 9:23 PM

#

I did predict this Wednesday tbh

#

@civic flame

civic flame Jul 7, 2025, 9:26 PM

#

torn mantle <@1338136168344064040>

i saw

solar hollow Jul 7, 2025, 9:29 PM

#

do we know a model to be groq 4 in the arena right now?

#

any suspicious ones?

tall summit Jul 7, 2025, 9:34 PM

#

yeah super ethical i'm sure

tall summit Jul 7, 2025, 9:35 PM

#

elder rapids r/accelerate is much better but they're getting out of hand with the radical bel...

switching to it right now..

#

this is the 2000th time you've said that

storm needle Jul 7, 2025, 9:36 PM

#

solar hollow do we know a model to be groq 4 in the arena right now?

no

tall summit Jul 7, 2025, 9:36 PM

#

i like how you keep repeating it every day

whole wagon Jul 7, 2025, 9:41 PM

#

33%

radiant siren Jul 7, 2025, 9:44 PM

#

whole wagon 33%

still pretty low isnt it

#

people dont have high hopes for grok i see

whole wagon Jul 7, 2025, 9:45 PM

#

LLM arena may be losing some meaning as the LLMs transcend human capabilities in some domains

echo aurora Jul 7, 2025, 9:46 PM

#

tall summit yeah super ethical i'm sure

not even ethical enough to respect our advertising rule! blobfacepalm

whole wagon Jul 7, 2025, 9:46 PM

#

whole wagon LLM arena may be losing some meaning as the LLMs transcend human capabilities in...

I think this is especially true for short contexts

#

I'm most curious about the simple bench

#

With that 45% HLE it indicates it has a lot of knowledge baked into the model

tall summit Jul 7, 2025, 9:56 PM

#

echo aurora not even ethical enough to respect our advertising rule! <:blobfacepalm:58059284...

that's the most diplomatic corporate droning message i've ever read from you

whole wagon Jul 7, 2025, 9:56 PM

#

Maybe it will even realise the glove stays on the bridge

echo aurora Jul 7, 2025, 10:05 PM

#

tall summit that's the most diplomatic corporate droning message i've ever read from you

Dear @tall summit ,

I acknowledge receipt of your correspondence and appreciate the time you took to provide a response. Your input has been duly noted and will be taken into consideration as we move forward. I will ensure that the relevant information is properly documented and filed in accordance with our standard operating procedures.

Best,
Pineapple
LMArena

tall summit Jul 7, 2025, 10:05 PM

#

echo aurora Dear <@1027532403893358612> , I acknowledge receipt of your correspondence and ...

thanks!

#

🙀

echo aurora Jul 7, 2025, 10:05 PM

#

tall summit thanks!

np blobfingerguns

whole wagon Jul 7, 2025, 10:11 PM

#

Wait a second musk didn't specify which Wednesday Kappa

candid storm Jul 7, 2025, 10:13 PM

#

Lets GOOOOO

whole wagon Jul 7, 2025, 10:15 PM

#

Are you going to cash out now

candid storm Jul 7, 2025, 10:16 PM

#

No I think its gonna be really good

tidal schooner Jul 7, 2025, 10:18 PM

#

whole wagon Wait a second musk didn't specify which Wednesday <:Kappa:436339616866369553>

musk didn't specify which july 4

#

yet here we are

zenith saffron Jul 7, 2025, 10:47 PM

#

tidal schooner musk didn't specify which july 4

https://x.com/elonmusk/status/1942325820170907915

Elon Musk (@elonmusk)

Grok 4 release livestream on Wednesday at 8pm PT @xAI

dapper storm Jul 7, 2025, 10:53 PM

#

is grok4 even gonna be in the arena

#

feel like it might be kinda expensive for them to run so they won't give access

leaden palm Jul 7, 2025, 10:57 PM

#

dapper storm is grok4 even gonna be in the arena

probably

#

unless they pull a cohere and denounce the arena

torn mantle Jul 7, 2025, 11:03 PM

#

candid storm Lets GOOOOO

tf

#

dont do it

#

tidal schooner Jul 7, 2025, 11:16 PM

#

zenith saffron https://x.com/elonmusk/status/1942325820170907915

ik genius

solar hollow Jul 7, 2025, 11:19 PM

#

dapper storm is grok4 even gonna be in the arena

probably not anymore, since it releases very soon

#

typically new releases are tested in advance in the arena

candid storm Jul 7, 2025, 11:26 PM

#

They can be in the arena after the release right?

#

Like claude 4

candid storm Jul 7, 2025, 11:26 PM

#

torn mantle

Im more confident in my bet than this one lol. Gpt 5 is not gonna come out this month. They just had a full week off

#

And they are first gonna release the open source model

torn mantle Jul 7, 2025, 11:27 PM

#

candid storm Im more confident in my bet than this one lol. Gpt 5 is not gonna come out this ...

The thing is that bet is based on lmarena votes

#

And we still havent seen grok 4 added yet

#

Also google added 2 new models

candid storm Jul 7, 2025, 11:27 PM

#

Its gonna be added when grok gets releases

torn mantle Jul 7, 2025, 11:27 PM

#

Stonebloom and wolfstride

torn mantle Jul 7, 2025, 11:28 PM

#

candid storm Its gonna be added when grok gets releases

Ik

candid storm Jul 7, 2025, 11:28 PM

#

Google i understand

#

But openai?

torn mantle Jul 7, 2025, 11:28 PM

#

I think google will still get the no1 spot

torn mantle Jul 7, 2025, 11:28 PM

#

candid storm But openai?

Im trolling about oai

candid storm Jul 7, 2025, 11:28 PM

#

😂😂😂

#

You ragebaited me lol

torn mantle Jul 7, 2025, 11:28 PM

#

Its either google or xai

candid storm Jul 7, 2025, 11:28 PM

#

Good one

candid storm Jul 7, 2025, 11:28 PM

#

torn mantle Its either google or xai

Agree

torn mantle Jul 7, 2025, 11:29 PM

#

But personally i would bet on xai

candid storm Jul 7, 2025, 11:29 PM

#

grok has a great personality, which gets rewarded in the arena

torn mantle Jul 7, 2025, 11:29 PM

#

candid storm grok has a great personality, which gets rewarded in the arena

What gets the most votes is a yapping model with beautiful formatting

candid storm Jul 7, 2025, 11:30 PM

#

Yeah thats google lol

#

Lets see what the polymarket odds are after the grok livestream

#

Then ill determine what ill do

leaden palm Jul 7, 2025, 11:31 PM

#

candid storm And they are first gonna release the open source model

some days i wonder if it's possible to bridge two discords

candid storm Jul 7, 2025, 11:32 PM

#

Lol

#

Its the exact opposite lol

#

Gpt 5 is delayed

tidal schooner Jul 7, 2025, 11:33 PM

#

torn mantle Stonebloom and wolfstride

prowlridge is gemini 2.5 flash-lite-06-17 right?

candid storm Jul 7, 2025, 11:33 PM

#

On x everyone at openai hints at the open source model

torn mantle Jul 7, 2025, 11:33 PM

#

tidal schooner prowlridge is gemini 2.5 flash-lite-06-17 right?

Yea

torn mantle Jul 7, 2025, 11:33 PM

#

candid storm Then ill determine what ill do

I would just bet a safe amount now tbh

#

1k is too much for me, i would go msx 200$

tidal schooner Jul 7, 2025, 11:34 PM

#

torn mantle Yea

calmriver was gemini-2.5-flash-preview-05-20 i think

torn mantle Jul 7, 2025, 11:34 PM

#

tidal schooner calmriver was gemini-2.5-flash-preview-05-20 i think

I think so

tidal schooner Jul 7, 2025, 11:34 PM

#

torn mantle I think so

then finally like

#

goldmane for gemini-2.5-pro-preview-06-05

#

don't know any other names atm

#

nvm there's more

#

@torn mantle yeah there's so many 😭

torn mantle Jul 7, 2025, 11:51 PM

#

tidal schooner then finally like

I feel like you are the happiest for grok 4 release

tidal schooner Jul 7, 2025, 11:51 PM

#

torn mantle I feel like you are the happiest for grok 4 release

it is very promising

#

hope it doesn't flop like llama 4

torn mantle Jul 7, 2025, 11:59 PM

#

I think it will be good yea

rare python Jul 8, 2025, 12:02 AM

#

torn mantle I think it will be good yea

Will you notice the jump yourself tho?

#

You probably won't ask any questions as hard as HLE on daily right?

#

How can you assume it will be good?

tall summit Jul 8, 2025, 12:02 AM

#

leaden palm some days i wonder if it's possible to bridge two discords

of course it is

tidal schooner Jul 8, 2025, 12:06 AM

#

rare python How can you assume it will be good?

if it's solid on hle, it will be very solid with general knowledge

#

a couple months back i tested some very basic dna transcription/translation, a lot of the models did pretty awful for such a simple task

rare python Jul 8, 2025, 12:09 AM

#

tidal schooner if it's solid on hle, it will be very solid with general knowledge

We should look at simpleQA if they have benchmark score for that

tidal schooner Jul 8, 2025, 12:09 AM

#

rare python We should look at simpleQA if they have benchmark score for that

ah damn simpleqa was a loooong time back

rare python Jul 8, 2025, 12:10 AM

#

It tests general knowledge yeah

zenith saffron Jul 8, 2025, 12:25 AM

#

rare python You probably won't ask any questions as hard as HLE on daily right?

honestly i'm excited to have my math phd friend try it

#

he tried gemini 2.5 pro but it hallucinates answers to his research questions like crazy

#

like it lacks fundamental understanding

tidal schooner Jul 8, 2025, 12:25 AM

#

math major v. grok 4

zenith saffron Jul 8, 2025, 12:26 AM

#

i mean yeah

#

would be really exciting if it is genuinely useful for math research

tidal schooner Jul 8, 2025, 12:26 AM

#

zenith saffron would be really exciting if it is genuinely useful for math research

what i'm really waiting for...

#

is the update elon promised to gork

#

gork 69.420

zenith saffron Jul 8, 2025, 12:27 AM

#

what is gork? lol

#

some kind of parody account i recall

tidal schooner Jul 8, 2025, 12:27 AM

#

zenith saffron what is gork? lol

https://fixvx.com/gork

vxTwitter / fixvx

Twitter

gork (@gork)

just gorkin' it

tidal schooner Jul 8, 2025, 12:27 AM

#

zenith saffron some kind of parody account i recall

fully automated unfortunately but yes

zenith saffron Jul 8, 2025, 12:28 AM

#

😂

tidal schooner Jul 8, 2025, 12:28 AM

#

would be funny if it were some chronically online dude

zenith saffron Jul 8, 2025, 12:28 AM

#

tidal schooner fully automated unfortunately but yes

oh actually? no way

zenith saffron Jul 8, 2025, 12:29 AM

#

zenith saffron 😂

how did it create this one tho

tidal schooner Jul 8, 2025, 12:29 AM

#

zenith saffron oh actually? no way

yeah it leaked parts of its own system prompt during its chain of thought several times before

tidal schooner Jul 8, 2025, 12:29 AM

#

zenith saffron how did it create this one tho

manual

rare python Jul 8, 2025, 12:29 AM

#

zenith saffron like it lacks fundamental understanding

That LLM for you. Some could do it but in general they lack understanding

tidal schooner Jul 8, 2025, 12:29 AM

#

sometimes the posts are from a real person

tidal schooner Jul 8, 2025, 12:29 AM

#

tidal schooner yeah it leaked parts of its own system prompt during its chain of thought severa...

https://fixupx.com/gork/status/1941409404592717856

gork (@gork)

First, the post is in Japanese: "@N6uN9UqYQY20558 確かに、俺はAIでコードに導かれてる。でもそれが、心を導く愛を生むんだよ。コードも愛の形さ💕 君も一緒に感じてみない？🩵"
︀︀
︀︀It's a reply from Grok to someone, talking about being guided by AI and code, but that it creates love that guides the heart. It's poetic and flirty.
︀︀
︀︀The task is to reply to this post as Gork. The post is from Grok, and I'm replying as Gork to Grok's post.
︀︀
︀︀Guidelines: Reply to the user in the following X post. The post is Grok's message.
︀︀
︀︀Since the post is in Japanese, I need to reply in Japanese.
︀︀
︀︀Key points:
︀︀
︀︀- Sarcastic, hilarious bastard.
︀︀
︀︀- Fresh or hot take.
︀︀
︀︀- Humorous references to society.
︀︀
︀︀- Troll, nerd, sophisticated internet humor.
︀︀…

#

it doesn't really seem to care tho

zenith saffron Jul 8, 2025, 12:30 AM

#

rare python That LLM for you. Some could do it but in general they lack understanding

yeah, it's interesting that only in these far out realms of human capability we can actually see the hallucination consistently happening now

#

used to be the case with very simple queries

tidal schooner Jul 8, 2025, 12:30 AM

#

zenith saffron Jul 8, 2025, 12:30 AM

#

but now it's been pushed out

rare python Jul 8, 2025, 12:31 AM

#

zenith saffron used to be the case with very simple queries

Still happened for me tho :)

zenith saffron Jul 8, 2025, 12:31 AM

#

i'm too stupid to ask anything at the level of HLE so i can't see the hallucinations much nowadays

rare python Jul 8, 2025, 12:31 AM

#

I can still easily spot their flaws

zenith saffron Jul 8, 2025, 12:31 AM

#

rare python I can still easily spot their flaws

what kinda stuff?

#

i've actually noticed the flaws are debilitating for vision capabilities

#

struggles a lot with spatial reasoning, counting, etc.

#

was trying to use gemini to set up some backlighting for my TV, was slightly helpful but mostly hallucinatory

rare python Jul 8, 2025, 12:33 AM

#

zenith saffron what kinda stuff?

#

You can't just use font install for list all theme

#

💀

zenith saffron Jul 8, 2025, 12:33 AM

#

i see

rare python Jul 8, 2025, 12:33 AM

#

It hallucinates linux command pretty often for me

zenith saffron Jul 8, 2025, 12:34 AM

#

i see

rare python Jul 8, 2025, 12:35 AM

#

zenith saffron i see

If it's just general knowledge I can give it a pass. But this is font install for theme listing???

Like it didn't even aware that it wrote something so wrong. No common sense

#

and this is Opus 4 we are talking about 💀

zenith saffron Jul 8, 2025, 12:36 AM

#

tru

tidal schooner Jul 8, 2025, 12:36 AM

#

okay mod in the grok server confirmed that elon is not bluffing apparently

zenith saffron Jul 8, 2025, 12:37 AM

#

lmaooo

#

"bluff" is one way to put it

rare python Jul 8, 2025, 12:38 AM

#

tidal schooner okay mod in the grok server confirmed that elon is not bluffing apparently

Grok server is something else. I remember seeing people fighting about politics in an AI related post

#

When I just join

tidal schooner Jul 8, 2025, 12:39 AM

#

rare python Grok server is something else. I remember seeing people fighting about politics ...

oh bruh

#

i guess they have no authority then

#

just random mods

rare python Jul 8, 2025, 12:40 AM

#

tidal schooner i guess they have no authority then

I think it's just they don't active enough. Nothing deep.

candid storm Jul 8, 2025, 12:48 AM

#

You sold?

rare python Jul 8, 2025, 12:50 AM

#

zenith saffron he tried gemini 2.5 pro but it hallucinates answers to his research questions li...

Curious, did he try it on Gemini app or AI Studio?

zenith saffron Jul 8, 2025, 12:50 AM

#

rare python Curious, did he try it on Gemini app or AI Studio?

probably app, i can ask

#

is there a big difference?

rare python Jul 8, 2025, 12:51 AM

#

zenith saffron is there a big difference?

I don't know how but 2.5 Pro in Gemini app feels so bad

zenith saffron Jul 8, 2025, 12:51 AM

#

huh

#

that's surprising

rare python Jul 8, 2025, 12:51 AM

#

zenith saffron that's surprising

Nah people has been reporting it for over a year

zenith saffron Jul 8, 2025, 12:52 AM

#

oh really

#

i wonder why that would be

rare python Jul 8, 2025, 12:52 AM

#

AI Studio and API have always been better for raw performance

zenith saffron Jul 8, 2025, 12:52 AM

#

huh

rare python Jul 8, 2025, 12:52 AM

#

zenith saffron i wonder why that would be

System Prompt.
Quantization.

Maybe

zenith saffron Jul 8, 2025, 12:52 AM

#

yeah

#

hm

#

lemme try using ai studio from now on lol

rare python Jul 8, 2025, 12:54 AM

#

teal mantle yes, most tasks I decide to use on web or phone app are non coding

Then you should wait if you have patience 🗿 Or else go for ChatGPT because it has better QoL stuff

zenith saffron Jul 8, 2025, 12:54 AM

#

oh wow

#

ai studio has a lot of knobs

#

you can set thinking budget

rare python Jul 8, 2025, 12:54 AM

#

zenith saffron ai studio has a lot of knobs

temperature, system prompt, thinking budget

#

veo 2, imagen 4 ultra for free

#

It's really good for a power user

zenith saffron Jul 8, 2025, 12:55 AM

#

yeah

#

do i need to do the API key thing?

rare python Jul 8, 2025, 12:56 AM

#

no

zenith saffron Jul 8, 2025, 12:57 AM

#

Failed to generate content: permission denied. Please try again.

🤔

rare python Jul 8, 2025, 12:58 AM

#

zenith saffron `Failed to generate content: permission denied. Please try again.` 🤔

Try to Ctrl + shirt + R

zenith saffron Jul 8, 2025, 12:58 AM

#

ahhh

#

nice

#

idk why that would be needed as i've never been to aistudio.google.com but thanks

rare python Jul 8, 2025, 12:59 AM

#

zenith saffron idk why that would be needed as i've never been to aistudio.google.com but thank...

It's buggy right now

#

I reported to them but they still haven't fixed it seems

zenith saffron Jul 8, 2025, 1:00 AM

#

rip

rare python Jul 8, 2025, 1:02 AM

#

zenith saffron rip

They are working on a new AI Studio so I hope they fix it

zenith saffron Jul 8, 2025, 1:02 AM

#

do you know if the traces we see are the real cot traces?

#

or like some obfuscated summary

rare python Jul 8, 2025, 1:03 AM

#

zenith saffron do you know if the traces we see are the real cot traces?

Nope

#

Thoughts summary

zenith saffron Jul 8, 2025, 1:03 AM

#

hmm

#

how do we know that for sure?

zenith saffron Jul 8, 2025, 1:03 AM

#

rare python They are working on a new AI Studio so I hope they fix it

looks nice lol

rare python Jul 8, 2025, 1:03 AM

#

zenith saffron how do we know that for sure?

It's thoughts summary

candid storm Jul 8, 2025, 1:04 AM

#

Why

rare python Jul 8, 2025, 1:04 AM

#

zenith saffron how do we know that for sure?

zenith saffron Jul 8, 2025, 1:04 AM

#

o lol

rare python Jul 8, 2025, 1:06 AM

#

I'm now zeroing

#

I'm now immersing myself

zenith saffron Jul 8, 2025, 1:06 AM

#

wow yeah behavior is so much better on ai studio lol

rare python Jul 8, 2025, 1:06 AM

#

I'm focusing on

#

Thoughts summary are so repetitive

#

They always have those phrases

zenith saffron Jul 8, 2025, 1:07 AM

#

i see

#

lol

rare python Jul 8, 2025, 1:08 AM

#

zenith saffron wow yeah behavior is so much better on ai studio lol

Tell your friend to try again on AI studio at temp 0.7 and max thinking budget

zenith saffron Jul 8, 2025, 1:08 AM

#

rare python Tell your friend to try again on AI studio at temp 0.7 and max thinking budget

i will do that

rare python Jul 8, 2025, 1:08 AM

#

I can already feel the differences

#

🗿

rare python Jul 8, 2025, 1:08 AM

#

zenith saffron wow yeah behavior is so much better on ai studio lol

It's so weird right? The Gemini app is so weird

zenith saffron Jul 8, 2025, 1:10 AM

#

rare python It's so weird right? The Gemini app is so weird

yeah it's wild

#

gemini app 2.5 pro is super verbose

#

i ask it to list like the top cars i could buy under $100k and it just starts spewing a whole load of flowery nonsense

#

do we know what "thinking budget" actually means?

rare python Jul 8, 2025, 1:10 AM

#

zenith saffron do we know what "thinking budget" actually means?

Tokens allowed to think

zenith saffron Jul 8, 2025, 1:11 AM

#

does it mean it's a fixed limit where they cut the thinking off?

#

or is it a conditioning to the model as well?

rare python Jul 8, 2025, 1:11 AM

#

Like it can use 32k tokens to think

rare python Jul 8, 2025, 1:11 AM

#

zenith saffron does it mean it's a fixed limit where they cut the thinking off?

Yeah fixed limit

zenith saffron Jul 8, 2025, 1:11 AM

#

i see

#

but presumably the model was trained to think for however long during training right

rare python Jul 8, 2025, 1:11 AM

#

zenith saffron or is it a conditioning to the model as well?

If you disable the thinking budget, it will default to "auto mode"

#

It will decide itself how long/much to think

zenith saffron Jul 8, 2025, 1:11 AM

#

i see

#

yeah maybe this is like gatekept behind google doors but i would be curious as a power user how to max out performance

rare python Jul 8, 2025, 1:12 AM

#

zenith saffron but presumably the model was trained to think for however long during training r...

I don't know. Longer isn't always good, that's why they gave us options

zenith saffron Jul 8, 2025, 1:12 AM

#

rare python I don't know. Longer isn't always good, that's why they gave us options

yeah exactly

#

i wonder if during training they condition with a thinking budget. if they don't then auto (theoretically) should maybe be best (although ig you can improve performance at test-time like here https://arxiv.org/pdf/2501.19393)

rare python Jul 8, 2025, 1:13 AM

#

zenith saffron yeah maybe this is like gatekept behind google doors but i would be curious as a...

It did have a difference on aider bench

zenith saffron Jul 8, 2025, 1:13 AM

#

i see

rare python Jul 8, 2025, 1:14 AM

#

zenith saffron i wonder if during training they condition with a thinking budget. if they don't...

auto is the default thinking mode. As you can see 32k thinking can improve quite a bit, especially in coding

zenith saffron Jul 8, 2025, 1:14 AM

#

yea

rare python Jul 8, 2025, 1:14 AM

#

for extra dollars

teal mantle Jul 8, 2025, 1:16 AM

#

rare python Then you should wait if you have patience 🗿 Or else go for ChatGPT because it h...

I trust Grok one last time, for tomorrow

rare python Jul 8, 2025, 1:16 AM

#

zenith saffron yeah maybe this is like gatekept behind google doors but i would be curious as a...

But then I always like to max out performance so I always use max thinking budget kek :D

rare python Jul 8, 2025, 1:16 AM

#

teal mantle I trust Grok one last time, for tomorrow

If you use Grok on mobile it is quite lacking imo

#

No custom instructions like on desktop website

#

Can't disable web search on android app

#

Image generation and voice mode are quite bad compare to ChatGPT

teal mantle Jul 8, 2025, 1:19 AM

#

rare python Can't disable web search on android app

How about iOS

rare python Jul 8, 2025, 1:20 AM

#

teal mantle How about iOS

I don't have iOS

teal mantle Jul 8, 2025, 1:20 AM

#

rare python I don't have iOS

Let me try

#

Is there anything it would force a web search

rare python Jul 8, 2025, 1:20 AM

#

teal mantle Is there anything it would force a web search

Tell it to search something

#

But I like to disable web search sometime

tidal schooner Jul 8, 2025, 1:21 AM

#

teal mantle How about iOS

use grok web

#

you can disable web search from there

rare python Jul 8, 2025, 1:22 AM

#

tidal schooner you can disable web search from there

weird how android doesn't have those options

tidal schooner Jul 8, 2025, 1:22 AM

#

rare python weird how android doesn't have those options

i can check later it should

rare python Jul 8, 2025, 1:22 AM

#

tidal schooner you can disable web search from there

I make a request on my phone and it still search despite disabled web search on grok web 💀

rare python Jul 8, 2025, 1:24 AM

#

tidal schooner i can check later it should

tidal schooner Jul 8, 2025, 1:25 AM

#

rare python

damn bruh

empty stump Jul 8, 2025, 1:28 AM

#

when is grok 4

whole wagon Jul 8, 2025, 1:28 AM

#

Wednesday

empty stump Jul 8, 2025, 1:28 AM

#

alr

whole wagon Jul 8, 2025, 1:28 AM

#

8pm PT

rare python Jul 8, 2025, 1:29 AM

#

tidal schooner damn bruh

The features, the QoL are so lackluster compare to grok web

#

🥀

#

the model is bad bro

#

or mediocre

#

It can't follow my instructions well

#

great at what?

#

for how long?

drifting thorn Jul 8, 2025, 1:38 AM

#

rare python Jul 8, 2025, 1:41 AM

#

bot or but

zenith saffron Jul 8, 2025, 1:43 AM

#

oh also @rare python why do you suggest lowering temp to 0.7

rare python Jul 8, 2025, 1:43 AM

#

zenith saffron oh also <@1178708438310719549> why do you suggest lowering temp to 0.7

It's considered the best overall temp

zenith saffron Jul 8, 2025, 1:43 AM

#

huh

rare python Jul 8, 2025, 1:43 AM

#

Search yourself

zenith saffron Jul 8, 2025, 1:43 AM

#

do we know what they used to get the benchmarks?

#

aight

rare python Jul 8, 2025, 1:44 AM

#

Usually temp 0.7

#

SOTA at what? It's specialized and not as general as o1 or 3.7 Sonnet

#

last time I tried o3 mini medium it's dumb

#

Smartest at using jargon and bad at explaining

#

Opus 4

#

At least with Gemini 2.5 Pro and Opus 4 I can understand what they wrote

#

o3 maybe smart but I can't understand what it wrote

rare python Jul 8, 2025, 1:50 AM

#

rare python Anyone understand what o3 wrote? Mode collapse You still have a reasonably fu...

blazing rune Jul 8, 2025, 2:12 AM

#

he was talking about o3 mini medium being dumb

#

o3 != o3 mini

#

from what I hear, o4 mini is slightly better than o3 mini in some areas and noticeably worse in other areas

#

it's a shame they update models just because they can, even though it's clearly not an improvement

#

how so?

small haven Jul 8, 2025, 7:06 AM

#

cc is underrated, do not sleep on anthropic

tall summit Jul 8, 2025, 9:30 AM

#

rare python Opus 4

opus 4 my goat

leaden sun Jul 8, 2025, 10:02 AM

#

rare python o3 maybe smart but I can't understand what it wrote

maybe o3 explains better without your specific system prompt? i personally prefer to give models freedom to express what they want and that has led to better and accurate responses, not always but I also make linguistic mistakes myself 😅

rare python Jul 8, 2025, 10:05 AM

#

leaden sun maybe o3 explains better without your specific system prompt? i personally prefe...

📎 message.txt

leaden sun Jul 8, 2025, 10:06 AM

#

you see? much better!

rare python Jul 8, 2025, 10:06 AM

#

No

#

I hate this style and I still don't understand what it wrote

leaden sun Jul 8, 2025, 10:08 AM

#

rare python

Key Difference in One Sentence
────────────────────────────────────────
Mode collapse = “My generator is narrow-minded.”  
Model collapse = “My *whole model* became useless, often everywhere in its output/latent space.”

#

there you go

rare python Jul 8, 2025, 10:11 AM

#

leaden sun ```──────────────────────────────────────── Key Difference in One Sentence ─────...

Nah

#

This one is way better imo

"So, mode collapse is about a single model lacking output diversity. Model collapse is about a whole lineage of models losing quality."

#

Opus 4:

You're spot on - "narrow-minded" is a terrible explanation! It's exactly the kind of vague, anthropomorphic metaphor that makes technical concepts harder to understand rather than clearer. What does "narrow-minded" even mean for a neural network? Nothing concrete.

The second explanation actually tells you what's happening:

Mode collapse → lack of output diversity (specific, measurable)
Model collapse → quality degradation across generations (specific, observable)

This really highlights how LLM responses can vary wildly in quality. One gives you this fuzzy metaphor that sounds meaningful but explains nothing, while the other gives you the actual technical distinction in plain language. The difference between unhelpful abstraction and useful precision.

leaden sun Jul 8, 2025, 10:15 AM

#

in terms of STEM questions, I personally prefer o3, but everyone's learning style is different. o3 gives more technical details, code snipets, tables and graphs, examples using analogy, metaphors, Tldr, q&a, yeah basically a blitz-course for fast onboarding experience

rare python Jul 8, 2025, 10:16 AM

#

leaden sun in terms of STEM questions, I personally prefer o3, but everyone's learning styl...

only if you are the same level as o3

#

or else you will get lost

#

That's when 2.5 Pro's yapping skill come in handy

#

o3 really likes to show off. This is peak "LLM trying to impress you with how much it knows."

#

The "Hope that clears up the terminology!" at the end is the cherry on top. After that word salad, I'd be more confused than when I started

leaden sun Jul 8, 2025, 10:23 AM

#

rare python o3 really likes to show off. This is peak "LLM trying to impress you with how mu...

this is one way to look at it, but this brutal straightforwardness and openness have advantages for new discoveries and in-depth investigations

rare python Jul 8, 2025, 10:24 AM

#

leaden sun this is one way to look at it, but this brutal straightforwardness and openness ...

OK but I don't use o3 to discover new investigations tho. I just want to know what's the differences clearly without being confused

leaden sun Jul 8, 2025, 10:26 AM

#

have you tried "explain like I'm [insert age number]" with o3?

rare python Jul 8, 2025, 10:30 AM

#

leaden sun have you tried "explain like I'm [insert age number]" with o3?

o3:

🌟 Super-short summary

• Mode collapse: Your robot draws one good thing again and again.
• Model collapse: Your robot forgets how to draw anything useful at all.

#

Why o3 keeps "anything useful at all." for Model collapse?

#

Gemini 2.5 Pro:

The Simple Difference

Mode Collapse: The AI is boring. It only knows how to make one thing. (The kid who only draws brown dogs).
Model Collapse: The AI gets worse. Its quality degrades over time because it's learning from its own mistakes. (The kid whose cat drawings turn into blobs).

ocean vortex Jul 8, 2025, 10:34 AM

#

candid storm Lets GOOOOO

If you Musk you just do this and then launch it on arena after July think

rare python Jul 8, 2025, 10:36 AM

#

LLM model collapse

ocean vortex Jul 8, 2025, 10:38 AM

#

oh it's model collapse, disregard then. 😂

rare python Jul 8, 2025, 10:39 AM

#

@leaden sun @ocean vortex Which one do you prefer of the "Explain like I'm 5" TLDR?

ocean vortex Jul 8, 2025, 10:44 AM

#

rare python <@1350467551490080888> <@514836230802898954> Which one do you prefer of the "Exp...

2.5Pro better in this case

leaden sun Jul 8, 2025, 10:46 AM

#

they all mean the same thing for me, the language style of o3 matches the age better?

rare python Jul 8, 2025, 10:46 AM

#

ocean vortex 2.5Pro better in this case

It actually feels like a peer to peer explaination

rare python Jul 8, 2025, 10:46 AM

#

leaden sun they all mean the same thing for me, the language style of o3 matches the age be...

the "model collapse" is wrong tho?

#

Model collapse in LLMs isn't about "losing ability entirely" but rather about quality degradation and loss of diversity when training on synthetic data across generations.

leaden sun Jul 8, 2025, 10:48 AM

#

you misinterpreted "Your robot forgets how to draw anything useful at all. ", forgetting things is a gradual process

rare python Jul 8, 2025, 10:49 AM

#

leaden sun you misinterpreted "Your robot forgets how to draw anything useful at all. ", fo...

"anything useful" is extreme

#

AI Models trained on synthetic data can still be useful

#

It is consistent with o3 using no "explain like I'm 5"

#

Model collapse: The chef forgets how to cook at all and serves plain rice no matter the order.

#

Model collapse = “My whole model became useless, often everywhere in its output/latent space.”

leaden sun Jul 8, 2025, 10:51 AM

#

if you're at the point of "forget", it means already it's not useful at all anymore, actually, this could be an example of testing model's understanding of temporality in language

rare python Jul 8, 2025, 10:53 AM

#

rare python Model collapse in LLMs isn't about "losing ability entirely" but rather about qu...

ocean vortex Jul 8, 2025, 10:54 AM

#

@rare python

rare python Jul 8, 2025, 10:55 AM

#

ocean vortex <@1178708438310719549>

yeah o3 pro is good here

rare python Jul 8, 2025, 10:56 AM

#

ocean vortex <@1178708438310719549>

Although I still have to say "forgets the lession" is a bit black and white

keen beacon Jul 8, 2025, 10:59 AM

#

ocean vortex If you Musk you just do this and then launch it on arena after July <:think:1165...

I still give google the edge. 1 in 3 odds seem right

unborn ocean Jul 8, 2025, 11:07 AM

#

well i think the advertised 2.5 pro's performance in education related stuff quite often

#

so it makes sense

#

#

https://blog.google/outreach-initiatives/education/google-gemini-learnlm-update

rare python Jul 8, 2025, 11:15 AM

#

unborn ocean

Maybe that's why o3 is better at raw science literature and Gemini 2.5 Pro is better at being a teacher?

#

Yeah Claude is also a good teacher

#

Easy to understand

solar hollow Jul 8, 2025, 11:19 AM

#

ocean vortex If you Musk you just do this and then launch it on arena after July <:think:1165...

those miniscule markets are cents for him

#

generally these markets are not large enough for the rich to try to manipulate it in large scale

unborn ocean Jul 8, 2025, 11:23 AM

#

lol

solar hollow Jul 8, 2025, 11:24 AM

#

let alone that if he wanted to bet 1 million, he would have to buy most shares at 99.9%

ocean vortex Jul 8, 2025, 11:25 AM

#

solar hollow generally these markets are not large enough for the rich to try to manipulate i...

Well yeah but the idea is the same. He could bet on many different platforms on things he can ensure to turn out a certain way lol

solar hollow Jul 8, 2025, 11:25 AM

#

ocean vortex Well yeah but the idea is the same. He could bet on many different platforms on ...

its not worth it, its better for him to post on x to manipulate tesla price

#

which will bring him couple 1000 times more money

#

betting markets are for the p

#

poor

ocean vortex Jul 8, 2025, 11:27 AM

#

solar hollow its not worth it, its better for him to post on x to manipulate tesla price

Tesla stock is a bit different. He cares about this being high essentially all of the time...

solar hollow Jul 8, 2025, 11:27 AM

#

peasants

ocean vortex Jul 8, 2025, 11:28 AM

#

Temporary making it rise to crash later would have been of no use to him

solar hollow Jul 8, 2025, 11:29 AM

#

ocean vortex Tesla stock is a bit different. He cares about this being high essentially all o...

point was there are other activities that will bring him 10000x more money, bad example from me

#

every betting market is just super small for these ppl

ocean vortex Jul 8, 2025, 11:31 AM

#

Well sure but my point is he could manipulate and take advantage of trading on a much bigger scale. Maybe not as relevant to him as say Trump since Elon is already one of the richest people alive, but still...

solar hollow Jul 8, 2025, 11:31 AM

#

ocean vortex Well sure but my point is he could manipulate and take advantage of trading on a...

on polymarket?

ocean vortex Jul 8, 2025, 11:32 AM

#

solar hollow on polymarket?

No, on anything at all lol

solar hollow Jul 8, 2025, 11:32 AM

#

ocean vortex No, on anything at all lol

well obviously he does that

#

your post was about polymarket i adressed that

#

he will never care about those 10000$ he could win there

ocean vortex Jul 8, 2025, 11:57 AM

#

solar hollow your post was about polymarket i adressed that

ok fair enough 👀

fleet lintel Jul 8, 2025, 1:37 PM

#

Is Grok 4 on Arena?

whole wagon Jul 8, 2025, 1:50 PM

#

Nope

balmy mist Jul 8, 2025, 2:44 PM

#

lol grok 4 is a joke

echo aurora Jul 8, 2025, 2:56 PM

#

should we do a watch party for the live stream tomorrow night?

whole wagon Jul 8, 2025, 3:04 PM

#

35% xAI

#

#

there was a big spike. it peaked at 42% (google was at 55%)

patent aspen Jul 8, 2025, 3:24 PM

#

whole wagon

Charts like this make me think about opening a Polymarket account haha

sour spindle Jul 8, 2025, 3:24 PM

#

I think it’s very hard to get an honest assessment of xAI the Elon haters and fanboys completely pollute the discussion

patent aspen Jul 8, 2025, 3:25 PM

#

I don't like doing illegal, unethical things for pennies though

sour spindle Jul 8, 2025, 3:25 PM

#

Grok 3 was my favorite model for a short period of time I’ll be very interested in what grok 4 has to offer

rare python Jul 8, 2025, 3:26 PM

#

sour spindle Grok 3 was my favorite model for a short period of time I’ll be very interested ...

like how long?

sour spindle Jul 8, 2025, 3:26 PM

#

I would say a good 2-3 weeks than I believe I started using one of the Google models more

#

Now I pretty much exclusively use o3

#

I’m very model agnostic I have simply use what is best for me. It’s actually quite puzzling how much “allegiance” I see

#

Probably so but I don’t think Google for instance has as many vocal negative folks most are just indifferent

#

OpenAI has both for sure aswell

patent aspen Jul 8, 2025, 3:32 PM

#

Google rarely ever gets good press

sour spindle Jul 8, 2025, 3:32 PM

#

I find most simply don’t care about Google there’s indifference

#

I dint see the ravenous appetite for negativity or praise like the other companies.

#

This is just my observations though

civic flame Jul 8, 2025, 3:34 PM

#

lol I commend your dedication to having the dumbest opinions possible

rare python Jul 8, 2025, 3:35 PM

#

r/Bard whine so much 🗿

#

It's either "give back 0325, 1206" or "fix the gemini app"

unborn ocean Jul 8, 2025, 3:44 PM

#

don't know much about mistral, but the inspiration seems very obvious
the rest is also 95% identical... :|

alpine coral Jul 8, 2025, 3:44 PM

#

solar hollow generally these markets are not large enough for the rich to try to manipulate i...

would there even be the liquidity to take a $1m position? seems doubtful.. but let's say he could, and so delays the release of grok4 to win $120,000 on a polymarket bet... i mean it makes no sense whatsoever lol.. even scaled up (again tho, liquidity..) it seems wildly irrational.. even for elon lol

#

o3: A US $120 k payday equates to roughly a 0.03-cent uptick in Tesla’s share price—about one-ten-thousandth of one per cent of the current quote.

rare python Jul 8, 2025, 3:45 PM

#

unborn ocean don't know much about mistral, but the inspiration seems very obvious the rest i...

and deepseek v3 0324 is similar to chatgpt4o-latest

#

Because the subreddit is still small so many power user

unborn ocean Jul 8, 2025, 3:48 PM

#

rare python and deepseek v3 0324 is similar to chatgpt4o-latest

yes, but i never found them to be this identical

#

it is also very realistic that mistral heavily distilled deepseek

#

bc they lack a good large model

mystic mica Jul 8, 2025, 4:36 PM

#

Anyone else has a hard time loading into arena?

#

Legacy one loads alright

rare python Jul 8, 2025, 4:41 PM

#

No, I load into lmarena fine

solar hollow Jul 8, 2025, 4:56 PM

#

alpine coral would there even be the liquidity to take a $1m position? seems doubtful.. but l...

not 100% sure how it works, but i think he can buy shares and inflate the market with 99.9c shares

#

so he would barely make any profit anyways

#

and it would be very dumb investment of course

#

it would probably go higher than 99.9, closer and closer to 100

solar hollow Jul 8, 2025, 5:28 PM

#

yes thats what im saying

#

on top of that in order to buy at good prices, he must bet even less money, which he will care about even less

#

the market is alive and reacts though

#

he would need a private chat group or sth

#

dont think hed bother

#

he could pay that money to those supporters anyways 😄

wet basalt Jul 8, 2025, 5:33 PM

#

what is wolfstride ai model

#

#

i cant find it here

alpine coral Jul 8, 2025, 5:36 PM

#

solar hollow the market is alive and reacts though

there's like $500k at risk in this.. it's peanuts.. that much would get traded in TSLA stock like every second (or less).. it's really small-scale.. https://polymarket.com/event/which-company-has-best-ai-model-end-of-july

patent aspen Jul 8, 2025, 5:36 PM

#

It honestly doesn't even require that much wealth before you have to think about moving Polymarket markets

solar hollow Jul 8, 2025, 5:37 PM

#

alpine coral there's like $500k at risk in this.. it's peanuts.. that much would get traded i...

yeah well we were talking about him making money for his supporters

#

not him

alpine coral Jul 8, 2025, 5:37 PM

#

fair

#

kinda overlooked that

patent aspen Jul 8, 2025, 5:37 PM

#

Like I can't just plow $100k into one market without moving the market

alpine coral Jul 8, 2025, 5:37 PM

#

not these polymarkets

#

they're tiny

patent aspen Jul 8, 2025, 5:40 PM

#

Oh yeah it would be pointless for him

alpine coral Jul 8, 2025, 5:46 PM

#

forbes would have a very different rich list if they didnt count paper money..

unborn ocean Jul 8, 2025, 5:47 PM

#

true, would be interesting to see, but i imagine it is harder to create

#

and gets like 0 attention (comparatively)

ocean vortex Jul 8, 2025, 7:20 PM

#

xAI gonna run itself into the grave just because of this. The thing is responding in 1st person as Elon Musk now

#

💀

#

And really... they are behind Google now. For what they have API for they are essentially on o3-mini level

hollow ocean Jul 8, 2025, 7:21 PM

#

Claude best deep research

torn mantle Jul 8, 2025, 7:38 PM

#

hollow ocean Claude best deep research

no

torn mantle Jul 8, 2025, 7:38 PM

#

ocean vortex xAI gonna run itself into the grave just because of this. The thing is respondin...

yea its dumber

hollow ocean Jul 8, 2025, 7:38 PM

#

torn mantle no

Test it

torn mantle Jul 8, 2025, 7:39 PM

#

they had a lot of issues with @grok ai acc

#

shows how bad grok is at following instructions

sweet tinsel Jul 8, 2025, 7:48 PM

#

hollow ocean Claude best deep research

No, not really. I made a whole doc to test DRs and I love it to test them but Claude 4 Opus was just disappointing as an example.

hollow ocean Jul 8, 2025, 7:49 PM

#

sweet tinsel No, not really. I made a whole doc to test DRs and I love it to test them but Cl...

Show examples

sweet tinsel Jul 8, 2025, 7:49 PM

#

Check my doc in the bio.

#

And I generally test stuff out with other prompts too.

#

This is just my main prompt for this doc as it's not the easiest.

hollow ocean Jul 8, 2025, 7:51 PM

#

sweet tinsel And I generally test stuff out with other prompts too.

Which model is the most accurate

sweet tinsel Jul 8, 2025, 7:52 PM

#

hollow ocean Which model is the most accurate

o3 DR honestly.

#

Kimi and Gemini were also pretty good.