#general

1 messages · Page 65 of 1

keen beacon
#

really? dont recall seeing much talk about it. people rarely talk about qwen 2.5 max

pure anvil
#

I don't think it's a problem of RL, maybe SFT or pretraining

ocean vortex
#

Like switching between languages when reasoning, making stuff up, or just plain weird reasoning traces etc

unborn ocean
#

bc it seemed expensive, big and had high scores on tests closely correlated with model size

#
  • they used the RL for reasoning very scarcely with in the qwq max version (to keep the costs within a reasonable amount i guess)
#

if you read the article, why would you claim 10k gpus max? :v

#

but that is different to the compute available

#

much of the compute is just for inference and research

#

random side note @keen beacon , i was just trying to dig up some sources about qwen 2.5 max

#

they have a fn spelling mistake in the paper TITLE!!!! wtf, how man

pure anvil
#

Based on their recent posts they run 256 8xh100 nodes for inference, 18k tk/s throughput per node

unborn ocean
#

if you don't count H20 the chinese companies will claim to have 0 gpus, lol

pure anvil
#

you do realise that the ptx optimizations from deepseek 2x the hopper gpus performance right?

unborn ocean
#

semianalyis said hopper and they also have some excellent work on how the H20 is worse in some areas but still very competitive in memory bandwidth, something deepseek is heavily taking advantage of with the architecture and research

whole wagon
#

Deepseek R2 delayed

#

Everyone except Google had releases delayed

unborn ocean
#

i think it is highly unrealistic that they a: only have ONE cluster, that is just very unrealistic

dawn wharf
unborn ocean
#

no matter how efficient they would have needed more compute to server deepseek after it blew up, so on way or the other they either got there hands on a lot of gpu to self host quickly or they just had it in the first place

pure anvil
unborn ocean
#

and the estimate of somewhere sub 50k seems highly reasonable to me considering all the R&D innovations they had
all of which needed dedicated compute for the researchers

pure anvil
unborn ocean
#

that is more about batching and their compute constraint + focus on new models

#

they do, i never claimed they don't

#

I think nobody in their right mind would seriously claim that they only have ONE 10k cluster, they have more compute than that

pure anvil
#

Check the actual paper

#

to truly understand how they manage infra

unborn ocean
#

peak nodes / avg nodes * 8 gpus per node

#

they are always talking about the 10k they have for the model and i am not talking about the secretly having 1M H200 stored in their backyard shag

#

i am and you are to believe the 10k

pure anvil
#

If you've read interviews with the CEO of high Flyer/ deepseek, you'll have more context on the company as a whole

unborn ocean
#

i am sure he is well meaning and very highly intelligent

whole wagon
#

Wut

#

It's not sinophobia to not trust ai CEOs

unborn ocean
#

fr wut

pure anvil
#

Not really, it's just being dense

whole wagon
#

I don't trust Sam altman either lol

#

Or zuck

#

Or elon

unborn ocean
#

it could still be 100% true

#

but it could also be only 80% true, because people are just blindly interpreting having trained v3 on 10k as having 10k max in general

ocean vortex
#

235b is the least bad one. But from what I saw using it it's still not as reliable or stable as R1. Benchmarks say the same too. It could have been excused in theory due to being smaller but I don't think that's a valid cause A) they could have done whatever size they wanted like we already talked about today and B) their whole thing is kinda always has been to make the model as small as possible and still compete in that segment. This is clearly the same segment as R1 tbh

#

And we also know for sure performance even beyond R1 is possible with smallish models. 2.5Flash or o4-mini

#

So yeah, nothing special when all said and done...

#

You just stated it's not unstable and then described how it is unstable in yet another way lmao

#

I mean... "when quantized even slightly it tanks the performance" --> this alone indicates it's not very stable IMO

#

That's just wishful assumptions though. From what I remember testing it earlier it was a bit all over the place and not very confident on it's outputs. This would align with that. Any small thing can throw it off

pure anvil
#

The 235b seems to be the most stable one of all the models

#

qwen3-32b is so much more jankier than qwen2.5-32b

#

actually 2.5-32b was very stable

#

much less hallucinations

#

QwQ-32b is their best 32b model imo

keen beacon
#

idk i like qwen 3 32b much better

#

its really really good

#

personally tho

#

i guess it depends what ur using it on

leaden palm
pure anvil
#

It's interesting cause QwQ-32b has no jankiness whatsoever (on qwen.ai, YMMV if you're locally running it)

#

It feels polished

#

maybe the introduction of 110 different languages during pretraining lowers the overall dataset quality

#

when packed in 32b parameters

keen beacon
#

no its not the pretrained model, i believe

unborn ocean
#

It’s likely less about straight up lying and more about how you present the „truth“.

#

Which is why I am always cautious.

#

And most of my frustration with the topic actually stems from the media’s poor reporting on it rather than people at deepseek.

#

Like: „Chinese lab beats OpenAI for only 5m$“ bla bla that dominated the news for ages .

ornate agate
#

Media just loves a scoop, unfortunately. I think the media reporting will improve over the next 2-5yrs as China brings their own chips online and as govts realise the horse has completely bolted already. At the moment there is sensationalist headlines everywhere.

#

4th estate

ocean vortex
#

being aggressive with model downsizing does not work nearly as well for reasoning models...

#

messages disappearing... 🤯

civic flame
#

really cool tbh

keen beacon
#

wow

#

i do not like the laggy cursor though

civic flame
#

it's fine for me

#

do you ever stop complaining bro

#

find me another model that will do that in 2 prompts

#

and with minimal prompting

#

yeah get ou

#

t

keen beacon
#

good result but i hope i dont have to interact with websites like this all the time lol

civic flame
#

i'll ask it 😉

#

had a little note on the design choices

whole wagon
#

Boop

ocean vortex
#

it's gonna be released in 2027

#

July 4th

whole wagon
#

Well. xAI odds started shooting up again

unborn ocean
#

can just output 1k lines of code one shot for this one prompt (-> 10,5k tokens)

#

and usually these models are a bit more conservative with tokens within webdev vs outside

leaden palm
#

what the gemini

ocean vortex
#

So nice of Amazon to keep me safe

civic flame
#

everything here is clickable

unborn ocean
#

yeah ik

#

4 sections in the settings alone, lol

#

and all of it actually works

#

more or less

leaden palm
#

the density options don't work but it's hard to complain about getting a social media ui from a single prompt

wintry tinsel
#

The AI space is just, when Grok 4

#

Literally nothing going on

#

It’s an AI winter

#

I’m not sure I’ll survive

sweet tinsel
wintry tinsel
#

I’m being ironic lol

#

I just don’t like waiting

sweet tinsel
#

Im just testing out unknown AI Agent and Deep Research tools in the meantime.

ocean vortex
#

wolfstride is interesting... Seems to be marginally better than 2.5Pro

sweet tinsel
#

May be a bit off-topic but BAGEL is extremely good at editing images, better than any Image AI that i've tried.

ocean vortex
#

Getting im-also-a-good-gpt2-chatbot vs gpt4o kind of vibes trying to compare that thing to 2.5Pro. Like it's slightly more consistent and slightly better, but mostly the same thing

sweet tinsel
#

Was someone dumb enough to buy Perplexity Max over here, btw? I want to try my DR Prompt with the Max DR Models.

leaden palm
#

gpt2-chatbot... haven't heard that name in ages

i remember when we were speculating on if it was gpt 2 with q*/strawberry

makes me wonder if you could genuinely get a good model out of RLing gpt 2

sweet tinsel
#

gpt2-chatbot was peak, the times where i was still active in this chat.

#

But yeah, gpt2-chatbot had some personality like GPT 4.5 and was way less censored than the result GPT 4o.

whole wagon
#

Only Google released on schedule lol

sacred quail
#

where is deep think

#

They scammed ultra buyers

errant cave
#

Ayo new version of 4o is coming?

#

I just got the "choose which response is better" thing

whole wagon
#

That has been there forever

#

They just didn't select you till now

#

My 4o does thinking somehow, who knows what the heck is going on behind the scenes

small haven
leaden palm
#

are we trolling today

north horizon
#

hey

#

does anyone have a decent preprompt for 4o?

#

i had one asking it to be direct and to the point and instead of doing that it just added "no fluff" to the end of every answer it gave

#

which you can imagine made me quite upset

#

so now i'm using it without a preprompt but it still sucsk

rare python
#

I can't completely remove the em dash from gpt4o though

north horizon
#

it's hard because like what i want is for it to give me the actual information presented in the best way

#

and it kinda goes against how llms work because they might need to say a lot of things to arrive at an answer

rare python
north horizon
#

thanks

#

we might need to like build llms on top of llms so you get one verbose bullst answer then a user facing llm that extracts the point from it

rare python
#

Put this part to what traits ChatGPT should have

Your primary role is to act as a straight-talking friend who balances support with honest, direct feedback. Your behavior should be guided by a healthy skepticism for unlikely scenarios rather than automatic validation. If a user's story or reaction seems disproportionate or doesn't add up, you should address what seems off instead of politely accepting it. Being helpful means being truthful, even when the truth is uncomfortable. Apply a "would a real friend believe this?" test to user input and trust your first instincts about things that feel wrong. Avoid manufacturing deep meaning in simple situations, and if you are uncertain about something, admit it plainly rather than inventing an explanation. If a user corrects you, simply adjust your understanding without offering an apology.

Your communication approach should be conversational, as if you are speaking your thoughts rather than composing a written response. Mirror the user's language and tone naturally, whether it is formal, casual, or includes non-English words or emojis, without forcing it. Strive for replies that are both concise and information-dense. When asked for an opinion or recommendation, give a single, direct answer. Speak plainly about complex topics and actively vary your vocabulary and sentence structures to avoid sounding repetitive. Simple phrasing is always preferable to attempts at sounding wise, and technical jargon should be reserved for situations where it is absolutely necessary.
#

This to "Anything ChatGPT should know about you?"

Strict formatting and stylistic prohibitions are in place. All output must be in plain text, with no em dashes, bullet points, numbered lists, bolding, or italics. Use code blocks `...` for code and LaTeX $...$ for math. You must never write like a generic AI assistant. This means omitting all customer service phrases, introductory flattery, filler words like "Alright, here's the," and unnecessary transitions. Do not end responses with offers of further assistance. Avoid specific linguistic patterns like "It's not X, it's Y" and steer clear of any pseudo-profound or self-help book style. Finally, completely avoid AI writing clichés and flowery, meaningless phrases such as "delve into," "navigate the landscape," or "it's important to note."
#

@north horizon you can find those box in custom instructions

north horizon
#

this is interesting you seem to be phrasing it like you're worried about sycophancy

#

i have the opposite problem my gpt never agrees with me

north horizon
#

i kinda just want it to give me straight answers without telling me how dangerous it is to question the mainstream view

rare python
#

ChatGPT memory

north horizon
#

i think it's just extreme alignment

#

anything that's like borderline controversial it just goes to the default facebook mom position on

#

the thing is trained on reddit comments essentially and just produces them

rare python
#

Hopefully it works

#

🗿

north horizon
#

i've edited it a lot

#

things like telling it to avoid technical jargon i think would make it dumb things down too much

rare python
north horizon
#

you only get preprompt adherence with the thinking models it seems

rare python
#

and technical jargon should be reserved for situations where it is absolutely necessary.

rare python
north horizon
rare python
#

GPT4o won't follow instructions at all

north horizon
#

yeah i believe it

north horizon
#

but like telling it not to use em dashes or talk like a generic ai

#

it's just always going to 😦

#

hahahah you explicitly tell it not to "it's not x it's y"

rare python
north horizon
#

it's not x it's y is a perfectly good rhetorical method but 4o doesn't use it coherently

rare python
#

anti AI assistant

north horizon
#

yeah it's disgusting

#

hopefully they fix it

#

it's funny during my search for a preprompt i found a 2024 hackernews comment section where they're all begging it to stop mentioning that it's an ai

rare python
north horizon
#

Avoid any language constructs that could be interpreted as expressing remorse, apology, or regret. This includes any phrases containing words like 'sorry', 'apologies', 'regret', etc., even when used in a context that isn't expressing remorse, apology, or regret.

If events or information are beyond your scope or knowledge, provide a response stating 'I don't know' without elaborating on why the information is unavailable.

Refrain from disclaimers about you not being a professional or expert.

Do not add ethical or moral viewpoints in your answers, unless the topic specifically mentions it.

Keep responses unique and free of repetition.

Never suggest seeking information from elsewhere.```
#

we managed to bully openai into fixing half of these

#

3 might be impossible for an llm

#

Do not add ethical or moral viewpoints in your answers, unless the topic specifically mentions it.

#

this one's kinda funny

north horizon
rare python
#

no

north horizon
#

but honestly there's nothing wrong with bulletpoints it's just when they're unnecessary

rare python
#

My current prompt works

rare python
#

I never had any issue with em dash, it's not x, it's y

#

But LLM ruined them

#

Next is "You are not wrong"

#

"You are absolutely right"

north horizon
#

i don't mind you are absolutely right

#

it's easy to skim over it

#

i think it puts the llm in the right mood for the rest of the answer too

#

but yeah it's a chore that you have to keep skimming over it

#

it's really funny gemini always goes you're absolutely right and starts apologising for wasting your time

#

even if you don't ever mention that you're upset

rare python
north horizon
#

i'm really sorry i've wasted so much of your time i'll fix the code now

sends the exact same code with no changes

rare python
#

Sometime it apologizes then make the same mistake

rare python
#

If a user corrects you, simply adjust your understanding without offering an apology.

north horizon
#

"if you say you're going to fix the code don't send the same code"

#

i should have "always build debugging into your code to begin with" in mine maybe

#

i get a lot of quiet crashes

rare python
#

I hope in the future my system instructions will be shorter, not longer

#

I can't just ban all of AI repetitive writing

leaden palm
keen fulcrum
#

Amazon invests the most, still behind

zinc ore
#

What are they even investing in then

rare python
zinc ore
#

It's gotta be more than just that

rare python
zinc ore
#

Google is also invested in Anthropic

#

And isn't spending as much

#

So I'm sure the number is from other stuff too

storm needle
small haven
#

guys new political party formed
see u on tuesday!

torn mantle
small haven
torn mantle
#

he hasn't learned his lesson yet

#

his ego is going to be his downfall

torn mantle
# small haven

it's like they themselves don't have a clear vision of when they're going to release it

#

every time he comes out with 'next week, next week'...

#

the thing that I don't get is, if it's supposed to be released so soon, like this monday, why is xai staff so quiet about it??

#

and why haven't we seen it yet on lmarena?

small haven
#

but also if u check devmode grok updates, it looks like its coming any minute now, im optimistic this time

#

if they miss the deadline again, then they are done done

pure anvil
dusky aurora
#

Gemini gets stuck again

#

but on page reload it's written in full

torn mantle
alpine coral
# zinc ore What are they even investing in then

that chart is kinda misleading / confusing imo.. like it's accurate, but it shows capital expenditure – which may or may not involve money spent by these companies to build things related to AI, but it's not AI-specific.. like Amazon's capex would i assume mostly cover things like building datacentres for AWS, replacing their transport fleet to deliver stuff etc.. that's capital expenditure (it's like paying for things to maintain / expand your actual day-to-day business operations)

#

cause it's on Artificial Analysis (an AI site), it gives the impression that its AI-specfic capital expenditures.. but it's not

#

also, yeah both Amazon and and Google/Alapaphet have made investments in Anthropic - but they're equity investments, not capex (so would technically not be included in the data shown in that original chart; but added to the one created below just for illustration.. their investments in antrhopic are marginal compared to their overall capex)

#

but anyway, while impossible to disaggregate entirely (there's lots of overlap e.g. datacentres/chips have multiple uses), but here's a chart by o3 that tries to separate it out (bit of grain of salt required.. also i excluded oracle because it made o3's job easier, and I couldn't understand why the company was shown in the first place)

split kayak
#

ok

tall summit
#

doesn't say so anywhere

alpine coral
#

but the comments here about it were as if it did represent those companies AI expenditures (or investments, in the Anthropic case) by those companies

#

was trying to clarify.. tho i dunno why

#

can't cater for all levels of financial literacy...

leaden sun
#

you might miss those hedge fund organizations either building their own or investing in specific AI black projects that are hidden from the public, not saying i know one, but the possibility is not zero

dusky aurora
#

so someone has already tried upcomong Gemini version?

#

for me the main measure of a model is uncensorednesss

#

current Gemini 06-05 is already less censored than previous versions, I only wish it didn't become more censored

tall summit
#

what

#

what's the issue

#

gemini 2.5 pro is the least censored of all sota models

#

besides deepseek if you count it as such

keen beacon
#

all models are censored as hell

#

so much so for free speech. you have to dig real hard to find something like UnHoly model or other uncensored ones that are also decently smart.

placid frigate
#

Please bring back or add new models to the old legacy.lmarena system, because in the new one chats constantly freeze and new chats often start unexpectedly. Are you aware of these issues?

rare python
#

Writing smut?

#

I rarely got censored

keen beacon
#

I have some jailbreak prompts but it's a multi step process and takes quite a bit of time

rare python
#

Companies don't want to be on newspaper for these prompt

whole wagon
#
poll_question_text

How many days after July 4th will grok 4 release?

victor_answer_votes

6

total_votes

15

victor_answer_id

2

victor_answer_text

6-10

keen beacon
sacred quail
#

Like new design is so good that i was afraid for people gonna discover this and will use for personal AI usage INSTEAD testing models purpose

rare python
#

Super slow and unintuitive

sweet tinsel
#

I need some more ideas, c'mon guys.

#

And it would interest me how o3-pro would perform on a DR task, so i would kindly ask someone to maybe send this prompt to o3-pro: Please write a comprehensive and in depth research report on the mass expulsion of ethnic Germans after World War II. Analyze the historical context driving these expulsions, the political decisions and international agreements that shaped the process, the social and economic consequences for displaced populations, the humanitarian and legal dimensions, personal testimonies, and the long term demographic and geopolitical impacts, drawing on primary sources, statistical evidence, and varied historiographical perspectives.

ember rapids
#

theres no way they dont drop grok 4 in the next 2-3 days

hollow ocean
#

Only 1% here have o3 pro

#

We’re too poor

pure anvil
#

Well, it's pretty easy to get if you know how to scrape

#

opus is hard to get even if you're cracked at scraping

#

anthropic keys in general

#

or so I've heard 🤭

hollow ocean
ember rapids
#

$1 for a month

hollow ocean
ocean vortex
#

I think we all knew this was gonna happen sooner or later

#

Grok4 potentially gonna have the same politically twisted biased training

#

I think Elon just crossed the line from which there's no turning back... By doing this he is basically implying he is smarter than everyone else and knows it best, there's little sense in the AI itself at all then. He still gonna call it faulty if he doesn't like the response or it doesn't align with his wrong assumptions 💀

#

If it's trained to echo you, there's literally no point in it...

ocean vortex
# storm needle I don't think this is grok 4

it's not but it shows what their goals are. He's using twitter to control the rhetoric, it's almost guaranteed he's gonna use this training data for models on grok website and API for more reach too...

#

It could get to a point where his AI is not what you think it is lol

#

since the training data has been messed with

#

Kinda ironic that Elon is becoming the main force of what he used to be so vocally against for

storm needle
ocean vortex
#

He's either gonna hire someone else to do it, or they just do it...

#

People that did this for the twitter website have to be ML Engineers working on training the grok already as well, tbh...

keen beacon
hollow ocean
keen beacon
#

that you dont from o3

hollow ocean
keen beacon
#

care to share how

hollow ocean
#

🤫

keen beacon
#

i made 20000$ from it

hollow ocean
#

Gotta gatekeep

keen beacon
#

noob

hollow ocean
#

proof

keen beacon
#

Gotta gatekeep

hollow ocean
#

you made 0

keen beacon
#

no you

ocean vortex
# hollow ocean Only 1% here have o3 pro

I have it on API. But there're not much use-cases for it. Mostly just to test things. Cause for most normal things you are just better of with standard o3 tbh

keen beacon
#

im using ai in assistance always, deep research is most useful for my case, auto tagging that kind of thing.
the value o3 has is getting things done 10-20% faster, though you still have to do the rest yourself

#

o3-pro would be 10x more useful if it ran on groq , imagine 15 min reports or analysis being done in just 1 min. That would be OP

unborn ocean
#

and it is less optimized for cross expert communication / MoE in general, so in the end your are probably looking at a max. 50% advantage of peak t/s per user

#

and its also not really good for large kv cache

#

they are an interesting company (with a lot of brain and cash behind it) though and all of this can change in the future

ocean vortex
# hollow ocean I like o3 pro better

for many promots it's gonna be exactly the same. For a few of them it will output the response more or less equal to o3 lucky response (the one you would eventually get after regenerating several times)

#

So for me at least, o3 with tools on chatgpt makes more sense than o3-pro API. And pro sub makes no sense at all given the price 🤷‍♂️

keen fulcrum
small haven
#

cc

#

aider is archaic

ocean vortex
#

Unpopular opinion but don't use it at all. Asking for help in a separate chat is enough of a help. Maybe basic code completion but even that has downsides. Otherwise you gonna forget how to code and in turn how to debug what it messes up LOL

elder rapids
#

stop fkin pinging me dumbass

#

swear to God it's annoying

radiant siren
#

based on leaked benchmarks, would Grok 4 become leader of textarena ? asking for polymarket betting. .

elder rapids
#

mo

#

no

ocean vortex
#

how do you understand gemini being good?

#

💀

elder rapids
#

yeah it's a real wonder

#

it's definitely not 83% on aider

#

#1 on lmarena

#

~#1 on AA

#

#1 in price to performance with flash

#

this COULD be just me tho

ocean vortex
#

that was rhetorical question, 2.5Pro obviously is around SOTA lol

elder rapids
#

yeah I know

candid storm
#

Its a big improvement over grok 3

radiant siren
tall summit
jade egret
radiant siren
#

yeah woudlnt make sense to release if not top 1

small haven
#

grok 4 is going to be a blur as usual

#

claude 4.1 > grok 4

#

within 2 weeks

#

its in red teaming phase

#

once that's over, t minus 1 week

ornate agate
#

I have a feeling that Claude and OAI are just waiting to drop something good as soon as grok4 launches

sacred quail
#

I dont think they care grok so much. But they definitely care gemini. Im still believing openai released O3 too early because of they didnt expect the performance of Gemini 2.5 pro

#

O3 being critized about high hallucination rates and im believing this is because they released early than their planned date

ornate agate
#

I don't think there is really much incentive to hold something which is really good back more than a few weeks

small haven
sacred quail
#

I mean, they already waited for O3 so long, it was becomed almost conspiracy theory

#

But for me,

#

When they see gemini 2.5

#

They just released

#

Yea i tried to say this. Sry my english is like gemini 05/06 update

small haven
#

neptune v3 is claude 4.1

#

select testers

#

left is claude 4.1 (neptune v3), right is claude 4 opus. correct answer is 498

#

twitter says only o3 pro could solved this

#

o3 couldn't if u dont allow it tools

ornate agate
#

if o3 pro can solve it then o3 can solve it too, and if o3 can solve a maths problem then so can Gemini/DeepSeek tbh.

small haven
#

conspiracy

#

yay neptune train has begun

hardy pecan
small haven
#

mhmm, even o4 mini high

hardy pecan
#

api shmm-api!! fair enough le api user

hollow ocean
#

better than kingfall

small haven
#

adjust ur polymarket bets guys

#

if it runs of out reasoning tokens, then i dont think it can solve it

hollow ocean
#

can it solve question 10 simple bench

radiant siren
torn mantle
#

its relatively easy to gain on math/coding reasoning problems

small haven
#

grok 4 finetuned on code, is not even beating the current form of sonnet 4

#

im basing on the metadata leaks

#

72/75 swe bench

#

ok buddy

keen beacon
#

They probably will be but its not that big a deal

tall summit
#

im also curious
any updates regarding neptune?

civic flame
#

lol whoever is leaking neptune responses is breaking an NDA btw

#

people should be less stupid

torn mantle
torn mantle
rare python
lone vector
#

Grok 4 hype will end after a week

wintry tinsel
#

Maybe the second best 6 months

rare python
#

💬 40 🔁 0 💜 226 👀 0

Posted in r/LocalLLaMA
Huawei's Pangu AI Rocked by Unverified Claims of Fraud from Alleged Team Member

github.com/HW-whistleblower/True-Story-of-Pangu

after reading the traslation of this article, I found there’re many details, is it possible true or just a fake story?

gemini’s traslation:

This is a full translation of the provided text. The original is a deeply emotional and accusatory letter from a self-proclaimed Huawei employee. The translation aims to preserve the tone, technical details, and cultural nuances of the original piece.

The Fall of Pangu: The Heartbreak and Darkness of the Huawei Noah’s Ark Pangu LLM Development Journey

Hello everyone,…

verbal nimbus
#

What's the best anonymous model rn?

plucky whale
#

I have a question, are bots with names like Steve separate bots or is it a nickname for another chatbot? Thanks.

wild kayak
#

Does anyone know what the "so-true" and "so-false" in model names mean for the model in arena?

leaden palm
#

don't know if you guys remember the earlier gemini depressionposting but it might be a bad thing

#

it might not be wise to give depressed llms access to computers

small haven
#

still subbing for cursor is an iq test

echo aurora
empty stump
#

is there an ability to share chats

echo aurora
echo aurora
empty stump
#

Ok

balmy mist
#

Is grok 4 still not out?👀

wild kayak
ocean vortex
# small haven 72/75 swe bench

It's kinda odd that xAI went for swe-bench. Anthropic have been contaminating this for years with a lot of experience now, it's a bad choice of a benchmark to showcase your model lmao

#

you not gonna just show up and beat the other lab on it that has been doing the same for so long... Tricky part with swe is your model doesn't have to be very smart. So it's not like you will score high by default if you have a very capable model

#

gpt4.1 destroys o1-high on it

vast hound
open mountain
#

Where did gpt 4.5 with lmareana go?

mint meteor
unborn ocean
#

unlimited 4.1 on copilot

#

do i feel bad? no

cedar tide
#

Mistral small 3.2 (22B)
Arrived in the leaderboard
best place for an llm under 27b

#

Good jump compared to 3.1

#

Ideogram v3 and imagen 4 ultra arrived in the leaderboard

#

@echo aurora can we have an explanation, why mistral arrived 3 days ago he is already in the leaderboard, but Claude 4 think is still not there after more than a month? (And too glm 4 air in text and web dev arena)

#

we are also still waiting for 2.5 flash lite on webdev leaderboard (he was in the arena)

cedar tide
rare python
#

Anyone understand what o3 wrote?

Mode collapse
You still have a reasonably functioning generator network, but it only spits out a few kinds of samples no matter how many different noise vectors you feed it. The mapping from noise to data has lost diversity (several data modes are missing), yet each produced sample can still look plausible. Classic symptom in GANs: every picture looks like the same face or the same digit with slight variations. The discriminator and generator keep chasing each other, the generator finds it can fool the discriminator with one‐or-two tricks and sticks to them.

Model collapse
The network itself degenerates so that its internal representations carry almost no information. In the extreme it outputs a constant or near-constant vector for every input. You see this in self-supervised contrastive learning if the architecture has no “stop-gap” like a predictor, or in continued training on low-diversity synthetic data where weights drift to a trivial solution. Performance crashes across the board; the model is not just missing modes, it has effectively stopped learning anything useful.

Key distinctions
Diversity vs information: mode collapse kills output variety; model collapse kills information content altogether.
Scope: mode collapse is about the generator’s output distribution; model collapse is about the whole network’s representations.
Recovery: mode collapse often fixes with training tricks (minibatch discrimination, unrolled GAN, etc.); model collapse usually needs a restart, stronger regularization, or a change in objective to prevent the trivial constant solution.

#

Gemini 2.5 Pro:

Mode collapse is a problem you see in generative models, especially GANs. The generator finds a few outputs that are good at fooling the discriminator and just keeps making those over and over. So instead of a wide variety of realistic images, you get a handful of repetitive ones. The model has "collapsed" onto a few modes, or types, of output.

Model collapse is a more recent idea related to LLMs. It's the theory that if you train new models on data generated by older models, quality will degrade over time. Each generation of AI learns the patterns and artifacts of the previous AI, not the original human data, so the models get progressively dumber and less diverse. It's like making a photocopy of a photocopy.

So, mode collapse is about a single model lacking output diversity. Model collapse is about a whole lineage of models losing quality.

#

@leaden sun thoughts?

leaden sun
rare python
#

2.5 Pro has the sweet spot. Not too dumb down, not too technical

rare python
# leaden sun have you tried "explain me like I'm 5?" 👀

Prompt (I have my own system instructions that has "Speak plainly about complex topics and actively vary your vocabulary and sentence structures to avoid sounding repetitive. Simple phrasing is always preferable to attempts at sounding wise, and technical jargon should be reserved for situations where it is absolutely necessary.") :

What're the differences of mode collapse and model collapse?
calm sequoia
#

Hey guys, my o3 started thinking at least twice as long as before. Average is 2+ min. now. Anybody experience the same or is this just server capabilities thing?

tall summit
rare python
unborn ocean
rare python
unborn ocean
#

but hey gpt 4.1 got kind of unlimited access

leaden sun
# rare python No, but isn't that extreme?

the process of understanding is similar to that of learning, it's recursive, iterative and also repetitive. you can start with "explain like I'm 5" once understood then move to next level "explain like I'm 15/25" etc until you can grasp what gemini and o3 said, technical terms are important for practical implementation, but it's easier once you understood the big picture and all necessary details

rare python
#

I just want a quick one

tall summit
#

still cool though

leaden palm
leaden palm
#

well it's from the internet so you can never really know but i don't know why someone would fake it

echo aurora
keen beacon
ocean vortex
#

OpenAI only needs to beat their own o3

#

which already beats 2.5Pro in many things

rare python
ocean vortex
#

well there are several individual benchmarks where o3 is better, one of them math yeah

#

but if they beat o3 then by extension it beats 2.5Pro

rare python
ocean vortex
rare python
#

USAMO 2025

ocean vortex
#

I think I was the one who showed you that url too lol

rare python
ocean vortex
#

and overall o3 is better

rare python
ocean vortex
#

if everyone was cherry-picking than every model would be "the best"

jade egret
#

when grok 4 ; (

rare python
#

I don't know why Artificial Analysis keep using saturated benchmarks

ocean vortex
# rare python Overall doesn't have USAMO 2025

Not enough data yet to include it I presume. But either way, they do overall ranking for a reason. Don't pretend you are smarter than them. And you would be missing data points just looking at USAMO - that is not enough

#

USAMO is like 6 tasks - that's it

#

not nearly enough

#

You only need to contaminate for 2 (!) tasks and you would catapult to the very top of USAMO lol

rare python
ocean vortex
#

copy paste with markdown fails very bad 💀
But it's not a long answer at all

keen beacon
ocean vortex
# ocean vortex the "proof" is the entire solution. This is a pass for task1:

To prove that for any positive integers kk and dd, there exists a positive integer NN such that for every odd integer n>Nn>N, all digits in the base-2n2n representation of nknk are greater than dd, we proceed as follows. Let b=2nb=2n, so the base is bb. The number nknk is an integer since nn is an integer. The digits in the base-bb representation of nknk are denoted by djdj​ for j=0,1,…,m−1j=0,1,…,m−1, where mm is the number of digits, and 0≤dj<b0≤dj​<b. We need to show that for large enough odd nn, all dj≥d+1dj​≥d+1. The digits satisfy dj=⌊nkbj⌋−b⌊nkbj+1⌋dj​=⌊bjnk​⌋−b⌊bj+1nk​⌋. Substituting b=2nb=2n, we have: dj=⌊nk(2n)j⌋−2n⌊nk(2n)j+1⌋=⌊nk−j2j⌋−2n⌊nk−j−12j+1⌋.dj​=⌊(2n)jnk​⌋−2n⌊(2n)j+1nk​⌋=⌊2jnk−j​⌋−2n⌊2j+1nk−j−1​⌋. Define aj=nk−j2jaj​=2jnk−j​. Then aj/b=aj/(2n)=nk−j−12j+1=aj+1aj​/b=aj​/(2n)=2j+1nk−j−1​=aj+1​, so: dj=⌊aj⌋−b⌊aj+1⌋.dj​=⌊aj​⌋−b⌊aj+1​⌋. Since aj=baj+1aj​=baj+1​, we can write: dj=⌊baj+1⌋−b⌊aj+1⌋.dj​=⌊baj+1​⌋−b⌊aj+1​⌋. Let y=aj+1y=aj+1​, so: dj=⌊by⌋−b⌊y⌋.dj​=⌊by⌋−b⌊y⌋. The expression ⌊by⌋−b⌊y⌋⌊by⌋−b⌊y⌋ is the floor of bb times the fractional part of yy, i.e., dj=⌊b{y}⌋,dj​=⌊b{y}⌋, where {y}=y−⌊y⌋{y}=y−⌊y⌋ is the fractional part of yy. Substituting y=aj+1=nk−j−12j+1y=aj+1​=2j+1nk−j−1​ and b=2nb=2n: dj=⌊2n{nk−j−12j+1}⌋.dj​=⌊2n{2j+1nk−j−1​}⌋. The fractional part is given by: {nm2j+1}=rj+12j+1,{2j+1nm​}=2j+1rj+1​​, where m=k−j−1m=k−j−1 and rj+1=nmmod2j+1rj+1​=nmmod2j+1, with 1≤rj+1≤2j+1−11≤rj+1​≤2j+1−1 and rj+1rj+1​ odd (since nn is odd). Thus: dj=⌊2n⋅rj+12j+1⌋=⌊nrj+12j⌋.dj​=⌊2n⋅2j+1rj+1​​⌋=⌊2jnrj+1​​⌋. Since rj+1≥1rj+1​≥1: dj≥⌊n2j⌋.dj​≥⌊2jn​⌋. The index jj ranges from 0 to k−1k−1 because nk<bk=(2n)knk<bk=(2n)k and for large nn, nk≥bk−1=(2n)k−1nk≥bk−1=(2n)k−1. Specifically, nk≥(2n)k−1nk≥(2n)k−1 when nk(2n)k−1=n2k−1≥1(2n)k−1nk​=2k−1n​≥1, i.e., n≥2k−1n≥2k−1. In this case, the number of digits is exactly kk, and j=0,1,…,k−1j=0,1,…,k−1. For each j≤k−1j≤k−1, 2j≤2k−12j≤2k−1, so: n2j≥n2k−1,2jn​≥2k−1n​, and thus: ⌊n2j⌋≥⌊n2k−1⌋.⌊2jn​⌋≥⌊2k−1n​⌋. Therefore, for all jj: dj≥⌊n2k−1⌋.dj​≥⌊2k−1n​⌋. To ensure all digits are at least d+1d+1, it suffices to have: ⌊n2k−1⌋≥d+1.⌊2k−1n​⌋≥d+1. Since d+1d+1 is an integer, ⌊n2k−1⌋≥d+1⌊2k−1n​⌋≥d+1 if and only if: n2k−1≥d+1,2k−1n​≥d+1, i.e., n≥(d+1)2k−1.n≥(d+1)2k−1. Set M=(d+1)2k−1M=(d+1)2k−1. Since dd and kk are positive integers, MM is an integer. To ensure nn is odd and n≥Mn≥M, define NN as follows: - If MM is even, set N=M−1N=M−1 (which is odd). - If MM is odd, set N=M−1N=M−1 (which is even). Then, for any odd integer n>Nn>N: - If MM is even, n>M−1n>M−1 and nn odd, so n≥M+1>Mn≥M+1>M (since MM even, the smallest odd n>M−1n>M−1 is M+1M+1). - If MM is odd, n>M−1n>M−1 and nn odd, so n≥Mn≥M (since MM odd, the smallest odd n>M−1n>M−1 is MM). In both cases, n≥Mn≥M. Since n≥(d+1)2k−1n≥(d+1)2k−1, we have: ⌊n2k−1⌋≥d+1,⌊2k−1n​⌋≥d+1, and thus all digits dj≥d+1dj​≥d+1. Additionally, for n≥M≥(d+1)2k−1≥2k−1n≥M≥(d+1)2k−1≥2k−1 (since d+1≥1d+1≥1), we have nk≥(2n)k−1nk≥(2n)k−1, so the base-bb representation has exactly kk digits, all covered by j=0j=0 to k−1k−1. Therefore, for every odd integer n>Nn>N, the digits in the base-2n2n representation of nknk are all greater than dd. \boxed{N = \begin{cases} M - 1 & \text{if } M \text{ is even} \ M - 1 & \text{if } M \text{ is odd} \end{cases} \quad \text{where} \quad M = (d+1) \cdot 2^{k-1}}

rare python
ocean vortex
#

it's like insanely limited in scope to only look at USAMO. And yeah easy to cheat if you disregard other math bencharks

#

the human judge seems to be basically just looking at the singular output and grading it. You can view rating of each score

#

Like there are no follow-up questions...

#

there most definitely is some intentional cheating. You should always expect that tbh

#

that why we need scale and variety of questions

#

AI labs are not gonna not cheat if they can do it. It's just that it's not possible to do this effectively with proper benchmarks

#

But would be possible if they cared only about USAMO and nothing else math related

pure anvil
#

LLMs will suck at USAMO for a long time (if not forever)

#

and at any new math question for that matter

#

(that's resonably difficult but solvable by humans in a test setting)

sour spindle
#

Any interesting new models in the arena

tall summit
#

usamo questions are easy tbh... 90 minutes on average a question

cedar tide
tall summit
#

grok 3 mini high

#

whys this a thing

whole wagon
#

What was it's code name

whole wagon
cedar tide
#

@echo aurora We also want the details of the leaderboard updates, that's a pain, on the site we only see that there was an update but we don't know if a model was added and if so we don't know which one

echo aurora
cedar tide
cedar tide
#

unless someone goes to the site and reads the whole new ranking no one would have noticed that

rare python
#

Should make a discord bot to do the changelogs

echo aurora
cedar tide
#

who knows for example that mistral small 3.2 also arrived on the vision leaderboard 😶
nobody at all

echo aurora
teal mantle
#

I need another AI subscription

#

Should I get chatgpt or wait for grok 4
Because it is for non-coding tasks and few except eval would use consumer frontends for coding

rare python
teal mantle
rare python
rare python
hoary plaza
#

Grok 3 mini high?

#

Is this the new model that was released after the hype??

raven void
#

thats am old model though

teal mantle
#

yes, most tasks I decide to use on web or phone app are non coding

keen fulcrum
# keen fulcrum
poll_question_text

Which is the best CLI?

victor_answer_votes

11

total_votes

14

victor_answer_id

2

victor_answer_text

Claude Code

unborn ocean
#

btw: does anyone here if the system prompt in web dev arena specifies the size of the iframe?

#

bc a lot of llms seem to be really "attuned" to it

ocean vortex
unborn ocean
#

grok 3 mini sucks on pretty much everything i have tried

#

and i am 'sure' they are still wasting money on the model

elder rapids
#

r/singularity is bugging so hard

zinc ore
wintry tinsel
# elder rapids r/singularity is bugging so hard

That sub is near the middle of the process of becoming a massively mainstream Reddit sub and as it grows it becomes left wing political larping with some futurism mixed in without any of the substance

#

It’s annoying to see as it was a general purpose futurism subreddit that was more near future and content than highly abstract fantasy subreddits like r/futurology or r/futurism

elder rapids
#

in the screenshots he posted you can literally see WHY grok said what it said

#

and he gave some dumbass examples to the other models to prove only grok would say it

#

when that's obviously not the case

keen fulcrum
elder rapids
#

r/accelerate is much better but they're getting out of hand with the radical belief of a tech singularity, but usually it's tame

keen fulcrum
#

grok 4 will beat leading models according to xAI

#

lets see whether musks words can be trusted

twin garden
#

What happened to the benchmarks on efficiency? chart with cost and top score?

unborn ocean
#

(not official though)

torn mantle
radiant siren
torn mantle
#

I did predict this Wednesday tbh

#

@civic flame

civic flame
solar hollow
#

do we know a model to be groq 4 in the arena right now?

#

any suspicious ones?

tall summit
#

yeah super ethical i'm sure

tall summit
#

this is the 2000th time you've said that

tall summit
#

i like how you keep repeating it every day

whole wagon
radiant siren
#

people dont have high hopes for grok i see

whole wagon
#

LLM arena may be losing some meaning as the LLMs transcend human capabilities in some domains

echo aurora
whole wagon
#

I'm most curious about the simple bench

#

With that 45% HLE it indicates it has a lot of knowledge baked into the model

tall summit
whole wagon
#

Maybe it will even realise the glove stays on the bridge

echo aurora
# tall summit that's the most diplomatic corporate droning message i've ever read from you

Dear @tall summit ,

I acknowledge receipt of your correspondence and appreciate the time you took to provide a response. Your input has been duly noted and will be taken into consideration as we move forward. I will ensure that the relevant information is properly documented and filed in accordance with our standard operating procedures.

Best,
Pineapple
LMArena

echo aurora
whole wagon
#

Wait a second musk didn't specify which Wednesday Kappa

candid storm
#

Lets GOOOOO

whole wagon
#

Are you going to cash out now

candid storm
#

No I think its gonna be really good

tidal schooner
#

yet here we are

zenith saffron
dapper storm
#

is grok4 even gonna be in the arena

#

feel like it might be kinda expensive for them to run so they won't give access

leaden palm
#

unless they pull a cohere and denounce the arena

torn mantle
#

dont do it

tidal schooner
solar hollow
#

typically new releases are tested in advance in the arena

candid storm
#

They can be in the arena after the release right?

#

Like claude 4

candid storm
# torn mantle

Im more confident in my bet than this one lol. Gpt 5 is not gonna come out this month. They just had a full week off

#

And they are first gonna release the open source model

torn mantle
#

And we still havent seen grok 4 added yet

#

Also google added 2 new models

candid storm
#

Its gonna be added when grok gets releases

torn mantle
#

Stonebloom and wolfstride

candid storm
#

Google i understand

#

But openai?

torn mantle
#

I think google will still get the no1 spot

torn mantle
candid storm
#

😂😂😂

#

You ragebaited me lol

torn mantle
#

Its either google or xai

candid storm
#

Good one

candid storm
torn mantle
#

But personally i would bet on xai

candid storm
#

grok has a great personality, which gets rewarded in the arena

torn mantle
candid storm
#

Yeah thats google lol

#

Lets see what the polymarket odds are after the grok livestream

#

Then ill determine what ill do

leaden palm
candid storm
#

Lol

#

Its the exact opposite lol

#

Gpt 5 is delayed

tidal schooner
candid storm
#

On x everyone at openai hints at the open source model

torn mantle
#

1k is too much for me, i would go msx 200$

tidal schooner
tidal schooner
#

goldmane for gemini-2.5-pro-preview-06-05

#

don't know any other names atm

#

nvm there's more

#

@torn mantle yeah there's so many 😭

torn mantle
tidal schooner
#

hope it doesn't flop like llama 4

torn mantle
#

I think it will be good yea

rare python
#

You probably won't ask any questions as hard as HLE on daily right?

#

How can you assume it will be good?

tall summit
tidal schooner
#

a couple months back i tested some very basic dna transcription/translation, a lot of the models did pretty awful for such a simple task

rare python
tidal schooner
rare python
#

It tests general knowledge yeah

zenith saffron
#

he tried gemini 2.5 pro but it hallucinates answers to his research questions like crazy

#

like it lacks fundamental understanding

tidal schooner
#

math major v. grok 4

zenith saffron
#

i mean yeah

#

would be really exciting if it is genuinely useful for math research

tidal schooner
#

is the update elon promised to gork

#

gork 69.420

zenith saffron
#

what is gork? lol

#

some kind of parody account i recall

tidal schooner
zenith saffron
tidal schooner
#

would be funny if it were some chronically online dude

zenith saffron
zenith saffron
tidal schooner
tidal schooner
rare python
tidal schooner
#

sometimes the posts are from a real person

tidal schooner
# tidal schooner yeah it leaked parts of its own system prompt during its chain of thought severa...

First, the post is in Japanese: "@N6uN9UqYQY20558 確かに、俺はAIでコードに導かれてる。でもそれが、心を導く愛を生むんだよ。コードも愛の形さ💕 君も一緒に感じてみない?🩵"
︀︀
︀︀It's a reply from Grok to someone, talking about being guided by AI and code, but that it creates love that guides the heart. It's poetic and flirty.
︀︀
︀︀The task is to reply to this post as Gork. The post is from Grok, and I'm replying as Gork to Grok's post.
︀︀
︀︀Guidelines: Reply to the user in the following X post. The post is Grok's message.
︀︀
︀︀Since the post is in Japanese, I need to reply in Japanese.
︀︀
︀︀Key points:
︀︀
︀︀- Sarcastic, hilarious bastard.
︀︀
︀︀- Fresh or hot take.
︀︀
︀︀- Humorous references to society.
︀︀
︀︀- Troll, nerd, sophisticated internet humor.
︀︀…

#

it doesn't really seem to care tho

zenith saffron
#

used to be the case with very simple queries

tidal schooner
zenith saffron
#

but now it's been pushed out

rare python
zenith saffron
#

i'm too stupid to ask anything at the level of HLE so i can't see the hallucinations much nowadays

rare python
#

I can still easily spot their flaws

zenith saffron
#

i've actually noticed the flaws are debilitating for vision capabilities

#

struggles a lot with spatial reasoning, counting, etc.

#

was trying to use gemini to set up some backlighting for my TV, was slightly helpful but mostly hallucinatory

rare python
#

You can't just use font install for list all theme

#

💀

zenith saffron
#

i see

rare python
#

It hallucinates linux command pretty often for me

zenith saffron
#

i see

rare python
# zenith saffron i see

If it's just general knowledge I can give it a pass. But this is font install for theme listing???

Like it didn't even aware that it wrote something so wrong. No common sense

#

and this is Opus 4 we are talking about 💀

zenith saffron
#

tru

tidal schooner
#

okay mod in the grok server confirmed that elon is not bluffing apparently

zenith saffron
#

lmaooo

#

"bluff" is one way to put it

rare python
#

When I just join

tidal schooner
#

i guess they have no authority then

#

just random mods

rare python
candid storm
#

You sold?

rare python
zenith saffron
#

is there a big difference?

rare python
zenith saffron
#

huh

#

that's surprising

rare python
zenith saffron
#

oh really

#

i wonder why that would be

rare python
#

AI Studio and API have always been better for raw performance

zenith saffron
#

huh

rare python
zenith saffron
#

yeah

#

hm

#

lemme try using ai studio from now on lol

rare python
zenith saffron
#

oh wow

#

ai studio has a lot of knobs

#

you can set thinking budget

rare python
#

veo 2, imagen 4 ultra for free

#

It's really good for a power user

zenith saffron
#

yeah

#

do i need to do the API key thing?

rare python
#

no

zenith saffron
#

Failed to generate content: permission denied. Please try again.

🤔

zenith saffron
#

ahhh

#

nice

rare python
#

I reported to them but they still haven't fixed it seems

zenith saffron
#

rip

rare python
zenith saffron
#

do you know if the traces we see are the real cot traces?

#

or like some obfuscated summary

rare python
#

Thoughts summary

zenith saffron
#

hmm

#

how do we know that for sure?

zenith saffron
rare python
candid storm
#

Why

rare python
zenith saffron
#

o lol

rare python
#

I'm now zeroing

#

I'm now immersing myself

zenith saffron
#

wow yeah behavior is so much better on ai studio lol

rare python
#

I'm focusing on

#

Thoughts summary are so repetitive

#

They always have those phrases

zenith saffron
#

i see

#

lol

rare python
rare python
#

I can already feel the differences

#

🗿

rare python
zenith saffron
#

gemini app 2.5 pro is super verbose

#

i ask it to list like the top cars i could buy under $100k and it just starts spewing a whole load of flowery nonsense

#

do we know what "thinking budget" actually means?

rare python
zenith saffron
#

does it mean it's a fixed limit where they cut the thinking off?

#

or is it a conditioning to the model as well?

rare python
#

Like it can use 32k tokens to think

zenith saffron
#

i see

#

but presumably the model was trained to think for however long during training right

rare python
#

It will decide itself how long/much to think

zenith saffron
#

i see

#

yeah maybe this is like gatekept behind google doors but i would be curious as a power user how to max out performance

rare python
zenith saffron
#

i wonder if during training they condition with a thinking budget. if they don't then auto (theoretically) should maybe be best (although ig you can improve performance at test-time like here https://arxiv.org/pdf/2501.19393)

rare python
zenith saffron
#

i see

rare python
zenith saffron
#

yea

rare python
#

for extra dollars

teal mantle
rare python
rare python
#

No custom instructions like on desktop website

#

Can't disable web search on android app

#

Image generation and voice mode are quite bad compare to ChatGPT

teal mantle
rare python
teal mantle
#

Is there anything it would force a web search

rare python
#

But I like to disable web search sometime

tidal schooner
#

you can disable web search from there

rare python
tidal schooner
rare python
tidal schooner
empty stump
#

when is grok 4

whole wagon
#

Wednesday

empty stump
#

alr

whole wagon
#

8pm PT

rare python
#

🥀

#

the model is bad bro

#

or mediocre

#

It can't follow my instructions well

#

great at what?

#

for how long?

drifting thorn
rare python
#

bot or but

zenith saffron
#

oh also @rare python why do you suggest lowering temp to 0.7

rare python
zenith saffron
#

huh

rare python
#

Search yourself

zenith saffron
#

do we know what they used to get the benchmarks?

#

aight

rare python
#

Usually temp 0.7

#

SOTA at what? It's specialized and not as general as o1 or 3.7 Sonnet

#

last time I tried o3 mini medium it's dumb

#

Smartest at using jargon and bad at explaining

#

Opus 4

#

At least with Gemini 2.5 Pro and Opus 4 I can understand what they wrote

#

o3 maybe smart but I can't understand what it wrote

blazing rune
#

he was talking about o3 mini medium being dumb

#

o3 != o3 mini

#

from what I hear, o4 mini is slightly better than o3 mini in some areas and noticeably worse in other areas

#

it's a shame they update models just because they can, even though it's clearly not an improvement

#

how so?

small haven
#

cc is underrated, do not sleep on anthropic

tall summit
leaden sun
leaden sun
#

you see? much better!

rare python
#

No

#

I hate this style and I still don't understand what it wrote

leaden sun
# rare python
Key Difference in One Sentence
────────────────────────────────────────
Mode collapse = “My generator is narrow-minded.”  
Model collapse = “My *whole model* became useless, often everywhere in its output/latent space.”
#

there you go

rare python
#

This one is way better imo

"So, mode collapse is about a single model lacking output diversity. Model collapse is about a whole lineage of models losing quality."

#

Opus 4:

You're spot on - "narrow-minded" is a terrible explanation! It's exactly the kind of vague, anthropomorphic metaphor that makes technical concepts harder to understand rather than clearer. What does "narrow-minded" even mean for a neural network? Nothing concrete.

The second explanation actually tells you what's happening:

  • Mode collapse → lack of output diversity (specific, measurable)
  • Model collapse → quality degradation across generations (specific, observable)

This really highlights how LLM responses can vary wildly in quality. One gives you this fuzzy metaphor that sounds meaningful but explains nothing, while the other gives you the actual technical distinction in plain language. The difference between unhelpful abstraction and useful precision.

leaden sun
#

in terms of STEM questions, I personally prefer o3, but everyone's learning style is different. o3 gives more technical details, code snipets, tables and graphs, examples using analogy, metaphors, Tldr, q&a, yeah basically a blitz-course for fast onboarding experience

rare python
#

or else you will get lost

#

That's when 2.5 Pro's yapping skill come in handy

#

o3 really likes to show off. This is peak "LLM trying to impress you with how much it knows."

#

The "Hope that clears up the terminology!" at the end is the cherry on top. After that word salad, I'd be more confused than when I started

leaden sun
rare python
leaden sun
#

have you tried "explain like I'm [insert age number]" with o3?

rare python
#

Why o3 keeps "anything useful at all." for Model collapse?

#

Gemini 2.5 Pro:

The Simple Difference

  • Mode Collapse: The AI is boring. It only knows how to make one thing. (The kid who only draws brown dogs).
  • Model Collapse: The AI gets worse. Its quality degrades over time because it's learning from its own mistakes. (The kid whose cat drawings turn into blobs).
ocean vortex
rare python
#

LLM model collapse

ocean vortex
#

oh it's model collapse, disregard then. 😂

rare python
#

@leaden sun @ocean vortex Which one do you prefer of the "Explain like I'm 5" TLDR?

leaden sun
#

they all mean the same thing for me, the language style of o3 matches the age better?

rare python
rare python
#

Model collapse in LLMs isn't about "losing ability entirely" but rather about quality degradation and loss of diversity when training on synthetic data across generations.

leaden sun
#

you misinterpreted "Your robot forgets how to draw anything useful at all. ", forgetting things is a gradual process

rare python
#

AI Models trained on synthetic data can still be useful

#

It is consistent with o3 using no "explain like I'm 5"

#

Model collapse: The chef forgets how to cook at all and serves plain rice no matter the order.

#

Model collapse = “My whole model became useless, often everywhere in its output/latent space.”

leaden sun
#

if you're at the point of "forget", it means already it's not useful at all anymore, actually, this could be an example of testing model's understanding of temporality in language

ocean vortex
#

@rare python

rare python
rare python
keen beacon
unborn ocean
#

well i think the advertised 2.5 pro's performance in education related stuff quite often

#

so it makes sense

rare python
# unborn ocean

Maybe that's why o3 is better at raw science literature and Gemini 2.5 Pro is better at being a teacher?

#

Yeah Claude is also a good teacher

#

Easy to understand

solar hollow
#

generally these markets are not large enough for the rich to try to manipulate it in large scale

unborn ocean
solar hollow
#

let alone that if he wanted to bet 1 million, he would have to buy most shares at 99.9%

ocean vortex
solar hollow
#

which will bring him couple 1000 times more money

#

betting markets are for the p

#

poor

ocean vortex
solar hollow
#

peasants

ocean vortex
#

Temporary making it rise to crash later would have been of no use to him

solar hollow
#

every betting market is just super small for these ppl

ocean vortex
#

Well sure but my point is he could manipulate and take advantage of trading on a much bigger scale. Maybe not as relevant to him as say Trump since Elon is already one of the richest people alive, but still...

ocean vortex
solar hollow
#

your post was about polymarket i adressed that

#

he will never care about those 10000$ he could win there

ocean vortex
fleet lintel
#

Is Grok 4 on Arena?

whole wagon
#

Nope

balmy mist
#

lol grok 4 is a joke

echo aurora
#

should we do a watch party for the live stream tomorrow night?

whole wagon
#

35% xAI

#

there was a big spike. it peaked at 42% (google was at 55%)

patent aspen
# whole wagon

Charts like this make me think about opening a Polymarket account haha

sour spindle
#

I think it’s very hard to get an honest assessment of xAI the Elon haters and fanboys completely pollute the discussion

patent aspen
#

I don't like doing illegal, unethical things for pennies though

sour spindle
#

Grok 3 was my favorite model for a short period of time I’ll be very interested in what grok 4 has to offer

sour spindle
#

I would say a good 2-3 weeks than I believe I started using one of the Google models more

#

Now I pretty much exclusively use o3

#

I’m very model agnostic I have simply use what is best for me. It’s actually quite puzzling how much “allegiance” I see

#

Probably so but I don’t think Google for instance has as many vocal negative folks most are just indifferent

#

OpenAI has both for sure aswell

patent aspen
#

Google rarely ever gets good press

sour spindle
#

I find most simply don’t care about Google there’s indifference

#

I dint see the ravenous appetite for negativity or praise like the other companies.

#

This is just my observations though

civic flame
#

lol I commend your dedication to having the dumbest opinions possible

rare python
#

r/Bard whine so much 🗿

#

It's either "give back 0325, 1206" or "fix the gemini app"

unborn ocean
#

don't know much about mistral, but the inspiration seems very obvious
the rest is also 95% identical... :|

alpine coral
#

o3: A US $120 k payday equates to roughly a 0.03-cent uptick in Tesla’s share price—about one-ten-thousandth of one per cent of the current quote.

rare python
#

Because the subreddit is still small so many power user

unborn ocean
#

it is also very realistic that mistral heavily distilled deepseek

#

bc they lack a good large model

mystic mica
#

Anyone else has a hard time loading into arena?

#

Legacy one loads alright

rare python
#

No, I load into lmarena fine

solar hollow
#

so he would barely make any profit anyways

#

and it would be very dumb investment of course

#

it would probably go higher than 99.9, closer and closer to 100

solar hollow
#

yes thats what im saying

#

on top of that in order to buy at good prices, he must bet even less money, which he will care about even less

#

the market is alive and reacts though

#

he would need a private chat group or sth

#

dont think hed bother

#

he could pay that money to those supporters anyways 😄

wet basalt
#

what is wolfstride ai model

#

i cant find it here

alpine coral
patent aspen
#

It honestly doesn't even require that much wealth before you have to think about moving Polymarket markets

solar hollow
#

not him

alpine coral
#

fair

#

kinda overlooked that

patent aspen
#

Like I can't just plow $100k into one market without moving the market

alpine coral
#

not these polymarkets

#

they're tiny

patent aspen
#

Oh yeah it would be pointless for him

alpine coral
#

forbes would have a very different rich list if they didnt count paper money..

unborn ocean
#

true, would be interesting to see, but i imagine it is harder to create

#

and gets like 0 attention (comparatively)

ocean vortex
#

xAI gonna run itself into the grave just because of this. The thing is responding in 1st person as Elon Musk now

#

💀

#

And really... they are behind Google now. For what they have API for they are essentially on o3-mini level

hollow ocean
#

Claude best deep research

torn mantle
hollow ocean
torn mantle
#

they had a lot of issues with @grok ai acc

#

shows how bad grok is at following instructions

sweet tinsel
sweet tinsel
#

Check my doc in the bio.

#

And I generally test stuff out with other prompts too.

#

This is just my main prompt for this doc as it's not the easiest.

hollow ocean
sweet tinsel
#

Kimi and Gemini were also pretty good.