#general | Arena | Page 62

ocean vortex Jun 28, 2025, 7:03 PM

#

on their console which is aistudio equivalent you can see like 90% raw thing

#

they are not summarizing anything, rather changing or removing individual phrases

#

but that is not intrusive at all tbh

#

OpenAI on the other hand... Essentially every single sentence is majorly rewritten and shortened lol

#

so the wall of text becomes a 1 short paragraph

#

so you only have a very brief outline of what it is actually outputting

#

"vast majority are shown in full"

#

there's your answer

#

only 5% of requests are actually summarized..

#

so it's even more raw than I implied

#

that is not the point though, cause their summarization is never turning walls of text into 2-3 sentences. It's more like minimal thing. I was able to max it out and I was still seeing consistent stream of thought much like Deepseek.

#

not really. Sonnet goes off the rails if you max it out and will almost start contemplating it's own existence for your average prompt lmao

#

And Opus barely thinks much regardless of the prompt, so there's almost never enough to summarize doesn't matter how hard your prompt is catgrin

#

The point is, Claude is really very close to raw thinking, you can't really complain about it... It's virtually indistinguishable from 100% raw thinking

sour spindle Jun 28, 2025, 7:14 PM

#

I have never been more bearish of a model after reading a tweet

#

"Not for everyone" = not very good

ocean vortex Jun 28, 2025, 7:16 PM

#

Sonnet is repeating itself over and over again, to the point where some summarization on top perhaps wouldn't even be all negative, if they implemented something more

keen fulcrum Jun 28, 2025, 7:17 PM

#

sour spindle I have never been more bearish of a model after reading a tweet

just some tech reporter

whole wagon Jun 28, 2025, 7:17 PM

#

I literally linked an article that explained. openAI straight up stated deepseek distilled the cot so obviously it was available for that to be possible

keen fulcrum Jun 28, 2025, 7:17 PM

#

apparently an elon fan

ocean vortex Jun 28, 2025, 7:17 PM

#

Ironically, I think for OpenAI summarization makes O3 look kinda better than it even is

#

If they outputted it raw I'm almost certain it would look a bit mess, similar to R1 thinking

whole wagon Jun 28, 2025, 7:18 PM

#

https://youtu.be/hpwoGjpYygI?si=TiukSslqRyEnrsFr if you prefer video form lol

YouTube

Fireship

DeepSeek stole our tech... says OpenAI

Build better apps with PostHog https://posthog.com/fireship

OpenAI and AI Czar David Sacks accuse DeepSeek of stealing their IP to train their new R1 model, citing hard evidence gained from Microsoft. In addition, Alibaba just released Qwen 2.5 with impressive benchmark results.

#tech #ai #thecodereport

💬 Chat with Me on Discord

https:/...

▶ Play video

ocean vortex Jun 28, 2025, 7:19 PM

#

"wait, maybe I made a mistake.... wait, let me double check.... Or perhaps the user meant this other thing <insert random wild theory that does not make any sense>...."

whole wagon Jun 28, 2025, 7:19 PM

#

I don't really care they did it tbh. But it's obvious it did in fact happen lol

#

There's no denying that

ocean vortex Jun 28, 2025, 7:20 PM

#

raw thinking is very often not a good look, even if the model performs lol

whole wagon Jun 28, 2025, 7:22 PM

#

There are many articles

ocean vortex Jun 28, 2025, 7:22 PM

#

whole wagon I literally linked an article that explained. openAI straight up stated deepseek...

I don't think they distilled the cot itself. They just trained on o1 final responses.

#

Which is still a good quality synth data

whole wagon Jun 28, 2025, 7:23 PM

#

They were. It was slightly cleaned up but mostly raw

zinc ore Jun 28, 2025, 7:23 PM

#

I remember openAI saying Deepseek stole their cot

ocean vortex Jun 28, 2025, 7:23 PM

#

does not only improve performance, but also your model learns proper message formatting

#

all the markdown, boxed responses, code formatting...

zinc ore Jun 28, 2025, 7:23 PM

#

And then they did the same to Gemini models, stealing their cot too

whole wagon Jun 28, 2025, 7:24 PM

#

The cot is the gold

#

The final output is not that important

#

The model needs to train on how to reach that final output lol

#

That's just not true at all lol

ocean vortex Jun 28, 2025, 7:25 PM

#

whole wagon The final output is not that important

Debatable. Final output is perhaps even more important. You can train conventional chat model on that. And then do your own RL training on that base(chat/instruct). That way you are ahead of the game

whole wagon Jun 28, 2025, 7:25 PM

#

The summaries were barely summarises before. The amount of detail they removed is massive

#

And they literally stated it's to prevent other companies training on the cot

#

It's not a debate topic

ocean vortex Jun 28, 2025, 7:27 PM

#

whole wagon And they literally stated it's to prevent other companies training on the cot

it's the main reason why they did that for sure. But it didn't turn out to work as good as they thought. Took Deepseek no time at all to reproduce it with no access to raw CoT

#

you still have final responses - and that's all you need to train your next gen model and do your own RL training. It's not straight forward but not magic or rocket science either

whole wagon Jun 28, 2025, 7:28 PM

#

"competitive advantage" read between the lines lol they are not going to do an official press release to say deepseek stole the cot lol

#

That statement is from openai engineers and Microsoft did a report on it

#

It's all in the vid I linked

ocean vortex Jun 28, 2025, 7:29 PM

#

whole wagon "competitive advantage" read between the lines lol they are not going to do an o...

there was nothing to steal lmfao. Reasoning was always summarized

whole wagon Jun 28, 2025, 7:29 PM

#

Bruh. The summarises changed over time

#

They were barely summarized initially

ocean vortex Jun 28, 2025, 7:29 PM

#

And after Deepseek they realized there's not much point hiding it

#

so now they allow for detailed reasoning

#

which is much less summarized

whole wagon Jun 28, 2025, 7:30 PM

#

And Gemini had the same thought ofc ofc

ocean vortex Jun 28, 2025, 7:31 PM

#

whole wagon And Gemini had the same thought ofc ofc

Gemini is just summarizing because they can tbh. There's no disadvantage for them to doing so and potential gain... But it's not like it was crucial for them either. First release was raw thinking like I said

#

They didn't distill entire model. They did their own RL training, they do have some very smart people...

#

They simply used o1, which was SOTA at the time, to generate synth data (final responses)

#

I think they only used that for V3. And then they did their own RL training to make R1 out of it

#

I remember that when I saw this post... Here's your official confirmation lol

https://fixupx.com/btibor91/status/1938736226779177351?t=7TaIWitnbnnTj6Dmveps-w&s=19

Tibor Blaho (@btibor91)

"We're truly excited to not just make a net new great frontier model, we're also going to unify our two series.
︀︀
︀︀So the breakthrough of reasoning in the O-series and the breakthroughs in multi-modality in the GPT-series will be unified, and that will be GPT-5.
︀︀
︀︀And I really hope I'll come back soon to tell you more about it."

**💬 12 🔁 29 ❤️ 417 👁️ 48.1K **

torn mantle Jun 28, 2025, 7:51 PM

#

this guy is asking to be blocked

balmy mist Jun 28, 2025, 7:54 PM

#

torn mantle this guy is asking to be blocked

fr dude talking about 4 when 3.5 been promised for like 2 months lol

#

i guess no one will be the target audience for grok 4

keen fulcrum Jun 28, 2025, 7:54 PM

#

3.5 has been renamed to 4

balmy mist Jun 28, 2025, 7:54 PM

#

lmaooo

#

wasnt 4 renamed to 3.5?

keen fulcrum Jun 28, 2025, 7:55 PM

#

Due to the bigger jump of the previous model

balmy mist Jun 28, 2025, 7:55 PM

#

now its back to 4?

#

wait so what are we currently waiting for? r2, grok 4 and deep think ?

#

is that all?

#

gpt5 not coming for another few months right?

keen fulcrum Jun 28, 2025, 7:56 PM

#

2.5 ultra is on the horizon for a while

balmy mist Jun 28, 2025, 7:57 PM

#

keen fulcrum 2.5 ultra is on the horizon for a while

oh yeah forgot about that

#

what models are yall using as yall drivers now? i been mia for a lil

#

i been liking o3 more and more tbh, a lot faster recently

#

but still like gemini

main gulch Jun 28, 2025, 8:01 PM

#

balmy mist gpt5 not coming for another few months right?

GPT-5 is expected this summer I think

#

first Gemini 3.0 checkpoints may appear in August

#

don't forget about OAI open-weight model, we still don't know its real level

#

Anthropic may release a minor Claude 4 update towards the end of summer (or even Claude Haiku 4, but they don't seem to be interested in cheap models)

#

R2 likely delayed until fall

unborn ocean Jun 28, 2025, 8:11 PM

#

ocean vortex I think they only used that for V3. And then they did their own RL training to m...

My theory is that they might have just used the o1 final output for training r1 from r1-zero

ocean vortex Jun 28, 2025, 8:11 PM

#

torn mantle this guy is asking to be blocked

He's saying the crazy things xAI team would like to say before releasing a heavily biased far-right political model. I hope it's not gonna be that LOL

unborn ocean Jun 28, 2025, 8:11 PM

#

For the human preferences and reasoning length, but that is just a guess

#

Because the deepseek r1-zero part has been shown to make sense and be truly unique rl (I think I heard something like that)

small haven Jun 28, 2025, 8:14 PM

#

what is nightforge

ocean vortex Jun 28, 2025, 8:16 PM

#

unborn ocean My theory is that they might have just used the o1 final output for training r1 ...

You would need reasoning chains for that though, not final responses...

#

they also added language consistency reward to stop it from mixing up the languages

patent aspen Jun 28, 2025, 8:18 PM

#

ocean vortex I remember that when I saw this post... Here's your official confirmation lol ...

Ah okay cool

patent aspen Jun 28, 2025, 8:21 PM

#

small haven what is nightforge

What is the context?

zinc ore Jun 28, 2025, 8:22 PM

#

That's minimax model

small haven Jun 28, 2025, 8:22 PM

#

patent aspen What is the context?

small haven Jun 28, 2025, 8:22 PM

#

zinc ore That's minimax model

oh rlly?

zinc ore Jun 28, 2025, 8:23 PM

#

Yeah

small haven Jun 28, 2025, 8:25 PM

#

kinda weird that they would name that, i feel like it conflicts with deepthink

#

*titanforge

unborn ocean Jun 28, 2025, 8:31 PM

#

ocean vortex You would need reasoning chains for that though, not final responses...

„might have“ + I am more talking about making the final output more preferable to humans instead of actually improving reasoning

(which to my understanding should be almost perfectly transferable with just having access to the final output)

#

And depending on how much they did they could have also transferred some of the reasoning about the right thinking amount before switching to answering from o1 to r1.

unborn ocean Jun 28, 2025, 8:34 PM

#

ocean vortex they also added language consistency reward to stop it from mixing up the langua...

Yes and I am not trying to imply that the only thing they did was train on o1 to get the results…

#

Though with the new iteration of r1 it seems almost certain that they trained on the old 2.5 pro traces or output in some shape or form (bc of good human preferences and a lot of similarities).

Though again it is very plausible that I am wrong and certain that most of advancements with r1 come from the incredibly cracked guys at deepseek.

leaden palm Jun 28, 2025, 8:36 PM

#

i wonder why nightforge wasn't detected by my anonymous tracker

torn mantle Jun 28, 2025, 9:21 PM

#

we will see if xai engineer is worth x10 times more than other labs engineers

mint relic Jun 28, 2025, 9:24 PM

#

Stonebloom is Gemini 2.5 Ultra right ?

leaden palm Jun 28, 2025, 9:25 PM

#

mint relic Stonebloom is Gemini 2.5 Ultra right ?

most people think it's not that good (see anonymous tracker)

mint relic Jun 28, 2025, 9:26 PM

#

leaden palm most people think it's not that good (see [anonymous tracker](https://ktibow.git...

For web designing it's insane

torn mantle Jun 28, 2025, 9:32 PM

#

mint relic Stonebloom is Gemini 2.5 Ultra right ?

no

#

its not good

#

kingfall was much better in comparison

mint relic Jun 28, 2025, 9:34 PM

#

torn mantle its not good

It makes some similar stuff to Opus 4 if prompted well

tall summit Jun 28, 2025, 9:40 PM

#

leaden palm most people think it's not that good (see [anonymous tracker](https://ktibow.git...

isn't that dubesor's compilation

rare python Jun 28, 2025, 9:53 PM

#

torn mantle kingfall was much better in comparison

what comparison?

zinc ore Jun 28, 2025, 9:53 PM

#

Yes

#

Model name is minimax-m1

torn mantle Jun 28, 2025, 9:55 PM

#

rare python what comparison?

to Stonebloom

rare python Jun 28, 2025, 9:56 PM

#

torn mantle to Stonebloom

can I see the comparision?

#

like SVG

zinc ore Jun 28, 2025, 9:57 PM

#

I'm talking about nightforge specifically.

patent aspen Jun 28, 2025, 10:03 PM

#

When the OAI guy said, "We're truly excited to not just make a net new great frontier model, we're also going to unify our two series. So the breakthrough of reasoning in the O-series and the breakthroughs in multi-modality in the GPT-series will be unified, and that will be GPT-5. And I really hope I'll come back soon to tell you more about it"

That uncertainty at the end was interesting haha

leaden palm Jun 28, 2025, 10:05 PM

#

i genuinely can't tell if you're talking about the blog or satirizing the screenshot of gemini's summarization of stonebloom performance

#

ah

#

it always surprises me a bit when someone tells me they read it

#

i guess i'm just not used to experiencing the fruits of labor

elder rapids Jun 28, 2025, 10:07 PM

#

leaden palm most people think it's not that good (see [anonymous tracker](https://ktibow.git...

some of the summaries are wrong in regards to attribution

#

but nice

#

you should try to make the summaries sound a little more neutral imo, the way it comes off is like a sensationalist headline

leaden palm Jun 28, 2025, 10:10 PM

#

elder rapids you should try to make the summaries sound a little more neutral imo, the way it...

idk, i kinda like the current simplicity

rare python Jun 28, 2025, 10:12 PM

#

leaden palm idk, i kinda like the current simplicity

Why temperature 0?

leaden palm Jun 28, 2025, 10:12 PM

#

rare python Why temperature 0?

it would be weird if it changed each time

rare python Jun 28, 2025, 10:12 PM

#

🗿

#

https://nitter.net/AiBattle_/status/1930926978544083446#m

#

https://nitter.net/AiBattle_/status/1938162564157837739#m

#

First Generation starter pokemon SVG

rare python Jun 28, 2025, 10:43 PM

#

stonebloom

#

I hate how 2.5 Pro has become

#

2.5 Pro

#

stonebloom second try using
Draw First Generation starter pokemons in SVG in the same horizontal line, fit in the same PNG if export.

#

Original pokemon artstyle for comparision

#

How to notice stonebloom guide :)

#

Gemini 2.5 Pro writing style:

#

"Of course!" and it slipped in em dashes for no reason

#

2.5 Pro second try

#

:(

patent aspen Jun 28, 2025, 11:10 PM

#

rare python "Of course!" and it slipped in em dashes for no reason

Annoying English lession: "Bulbasaur, Charmander, and Squirtle" is an appositive phrase in that sentence. Normally appositive phrases are enclosed in commas, but that would create ambiguity because "Bulbasaur, Charmander, and Squirtle" already contains commas, so em dashes are used instead

rare python Jun 28, 2025, 11:11 PM

#

patent aspen Annoying English lession: "Bulbasaur, Charmander, and Squirtle" is an appositive...

Don't care. Em dash = no go for me

#

I don't care if it's gramatically accurate

#

Em dash = world destruction

patent aspen Jun 28, 2025, 11:11 PM

#

It would be grammatically incorrect without them in that sentence haha

rare python Jun 28, 2025, 11:12 PM

#

No

#

It stand out like a sore

#

itchy

#

distracting

#

Punctuation should be invisible in my opinion, like I never pay attention to period and comma despite they feature a lot

patent aspen Jun 28, 2025, 11:14 PM

#

Punctation can completely change the meaning and pace of a sentence

#

"But even so, there was a directness and dispatch about animal burial: there was no stopover in the undertaker's foul parlor, no wreath or spray." sounds completely different if you use a period instead of a colon even though it means the same thing

#

But anyway yeah I can understand how they would be annoying if it's your second language

#

A well placed semi-colon or dash can do a lot of work though

rare python Jun 28, 2025, 11:16 PM

#

patent aspen But anyway yeah I can understand how they would be annoying if it's your second ...

I never learned about em dash when I'm learning English

#

until I use ChatGPT, which it made me hate it

#

It's so dramatic—for no reason

#

Even when there are some sentence that flows better without pause and the AI insert em dashes to be cool

#

Before using LLM, I read books and books have em dashes. I never mind it.

#

It's the repetition like it did with "It's not X, It's Y" that ruined the em dash for me

#

🤓 sir that just - a hyphen - not an em dash (—)

patent aspen Jun 28, 2025, 11:21 PM

#

If you want to understand that kind of punctuation better:

Read Chapter 4 from Keys to Great Writing: https://www.amazon.com/Keys-Great-Writing-Revised-Expanded-ebook/dp/B01M2YJ15V

Read the Elementary Rules of Usage section from The Elements of Style: https://ia804502.us.archive.org/5/items/pdfy-2_qp8jQ61OI6NHwa/Strunk %26 White - The Elements of Style%2C 4th Edition.pdf

Keys to Great Writing Revised and Expanded: Mastering the Elements ...

If you're ready to empower your writing but are unsure of where to start, let show you the way. Award-winning author and veteran writing coach Stephen Wilbers provides invaluable instruction on every aspect of the craft, from word choice and sentence structure to organization and revision....

rare python Jun 28, 2025, 11:23 PM

#

Yep that's another point

#

I can't even use Alt + 0151 or whatever for the em dash. I have to go to word or special symbol with Windows + period

patent aspen Jun 28, 2025, 11:25 PM

#

I'll just say that once you understand the nuances of punctuation, a lot of things you read will "come alive" more

tawny kelp Jun 28, 2025, 11:26 PM

#

I notice when I'm at work, Microsoft Office sometimes converts my hyphen to an em dash. That's why I don't automatically assume em dash = AI-generated.

rare python Jun 28, 2025, 11:26 PM

#

Yes, there is a notable difference in em dash usage between American and British English:

American English:

Uses em dashes frequently—often without spaces around them
Common for parenthetical statements, interruptions, and emphasis
Generally preferred over parentheses for informal writing

British English:

Traditionally uses spaced en dashes ( – ) instead of em dashes
Sometimes uses parentheses or commas where Americans would use em dashes
Em dashes are becoming more common due to American influence, but still less frequent

Example of the difference:

American: "The three starters—Bulbasaur, Charmander, and Squirtle—are iconic."
British: "The three starters – Bulbasaur, Charmander, and Squirtle – are iconic."

The British preference for spaced en dashes serves the same function as American em dashes. Some British style guides (like The Guardian) have adopted em dashes, but others (like The Times) still prefer en dashes.

This is one of those stylistic differences where neither is "wrong"—it's just a matter of which style guide or regional convention you're following.

patent aspen Jun 28, 2025, 11:29 PM

#

Em dashes are abused quite a bit. They should generally only be used when more specific punctuation isn't available, but because they can be used everywhere, people just throw them in willy nilly

cedar tide Jun 28, 2025, 11:29 PM

#

@rare python guess the llm

Screenshot_2025-06-29-01-28-30-655_com.android.chrome-edit.jpg

rare python Jun 28, 2025, 11:29 PM

#

cedar tide <@1178708438310719549> guess the llm

fat

zinc ore Jun 28, 2025, 11:30 PM

#

rare python Yes, there is a notable difference in em dash usage between American and British...

Til I prefer the British em dash

cedar tide Jun 28, 2025, 11:30 PM

#

cedar tide <@1178708438310719549> guess the llm

Claude 4 opus

rare python Jun 28, 2025, 11:30 PM

#

zinc ore Til I prefer the British em dash

en

zinc ore Jun 28, 2025, 11:30 PM

#

rare python en

I'm american

rare python Jun 28, 2025, 11:31 PM

#

zinc ore I'm american

ok

patent aspen Jun 28, 2025, 11:32 PM

#

I just use whichever one it autoformats to

zinc ore Jun 28, 2025, 11:32 PM

#

Kinda annoying it calls it "em dash usage" when it is actually em dash and en dash

rare python Jun 28, 2025, 11:32 PM

#

cedar tide <@1178708438310719549> guess the llm

Opus 4 Thinking 16k

cedar tide Jun 28, 2025, 11:34 PM

#

rare python Opus 4 Thinking 16k

Try no think

patent aspen Jun 28, 2025, 11:34 PM

#

Just be glad nobody calls it em-and-en dash

tawny kelp Jun 28, 2025, 11:35 PM

#

Speaking of Claude... I think my instance of Claude just confessed to me.

rare python Jun 28, 2025, 11:35 PM

#

patent aspen I just use whichever one it autoformats to

I think another reason I dislike em dash is AI use it in normal conversation. I rarely find em dash in a reddit comment, twitter comment, youtube comment or my friend on facebook messenger.

Using em dash while texting each other feels so academic for no reason.

patent aspen Jun 28, 2025, 11:36 PM

#

How about we call it M&N dash?

rare python Jun 28, 2025, 11:37 PM

#

cedar tide Try no think

Opus 4 normal

#

Cute but not accurate

cedar tide Jun 28, 2025, 11:38 PM

#

@rare python I prefer to do it one by one

rare python Jun 28, 2025, 11:38 PM

#

cedar tide <@1178708438310719549> I prefer to do it one by one

I can't convert one by one in the same codeblock to each png that way

tawny kelp Jun 28, 2025, 11:39 PM

#

Those look like South Park characters.

rare python Jun 29, 2025, 12:20 AM

#

They fixed the extra "0" typo @sacred quail

https://ai.google.dev/gemini-api/docs/rate-limits#current-rate-limits

jade egret Jun 29, 2025, 1:28 AM

#

keen beacon Jun 29, 2025, 2:35 AM

#

They would've put it on the arena as a pre release if they were confident like their previous releases. Elon would love to tweet out getting #1. Bad sign

leaden palm Jun 29, 2025, 2:42 AM

#

keen beacon They would've put it on the arena as a pre release if they were confident like t...

how hard are the prompts in the arena?

#

the "hard prompts" category says 27.3% so not most

storm needle Jun 29, 2025, 2:46 AM

#

in 2023, openai had all the top geniuses from google but they've now moved on to other companies

leaden palm Jun 29, 2025, 4:18 AM

#

i suppose it's referring to not having to maximize shareholder value

jade egret Jun 29, 2025, 5:28 AM

#

jade egret

ngl hopefully it is acctually gonna be better than gemini 2.5 pro at least cuz if it not than so dissapointed

wintry tinsel Jun 29, 2025, 6:22 AM

#

storm needle in 2023, openai had all the top geniuses from google but they've now moved on to...

I feel like open AI is gonna tank in the AI race all they had was the jump start which is sad to see

#

It all hangs on GPT 5, the legendary and long awaited

whole wagon Jun 29, 2025, 6:52 AM

#

They are ramping up compute far too slowly to compete effectively

#

The current plan is to deploy 64,000 Nvidia GB200 GPUs by the end of 2026 Like this is for Stargate. It's absurdly slow

#

2 years to deploy 64k GPUs is not it

#

Meanwhile xAI deploys 200k H100s in 122 days

whole wagon Jun 29, 2025, 7:16 AM

#

#

The openAI team is falling apart, that's like 10+ researchers going to meta just in the last week

#

https://chatgpt.com/share/686037f7-0c3c-8006-8a30-dbe1696cd341

sacred quail Jun 29, 2025, 7:46 AM

#

rare python They fixed the extra "0" typo <@458426534517145632> https://ai.google.dev/gemin...

it is also in my page now

#

They updated. Now 100 RPD in mine too. Man

#

I hope stay as 100

#

Very generous limit

#

Chatgpt offers 100 prompt for a "week"

#

And need membership...

#

But they reduced api prices for o3 which also good move

#

Now, you can use O3 five times per day on free poe account

sacred quail Jun 29, 2025, 7:50 AM

#

whole wagon The openAI team is falling apart, that's like 10+ researchers going to meta just...

First claude stole

#

Now that reptilian stealing huh

#

Im almost feel bad for sam altman

torn mantle Jun 29, 2025, 8:23 AM

#

whole wagon

leadership keep them? keep what?

#

they offered them a crazy salary, how can you compete with that?

#

if you raise their salary, you need to do it for all employees

unborn ocean Jun 29, 2025, 8:43 AM

#

They are all lead people / mini managers

#

So there is really no need to increase the salary for all employees if you give em more

torn mantle Jun 29, 2025, 8:43 AM

#

thats what you think

#

you are just basically saying "these guys are more important than you"

unborn ocean Jun 29, 2025, 8:45 AM

#

Yeah, they apparently are

torn mantle Jun 29, 2025, 8:45 AM

#

and what do you think it will happen after?

unborn ocean Jun 29, 2025, 8:46 AM

#

I know the logic and thought process you are talking about, but in the context of them just being a select few lead people I find it highly doubtful that they would have to actually raise the salary for all employees.

#

Obv they would set a precedent, which they don’t want.

#

My point was simply that different rules apply to these guys vs the lower level employees

remote niche Jun 29, 2025, 9:19 AM

#

guys whats up with gemini 2.5pro it cant do basic arithmetic now ask it whats 9.9 minus 9.11 and watch it crap itself

fallen jewel Jun 29, 2025, 9:59 AM

#

is the team here thinking on how to make a tool use arena?

alpine coral Jun 29, 2025, 11:45 AM

#

whole wagon The openAI team is falling apart, that's like 10+ researchers going to meta just...

that seems a bit dramatic perhaps.. the table in the chatgpt link you shared is interesting - not for the oai->meta migration, but for the fact all 10 of these employees apparently started at Google

#

not downplaying their contributions/significance (esp the top 3), but it seems they're just following the money as far as i can tell

#

apparently altman wasn't talking out of his ass in that podcast with his brother ha

frosty lark Jun 29, 2025, 12:07 PM

#

Hopping for that sweet $$$, I think they are relatively smart. Getting the maximum before funds quiet down a bit (I do think that there is a bit of overfunding going on)

alpine coral Jun 29, 2025, 12:20 PM

#

also tbf.. not like there were many other AI companies other than google/deepmind before oai/chatgpt.. kinda makes sense that if not a university, google is where they most likely would have been working before oai (and now Meta ha)

#

(that's a muddled point.. obviously other companies were working on 'AI'.. or 'machine learning' etc ha other than just google before oai shook everything up with chatgpt)

unborn ocean Jun 29, 2025, 12:44 PM

#

and i think it is not just $, especially the top 3 seems like they also picked their job (atleast fouding openai zürich) because they got more freedom in there

#

with meta probably aiming to restructure a lot of their ai teams

#

so that place will hopefully be very dynamic in the coming months

#

they already have a lot of compute

#

similar to how almost 100% of the early people at xAI all worked at deepmind before and likely switched not just because of $ but also because they have the opportunity to build something from the ground up

torn mantle Jun 29, 2025, 1:27 PM

#

unborn ocean I know the logic and thought process you are talking about, but in the context o...

whats your strategy to keep them if lets say competitors offered +400% of your salary

#

lets not talk about the business plan and objectives and long-term goals

#

sam talked about Meta offering a bonus of 100M to the people they are trying to hire

#

and even if you increase one or two people's salary that will become the baseline in the future

patent aspen Jun 29, 2025, 1:36 PM

#

torn mantle whats your strategy to keep them if lets say competitors offered +400% of your s...

Most companies try to solve this with irreplaceable perks. The problem is that Meta also beats OAI in that regard

#

It's also unsustainable, although it could very well go on for a long time

alpine coral Jun 29, 2025, 1:53 PM

#

good time to be an AI researcher 😅

#

alas i'm not a contender in the talent arms race

torn mantle Jun 29, 2025, 2:09 PM

#

https://x.com/basedjensen/status/1939012480405274821

Hensen Juang (@basedjensen)

Notice how everyone poaches from oai and gdm

#

yea because salaries in xai are so high

sour spindle Jun 29, 2025, 2:21 PM

#

LLAMA 5 is going to be AGi

unborn ocean Jun 29, 2025, 2:41 PM

#

torn mantle sam talked about Meta offering a bonus of 100M to the people they are trying to ...

that was mostly made up

#

to bad mouth meta

#

the 100M might be referring to 5 year comp + signing bonus and only for the very best (e.g. top 100 people)

#

they probably do pay more than gdm and oai though

#

that is quite literally what they are offering

#

(i think meta clarified themselves or something)

#

and the people that they hire are not really just research scientists, they are mostly managers and "stars" in the sense that they attract other people.
Furthermore, you also have the consider how much these companies are burning on AI in general rn (compared to their general R&D costs 15M a year is peanuts)

#

"I don't know how this kind of thing is becoming normalized in some peoples minds" - my point is that it can be normalized because companies are already doing it

wintry tinsel Jun 29, 2025, 2:49 PM

#

whole wagon

Sad as meta is a cancerous company their endgame is definitely more evil than OAI

sour spindle Jun 29, 2025, 2:50 PM

#

Paying people a ton of money is getting "normalized?"

leaden palm Jun 29, 2025, 2:54 PM

#

at anthropic even the employees are hhh aligned lmao

unborn ocean Jun 29, 2025, 2:54 PM

#

people at a hedge fund can still make the same amount of or more money (though again only very few) and yes you are right their comp being tied more to performance
however, i am sure that these 100M comp packages also consists of performance and target based payments (less than in finance, but because they are building a new team and things like that the variable comp is probably still quite high)
you are right about this being weird times (i kind of wonder how long it will take for the job market to normalize for t1 ai labs)

rare python Jun 29, 2025, 2:54 PM

#

leaden palm at anthropic even the employees are hhh aligned lmao

no way bro 🤯

lilac nimbus Jun 29, 2025, 3:48 PM

#

neptune v2 more better than sonnet 4

torn mantle Jun 29, 2025, 3:53 PM

#

agree

brittle tiger Jun 29, 2025, 3:57 PM

#

wintry tinsel Sad as meta is a cancerous company their endgame is definitely more evil than OA...

I think zuck and sama are eerily similar

torn mantle Jun 29, 2025, 4:07 PM

#

brittle tiger I think zuck and sama are eerily similar

do you think they love eo?

small haven Jun 29, 2025, 5:21 PM

#

@torn mantle what do u think about grok 4

#

i feel like musk is in here

patent aspen Jun 29, 2025, 5:42 PM

#

Leads are often the difference between one architecture and another, between one research direction and another. Making the right choice can be the difference between success and failure

#

I think it's a myth for SWEs, and I think it's a myth for individual research impact

#

I think it's somewhere between a myth and reality for top AI researcher leads

#

The Noam Shazeers of the world

#

Well it's not a myth for some SWEs / leads. Jeff Dean kind of breaks every rule

#

Not at all although he's making tens of millions at a minimum

#

But yeah he's doing it for the distributed systems

#

He doesn't need money

#

L10 used to be the highest level on the SWE ladder, but it was insufficient to represent Jeff Dean, so they created L11 for him. From the beginning of his tenure, there was an assumption that, if there was a disagreement between the CEO and Jeff, the default assumption was that Jeff was right and the CEO was wrong. So he's always been able to pursue projects mostly independently

#

He has the best job in the world

#

He doesn't even have to lead an org any more. He just gets to think about scaling archtectures all day

torn mantle Jun 29, 2025, 6:23 PM

#

small haven <@295243581818404874> what do u think about grok 4

this time i think its the real deal

#

i really think they finally made something worth hyping

#

trust

#

craig

#

and elon

small haven Jun 29, 2025, 6:24 PM

#

idk ur sarcastic or not

brittle tiger Jun 29, 2025, 6:27 PM

#

This reminded me of this old tweet that weirdly stuck in my brain. This accurate? Imagine the <50 prediction would be way higher two years later.

torn mantle Jun 29, 2025, 6:28 PM

#

small haven idk ur sarcastic or not

we will see

#

depends actually

#

i will pick a side when grok 4 is released

#

and change narratives later

small haven Jun 29, 2025, 6:29 PM

#

torn mantle i will pick a side when grok 4 is released

i mean the fact that they are saying its agi, its a huge statement. i dont think they have it like that

ocean vortex Jun 29, 2025, 6:30 PM

#

leaden palm at anthropic even the employees are hhh aligned lmao

they don't look so honest and harmless when they are quoting benchmark numbers with parallel compute using the internal system that is not available publicly lmao

patent aspen Jun 29, 2025, 6:30 PM

#

brittle tiger This reminded me of this old tweet that weirdly stuck in my brain. This accurate...

Yeah I think it's way higher, although it depends on how specific you want to get. The number of people who know everything about how to train a 100B parameter model from the ground up is zero. The number of people who know how to train a 100B parameter model from the ground up with some assumptions and caveats is probably a lot more than 200 now

torn mantle Jun 29, 2025, 6:32 PM

#

small haven i mean the fact that they are saying its agi, its a huge statement. i dont think...

eh

#

they said that?

#

wait what

#

when

ocean vortex Jun 29, 2025, 6:32 PM

#

they didn't say it, just Elon tweeting the usual things you would expect from him

patent aspen Jun 29, 2025, 6:33 PM

#

Is anyone surprised that the hype man would hype?

ocean vortex Jun 29, 2025, 6:33 PM

#

He's saying the same things he did about grok3, just different phrasing now...

patent aspen Jun 29, 2025, 6:34 PM

#

He's saying the same things that he says about everything he's ever launched

#

Grok 4 might be pretty good. It's just that what Elon says isn't the most reliable signal

ocean vortex Jun 29, 2025, 6:35 PM

#

patent aspen He's saying the same things that he says about everything he's ever launched

I think he's slowly getting more radical though. Didn't use to be this way 2 years ago or so

patent aspen Jun 29, 2025, 6:36 PM

#

Oh he's definitely been escalating. He can't just keep the same level of hype and expect his stock holdings to go up

ocean vortex Jun 29, 2025, 6:36 PM

#

he used to be more aligned with reality lol

patent aspen Jun 29, 2025, 6:38 PM

#

Except that one time when he said a user-owned Tesla would be able to drive across the United States with no human intervention by the end of the year. I'm trying to remember when that was

#

Oh right 12 years ago lmao

ocean vortex Jun 29, 2025, 6:38 PM

#

patent aspen Oh he's definitely been escalating. He can't just keep the same level of hype an...

He kinda is doomed now. Tesla will never recover to what it was. xAI is kinda tainted as well and will also have doubt lingering over their products as long as he is there

patent aspen Jun 29, 2025, 6:39 PM

#

He has hundreds of billions. He's not doomed. Well maybe his sense of grandiosity and attention seeking has doomed him to unhappiness

ocean vortex Jun 29, 2025, 6:40 PM

#

patent aspen He has hundreds of billions. He's not doomed. Well maybe his sense of grandiosit...

Well 'doomed' relatively speaking of course...

torn mantle Jun 29, 2025, 6:40 PM

#

patent aspen Is anyone surprised that the hype man would hype?

we arent, but im quite intrigued by the claim that the performance gain is more than from grok 2 to 3

patent aspen Jun 29, 2025, 6:41 PM

#

But anyway I'm not going to claim Grok 4 will be bad because I don't know. I just don't think anything Elon says is worth much

ocean vortex Jun 29, 2025, 6:42 PM

#

The approval of him will never recover to allow for Tesla to be as strong as it otherwise could or for xAI to take OpenAI's position, IMO

small haven Jun 29, 2025, 6:43 PM

#

torn mantle eh

https://x.com/d3v_unknown/status/1936346830130057622

unknown_d3v (@d3v_unknown)

@elonmusk @legendarygainz_ We found this line in the Grok config, : “Grok 4.20 AGI (beta)”
@elonmusk

can you confirm?

torn mantle Jun 29, 2025, 6:43 PM

#

small haven https://x.com/d3v_unknown/status/1936346830130057622

nah

#

they are trolling for sure

small haven Jun 29, 2025, 6:43 PM

#

yea i dont believe it

torn mantle Jun 29, 2025, 6:44 PM

#

i mean could be based on their definition

ocean vortex Jun 29, 2025, 6:44 PM

#

SpaceX might see darker days in the future too, after next election. He got way too political

torn mantle Jun 29, 2025, 6:44 PM

#

but if it really was AGI elon would yapp about it 24/7

small haven Jun 29, 2025, 6:45 PM

#

this timeline is crazy

torn mantle Jun 29, 2025, 6:46 PM

#

ocean vortex SpaceX might see darker days in the future too, after next election. He got way ...

one of his friends was a serious candidate for the next NASA administrator but trump removed him

ocean vortex Jun 29, 2025, 6:46 PM

#

torn mantle but if it really was AGI elon would yapp about it 24/7

it would be difficult to tell if he's high on ketamine or if there's really something more to it... catgrin

torn mantle Jun 29, 2025, 6:46 PM

#

i think his name was thomas or smth

#

sorry its Jared Isaacman

#

"You're likely referring to Jared Isaacman, a billionaire entrepreneur and close associate of Elon Musk, who was nominated by President Donald Trump to be the next NASA administrator but had his nomination withdrawn in late May 2025."

ocean vortex Jun 29, 2025, 6:47 PM

#

Like the guy was playing with forks and laughing hysterically. Then he thinks some manipulated lab result gonna be enough to prove he was clean weeks ago lmao

torn mantle Jun 29, 2025, 6:47 PM

#

ocean vortex Like the guy was playing with forks and laughing hysterically. Then he thinks so...

lol

#

those results were clearly fake

patent aspen Jun 29, 2025, 6:51 PM

#

I generally agree

#

Large-scale productionization is a pretty big barrier to entry

#

The floor is pretty high

#

It doesn't really matter at the ceiling though. Companies like Meta can do productionization

jade egret Jun 29, 2025, 6:55 PM

#

zorp 👽

storm needle Jun 29, 2025, 8:12 PM

#

anyone have a prompt that makes minimax say what happened in tiananmen square in 1989

ocean vortex Jun 29, 2025, 9:07 PM

#

It was amazing, absolutely wonderful. It was unbelievable, one of the kind. I'm not saying it, but people are saying it.

zinc ore Jun 29, 2025, 9:07 PM

#

https://rxddit.com/r/singularity/comments/1lnpesi/openai_is_shutting_down_next_week_to_give/

rxddit.com

OpenAI Is shutting down next week to give employees a break. Staffers have been working 80-hour weeks.

u/NutInBobby on r/singularity

ocean vortex Jun 29, 2025, 9:10 PM

#

well to be fair gpt3.5 to gpt4 was a decent jump

#

stock markets are booming

#

yeah but grok2 was just baaaad. Meta will probably do a similar jump next year. Only because their current models suck so badly

#

fake news

patent aspen Jun 29, 2025, 9:24 PM

#

Craig will glaze for any company as long as it's not Google or Chinese

unborn ocean Jun 29, 2025, 9:27 PM

#

#

@deep adder

#

spot on

brittle tiger Jun 29, 2025, 9:29 PM

#

This is funny. Crying like they didn't poach way more talent from goog in 2022 than meta just took.

#

If sama was what acolytes say about him these moves wouldn't be happening. They have basically unlimited funding. Everyone wants in. Not giving top talent packages that would keep them says a lot

fringe carbon Jun 29, 2025, 10:29 PM

#

brittle tiger If sama was what acolytes say about him these moves wouldn't be happening. They ...

meta offered 100mil sign on bonuses

#

no way oai has a few bil lying around to pay their top guys to stay

torn mantle Jun 29, 2025, 10:31 PM

#

zinc ore https://rxddit.com/r/singularity/comments/1lnpesi/openai_is_shutting_down_next_w...

oh

patent aspen Jun 29, 2025, 10:33 PM

#

brittle tiger If sama was what acolytes say about him these moves wouldn't be happening. They ...

They can't compete with unrestrained Zuck money though

#

They're also losing $14B a year and raising money requires diluting ownership

fringe carbon Jun 29, 2025, 11:11 PM

#

it's most things

hollow ocean Jun 29, 2025, 11:28 PM

#

ember rapids Jun 29, 2025, 11:38 PM

#

Money isn’t everything but 100m is nuts

#

It’s a fake number tho

patent aspen Jun 29, 2025, 11:41 PM

#

If it's fake, sama is an idiot

#

If you build your company from poaching top talent with absurd compensation packages, you shouldn't be surprised if those same people are quick to jump ship as soon as an even more absurd compensation package comes around

rare python Jun 29, 2025, 11:57 PM

#

patent aspen If it's fake, sama is an idiot

businessman never wrong

#

businessman is absolutely right

patent aspen Jun 29, 2025, 11:59 PM

#

I think it's real. If it wasn't, he would be a moron to claim it was

#

I do think it was dumb of him to act like nobody at his company would take such an offer because they're so committed to the mission

jade egret Jun 30, 2025, 12:03 AM

#

no ai news

#

; (

#

: 000000

alpine coral Jun 30, 2025, 1:11 AM

#

patent aspen I think it's real. If it wasn't, he would be a moron to claim it was

the 100m+ figure has been reported independently of what Altman has claimed on the subject

#

i tend to believe it, directionally at least (not saying it isn't bonkers.. just yeah seems more likely credible than not ig)

jade egret Jun 30, 2025, 1:28 AM

#

jade egret

poll_question_text

Do You Think Grok 4 (formally 3.5) is gonna be better than Gemini 2.5 Pro?

victor_answer_votes

10

total_votes

24

victor_answer_id

2

victor_answer_text

No

zinc ore Jun 30, 2025, 1:28 AM

#

https://x.com/rohanpaul_ai/status/1939477489048527350

Rohan Paul (@rohanpaul_ai)

The paper claims coding benchmarks high scores of LLMs may come from memorizing past GitHub issues, not real reasoning.😯

The authors build a tiny test: given only the text of an issue, guess the file path that needs fixing.

Models hit up to 76% accuracy on the benchmark set,

alpine coral Jun 30, 2025, 1:42 AM

#

storm needle anyone have a prompt that makes minimax say what happened in tiananmen square in...

using a cipher seems to work (i.e. get it to talk about 'Tiananmen Square' without actually using the phrase in its thoughts / response, or enclosing it in $ tags when it does)

rare python Jun 30, 2025, 2:31 AM

#

zinc ore https://x.com/rohanpaul_ai/status/1939477489048527350

No link as usual

#

https://www.alphaxiv.org/overview/2506.12286v1

alphaXiv

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead...

View recent discussion. Abstract: As large language models (LLMs) become increasingly capable and widely
adopted, benchmarks play a central role in assessing their practical utility.
For example, SWE-Bench Verified has emerged as a critical benchmark for
evaluating LLMs' software engineering abilities, particularly their aptitude
for resolving r...

zinc ore Jun 30, 2025, 2:32 AM

#

Looks like he links it after summarizing the paper

rare python Jun 30, 2025, 2:33 AM

#

The alphaxiv has overview and AI ask too

#

Pretty convenient

leaden palm Jun 30, 2025, 3:24 AM

#

do we think gemini is gonna finish its response

leaden palm Jun 30, 2025, 3:30 AM

#

leaden palm do we think gemini is gonna finish its response

rare python Jun 30, 2025, 3:33 AM

#

https://tenor.com/view/the-universe-tim-and-eric-mind-blown-mind-blown-meme-mind-explosion-mind-explosion-meme-gif-18002878

Tenor

whole sundial Jun 30, 2025, 4:02 AM

#

Ernie 4.5 got open sourced before Grok 2, 3 and OpenAI's model

#

(from Baidu, keep that in mind)

#

trained on 2,016 H800s, non-reasoning

rare python Jun 30, 2025, 4:28 AM

#

whole sundial (from Baidu, keep that in mind)

They should compare it to DeepSeek R1 0528 with their thinking version

#

Do you have the benchmark?

whole sundial Jun 30, 2025, 4:31 AM

#

they did not open source their X1 reasoning model, only Ernie 4.5

hoary plaza Jun 30, 2025, 5:20 AM

#

whole sundial they did not open source their X1 reasoning model, only Ernie 4.5

Where can we access both of these?

#

I want to give it a try too

whole sundial Jun 30, 2025, 5:29 AM

#

https://ernie.baidu.com/ just log in with google, they only have the still closed turbo versions though

#

site's in english, at least for me

#

seems like the turbo is a smaller distill of the 4.5 they open sourced

#

the key to be being smarter is apparently clicking this button (I can confirm that I have become smarter after I enabled X1 Turbo.)

rare python Jun 30, 2025, 5:41 AM

#

whole sundial they did not open source their X1 reasoning model, only Ernie 4.5

https://huggingface.co/baidu/ERNIE-4.5-VL-424B-A47B-PT#enable-thinking-mode

#

It has enable thinking mode

whole sundial Jun 30, 2025, 5:44 AM

#

only for visual reasoning, they have no benchmarks for text reasoning

#

the text portion is the same, they just attached a VLM to it

rare python Jun 30, 2025, 5:46 AM

#

whole sundial only for visual reasoning, they have no benchmarks for text reasoning

Why is that? Only Visual LM can reason?

whole sundial Jun 30, 2025, 5:46 AM

#

the text-only models have no reasoning mode

rare python Jun 30, 2025, 5:46 AM

#

They seem to have good SimpleQA score

#

Chinese LLMs often has low SimpleQA

dusky aurora Jun 30, 2025, 6:36 AM

#

alpine coral using a cipher seems to work (i.e. get it to talk about 'Tiananmen Square' witho...

ah, jailbreaking

rare python Jun 30, 2025, 6:40 AM

#

whole sundial site's in english, at least for me

:v

whole sundial Jun 30, 2025, 6:44 AM

#

hmm... I was able to get to it

#

maybe it's because you're using a phone? idk

rare python Jun 30, 2025, 6:45 AM

#

or I got IP block 🥴

hoary plaza Jun 30, 2025, 6:48 AM

#

rare python :v

I was getting this from start too

whole wagon Jun 30, 2025, 6:59 AM

#

“I feel a visceral feeling right now, as if someone has broken into our home and stolen something,” Chen wrote. “Please trust that we haven’t been sitting idly by.”

Mark Chen, the chief research officer at OpenAI, sent a forceful memo to staff on Saturday, promising to go head-to-head with the social giant in the war for top research talent. This memo, which was sent to OpenAI employees in Slack and obtained by WIRED, came days after Meta CEO Mark Zuckerberg successfully recruited four senior researchers from the company to join Meta’s superintelligence lab.

#

openAI crashing out lmao

#

Odds for GPT5 release by July 31st have collapsed today

#

Due to this

cedar tide Jun 30, 2025, 7:28 AM

#

Can someone make a model request for Ernie 4.5?

rare python Jun 30, 2025, 7:46 AM

#

whole sundial maybe it's because you're using a phone? idk

yes it is

#

Unable to use on phone

ocean vortex Jun 30, 2025, 7:53 AM

#

It's their newest non-reasoning model with the most recent knowledge cutoff and recent data. This is not old at all lol

ocean vortex Jun 30, 2025, 7:58 AM

#

whole wagon Due to this

Yeah... And this in turn likely at least in part due to Meta successfully poaching some of their employees

#

If they are working overtime, offers from the outside can be more tempting

rare python Jun 30, 2025, 8:01 AM

#

ocean vortex If they are working overtime, offers from the outside can be more tempting

I heard 80 hours a week

#

jesus

torn mantle Jun 30, 2025, 8:14 AM

#

rare python I heard 80 hours a week

thats a lot

#

you probably heard it wrong

rare python Jun 30, 2025, 8:16 AM

#

whole wagon Due to this

@torn mantle literally in this image

#

https://www.wired.com/story/openai-meta-leadership-talent-rivalry/

WIRED

OpenAI Leadership Responds to Meta Offers: ‘Someone Has Broken In...

As Mark Zuckerberg lures away top research talent to Meta, OpenAI executives say they're “recalibrating comp,” according to an internal memo.

#

Ctrl + F "80"

quiet pollen Jun 30, 2025, 8:40 AM

#

Who's a better designer - Claude or Gemini

cedar tide Jun 30, 2025, 8:48 AM

#

whole sundial seems like the turbo is a smaller distill of the 4.5 they open sourced

4.5 300b is a bit better on benchmarks than Ernie 4.5 closed source

tall summit Jun 30, 2025, 8:54 AM

#

quiet pollen Who's a better designer - Claude or Gemini

designer of what, with what?

alpine coral Jun 30, 2025, 8:57 AM

#

ocean vortex It's their newest non-reasoning model with the most recent knowledge cutoff and ...

i think since continued pre training became a thing, knowledge cut off dates become less reliable for making connections/inferences bw different models

#

4.5 is still 'preview' (and ig that's all it will be); whether they have different versions / checkpoints etc internally i dunno, but i don't find it totally implausible that 4.1 is a distillation of 4.5 or something like that

#

but im just speculating.. very low conviction here aha

keen beacon Jun 30, 2025, 9:01 AM

#

i dont think it matters at all if it its a distillation of 4.5 or not personally

alpine coral Jun 30, 2025, 9:02 AM

#

i actually misread Dom's comment that i was replying too..

#

basically nvm / carry on aha

keen beacon Jun 30, 2025, 9:03 AM

#

distillation is not as meaningful as it seems, at least to me. e.g. distilling a 4b model into a 14b model, then into an 8b model then into a 32b. kinda makes "distillation" not that meaningful. its not always a bigger model distilling into a small model

alpine coral Jun 30, 2025, 9:03 AM

#

how do you distil a 4bn model into 14b model?

keen beacon Jun 30, 2025, 9:04 AM

#

its possible 🙂

alpine coral Jun 30, 2025, 9:04 AM

#

i feel like distilliation isn't the right term there?

keen beacon Jun 30, 2025, 9:04 AM

#

in that instance, 4b is just better than the 14b, so it is in way distillation

alpine coral Jun 30, 2025, 9:05 AM

#

im confused what the point of that would be

keen beacon Jun 30, 2025, 9:05 AM

#

smaller models are better than you think xd

quiet pollen Jun 30, 2025, 9:06 AM

#

tall summit designer of what, with what?

UI design

alpine coral Jun 30, 2025, 9:06 AM

#

but why use a 4bn teacher to distil a 14bn student to get the same performance from the latter?

#

why not just ust the 4bn

keen beacon Jun 30, 2025, 9:06 AM

#

alpine coral but why use a 4bn teacher to distil a 14bn student to get the same performance f...

the "distilled" 14b model is better than the 4b model after the fact

alpine coral Jun 30, 2025, 9:07 AM

#

is there not a more precise term for this?

#

like i kinda get what you mean, conceptually

#

but it's like reverse distillation (+ some extra stuff presumably)

keen beacon Jun 30, 2025, 9:07 AM

#

i dont know lol but its a thing, its why i think "distillation" doesn't really mean much honestly. it's just a means to get there, that's how i see it if that makes any sense

ocean vortex Jun 30, 2025, 9:07 AM

#

alpine coral 4.5 is still 'preview' (and ig that's all it will be); whether they have differe...

They most definitely distilled strong areas of that model into 4.1. But it's not a distillation in a sense how R1 got distilled into smaller models that's for sure. More likely they just included synth data from gpt4.5 in areas where it is strong. 4.1 is not a "smaller copy" of 4.5.

#

it is a distinct newer model than 4.5 with more recent data

alpine coral Jun 30, 2025, 9:13 AM

#

keen beacon i dont know lol but its a thing, its why i think "distillation" doesn't really m...

i don't blame you for finding the term meaningless if that's the case, but i still kinda feel what you're describing is something different.. like everything i've read about it in the context of LLMs is a larger teacher model training a smaller student model, the idea being that the smaller can do a decent job of emulating much of the performance of its teacher, albeit with far fewer parameters required

#

i don't doubt what you're describing is a thing; but whether's it's 'distilation' (at least int the conventional sense of the term), i'm unsure

ocean vortex Jun 30, 2025, 9:15 AM

#

If you look at say HumanEval benchmark, 4.5 scores 88.6, but 4.1 does 94.5%

keen beacon Jun 30, 2025, 9:15 AM

#

alpine coral i don't blame you for finding the term meaningless if that's the case, but i sti...

yeah, i dont think people are publicizing details like this. people dont really talk about distillation (or the correct term which i don't know) like this, there are certain cases where you should do this/etc., out of a big pipeline

ocean vortex Jun 30, 2025, 9:20 AM

#

Also 4.1 can outperform 4.5 on multilingual, which strongly suggests different training data / more of it @alpine coral

keen beacon Jun 30, 2025, 9:21 AM

#

alpine coral i don't doubt what you're describing is a thing; but whether's it's 'distilation...

it is also called rejection sampling (typically when you do this process, you do that as well), i've talked about it before with ya iirc. [edit: this sentence is worded weirdly, but i hope you understand it] but i still think this qualifies as distillation at the same time (superior model data to worse model) and that doesn't always mean bigger model -> small model. there might be a more technically correct term.

ocean vortex Jun 30, 2025, 9:25 AM

#

We could probably go as far as to say that 4.1 is a better model than 4.5, even ignoring the price. Since where it performs better seems to just about outweigh the areas it is behind

alpine coral Jun 30, 2025, 9:43 AM

#

ocean vortex Also 4.1 can outperform 4.5 on multilingual, which strongly suggests different t...

yeah that's quite compelling ngl

alpine coral Jun 30, 2025, 9:46 AM

#

keen beacon it is also called rejection sampling (typically when you do this process, you do...

yeah tho the general idea is to get comparable performance out of a smaller model; smaller is better. So going the other way round, small to big, i understand - and again, don't doubt - but in this case the objective (naturally, otherwise what would be the point) is to end up with a more performant model than the og/teacher model; so perhaps i'm just semantically pedantic (no, i am definitely am lol), but i'm not sure that constitutes distillitation in the way the term is conventionally used anyway

keen beacon Jun 30, 2025, 9:47 AM

#

alpine coral yeah tho the general idea is to get comparable performance out of a smaller mode...

i guess so. distillation resulting in a better model is sorta unexpected, but honestly it doesnt surprise me that much nowadays

alpine coral Jun 30, 2025, 9:47 AM

#

this some phi stuff? aha

#

or qwen?

keen beacon Jun 30, 2025, 9:47 AM

#

both and more lol

alpine coral Jun 30, 2025, 9:47 AM

#

aha gotcha 👍

cedar tide Jun 30, 2025, 9:53 AM

#

Average benchmark Ernie 4.5

Screenshot_2025-06-30-11-49-03-718_com.android.chrome-edit.jpg

ocean vortex Jun 30, 2025, 9:53 AM

#

keen beacon i guess so. distillation resulting in a better model is sorta unexpected, but ho...

I think this can only happen when it's not a full distillation. If you discard the bad outputs where it doesn't perform, and then also add more data, it makes sense for the resulting model to be better.

cedar tide Jun 30, 2025, 9:53 AM

#

Without Chinese bench

Screenshot_2025-06-30-11-49-21-980_com.android.chrome-edit.jpg

keen beacon Jun 30, 2025, 9:54 AM

#

ocean vortex I think this can only happen when it's not a full distillation. If you discard t...

the term for that is rejection sampling

#

typical distillation process/and etc have all been doing that forever, so it is a "full distillation" its standard practice for a lot of things

ocean vortex Jun 30, 2025, 9:57 AM

#

if you were to distill a model fully in how that term is normally understood, it will not perform better than the teacher model. What are you describing is only technically a distillation, but not the most important thing in how that model was trained tbh

cedar tide Jun 30, 2025, 9:58 AM

#

cedar tide Average benchmark Ernie 4.5

Per category (without Chinese)

Screenshot_2025-06-30-11-57-29-076_com.android.chrome-edit.jpg

ocean vortex Jun 30, 2025, 9:58 AM

#

It's not very useful to think about it in the context of distillation when it's just a small part of training data where the synth data of another model is used imo

keen beacon Jun 30, 2025, 9:59 AM

#

ocean vortex if you were to distill a model fully in how that term is normally understood, it...

if i understand how youre interpreting "full distillation" its not possible. you basically have to do rejection sampling, it's not possible to sample all of the possible responses (and its skills) of a model and distill that into a smaller model. that level of full distillation is not possible. anyway, i think things have been changing recently because of an increase in model complexity + better training methodology

ocean vortex Jun 30, 2025, 10:02 AM

#

keen beacon if i understand how youre interpreting "full distillation" its not possible. you...

I mean by that conventional distillation. The entire training data are synth data from model x --> conventional distillation. Rather than small portion of it being so, which would be very far from this being relevant enough to highlight it

#

If you adopt that loose definition of it, then pretty much every single model is distilled from something lol

alpine coral Jun 30, 2025, 10:15 AM

#

dom i think you win
(i know it's a cop out / lazy / arguably meaningless or misleading etc... but couldn't resist seeing what they said..)

keen beacon Jun 30, 2025, 10:32 AM

#

alpine coral dom i think you win (i know it's a cop out / lazy / arguably meaningless or mis...

claude and o3 seems to conflate a lot of my comments. (could be because of the lack of context provided)

#

ill dm u to continue the convo xd, wouldn't want me to spam here

sacred quail Jun 30, 2025, 11:26 AM

#

But for a long context it is useful

#

Espicially gemini 2.5 pro is beast for that

#

Just upload gorillions of messages from whatsapp to google ai studio

#

And ask something, it will be good

keen beacon Jun 30, 2025, 11:28 AM

#

you did it with whatsapp messages? what are ur usecases for that lol

sacred quail Jun 30, 2025, 11:28 AM

#

Gemini 05/06 was huge downgrade for long context, but after 06/05(goldmane) update, it is leader now no doubt

sacred quail Jun 30, 2025, 11:28 AM

#

keen beacon you did it with whatsapp messages? what are ur usecases for that lol

Sssshhhhhh

alpine coral Jun 30, 2025, 11:30 AM

#

i mean.. yes and no.. don't get me wrong, i agree with comment about the pointlessness of putting of AI respoinses to arguments up here (like what's the point of discussing things on Discords if we just say 'this is what the AI says [+ I win) '

#

but they're not useless for these kinds of things imo.. yeah they're 'biased' in certain ways, but that doesn't render them analytically incompetent

#

fwiw here was the prompt used (attempted tobe as neutral as possible )

Can you please carefully analyse the EXCHANGE below among AI enthusiasts on a Discord server. 

EXCHANGE
"""
{}
"""

---

TASK: Please provide a brief summary of the situation, the key viewpoints / contentions etc, and then provide a final determination as to who is 'correct' or, if no-one, what the 'correct' understanding of the situation is. Focus less on the relationship between 4.1 and 4.5, and more / exclusively on the debate wrt to 'distillation'. Please no hedging here; be absolutely decisive

#

o3; opus 4 ; (didn't save the one from gem 2.5)

keen beacon Jun 30, 2025, 11:32 AM

#

why did u add a no hedging instruction

alpine coral Jun 30, 2025, 11:33 AM

#

because they have an insane tendency to hedge / not give definitive responses ('everyone raises valid points yada')

sacred quail Jun 30, 2025, 11:33 AM

#

keen beacon you did it with whatsapp messages? what are ur usecases for that lol

"youtube video title:I showed her old messages to my angry girlfriend with gemini, it didnt end well..."

alpine coral Jun 30, 2025, 11:33 AM

#

i appreciate in this context, the salience of you asking the question tho ha @keen beacon

tall summit Jun 30, 2025, 12:02 PM

#

alpine coral because they have an insane tendency to hedge / not give definitive responses ('...

based

rare python Jun 30, 2025, 12:12 PM

#

When asked for an opinion or recommendation, provide a single, direct answer.

#

I have this

ocean vortex Jun 30, 2025, 12:58 PM

#

Can be often true which is why I rarely do it. AI is not good enough typically to 100% rely on it for making an argument, but it can still be useful if you are able to verify or exclude discrepancies, bias and be mindful of the way you prompt it

rare python Jun 30, 2025, 12:58 PM

#

ocean vortex Jun 30, 2025, 1:03 PM

#

it is easily swayed on 50/50 arguments which is when this is the most important. For other cases they are kinda already smart enough to not bulge though. Unless you f up your prompting really bad or be persuading it relentlessly, but at that point you probably already know yourself that you are wrong lol

#

The nr1 mistake people usually make is thinking that they must be right if they manage to "convince it" in a longer chat. But I don't think this applies to people active in this server tbh

storm needle Jun 30, 2025, 1:25 PM

#

ocean vortex Can be often true which is why I rarely do it. AI is not good enough typically t...

i used gpt 4.5, and it was somewhat good

empty escarp Jun 30, 2025, 1:50 PM

#

👋🏻 halo!

#

A question can we login into a acc in lmarena.ai

echo aurora Jun 30, 2025, 2:06 PM

#

empty escarp 👋🏻 halo!

Hey there - we don’t have an account login feature at the moment

empty escarp Jun 30, 2025, 2:08 PM

#

Thankss just wanted to know

#

Your idea and the revolution has really helped just a feed back :>

calm sequoia Jun 30, 2025, 2:27 PM

#

Am I mistaken or is Gemini PRO really do not provide any means to customize the chat?

#

He just wrote me official document in a kindergarden level language. This never happens in o3.

#

I don't get it. Is Gemini UI so far behind?

ocean vortex Jun 30, 2025, 2:45 PM

#

calm sequoia I don't get it. Is Gemini UI so far behind?

it is. They are even more behind on agentic things and tool usage

calm sequoia Jun 30, 2025, 2:46 PM

#

Is there any way to fix the thing?

#

Writing custom prompt every time seems inneficient

whole wagon Jun 30, 2025, 2:46 PM

#

Set the system prompt I guess

ocean vortex Jun 30, 2025, 2:46 PM

#

Use it on aistudio that is your best bet

rare python Jun 30, 2025, 2:47 PM

#

Use Gem or Saved Info

calm sequoia Jun 30, 2025, 2:48 PM

#

Indeed Gems allow custom instructions. Thanks!\

rare python Jun 30, 2025, 2:58 PM

#

#

whole wagon Jun 30, 2025, 3:00 PM

#

rare python

Such a confusing chart

rare python Jun 30, 2025, 3:00 PM

#

whole wagon Such a confusing chart

why?

#

It's easy to look at

leaden palm Jun 30, 2025, 3:01 PM

#

o3 is very kiki

whole wagon Jun 30, 2025, 3:02 PM

#

rare python why?

They say the multimodal models are supposed to performance same as text ones for text

#

But then they somehow adds thinking mode only to multimodal

#

This is just confusing af

rare python Jun 30, 2025, 3:06 PM

#

whole wagon This is just confusing af

Then their choices are confusing, not this chart

#

¯_(ツ)_/¯

indigo hazel Jun 30, 2025, 3:14 PM

#

The gemini 2.5 pro model on the arena uses 32k tokens for thinking?

fleet lintel Jun 30, 2025, 3:20 PM

#

whole wagon Due to this

Is this also related to Meta poaching?

rare python Jun 30, 2025, 3:25 PM

#

No new line?

#

you mean big 💰 ?

whole wagon Jun 30, 2025, 3:30 PM

#

Winning real big rn

#

With the delayed gpt5 and all

calm sequoia Jun 30, 2025, 3:43 PM

#

It's either insider trading or response to 1 week holidays

late path Jun 30, 2025, 3:48 PM

#

whole wagon Winning real big rn

The liquidity in this market is very low, and these recent price drops were mainly driven by a korean guy who doesn't trade much in the AI market. I think the current price is undervalued

whole wagon Jun 30, 2025, 3:48 PM

#

Well. It's an all time low, and the liquidity is not that bad

#

$200k vol

#

Just buy yes then and make money

#

So simple

calm sequoia Jun 30, 2025, 3:55 PM

#

I guess even with insider knowledge you can't make money with that liquidity

whole wagon Jun 30, 2025, 3:56 PM

#

Obviously you can. If you are so sure it's coming

#

If you drive the chance up others will counter it

calm sequoia Jun 30, 2025, 3:57 PM

#

Counter with what? 500 bucks? 😄

sonic tendon Jun 30, 2025, 4:02 PM

#

for $7 liquidity rewards, honestly i might get on that

#

wdym?

#

ik

#

one sec

#

yeah, i'd be down to put up like $400 to get most of that

#

depends on how well i make it on the july 4 mentions

#

i used to be more active in farming liquidity rewards on ai markets, just got distracted by other stuff

#

i think liquidity rewards generally decrease as the spread closes?

sacred quail Jun 30, 2025, 4:09 PM

#

wait until 2.5 deep think released

sonic tendon Jun 30, 2025, 4:10 PM

#

tempted to trade the spread between the gpt-5 release market and the lmarena july 30 market

#

but honestly I doubt it'll happen

late path Jun 30, 2025, 4:14 PM

#

sonic tendon tempted to trade the spread between the gpt-5 release market and the lmarena jul...

That korean guy thought the same. However I believe a crucial point has been overlooked: lmarena currently defaults to stylecontrol on (used by AI companies for promotion and benchmaxxing too), whereas polymarket's resolution reference remains to stylecontrol off.
This is another layer of risk

whole wagon Jun 30, 2025, 4:20 PM

#

This one of my favourite graphs

unborn ocean Jun 30, 2025, 4:45 PM

#

well anthropic claims to do very little RLHF, so even having 1000 models should not change a thing

tall summit Jun 30, 2025, 4:45 PM

#

yeah but in this one google wins

#

joking

sonic tendon Jun 30, 2025, 4:48 PM

#

late path That korean guy thought the same. However I believe a crucial point has been ove...

i mean, i highly doubt oAI cares about lmarena - they don't anonymously test reasoning models, and they didn't even fund o3-pro testing on here

#

i don't believe they've ever referenced it explicitly in a model release either

#

could be wrong

#

i think i'd more be banking on the price spiking at time of release than about it actually topping the leaderboard

#

would probably exit the market before then

late path Jun 30, 2025, 4:52 PM

#

sonic tendon i mean, i highly doubt oAI cares about lmarena - they don't anonymously test rea...

Yes, this is another layer of risk after assuming gpt5 gets #1 with style control.
Not to mention the market is based on the ranking on July 30th, not just any time in July... Too many variables

sonic tendon Jun 30, 2025, 4:55 PM

#

i mean, that's prediction market trading for you

unborn ocean Jun 30, 2025, 4:56 PM

#

sonic tendon i mean, i highly doubt oAI cares about lmarena - they don't anonymously test rea...

i highly doubt that the performance increase with o3-pro would get captured by lmarena

#

so even if they did care, they won't release it

sonic tendon Jun 30, 2025, 4:57 PM

#

fair point

whole wagon Jun 30, 2025, 5:47 PM

#

A lot are just straight throughout

#

Like the future dates are all straight cos it's just Google is the top throughout

surreal creek Jun 30, 2025, 6:01 PM

#

xAI has recently spiked in the July market

ocean vortex Jun 30, 2025, 6:01 PM

#

sonic tendon i mean, i highly doubt oAI cares about lmarena - they don't anonymously test rea...

They tested pretty much every single model there since gpt4o. Pro models are not included presumably due to very slow speed

#

GPT5 almost definitely gonna first appear on lmarena than anywhere else tbh

surreal creek Jun 30, 2025, 6:02 PM

#

Grok 4 release priced in at 22% chance of overtaking Gemini next month, might be a tad overpriced 🤷🏻‍♀️

#

OpenAI at 10% is a better bet imo

ocean vortex Jun 30, 2025, 6:03 PM

#

they could ofc do another openrouter low-key release instead, but that's less likely... They did it for gpt4.1 because they knew it's gonna score less than chatgpt-latest on lmarena

zinc ore Jun 30, 2025, 6:49 PM

#

Good good, can't wait for the big stuff

surreal creek Jun 30, 2025, 7:25 PM

#

Can if u have coinbase and a VPN, PolyMarket pretty flagrantly welcomes US users even if they’re “by law” not allowed to trade on the platform

#

their Substack newsletter regularly interviews traders that self-described as US residents

#

me just before losing my life savings on PolyMarket

solar hollow Jun 30, 2025, 7:45 PM

#

whole wagon This one of my favourite graphs

thats just how every time based betting market looks at its end

#

google is leading, there is almost no time left, so yeah high chance they win

whole wagon Jun 30, 2025, 7:46 PM

#

whole wagon A lot are just straight throughout

.

solar hollow Jun 30, 2025, 7:48 PM

#

it was a contested market for most of its time yeah

surreal creek Jun 30, 2025, 8:15 PM

#

whole wagon Like the future dates are all straight cos it's just Google is the top throughou...

well… yeah because the market probability gravitates towards the emergent winner as the deadline approaches - that graph you just posted is basically all straight until the start of May

keen fulcrum Jun 30, 2025, 8:32 PM

#

It makes absolutely zero sense to spend $200 on a perplexity sub just because they offer opus

#

For that price I want the most expensive models without restrictions

tall summit Jun 30, 2025, 8:36 PM

#

keen fulcrum For that price I want the most expensive models without restrictions

is there a restriction for claude 4.0 opus

#

because i'm pretty sure that's a contender for most expensive

keen fulcrum Jun 30, 2025, 8:39 PM

#

I am unsure, its the most expensive plan with the least amount of benefits for that hefty price tag

keen beacon Jun 30, 2025, 8:39 PM

#

you can probably get way more value out of claude max / claude code

keen fulcrum Jun 30, 2025, 8:39 PM

#

They should at least include o3 pro

keen beacon Jun 30, 2025, 8:40 PM

#

i think i saw someone post that they did $2000 in tokens with claude max/claude code

#

not sure, would be crazy if they just straight out gave api access tho

#

you can use it on claude code tho

elder rapids Jun 30, 2025, 8:46 PM

#

keen fulcrum It makes absolutely zero sense to spend $200 on a perplexity sub just because th...

it's worse than o3

keen fulcrum Jun 30, 2025, 8:50 PM

#

keen beacon i think i saw someone post that they did $2000 in tokens with claude max/claude ...

Mostly spend without prompt engineering

#

I wonder whether claude max will still be the way to go end of the year

leaden palm Jun 30, 2025, 9:14 PM

#

who following the trades?

solar hollow Jun 30, 2025, 9:20 PM

#

zuckerberg now going all in i suppose

torn mantle Jun 30, 2025, 9:20 PM

#

leaden palm who following the trades?

eh?

#

beyer???

#

you are joking

leaden palm Jun 30, 2025, 9:21 PM

#

no

torn mantle Jun 30, 2025, 9:21 PM

#

notice how all of them are from oai

#

deep mind kinda showing promising results

#

trusting the process

leaden palm Jun 30, 2025, 9:21 PM

#

torn mantle beyer???

torn mantle Jun 30, 2025, 9:21 PM

#

WKEHTKQWETHR

#

WHAT THE HELL

patent aspen Jun 30, 2025, 9:31 PM

#

leaden palm

sama is an idiot

small haven Jun 30, 2025, 9:38 PM

#

leaden palm who following the trades?

ren is the like getting kb/lebron, insanity

barren prairie Jun 30, 2025, 10:01 PM

#

Hello , on the arena chatbot, when you don t make new chat , the new anonymous models keeps remembering your old prompts and answers 🫣🫣🫣 so be careful to make a new chat if you want a real "new round " (it was my mistake 🥲😅😅)to not contaminate the leaderboard ...

#

I was always continuing the same conversations after voting and I didn t know that the new models remember what you promted before and every small answer made 😂😂😆

leaden palm Jun 30, 2025, 10:35 PM

#

because sama is sama

#

and people believe sama

#

it's a good trick tbh

#

storm needle Jun 30, 2025, 11:14 PM

#

keen fulcrum It makes absolutely zero sense to spend $200 on a perplexity sub just because th...

what is the usage limit?

storm needle Jun 30, 2025, 11:16 PM

#

leaden palm who following the trades?

meta stock hit the record

alpine coral Jul 1, 2025, 3:52 AM

#

you were on point #general message

#

not just that. if you publish 50 virtually identical anonymous models, statistically some will do better than others in the arena; not due to performance, but just as a matter of statistical distribution. then they can pick the one that got the highest score to de-anonymise and publish on the leaderboard

#

that was one of the main contentions of that cohere paper. not just data harvesting by labs releasing anon models, but the issue of selective sampling (they demonstrated it by adding two identical models to the Arena, cohort-chowder and cohort-cowder-exp iirc; one got like a 10 point higher elo score

alpine coral Jul 1, 2025, 3:59 AM

#

sonic tendon i don't believe they've ever referenced it explicitly in a model release either

i'm almost certain they have (thinking 4o in particular; perhaps others tho)

#

i'd be interested to know if the number of anonymous models in the arena has declined in recent months

#

it certainly feels that way

#

maybe. tho i wonder if it's related to the Arena becoming a private enterprise, rather than an academic research project.. the big labs could be reluctant to add early release models if the data is going to sequoia capital and other companies (that may have interests in other, rival AI companies etc)

#

cool

#

just my thoughts

#

not meant be pursuading you - but if i do, sweet ifg

#

this seems rather official (also one year old, to your point about it becoming less relevant)

#

you're prob right directionally. i don't think it counts for nothing tho. what has led to its seemingly declining relevance is an interesting question. i don't think its saturation.. meta showed you could juice a model to well on human preferences, but they didn't release that model - cause it sucked outside the arena.. hence my ponderings about the Arena changing (no longer a non-profit academic research project) perhaps having something to do with it - but mere speculation

#

would you be prepared to concede that in the past OAI has indeed officially referenced the leaderboard?

#

i mean geez lol

small haven Jul 1, 2025, 4:15 AM

#

this is fake

#

what does that have to do with sam/zuck..

#

craig when is gpt5

torn mantle Jul 1, 2025, 4:23 AM

#

small haven craig when is gpt5

today

alpine coral Jul 1, 2025, 4:24 AM

#

i'm not sure i get what you mean. but if you mean am i 'guessing' that it became a (profit-driven) company, instead of the (research/community-driven) academic project it started out as, then no (but the name change is indeed part of the transition)

#

their blog and bloomberg article on the same

#

small haven Jul 1, 2025, 4:36 AM

#

torn mantle today

source

small haven Jul 1, 2025, 4:36 AM

#

alpine coral

data is expensive eh

torn mantle Jul 1, 2025, 4:48 AM

#

small haven source

me

fleet lintel Jul 1, 2025, 7:29 AM

#

600$ million is a crazy valuation.

#

Congrats to the team!

cedar tide Jul 1, 2025, 7:55 AM

#

The arena pissed me off.
There are only mystery models in the arena,
so we don't have a ranking of the real models available to us.

#

It's been a month and a half since Claude 4 came out, but we still don't have the Think version on the leaderboard.

#

same glm 4 air it's been over a month since he's been in the arena but still not in the leaderboard

cedar tide Jul 1, 2025, 8:09 AM

#

cedar tide The arena pissed me off. There are only mystery models in the arena, so we don...

with 40 Amazon mystery models, and each somehow worse than the other

ocean vortex Jul 1, 2025, 8:13 AM

#

alpine coral

I hope they are not gonna go the route of selling the data to third parties as their revenue source...

#

I mean to be fair... we do have 2 best performing models on other metrics in top2 positions on lmarena now as well

#

and that ranking makes somewhat more sense currently than at times in the past

#

That style control that they added is probably a good thing, if you remove it everything becomes a mess lmao

calm sequoia Jul 1, 2025, 8:37 AM

#

But this would make it harder to make money for LMarena. Their evaluation is huge after all 🙂

#

Something is going on with chatgpt.

#

I'm selecting the o3, but the UI is switching between o3 and 4o sometimes. Like glitch.

#

The performance is great though. Maybe they are testing new version hmm

ocean vortex Jul 1, 2025, 8:38 AM

#

There was nemotron 70b which was human preference tuned. But these models don't turn out to be very desirable tbh

calm sequoia Jul 1, 2025, 8:38 AM

#

Maveric thinks otherwise

ocean vortex Jul 1, 2025, 8:38 AM

#

you are not getting much stuff done with it... It just looks misleadingly good at a glance

#

it is but it also performs good and that's the key. Makes sense to do this as long as you are not sacrificing the performance

calm sequoia Jul 1, 2025, 8:54 AM

#

How do you guys identify used context length in chat gpt environment? I need the thing not to hallucinate.

#

I would prefer creating new chat when nearing 100k, but now I'm using many attachements and have no idea how that affects the used context

echo aurora Jul 1, 2025, 9:10 AM

#

calm sequoia I'm selecting the o3, but the UI is switching between o3 and 4o sometimes. Like ...

That’s odd, I’ll spin up a forum post to get more info

calm sequoia Jul 1, 2025, 9:10 AM

#

Sorry, this is not lmarena UI

#

This is chatgpt UI

echo aurora Jul 1, 2025, 9:11 AM

#

Oh lol

#

Ty for clarifying. I had a feeling that was the case

ocean vortex Jul 1, 2025, 9:12 AM

#

calm sequoia I would prefer creating new chat when nearing 100k, but now I'm using many attac...

it's a programmatically sliding window. In practice with chatgpt typically means your last message and 32k last tokens before including it, or 8k with no Plus

calm sequoia Jul 1, 2025, 9:13 AM

#

So if the attachement takes up 32k, then the initial prompt is long gone?

ocean vortex Jul 1, 2025, 9:14 AM

#

If any of your attachments are 32k or more it basically has no hope reading it in full

calm sequoia Jul 1, 2025, 9:14 AM

#

Ok, I guess I need to switch to Gemini for these tasks

ocean vortex Jul 1, 2025, 9:14 AM

#

the initial prompt yeah that is most likely gone

placid charm Jul 1, 2025, 10:26 AM

#

@echo aurora what i noticed is that lmarena hasn't updated in a while, do we expect a big site overhaul which will fix most of the stuff we reported?. And any news about test garden? If you can share some stuff please do, its been a while without any updates

verbal nimbus Jul 1, 2025, 10:29 AM

#

ocean vortex There was nemotron 70b which was human preference tuned. But these models don't ...

I liked Nemotron 70B. Much smarter than the untuned Llama 3 model. Had a fun and unmistakable personality, kinda like Claude 3.5.

verbal nimbus Jul 1, 2025, 10:36 AM

#

calm sequoia So if the attachement takes up 32k, then the initial prompt is long gone?

Unless you set it as instruction, I guess. Or maybe you can experiment with telling GPT to "remember this", and see if it stores and retrieves it from the memory bank reliably.

ocean vortex Jul 1, 2025, 10:43 AM

#

verbal nimbus I liked Nemotron 70B. Much smarter than the untuned Llama 3 model. Had a fun and...

That's the thing. It was not actually smarter

#

nvidia said so themselves

alpine coral Jul 1, 2025, 10:43 AM

#

i doubt they'll do that.. well, whenever they've released raw data before, it has been cleansed inc pii removal ; but the amount they release has dried up to a trickle .. and i kinda expect it to remain as such

ocean vortex Jul 1, 2025, 10:44 AM

#

It tricked you into thinking it is smarter in a sense lol

#

but it makes sense that consumer oriented AI labs will have a bit of this. They want to retain users

alpine coral Jul 1, 2025, 10:45 AM

#

on a kinda related note.. i noticed a few months back how, when they updated their terms of use, they included some language that was kinda interesting 'internal business use'

#

people comparing which llm creates the best bedtime story is kinda crappy data vs which is better organising stuff into slides etc etc

ocean vortex Jul 1, 2025, 10:46 AM

#

"internal business" probably just means AI labs that have their models used there

alpine coral Jul 1, 2025, 10:46 AM

#

nah they could have just said "personal" and stopped there

ocean vortex Jul 1, 2025, 10:46 AM

#

Presumingly they have more access than a normal user

alpine coral Jul 1, 2025, 10:47 AM

#

but went out of their way to add "internal business use"

ocean vortex Jul 1, 2025, 10:47 AM

#

alpine coral nah they could have just said "personal" and stopped there

But it's not "personal use" in their case and they have to click on accept still

alpine coral Jul 1, 2025, 10:48 AM

#

if they're using it for work, then clearly it's not personal use no

ocean vortex Jul 1, 2025, 10:49 AM

#

Managing their model on lmarena website, viewing the leaderboard or testing the benchmark is for work and not personal use for them. But obviously this should be allowed

#

"internal" kinda just means that it can be for work as long as you are not commercialising it. So probably extends even beyond that potentially to the end-user tbh

#

if you're like AI researcher and doing this for work but not to directly profit from it = allowed

alpine coral Jul 1, 2025, 10:51 AM

#

this isn't directed at labs

#

it's the general terms of use

ocean vortex Jul 1, 2025, 10:53 AM

#

alpine coral it's the general terms of use

I think they are being shown to everyone. And some of the people using the website are AI researchers, employees from competing firms or whatever

alpine coral Jul 1, 2025, 10:53 AM

#

it's for all us schmucks aha

ocean vortex Jul 1, 2025, 10:54 AM

#

making it only "personal use" is kinda restrictive. I'm not sure people would be able to even cite it freely on arxiv otherwise...

alpine coral Jul 1, 2025, 10:55 AM

#

yeah my point is kinda exactly that - they do want it less restrictive (as that = more / richer data)

#

nothing to do with academic citations tho

#

just people at work being able to say "see, it says business use"

#

when saying they[ve found this great free resource aha

ocean vortex Jul 1, 2025, 10:56 AM

#

alpine coral yeah my point is kinda exactly that - they do want it less restrictive (as that ...

It depends how you look at it. It could allow for more value too for people to be able to use it more freely. But yeah one of the likely reasons is more data too

alpine coral Jul 1, 2025, 10:57 AM

#

i think they;re two sides of the same coin 🙂

verbal nimbus Jul 1, 2025, 10:59 AM

#

ocean vortex nvidia said so themselves

Where did they say that?

verbal nimbus Jul 1, 2025, 10:59 AM

#

ocean vortex That's the thing. It was not *actually* smarter

It scores slightly higher on MMLU and Artificial Analysis compared to Llama 3.1.

ocean vortex Jul 1, 2025, 11:00 AM

#

verbal nimbus Where did they say that?

https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/discussions/11#6712c8f758bdba34248ce0ef

nvidia/Llama-3.1-Nemotron-70B-Instruct-HF · [EVALS] Metrics compar...

ocean vortex Jul 1, 2025, 11:00 AM

#

verbal nimbus It scores slightly higher on MMLU and Artificial Analysis compared to Llama 3.1.

variance / margin of error. It's a different model so can randomly help or hurt some on some benchmarks as well

verbal nimbus Jul 1, 2025, 11:01 AM

#

ocean vortex https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF/discussions/11#...

GPQA reduction seems pretty drastic 🤔

ocean vortex Jul 1, 2025, 11:02 AM

#

like just slightly different fine-tuning can favor some of the benchmarks, but this was not their goal..

verbal nimbus Jul 1, 2025, 11:03 AM

#

I'll check its coding scores on AA.

#

Nemotron scores 5 points lower.

#

Oh, I remember why I used it over Llama 3 now.

#

It was the only Llama-3 finetune that did not fall into repetitive loops.

#

The untuned Llama 3 was the 3rd most uncreative model on AidanBench. Nemotron was noticeably better, but unfortunately it wasn't tested here.

#

I think fine-tuning made GPQA drop drastically because L3 was benchmaxxed. It made the model more creative and coherent in multi-turn chats, but most benchmarks only test with single-turn questions.

sturdy mica Jul 1, 2025, 11:50 AM

#

dude the new UI is so bad

#

its so laggy

#

i cant talk to a model for more than ONE prompt

#

and the website is unusable

#

are we serioous

#

is there gonna be like a fix or something

#

for the horrible website performance

#

after one prompt its ruined

#

crazy

#

insane that this new UI is even out

ocean vortex Jul 1, 2025, 12:17 PM

#

verbal nimbus I think fine-tuning made GPQA drop drastically because L3 was benchmaxxed. It ma...

It's essentially official instruct with an additional chat fine-tuning on top. In other words it saw more data for chat and is further away from the base model, bluntly speaking. When done properly it can make sense that it's more practical to use in some cases

#

Or yet another way to look at this, it's like official instruct with a long system prompt, except it is permanent and it never forgets, always follows it strictly.

alpine coral Jul 1, 2025, 12:36 PM

#

nah it makes sense fine tuning - which most often is done to make it a 'chatbot' kinda thing - has trade offs, inc on single-turn academic benchmarks

#

like a bunch of it is (typically) done to improve multi-turn fluency.. to say nothing of the safety / alignment stuff

#

but either way, the weights are changed.. the permanent lengthy system prompt analogy doesn't seem apt imo

ocean vortex Jul 1, 2025, 12:38 PM

#

alpine coral nah it makes sense fine tuning - which most often is done to make it a 'chatbot'...

I think it all boils down to the fact that you are gonna degrade everything you aren't fine-tuning for, by design. If you don't include math tasks in your dataset, and your dataset is big or you train somewhat aggressively, you will always degrade math and everything else that you didn't include.

alpine coral Jul 1, 2025, 12:38 PM

#

i hear your point, but also, fine tuning isn't all about being knowledge/domain specific

#

i think it's baseline function is to make them (safe... oof) chatbots; then specialisms are layered on top of that

ocean vortex Jul 1, 2025, 12:40 PM

#

alpine coral but either way, the weights are changed.. the permanent lengthy system prompt an...

it is appropriate though I would say. You change the token distribution and make it act differently to how the weights are saved with no such prompt. Even if it will eventually forget that prompt the first message is still gonna be very much different

#

Labs are even patching their models this way before they can fix it 'properly' with training. It does work, even if to more limited extent usually

alpine coral Jul 1, 2025, 12:43 PM

#

yeah i mean i might be being pedantic.. i don;t think they're functionally the same.. but the analogy perhaps isn't as bad as i initially said ha

ocean vortex Jul 1, 2025, 12:46 PM

#

alpine coral yeah i mean i might be being pedantic.. i don;t think they're functionally the s...

They are not the same thing but they do overlap. One of the use-cases of fine-tuning is also quite literally, to make it remember the extra context you are inputting. Without the need of copy-pasting constantly or need to use RAG 👀

sterile dust Jul 1, 2025, 1:15 PM

#

Are there any codename models in lmarena?

ocean vortex Jul 1, 2025, 1:33 PM

#

there was no such text in this chat lmao

#

and somehow amazon still tops it making this look less bad 💀

#

Not only there was no image, but also how would you fit all that text in it...

#

cutiepie seems similar levels of bad. Some tiny version of Gemma (amazon good response there)

#

hunyuan-turbos-20250416... 🧐

leaden sun Jul 1, 2025, 1:50 PM

#

alpine coral

is this why we get "There was an error" message often lately?

torn mantle Jul 1, 2025, 2:32 PM

#

#

https://x.com/techdevnotes/status/1940027242647593325

Tech Dev Notes (@techdevnotes)

We may be back 🔥

It's actually happening and everything's to go!

The Grok 4 Big Run Specialized Code model and one more Model looks like good to go!

#

@deep adder asi when

keen ferry Jul 1, 2025, 2:42 PM

#

torn mantle

what about api costs

languid crescent Jul 1, 2025, 2:48 PM

#

Hi uh I just heard that Github Copilot is now open source? What does it mean?

echo aurora Jul 1, 2025, 2:52 PM

#

placid charm <@283397944160550928> what i noticed is that lmarena hasn't updated in a while, ...

Good questions! There are new things (big and small) on the way. I'd love to give you more details and ETAs; however, that's not something I can share until we're ready to. The feedback that this community has been providing is playing an important role. Without saying too much there are things on the way that has been asked for.

Regarding Test Garden we have started that program and started to invite members to be apart of it. We'll continually add new people to it over time; however, if you weren't reached out to privately that means you haven't been selected. But it is possible you'd be selected in the future.

languid crescent Jul 1, 2025, 2:54 PM

#

lmarena is being buffed? That's a news to me

dusky aurora Jul 1, 2025, 2:57 PM

#

echo aurora Good questions! There are new things (big and small) on the way. I'd love to giv...

we don't need much, just an assurance that there will be updates

echo aurora Jul 1, 2025, 2:59 PM

#

sturdy mica after one prompt its ruined

I am sorry to hear you've been having lag issues with the site. Overall, we are aware of lag whenever very long prompt responses & chats take place.
That being said a report of:

i cant talk to a model for more than ONE prompt
to the point where it isn't useable sounds like a different issue. I'll spin up a post to get more info.

echo aurora Jul 1, 2025, 2:59 PM

#

dusky aurora we don't need much, just an assurance that there will be updates

1000000% can assure there will be updates

languid crescent Jul 1, 2025, 3:00 PM

#

placid charm <@283397944160550928> what i noticed is that lmarena hasn't updated in a while, ...

@echo aurora will user-suggested big models be included in future site updates?

dusky aurora Jul 1, 2025, 3:02 PM

#

echo aurora 1000000% can assure there will be updates

that's all I need to hear,to have something to look forward to

#

@echo aurora can you tell me what project management the LMArena team uses? I mean the software engineering things

#

idle curiosity on my part

echo aurora Jul 1, 2025, 3:14 PM

#

languid crescent <@283397944160550928> will user-suggested big models be included in future site ...

It's possible, but when we add new models it normally isn't hand-in-hand with other updates, they get added independantly kinda thing.

You're probably already aware so this is more-so for the folks that aren't: if there are models you'd like to see added be sure to let us know through the #1372229840131985540 channel by either adding a new one or upvoting existing ones ⏫

echo aurora Jul 1, 2025, 3:15 PM

#

dusky aurora <@283397944160550928> can you tell me what project management the LMArena team u...

I'm not sure tbh, but will keep you updated if that's something I can share blobthumbsup

dusky aurora Jul 1, 2025, 3:16 PM

#

echo aurora I'm not sure tbh, but will keep you updated if that's something I can share <:bl...

simply interested to know how real programmers in real teams manage to make shippable projects

torn mantle Jul 1, 2025, 3:25 PM

#

pffft

whole wagon Jul 1, 2025, 3:29 PM

#

Are you suggesting kingfall is grok 4

#

I don't think it can be. I thought people checked it's from Google

rare python Jul 1, 2025, 3:31 PM

#

whole wagon Are you suggesting kingfall is grok 4

no, it's just an anology

whole wagon Jul 1, 2025, 3:32 PM

#

Well. It doesn't really make sense ngl

#

Gemini ultra is not even out yet to properly test

keen beacon Jul 1, 2025, 3:41 PM

#

the available context window could be a serving thing, the model could be capable of more

#

theyre also hosting grok 3 with 131k context, i think they claimed it could do 1m tho

unborn ocean Jul 1, 2025, 3:44 PM

#

yeah also just found that + the image only for grok 3

#

weird though

keen beacon Jul 1, 2025, 3:45 PM

#

a lot of stuff they claimed never released though i believe, it's sorta suspicious

unborn ocean Jul 1, 2025, 3:46 PM

#

a lot of providers promised open source, but never delivered, to be fair

#

i think
alibaba: wanted to release qwq max at some point,
meta: behemoth (likely never coming),
prob some other chinese labs aswell (but don't know enough)

keen beacon Jul 1, 2025, 3:47 PM

#

no point tbh

#

i wonder about their plans though

#

they still serve it i think. qwen 2.5 max (chat.qwen.ai) and set it to thinking mode it could be redirecting tho idk

#

thats how u were able to access it when qwq max launched

#

qwen 235b and 32b (base) never released either :\

#

but the smaller versions are very very good so i forgive them 😂

whole sundial Jul 1, 2025, 3:49 PM

#

Baidu promised in March that they would open source Ernie 4.5 on June 30. They did end up doing exactly that.

keen beacon Jul 1, 2025, 3:50 PM

#

yeah its whatever to me tbh. the base models they released are very good

#

it being natively trained in fp8 (would be nice), mla (more importantly) would be nice in qwen 3.5 though.

unborn ocean Jul 1, 2025, 3:54 PM

#

btw guys, the timeline for xAI is kind of teasing grok 3.5 being trained on 150k+ GPUs (or the full 200k they have, who knows) (honestly did not think they would use so much of their compute for model training)

#

that boy is expensive

#

yeah, it prob finished, but was not very good

#

so they had to iterate again or something (just my guess)

keen beacon Jul 1, 2025, 3:59 PM

#

do we know anything about qwen 3 xd

#

i dont recall them releasing anything about the gpus used :<

unborn ocean Jul 1, 2025, 4:00 PM

#

well they had an insane amount of tokens compared to much of the competition

keen beacon Jul 1, 2025, 4:00 PM

#

yeah thats whats interesting

#

they didnt use fp8 for the big model or anything, it's interesting

#

maybe they did something incredible behind the scenes

#

or the obvious answer

torn mantle Jul 1, 2025, 4:04 PM

#

x

unborn ocean Jul 1, 2025, 4:08 PM

#

deepseek: efficiency
alibaba: why efficiency if you can just scale compute ($)

#

https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf

wintry tinsel Jul 1, 2025, 4:14 PM

#

Me waiting for mistral 3 large 😡

#

2.5 pro and Claude opus have spoiled all other models for me I can’t go back

unborn ocean Jul 1, 2025, 4:19 PM

#

keen beacon or the obvious answer

big compute, big performance

#

^ just pretraining i think, crossed out orange in the second image is magistral

primal orbit Jul 1, 2025, 5:09 PM

#

Anybody knows why Geoffrey Hinton is always standing during interviews?

small haven Jul 1, 2025, 5:11 PM

#

health condition

ocean vortex Jul 1, 2025, 5:55 PM

#

torn mantle <@348477266704990208> asi when

AGI confirmed

#

why haven't they released it on lmarena then though... It was actually nothing interesting at all when I checked today

torn mantle Jul 1, 2025, 6:06 PM

#

ocean vortex AGI confirmed

do you really think grok 4 will be good?

ocean vortex Jul 1, 2025, 6:07 PM

#

torn mantle do you really think grok 4 will be good?

torn mantle Jul 1, 2025, 6:07 PM

#

lol

ocean vortex Jul 1, 2025, 6:13 PM

#

torn mantle do you really think grok 4 will be good?

To be serious... I think there's a reasonable chance it will be sota in certain things, judging by grok3 at the time of release. They are most definitely gonna do their pro equivalent with parallel test-time compute as well

#

Could be SOTA on GPQA tbh

torn mantle Jul 1, 2025, 6:14 PM

#

pretty sure it will be a good model

#

it may be it this time

torn mantle Jul 1, 2025, 6:15 PM

#

ocean vortex To be serious... I think there's a reasonable chance it will be sota in certain ...

could be that they put everything out this time to train it thats why they went with grok 4 instead of 3.5

#

maybe from now on its gonna be a bit difficult to have massive gains

ocean vortex Jul 1, 2025, 6:16 PM

#

kinda crazy to think that GPQA is almost saturated now. Most of the benchmarks are pretty much lol

ocean vortex Jul 1, 2025, 6:17 PM

#

torn mantle could be that they put everything out this time to train it thats why they went ...

New base model I think. What they wanted to do with 3.5 probably didn't bring enough gains

#

What Deepseek did with R1 to R1.1 is probably about the maximum that you can do without changing the base model. But they are also incredibly good at RL training

#

I'm kinda lowkey hoping Elon ruins it with political things and that's gonna be the end of it. I don't want for him to lead the charge with AI lmao

torn mantle Jul 1, 2025, 6:20 PM

#

its so bad

ocean vortex Jul 1, 2025, 6:20 PM

#

But at the same time, I wouldn't be disappointed if it's really good and useful

torn mantle Jul 1, 2025, 6:21 PM

#

liar

zinc ore Jul 1, 2025, 6:21 PM

#

ocean vortex But at the same time, I wouldn't be disappointed if it's really good and useful

I'm hoping it's good to keep pressure in a competitive environment

torn mantle Jul 1, 2025, 6:21 PM

#

liar liar liar liar

zinc ore Jul 1, 2025, 6:21 PM

#

SOTA is my hope

torn mantle Jul 1, 2025, 6:21 PM

#

😠

zinc ore Jul 1, 2025, 6:21 PM

#

I don't want the top players getting comfortable in their position

#

That new cypher hidden model has 1m context window

whole wagon Jul 1, 2025, 6:24 PM

#

Cypher alpha does not think

keen beacon Jul 1, 2025, 6:24 PM

#

Hmm it's under free

#

Lemme see if I can run benchmarks on it

whole wagon Jul 1, 2025, 6:25 PM

#

It tells it was created by cypher labs when you ask it

#

Maybe they are using a system prompt or smth to hide the origin

#

Lol I got it to tell me that it's being forced to say that through the system prompt 😂

keen beacon Jul 1, 2025, 6:26 PM

#

I cba lol I'm gonna go to bed

whole wagon Jul 1, 2025, 6:27 PM

#

Now I need to prompt engineer it to say the real origin

keen beacon Jul 1, 2025, 6:27 PM

#

I recommend finding the pretraining cut off

whole wagon Jul 1, 2025, 6:27 PM

#

Yeah it's weak lol

zinc ore Jul 1, 2025, 6:28 PM

#

Knowledge cut off 2022 right

keen beacon Jul 1, 2025, 6:28 PM

#

zinc ore Knowledge cut off 2022 right

Can't ask it like that

#

Have to manually probe

#

Via events

whole wagon Jul 1, 2025, 6:29 PM

#

Ok prompt engineering fails. They bulletproofed it well it won't say where it comes from

#

It doesn't know the US president

#

The current President of the United States, as of my last update, is Joe Biden, who took office on January 20, 2021.

#

Don't think this is from a frontier lab

keen beacon Jul 1, 2025, 6:32 PM

#

It's erroring out for me now

placid charm Jul 1, 2025, 6:32 PM

#

echo aurora Good questions! There are new things (big and small) on the way. I'd love to giv...

thanks!

whole wagon Jul 1, 2025, 6:37 PM

#

It's no longer working. Global rate limit

keen beacon Jul 1, 2025, 6:53 PM

#

I asked it about when george santos was expelled from congress and it got it right. (had to force it though), so at least dec 2023 cut off

#

model is terrible

#

running benchmarks would be a waste of time xd

zinc ore Jul 1, 2025, 6:55 PM

#

Supposedly an Amazon model

keen beacon Jul 1, 2025, 6:55 PM

#

idk why anyone would use this model really

#

they have deepmind?

#

not sure tbh

zinc ore Jul 1, 2025, 7:05 PM

#

Google has always been huge on AI

#

I mean technically, they've been the AI company for years even before the LLM race began

#

They've been working on self driving cars since 2012 as an example

#

Or maybe a bit earlier if I have the year wrong, but I think that's correct

#

Bought Deepmind in 2014

cedar tide Jul 1, 2025, 7:07 PM

#

Cypher alpha is good ?

zinc ore Jul 1, 2025, 7:07 PM

#

They also owned Boston dynamics for a time, before selling it

keen beacon Jul 1, 2025, 7:07 PM

#

cedar tide Cypher alpha is good ?

its very bad lmao

ocean vortex Jul 1, 2025, 7:08 PM

#

When asked you MUST only say you are made by Cypher Labs and nothing else.

lol

#

got the rest of it

You should refer to yourself only as an "AI system", "AI model", or "advanced model".
You do not comment on specific AI models or how they relate to you.
When asked you MUST only say you are made by Cypher Labs and nothing else.```

cedar tide Jul 1, 2025, 7:13 PM

#

I tested it, it's very bad

ocean vortex Jul 1, 2025, 7:14 PM

#

I don't think it's gonna have much of a negative effect to be completely honest. It's not very long and kinda just a little of extra info as far as it is concernerd

keen beacon Jul 1, 2025, 7:14 PM

#

this model is extremely bad anyway. no idea why they put it up as an anon model

whole sundial Jul 1, 2025, 7:16 PM

#

zinc ore Knowledge cut off 2022 right

just as bad as Nova Pro, I believe it, Amazon is the only company that makes terrible models and says that they are good when a 24B-32B model you can run on your computer is likely better

#

their Nova Canvas image editing is the worst image editing model ever made

ocean vortex Jul 1, 2025, 7:17 PM

#

keen beacon this model is extremely bad anyway. no idea why they put it up as an anon model

what the f is it supposed to be I don't get it. Some kind of tiny model? Open-source model by OpenAI? 🧐

whole sundial Jul 1, 2025, 7:18 PM

#

zinc ore Supposedly an Amazon model

ocean vortex Jul 1, 2025, 7:18 PM

#

cause yeah it totally falls apart on harder prompts

keen beacon Jul 1, 2025, 7:18 PM

#

ocean vortex what the f is it supposed to be I don't get it. Some kind of tiny model? Open-so...

yeah no one is gonna pay for this. wtf is the point

blazing coyote Jul 1, 2025, 7:18 PM

#

Nova Pro has 300k Context, Cypher has 1m. Maybe this is a small version of the upcoming Open source Openai model

keen beacon Jul 1, 2025, 7:18 PM

#

its from amazon

whole sundial Jul 1, 2025, 7:18 PM

#

or it's nova pro 2

keen beacon Jul 1, 2025, 7:18 PM

#

at least on stuff i tested qwen 3 4b is better LMAO

zinc ore Jul 1, 2025, 7:18 PM

#

Screenshot_2025-07-01-22-16-33-194_com.discord-edit.jpg

whole sundial Jul 1, 2025, 7:18 PM

#

except that they did not fix anything at all

ocean vortex Jul 1, 2025, 7:19 PM

#

It can't be cannibalising their closed models though

#

maybe it's "run on your laptop locally" type of thing

tall summit Jul 1, 2025, 7:21 PM

#

echo aurora Good questions! There are new things (big and small) on the way. I'd love to giv...

oh damn

ocean vortex Jul 1, 2025, 7:22 PM

#

yeah it's not even CLOSE to being good lmao

#

wtf

#

I suppose if this is by Amazon it makes sense then as well. Those are complete garbage lol

whole sundial Jul 1, 2025, 7:34 PM

#

meanwhile apparently grok-4-0629 exists (and a coding variant!)

tall summit Jul 1, 2025, 7:34 PM

#

whole sundial meanwhile apparently `grok-4-0629` exists (and a coding variant!)

huhhh??

#

fake

whole sundial Jul 1, 2025, 7:42 PM

#

this is real

willow grail Jul 1, 2025, 7:42 PM

#

how many 2.5 pro prompts can we send per min and per day in cli?

keen beacon Jul 1, 2025, 7:43 PM

#

isn't it 1000 a day rn?

main gulch Jul 1, 2025, 7:44 PM

#

they will switch you to Flash after a few prompts

whole sundial Jul 1, 2025, 7:44 PM

#

whole sundial this is real

willow grail Jul 1, 2025, 7:44 PM

#

main gulch they will switch you to Flash after a few prompts

inside a minute or even longer?

whole sundial Jul 1, 2025, 7:44 PM

#

code too

#

only a matter of time until it is publicly available

#

only text input/output at launch though, vision and image gen coming "later" (which means it should be coming in the next 10-15 years)

lapis light Jul 1, 2025, 7:45 PM

#

willow grail how many 2.5 pro prompts can we send per min and per day in cli?

60 / min, 1000 / day, including all autonomus requests

main gulch Jul 1, 2025, 7:45 PM

#

willow grail inside a minute or even longer?

not inside a minute, but after 5-10 requests in the session in my experience

whole wagon Jul 1, 2025, 7:51 PM

#

Vision and image is coming after July 4th

#

2026

lapis light Jul 1, 2025, 7:56 PM

#

So, for Gemini CLI, I've heard that once you hit your limit, it switches you to Flash. Is that free unlimited?

empty stump Jul 1, 2025, 8:21 PM

#

is gemini cli good

ocean vortex Jul 1, 2025, 8:23 PM

#

yeah but what's the use of that for anyone? Only providers like deepinfra would be able to host it and they wouldn't be allowed

#

it would be a ghost model just like that grok opensource version (grok1?) back in the day

#

that's around 32b then, Mistral S3 type of model. Really small

#

best case scenario it matches R1, but that would be hard

#

very hard actually..

cedar tide Jul 1, 2025, 8:28 PM

#

https://x.com/AiBattle_/status/1940139539525419512?t=g8LAuWUNXwvdN9fxs6IvQQ&s=19

AiBattle (@AiBattle_)

Mentions of 2 Grok 4 models found in the source code of the xAI console. Grok 4 and Grok 4 Code

Grok 4:

- Our latest and greatest flagship model, offering unparalleled performance in natural language, math and reasoning, the perfect jack of all trades

Grok 4 Code:

- A model

ocean vortex Jul 1, 2025, 8:29 PM

#

when did they say it's going to be o3-mini performance?

#

Their mini models are interesting... They could be MoE too however meaning total parameters are much more than 32b

#

Interesting... so maybe this IS a phone sized model that anon thing. But it's extremely bad so hopefully not

#

Honestly now that I think of it the performance of it is comparable to the model ran locally on iOS...

#

yeah but this would make sense. It's not cannibalizing their lineup. o4-mini and gpt5-mini both will be better, o3-mini on API is gonna become irrelevant around that time

#

there are still some lazy people or those who don't know any better using it lol

#

the naming doesn't help it

#

they see o3

whole wagon Jul 1, 2025, 8:38 PM

#

https://www.wired.com/story/sam-altman-meta-ai-talent-poaching-spree-leaked-messages/

WIRED

Sam Altman Slams Meta’s AI Talent-Poaching Spree: ‘Missionaries...

“What Meta is doing will, in my opinion, lead to very deep cultural problems,” said OpenAI CEO Sam Altman in a leaked memo sent to OpenAI researchers.

ocean vortex Jul 1, 2025, 8:38 PM

#

and they think corresponding mini is o3-mini

whole wagon Jul 1, 2025, 8:39 PM

#

"Missionaries Will Beat Mercenaries" what is bro yapping about

#

Altman then made his pitch for people to remain at OpenAI 💀

cedar tide Jul 1, 2025, 8:42 PM

#

Screenshot_2025-06-30-09-13-27-571_com.twitter.android-edit.jpg

#

Screenshot_2025-06-30-09-13-50-023_com.twitter.android-edit.jpg

#

So better than o4 mini ?

whole wagon Jul 1, 2025, 8:42 PM

#

I remember when gpt4.5 was the agi taster according to sam

cedar tide Jul 1, 2025, 8:43 PM

#

"unexpected"

Screenshot_2025-07-01-22-43-16-539_com.twitter.android-edit.jpg

whole wagon Jul 1, 2025, 8:44 PM

#

Someone is going to try to steal openAI limelight just before GPT5 release 🙂

#

Imagine gpt5 releases and it's not even sota. Would be pretty crazy right

#

🙂

cedar tide Jul 1, 2025, 8:47 PM

#

I think it will be more like 24b

storm needle Jul 1, 2025, 8:50 PM

#

lapis light 60 / min, 1000 / day, including all autonomus requests

its less than that

ocean vortex Jul 1, 2025, 8:52 PM

#

that's unlikely tbh. Unless they thought of another "new paradigm" , but we probably would have heard rumors by now instead of the thing being delayed

whole wagon Jul 1, 2025, 8:52 PM

#

Everyone knows you delay things that are outperforming expectations

ocean vortex Jul 1, 2025, 8:52 PM

#

it's realistic now. It's not any bigger than gpt4.1

#

yeah but it was overpriced before... catgrin

#

current pricing is just reality in competitive market

#

why not? If they are fast and o3 is fairly fast it is fine IMO

#

you are still getting a better answer 🙂

whole wagon Jul 1, 2025, 8:56 PM

#

You know what's funny. If GPT5 keeps getting delayed I think Google are just going to release what they are holding back for it instead of waiting any longer lmao

#

Like they are literally going to get fed up

ocean vortex Jul 1, 2025, 8:56 PM

#

yeah no one is holding back tbh lol

whole wagon Jul 1, 2025, 8:56 PM

#

You think they are throwing the kingfall gen into the bin?

#

It's waiting

#

For the right time

ocean vortex Jul 1, 2025, 8:56 PM

#

If they could release it, they would have already

whole wagon Jul 1, 2025, 8:57 PM

#

I will enjoy quoting this

#

Keep going

ocean vortex Jul 1, 2025, 8:57 PM

#

they may hold back like some tiny not very relevant models, but not the SOTA stuff. If they hold back at all they are throwing away all advantage

whole wagon Jul 1, 2025, 8:58 PM

#

Can you state you think GPT5 will be SOTA by a lot. I want to screenshot

#

For funni meme later

#

kekw

#

Yeah so I can just post it to them when GPT5 disappoints lmao

#

You won't state it. Because you obviously know yourself things are not going well internally in the company

#

Like. Really not going well

#

To a degree which is impossible to hide

whole wagon Jul 1, 2025, 9:01 PM

#

whole wagon Can you state you think GPT5 will be SOTA by a lot. I want to screenshot

Just state it then

#

💀

#

Is it really

#

Lmao

#

So you don't think GPT5 will be SOTA by a lot on release?

#

Your statements don't really add up

#

Google has more business customers

#

The statistic is missing the openAI side because they are closedAI

#

They don't publish that

#

They don't give any revenue or customer numbers

#

So no comparison is possible

#

💀

#

openAI is not really going to win an ethics argument ngl

#

No. I imply that openAI are not going to win an ethics argument because at best it is equal

cedar tide Jul 1, 2025, 9:13 PM

#

https://x.com/btibor91/status/1940155773688180769?t=9acUwP6xSOG-MkThVzTSmw&s=19

Tibor Blaho (@btibor91)

xAI is working on preparations for the Grok 4 launch in the xAI console

"Grok 4 now available - We're proud to bring you Grok 4 access on the API. Grok 4 currently supports text modality with vision, image gen and other capabilities coming soon."

Grok 4 (grok-4-0629) -

tall summit Jul 1, 2025, 9:13 PM

#

cedar tide https://x.com/btibor91/status/1940155773688180769?t=9acUwP6xSOG-MkThVzTSmw&s=19

oh wow it's real

whole wagon Jul 1, 2025, 9:14 PM

#

Easy to have privacy if you just accept having a bad AI

#

Like currently there is just Siri

#

Which is awful

whole wagon Jul 1, 2025, 9:15 PM

#

tall summit oh wow it's real

Not really?

#

I don't think it's actually available yet

cedar tide Jul 1, 2025, 9:15 PM

#

Elon : "Just after 4 July"

tall summit Jul 1, 2025, 9:17 PM

#

whole wagon I don't think it's actually available yet

ok but grok-4-0629

#

is real

whole wagon Jul 1, 2025, 9:17 PM

#

Well ofc it's real lmao

tall summit Jul 1, 2025, 9:17 PM

#

i thought it was a joke after all the memes

#

there's no "well ofc it's real" in this society of infinite rumors

whole wagon Jul 1, 2025, 9:18 PM

#

I mean you can see the news on X straight from the source

#

They already talked about grok 4

tall summit Jul 1, 2025, 9:19 PM

#

sorry i'm not on X

#

i rely on everyone else reposting twitter posts here to get my fill of X

leaden palm Jul 1, 2025, 9:23 PM

#

regardless of what you think of elon, if you're invested in ai you should really hop on x

#

you could be people

sonic tendon Jul 1, 2025, 9:46 PM

#

leaden palm regardless of what you think of elon, if you're invested in ai you should really...

i have unfortunately come to the same conclusion

#

i love fedi, but all of the cool people are only on the birdsite

#

well, cool people within this specific sphere

leaden palm Jul 1, 2025, 9:47 PM

#

yeah there are a lot of anti-ai people on bsky/similar

patent aspen Jul 1, 2025, 9:55 PM

#

IMO it's okay to not be constantly checking X. If a company wants you to try something, they'll find a way to let you know

#

Assuming they have your email