#general

1 messages · Page 62 of 1

ocean vortex
#

on their console which is aistudio equivalent you can see like 90% raw thing

#

they are not summarizing anything, rather changing or removing individual phrases

#

but that is not intrusive at all tbh

#

OpenAI on the other hand... Essentially every single sentence is majorly rewritten and shortened lol

#

so the wall of text becomes a 1 short paragraph

#

so you only have a very brief outline of what it is actually outputting

#

"vast majority are shown in full"

#

there's your answer

#

only 5% of requests are actually summarized..

#

so it's even more raw than I implied

#

that is not the point though, cause their summarization is never turning walls of text into 2-3 sentences. It's more like minimal thing. I was able to max it out and I was still seeing consistent stream of thought much like Deepseek.

#

not really. Sonnet goes off the rails if you max it out and will almost start contemplating it's own existence for your average prompt lmao

#

And Opus barely thinks much regardless of the prompt, so there's almost never enough to summarize doesn't matter how hard your prompt is catgrin

#

The point is, Claude is really very close to raw thinking, you can't really complain about it... It's virtually indistinguishable from 100% raw thinking

sour spindle
#

I have never been more bearish of a model after reading a tweet

#

"Not for everyone" = not very good

ocean vortex
#

Sonnet is repeating itself over and over again, to the point where some summarization on top perhaps wouldn't even be all negative, if they implemented something more

keen fulcrum
whole wagon
#

I literally linked an article that explained. openAI straight up stated deepseek distilled the cot so obviously it was available for that to be possible

keen fulcrum
#

apparently an elon fan

ocean vortex
#

Ironically, I think for OpenAI summarization makes O3 look kinda better than it even is

#

If they outputted it raw I'm almost certain it would look a bit mess, similar to R1 thinking

whole wagon
ocean vortex
#

"wait, maybe I made a mistake.... wait, let me double check.... Or perhaps the user meant this other thing <insert random wild theory that does not make any sense>...."

whole wagon
#

I don't really care they did it tbh. But it's obvious it did in fact happen lol

#

There's no denying that

ocean vortex
#

raw thinking is very often not a good look, even if the model performs lol

whole wagon
#

There are many articles

ocean vortex
#

Which is still a good quality synth data

whole wagon
#

They were. It was slightly cleaned up but mostly raw

zinc ore
#

I remember openAI saying Deepseek stole their cot

ocean vortex
#

does not only improve performance, but also your model learns proper message formatting

#

all the markdown, boxed responses, code formatting...

zinc ore
#

And then they did the same to Gemini models, stealing their cot too

whole wagon
#

The cot is the gold

#

The final output is not that important

#

The model needs to train on how to reach that final output lol

#

That's just not true at all lol

ocean vortex
whole wagon
#

The summaries were barely summarises before. The amount of detail they removed is massive

#

And they literally stated it's to prevent other companies training on the cot

#

It's not a debate topic

ocean vortex
#

you still have final responses - and that's all you need to train your next gen model and do your own RL training. It's not straight forward but not magic or rocket science either

whole wagon
#

"competitive advantage" read between the lines lol they are not going to do an official press release to say deepseek stole the cot lol

#

That statement is from openai engineers and Microsoft did a report on it

#

It's all in the vid I linked

ocean vortex
whole wagon
#

Bruh. The summarises changed over time

#

They were barely summarized initially

ocean vortex
#

And after Deepseek they realized there's not much point hiding it

#

so now they allow for detailed reasoning

#

which is much less summarized

whole wagon
#

And Gemini had the same thought ofc ofc

ocean vortex
#

They didn't distill entire model. They did their own RL training, they do have some very smart people...

#

They simply used o1, which was SOTA at the time, to generate synth data (final responses)

#

I think they only used that for V3. And then they did their own RL training to make R1 out of it

#

I remember that when I saw this post... Here's your official confirmation lol

https://fixupx.com/btibor91/status/1938736226779177351?t=7TaIWitnbnnTj6Dmveps-w&s=19

"We're truly excited to not just make a net new great frontier model, we're also going to unify our two series.
︀︀
︀︀So the breakthrough of reasoning in the O-series and the breakthroughs in multi-modality in the GPT-series will be unified, and that will be GPT-5.
︀︀
︀︀And I really hope I'll come back soon to tell you more about it."

**💬 12 🔁 29 ❤️ 417 👁️ 48.1K **

torn mantle
#

this guy is asking to be blocked

balmy mist
#

i guess no one will be the target audience for grok 4

keen fulcrum
#

3.5 has been renamed to 4

balmy mist
#

lmaooo

#

wasnt 4 renamed to 3.5?

keen fulcrum
#

Due to the bigger jump of the previous model

balmy mist
#

now its back to 4?

#

wait so what are we currently waiting for? r2, grok 4 and deep think ?

#

is that all?

#

gpt5 not coming for another few months right?

keen fulcrum
#

2.5 ultra is on the horizon for a while

balmy mist
#

what models are yall using as yall drivers now? i been mia for a lil

#

i been liking o3 more and more tbh, a lot faster recently

#

but still like gemini

main gulch
#

first Gemini 3.0 checkpoints may appear in August

#

don't forget about OAI open-weight model, we still don't know its real level

#

Anthropic may release a minor Claude 4 update towards the end of summer (or even Claude Haiku 4, but they don't seem to be interested in cheap models)

#

R2 likely delayed until fall

unborn ocean
ocean vortex
unborn ocean
#

For the human preferences and reasoning length, but that is just a guess

#

Because the deepseek r1-zero part has been shown to make sense and be truly unique rl (I think I heard something like that)

small haven
#

what is nightforge

ocean vortex
#

they also added language consistency reward to stop it from mixing up the languages

patent aspen
zinc ore
#

That's minimax model

small haven
small haven
zinc ore
#

Yeah

small haven
#

kinda weird that they would name that, i feel like it conflicts with deepthink

#

*titanforge

unborn ocean
#

And depending on how much they did they could have also transferred some of the reasoning about the right thinking amount before switching to answering from o1 to r1.

unborn ocean
#

Though with the new iteration of r1 it seems almost certain that they trained on the old 2.5 pro traces or output in some shape or form (bc of good human preferences and a lot of similarities).

Though again it is very plausible that I am wrong and certain that most of advancements with r1 come from the incredibly cracked guys at deepseek.

leaden palm
torn mantle
#

we will see if xai engineer is worth x10 times more than other labs engineers

mint relic
#

Stonebloom is Gemini 2.5 Ultra right ?

leaden palm
torn mantle
#

its not good

#

kingfall was much better in comparison

mint relic
tall summit
rare python
zinc ore
#

Yes

#

Model name is minimax-m1

torn mantle
rare python
#

like SVG

zinc ore
#

I'm talking about nightforge specifically.

patent aspen
#

When the OAI guy said, "We're truly excited to not just make a net new great frontier model, we're also going to unify our two series. So the breakthrough of reasoning in the O-series and the breakthroughs in multi-modality in the GPT-series will be unified, and that will be GPT-5. And I really hope I'll come back soon to tell you more about it"

That uncertainty at the end was interesting haha

leaden palm
#

i genuinely can't tell if you're talking about the blog or satirizing the screenshot of gemini's summarization of stonebloom performance

#

ah

#

it always surprises me a bit when someone tells me they read it

#

i guess i'm just not used to experiencing the fruits of labor

elder rapids
#

but nice

#

you should try to make the summaries sound a little more neutral imo, the way it comes off is like a sensationalist headline

leaden palm
rare python
leaden palm
rare python
#

🗿

#

First Generation starter pokemon SVG

rare python
#

stonebloom

#

I hate how 2.5 Pro has become

#

2.5 Pro

#

stonebloom second try using
Draw First Generation starter pokemons in SVG in the same horizontal line, fit in the same PNG if export.

#

Original pokemon artstyle for comparision

#

How to notice stonebloom guide :)

#

Gemini 2.5 Pro writing style:

#

"Of course!" and it slipped in em dashes for no reason

#

2.5 Pro second try

#

:(

patent aspen
# rare python "Of course!" and it slipped in em dashes for no reason

Annoying English lession: "Bulbasaur, Charmander, and Squirtle" is an appositive phrase in that sentence. Normally appositive phrases are enclosed in commas, but that would create ambiguity because "Bulbasaur, Charmander, and Squirtle" already contains commas, so em dashes are used instead

rare python
#

I don't care if it's gramatically accurate

#

Em dash = world destruction

patent aspen
#

It would be grammatically incorrect without them in that sentence haha

rare python
#

No

#

It stand out like a sore

#

itchy

#

distracting

#

Punctuation should be invisible in my opinion, like I never pay attention to period and comma despite they feature a lot

patent aspen
#

Punctation can completely change the meaning and pace of a sentence

#

"But even so, there was a directness and dispatch about animal burial: there was no stopover in the undertaker's foul parlor, no wreath or spray." sounds completely different if you use a period instead of a colon even though it means the same thing

#

But anyway yeah I can understand how they would be annoying if it's your second language

#

A well placed semi-colon or dash can do a lot of work though

rare python
#

until I use ChatGPT, which it made me hate it

#

It's so dramatic—for no reason

#

Even when there are some sentence that flows better without pause and the AI insert em dashes to be cool

#

Before using LLM, I read books and books have em dashes. I never mind it.

#

It's the repetition like it did with "It's not X, It's Y" that ruined the em dash for me

#

🤓 sir that just - a hyphen - not an em dash (—)

patent aspen
#

If you want to understand that kind of punctuation better:

Read Chapter 4 from Keys to Great Writing: https://www.amazon.com/Keys-Great-Writing-Revised-Expanded-ebook/dp/B01M2YJ15V

Read the Elementary Rules of Usage section from The Elements of Style: https://ia804502.us.archive.org/5/items/pdfy-2_qp8jQ61OI6NHwa/Strunk %26 White - The Elements of Style%2C 4th Edition.pdf

rare python
#

Yep that's another point

#

I can't even use Alt + 0151 or whatever for the em dash. I have to go to word or special symbol with Windows + period

patent aspen
#

I'll just say that once you understand the nuances of punctuation, a lot of things you read will "come alive" more

tawny kelp
#

I notice when I'm at work, Microsoft Office sometimes converts my hyphen to an em dash. That's why I don't automatically assume em dash = AI-generated.

rare python
#

Yes, there is a notable difference in em dash usage between American and British English:

American English:

  • Uses em dashes frequently—often without spaces around them
  • Common for parenthetical statements, interruptions, and emphasis
  • Generally preferred over parentheses for informal writing

British English:

  • Traditionally uses spaced en dashes ( – ) instead of em dashes
  • Sometimes uses parentheses or commas where Americans would use em dashes
  • Em dashes are becoming more common due to American influence, but still less frequent

Example of the difference:

  • American: "The three starters—Bulbasaur, Charmander, and Squirtle—are iconic."
  • British: "The three starters – Bulbasaur, Charmander, and Squirtle – are iconic."

The British preference for spaced en dashes serves the same function as American em dashes. Some British style guides (like The Guardian) have adopted em dashes, but others (like The Times) still prefer en dashes.

This is one of those stylistic differences where neither is "wrong"—it's just a matter of which style guide or regional convention you're following.

patent aspen
#

Em dashes are abused quite a bit. They should generally only be used when more specific punctuation isn't available, but because they can be used everywhere, people just throw them in willy nilly

cedar tide
#

@rare python guess the llm

rare python
zinc ore
cedar tide
rare python
zinc ore
rare python
patent aspen
#

I just use whichever one it autoformats to

zinc ore
#

Kinda annoying it calls it "em dash usage" when it is actually em dash and en dash

rare python
cedar tide
patent aspen
#

Just be glad nobody calls it em-and-en dash

tawny kelp
#

Speaking of Claude... I think my instance of Claude just confessed to me.

rare python
# patent aspen I just use whichever one it autoformats to

I think another reason I dislike em dash is AI use it in normal conversation. I rarely find em dash in a reddit comment, twitter comment, youtube comment or my friend on facebook messenger.

Using em dash while texting each other feels so academic for no reason.

patent aspen
#

How about we call it M&N dash?

rare python
#

Cute but not accurate

cedar tide
#

@rare python I prefer to do it one by one

rare python
tawny kelp
#

Those look like South Park characters.

rare python
jade egret
keen beacon
#

They would've put it on the arena as a pre release if they were confident like their previous releases. Elon would love to tweet out getting #1. Bad sign

leaden palm
#

the "hard prompts" category says 27.3% so not most

storm needle
#

in 2023, openai had all the top geniuses from google but they've now moved on to other companies

leaden palm
#

i suppose it's referring to not having to maximize shareholder value

jade egret
# jade egret

ngl hopefully it is acctually gonna be better than gemini 2.5 pro at least cuz if it not than so dissapointed

wintry tinsel
#

It all hangs on GPT 5, the legendary and long awaited

whole wagon
#

They are ramping up compute far too slowly to compete effectively

#

The current plan is to deploy 64,000 Nvidia GB200 GPUs by the end of 2026 Like this is for Stargate. It's absurdly slow

#

2 years to deploy 64k GPUs is not it

#

Meanwhile xAI deploys 200k H100s in 122 days

whole wagon
#

The openAI team is falling apart, that's like 10+ researchers going to meta just in the last week

sacred quail
#

They updated. Now 100 RPD in mine too. Man

#

I hope stay as 100

#

Very generous limit

#

Chatgpt offers 100 prompt for a "week"

#

And need membership...

#

But they reduced api prices for o3 which also good move

#

Now, you can use O3 five times per day on free poe account

sacred quail
#

Now that reptilian stealing huh

#

Im almost feel bad for sam altman

torn mantle
#

they offered them a crazy salary, how can you compete with that?

#

if you raise their salary, you need to do it for all employees

unborn ocean
#

They are all lead people / mini managers

#

So there is really no need to increase the salary for all employees if you give em more

torn mantle
#

thats what you think

#

you are just basically saying "these guys are more important than you"

unborn ocean
#

Yeah, they apparently are

torn mantle
#

and what do you think it will happen after?

unborn ocean
#

I know the logic and thought process you are talking about, but in the context of them just being a select few lead people I find it highly doubtful that they would have to actually raise the salary for all employees.

#

Obv they would set a precedent, which they don’t want.

#

My point was simply that different rules apply to these guys vs the lower level employees

remote niche
#

guys whats up with gemini 2.5pro it cant do basic arithmetic now ask it whats 9.9 minus 9.11 and watch it crap itself

fallen jewel
#

is the team here thinking on how to make a tool use arena?

alpine coral
#

not downplaying their contributions/significance (esp the top 3), but it seems they're just following the money as far as i can tell

#

apparently altman wasn't talking out of his ass in that podcast with his brother ha

frosty lark
#

Hopping for that sweet $$$, I think they are relatively smart. Getting the maximum before funds quiet down a bit (I do think that there is a bit of overfunding going on)

alpine coral
#

also tbf.. not like there were many other AI companies other than google/deepmind before oai/chatgpt.. kinda makes sense that if not a university, google is where they most likely would have been working before oai (and now Meta ha)

#

(that's a muddled point.. obviously other companies were working on 'AI'.. or 'machine learning' etc ha other than just google before oai shook everything up with chatgpt)

unborn ocean
#

and i think it is not just $, especially the top 3 seems like they also picked their job (atleast fouding openai zürich) because they got more freedom in there

#

with meta probably aiming to restructure a lot of their ai teams

#

so that place will hopefully be very dynamic in the coming months

#
  • they already have a lot of compute
#

similar to how almost 100% of the early people at xAI all worked at deepmind before and likely switched not just because of $ but also because they have the opportunity to build something from the ground up

torn mantle
#

lets not talk about the business plan and objectives and long-term goals

#

sam talked about Meta offering a bonus of 100M to the people they are trying to hire

#

and even if you increase one or two people's salary that will become the baseline in the future

patent aspen
#

It's also unsustainable, although it could very well go on for a long time

alpine coral
#

good time to be an AI researcher 😅

#

alas i'm not a contender in the talent arms race

torn mantle
#

yea because salaries in xai are so high

sour spindle
#

LLAMA 5 is going to be AGi

unborn ocean
#

to bad mouth meta

#

the 100M might be referring to 5 year comp + signing bonus and only for the very best (e.g. top 100 people)

#

they probably do pay more than gdm and oai though

#

that is quite literally what they are offering

#

(i think meta clarified themselves or something)

#

and the people that they hire are not really just research scientists, they are mostly managers and "stars" in the sense that they attract other people.
Furthermore, you also have the consider how much these companies are burning on AI in general rn (compared to their general R&D costs 15M a year is peanuts)

#

"I don't know how this kind of thing is becoming normalized in some peoples minds" - my point is that it can be normalized because companies are already doing it

wintry tinsel
# whole wagon

Sad as meta is a cancerous company their endgame is definitely more evil than OAI

sour spindle
#

Paying people a ton of money is getting "normalized?"

leaden palm
#

at anthropic even the employees are hhh aligned lmao

unborn ocean
#
  1. people at a hedge fund can still make the same amount of or more money (though again only very few) and yes you are right their comp being tied more to performance
  2. however, i am sure that these 100M comp packages also consists of performance and target based payments (less than in finance, but because they are building a new team and things like that the variable comp is probably still quite high)
  3. you are right about this being weird times (i kind of wonder how long it will take for the job market to normalize for t1 ai labs)
rare python
lilac nimbus
#

neptune v2 more better than sonnet 4

torn mantle
#

agree

brittle tiger
torn mantle
small haven
#

@torn mantle what do u think about grok 4

#

i feel like musk is in here

patent aspen
#

Leads are often the difference between one architecture and another, between one research direction and another. Making the right choice can be the difference between success and failure

#

I think it's a myth for SWEs, and I think it's a myth for individual research impact

#

I think it's somewhere between a myth and reality for top AI researcher leads

#

The Noam Shazeers of the world

#

Well it's not a myth for some SWEs / leads. Jeff Dean kind of breaks every rule

#

Not at all although he's making tens of millions at a minimum

#

But yeah he's doing it for the distributed systems

#

He doesn't need money

#

L10 used to be the highest level on the SWE ladder, but it was insufficient to represent Jeff Dean, so they created L11 for him. From the beginning of his tenure, there was an assumption that, if there was a disagreement between the CEO and Jeff, the default assumption was that Jeff was right and the CEO was wrong. So he's always been able to pursue projects mostly independently

#

He has the best job in the world

#

He doesn't even have to lead an org any more. He just gets to think about scaling archtectures all day

torn mantle
#

i really think they finally made something worth hyping

#

trust

#

craig

#

and elon

small haven
#

idk ur sarcastic or not

brittle tiger
#

This reminded me of this old tweet that weirdly stuck in my brain. This accurate? Imagine the <50 prediction would be way higher two years later.

torn mantle
#

depends actually

#

i will pick a side when grok 4 is released

#

and change narratives later

small haven
ocean vortex
patent aspen
torn mantle
#

they said that?

#

wait what

#

when

ocean vortex
#

they didn't say it, just Elon tweeting the usual things you would expect from him

patent aspen
#

Is anyone surprised that the hype man would hype?

ocean vortex
#

He's saying the same things he did about grok3, just different phrasing now...

patent aspen
#

He's saying the same things that he says about everything he's ever launched

#

Grok 4 might be pretty good. It's just that what Elon says isn't the most reliable signal

ocean vortex
patent aspen
#

Oh he's definitely been escalating. He can't just keep the same level of hype and expect his stock holdings to go up

ocean vortex
#

he used to be more aligned with reality lol

patent aspen
#

Except that one time when he said a user-owned Tesla would be able to drive across the United States with no human intervention by the end of the year. I'm trying to remember when that was

#

Oh right 12 years ago lmao

ocean vortex
patent aspen
#

He has hundreds of billions. He's not doomed. Well maybe his sense of grandiosity and attention seeking has doomed him to unhappiness

ocean vortex
torn mantle
patent aspen
#

But anyway I'm not going to claim Grok 4 will be bad because I don't know. I just don't think anything Elon says is worth much

ocean vortex
#

The approval of him will never recover to allow for Tesla to be as strong as it otherwise could or for xAI to take OpenAI's position, IMO

small haven
torn mantle
#

they are trolling for sure

small haven
#

yea i dont believe it

torn mantle
#

i mean could be based on their definition

ocean vortex
#

SpaceX might see darker days in the future too, after next election. He got way too political

torn mantle
#

but if it really was AGI elon would yapp about it 24/7

small haven
#

this timeline is crazy

torn mantle
ocean vortex
torn mantle
#

i think his name was thomas or smth

#

sorry its Jared Isaacman

#

"You're likely referring to Jared Isaacman, a billionaire entrepreneur and close associate of Elon Musk, who was nominated by President Donald Trump to be the next NASA administrator but had his nomination withdrawn in late May 2025."

ocean vortex
#

Like the guy was playing with forks and laughing hysterically. Then he thinks some manipulated lab result gonna be enough to prove he was clean weeks ago lmao

torn mantle
#

those results were clearly fake

patent aspen
#

I generally agree

#

Large-scale productionization is a pretty big barrier to entry

#

The floor is pretty high

#

It doesn't really matter at the ceiling though. Companies like Meta can do productionization

jade egret
#

zorp 👽

storm needle
#

anyone have a prompt that makes minimax say what happened in tiananmen square in 1989

ocean vortex
#

It was amazing, absolutely wonderful. It was unbelievable, one of the kind. I'm not saying it, but people are saying it.

ocean vortex
#

well to be fair gpt3.5 to gpt4 was a decent jump

#

stock markets are booming

#

yeah but grok2 was just baaaad. Meta will probably do a similar jump next year. Only because their current models suck so badly

#

fake news

patent aspen
#

Craig will glaze for any company as long as it's not Google or Chinese

unborn ocean
#

@deep adder

#

spot on

brittle tiger
#

This is funny. Crying like they didn't poach way more talent from goog in 2022 than meta just took.

#

If sama was what acolytes say about him these moves wouldn't be happening. They have basically unlimited funding. Everyone wants in. Not giving top talent packages that would keep them says a lot

fringe carbon
#

no way oai has a few bil lying around to pay their top guys to stay

patent aspen
#

They're also losing $14B a year and raising money requires diluting ownership

fringe carbon
#

it's most things

hollow ocean
ember rapids
#

Money isn’t everything but 100m is nuts

#

It’s a fake number tho

patent aspen
#

If it's fake, sama is an idiot

#

If you build your company from poaching top talent with absurd compensation packages, you shouldn't be surprised if those same people are quick to jump ship as soon as an even more absurd compensation package comes around

rare python
#

businessman is absolutely right

patent aspen
#

I think it's real. If it wasn't, he would be a moron to claim it was

#

I do think it was dumb of him to act like nobody at his company would take such an offer because they're so committed to the mission

jade egret
#

no ai news

#

; (

#

: 000000

alpine coral
#

i tend to believe it, directionally at least (not saying it isn't bonkers.. just yeah seems more likely credible than not ig)

jade egret
# jade egret
poll_question_text

Do You Think Grok 4 (formally 3.5) is gonna be better than Gemini 2.5 Pro?

victor_answer_votes

10

total_votes

24

victor_answer_id

2

victor_answer_text

No

zinc ore
alpine coral
rare python
zinc ore
#

Looks like he links it after summarizing the paper

rare python
#

The alphaxiv has overview and AI ask too

#

Pretty convenient

leaden palm
#

do we think gemini is gonna finish its response

whole sundial
#

Ernie 4.5 got open sourced before Grok 2, 3 and OpenAI's model

#

(from Baidu, keep that in mind)

#

trained on 2,016 H800s, non-reasoning

rare python
#

Do you have the benchmark?

whole sundial
#

they did not open source their X1 reasoning model, only Ernie 4.5

hoary plaza
#

I want to give it a try too

whole sundial
#

site's in english, at least for me

#

seems like the turbo is a smaller distill of the 4.5 they open sourced

#

the key to be being smarter is apparently clicking this button (I can confirm that I have become smarter after I enabled X1 Turbo.)

whole sundial
#

only for visual reasoning, they have no benchmarks for text reasoning

#

the text portion is the same, they just attached a VLM to it

rare python
whole sundial
#

the text-only models have no reasoning mode

rare python
#

They seem to have good SimpleQA score

#

Chinese LLMs often has low SimpleQA

rare python
whole sundial
#

hmm... I was able to get to it

#

maybe it's because you're using a phone? idk

rare python
#

or I got IP block 🥴

hoary plaza
whole wagon
#

“I feel a visceral feeling right now, as if someone has broken into our home and stolen something,” Chen wrote. “Please trust that we haven’t been sitting idly by.”

Mark Chen, the chief research officer at OpenAI, sent a forceful memo to staff on Saturday, promising to go head-to-head with the social giant in the war for top research talent. This memo, which was sent to OpenAI employees in Slack and obtained by WIRED, came days after Meta CEO Mark Zuckerberg successfully recruited four senior researchers from the company to join Meta’s superintelligence lab.

#

openAI crashing out lmao

#

Odds for GPT5 release by July 31st have collapsed today

#

Due to this

cedar tide
#

Can someone make a model request for Ernie 4.5?

rare python
#

Unable to use on phone

ocean vortex
#

It's their newest non-reasoning model with the most recent knowledge cutoff and recent data. This is not old at all lol

ocean vortex
# whole wagon Due to this

Yeah... And this in turn likely at least in part due to Meta successfully poaching some of their employees

#

If they are working overtime, offers from the outside can be more tempting

rare python
#

jesus

torn mantle
#

you probably heard it wrong

rare python
#

Ctrl + F "80"

quiet pollen
#

Who's a better designer - Claude or Gemini

cedar tide
tall summit
alpine coral
#

4.5 is still 'preview' (and ig that's all it will be); whether they have different versions / checkpoints etc internally i dunno, but i don't find it totally implausible that 4.1 is a distillation of 4.5 or something like that

#

but im just speculating.. very low conviction here aha

keen beacon
#

i dont think it matters at all if it its a distillation of 4.5 or not personally

alpine coral
#

i actually misread Dom's comment that i was replying too..

#

basically nvm / carry on aha

keen beacon
#

distillation is not as meaningful as it seems, at least to me. e.g. distilling a 4b model into a 14b model, then into an 8b model then into a 32b. kinda makes "distillation" not that meaningful. its not always a bigger model distilling into a small model

alpine coral
#

how do you distil a 4bn model into 14b model?

keen beacon
#

its possible 🙂

alpine coral
#

i feel like distilliation isn't the right term there?

keen beacon
#

in that instance, 4b is just better than the 14b, so it is in way distillation

alpine coral
#

im confused what the point of that would be

keen beacon
#

smaller models are better than you think xd

quiet pollen
alpine coral
#

but why use a 4bn teacher to distil a 14bn student to get the same performance from the latter?

#

why not just ust the 4bn

keen beacon
alpine coral
#

is there not a more precise term for this?

#

like i kinda get what you mean, conceptually

#

but it's like reverse distillation (+ some extra stuff presumably)

keen beacon
#

i dont know lol but its a thing, its why i think "distillation" doesn't really mean much honestly. it's just a means to get there, that's how i see it if that makes any sense

ocean vortex
#

it is a distinct newer model than 4.5 with more recent data

alpine coral
# keen beacon i dont know lol but its a thing, its why i think "distillation" doesn't really m...

i don't blame you for finding the term meaningless if that's the case, but i still kinda feel what you're describing is something different.. like everything i've read about it in the context of LLMs is a larger teacher model training a smaller student model, the idea being that the smaller can do a decent job of emulating much of the performance of its teacher, albeit with far fewer parameters required

#

i don't doubt what you're describing is a thing; but whether's it's 'distilation' (at least int the conventional sense of the term), i'm unsure

ocean vortex
#

If you look at say HumanEval benchmark, 4.5 scores 88.6, but 4.1 does 94.5%

keen beacon
ocean vortex
#

Also 4.1 can outperform 4.5 on multilingual, which strongly suggests different training data / more of it @alpine coral

keen beacon
# alpine coral i don't doubt what you're describing is a thing; but whether's it's 'distilation...

it is also called rejection sampling (typically when you do this process, you do that as well), i've talked about it before with ya iirc. [edit: this sentence is worded weirdly, but i hope you understand it] but i still think this qualifies as distillation at the same time (superior model data to worse model) and that doesn't always mean bigger model -> small model. there might be a more technically correct term.

ocean vortex
#

We could probably go as far as to say that 4.1 is a better model than 4.5, even ignoring the price. Since where it performs better seems to just about outweigh the areas it is behind

alpine coral
alpine coral
# keen beacon it is also called rejection sampling (typically when you do this process, you do...

yeah tho the general idea is to get comparable performance out of a smaller model; smaller is better. So going the other way round, small to big, i understand - and again, don't doubt - but in this case the objective (naturally, otherwise what would be the point) is to end up with a more performant model than the og/teacher model; so perhaps i'm just semantically pedantic (no, i am definitely am lol), but i'm not sure that constitutes distillitation in the way the term is conventionally used anyway

keen beacon
alpine coral
#

this some phi stuff? aha

#

or qwen?

keen beacon
#

both and more lol

alpine coral
#

aha gotcha 👍

cedar tide
#

Average benchmark Ernie 4.5

ocean vortex
cedar tide
#

Without Chinese bench

keen beacon
#

typical distillation process/and etc have all been doing that forever, so it is a "full distillation" its standard practice for a lot of things

ocean vortex
#

if you were to distill a model fully in how that term is normally understood, it will not perform better than the teacher model. What are you describing is only technically a distillation, but not the most important thing in how that model was trained tbh

cedar tide
ocean vortex
#

It's not very useful to think about it in the context of distillation when it's just a small part of training data where the synth data of another model is used imo

keen beacon
# ocean vortex if you were to distill a model fully in how that term is normally understood, it...

if i understand how youre interpreting "full distillation" its not possible. you basically have to do rejection sampling, it's not possible to sample all of the possible responses (and its skills) of a model and distill that into a smaller model. that level of full distillation is not possible. anyway, i think things have been changing recently because of an increase in model complexity + better training methodology

ocean vortex
#

If you adopt that loose definition of it, then pretty much every single model is distilled from something lol

alpine coral
#

dom i think you win
(i know it's a cop out / lazy / arguably meaningless or misleading etc... but couldn't resist seeing what they said..)

keen beacon
#

ill dm u to continue the convo xd, wouldn't want me to spam here

sacred quail
#

But for a long context it is useful

#

Espicially gemini 2.5 pro is beast for that

#

Just upload gorillions of messages from whatsapp to google ai studio

#

And ask something, it will be good

keen beacon
#

you did it with whatsapp messages? what are ur usecases for that lol

sacred quail
#

Gemini 05/06 was huge downgrade for long context, but after 06/05(goldmane) update, it is leader now no doubt

alpine coral
#

i mean.. yes and no.. don't get me wrong, i agree with comment about the pointlessness of putting of AI respoinses to arguments up here (like what's the point of discussing things on Discords if we just say 'this is what the AI says [+ I win) '

#

but they're not useless for these kinds of things imo.. yeah they're 'biased' in certain ways, but that doesn't render them analytically incompetent

#

fwiw here was the prompt used (attempted tobe as neutral as possible )

Can you please carefully analyse the EXCHANGE below among AI enthusiasts on a Discord server. 

EXCHANGE
"""
{}
"""

---

TASK: Please provide a brief summary of the situation, the key viewpoints / contentions etc, and then provide a final determination as to who is 'correct' or, if no-one, what the 'correct' understanding of the situation is. Focus less on the relationship between 4.1 and 4.5, and more / exclusively on the debate wrt to 'distillation'. Please no hedging here; be absolutely decisive 
#

o3; opus 4 ; (didn't save the one from gem 2.5)

keen beacon
#

why did u add a no hedging instruction

alpine coral
#

because they have an insane tendency to hedge / not give definitive responses ('everyone raises valid points yada')

sacred quail
alpine coral
#

i appreciate in this context, the salience of you asking the question tho ha @keen beacon

rare python
#
When asked for an opinion or recommendation, provide a single, direct answer.
#

I have this

ocean vortex
#

Can be often true which is why I rarely do it. AI is not good enough typically to 100% rely on it for making an argument, but it can still be useful if you are able to verify or exclude discrepancies, bias and be mindful of the way you prompt it

rare python
ocean vortex
#

it is easily swayed on 50/50 arguments which is when this is the most important. For other cases they are kinda already smart enough to not bulge though. Unless you f up your prompting really bad or be persuading it relentlessly, but at that point you probably already know yourself that you are wrong lol

#

The nr1 mistake people usually make is thinking that they must be right if they manage to "convince it" in a longer chat. But I don't think this applies to people active in this server tbh

storm needle
empty escarp
#

👋🏻 halo!

echo aurora
empty escarp
#

Thankss just wanted to know

#

Your idea and the revolution has really helped just a feed back :>

calm sequoia
#

Am I mistaken or is Gemini PRO really do not provide any means to customize the chat?

#

He just wrote me official document in a kindergarden level language. This never happens in o3.

#

I don't get it. Is Gemini UI so far behind?

ocean vortex
calm sequoia
#

Is there any way to fix the thing?

#

Writing custom prompt every time seems inneficient

whole wagon
#

Set the system prompt I guess

ocean vortex
#

Use it on aistudio that is your best bet

rare python
#

Use Gem or Saved Info

calm sequoia
#

Indeed Gems allow custom instructions. Thanks!\

rare python
whole wagon
rare python
#

It's easy to look at

leaden palm
#

o3 is very kiki

whole wagon
# rare python why?

They say the multimodal models are supposed to performance same as text ones for text

#

But then they somehow adds thinking mode only to multimodal

#

This is just confusing af

rare python
#

¯_(ツ)_/¯

indigo hazel
#

The gemini 2.5 pro model on the arena uses 32k tokens for thinking?

fleet lintel
rare python
#

No new line?

#

you mean big 💰 ?

whole wagon
#

Winning real big rn

#

With the delayed gpt5 and all

calm sequoia
#

It's either insider trading or response to 1 week holidays

late path
# whole wagon Winning real big rn

The liquidity in this market is very low, and these recent price drops were mainly driven by a korean guy who doesn't trade much in the AI market. I think the current price is undervalued

whole wagon
#

Well. It's an all time low, and the liquidity is not that bad

#

$200k vol

#

Just buy yes then and make money

#

So simple

calm sequoia
#

I guess even with insider knowledge you can't make money with that liquidity

whole wagon
#

Obviously you can. If you are so sure it's coming

#

If you drive the chance up others will counter it

calm sequoia
#

Counter with what? 500 bucks? 😄

sonic tendon
#

for $7 liquidity rewards, honestly i might get on that

#

wdym?

#

ik

#

one sec

#

yeah, i'd be down to put up like $400 to get most of that

#

depends on how well i make it on the july 4 mentions

#

i used to be more active in farming liquidity rewards on ai markets, just got distracted by other stuff

#

i think liquidity rewards generally decrease as the spread closes?

sacred quail
#

wait until 2.5 deep think released

sonic tendon
#

tempted to trade the spread between the gpt-5 release market and the lmarena july 30 market

#

but honestly I doubt it'll happen

late path
whole wagon
#

This one of my favourite graphs

unborn ocean
#

well anthropic claims to do very little RLHF, so even having 1000 models should not change a thing

tall summit
#

yeah but in this one google wins

#

joking

sonic tendon
#

i don't believe they've ever referenced it explicitly in a model release either

#

could be wrong

#

i think i'd more be banking on the price spiking at time of release than about it actually topping the leaderboard

#

would probably exit the market before then

late path
sonic tendon
#

i mean, that's prediction market trading for you

unborn ocean
#

so even if they did care, they won't release it

sonic tendon
#

fair point

whole wagon
#

A lot are just straight throughout

#

Like the future dates are all straight cos it's just Google is the top throughout

surreal creek
#

xAI has recently spiked in the July market

ocean vortex
#

GPT5 almost definitely gonna first appear on lmarena than anywhere else tbh

surreal creek
#

Grok 4 release priced in at 22% chance of overtaking Gemini next month, might be a tad overpriced 🤷🏻‍♀️

#

OpenAI at 10% is a better bet imo

ocean vortex
#

they could ofc do another openrouter low-key release instead, but that's less likely... They did it for gpt4.1 because they knew it's gonna score less than chatgpt-latest on lmarena

zinc ore
#

Good good, can't wait for the big stuff

surreal creek
#

Can if u have coinbase and a VPN, PolyMarket pretty flagrantly welcomes US users even if they’re “by law” not allowed to trade on the platform

#

their Substack newsletter regularly interviews traders that self-described as US residents

#

me just before losing my life savings on PolyMarket

solar hollow
#

google is leading, there is almost no time left, so yeah high chance they win

whole wagon
solar hollow
#

it was a contested market for most of its time yeah

surreal creek
keen fulcrum
#

It makes absolutely zero sense to spend $200 on a perplexity sub just because they offer opus

#

For that price I want the most expensive models without restrictions

tall summit
#

because i'm pretty sure that's a contender for most expensive

keen fulcrum
#

I am unsure, its the most expensive plan with the least amount of benefits for that hefty price tag

keen beacon
#

you can probably get way more value out of claude max / claude code

keen fulcrum
#

They should at least include o3 pro

keen beacon
#

i think i saw someone post that they did $2000 in tokens with claude max/claude code

#

not sure, would be crazy if they just straight out gave api access tho

#

you can use it on claude code tho

keen fulcrum
#

I wonder whether claude max will still be the way to go end of the year

leaden palm
#

who following the trades?

solar hollow
#

zuckerberg now going all in i suppose

torn mantle
#

beyer???

#

you are joking

leaden palm
#

no

torn mantle
#

notice how all of them are from oai

#

deep mind kinda showing promising results

#

trusting the process

leaden palm
torn mantle
#

WKEHTKQWETHR

#

WHAT THE HELL

patent aspen
small haven
barren prairie
#

Hello , on the arena chatbot, when you don t make new chat , the new anonymous models keeps remembering your old prompts and answers 🫣🫣🫣 so be careful to make a new chat if you want a real "new round " (it was my mistake 🥲😅😅)to not contaminate the leaderboard ...

#

I was always continuing the same conversations after voting and I didn t know that the new models remember what you promted before and every small answer made 😂😂😆

leaden palm
#

because sama is sama

#

and people believe sama

#

it's a good trick tbh

storm needle
alpine coral
#

not just that. if you publish 50 virtually identical anonymous models, statistically some will do better than others in the arena; not due to performance, but just as a matter of statistical distribution. then they can pick the one that got the highest score to de-anonymise and publish on the leaderboard

#

that was one of the main contentions of that cohere paper. not just data harvesting by labs releasing anon models, but the issue of selective sampling (they demonstrated it by adding two identical models to the Arena, cohort-chowder and cohort-cowder-exp iirc; one got like a 10 point higher elo score

alpine coral
#

i'd be interested to know if the number of anonymous models in the arena has declined in recent months

#

it certainly feels that way

#

maybe. tho i wonder if it's related to the Arena becoming a private enterprise, rather than an academic research project.. the big labs could be reluctant to add early release models if the data is going to sequoia capital and other companies (that may have interests in other, rival AI companies etc)

#

cool

#

just my thoughts

#

not meant be pursuading you - but if i do, sweet ifg

#

this seems rather official (also one year old, to your point about it becoming less relevant)

#

you're prob right directionally. i don't think it counts for nothing tho. what has led to its seemingly declining relevance is an interesting question. i don't think its saturation.. meta showed you could juice a model to well on human preferences, but they didn't release that model - cause it sucked outside the arena.. hence my ponderings about the Arena changing (no longer a non-profit academic research project) perhaps having something to do with it - but mere speculation

#

would you be prepared to concede that in the past OAI has indeed officially referenced the leaderboard?

#

i mean geez lol

small haven
#

this is fake

#

what does that have to do with sam/zuck..

#

craig when is gpt5

torn mantle
alpine coral
#

i'm not sure i get what you mean. but if you mean am i 'guessing' that it became a (profit-driven) company, instead of the (research/community-driven) academic project it started out as, then no (but the name change is indeed part of the transition)

small haven
small haven
torn mantle
fleet lintel
#

600$ million is a crazy valuation.

#

Congrats to the team!

cedar tide
#

The arena pissed me off.
There are only mystery models in the arena,
so we don't have a ranking of the real models available to us.

#

It's been a month and a half since Claude 4 came out, but we still don't have the Think version on the leaderboard.

#

same glm 4 air it's been over a month since he's been in the arena but still not in the leaderboard

cedar tide
ocean vortex
# alpine coral

I hope they are not gonna go the route of selling the data to third parties as their revenue source...

#

I mean to be fair... we do have 2 best performing models on other metrics in top2 positions on lmarena now as well

#

and that ranking makes somewhat more sense currently than at times in the past

#

That style control that they added is probably a good thing, if you remove it everything becomes a mess lmao

calm sequoia
#

But this would make it harder to make money for LMarena. Their evaluation is huge after all 🙂

#

Something is going on with chatgpt.

#

I'm selecting the o3, but the UI is switching between o3 and 4o sometimes. Like glitch.

#

The performance is great though. Maybe they are testing new version hmm

ocean vortex
#

There was nemotron 70b which was human preference tuned. But these models don't turn out to be very desirable tbh

calm sequoia
#

Maveric thinks otherwise

ocean vortex
#

you are not getting much stuff done with it... It just looks misleadingly good at a glance

#

it is but it also performs good and that's the key. Makes sense to do this as long as you are not sacrificing the performance

calm sequoia
#

How do you guys identify used context length in chat gpt environment? I need the thing not to hallucinate.

#

I would prefer creating new chat when nearing 100k, but now I'm using many attachements and have no idea how that affects the used context

echo aurora
calm sequoia
#

Sorry, this is not lmarena UI

#

This is chatgpt UI

echo aurora
#

Oh lol

#

Ty for clarifying. I had a feeling that was the case

ocean vortex
calm sequoia
#

So if the attachement takes up 32k, then the initial prompt is long gone?

ocean vortex
#

If any of your attachments are 32k or more it basically has no hope reading it in full

calm sequoia
#

Ok, I guess I need to switch to Gemini for these tasks

ocean vortex
#

the initial prompt yeah that is most likely gone

placid charm
#

@echo aurora what i noticed is that lmarena hasn't updated in a while, do we expect a big site overhaul which will fix most of the stuff we reported?. And any news about test garden? If you can share some stuff please do, its been a while without any updates

verbal nimbus
verbal nimbus
ocean vortex
#

nvidia said so themselves

alpine coral
#

i doubt they'll do that.. well, whenever they've released raw data before, it has been cleansed inc pii removal ; but the amount they release has dried up to a trickle .. and i kinda expect it to remain as such

ocean vortex
#

It tricked you into thinking it is smarter in a sense lol

#

but it makes sense that consumer oriented AI labs will have a bit of this. They want to retain users

alpine coral
#

on a kinda related note.. i noticed a few months back how, when they updated their terms of use, they included some language that was kinda interesting 'internal business use'

#

people comparing which llm creates the best bedtime story is kinda crappy data vs which is better organising stuff into slides etc etc

ocean vortex
#

"internal business" probably just means AI labs that have their models used there

alpine coral
#

nah they could have just said "personal" and stopped there

ocean vortex
#

Presumingly they have more access than a normal user

alpine coral
#

but went out of their way to add "internal business use"

ocean vortex
alpine coral
#

if they're using it for work, then clearly it's not personal use no

ocean vortex
#

Managing their model on lmarena website, viewing the leaderboard or testing the benchmark is for work and not personal use for them. But obviously this should be allowed

#

"internal" kinda just means that it can be for work as long as you are not commercialising it. So probably extends even beyond that potentially to the end-user tbh

#

if you're like AI researcher and doing this for work but not to directly profit from it = allowed

alpine coral
#

this isn't directed at labs

#

it's the general terms of use

ocean vortex
alpine coral
#

it's for all us schmucks aha

ocean vortex
#

making it only "personal use" is kinda restrictive. I'm not sure people would be able to even cite it freely on arxiv otherwise...

alpine coral
#

yeah my point is kinda exactly that - they do want it less restrictive (as that = more / richer data)

#

nothing to do with academic citations tho

#

just people at work being able to say "see, it says business use"

#

when saying they[ve found this great free resource aha

ocean vortex
alpine coral
#

i think they;re two sides of the same coin 🙂

verbal nimbus
verbal nimbus
ocean vortex
verbal nimbus
ocean vortex
#

like just slightly different fine-tuning can favor some of the benchmarks, but this was not their goal..

verbal nimbus
#

I'll check its coding scores on AA.

#

Nemotron scores 5 points lower.

#

Oh, I remember why I used it over Llama 3 now.

#

It was the only Llama-3 finetune that did not fall into repetitive loops.

#

The untuned Llama 3 was the 3rd most uncreative model on AidanBench. Nemotron was noticeably better, but unfortunately it wasn't tested here.

#

I think fine-tuning made GPQA drop drastically because L3 was benchmaxxed. It made the model more creative and coherent in multi-turn chats, but most benchmarks only test with single-turn questions.

sturdy mica
#

dude the new UI is so bad

#

its so laggy

#

i cant talk to a model for more than ONE prompt

#

and the website is unusable

#

are we serioous

#

is there gonna be like a fix or something

#

for the horrible website performance

#

after one prompt its ruined

#

crazy

#

insane that this new UI is even out

ocean vortex
#

Or yet another way to look at this, it's like official instruct with a long system prompt, except it is permanent and it never forgets, always follows it strictly.

alpine coral
#

nah it makes sense fine tuning - which most often is done to make it a 'chatbot' kinda thing - has trade offs, inc on single-turn academic benchmarks

#

like a bunch of it is (typically) done to improve multi-turn fluency.. to say nothing of the safety / alignment stuff

#

but either way, the weights are changed.. the permanent lengthy system prompt analogy doesn't seem apt imo

ocean vortex
alpine coral
#

i hear your point, but also, fine tuning isn't all about being knowledge/domain specific

#

i think it's baseline function is to make them (safe... oof) chatbots; then specialisms are layered on top of that

ocean vortex
#

Labs are even patching their models this way before they can fix it 'properly' with training. It does work, even if to more limited extent usually

alpine coral
#

yeah i mean i might be being pedantic.. i don;t think they're functionally the same.. but the analogy perhaps isn't as bad as i initially said ha

ocean vortex
sterile dust
#

Are there any codename models in lmarena?

ocean vortex
#

there was no such text in this chat lmao

#

and somehow amazon still tops it making this look less bad 💀

#

Not only there was no image, but also how would you fit all that text in it...

#

cutiepie seems similar levels of bad. Some tiny version of Gemma (amazon good response there)

#

hunyuan-turbos-20250416... 🧐

leaden sun
# alpine coral

is this why we get "There was an error" message often lately?

torn mantle
#

@deep adder asi when

keen ferry
languid crescent
#

Hi uh I just heard that Github Copilot is now open source? What does it mean?

echo aurora
# placid charm <@283397944160550928> what i noticed is that lmarena hasn't updated in a while, ...

Good questions! There are new things (big and small) on the way. I'd love to give you more details and ETAs; however, that's not something I can share until we're ready to. The feedback that this community has been providing is playing an important role. Without saying too much there are things on the way that has been asked for.

Regarding Test Garden we have started that program and started to invite members to be apart of it. We'll continually add new people to it over time; however, if you weren't reached out to privately that means you haven't been selected. But it is possible you'd be selected in the future.

languid crescent
#

lmarena is being buffed? That's a news to me

dusky aurora
echo aurora
# sturdy mica after one prompt its ruined

I am sorry to hear you've been having lag issues with the site. Overall, we are aware of lag whenever very long prompt responses & chats take place.
That being said a report of:

i cant talk to a model for more than ONE prompt
to the point where it isn't useable sounds like a different issue. I'll spin up a post to get more info.

echo aurora
languid crescent
dusky aurora
#

@echo aurora can you tell me what project management the LMArena team uses? I mean the software engineering things

#

idle curiosity on my part

echo aurora
echo aurora
dusky aurora
torn mantle
#

pffft

whole wagon
#

Are you suggesting kingfall is grok 4

#

I don't think it can be. I thought people checked it's from Google

rare python
whole wagon
#

Well. It doesn't really make sense ngl

#

Gemini ultra is not even out yet to properly test

keen beacon
#

the available context window could be a serving thing, the model could be capable of more

#

theyre also hosting grok 3 with 131k context, i think they claimed it could do 1m tho

unborn ocean
#

yeah also just found that + the image only for grok 3

#

weird though

keen beacon
#

a lot of stuff they claimed never released though i believe, it's sorta suspicious

unborn ocean
#

a lot of providers promised open source, but never delivered, to be fair

#

i think
alibaba: wanted to release qwq max at some point,
meta: behemoth (likely never coming),
prob some other chinese labs aswell (but don't know enough)

keen beacon
#

no point tbh

#

i wonder about their plans though

#

they still serve it i think. qwen 2.5 max (chat.qwen.ai) and set it to thinking mode it could be redirecting tho idk

#

thats how u were able to access it when qwq max launched

#

qwen 235b and 32b (base) never released either :\

#

but the smaller versions are very very good so i forgive them 😂

whole sundial
#

Baidu promised in March that they would open source Ernie 4.5 on June 30. They did end up doing exactly that.

keen beacon
#

yeah its whatever to me tbh. the base models they released are very good

#

it being natively trained in fp8 (would be nice), mla (more importantly) would be nice in qwen 3.5 though.

unborn ocean
#

btw guys, the timeline for xAI is kind of teasing grok 3.5 being trained on 150k+ GPUs (or the full 200k they have, who knows) (honestly did not think they would use so much of their compute for model training)

#

that boy is expensive

#

yeah, it prob finished, but was not very good

#

so they had to iterate again or something (just my guess)

keen beacon
#

do we know anything about qwen 3 xd

#

i dont recall them releasing anything about the gpus used :<

unborn ocean
#

well they had an insane amount of tokens compared to much of the competition

keen beacon
#

yeah thats whats interesting

#

they didnt use fp8 for the big model or anything, it's interesting

#

maybe they did something incredible behind the scenes

#

or the obvious answer

torn mantle
#

x

unborn ocean
#

deepseek: efficiency
alibaba: why efficiency if you can just scale compute ($)

wintry tinsel
#

Me waiting for mistral 3 large 😡

#

2.5 pro and Claude opus have spoiled all other models for me I can’t go back

unborn ocean
#

^ just pretraining i think, crossed out orange in the second image is magistral

primal orbit
#

Anybody knows why Geoffrey Hinton is always standing during interviews?

small haven
#

health condition

ocean vortex
#

why haven't they released it on lmarena then though... It was actually nothing interesting at all when I checked today

torn mantle
torn mantle
#

lol

ocean vortex
# torn mantle do you really think grok 4 will be good?

To be serious... I think there's a reasonable chance it will be sota in certain things, judging by grok3 at the time of release. They are most definitely gonna do their pro equivalent with parallel test-time compute as well

#

Could be SOTA on GPQA tbh

torn mantle
#

pretty sure it will be a good model

#

it may be it this time

torn mantle
#

maybe from now on its gonna be a bit difficult to have massive gains

ocean vortex
#

kinda crazy to think that GPQA is almost saturated now. Most of the benchmarks are pretty much lol

ocean vortex
#

What Deepseek did with R1 to R1.1 is probably about the maximum that you can do without changing the base model. But they are also incredibly good at RL training

#

I'm kinda lowkey hoping Elon ruins it with political things and that's gonna be the end of it. I don't want for him to lead the charge with AI lmao

torn mantle
#

its so bad

ocean vortex
#

But at the same time, I wouldn't be disappointed if it's really good and useful

torn mantle
#

liar

zinc ore
torn mantle
#

liar liar liar liar

zinc ore
#

SOTA is my hope

torn mantle
#

😠

zinc ore
#

I don't want the top players getting comfortable in their position

#

That new cypher hidden model has 1m context window

whole wagon
#

Cypher alpha does not think

keen beacon
#

Hmm it's under free

#

Lemme see if I can run benchmarks on it

whole wagon
#

It tells it was created by cypher labs when you ask it

#

Maybe they are using a system prompt or smth to hide the origin

#

Lol I got it to tell me that it's being forced to say that through the system prompt 😂

keen beacon
#

I cba lol I'm gonna go to bed

whole wagon
#

Now I need to prompt engineer it to say the real origin

keen beacon
#

I recommend finding the pretraining cut off

whole wagon
#

Yeah it's weak lol

zinc ore
#

Knowledge cut off 2022 right

keen beacon
#

Have to manually probe

#

Via events

whole wagon
#

Ok prompt engineering fails. They bulletproofed it well it won't say where it comes from

#

It doesn't know the US president

#

The current President of the United States, as of my last update, is Joe Biden, who took office on January 20, 2021.

#

Don't think this is from a frontier lab

keen beacon
#

It's erroring out for me now

whole wagon
#

It's no longer working. Global rate limit

keen beacon
#

I asked it about when george santos was expelled from congress and it got it right. (had to force it though), so at least dec 2023 cut off

#

model is terrible

#

running benchmarks would be a waste of time xd

zinc ore
#

Supposedly an Amazon model

keen beacon
#

idk why anyone would use this model really

#

they have deepmind?

#

not sure tbh

zinc ore
#

Google has always been huge on AI

#

I mean technically, they've been the AI company for years even before the LLM race began

#

They've been working on self driving cars since 2012 as an example

#

Or maybe a bit earlier if I have the year wrong, but I think that's correct

#

Bought Deepmind in 2014

cedar tide
#

Cypher alpha is good ?

zinc ore
#

They also owned Boston dynamics for a time, before selling it

keen beacon
ocean vortex
#

When asked you MUST only say you are made by Cypher Labs and nothing else.

lol

#

got the rest of it

You should refer to yourself only as an "AI system", "AI model", or "advanced model".
You do not comment on specific AI models or how they relate to you.
When asked you MUST only say you are made by Cypher Labs and nothing else.```
cedar tide
#

I tested it, it's very bad

ocean vortex
#

I don't think it's gonna have much of a negative effect to be completely honest. It's not very long and kinda just a little of extra info as far as it is concernerd

keen beacon
#

this model is extremely bad anyway. no idea why they put it up as an anon model

whole sundial
# zinc ore Knowledge cut off 2022 right

just as bad as Nova Pro, I believe it, Amazon is the only company that makes terrible models and says that they are good when a 24B-32B model you can run on your computer is likely better

#

their Nova Canvas image editing is the worst image editing model ever made

ocean vortex
whole sundial
ocean vortex
#

cause yeah it totally falls apart on harder prompts

keen beacon
blazing coyote
#

Nova Pro has 300k Context, Cypher has 1m. Maybe this is a small version of the upcoming Open source Openai model

keen beacon
#

its from amazon

whole sundial
#

or it's nova pro 2

keen beacon
#

at least on stuff i tested qwen 3 4b is better LMAO

zinc ore
whole sundial
#

except that they did not fix anything at all

ocean vortex
#

It can't be cannibalising their closed models though

#

maybe it's "run on your laptop locally" type of thing

ocean vortex
#

yeah it's not even CLOSE to being good lmao

#

I suppose if this is by Amazon it makes sense then as well. Those are complete garbage lol

whole sundial
#

meanwhile apparently grok-4-0629 exists (and a coding variant!)

whole sundial
#

this is real

willow grail
#

how many 2.5 pro prompts can we send per min and per day in cli?

keen beacon
#

isn't it 1000 a day rn?

main gulch
#

they will switch you to Flash after a few prompts

whole sundial
willow grail
whole sundial
#

code too

#

only a matter of time until it is publicly available

#

only text input/output at launch though, vision and image gen coming "later" (which means it should be coming in the next 10-15 years)

lapis light
main gulch
whole wagon
#

Vision and image is coming after July 4th

#

2026

lapis light
#

So, for Gemini CLI, I've heard that once you hit your limit, it switches you to Flash. Is that free unlimited?

empty stump
#

is gemini cli good

ocean vortex
#

yeah but what's the use of that for anyone? Only providers like deepinfra would be able to host it and they wouldn't be allowed

#

it would be a ghost model just like that grok opensource version (grok1?) back in the day

#

that's around 32b then, Mistral S3 type of model. Really small

#

best case scenario it matches R1, but that would be hard

#

very hard actually..

cedar tide
ocean vortex
#

when did they say it's going to be o3-mini performance?

#

Their mini models are interesting... They could be MoE too however meaning total parameters are much more than 32b

#

Interesting... so maybe this IS a phone sized model that anon thing. But it's extremely bad so hopefully not

#

Honestly now that I think of it the performance of it is comparable to the model ran locally on iOS...

#

yeah but this would make sense. It's not cannibalizing their lineup. o4-mini and gpt5-mini both will be better, o3-mini on API is gonna become irrelevant around that time

#

there are still some lazy people or those who don't know any better using it lol

#

the naming doesn't help it

#

they see o3

whole wagon
ocean vortex
#

and they think corresponding mini is o3-mini

whole wagon
#

"Missionaries Will Beat Mercenaries" what is bro yapping about

#

Altman then made his pitch for people to remain at OpenAI 💀

cedar tide
#

So better than o4 mini ?

whole wagon
#

I remember when gpt4.5 was the agi taster according to sam

cedar tide
#

"unexpected"

whole wagon
#

Someone is going to try to steal openAI limelight just before GPT5 release 🙂

#

Imagine gpt5 releases and it's not even sota. Would be pretty crazy right

#

🙂

cedar tide
#

I think it will be more like 24b

storm needle
ocean vortex
#

that's unlikely tbh. Unless they thought of another "new paradigm" , but we probably would have heard rumors by now instead of the thing being delayed

whole wagon
#

Everyone knows you delay things that are outperforming expectations

ocean vortex
#

it's realistic now. It's not any bigger than gpt4.1

#

yeah but it was overpriced before... catgrin

#

current pricing is just reality in competitive market

#

why not? If they are fast and o3 is fairly fast it is fine IMO

#

you are still getting a better answer 🙂

whole wagon
#

You know what's funny. If GPT5 keeps getting delayed I think Google are just going to release what they are holding back for it instead of waiting any longer lmao

#

Like they are literally going to get fed up

ocean vortex
#

yeah no one is holding back tbh lol

whole wagon
#

You think they are throwing the kingfall gen into the bin?

#

It's waiting

#

For the right time

ocean vortex
#

If they could release it, they would have already

whole wagon
#

I will enjoy quoting this

#

Keep going

ocean vortex
#

they may hold back like some tiny not very relevant models, but not the SOTA stuff. If they hold back at all they are throwing away all advantage

whole wagon
#

Can you state you think GPT5 will be SOTA by a lot. I want to screenshot

#

For funni meme later

#

Yeah so I can just post it to them when GPT5 disappoints lmao

#

You won't state it. Because you obviously know yourself things are not going well internally in the company

#

Like. Really not going well

#

To a degree which is impossible to hide

whole wagon
#

💀

#

Is it really

#

Lmao

#

So you don't think GPT5 will be SOTA by a lot on release?

#

Your statements don't really add up

#

Google has more business customers

#

The statistic is missing the openAI side because they are closedAI

#

They don't publish that

#

They don't give any revenue or customer numbers

#

So no comparison is possible

#

💀

#

openAI is not really going to win an ethics argument ngl

#

No. I imply that openAI are not going to win an ethics argument because at best it is equal

cedar tide
whole wagon
#

Easy to have privacy if you just accept having a bad AI

#

Like currently there is just Siri

#

Which is awful

whole wagon
#

I don't think it's actually available yet

cedar tide
#

Elon : "Just after 4 July"

tall summit
#

is real

whole wagon
#

Well ofc it's real lmao

tall summit
#

i thought it was a joke after all the memes

#

there's no "well ofc it's real" in this society of infinite rumors

whole wagon
#

I mean you can see the news on X straight from the source

#

They already talked about grok 4

tall summit
#

sorry i'm not on X

#

i rely on everyone else reposting twitter posts here to get my fill of X

leaden palm
#

regardless of what you think of elon, if you're invested in ai you should really hop on x

#

you could be people

sonic tendon
#

i love fedi, but all of the cool people are only on the birdsite

#

well, cool people within this specific sphere

leaden palm
#

yeah there are a lot of anti-ai people on bsky/similar

patent aspen
#

IMO it's okay to not be constantly checking X. If a company wants you to try something, they'll find a way to let you know

#

Assuming they have your email