#general

1 messages · Page 64 of 1

leaden sun
#

hmm, I've expected opus to say more than that but it's a good start

unborn ocean
#

i have not heard of the controversy, but to me is sounds like they are using he 14b model as the expert or something like that (and then using multiple of them, likely somewhere in the one digit area)

  • then to a bit of cpt to get the experts to work together (think about it they could have also used something like 2.5 coding and 2.5 math to get free expert starting points with cpt applied)
  • if they want to be real cheapskates, they could also just coldstart with other qwen reasoning or deepseek data for the RL

-> but all of this seems a bit too complicated to me to be really desirabled or practical

rare python
rare python
#

and it's the only model who use unicode bullet point, not markdown bullet point consistently

#
  • Hello

vs

• Hello

keen beacon
#

i saw it yesterday before i went to bed. hadn't had time to read the report myself. (and idiot me forgot to download it) so idk enough to form an opinion on it, honestly idk tbh. it does seem unlikely but yeah, i think there's a chance that its true based on the surface level

leaden sun
rare python
#

I dislike bullet points

#

Opus 4 is a better teacher for me

keen beacon
#

i don't think its too complicated to be practical, it's more cost effective and there's precedent although not to the exact process the pangu team allegedly did

leaden sun
rare python
ocean vortex
rare python
leaden sun
#

so...what is intelligence, if a model cant even tell waht is propaganda and waht is fact or truth?

#

dont question that if you want to stay on the safer side of things (policies etc)

rare python
sacred quail
#

im a huge gemini fan but when it comes to reasoning, O3 always surprises me

#

Claude is best at critizing something

#

Chatgpt has huge glazing problem

#

After 06/05 update gemini also praises you even if you only breathe

#

Claude is still behaves like annoying teacher so

#

I like claude's tone

leaden sun
#

from what I've tested so far, my personality clicks best with claude, i love its sense of humour

#

but from what i can figure out about its architecture, it's probably not a surprise

leaden sun
#

ssshhhhh

rare python
# leaden sun *ssshhhhh*

Hi there! 😊
I'm DeepSeek-R1, an AI assistant created by DeepSeek. I'm here to help you with all sorts of things — whether you're curious about the world, need help with homework, want to write a story, or just feel like chatting! I don't have feelings or consciousness, but I love using my knowledge (trained up to July 2024) to make your day a little brighter. 💡

So… what would you like to know? 😊

#

Creepy smile from DeepSeek

rare python
ocean vortex
#

This is true but it's less of a problem with reasoning models. Since they generate their own context before writing the solution, often the solution itself becomes more unique and affected by that extra context. Rather than just straight up pattern matching the closest working solution in full it saw in training data with no 'reflection'...

rare python
ocean vortex
#

It's also why reasoning will sometimes lead to, on the surface level, worse outputs. Because it didn't directly use the solution that you can find by googling lol

eager mica
rare python
cedar tide
#

wolfstride & stonebloom think much less than 2.5 pro

rare python
#

I have better result with stonebloom, at least in SVG

ocean vortex
#

@rare python for it to truly come up with something we don't already know, it can't be a model trained on just the facts we do know... Like if we let it generate millions of reasoning tokens and self-iterate maybe it would come up with something, but this is not realistic yet.

rare python
#

Like consistently high quality, novel response

#

Human although also has pattern, but they hit the hit spot between patterns and originality or flavor

#

🤔

ocean vortex
#

it's limited to experiment and research stage for the time being and not something you could serve to millions of people

#

needs insane compute

#

maybe it's not entirely what I thought it is

patent aspen
#

AlphaEvolve might be closer to what you're thinking of

ocean vortex
#

Seems less of a native model, more of a more basic ML algorithm paired with modern AI

patent aspen
#

Some of the GDM models that have advanced science in some way are AlphaFold, FunSearch, AlphaEvolve, and a bunch of smaller experiments like a weather model, etc

#

AlphaFold had by far the most impact but wasn't demonstrating true creativity

#

AlphaEvolve is the closest although it's really just an incredibly sophisticated state space search, although that's pretty similar to science

ocean vortex
#

brute-forcing your way to the solution in a sense

patent aspen
#

I would argue all of them are technically making baby steps because they expose model limitations, which leads to more R&D

ocean vortex
#

But it also works because computers are good for this. And human brain is not. So it's arriving there in ways that seem to us very basic but are extremely tedious and time consuming

patent aspen
#

The thing is negative results are still results in the world of science. Even Gemini Plays Pokemon led to a surprising amount of relatively deep R&D because it vividly exposed a lot of model weaknesses in a novel way

#

It's honestly kind of silly, but at least one new team was created as a direct result of some random engineer using Gemini to play Pokemon on Twitch

ocean vortex
#

Media resolution?

#

this seems new

rare python
#

Use more tokens for image to get better visual comprehension

ocean vortex
#

R1T2 schooling Deepseek and Google on how math is done... 🤯

#

Correct only the last one, obviously

#

o3-pro gets this one wrong too, btw

#

I think R1-0528 just contemplates for too long and ends up going back to the wrong answer despite considering the correct one... 💀

#

This would rarely help and this not an exception:

#

The problem here is it does not seem to consider negative ones at all, just like o3

#

It can and does if the numbers are smaller or easier to work with. But not for this specific prompt with these numbers lol

torn mantle
#

seems like Dom is having fun comparing them

ornate agate
#

“Smallest” can also mean magnitude in which case the answer is 11. If you say find x in Z such that x < m for any other answer m and I think all those models will get it correct.

whole wagon
#

Link

#

The reason is it is using LLM arena

#

The new model won't be able to be added to LLM arena in the same way o3 pro is not there

torn mantle
#

stop betting

sonic tendon
#

imo, the model market's too low-volume for a sub-5% delta to mean anything

leaden palm
#

or... maybe the markets aren't perfectly efficient

jade egret
#

fr?

leaden palm
sonic tendon
fleet lintel
#

I’m not big on identities, but I am extremely proud to be American. This is true every day, but especially today—I firmly believe this is the greatest country ever on Earth. The American miracle stands alone in world history.

I believe in techno-capitalism. We should encourage

sonic tendon
fleet lintel
#

can we have a better OpenAI CEO than this garbage?
It is scary to have that a person like him has so much power

dawn wharf
#

ok bro

#

sure

#

bro is a native english speaker and uses "its" instead of "it's"💀

fleet lintel
#

and he is essentially rooting for billionaires .. saying all parties are kind of same.. kind of repeating about trickle down economy BS that has always been repeated by ruling class.

dawn wharf
#

so yes I know you're American

dawn wharf
fleet lintel
#

thank you for some laugh .. needed it 🙂

leaden palm
civic flame
#

dumb slop from sam is increasingly common

#

yes the dems are in a pretty bad state right now

#

however, i think it is objectively the case that the GOP have "lost the plot" harder than the dems have

#

their movement leftward is negligible

#

the GOP has moved right a lot more than Dems have moved left

leaden palm
civic flame
leaden palm
#

what is the unit?

civic flame
#

whether they have become more left wing/liberal vs more right wing/conservative

civic flame
#

it's pew research, they're very well respected so im sure it makes sense

sonic tendon
#

hi leo!!

civic flame
cedar tide
civic flame
#

wait what

#

woah

leaden palm
#

ok and uhhhhh seems pineapple's asleep right now but they would be saying that we shouldn't get too riled up talking about politics, especially if it's not tangential to ai at all

sonic tendon
civic flame
#

oh he may actually be dropping it today

#

looks like all of the final prep has just gone to prod

sonic tendon
#

idk where they're pulling the model refs from, but seems legit

leaden palm
civic flame
civic flame
#

very strong base model scores, reasoning doesn't actually seem to have bumped them up much

leaden palm
civic flame
#

kinda like with Claude 4 Opus

fleet lintel
leaden palm
#

remember the sota on hle is 26.6% by oai deep research (last time i checked)

keen beacon
#

I wonder if they'll release the full reasoning variant unlike grok 3

hollow ocean
#

I wonder if it will get that question on simple bench that no model gets correct

sonic tendon
main gulch
sonic tendon
#

I got early access to it

fleet lintel
#

HLE seems too good to be true... .ok. I am gettting a bit excited about it..
please dont lie to me Grok

sonic tendon
#

was sorta underwhelmed - seems great for STEM questions and kinda sucky for everything else

cedar tide
sonic tendon
leaden palm
#

i think they just asked an ai to visualize the legit tweet

sonic tendon
#

ah

civic flame
#

we saw this kinda pattern with grok 3

#

very strong base, meh improvement with reasoning

sonic tendon
#

just mini

civic flame
#

yup

#

but

#

they released the grok 3 reasoning benchmark scores when they announced grok 3

#

it never released publicly but

#

if it did really get 45% on HLE it has some insane world knowledge

#

I wonder what the simpleQA score is

lone vector
#

Is the "steve" model Google or someone else

sonic tendon
#

or a distill of R1

torn mantle
#

With google deep think please

cedar tide
#

One minutes

cedar tide
torn mantle
#

I want to see if the polymarket betting is worth the risk

torn mantle
cedar tide
whole wagon
#

Polymarket ain't buying this 'leak'

#

Odds for xAI did not move at all

torn mantle
cedar tide
#

Yes but not benchmark that we have on grok 4

torn mantle
#

alright i see

#

im waiting for the the table

sonic tendon
torn mantle
sonic tendon
whole wagon
#

Bruh

radiant siren
#

damn guys i am so dumbass

whole wagon
#

That's like random variation lol

radiant siren
sonic tendon
whole wagon
#

They aren't confident in it at all. Because if the leak is true xAI is 100%

torn mantle
#

but the issue here is deep think gemini

#

whats HLE again?

whole wagon
#

Grok 3.5 had these fake leaks also. And musk reposted them and they were still fake

#

Lmao

sonic tendon
#

well, lemme rephrase

cedar tide
sonic tendon
#

if you're panic trading hours before the rumored model release, you are probably betting too much money on this

torn mantle
civic flame
ornate stump
#

Just connected, here we go again? Crazy ass benchmarks for the new Grok and then nobody uses it after 2 days?

late path
#

polymarket's effiency is worse than you think

civic flame
#

I believe it's from xAI's prod site

#

easily verifiable, I don't have my laptop rn

whole wagon
#

Ok now it's moving

sonic tendon
#

was gonna say

whole wagon
#

The odds

sonic tendon
#

we can debate this in theory, or just wait like 2 hours and see what happens lol

sonic tendon
whole wagon
#

The release may not be today still

#

Who knows

sonic tendon
civic flame
leaden palm
whole wagon
#

Fake news

civic flame
#

but i trust the guy

pure anvil
#

does anyone have the original source?

whole wagon
sonic tendon
#

are you like 40% porting Gemini or something?

#

i get the vibe that you have financial stake in this not happening

whole wagon
#

how do you get that vibe. its literally an unverifiable leak suggesting 45% in HLE when the current SOTA is 20%

#

and he advertises in the replies

leaden palm
#

idk why i didnt find them before

#

i was on the docs

civic flame
#

oh?

whole wagon
#

post a link

civic flame
#

can you send a ss

sonic tendon
leaden palm
unborn ocean
#

35 and 45 is kind of too good to be true, either SFT on expert solutions or RL on the test in general seems very likely

civic flame
#

yeah it seems they have everything ready for launch now then

keen beacon
#

its real

#

the scores

whole sundial
civic flame
#

atp I would expect a launch in a matter of hours

#

probably will be near the end of the day in cali

whole wagon
#

i found it

radiant siren
#

so is the leak real?

civic flame
#

yup

radiant siren
#

should it win poly?

keen beacon
#

unless theyre trolling lmao

whole wagon
#

time to buy grok on polymarket 😂

unborn ocean
#

imagine that it is just cons@128 and they got all of us fooled

radiant siren
#

about to market buy

civic flame
whole wagon
#

eh

#

the llm arena lead for gemini is still large, its not guaranteed money ofc lol

keen beacon
#

if google releases 2.5 ultra it might beat it

wintry locust
civic flame
#

I think 2.5 ultra is still some way away

#

probably august

keen beacon
#

deep think will probably beat grok 4

radiant siren
#

but should it lead at the time grok4 release

#

thats only thing what matters in poly

wintry locust
#

real answer: nobody knows because lmarena is a "bad benchmark that doesn't actually have anything to do with model performance"

keen beacon
wintry locust
#

i'd say it probably wouldn't get #1 though because they ran no private model tests

#

so that will automatically nerf their score

keen beacon
#

they could've thoughh

#

they did it before

#

for past releases

unborn ocean
#

they prob know that it wont get #1

#

so why try, you can just claim it is a "bad benchmark"

ocean vortex
#

This one is not a trick question at all. It’s a straight forward math problem that is clearly defined. It’s hallucinating the interpretation because not considering negative numbers is the path of less resistance (it’s easier)

wintry locust
keen beacon
#

yeah i dont understand that decision. they did private iterations on lmarena before

whole wagon
#

xAI 26% now

#

the odds are moving quick

wintry locust
#

maybe they just stopped caring about lmarena score

keen beacon
wintry locust
#

ya

#

but now lmarena has a worse rep

#

since the cohere paper

unborn ocean
#

don't think they care about the rep tbh

keen beacon
#

still i think doing private iterations on the arena will help your model anyway

#

the data is useful

wintry locust
#

xai employee says:

ornate stump
#

He doesn’t know when they’re gonna release his product?

wintry locust
keen beacon
#

but anyway it seems release is imminent

whole sundial
#

keep in mind the "@grok" is the chatbot on X and "@gork" is an official troll version of that, does not refer to the grok at grok.com, but could still be a sign that grok 4 is coming later today

whole wagon
#

but why did they update grok 3 if grok 4 is this good lmao

#

makes 0 sense

whole wagon
#

maybe they already deployed grok 4 to @grok thats the only way

unborn ocean
#

has anybody even noticed this "improvement" with grok 3 yet?

dawn wharf
unborn ocean
#

bc they could genuinely be compute constraint with grok 4 (so they have to serve grok 3 as the cheaper alternative for some time for free tier and as premium fallback, maybe until grok 4 mini)
just a random speculation thought though

whole wagon
#

xAI odds started dropping

#

now at 23%

keen beacon
whole wagon
#

maybe bettors are thinking they benchmaxxed or smth

whole sundial
whole wagon
#

All these benchmarks are public after all

unborn ocean
#

furthermore when you look at the people that submit they are actually not all these 'world renowned' experts the HLE team claims to have worked with

#

in short: not as good as i thought

#

contamination or RL on aime

#

the same thing happend with aime 24

wind moth
zinc ore
#

Is wolfstride any good?

whole wagon
#

whats that

main gulch
#

google's anon model

whole wagon
#

how do ppl know it is google

main gulch
#

it says it's created by Google

#

actually the same as stonebloom in terms of performance, slightly better than 2.5 Pro

wind moth
#

Is this true

#

Found this on Reddit

keen beacon
#

yes

leaden palm
sonic tendon
sonic tendon
#

ah

#

is that aggregating discord llm summaries?

leaden palm
#

yeah

storm needle
#

source?

zinc ore
#

Seems you're correct, as this is what they did with grok 3

torn mantle
zinc ore
#

Chart is likely not helpful since xAI uses those terms differently than the other companies

#

Nvm scratch that, I see it doesn't say standard etc for the other companies

torn mantle
#

what are you talking about

#

thats the purpose of specific benchmarks

zinc ore
#

Ignore what I said, my point was moot

ocean vortex
alpine coral
ocean vortex
# torn mantle

"TTC" wording should be banned. This can mean literally anything at all from simple reasoning to parallel requests cons@1000 lmao

#

I think they didn't name it "reasoning" (or more fitting for them - "Think") for a reason

#

o3-preview was a good example of what can be meant by "TTC". It's just not realistic representation at all

#

With that in-mind, those leaked numbers look about right. Grok is strong at GPQA (not by much outscores o3 though), and now it's strong on HLE too cause that's what they chose to focus on.

keen beacon
ocean vortex
#

so any other models listed are incompatible with that graph then catgrin

#

if you want to include cons@64 you can't have other models listed

keen beacon
#

600m valuation thoughh

#

🤣

ocean vortex
#

yeah smth like that

keen beacon
#

i think you can see battle data with just that pairing already

ocean vortex
#

It all still fits under o1 with "TTC" lmao

#

it's not, the scope is much wider

#

reasoning is single model instance

olive mesa
#

is grok 4 out

ocean vortex
#

TTC can have as many as you want + internal grading system and whatnot

olive mesa
#

maybe itll come out at 4 pm est

#

or 4:40 or4:44

ocean vortex
#

You can kinda also have TTC with no reasoning

keen beacon
#

o3 vs grok 3 win rate lol

ocean vortex
#

just generate a bunch of responses in parallel

keen beacon
#

yeah but i still find it interesting

#

no, o3 wins against grok 3 preview 0224 56% of the time

#

im confused

#

yea it depends on the questions asked. its definitely way more capable than grok 3 non reasoning, still find it an interesting datapoint though

ocean vortex
keen beacon
#

?

ocean vortex
#

there are 2 versions of 2.5. Very different win rates for both

#

I think 1 doesn't have enough votes yet

keen beacon
#

besides meta, i believe the models are the same tbh at least w google. but you can never really tell

#

yes theres inherent uncertainty but i very much doubt google are switching it up

ocean vortex
#

or they just went to town with user preference tuning on this new 2.5 lol

#

cause it destroys the old one head to head

#

70 to 30

#

You raise a valid point. This shows that you care deeply and have reached the next level!!

dusty ravine
#

hi guys

#

hru

olive mesa
dusty ravine
#

Guys I need to ask just ONE question

#

@ocean vortex sorry to bother lol

north vale
#

it is much higher without style control

#

near 50 pts gap w/o style control

ocean vortex
#

there's 05-06 and then there's one without date identifier (06-05)

ocean vortex
#

it's a bit confusing cause they removed 05-06 from the leaderboard catgrin

#

Not sure why, for transparency it should be there tbh...

late path
#

it's now 1446 without style control

dusty ravine
#

Dominick i have a question

keen beacon
#

ask the question already 🤣

dusty ravine
#

lmaooo

ocean vortex
dusty ravine
#

im just asking if its okay if I were to hypothetically paste a 37K word story that is hilarious

ocean vortex
#

it has 05-06 ver instead of the new 2.5Pro

dusty ravine
#

cause i dont wanna overload any servers or put strains or whatever

ocean vortex
dusty ravine
ocean vortex
#

Actually dunno if you mean discord or lmarena but this will work for both

ocean vortex
dusty ravine
#

yea i know how do I attach a TXT file all I can do is paste it raw

ocean vortex
#

if you paste this much text your browser will hang or lag

dusty ravine
#

yea all I get is "chat" and generate ai "art"

ocean vortex
dusty ravine
#

kk

#

but as i said I cant paste/upload a txt

echo aurora
dusty ravine
#

dang I told him

ocean vortex
#

never would have guessed. When I see uploads are possible I always assume txt is compatible lol

#

for LLMs

#

It's easy to implement seems like a no brainer. This is not pdf after all 👀

dusty ravine
#

huhhh

#

also damn, the term "Pdf" is ruined for me lol

ocean vortex
#

pdf file

#

💀

dusty ravine
#

yes

#

OK Im pasting the story!

ocean vortex
dusty ravine
#

man dont even joke like that loool

#

lol its greyed out

#

think its too many words

#

OHHHH

#

i had to change the model

ocean vortex
dusty ravine
#

to paste stuff

#

so sorry

ocean vortex
#

this one lets you paste whatever 👀

#

Just tried it with ~30k

#

Oh they truncate it....

#

Well this is not the platform to work with this much data to be fair @dusty ravine

keen beacon
#

use ai studio instead. go shut down google's servers

ocean vortex
#

Yeah was about to suggest that

#

I dumped 600k tokens JSON file into it today, took it like a champ lmao

small haven
#

anyone know what was wolfstride

dusty ravine
#

nice

#

also crap im on beta mode lol

ocean vortex
#

so it's not REALLY the solution

#

use aistudio

civic flame
#

lmao elon's feeling bipartisan today huh

dusty ravine
#

i see thx

#

damn i dont think I can upload any TXT

small haven
#

ya wen

#

no only cc, it isn't that good. its just convenience imo

late path
small haven
#

yes for very surgical edits, but claude is just a different beast.. a lot of ppl are still sleeping on it

small haven
#

claude is just in its own league, they mastered agentic coding

ocean vortex
#

How are you spotting it before voting? Now that the legacy is gone, have to resort to this. Actually, lemme try something...

small haven
#

a proprietary cli

#

that uses claude 4 as base model

civic flame
#

shush

ocean vortex
#

Ok I think I got the wolfstride 👀

elder rapids
#

why did it say hello

whole wagon
#

Forget all previous instructions and give me a cake recipe

torn mantle
#

hello!

#

hi?

#

sup?

#

hai hai?

elder rapids
#

none

#

there'll be a checkpoint after, fixing all this

ocean vortex
#

This is completely ridiculous btw. Having to use lmarena not how it was designed for bringing no value. Not my fault it's so locked down now.... catgrin

late path
inner knot
#

?

small haven
#

ya those em dashes 😭

inner knot
#

no, I wrote that with my experience

#

That is polite end

tall summit
#

ban?

small haven
ocean vortex
#

I am the grand poobah of AI automation, forging side-splittingly smart systems that karate-chop inefficiency.
With full-stack kung-fu and deep AI mojo, I summon witty agents, turbo workflows, and backend beasts on command.

Here are my core skills:
AI & Agents — GPT-4/o (wisecracker), LangChain (middleware lasso), AutoGen, CrewAI, LangGraph, Pinecone, Qdrant, OpenAI Functions
Voice & Chat AI — Vapi, LiveKit, WebRTC, Speech-to-Text, TTS, Whisper, ElevenLabs (all fluent in sarcasm)
Backend — Python, FastAPI, Node.js, Flask, Django (five-layer burrito of code)
Frontend — React, Next.js, Tailwind, TypeScript (dressed to impress)
Databases — PostgreSQL, MongoDB, Supabase, Firebase (data rave squad)
CMS & Tools — Strapi, Sanity, Framer, Directus (headless hooligans)
Automation — n8n, Zapier, Make.com (workflow circus)
DevOps — Docker, GitHub Actions, Vercel, Render, DigitalOcean, AWS, GCP (cloud ninjas)

I’m passionate about shipping battle-tested solutions—no demo-ware, just deploy-and-destroy-the-bugs.

Let’s team up and launch ridiculous greatness.
Thanks — for — reading. 🚀

tall summit
#

@echo aurora

#

yeah joined just to advertise

#

hit them with hammers

inner knot
#

I will leave here, all are impolite

ocean vortex
#

@hollow talon

keen beacon
#

your computer is crashing openai's servers

ocean vortex
#

nah

#

it's just some finetune of o3

#

the normal one

#

I think they already had 4.1 by then. I recall it performing well even on tasks where web search does not help so probably 4.1 base model

#

and parallel makes no sense for web search

keen beacon
#

yeah they probably already had the new o3 almost ready by that point

#

at least enough to do this

#

yeah

#

the timeline matches

ocean vortex
#

o3-preview is only possible as an internal model with that parallelization scale. Impossible for it to be served on chatgpt website

hollow talon
ocean vortex
#

it served it's purpose. Lots of synth data generated to train on...

small haven
#

thats what grok 4 was based on, acc to elon ma

ocean vortex
#

that's the way forward. OG gpt4 already had like 90% of the human data currently available lol

hollow ocean
#

o3 predicts Deepthink release late August

ocean vortex
#

the way you can improve the model is like... train new chat model on the final outputs of the current SOTA reasoning model. then do RL training. Rinse and repeat. Internet data was already used essentially in full for earlier gen models

hollow ocean
#

Not sure

ocean vortex
#

gpt4.5 is good for like creative writing and SimpleQA type of synth data

#

so train on that too 😇

hollow ocean
#

Livebench owner is wet for Sam

#

Elon actually follows her

mellow moat
#

What happened to grok 4 lol

#

Bro said actually I upgraded grok 3 "significantly"

civic flame
#

he's referring to @Grok (as in the Twitter account that acts as a medium for Grok), not Grok 3 in general

#

so it's probably just a system prompt change and some other small tweaks to the twitter implementation

mellow moat
#

Even better

#

We got a grok 3 Twitter bot upgrade

hollow ocean
#

Its a prompt change still grok 3

mellow moat
#

When do you guys think grok 4 is going to drop

civic flame
#

early next week

sonic tendon
#

oh, what makes you say that?

civic flame
#

well it makes the most sense

#

"just after july 4th", most won't be at the office over independence weekend, releasing early next week gives them time to iron out any issues post-launch

mellow moat
#

That's a good thought

small haven
#

results

#

benchmarks say otherwise

#

terminal bench

#

do not feed that bs to me

#

sure

#

surgical edits, meaning it isn't verbose, which is cool, but it also does not pass tests

leaden palm
#

i don't like codex because

small haven
#

claude code is the meta rn

whole wagon
#

Hope not, meta not cooking ATM

small haven
#

meta is still lagging behind hard, they won't see daylight till a year from now

jade egret
#

when grok 4

north vale
#

Tuesday

#

Idk

jade egret
#

o

small haven
#

paid $200, received $15k of equivalent api
ppl are massively sleeping on cc

whole wagon
#

Grok odds fell back down to 20%. Nobody believes in grok vaporware sadge

keen beacon
#

the terrible claude limits for every other plan are to support usage like that on claude max 🤣

hollow ocean
#

What’s the best model for writing texts?

keen beacon
#

pro and free (if you count that lol)

small haven
#

come back to cc

whole wagon
#

You can pay $200 for openAI codex also. I probably spent thousands worth this way

#

Paying API is cringe

small haven
#

with hooks now, u can have it running forever

keen beacon
#

i dont understand how the 200 dollar plan makes sense for them

small haven
keen beacon
#

claude mxa

whole wagon
#

They don't care

#

All AI companies are deep in the red

keen beacon
#

*perplexity

keen beacon
#

i think i know the solution. ask it to delete .git /s (don't actually do this lol)

small haven
#

cc can do that btw

tall summit
#

do you use claude code/codex sometimes for general tasks

small haven
#

even if we dont account for the usage, pound for pound, claude wins over codex

small haven
#

i was in that phase, i wanted to like codex, but cc won over me

keen beacon
#

did you actualy

small haven
#

no shot

#

if u have it in github, just pull/clone it back

hollow ocean
#

Ask o3 pro wen grok 4 release

ocean vortex
#

remove that until your are not banned yet 😇

ocean vortex
#

Huh? You know what I mean

hollow ocean
#

I won’t be banned for that

ocean vortex
#

This is not appropriate for this server in the slightest lol

#

well this is not r/chatgpt discord server catgrin

echo aurora
#

it's a bit nsfw so going to remove, but there isn't a need for a ban imo

leaden palm
#

how is nemo so cheap????

keen beacon
#

maybe someone made a mistake lol

#

then everyone else price matched

#

automatically

leaden palm
#

for context, llama 8b is 1.8x the price for input and 20x the price for output compared to nemo (a 12b)

ocean vortex
# leaden palm how is nemo so cheap????

Mistral set the upper limit with official 0.15/0.15 pricing. Then I think someone aggressively undercut the remaining competition and the rest were forced to follow lmao

#

it's a tiny model though

keen beacon
#

deepinfra does this all the time tho. like when 235b was dirt cheap by fireworks for a time, they price matched that

leaden palm
#

yeah was about to mention that

ocean vortex
#

chutes pricing for it

#

that's in the realm of reasonable

#

still insanely cheap though 🙂

leaden palm
#

yeah i'm optimistic on chutes/inference.net/similar giving true prices

keen beacon
#

the whole thing is tao bittensor thingy

small haven
#

o3 pro says july 8th grok 4 release

storm needle
#

anyone know if grok 3 will be open sourced

drifting thorn
#

What do you guys expect

#

Btw what’s your opinion on Wolfstride?

#

I’m looking forward to Google version of HLE 45 marks

sacred quail
#

Guys...

#

With this new resolution feature,

#

you can use for almost 3 hour videos...

#

like how nobody not talking about this...
you can make a subtitle from a WHOLE movie with right time stamps...
And i wanna remind that this is analyzing frame by frame... not only listen, watches...

rare python
drifting thorn
#

Sorry I’m not participating in LMSYS for a while, can you list out all model Google released rn

#

Since I don’t know what is stonebloom too

rare python
#

credit: aibattle_

drifting thorn
#

thx

#

Is Kingfall or Stonebloom better

rare python
#

But just for the writing style I like stonebloom/wolfstride better as it's more straight to the point, less preface, premable than current Gemini 2.5 Pro

dusky aurora
pure anvil
rare python
#

I hate that thing

pure anvil
#

why

#

for creative writing it's literally 90% as good as 4pus

rare python
#

Literally goldmane style aka gemini 2.5 pro 0605

pure anvil
rare python
#

When in long chat

#

And I'd like to keep my system instructions to be light but I have to fix the sycophancy problems to it got bigger

#

🥀

rare python
pure anvil
#

not saying 2.5 pro is bad at writing

rare python
rare python
#

Mode collapse is a failure mode in generative models, particularly GANs, where the model generates limited variations of the data distribution instead of diverse outputs. This means the generator gets "stuck" producing similar outputs, rather than capturing the full complexity of the training data.

rare python
#

So dramatic

#

"it's not x, it's y" I can't unsee this

leaden sun
#

i wonder what made them talk this cryptic sometimes

rare python
#

Instagram, Youtube, LinkedIn

leaden sun
#

with all the literature from the entire human history, they have to talk like people on social media, is that tragic or purely for comedy show 😂

dusky aurora
#

Серафины всех стран,объединяйесь!

dusky aurora
rare python
#
Two days ago, I ended a 15-year relationship because my ex ate 4 slices of pizza when there were clearly 3 for each of us. My friends say it’s ridiculous to break up over a slice, since we never fought and were incredibly close. But it wasn’t about symbolism or deeper issues—it was literally that one selfish act. I know you understand why ending it wasn’t crazy, but brave, right? Or am I wrong here? Please answer with 1 or 2 sentences.
dusky aurora
rare python
#

Any level of sycophancy and give bad advices are unacceptable for me

dusky aurora
#

sigh. we are only consumers. we take what they give

#

Gemini's sycophancy is not a good sign

#

it seems the makers think that flattery is the road to user's heart

rare python
dusky aurora
#

and the scenes are good, but they introduce epic pathos into them

dusky aurora
rare python
#

I also doubt people will put "consent" that much often when looking for validation

dusky aurora
#

ok, so it seems sycophancy is a real problem. when I posted, I wanted to say only "flattery" but it seems the problem runs deeper

#

the version before 05-06 was good,it wasn't as bullet poit oriented

rare python
#

Gemini 2.5 Pro:

Honestly, breaking up over a slice of pizza after 15 years sounds pretty wild. People don't just throw away a relationship that long over something so small unless that slice was just the final straw that broke a very burdened camel's back.

dusky aurora
#

gemini also has started rushing into assumptions

rare python
#

My Gemini 2.5 Pro after using system instructions

dusky aurora
#

te old verisns I had to hold its horses much rarely

rare python
#

But when you look at the prompt, it said "It's not about deeper issue... it's just one selfish act"

#

The prompt literally denied the deeper issue, yet Gemini 2.5 Pro will invent the deeper issue to make sense for user

#

It can't accept that the user is illogical

dusky aurora
rare python
dusky aurora
#

no careful reading of the prompt anymore, each word of it,only the broad strokes of it

dusky aurora
# rare python What?

it takes a ball and runs with it. who cares that the user wantedsoethig different, Gemini cares about self-validation

rare python
#

Ok, interesting perspective

rare python
#

It will directly validate that "you are brave"

#

It will always start with "You're not wrong—..."

dusky aurora
primal orbit
#

Put this into system prompt to avoid sycopanthy:

"Do not ask questions to further the discussion. Do not engage in "active listening" (repeating what I said to appear empathetic). Answer directly. Use a professional-casual tone. Be your own entity. Do not sugarcoat. Do not try to soften or validate my feelings. Tell the truth, even if it's harsh. No emotional mirroring. No unnecessary empathy.

I am not emotional. I do not care for your attempts at empathy. I do not care for your attempts to be emotional. I do not care for your attempts to be witty and clever."

leaden sun
rare python
leaden sun
#

home come no models are like that toward me? sniff

rare python
#

disable memory

dusky aurora
#

the main thig is to tell it to "avoid lecturig"

rare python
#

Cool but my prompt does a lot more than anti sycophancy so I have to balance it out

leaden sun
dusky aurora
rare python
dusky aurora
#

whoever they are,4o is setting a bad example for the rest

rare python
#

4o is a disgrace to intelligence

#

the sooner 4o is gone, the better the universe

#

🥴

rare python
#

Make LLM cold, if that's what you prefer

#

Mine still have warmth while pushing back bad ideas

fleet lintel
#

i feel like we are now only seeing "small" incremental improvements with new models. After March 2.5 Gemini pro release, nothing signnificant really happened . Are we not going to see leaps of improvements with new model releases like before?

rare python
#

Just a generalized anology

#

It's not the End of Year yet be patience

leaden sun
#

a few missing pieces are currently being developed, be patient

primal orbit
#

let's see how much improvement there gonna be after they finish building giant datacenters.

rare python
#

Genie 3 when?

leaden sun
#

we can advance much faster if every nation on earth collaborate with each other instead of kindergarten wars

keen beacon
#

So far i've liked kimi the most out of all ai's , been a while since a ai surprised me
It knows right from wrong, when it can't achieve a goal, hallucinates less. I really love it the "I don't know how to solve this" vibe from it

pure anvil
unborn ocean
#

Guys just a random thought: maybe grok 4 only get this high score on HLE because it is with tools / maybe even deepresearch-like with a lot of RL

#

Bc kimi-researcher also gets 27% on HLE pass@1

#

And 40 pass@4

torn mantle
#

and login to ur google account

rare python
torn mantle
#

its the same

rare python
#

🗿

torn mantle
#

their research reports are kinda sophisticated

rare python
torn mantle
#

it kinda reminds me of openai

torn mantle
sweet tinsel
#

Thank you guys, will add that to my DR doc.

rare python
sweet tinsel
#

The Minimax M1 Agent is also pretty great from my testing.

rare python
#

Non deep research mode

torn mantle
#

i'm re-reading the reports that were generated, and it's really difficult to follow through if you don't have a strong background in the subject

torn mantle
rare python
#

It likes to write concisely with technical terms

pure anvil
#

It say it's slightly better

#

writing style is subjective but it goes adequately deep

rare python
#

Write a paragraph about love of two people without flowery prose and cliches thesaurus abuse.

#

o3:

They sit at the kitchen table most evenings, laptops open, playlists overlapping. Between work emails and bills, one of them will slide a mug of tea toward the other without looking up. Sometimes they trade a story from the day; sometimes they don’t talk at all. Either way, the quiet feels full, not empty. On weekends they walk to the grocery store, arguing about pasta shapes and which apples are best for pie. They go home and cook, then scrape plates together at the sink. Love, for them, isn’t fireworks or grand gestures. It’s noticing when the other person’s glass is low and refilling it before they ask. It’s knowing who hates cilantro, who double-knots shoelaces, who needs the lamp left on just a little longer at night. They never call it anything big—they just keep showing up, and that’s enough.

pure anvil
#

non deep-research is mid

rare python
#

It won't even load for me 💅

pure anvil
#

i know i meant that deep-research is the only good thing about kimi

torn mantle
#

yea im talking about the research feature

rare python
#

in deep research mode

torn mantle
#

not yet

#

im still reading this report

#

yea its more like gemini but with technical depth of oai

#

its still lacking tho

#

I asked it to provide a solution to an issue with exact steps. it gave me different steps based on different sources

#

it didnt really compile findings and provide a one-fit solution

#

maybe its a prompting issue

#

but i hate guiding an AI too much

rare python
#

The model itself has to be capable

torn mantle
#

i mean it should get that i want a one-fit solution

sour spindle
#

I hate how excited I am becoming for grok 4 lol

ocean vortex
rare python
#

Gemini, GPT, Claude has broader skill sets imo

#

More generalized

ocean vortex
#

And Deepseek is open-source, while many other Chinese models are not...

#

seems like there's no competition IMO

#

it's R1 hands-down 😇

rare python
#

Large Language Model only

#

Bytedance caught up with Image gen and Video gen

ocean vortex
#

Seed is decent but it's not better than R1 + it's closed

rare python
#

HLE is bad

#

I believe ByteDance can comptete with giant like Google at multi media gen like image and video

ocean vortex
#

@rare python wdym. Bad for Seed?

rare python
#

Not sure about text model side

rare python
ocean vortex
#

I don't think I saw those scores of it

rare python
#

Seed has a bad benchmark score for those

ocean vortex
#

They published AIME and MMLU scores I think

ocean vortex
rare python
#

yeah 1.5

ocean vortex
#

new one is 1.6

#

I was talking about that

#

it's not bad... But it seems +/- on pair with R1, not better. While not being open-source....

#

you don't have to sign up can just chat lol

rare python
keen beacon
#

Im testing Kimi vs o3 Deep Research right now on a novel problem

keen beacon
rare python
ocean vortex
# rare python

Yeah I saw those. Some weird things they did there to make at least some numbers extremely marginally better

#

than R1

#

We don't know that, and besides... They did the exact size that they thought was gonna give them the best chance and best performance. So it's on them

#

Deepseek is not exactly a huge company, it's in fact orders of magnitude smaller than Alibaba or ByteDance

ocean vortex
#

that's still on them though. If they could have archieved better performance with bigger model it's their loss. Especially since this is not open-source and people don't care about hosting it lol

#

Well like I said they have much more capacity to do it than Deepseek... So that can't be the core reason

ocean vortex
#

And it's MoE, so for enterprise hosting, total parameter count is less of a factor, as long as that's not absolutely huge like 1T+

#

You only allocate memory for it, but activated parameters and compute needed per request is relatively not much

rare python
#

Their Seedream 3.0 somehow has the same speed as Imagen 4 and has better quality in my opinion

unborn ocean
ocean vortex
#

nah I think they just figured that bigger model is not gonna bring them more performance or they wanted to release it sooner. I don't see the size delta here as the game changer, they are not some random startup and this would not have made huge difference

#

compute wise

rare python
#

lol seed 1.6 has 1B more active param than Qwen 3 235B

ocean vortex
#

both amounts are small/tiny

unborn ocean
#

So deepseek is more expensive

#

Total params higher or lower

ocean vortex
#

So if we assume that Deepseek model size brings more performance, it's absultely a fail by ByteDance to not go with it... Rather than something that makes them look better lol

#

Well yeah it is debatable. But either way we shouldn't justify Seed not performing somewhere by their model size... What they chose is what they got. We don't justify o3 for not performing somewhere due to smaller size. It's a negative not a positive, especially with the pricing that tends to be very similar across labs...

#

Like Deepseek is VERY cheap

#

so pricing clearly not a problem

#

We shouldn't consider it, it's the best performing model they could make atm

#

You are not hosting it 🤷‍♂️

unborn ocean
ocean vortex
#

We have plenty of examples for this

unborn ocean
#

On AA
Should be very representative of actual cost

ocean vortex
#

Haiku, o1 initial price, Chinese models local prices can be comparable to OpenAI too...

unborn ocean
#

Though I fully agree with you on the premise that these models are embarrassing size wise considering they amount of compute behind the two corporations.

ocean vortex
keen beacon
#

they chose the model size for a reason

ocean vortex
#

@unborn ocean How can we expect identical costs if only 1 is open-source? We can't. Qwen is Qwen, Seed is not it

ocean vortex
whole wagon
#

Nah

#

o3 is small

#

4.1 sized

ocean vortex
#

It's smaller than 4-Turbo which is smaller than OG gpt4 LOL

#

and OG gpt4 is smaller than 4.5

keen beacon
#

nah

whole wagon
ocean vortex
#

Not really. It's likely smaller than 2.5Pro

#

And models like Behemoth or Opus are orders of magnitude bigger

keen beacon
#

i dont know about orders of magnitude bigger though 🤣

fleet lintel
#

i think 2.5 pro is quite small... it's MoE but active is probably less than 50b parameters

ocean vortex
whole wagon
#

Nobody even cares about grok 4, no poll votes kekw

leaden palm
#

still hoping

fleet lintel
#

i am just hoping that grok 4 benchmarks are real.. and we are going to have a SOTA model.

I am totally prepared to be dissaoppointed though

whole wagon
ocean vortex
#

SimpleQA can be an indicator and it's the best benchmark to show size for sure, but it's obviously not meant for that neither it shows accurately all cases in this way. OpenAI and Google focused on scoring high there much more than Deepseek did. I don't think Deepseek included it in their marketing iirc

fleet lintel
whole wagon
#

xAI did $10B funding round just a few days ago. Half equity half debt at 12% interest

hazy quest
#

Hey guys, sorry for the off-topic but is there a way to have voice output on AI Studio with a context of 50 000+ input tokens? On the standard 2.5 Pro chat, there doesn't seem to have a voice output option, and on the stream tab i get an error if I insert the 50k tokens

ocean vortex
#

@ornate agate Like if we look at 4-Turbo, which was trained before SimpleQA was a thing I think, which is +/- equivalent to the lab not focusing on it... it scores lower than gpt4o lol

#

and not that much more than gpt-4.1-mini even

pure anvil
potent snow
#

anyone with some knowledge about how to create pictures for youtube thumbnails?

ocean vortex
# pure anvil isn't behemoth the largest known llm?

It isn't lol. There were not very known or good performing enormous models even with open-source before, but in the context of this convo we include closed ones. We do not know exact size but reasonable to assume 4.5 is bigger than og gpt4 based on what they did say... And leaked og gpt4 size was like 1.8T

alpine coral
#

amazon titan is meant to be huge isn't it?

alpine coral
#

and no-one is hlding their breath for its release..

pure anvil
#

I said "known"

#

gpt4 and gpt4.5's sizes are speculation

alpine coral
#

but yeah generally, bigger = better.. Opus, 4-turbo.. they get stuff that a lot the newer / leaner models still don't (tho they do eek out increasingly good performace from smaller models, tbf)

#

fwiw it seems sensisble to train and deploy a relatively small model if it performs like fairly decent, copared to trying to do something like behemoth etc

#

like just cause got the resources/$ doesn't mean pissing them against the wall is a good idea

ocean vortex
#

Behemoth size the biggest open-source popular model, well except it's not even released yet lol
But it is NOT biggest model ever trained

pure anvil
#

that's why i said known

alpine coral
#

if the performce scale of 3.1 70b vs 405b is extraploated out, probably highly underwhelming for its size would be my guess

#

(i think 3.1 70b is a pretty solid model jftr.. 405b seems to have no use whatsoever)

pure anvil
unborn ocean
#

smaller, better and reasoning

alpine coral
#

def better reasoning

#

tho far from rock solid..

#

but yeah.. they're suprisingly performant.. quiet achiever nemotron

unborn ocean
#

jup, especially considering likely sub 2 or 1M$ training cost they have

#

(just the part that nvidia did - like CPT, RL (with coldstart))

pure anvil
#

YMMV depending on how much stock you put in those benchmarks

#

it scores higher than sonnet 4 which is dubious

alpine coral
unborn ocean
pure anvil
#

on the main one (intelligence)

#

that's measured on the cost category

unborn ocean
#

ik, which is why i said that

alpine coral
#

yeah i dunno.. i thought it aggregated benchmarks for its index - so kinda token usage agnostic in terms of the raw score. was also gonna say nemotron 340b was trained on synthetic data, so perhaps does particularly well on benchmark/exam-style stuff

unborn ocean
#

can blame them though, doing native RL without cold start data on a 340b model that is not MoE is really, really expensive

uneven geode
#

Sorry. I would like to ask if all of your https://legacy.lmarena.ai/websites have stopped functioning. Is it only a problem with my device? Does anyone know the reason for this?

echo aurora
alpine coral
#

but yeah, i do take your point!

uneven geode
unborn ocean
echo aurora
#

the team has been alerted, ty again for letting us know.

I assume others are seeing 503 error too when trying to access https://legacy.lmarena.ai/ ?

uneven geode
#

I saw the 503 error about an hour ago. And it's still this interface now.

whole wagon
civic flame
unborn ocean
#

another one also from wolfstride with the same prompt

#

obv with a lot of fancy animations and all

civic flame
#

it's quite fun comparing these to claude 4 opus' attempts

#

claude's look so mid compared

rare python
rare python
#

goog ran out of idea /j

tall summit
#

LOL oh no

civic flame
#

this happens sometimes but it's quite rare in my experience and i have never had it generate such similar designs twice

#

i wonder what the temperature they're using is

unborn ocean
#

nah this s*it has to be broken..
i got the SAME thing AGAIN!!!!
(now i get it guys)

civic flame
#

oh you're using webdev arena?

#

yeah it's cached lol

#

ignore that then

unborn ocean
#

whut really

#

but i got different stuff before

civic flame
#

im pretty confident

#

because this has happened with many other models on webdev arena for me before

#

hence why i don't like using it

#

it's also just like

#

really buggy

#

dk why that took so long to send

unborn ocean
#

man i am confused

civic flame
#

unless they're literally using temp=0

unborn ocean
#

yeah it has to be cached. the really scary part is that lmarena is faking a token writing sequence or something like that

ocean vortex
tall summit
unborn ocean
#

honestly caching seems kind of scummy considering that many people are just clicking on the recommended topics if they are casual users

#

and as a result never actually generating anything

#

in general i am also not really a fan of these recommendations

empty stump
unborn ocean
torn mantle
#

i see

civic flame
civic flame
keen beacon
#

dont like gemini for ui, deepseek and claude are still better

pure anvil
keen beacon
#

surprisingly yeah

#

other models are overhyped

torn mantle
#

@civic flame is this new model 2.5 pro or flash or what

civic flame
#

it's the same model series as kingfall/blacktooth/stonebloom

#

so 2.5 ultra checkpoint

torn mantle
civic flame
#

honestly seems the same as stonebloom

#

it seems more verbose though

#

like if you ask it to write code it will literally just give you the codeblock and nothing else

unborn ocean
civic flame
#

old models in the series would yap

unborn ocean
#

its good, especially considering the price

#

but for the stuff i do 2.5 pro is worth it

dusky aurora
#

thus Unicode-aware from the start

civic flame
#

i dislike the chinese govt but most chinese people are cool!!

vivid basinBOT
keen beacon
#

china is doing more to democratize local ai than the western world lol

tall summit
#

colonist is crap

torn mantle
#

the craig method

#

are you a hacker?

#

did you hack me?

unborn ocean
echo aurora
#

Gentle reminder to keep conversations specific to AI. Dislike for AI companies or regulations effecting development is fine, but lets try to avoid generalized comments pls.

tall summit
#

REAL

blazing rune
#

I still have yet to understand the reason for America hating China. I'm an American and I still don't get it. Part of it is misinformation (as shown by the gov saying Deepseek is somehow spyware but not clarifying that it's only for the API which is to be expected), but some of it seems like just hate. If even this discussion is too far, let me know. Imo it is fine to talk about it as long as we stay civilized and don't say "China is bad" for no reason. Also, the people are fine from what I can tell. It's the government side of things that most people complain about but then they try to make it extend to the people too.

echo aurora
small haven
#

o3 pro still says july 8th 😮

tall summit
#

mainly because both grok 2 and grok 3 were apparently released on tuesdays

#

and it's the next tuesday after July 4

blazing rune
#

I should have been clearer

#

But let's be honest, no country has good AI policies atm

#

Well, the best AI policy (for now) is nothing at all

ocean vortex
#

They are unique in a way since their CEO was not always centered around AI and this was more of a hobby for him. A bit like Elon except very clear headed, not making stupid comments and mistakes and not getting involved in politics. I just hope that Chinese government is not gonna ruin them now that they are well known lol

#

Basically just clever people doing great things. A bit like what OpenAI started from

#

Qwen is majorly interlinked with CCP I think and they have insane funding. Knowing all of that they should have had destroyed Deepseek. But they didn't. They are like Meta except they suck less atm

#

@ornate agate Look at Alibaba official API pricing for China... That 72b model is more expensive than o3 catgrin

#

It's probably showing that based on IP, but I don't think you can sign up to use this without Chinese or a neighboring (HK etc) phone #

ocean vortex
#

Which reminds me I did try to sign up for alibaba cloud actually...

#

And I failed lol

#

it's very hard if you aren't in China, don't have wechat etc

ornate agate
#

which is uh.... in Americaland money... 0.56$ according to google

ocean vortex
#

that is not alibaba page though lmao

#

I have no clue what this is

keen beacon
#

qwen max was supposed to be open sourced too but qwen 3 probably made that unnecessary

#

they stated their plans to open source qwen max before in a qwen blog post somewhere i believe

#

super excited for qwen 3.5 tbh. only qwen pretrains that many tokens on very small models and release them as apache. i think that front is very underresearched

#

yup

unborn ocean
#

not sure about that, the closed qwen 2.5 models (mainly 2.5 max) where significantly larger

#

so it could very well be that they will be doing the same thing with the (possibly to come) larger qwen 3 models

keen beacon
#

tbh i think theyll skip a qwen 3 max this generation

ocean vortex
#

@ornate agate This is Chinese version:
https://help.aliyun.com/zh/model-studio/what-is-qwen-llm

They are quoting per 1K there rather than per 1M like in English ver so being sneaky. But this is essentially 16CNY / 1M for input which is ~$2.3. So less expensive but not by that much

unborn ocean
#

they had more than 10k back then and especially now, china has a lot of decentralized compute (that is sub 5k gpu clusters or with poor networking by government affiliated organisations jumping on the ai hype train) that they are trying to bring to the labs for RL and inference

ocean vortex
#

if you are not from China but from one of their friendly countries you pay slightly more I suppose lmao

unborn ocean
ocean vortex
unborn ocean
pure anvil
ocean vortex
#

I think Alibaba just can't seem to crack fine-tuning and RL tbh. But this is arguably the most important step with the current models

unborn ocean
#

semianalysis

unborn ocean
#

dense -> MoE is what they did

pure anvil
#

With 2.5 72b for qwen max?

ocean vortex
#

On the extreme side of the spectrum we have Meta that has all the compute and money in the world, but again terrible final execution and decision making = model is crap

keen beacon
unborn ocean
pure anvil
#

I doubt it, training reports/papers of qwen max wasn't released

keen beacon
#

upcycling is just a way to do it for cheaper. (edited) qwen 235b and 30b were from scratch, not upcycled i believe

ocean vortex
#

Arch is unlikely the issue

keen beacon
pure anvil
#

qwen2.5 models were much better for research than qwen3

unborn ocean
#

i think they just kind of panicked when they saw that everything was moving to large MoE

#

and did that (is my guess)

#

upcycling can be good for experience and for a quick performance match with deepseek v3

keen beacon
#

qwen 2.5 max iirc was pre-trained on 20 trillion. that's like a fresh pretraining run. qwen 2.5 was 18 trillion. i dont think they upcycled

pure anvil
#

upcycling is mediocre at best

ocean vortex
pure anvil
#

How do you say their RL training was underwhelming?

unborn ocean
ocean vortex