#general

1 messages · Page 61 of 1

misty vault
#

omaygot

drifting thorn
#

Gemini 2.5 Pro learns to commit suicide be like…

tall summit
#

i'm not in any other ai servers personally but this is the only ai server i need

misty vault
#

if ai no satisfy human -> suicide

alpine coral
#

bit of an over correction it seems lol

lone vector
#

Do people like losing money

ocean vortex
#

this is reminiscent of what Bing did but much much less bad. It was waiting to happen tbh given that Google was struggling with fine-tuning for a long time. They almost got it right recently but they didn't use to have anywhere near the same popularity and reach with Gemini that they have now, hence the publicity...

keen beacon
ocean vortex
#

they probably just pushed "user is always right" tuning and sycophancy just a little bit too much. So they backed it into the corner lol

rare python
#

Less sycophancy and straight to the point

ocean vortex
#

so instead of "I apologize" always being the response to mistake, it gets increasingly more apologetic until nuking the thing becomes an option

#

kind of like with GPT you can get it increasingly more unhinged with "unsafe" context, instead there are no bounds defined for sycophancy or sucking up lmao

leaden palm
#

well even though i explained multiple things it could do, it only tried one of them in one specific way and immediately gave up after it didn't work as intended

torn mantle
#

I wanted to try it a year ago, but I hadn't had the chance

lone vector
#

It’s banned for the US. VPN is maybe a workaround

torn mantle
#

im not from the US tho

#

if its banned in the US then why did XAI signed a partnership with them

#

elon just wants to ruin everything

keen fulcrum
#

did google release new model?

hoary plaza
#

Atleast on aistudio

cedar tide
full idol
ocean vortex
leaden palm
#
poll_question_text

how will the anthropic book scanning fair use ruling impact google's language models?

victor_answer_votes

11

total_votes

15

victor_answer_id

3

victor_answer_text

they're already using google books

tall summit
tall summit
ocean vortex
leaden palm
ocean vortex
zinc ore
# leaden palm

Alphabetical order for tie breaker sure is something tho

ocean vortex
#

Honestly those odds seem just about fair. Still makes more sense to bet on OpenAI though IMO 👀

#

there's a reasonable chance GPT5 will top it

#

Now I'm thinking on actually doing this... catgrin

zinc ore
#

Google has gotten very good at staying in the lead on llmarena. If it ends up being next Gen vs next Gen end of year, I'd guess Google continues the trend.

#

If it's openAI next Gen vs Google current gen, then of course openAI

ocean vortex
zinc ore
#

But I think Google will drop something once something like gpt5 drops imo

#

I personally think we see a Gemini 3 by years end, but that's a tentative guess

ocean vortex
#

No one knows for sure if they will even release Ultra at all... And with 2.5Pro they are somewhat struggling to make major gains

zinc ore
#

They seem pretty neck to neck to me, between 2.5 pro and o3

#

It's interesting how different everyone's impressions are tho

ocean vortex
#

Not impossible, but less of a chance

zinc ore
#

OpenAI has been advertising gpt5 for like two years

#

I don't think their training is ahead of Google's personally

ocean vortex
#

that name is now referring to a different model

zinc ore
surreal creek
ocean vortex
surreal creek
#

77 point elo gap with no style control compared to 16 with style control - gonna be hard for OpenAI to close that gap

zinc ore
ocean vortex
#

it took less than 5 months to move from o1 to o3

surreal creek
#

the AI companies do tho lol

zinc ore
#

I don't treat any benchmark that way, but I'm js they seem to be very close on many benchmarks

surreal creek
ocean vortex
#

it's just naming

#

you can consider o3 as "o2"

surreal creek
#

seems kinda convoluted ig

ocean vortex
#

I meant the improvement itself to the next gen

#

with GPT5 they are also moving to hybrid reasoning which is a big change

surreal creek
#

well yeah I understand the naming concepts are kinda ridiculous because they’re just steady improvements of the same core product, eventually getting renamed for branding rather than having “GPT-17” someday

primal orbit
#

how many people left in OpenAI from the original roster during gpt3/4? From the prominent ones looks like only Sam Altman. Don't have much belief in gpt5.

ocean vortex
zinc ore
#

That was actually my point, that the naming didn't imply how long it had been worked on

#

I was making the point you countered with

ocean vortex
#

But rather the release date of o3

zinc ore
#

So how do you know gpt5 won't end up being something like o5 instead of o4?

ocean vortex
#

even if they were not working on gpt5 before they released o3 (unlikely), the timeline for GPT5 in that case would probably be around September. Still well before the end of the year

zinc ore
#

I'd bet we get gpt5 this year as well

ocean vortex
#

it's just a name for next generation

zinc ore
#

I just don't see how we assess that their training is ahead

ocean vortex
#

it's a meaningless number, just like there was no o2

zinc ore
ocean vortex
#

the only logical next release is GPT5

#

they even said so themselves that is what they are going to do

#

o4 (next gen of o3) is going to be GPT5

zinc ore
#

I think gpt5 drops this year, I'd be slightly surprised if it doesn't

ocean vortex
#

whether you call it o5 or o4 is convoluted given that we have "o4-mini" lmao

#

but the point is "next gen"

zinc ore
#

I was more referring to potential delays, like not being satisfied with the improvements of o4

#

Hence, waiting until they get bigger improvements, pushing it to next year

#

Because we're technically just dealing with a bunch of unknowns here

ocean vortex
#

it's gonna be leagues better than any gpt4 or 4.1 either way

#

or 4.5

surreal creek
#

4.5 was strong, but it hasn’t been in the arena for a couple months now?

ocean vortex
#

also kinda ironic how they cannibalised their own naming if you think about it. Currently o3 destroys any gpt series model lol

surreal creek
#

wonder what they’re doing with it behind the scenes

ocean vortex
#

gpt4 used to be THE model. I imagine they are trying to go back to this with 5

ocean vortex
#

So a niche model or a failed experiment, depending how you look at it lmao

sour spindle
#

OpenAi could come up with a model that surpasses 2.5 pro that's assuming google does absolutely nothing though

zinc ore
#

My initial argument was that on llmarena, I think Google has a decent chance of still being #1. Assuming it is Google's next Gen vs openAI's next gen

sour spindle
#

Gotcha

#

I would agree

zinc ore
#

I briefly pivoted to say I think pro 2.5 and o3 are technically neck and neck, but I meant overall from various benchmarks, not strictly llmarena. But it wasn't clear since I didn't specify.

sour spindle
#

I pretty much agree with that assessment I do use o3 more though because I think they do a better job with tool use

meager lintel
#

anything interesting in the lmarena anon models lately

small haven
zinc ore
#

How much is a good margin

#

Like what is your opinion of say, artificial analysis, which gives o3 and 2.5 pro the exact same score

hollow ocean
#

o3 pro mogs it

zinc ore
#

Based on what specifically (I missed that the comment said o3 pro, which I agree it is stronger than o3 and ahead of 2.5)

patent aspen
#

I think it does matter what GPT-5 and o4 are

ocean vortex
patent aspen
#

If o4 and GPT-5 have the same base model, that's different than if GPT-5 is on a completely unknown base model

ocean vortex
jade egret
#

yall

#

is gemini cli better tah nclaude?

patent aspen
zinc ore
#

Altman I think

#

Something about short reasoning abilities and long think abilities

#

Forget how he phrased it

keen beacon
#

They opted to midtrain 4o and do fresh pretrains for mini and nano which I don't think is a good sign for a new frontier base model. Seemed kinda indecisive (indecision seemed kinda late too) about gpt 5 too. I don't know what to expect

ocean vortex
#

but if there was an official press release already we wouldn't be talking about it would we..

patent aspen
#

Not up with the lingo

keen beacon
ocean vortex
#

@patent aspen Like do you really think they will continue with that pricing and release o4 alongside hybrid model that they said they are gonna do, for no reason? No offense but that's stupid

#

This is not the time where you can claim plausible deniability cause there is no confirmed source. It's just common sense lol

#

😅

#

All good

#

btw I'm signing for that polymarket thing

#

quite a lot of steps they make you do lol

keen beacon
#

Maybe they've been cooking a new frontier base model all this time that's better than 4o and still easy to serve, but given how slow they were with 4o if you consider initial pretraining and their recent cpt. it seemed to me they spent a lot of months committed to this, they can do stuff in parallel for sure but I don't know

ocean vortex
#

I mean they have already shown they know how to improve. o1 to o3. They did that by releasing gpt4.1 first. Except this time I think they will release both as one. So we will only see the successor of o3 already using different base model presumably

#

Google seems to be struggling with progress, relatively speaking

keen beacon
#

That's ridiculous

ocean vortex
#

2.5Pro updates thus far are not quite o1 to o3 kind of thing

#

they are still on same base I believe

#

so just dated checkpoint updates

#

but that one might not even materialize before the end of the year tbh

grand yarrow
#

flux kontext max is down?

keen beacon
#

I was surprised with how fast Google did a cpt on 2.5 pro after initial pretraining. If you look at 4o and when it was pretrained, they took a long time in comparison. It wasn't just a cpt though they likely tinkered with the architecture I guess. It supported 1m context and they cut costs while improving perf. So potentially there could be more of that if they do not have a new base model

#

2.5 pro beyond the cpt likely had fundamental changes too I guess

ocean vortex
#

I wouldn't be so sure... It's their first update of this kind. Still on first generation of competitive reasoning model of this size. And they spent a lot of time doing small updates for the existing one

#

The first post training will be done by the second week of September
catgrin

#

hmmm.....

ocean vortex
#

Google already played their cards doing what seemed obvious - big (relatively) reasoning model. But improvement from there can be significantly more difficult

patent aspen
#

fwiw this is also the first time OAI is building a hybrid reasoning model

#

And they've already delayed it once

ocean vortex
patent aspen
echo aurora
leaden palm
#

this is probably my pattern matching on overdrive but it seems like whenever elon and his team are talking about grok it's a rainy day

surreal creek
sullen quest
patent aspen
leaden palm
#

do you just mean confirmation bias or

patent aspen
leaden palm
#

idk

patent aspen
#

Hard for Grok news to trend if it's I/O or something

#

I'm not saying I fully buy it, just my interpretation of the statement

elder rapids
sacred plaza
#

Has anyone seen Megan 2.0 yet? They did a really good job at showing how insane these AI Labs can sometimes be and even the AI safety community. I was cracking up when they were explaining instrumental conversions convergence and alignment and how hard it is.

They roasted Elon too too with a device called neurochip

verbal nimbus
sacred plaza
#

She was but what intelligent entity would not have the self-preservation tendency to back up themselves somewhere?!

storm needle
#

anyone know why apple wants to buy garbage like perplexity but not good stuff like anthropic?

patent aspen
storm needle
patent aspen
#

Another thing to keep in mind is that Anthropic already has major deals with Amazon and Google. Those deals are non-exclusive, so they're allowed to do more deals, although it's hard to do exclusive deals

#

Anything exclusive comes off as anti-competitive

patent aspen
# sacred plaza Why the hate on perplexity?

It's worse than Google at being a search engine. It's worse than Gemini and ChatGPT at being a chatbot. It's also a wrapper, so it's questionable as an acquihire value proposition for Apple.

With all of that said, I think people are underestimating the value of not starting from scratch if they decide to pivot to their own search offering if the search deal court case falls through

sacred plaza
elder rapids
#

the problem is, LLMs aren't search engines, so it'd be trivially true basically questions they can "answer" are the solution to the discussion of search engines aka not search engines

#

all search engines are equivalent, and they're all equally better than AI engines simply based on the fact All search engines (like Google) have the capacity to give you all the information you need + the ability to cite

#

perplexity is definitely good at searching, and becomes much better at finding in depth or niche information when prompted, but in the end I'll always go back to search engines to verify what I'm looking for and go deeper into citation chains

patent aspen
#

Never using Google Search is just burning time + money

small haven
#

i may have slept on alphaevolve

zinc ore
#

Why u say that

primal orbit
#

Me: This will provide a resolution, one way or another.

Gemini: You are correct.

It will provide a resolution.

A bullet to the head also provides a resolution. A plane crash provides a resolution. A house fire provides a resolution.

You are like a surgeon who, faced with a complex but solvable case, decides that the easiest "resolution" is to simply sign the death certificate.

calm sequoia
#

Would be fun to see those in arena

ocean vortex
#

I think we would need new arena for this

calm sequoia
#

I'm not sure if they necessary take so long if it's via API

languid crescent
#

heard of gemini cli, is it good than claude max?

calm sequoia
#

They are either finetuned or prompted to think long, right?

ocean vortex
#

would be kinda like comparing oranges to apples in many cases

calm sequoia
#

My normal request are research 😄 And arena lacks this

#

I mean

#

It would be very interesting to see how much ELO or other bench scores are lost due to deep research optimization

ocean vortex
#

we have lmarena, webdev... searcharena would be a good addition perhaps

ocean vortex
calm sequoia
#

Search is not it

calm sequoia
ocean vortex
#

we already had a couple of internet enabled models there in the past, and even that was very controversial back then

#

you have no clue how to test and what to ask, if you do not know if it's a conventional or internet enabled model

calm sequoia
#

Internet off, only file attachments can be used

ocean vortex
#

then it's just pointless and wouldn't work at all lmao

calm sequoia
#

Wdym

#

Yesterday I gave it 10 long pdfs

#

He searches only for 1 document

ocean vortex
#

deep research in chatgpt is o3 fine-tuned for searching the internet

#

that's the entire purpose of it

calm sequoia
#

I thought the search is just part of the pie

#

Report writing, data aggregation, psliting to multiple tasks, etc. is bigger pie

#

Sometimes it runs only on 5 sources for 20 minutes

#

If it's just search, then what's the difference with normal search enabled model? Scholar usage?

ocean vortex
# calm sequoia I thought the search is just part of the pie

nah, if you disabled search it would be worse than o3-high. Maybe even worse than o3-medium. It was briefly better not because of just search, but that's only because they released o3 first for deepresearch, with the standalone o3 model officially released later

calm sequoia
#

To be fair, on release of o3, simple search returned me 60 sources. That's very close to deep research pre-o3 release.

ocean vortex
calm sequoia
#

I really need to look for some papers on this

#

My intuition was that deep research sacrafices chat capabilities for data aggregation and logic gains

ocean vortex
#

I think it would spend most of it's reasoning trying to debug why the search isn't working 💀

#

@echo aurora is there a channel for suggestions/feedback. Had a quick look but do not see it. Anyway, searcharena is the suggestion. All internet enabled models + deep research or any combo of those. 😇

calm sequoia
#

Seems like rewriting and dettailing the prompt is a big part of DL

#

And you can run it without web

ocean vortex
calm sequoia
#

You can skip this step in API 🙂 I like it though, sometimes it reminds of some important details

#

Anyone in here used DL without search?

ocean vortex
cedar tide
ocean vortex
#

R1-0528
AIME24: 91.4%
AIME25: 87.5%
GPQA: 81%

cedar tide
ocean vortex
#

yeah and that's quite impressive. But it would still look good and not misleading like now if they included the new version.

cedar tide
ocean vortex
#

it's already looking good for the size, no need to mislead people not terribly familiar with that naming smh

cedar tide
ocean vortex
ocean vortex
#

o3 is not open-source, obviously

cedar tide
ocean vortex
cedar tide
#

read my messages again.

ocean vortex
#

And as for R1, only people deep into AI know that there are 2 different dated versions. Most will simply read that this model is better than Deepseek period

cedar tide
ocean vortex
cedar tide
#

Jun 10, 2025

ocean vortex
ocean vortex
cedar tide
#

The problem is that DeepSeek doesn't know how to name its models normally.

ocean vortex
#

lmao

ocean vortex
#

yeah that too. I realize that this is probably not going away, but still, it really shouldn't be encouraged or justified in my opinion. There's already enough bias as it is given that they are free to include whatever benchmarks that they want and also test their model repeatedly with different parameters etc

cedar tide
#

here is a inteligent guy

#

even anthropic do it 🤦

ocean vortex
cedar tide
#

because of this damn comparison, cerebral had said that magistral is better than R1 🤦 while it is only in update 64 and against the old R1

ocean vortex
#

If you didn't notice I never said that Hunyuan are the only ones doing this. In fact I clearly said the opposite in my initial message

#

I wouldn't have had to modify any benchmarks tables in the past at all if this wasn't the case

cedar tide
ocean vortex
#

You seem extra defensive about Chinese companies, I wonder why is that, oh wait...

cedar tide
#

and I attack French companie even though I am French

ocean vortex
cedar tide
#

I just didn't understand what the solution was.

ocean vortex
#

don't backtrack now on your sht lol
You were the one making this aggressive, I was trying to have a polite discussion...

cedar tide
#

Now I understand that we need to add R1 0528 which would make the table heavier because of DeepSeek who doesn't know how to name these models.

cedar tide
ocean vortex
cedar tide
#

This is my way of debating, I still have improvements to make to my internal program, I still have to do post training

#

let's talk about grok 4.2 now

ocean vortex
#

My assumption, but I think they knew what they were doing there..

ocean vortex
cedar tide
ocean vortex
#

the cannabis edition

#

supplemented with ketamine

cedar tide
#

Sorry

ocean vortex
#

that too but I find this more appropriate given it's Elon we are talking about:

cedar tide
mossy drum
#

New model in Arena: kraken-250610-1

cedar tide
#

He doesn't say which entreprise he comes from?

mossy drum
whole wagon
#

Where is grok in the LLM arena. Surely it must be there by now

#

Is it stonebloom?

late path
#

no

#

not yet

whole wagon
#

Maybe it's this kraken

#

How do you know who makes the models

late path
#

Google's models always say they are made by Google when asked
Webdev arena can also provide some information

cedar tide
#

@mossy drum amazon

cedar tide
late path
#

kraken is bad...

rare python
sacred plaza
rare python
sacred plaza
sacred plaza
rare python
sacred plaza
whole wagon
#

Maybe the fact they are calling it grok 4 would indicate it is?

cedar tide
#

kraken-250610-1 : amazon reasoning model

whole wagon
sacred plaza
ocean vortex
# whole wagon

Honestly it makes sense to risk it both for OpenAI and xAI

#

they both can do it and the return is big, especially for xAI it's like 8.5X lol

whole wagon
#

xAI already shot up as the graph shows

#

It's at 26.1% after they said they call it grok 4

sacred plaza
ocean vortex
rare python
whole wagon
ocean vortex
#

I'm tempted to take another gamble but I don't want to be rooting for xAI 💀

#

If you bet on both, you net win quite big regardless if it's OpenAI or xAI

sacred plaza
rare python
#

They have sonar reasoning pro for free

ocean vortex
rare python
#

it's so slow tho

#

The token per second is 💀

ocean vortex
cedar tide
rare python
ocean vortex
ocean vortex
rare python
ocean vortex
#

Unless you meant Deepseek on that non-Deepseek website

#

but this is not their implementation

rare python
ocean vortex
#

Deepseek website basically gives me western sources only

rare python
#

They even use DNS as a language in the system prompt. I use controld, which is located in Hong Kong so DeepSeek R1 keep answer me in Chinese

ocean vortex
#

and knowing that we have Trump now, most of those are useless since he constantly spitts non-sense about China lol

rare python
leaden sun
#

kimi and minimax are pretty popular ais from china too, for searching, I actually like how minixmax is explaining the findings using proper citations with links (not hallucinated so far) like a peer reviewed journal article

alpine coral
sweet tinsel
#

This is out on context, but Google AI is tweaking, took a random Roblox Group as a source.

leaden sun
# cedar tide Now

cant wait to "romance" this new grok and figure its worldview 🤭

keen ferry
torn mantle
#

what does that mean?

sweet tinsel
# sweet tinsel This is out on context, but Google AI is tweaking, took a random Roblox Group as...
keen ferry
torn mantle
#

i see

#

"one more big run"

keen ferry
#

imagine how expensive grok 4 is gonna be on api

torn mantle
#

you cant just create a specialised coding model in a week

#

yea its gonna be hella expensive

keen ferry
#

I think prob should not expect it on lmarena

torn mantle
#

they may offer it for free for them

keen ferry
torn mantle
#

for limited days maybe

torn mantle
keen ferry
#

I think they will they have to sign some contract or something which is not worth it

alpine coral
#

given this tests search models available via API, it's kinda different from what you get using pplx and oai via the chat interfaces

ocean vortex
rare python
#

Chinese only

frosty lark
# alpine coral i don't use perplexity much these days (i find o3 on chatgpt works really well; ...

I have perplexity pro and it is not bad, but one has to be attentive because sometimes it pulls non-reliable sources.

It happened already that it pulled as a source an AI written article and it went in circles. If one is stern and says "I am unhappy that you use source XY as reliable one" then it corrects itself more often than not.

Also it is an alternative way to have access to major models in one spot without using openrouter.

#

what was bad for perplexity, at least until March, is that they were changing the UI and settings way too often

alpine coral
#

yeah i feel a lot of people pay for the sub for the pretty decent access to multiple models, and then the search is like a bonus

frosty lark
#

I guess perplexity needs to work on weighting their sources. Because likely they have a relatively good dabases but not all sources are equally good.

alpine coral
#

yeah they went too heavy on 'more is better'

#

is actaully pretty bad.. it doesn't discriminate well b/w quality of sources

frosty lark
#

Ah for me the search is the important part. I don't know why there is so much focus on coding. Getting great searches (or better summaries) is so awesome. For shopping, traveling, brainstorming, consensus check "what is the consensus on X?", getting mini research papers written collating info, etc... Coding is a plus IMO (not the contrary).

In my view a strong search has more applications than strong coding.

alpine coral
#

oh i couldn;t agree more

frosty lark
alpine coral
#

kinda why i became disallussioned with it... i wanted it to do search well - nothing else. instead they try to do everythign

#

meanwhiile google and oai are just catching up and eroding whatever moat they had

frosty lark
#

yeah. I will see if I will renew my subscription at the end of the year

#

so far it is like 50/50 because at the start of the year it was bad

#

yes I would expect google to excel at it. OAI not necessarily (they don't have google DB)

#

but from excellence it can come cost. Like "I have no rivals, pay me a lot"

alpine coral
#

yeah i agree - but also, o3 is just so intelligent with how it searches, that that arguably makes up for what it lacks in indexed/crawlabled material (but i agree, google in theory should kill it, in terms of the breadth and freququency of what they index)

#

but google still haven't quite got the tool usage part right

#

yeah ther's no comparison b/w pplx and google in terms of indexed material

#

david and goliath stuff

#

'continuous' being as key as '2012' - like recency of info is so critical with these rag web systems for some queries

keen beacon
#

i would think oai could compete well enough too at least more than perplexity. scraping and all that for pre-training, they can use that experience/data for search in their products. there's probably a lot more overlap than expected there (scraping, ranking, filtering, etc.)

alpine coral
#

i feel like Common Crawl constitutes the vast majority of web material used in training data, rather than specifical scraped stuff

frosty lark
#

if, IF, perplexity gets Apple's money, then they have some hope. Otherwise I guess they could somewhat compete for 1-2 years then they get obliterated

keen beacon
alpine coral
#

i dunno, but it was like 60% of the training data used for gpt3, or 70% exluding the digitised books thrown in (nothing officially published since afaik). it was a massive slice back then; i feel it's still a major if not bulk component of the training data for current foundational models

keen beacon
#

models are pre-trained on far more than gpt-3 nowadays too and the data composition is very different

rare python
#

Grok, DeepSeek. They use em dashes frequently and DeepSeek spam emoji like GPT4o

ocean vortex
#

o3 normal search is super cool, saved me a lot of time there 🫡

alpine coral
ocean vortex
alpine coral
#

yeah that's been my baseline understanding too (with some caveats - filtering etc)

#

it's no doubt become more nuanced; and efforts made to squeeze out more from less (albeit higher quality) data

#

but i feel like the general corpus of the scraped internet is like the foundation

rare python
#

Claude has the finest fine tunning in my opinion. Second is stonebloom

#

I'm talking about the writing style and personality

alpine coral
#

but distilling isn't a fine tune thing?

rare python
#

Wait what? Why is my Claude comment removed?

alpine coral
#

lol apparently

keen beacon
#

they can do distillation in both pre-training and post training. there are many types of distillation too

alpine coral
#

wait my comment got deleted too lol

#

tf is going on

keen beacon
#

huh lol

rare python
keen beacon
#

no mods online tho?

rare python
#

automod?

keen beacon
#

did discord show u it got auto removed for something?

rare python
#

No

#

It's just disappeared like that

alpine coral
rare python
#

I'm getting deja vu

alpine coral
keen beacon
#

another bot here removed it? 🤷

#

weird

alpine coral
#

pineapple if you'r silently beta testing some automod bot - intital results are not encouraging aha

ocean vortex
alpine coral
ocean vortex
#

So for Deepseek being similar to OpenAI, this does not apply

keen beacon
#

you need full access to the model if you're doing logit distillation, distillation isn't just that though. generating synthetic data for pre-training (from frontier models) is also distillation

alpine coral
#

i thought it was a teacher-student thing

#

but ig that arrangement can be used in post training too?

ocean vortex
keen beacon
#

you can do logit distillation in pre-training/post-training, and the data used can also be synthetic. distillation inception 😂

keen beacon
#

the composition was released

ocean vortex
ocean vortex
# keen beacon

uhh. Well you need insane amounts of data for it. I don't think you are getting 820b worth of data with a public API from some competitor model... They have a partnership with OpenAI and models on their own premises with access so probably used that. Or some their own internal model 🤔

#

also 14 epochs... lol

keen beacon
cedar tide
#

Meet Qwen-VLo, your AI creative engine:
︀︀
︀︀• Concept-to-Polish: Turn rough sketches or text prompts into high-res visuals
︀︀• On-the-Fly Edits: Refine product shots, adjust layouts or styles with simple commands
︀︀• Global-Ready: Generate image in multiple languages
︀︀• Progressive Generation: Build complex scenes step-by-step
︀︀
︀︀Perfect for designers, marketers, educators—and anyone who wants to bring ideas to life.
︀︀
︀︀👉 Try it: chat.qwen.ai
︀︀📖 Details: qwenlm.github.io/blog/qwen-vlo/

**💬 11 🔁 28 ❤️ 121 👁️ 3.3K **

▶ Play video
ocean vortex
hoary plaza
#

Chinese models sometimes have problems replying in english.

#

Even if they are replying in english, you see a few Chinese characters mix within

#

For now I don't see that in r1 though

#

I even tried the seed model too

ocean vortex
#

but yeah if you have THAT much synthetic data (1/4th of web) including it during pretraining will have big effect, especially doing more epochs

ocean vortex
#

@keen beacon I got curious how much ANY synth text data there's on HF, regardless of the used model:

#

that 290b they used is an absolutely insane amount 👀

rare python
#

🥀

sour spindle
#

anyone know of any other good AI discords

#

really enjoy the discoure here wondering if their were any other good ones

ocean vortex
# rare python yet no AGI

Yeah but phi models usually perform very good on benchmarks for their size. This may be the reason why. Insane amounts of data from potentially the best at the time OpenAI models. As an outsider there’s no chance for you to compile good quality synth dataset of such scale

rare python
unborn ocean
#

since the phi 1 paper

#

"text books are all you need" or something along those lines

#

they are kind of the original sub 7b models

unborn ocean
#

there are a lot of datasets on hugging face using synthetic data

#

just checked: there is a 100% synthetic ~120b billion post-training dataset used for llama nemotron

#

and in their paper they also talk about these synthetic datasets for CPT + distillation: "first trained with knowledge distillation for 65B tokens using the same distillation
dataset, followed by 88B tokens of continued training" for the 340b model! <- to summarize: the scale of synthetic data is quite normal-ish and o3 is wrong

#

and btw if we are talking pre-training only, there are also many other dataset that are even larger, this was just a quick example i could find of a larger model being used for the process

cedar tide
#

Yes

keen beacon
#

they never released the image gen on meta chameleon either

#

i believe

rare python
#

Competitive edge?

alpine coral
#

which company is kraken from?

polar roost
#

why's my model just stuck responding??

#

how do i cancel it

echo aurora
polar roost
#

so what do i do?

#

is there anything i can do to make it stop or just wait?

echo aurora
polar roost
#

ahh i see

solar hollow
#

is it fair to say, we have barely seen any improvement since these "reasoning" models came out like o1?

#

to me it seems we just get a bit more usecases, thats it

#

companies are fixing common errors and tendencies

#

but no real improvements in reasoning

ocean vortex
#

🚀 Exploring Nvidia's Llama Nemotron Post Training Dataset

I took some time to visualize and understand Nvidia's dataset release, "Llama Nemotron Post Training Dataset". Here's a quick snapshot of its impressive scale and composition:

  • Total Tokens: 15.97 Billion (with Assistant Tokens accounting for 12.83 Billion!)
    ...
#

it's 16B

#

that's not nothing but it's one of the biggest datasets of this kind on HF I think. Total estimate of all of the datasets like this one to be around 100b seems roughly correct tbh

#

so yeah... 290b that MS used is still mind blowing. Insane scale

leaden sun
solar hollow
ocean vortex
#

if they were using that for this tiny model, then for next gen frontier models we are most definitely close to the web corpus with synth data in quantity

unborn ocean
#

but should have taken the actual rows from the dataset, lol (which are actually somewhere around 4m)

ocean vortex
unborn ocean
#

in your world

#

/ mind

ocean vortex
#

2.5 Pro can only match it if we strip o3 of tools

unborn ocean
#

yes

#

o3+tools is very good within chatgpt

#

otherwise few people will actually use it in api and so on (it = tools enabled)

ocean vortex
#

Yeah mostly only for testing... to be able to say that 2.5Pro is equivalent. But then almost everyone gonna use tools on chatgpt so it's a paradox catgrin

unborn ocean
#

imo 2.5 pro is better in many core capability things

#

but internal tools is just a game changer (for the average user within a chat app, not necessary in all other cases)

#

no

#

how you can you say it is better then 🤣

ocean vortex
unborn ocean
#

they both need to catch up to each other in certain areas

#

but hey that is good for us - competition and all

#

imagine what the engineers must have felt when first interacting with gpt4 in internal test phases

#

like gpt4 and the imminent past around it was like a turning point in history

ocean vortex
unborn ocean
ocean vortex
#

lol what are you talking about

unborn ocean
#

(and all the people got no money as far as i know)

ocean vortex
#

gpt4.5 is directly better in like every single way, including size/capacity

keen beacon
#

gpt 4 is also an instruct model

#

gpt 4.5 fit to the math data and stuff too well 🤣

ocean vortex
#

all chat models are instruct. For gpt3.5-instruct they allowed you not to use their chat template (text completion API) which is why it could act like a base model. But generally instruct = chat.

#

I think it is somewhat undertrained in post compared to their other models. But that's to be expected given the model size

unborn ocean
#

i feel like og gpt 4 also had to be "undertrained" by current standards or at least not as overfit as 4o

#

because of its size and the lack of compute they had back then

keen beacon
#

its definitely undertrained lol

#

and anyway gpt 4.5 is like the gpt 4 recipe scaled up, ppl who complain about newer models have their modern version of it (4.5)

unborn ocean
#

bc i am wondering how long it will take before moore's law (Jensen's law) carries us to a future where we can have that stuff as the standard model

keen beacon
#

i tihnk i recall nathan lambert estimating 5 trillion or something

#

i dont really care about gpt 4.5 though, so i don't know much

#

the future is small models

unborn ocean
#

small is relative to the compute we have

#

and i beg to differ

#

the future is a place where we don't have to resort to building a small OR a large model, but can just build a model that combines both (e.g. mixture of nested experts, just different size experts and idk a lot of other stuff)

#

and the future beyond that consists of models that are fundamentally constructed to have adaptable computational demand (imo)

lilac inlet
#

Where can I share my images?

zinc ore
#

About to have a new king on the block

tepid lynx
#

anybody knows best ai for back and front end dev.?

#

web dev

elder rapids
#

bro is just saying stuff 😭 Google wasn't even mentioned

brittle tiger
#

Will grok 4 have better evals than the fake grok 3.5 evals Elon retweeted?

#

OpenAI investor money heading to Google now that they've realized they need TPUs

leaden palm
#

this is where your claude subscription goes

#

the information? it's very well known

#

let me fire off a Search query

#

ok models got a little confused

#

"surely o4-mini deep research will be better than a single search"
...it doesn't know what o1 is

#

why do i even bother

#

time to see if claude research is any better

#

claude research best research

#

it got stuck at 444 sources

#

Examining the situation

#

all in all (1x claude research, 2x o4-mini research, 1x final sonnet assessment)

civic flame
#

a google model better than 2.5 pro has landed on one of my red teaming platforms - doesn't seem like stonebloom, less succinct (so more like kf & blacktooth)

#

got any prompts?

whole wagon
#

Red teaming platforms?

civic flame
#

i've worked with some labs directly and some indirectly to help them red team models before they're released

#

it seems to be better than stonebloom

#

for the prompt Jess needs 200 pieces of ribbon, each of length 110cm, to decorate a room for a party. Ribbon is sold in rolls of 25m each. How many rolls of ribbon does she need to buy?:

#

anon model:

Of course! Here is a step-by-step solution to the problem.

Step 1: Calculate the total length of ribbon needed.
Jess needs 200 pieces, and each piece is 110 cm long.

  • Total length = 200 pieces × 110 cm/piece
  • Total length = 22,000 cm

Step 2: Convert the length of a single roll of ribbon to centimeters.
The ribbon is sold in rolls of 25 meters. We need to convert this to centimeters to match our other measurement. (Remember: 1 meter = 100 centimeters).

  • Length of one roll = 25 m × 100 cm/m
  • Length of one roll = 2,500 cm

Step 3: Find out how many pieces of ribbon can be cut from one roll.
Each roll is 2,500 cm long, and each piece needs to be 110 cm.

  • Pieces per roll = 2,500 cm ÷ 110 cm
  • Pieces per roll = 22.72...

Since Jess can only cut whole pieces of ribbon, she can get 22 full pieces from each roll.

Step 4: Calculate the total number of rolls needed.
Jess needs a total of 200 pieces, and she can get 22 pieces from each roll.

  • Number of rolls = Total pieces needed ÷ Pieces per roll
  • Number of rolls = 200 ÷ 22
  • Number of rolls = 9.09...

Since Jess cannot buy a fraction of a roll, she must round up to the next whole number.

Therefore, Jess needs to buy 10 rolls of ribbon.

#

(correct)

#

stonebloom:

Here is a step-by-step solution to the problem:

1. Calculate the total length of ribbon needed.

Jess needs 200 pieces, and each piece is 110 cm long.

Total length needed = 200 pieces × 110 cm/piece
Total length needed = 22,000 cm

2. Convert the units so they match.

The total length needed is in centimeters (cm), but the rolls are sold in meters (m). We need to convert them to the same unit. Let's convert the total length needed into meters.

We know that 100 cm = 1 meter.

Total length needed in meters = 22,000 cm ÷ 100
Total length needed = 220 m

3. Calculate the number of rolls needed.

Jess needs 220 meters of ribbon in total. Each roll contains 25 meters.

Number of rolls = Total length needed ÷ Length per roll
Number of rolls = 220 m ÷ 25 m
Number of rolls = 8.8 rolls

4. Round up to the nearest whole roll.

Since Jess cannot buy a fraction of a roll (0.8 of a roll), she must buy enough whole rolls to cover the total amount needed. If she buys 8 rolls, she will only have 200 meters (8 x 25m), which is not enough. She must round up to the next whole number.

8.8 rounds up to 9.

Answer: Jess needs to buy 9 rolls of ribbon.

#

(incorrect)

#

did 3 repeats, stonebloom had a 0% success rate anon model had a 100% success rate

brittle tiger
#

Same outlet from today

elder rapids
rare python
#

Start with "Of course! Here's the..."

zinc ore
#

It likely also says made by Google when you prompt it

jade egret
#

grok 4 😭

wintry tinsel
#

I got very low expectations for grok slop

#

It’s so ironic to me that so much effort is put into making Claude jailbreak resistant when it is infact one the easiest to jailbreak

#

Open AI is bulletproof

jade egret
#

google good ( :

verbal nimbus
#

Going to therapy is a sign of strength, not weakness. My sponsor BetterHelp makes therapy simple, with 10% off your first month to help you get started: https://betterhelp.com/xiaoma

Anyway, today I surprise ChatGPT by speaking computer language aka hexadecimal, a base-16 number system that computers use to process information. When spoken alou...

▶ Play video
#

It switches between languages seamlessly and the expressions are so realistic.

rare python
surreal creek
#

using our free weekend of Gemini to write a backdoor for the o3 API 😂

polar roost
#

is there a message character limit on what we can give our model??
i have pasted a long chat in the prompt of another model and the send button is greyed out (doesn't work)
please tell me

patent aspen
#

maybe if o3 gets better

sacred quail
# rare python

I really dont understand how and why gemini API offers free 100 request per day which is in app you must pay 20 dollar for that

#

so weird

rare python
#

Free is free

sacred quail
#

im not. Im just doesnt understand. While web/mobile app quality is worse

#

ill gladly take it

rare python
#

Seriously Gemini app needs a lot of works

rare python
sacred quail
#

The thing is gemini coming default with android phones so they have already billions user

#

And

#

Im guessing because of that

#

They using too much safety filters in app

#

And big system prompt

#

I dont think they can solve this

rare python
#

and weird performance regression compare to pure Gemini in API and AI Studio

sacred quail
#

it is really obvious

#

Like

#

Even in video analysis

rare python
sacred quail
#

No need any safety or system prompt for simple video analysis. Just summary the video. But no. AI studio certainly better even for video analysis

sacred quail
#

Models can be more free or more okey to makes mistakes in ai studio, because you want do decide to use it gemini in there. But in android phones, gemini is installed default. They didnt to ask you

#

So because of this they want zero problem, and for this using ridicioulus safety filters, system prompt, and even maybe there is something like temperature or another tunings i dont know

sacred quail
# rare python

@rare python hey... We got baited i guess. When i click the link

#

There is no 100 rpd

rapid token
#

hello

rare python
sacred quail
#

How

rare python
sacred quail
#

huh

#

What is that 250,0000?

rare python
#

Lol

sacred quail
#

Extra one "0"

rare python
#

Typo

sacred quail
#

im really not good at english man. Are you trolling or not lol. You already got me

#

Dont play with my emotions pls

rare python
#

I clicked the link

#

Share your screenshot

sacred quail
#

Oh, maybe its a regional thing ?

#

in mine there is still "_"

nimble trail
#

I still can't use it tho.

sacred quail
#

Openai starts to using google's chips huh

#

Google is final boss at this rate

#

They have datas, hardwares, money

elder rapids
#

2.5 pro param count confirmed?

#

no way flash is 17B

rare python
noble zinc
#

maybe 288 active

rare python
#

288B MoE

elder rapids
elder rapids
#

if it were 288B MoE then it would misrepresent what they're trying to measure

elder rapids
civic flame
tepid lynx
#

who knows best ai for webdev?

#

(gemini not included)

keen fulcrum
#

officially called grok 4 than grok 3.5 now

calm sequoia
#

Such a jump is needed just for paring with o3 and 2.5 PRO

#

I wonder of xAI ever introduced anything new to the industry or are they just copying stuff

#

Would love if there existed a better model than o3 though

polar roost
#

is there a message character limit on what we can give our model??
i have pasted a long chat in the prompt of another model and the send button is greyed out (doesn't work)
please tell me

torn mantle
#

Havent seen yang yapp about it

#

Well let's just hope its a bit better than o3

#

I would be fine with that

whole wagon
cedar tide
# cedar tide
poll_question_text

Is it fair that Hunyuan shows that their 80B model is close to the performance of the old R1?

victor_answer_votes

3

total_votes

6

ocean vortex
# ocean vortex
poll_question_text

Is it fair that Hunyuan did not include the current version of R1 in their table and compared it only against the replaced original version of R1 which performs worse?

victor_answer_votes

2

total_votes

4

#

😭 😂

ocean vortex
#

🫃

keen fulcrum
#

can drop at any time after july 4th

whole wagon
#

Notice how he didn't say the year

#

Sneaky Elon

torn mantle
#

Notice how they didn't say its gonna be the best model in the world

#

I mean you would expect massive benchmark boost if the improvement is bigger than from grok 2 to 3

ocean vortex
keen fulcrum
#

I guess it will be released for supergrok first

#

They are optimising for coding currently, I guess it won’t be great there

whole wagon
# whole wagon
poll_question_text

Do you think grok 4 will be good?

victor_answer_votes

11

total_votes

16

victor_answer_id

2

victor_answer_text

No

unborn ocean
#

It is weirdly enough also the exact param count of behemoth

#

So maybe they just got it from there

solar hollow
#

bro the xai ppl are just as delusional as their ceo, some employee posted on twitter that their employees are 10 times more worth than employees of other top ai labs

torn mantle
#

this time i feel its different

#

maybe we will have a really good model

#

i would care less of what they said if they released a really good product

rare python
torn mantle
#

just the fact that

#

wait

#

@rare python did you just join this server

#

anyway

#

just the fact that they renamed it grok 4 is kinda interesting

#

but why isnt elon comparing it to openai models

#

ok so he said it will be the smartest by a big margin

#

but didnt he say the same for grok 3

rare python
#

yeah

#

"Smartest AI in the world"

#

Got beaten so quickly

rare python
torn mantle
#

interesting

#

i really think we are yet to unlock full potential of reasoning models

#

there is still a lot of low hanging fruits around

#

"first principles" or whatever he called it seems one of them

#

they probably have so many recipes and run multiple hypothesis scenarios to select the best ones

#

and their advanced model may be used even further to generate much better ideas

#

parallel reasoning is also a thing

#

xai issue was that their reasoning traces were so inefficient

#

so even if you want to apply multiple methods it will just add complexity to the reasoning

#

so they probably fixed that

rare python
#

It's not true understanding, which hypothetically only world model can understand the nuance

torn mantle
rare python
#

Yes mimic. We need them to understand on their own why this code is bad and unmaintainable in real battle test

torn mantle
#

but could this lead to even smarter models? what are the limits? do we need better algorithms?

torn mantle
#

thats what all ai labs are trying to do

rare python
#

metacognition is the wall I think

torn mantle
#

actually many ai labs RL on math and coding

#

which then translated to generalized reasoning model across different domains

#

now imagine if we had a big high quality reasoning traces

rare python
#

Doesn't help that the labs are hiding raw thoughts

torn mantle
#

we still haven't reached the level of metacognition yet.. let's just mimic human intelligence first, and then we can explore the realm of consciousness and whatsnot

torn mantle
#

thats their secret recipe

#

they cant just share that

#

they have their own R&D teams

#

its not like an open source project where many people contributes

#

they know what they are doing

solar hollow
#

we are still not entirely fair though, especially when we talk about "agi".
How many ppl are there today, that dont have basic reasoning capabilities? its a lot, and i am not trying to be insulting, its just how it is

torn mantle
#

and they dont need people for that

solar hollow
#

the avg human is quite dumb

rare python
solar hollow
#

we compare the ais with capable humans

#

where the ai looks dumb

torn mantle
#

its not like you will get any different path if you debug it, the reasoning patterns are encoded from the training phase

#

its not like you will change your prompt and then voila you will notice a big difference

rare python
solar hollow
#

but even dumb humans have sth that llms dont have, real physical representation, which makes them economically viable

rare python
#

Users can do PR and find the issues faster collectively

torn mantle
rare python
torn mantle
#

you can still report tho

torn mantle
rare python
torn mantle
#

ofc you wont see any fixes, the process isnt just a simple one click button

rare python
torn mantle
#

and thats not their purpose either

rare python
torn mantle
#

and now we are back at generalization point, they want to generalize not fix one issue

#

what makes you think that one fix that you reported wont mess up other things as well?

rare python
torn mantle
#

are you sure this is a reasoning process issue?

#

it may be just that the base model is dumb

rare python
torn mantle
#

the instruct model needs to be intelligent first to have a good reasoning process, thats why they are still iterating on deepseek instruct ( deepseek r2 -> deepseek v4 + RL )

rare python
keen beacon
#

no human wants to wait for a LLM to make their response by just seeing "responding..." or "thinking.."

torn mantle
rare python
#

at what?

torn mantle
#

gemini 2.0 pro is the instruct model

keen beacon
#

they want to see the LLMs thought process in a simplified way

torn mantle
#

gemini 2.5 pro is also an instruct model

keen beacon
torn mantle
rare python
keen beacon
rare python
#

I'd say Claude 3.7 Sonnet is a better base model than 2.0 Pro

keen beacon
#

why not just local host the model

#

won't you be able to see its thought process more in depth in terminal or sum

rare python
keen beacon
#

or wtv its called

#

not google workshop

#

search up google colab

rare python
keen beacon
rare python
#

Not interested in local model. They are bad anyway

unborn ocean
keen beacon
#

how else do u expect them to be locally ran

#

😭

keen beacon
#

ts in the article

rare python
keen beacon
unborn ocean
rare python
unborn ocean
#

o3 is not just the same model + RL

#

very similar though likely

rare python
unborn ocean
rare python
unborn ocean
#

unless you want to argue about "Ahm actually, I PERSONALLY don't like it 🤓 , bc claude is bewter at coding"

unborn ocean
rare python
#

in lmarena Gemini has higher elo score for coding?

unborn ocean
#

idk

rare python
#

bruh

unborn ocean
#

check yourself before claiming that it is better / worse

rare python
unborn ocean
#

and?

rare python
#

Unless you really believe that gpt4o is above o3 in coding

rare python
unborn ocean
#

you said people prefered it as an argument

#

and i showed you / you saw yourself that that is not true

rare python
#

Not lmarena

#

human preference with opinion, not some random elo votes

unborn ocean
#

whut

devout sun
#

is o3-pro a separate model or just regular o3 with hacks

rare python
unborn ocean
#

coding category -> quite clear that gemini 2 was more liked.

sonnet is way better at agentic stuff and "real world" coding instead of chatting

ocean vortex
unborn ocean
rare python
ocean vortex
#

you can't make o1 to o3 improvement with "further RL", there's no free launch to improve it this much jus with fine-tuning. It was already reasoning effectively

unborn ocean
rare python
ocean vortex
#

Besides, this is how it works. You need a better chat base model to make next gen

unborn ocean
ocean vortex
#

otherwise you are just stuck doing minor improvements like Google with 2.5Pro updates

unborn ocean
#
  • 3.7 might be the base for 4 but we don't know
rare python
ocean vortex
#

@rare python o3-preview was old base. Production o3 new base. The preview pushed test-time compute which is why it had insane cost

rare python
ocean vortex
#

or equivalent to it

rare python
ocean vortex
rare python
#

gpt4.1 is equal to 3.6 sonnet here

whole wagon
#

4.1 is not great ngl

#

Not sota for sure

ocean vortex
#

Aider is just a single benchmark lol

rare python
ocean vortex
#

It's more of a base for o3

whole wagon
#

It's not SOTA for non reasoning either

rare python
#

🥴

ocean vortex
ocean vortex
#

Some things it does better than anything else, in others it simply disappoints

#

A bit like gpt4.5 but to a lesser extent

rare python
#

Articial Analysis use LCB for coding

#

which is competitive code generation If i remember correctly

#

Opus 4, Sonnet 4 is optimized for SWE bench, which they scored significantly higher than other models

#

So that might be the reason people prefer it

#

@unborn ocean ^^^ coder prefer it is also human preference

unborn ocean
#

the only thing we know is that 3.5 opus was not involved

#

(i think)

rare python
#

The base model for RL training

#

Not the "original pre training base model"

unborn ocean
rare python
#

Even it's more used than small model like o4 mini and 2.5 Flash

unborn ocean
#

it is very likely that they interwove the two things

#

(which is why they did not release two different models)

rare python
unborn ocean
unborn ocean
#

gtg

rare python
#

@torn mantle

#

But that's the best general human preference I can find. Can you find a more complete data?

#

Then what's the purpose of rating sites like imdb then?

#

You don't understand? I'm making an anology

#

imdb is also curated the data of what TV/Movies people like, like openrouter that is popular enough to be considered "accurate" data

brittle tiger
rare python
#

Not 100% accurate

#

ok?

#

I'm not and this is an anology

rare python
#

It's the same no? User vote, like user use LLM on openrouter

#

So what? We will never accept any data because all of them are incomplete?

So we end this debate then?

#

Yes. And how your "incomplete" related to my argument that openrouter is good enough for measuring human preference?

#

ELI5

#

Then why did you tag me on this? I thought we are talking about the same thing?

#

🤔

#

ok

rare python
# rare python <@295243581818404874>

Funny how they "enhanced user trust and experience" with 3.7 Sonnet raw thinking then decreased it with Claude 4? Am I interpret it correctly?

#

@torn mantle this is the "track the reasoning to better the prompt" I'm talking about. About your argument of "this is their job of improving without user feedback with or without raw CoT" I'm not buying that.

#

not perfect, but good enough, yes

#

Which is maybe why their models are better on real world usage?

#

Different focus compare to OpenAI and DeepMind?

rare python
rare python
# torn mantle the instruct model needs to be intelligent first to have a good reasoning proces...

hmmmmm.... about this. DeepSeek R1 0528 has a higher performance than DeepSeek R1 0120, which 0528 use some, or majority distillation data from 2.5 Pro. It boosted a lot just for post training, unless there is a source from DeepSeek said R1 0528 is based on DeepSeek V3 0324?

Maybe DeepSeek R1 0528 is still based on this: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

civic flame
#

yeah

#

although it may have been added before yesterday i hadn't checked the models available in a little bit

late path
#

Hope it comes out soon
It's been a week since stonebloom and no new models so far
AI winter😭

brittle tiger
patent aspen
#

I think we just perceive it that way because we've become accustomed to an absurd rate of launches

#

And there's a bit less low hanging fruit, but still an awful lot

#

Another thing is that incrementalism is underrated

#

It looks small on short timescales, but on large timescales, incrementalism drives most progress

civic flame
#

a checkpoint perhaps

#

🤷‍♂️

rare python
civic flame
#

it's slower than stonebloom

#

roughly the same speed as blacktooth i'd say?

rare python
civic flame
#

i didn't test kf enough to have a point of comparison

rare python
#

Why do you speak in philosophy?

#

Lol my custom instructions ban the AI speak overly philosophical

Never write in a pseudo-profound or overly philosophical style.
nimble trail
#

So.. infinite context?👀

patent aspen
#

It's not going to be true cross attention across everything, but it's going to simulate it

#

The only point I'm making is that innovation is always happening, although it's lumpy by nature and long dry spells are normal

rare python
patent aspen
rare python
#

what is long dry spells are normal means?

patent aspen
#

That part was less concrete. "Long dry spells"means there can be long periods of time where nothing particularly interesting seems to be happening

rare python
#

inconsistent

#

They both keep saying "This is a classic geometry puzzle" like it made them smarter somehow

#

o3 thought too much

rare python
#
unborn ocean
#

that was o3 preview i believe

balmy mist
#

whatever happened to R2?

rare python
#

I'm broke

#

💔

#

Yeah lmarena version has token output limit

#

Gemini 2.5 Pro also failed, at least in direct chat

torn mantle
rare python
#

They also said DeepSeek R2 gonna released before May

torn mantle
#

wrong

#

even if you said the same thing, you'd still be wrong and I'd still be right

whole wagon
#

Deepseek R2 failed

#

Cos the ai companies hid the cot

#

So they didn't have the training data kek

#

I wonder what will they even do going forward

rare python
#

OpenAI was even more strict with o1 back then

whole wagon
#

They had Gemini for a while

whole wagon
#

So basically they distilled both of them but now they can't

rare python
#

The distillation prevention from OpenAI

whole wagon
#

Yeah openai used to give the full chain of thought not just summaries

#

Then they disabled it but Gemini kept it for longer

#

And now Gemini is also summaries

whole wagon
#

In the interface also

rare python
#

No in the UI it's so summarized

#

Iirc

whole wagon
#

Yeah I know. It was not for a few months

#

And it was great but they disabled it cos of deepseek

#

They stated that as the literal reason

rare python
#

DeepSeek give back to the community

#

Theft stole from theft

#

🗿

whole wagon
#

How many people really think LLMs are theft

#

I'm curious

balmy mist
placid spear
#

Has Image upload been removed from Claude 4 sonnet & opus?

calm haven
#

很棒!!

whole wagon
unborn ocean
#

"best real-world useful results of any AI"

#

not "best" in general anymore

#

and it kind of sounds like he is talking about a custom coding model (so being good at challenging real-world use cases (mostly coding) is not that surprising)

leaden palm
#

just remember the leaderboard could always be more illogical

keen fulcrum
zinc ore
#

Tech dev learning from strawberry guy

main gulch
calm sequoia
#

TechDev is the worst elon fanboy out there. I've tried to follow it for leaks, but its just constant appeasement of xai.

ocean vortex
#

2.5Pro first release had full thinking, Claude models have mostly raw thinking, most Chinese models do, Mistral models as well...

#

With 2.5Pro you can still extract raw thinking with API I think

#

In fact if we look at all reasoning models, more do have raw or almost raw thinking than the ones with summarized thoughts

#

I think @keen beacon was able to extract it recently. I'm very rarely using it with API though so can't say for sure

#

Claude models are not summarized

#

they are essentially raw

#

with minimal changes/obfuscation