#general | Arena | Page 61

misty vault Jun 26, 2025, 1:50 PM

#

omaygot

drifting thorn Jun 26, 2025, 2:14 PM

#

Gemini 2.5 Pro learns to commit suicide be like…

tall summit Jun 26, 2025, 2:17 PM

#

i'm not in any other ai servers personally but this is the only ai server i need

misty vault Jun 26, 2025, 2:20 PM

#

drifting thorn Gemini 2.5 Pro learns to commit suicide be like…

google solved ai safety

#

if ai no satisfy human -> suicide

#

Thumbs_Up

alpine coral Jun 26, 2025, 2:47 PM

#

misty vault google solved ai safety

safety? you mean sycophancy

#

bit of an over correction it seems lol

lone vector Jun 26, 2025, 3:05 PM

#

Do people like losing money

ocean vortex Jun 26, 2025, 3:11 PM

#

this is reminiscent of what Bing did but much much less bad. It was waiting to happen tbh given that Google was struggling with fine-tuning for a long time. They almost got it right recently but they didn't use to have anywhere near the same popularity and reach with Gemini that they have now, hence the publicity...

keen beacon Jun 26, 2025, 3:14 PM

#

lone vector Do people like losing money

Those odds seem fair

ocean vortex Jun 26, 2025, 3:15 PM

#

they probably just pushed "user is always right" tuning and sycophancy just a little bit too much. So they backed it into the corner lol

rare python Jun 26, 2025, 3:16 PM

#

ocean vortex they probably just pushed "user is always right" tuning and sycophancy just a li...

new stonebloom I think they did it right

#

Less sycophancy and straight to the point

ocean vortex Jun 26, 2025, 3:16 PM

#

so instead of "I apologize" always being the response to mistake, it gets increasingly more apologetic until nuking the thing becomes an option

#

kind of like with GPT you can get it increasingly more unhinged with "unsafe" context, instead there are no bounds defined for sycophancy or sucking up lmao

leaden palm Jun 26, 2025, 3:51 PM

#

well even though i explained multiple things it could do, it only tried one of them in one specific way and immediately gave up after it didn't work as intended

torn mantle Jun 26, 2025, 4:02 PM

#

lone vector Do people like losing money

is polymarket a thing

#

I wanted to try it a year ago, but I hadn't had the chance

lone vector Jun 26, 2025, 4:03 PM

#

It’s banned for the US. VPN is maybe a workaround

torn mantle Jun 26, 2025, 4:10 PM

#

im not from the US tho

#

if its banned in the US then why did XAI signed a partnership with them

#

elon just wants to ruin everything

keen fulcrum Jun 26, 2025, 4:14 PM

#

lone vector Do people like losing money

how?

#

did google release new model?

hoary plaza Jun 26, 2025, 5:13 PM

#

keen fulcrum did google release new model?

Gemma one

#

Atleast on aistudio

cedar tide Jun 26, 2025, 5:32 PM

#

Screenshot_2025-06-26-19-29-45-103_com.android.chrome-edit.jpg

#

Screenshot_2025-06-26-19-31-58-156_com.android.chrome-edit.jpg

full idol Jun 26, 2025, 5:37 PM

#

torn mantle if its banned in the US then why did XAI signed a partnership with them

they AIM to be legalized in US, but they focus on global market. Kalshi is opposite - it is US only.

ocean vortex Jun 26, 2025, 6:24 PM

#

cedar tide

niiice

leaden palm Jun 26, 2025, 6:25 PM

#

poll_question_text

how will the anthropic book scanning fair use ruling impact google's language models?

victor_answer_votes

11

total_votes

15

victor_answer_id

3

victor_answer_text

they're already using google books

tall summit Jun 26, 2025, 7:16 PM

#

lone vector Do people like losing money

nobody knows until they know

tall summit Jun 26, 2025, 7:16 PM

#

cedar tide

?!

ocean vortex Jun 26, 2025, 8:27 PM

#

lone vector Do people like losing money

How is the "best" defined here?

leaden palm Jun 26, 2025, 8:29 PM

#

ocean vortex How is the "best" defined here?

#

ocean vortex Jun 26, 2025, 8:30 PM

#

Yeah I read it just now https://polymarket.com/event/which-company-has-best-ai-model-end-of-2025

Polymarket

Which company has best AI model end of 2025?

Polymarket | This market will resolve to "Yes" if any model owned by Google has the highest arena score based off the Chatbot Arena LLM Leaderboard (https://...

zinc ore Jun 26, 2025, 8:31 PM

#

leaden palm

Alphabetical order for tie breaker sure is something tho

ocean vortex Jun 26, 2025, 8:31 PM

#

Honestly those odds seem just about fair. Still makes more sense to bet on OpenAI though IMO 👀

#

there's a reasonable chance GPT5 will top it

#

#

Now I'm thinking on actually doing this... catgrin

zinc ore Jun 26, 2025, 8:36 PM

#

Google has gotten very good at staying in the lead on llmarena. If it ends up being next Gen vs next Gen end of year, I'd guess Google continues the trend.

#

If it's openAI next Gen vs Google current gen, then of course openAI

ocean vortex Jun 26, 2025, 8:37 PM

#

zinc ore Google has gotten very good at staying in the lead on llmarena. If it ends up b...

GPT5 is in the advanced stage though and it's to be very different/new. With Google everything is less certain

zinc ore Jun 26, 2025, 8:37 PM

#

But I think Google will drop something once something like gpt5 drops imo

#

I personally think we see a Gemini 3 by years end, but that's a tentative guess

ocean vortex Jun 26, 2025, 8:38 PM

#

No one knows for sure if they will even release Ultra at all... And with 2.5Pro they are somewhat struggling to make major gains

zinc ore Jun 26, 2025, 8:38 PM

#

They seem pretty neck to neck to me, between 2.5 pro and o3

#

It's interesting how different everyone's impressions are tho

ocean vortex Jun 26, 2025, 8:39 PM

#

zinc ore They seem pretty neck to neck to me, between 2.5 pro and o3

yes but like I said we know GPT5 is a thing 100%, with Google we have no clue. If they gonna release something big I doubt it's before 2026

#

Not impossible, but less of a chance

zinc ore Jun 26, 2025, 8:40 PM

#

OpenAI has been advertising gpt5 for like two years

#

I don't think their training is ahead of Google's personally

ocean vortex Jun 26, 2025, 8:41 PM

#

zinc ore OpenAI has been advertising gpt5 for like two years

that's because the initial thing is already released under gpt4.5 lol

#

that name is now referring to a different model

zinc ore Jun 26, 2025, 8:41 PM

#

ocean vortex that's because the initial thing is already released under gpt4.5 lol

Exactly, which is why I think we can't measure the maturity of "current gpt5"

surreal creek Jun 26, 2025, 8:42 PM

#

zinc ore They seem pretty neck to neck to me, between 2.5 pro and o3

Only with style control on, Google has a bigger lead with it off as the market accounts for

ocean vortex Jun 26, 2025, 8:42 PM

#

zinc ore Exactly, which is why I think we can't measure the maturity of "current gpt5"

we can because there's a fairly clear way to improve the models and we know what it is going to be. It's not some enormous model experimentation...

surreal creek Jun 26, 2025, 8:43 PM

#

77 point elo gap with no style control compared to 16 with style control - gonna be hard for OpenAI to close that gap

zinc ore Jun 26, 2025, 8:44 PM

#

surreal creek Only with style control on, Google has a bigger lead with it off as the market a...

I wasn't meaning on llmarena, I meant overall. I don't treat the arena as an objective assessment of capability.

ocean vortex Jun 26, 2025, 8:44 PM

#

it took less than 5 months to move from o1 to o3

surreal creek Jun 26, 2025, 8:44 PM

#

the AI companies do tho lol

zinc ore Jun 26, 2025, 8:44 PM

#

I don't treat any benchmark that way, but I'm js they seem to be very close on many benchmarks

surreal creek Jun 26, 2025, 8:44 PM

#

ocean vortex it took less than 5 months to move from o1 to o3

Yeah was there never an o2 ?

ocean vortex Jun 26, 2025, 8:45 PM

#

surreal creek Yeah was there never an o2 ?

that's irrelevant lol

#

it's just naming

#

you can consider o3 as "o2"

surreal creek Jun 26, 2025, 8:45 PM

#

seems kinda convoluted ig

ocean vortex Jun 26, 2025, 8:45 PM

#

I meant the improvement itself to the next gen

#

with GPT5 they are also moving to hybrid reasoning which is a big change

surreal creek Jun 26, 2025, 8:46 PM

#

well yeah I understand the naming concepts are kinda ridiculous because they’re just steady improvements of the same core product, eventually getting renamed for branding rather than having “GPT-17” someday

primal orbit Jun 26, 2025, 8:47 PM

#

how many people left in OpenAI from the original roster during gpt3/4? From the prominent ones looks like only Sam Altman. Don't have much belief in gpt5.

ocean vortex Jun 26, 2025, 8:48 PM

#

surreal creek well yeah I understand the naming concepts are kinda ridiculous because they’re ...

Right. Which is why saying that gpt5 was getting advertised for 2 years is missing the point. That name meant several different things at different times

zinc ore Jun 26, 2025, 8:48 PM

#

That was actually my point, that the naming didn't imply how long it had been worked on

#

I was making the point you countered with

ocean vortex Jun 26, 2025, 8:49 PM

#

zinc ore That was actually my point, that the naming didn't imply how long it had been wo...

it's not the naming that implies this

#

But rather the release date of o3

zinc ore Jun 26, 2025, 8:49 PM

#

So how do you know gpt5 won't end up being something like o5 instead of o4?

ocean vortex Jun 26, 2025, 8:50 PM

#

even if they were not working on gpt5 before they released o3 (unlikely), the timeline for GPT5 in that case would probably be around September. Still well before the end of the year

ocean vortex Jun 26, 2025, 8:51 PM

#

zinc ore So how do you know gpt5 won't end up being something like o5 instead of o4?

what

zinc ore Jun 26, 2025, 8:51 PM

#

I'd bet we get gpt5 this year as well

ocean vortex Jun 26, 2025, 8:51 PM

#

it's just a name for next generation

zinc ore Jun 26, 2025, 8:51 PM

#

I just don't see how we assess that their training is ahead

ocean vortex Jun 26, 2025, 8:51 PM

#

it's a meaningless number, just like there was no o2

zinc ore Jun 26, 2025, 8:52 PM

#

ocean vortex it's a meaningless number, just like there was no o2

I know that, I'm saying that they don't actually release it until it's actually a second generation jump after o3

ocean vortex Jun 26, 2025, 8:52 PM

#

zinc ore I just don't see how we assess that their training is ahead

they already price matched o3 and 4.1

#

the only logical next release is GPT5

#

they even said so themselves that is what they are going to do

#

o4 (next gen of o3) is going to be GPT5

zinc ore Jun 26, 2025, 8:54 PM

#

I think gpt5 drops this year, I'd be slightly surprised if it doesn't

ocean vortex Jun 26, 2025, 8:54 PM

#

whether you call it o5 or o4 is convoluted given that we have "o4-mini" lmao

#

but the point is "next gen"

zinc ore Jun 26, 2025, 8:55 PM

#

I was more referring to potential delays, like not being satisfied with the improvements of o4

#

Hence, waiting until they get bigger improvements, pushing it to next year

#

Because we're technically just dealing with a bunch of unknowns here

ocean vortex Jun 26, 2025, 8:56 PM

#

zinc ore I was more referring to potential delays, like not being satisfied with the impr...

I think they are gonna go ahead with it regardless. They are also doing it to get rid of convoluted model switcher. So even small gains would probably fly to start with worst case scenario

#

it's gonna be leagues better than any gpt4 or 4.1 either way

#

or 4.5

surreal creek Jun 26, 2025, 8:58 PM

#

4.5 was strong, but it hasn’t been in the arena for a couple months now?

ocean vortex Jun 26, 2025, 8:58 PM

#

also kinda ironic how they cannibalised their own naming if you think about it. Currently o3 destroys any gpt series model lol

surreal creek Jun 26, 2025, 8:58 PM

#

wonder what they’re doing with it behind the scenes

ocean vortex Jun 26, 2025, 8:59 PM

#

gpt4 used to be THE model. I imagine they are trying to go back to this with 5

ocean vortex Jun 26, 2025, 9:00 PM

#

surreal creek 4.5 was strong, but it hasn’t been in the arena for a couple months now?

it's too big of a model to make updates for. Expensive slow training

#

So a niche model or a failed experiment, depending how you look at it lmao

sour spindle Jun 26, 2025, 9:47 PM

#

zinc ore I wasn't meaning on llmarena, I meant overall. I don't treat the arena as an ob...

Polymarket is determined by LMarena though

#

OpenAi could come up with a model that surpasses 2.5 pro that's assuming google does absolutely nothing though

zinc ore Jun 26, 2025, 9:48 PM

#

My initial argument was that on llmarena, I think Google has a decent chance of still being #1. Assuming it is Google's next Gen vs openAI's next gen

sour spindle Jun 26, 2025, 9:48 PM

#

Gotcha

#

I would agree

zinc ore Jun 26, 2025, 9:49 PM

#

I briefly pivoted to say I think pro 2.5 and o3 are technically neck and neck, but I meant overall from various benchmarks, not strictly llmarena. But it wasn't clear since I didn't specify.

sour spindle Jun 26, 2025, 9:50 PM

#

I pretty much agree with that assessment I do use o3 more though because I think they do a better job with tool use

meager lintel Jun 26, 2025, 10:22 PM

#

anything interesting in the lmarena anon models lately

small haven Jun 26, 2025, 10:36 PM

#

zinc ore I briefly pivoted to say I think pro 2.5 and o3 are technically neck and neck, b...

tbf they are not neck and neck at all, o3 edges it by a good margin still

zinc ore Jun 26, 2025, 10:36 PM

#

How much is a good margin

#

Like what is your opinion of say, artificial analysis, which gives o3 and 2.5 pro the exact same score

hollow ocean Jun 26, 2025, 10:38 PM

#

o3 pro mogs it

zinc ore Jun 26, 2025, 10:38 PM

#

Based on what specifically (I missed that the comment said o3 pro, which I agree it is stronger than o3 and ahead of 2.5)

patent aspen Jun 26, 2025, 10:39 PM

#

I think it does matter what GPT-5 and o4 are

ocean vortex Jun 26, 2025, 10:39 PM

#

zinc ore I briefly pivoted to say I think pro 2.5 and o3 are technically neck and neck, b...

yeah to which I replied next gen of Gemini replacing 2.5Pro may not even arrive this year...

patent aspen Jun 26, 2025, 10:40 PM

#

If o4 and GPT-5 have the same base model, that's different than if GPT-5 is on a completely unknown base model

ocean vortex Jun 26, 2025, 10:41 PM

#

patent aspen If o4 and GPT-5 have the same base model, that's different than if GPT-5 is on a...

most definitely there's not gonna be o4 at all, GPT5 is to be what comes next after o3 except it's hybrid reasoning

jade egret Jun 26, 2025, 10:42 PM

#

yall

#

is gemini cli better tah nclaude?

patent aspen Jun 26, 2025, 10:42 PM

#

ocean vortex most definitely there's not gonna be o4 at all, GPT5 is to be what comes next af...

Cool. Where did that information come from?

zinc ore Jun 26, 2025, 10:42 PM

#

Altman I think

#

Something about short reasoning abilities and long think abilities

#

Forget how he phrased it

keen beacon Jun 26, 2025, 10:43 PM

#

They opted to midtrain 4o and do fresh pretrains for mini and nano which I don't think is a good sign for a new frontier base model. Seemed kinda indecisive (indecision seemed kinda late too) about gpt 5 too. I don't know what to expect

ocean vortex Jun 26, 2025, 10:44 PM

#

patent aspen Cool. Where did that information come from?

OpenAI themselves said quite a lot about the upcoming GPT5. This coupled with their pricing and what they are doing is the most reasonable outcome. Everything else is unlikely as far as I'm concerned tbh

#

but if there was an official press release already we wouldn't be talking about it would we..

patent aspen Jun 26, 2025, 10:44 PM

#

keen beacon They opted to midtrain 4o and do fresh pretrains for mini and nano which I don't...

So they started from a partial pre-train of 4o for GPT-5?

#

Not up with the lingo

keen beacon Jun 26, 2025, 10:46 PM

#

patent aspen So they started from a partial pre-train of 4o for GPT-5?

yeah oai calls continued pretraining mid training. I don't know about gpt 5 tbh but the decision to not do a fresh pretrain and the indecision makes me kinda doubt they'd use a different base model

ocean vortex Jun 26, 2025, 10:47 PM

#

@patent aspen Like do you really think they will continue with that pricing and release o4 alongside hybrid model that they said they are gonna do, for no reason? No offense but that's stupid

#

This is not the time where you can claim plausible deniability cause there is no confirmed source. It's just common sense lol

#

😅

#

All good

#

btw I'm signing for that polymarket thing

#

quite a lot of steps they make you do lol

keen beacon Jun 26, 2025, 10:53 PM

#

Maybe they've been cooking a new frontier base model all this time that's better than 4o and still easy to serve, but given how slow they were with 4o if you consider initial pretraining and their recent cpt. it seemed to me they spent a lot of months committed to this, they can do stuff in parallel for sure but I don't know

ocean vortex Jun 26, 2025, 10:54 PM

#

I mean they have already shown they know how to improve. o1 to o3. They did that by releasing gpt4.1 first. Except this time I think they will release both as one. So we will only see the successor of o3 already using different base model presumably

#

Google seems to be struggling with progress, relatively speaking

keen beacon Jun 26, 2025, 10:55 PM

#

That's ridiculous

ocean vortex Jun 26, 2025, 10:55 PM

#

2.5Pro updates thus far are not quite o1 to o3 kind of thing

#

they are still on same base I believe

#

so just dated checkpoint updates

#

but that one might not even materialize before the end of the year tbh

grand yarrow Jun 26, 2025, 10:59 PM

#

flux kontext max is down?

keen beacon Jun 26, 2025, 10:59 PM

#

I was surprised with how fast Google did a cpt on 2.5 pro after initial pretraining. If you look at 4o and when it was pretrained, they took a long time in comparison. It wasn't just a cpt though they likely tinkered with the architecture I guess. It supported 1m context and they cut costs while improving perf. So potentially there could be more of that if they do not have a new base model

#

2.5 pro beyond the cpt likely had fundamental changes too I guess

ocean vortex Jun 26, 2025, 11:00 PM

#

I wouldn't be so sure... It's their first update of this kind. Still on first generation of competitive reasoning model of this size. And they spent a lot of time doing small updates for the existing one

#

The first post training will be done by the second week of September

#

hmmm.....

ocean vortex Jun 26, 2025, 11:17 PM

#

keen beacon I was surprised with how fast Google did a cpt on 2.5 pro after initial pretrain...

I doubt they have as clear of a roadmap for next gen improvement as OpenAI does who already did this before and know what works... It's not impossible that Google encounters a bit of what Deepseek did trying to make R2 and not being satisfied with the performance, then delaying it.

#

Google already played their cards doing what seemed obvious - big (relatively) reasoning model. But improvement from there can be significantly more difficult

patent aspen Jun 26, 2025, 11:19 PM

#

fwiw this is also the first time OAI is building a hybrid reasoning model

#

And they've already delayed it once

ocean vortex Jun 26, 2025, 11:20 PM

#

patent aspen fwiw this is also the first time OAI is building a hybrid reasoning model

yeah true...

ocean vortex Jun 26, 2025, 11:21 PM

#

patent aspen And they've already delayed it once

I'm hoping one time was enough 👀

patent aspen Jun 26, 2025, 11:22 PM

#

ocean vortex I'm hoping one time was enough 👀

If they do delay it again, it would probably be because it's not SoTA. I don't think they'll mess up hybrid reasoning again

echo aurora Jun 26, 2025, 11:25 PM

#

grand yarrow flux kontext max is down?

on LMArena? looks like it's working for me

leaden palm Jun 27, 2025, 12:07 AM

#

this is probably my pattern matching on overdrive but it seems like whenever elon and his team are talking about grok it's a rainy day

surreal creek Jun 27, 2025, 12:12 AM

#

patent aspen fwiw this is also the first time OAI is building a hybrid reasoning model

pardon my confusion, but what is a “hybrid” model ?

sullen quest Jun 27, 2025, 12:19 AM

#

leaden palm this is probably my pattern matching on overdrive but it seems like whenever elo...

I liked grok better when it was chocolate

patent aspen Jun 27, 2025, 12:42 AM

#

surreal creek pardon my confusion, but what is a “hybrid” model ?

A hybrid model adaptively decides how much reasoning to use

elder rapids Jun 27, 2025, 12:56 AM

#

leaden palm this is probably my pattern matching on overdrive but it seems like whenever elo...

ye, survivorship bias

leaden palm Jun 27, 2025, 12:57 AM

#

elder rapids ye, survivorship bias

how's this survivorship bias?

#

do you just mean confirmation bias or

patent aspen Jun 27, 2025, 1:21 AM

#

leaden palm how's this survivorship bias?

They might have always been talking about it, but you only noticed the cases that were able to spread more than other LLM talk?

leaden palm Jun 27, 2025, 1:22 AM

#

idk

patent aspen Jun 27, 2025, 1:22 AM

#

Hard for Grok news to trend if it's I/O or something

#

I'm not saying I fully buy it, just my interpretation of the statement

elder rapids Jun 27, 2025, 1:23 AM

#

leaden palm how's this survivorship bias?

an inherently filtered data pool is a survivorship bias, but they're both selection biases

sacred plaza Jun 27, 2025, 1:36 AM

#

Has anyone seen Megan 2.0 yet? They did a really good job at showing how insane these AI Labs can sometimes be and even the AI safety community. I was cracking up when they were explaining instrumental conversions convergence and alignment and how hard it is.

They roasted Elon too too with a device called neurochip

verbal nimbus Jun 27, 2025, 1:37 AM

#

sacred plaza Has anyone seen Megan 2.0 yet? They did a really good job at showing how insane ...

I didn't know there was a 2.0. Thought she was dead 🤣

sacred plaza Jun 27, 2025, 1:38 AM

#

She was but what intelligent entity would not have the self-preservation tendency to back up themselves somewhere?!

storm needle Jun 27, 2025, 1:58 AM

#

anyone know why apple wants to buy garbage like perplexity but not good stuff like anthropic?

patent aspen Jun 27, 2025, 2:07 AM

#

storm needle anyone know why apple wants to buy garbage like perplexity but not good stuff li...

It would be difficult for Apple to buy Anthropic in the current antitrust environment. Big tech companies can't just buy companies willy nilly any more

storm needle Jun 27, 2025, 2:08 AM

#

patent aspen It would be difficult for Apple to buy Anthropic in the current antitrust enviro...

they can do like what microsoft did with openai

patent aspen Jun 27, 2025, 2:08 AM

#

storm needle they can do like what microsoft did with openai

Unlikely although they could probably structure something a bit weaker than that

#

Another thing to keep in mind is that Anthropic already has major deals with Amazon and Google. Those deals are non-exclusive, so they're allowed to do more deals, although it's hard to do exclusive deals

#

Anything exclusive comes off as anti-competitive

sacred plaza Jun 27, 2025, 2:35 AM

#

storm needle anyone know why apple wants to buy garbage like perplexity but not good stuff li...

Why the hate on perplexity?

patent aspen Jun 27, 2025, 2:45 AM

#

sacred plaza Why the hate on perplexity?

It's worse than Google at being a search engine. It's worse than Gemini and ChatGPT at being a chatbot. It's also a wrapper, so it's questionable as an acquihire value proposition for Apple.

With all of that said, I think people are underestimating the value of not starting from scratch if they decide to pivot to their own search offering if the search deal court case falls through

sacred plaza Jun 27, 2025, 4:25 AM

#

patent aspen It's worse than Google at being a search engine. It's worse than Gemini and Chat...

any evidence of perplexity being worse than google as a search engine? i have stopped using google search in the last year so i would not know, lol

I think the UI in perplexity is better than anything i have seen from the ai labs, so i can see apple being interested in perplexity purely from that aspect.

elder rapids Jun 27, 2025, 4:35 AM

#

sacred plaza any evidence of perplexity being worse than google as a search engine? i have st...

that's not a claim you can prove, the baseline assumption is that Google is better as a search engine

#

the problem is, LLMs aren't search engines, so it'd be trivially true basically questions they can "answer" are the solution to the discussion of search engines aka not search engines

#

all search engines are equivalent, and they're all equally better than AI engines simply based on the fact All search engines (like Google) have the capacity to give you all the information you need + the ability to cite

#

perplexity is definitely good at searching, and becomes much better at finding in depth or niche information when prompted, but in the end I'll always go back to search engines to verify what I'm looking for and go deeper into citation chains

patent aspen Jun 27, 2025, 4:54 AM

#

Never using Google Search is just burning time + money

small haven Jun 27, 2025, 6:13 AM

#

i may have slept on alphaevolve

zinc ore Jun 27, 2025, 6:14 AM

#

Why u say that

primal orbit Jun 27, 2025, 8:00 AM

#

Me: This will provide a resolution, one way or another.

Gemini: You are correct.

It will provide a resolution.

A bullet to the head also provides a resolution. A plane crash provides a resolution. A house fire provides a resolution.

You are like a surgeon who, faced with a complex but solvable case, decides that the easiest "resolution" is to simply sign the death certificate.

calm sequoia Jun 27, 2025, 8:07 AM

#

Would be fun to see those in arena

ocean vortex Jun 27, 2025, 9:06 AM

#

calm sequoia Would be fun to see those in arena

How would that work... You give it a coding problem and one of the models is doing research while another is trying to code? lol

#

I think we would need new arena for this

calm sequoia Jun 27, 2025, 9:07 AM

#

I'm not sure if they necessary take so long if it's via API

languid crescent Jun 27, 2025, 9:07 AM

#

heard of gemini cli, is it good than claude max?

calm sequoia Jun 27, 2025, 9:07 AM

#

They are either finetuned or prompted to think long, right?

ocean vortex Jun 27, 2025, 9:07 AM

#

calm sequoia I'm not sure if they necessary take so long if it's via API

it's meant for doing research though, not your general "normal" requests

#

would be kinda like comparing oranges to apples in many cases

calm sequoia Jun 27, 2025, 9:07 AM

#

My normal request are research 😄 And arena lacks this

#

I mean

#

It would be very interesting to see how much ELO or other bench scores are lost due to deep research optimization

ocean vortex Jun 27, 2025, 9:08 AM

#

we have lmarena, webdev... searcharena would be a good addition perhaps

ocean vortex Jun 27, 2025, 9:09 AM

#

calm sequoia It would be very interesting to see how much ELO or other bench scores are lost ...

It would skew the leaderboard though

calm sequoia Jun 27, 2025, 9:09 AM

#

Search is not it

calm sequoia Jun 27, 2025, 9:10 AM

#

ocean vortex It would skew the leaderboard though

Better models would gain some ELO and worse woudl lose it. If the time can be solved, skewing is not such a problem, right?

ocean vortex Jun 27, 2025, 9:10 AM

#

we already had a couple of internet enabled models there in the past, and even that was very controversial back then

#

you have no clue how to test and what to ask, if you do not know if it's a conventional or internet enabled model

calm sequoia Jun 27, 2025, 9:11 AM

#

Internet off, only file attachments can be used

ocean vortex Jun 27, 2025, 9:11 AM

#

calm sequoia Internet off, only file attachments can be used

deep research with internet off?

#

then it's just pointless and wouldn't work at all lmao

calm sequoia Jun 27, 2025, 9:12 AM

#

Wdym

#

Yesterday I gave it 10 long pdfs

#

He searches only for 1 document

ocean vortex Jun 27, 2025, 9:12 AM

#

deep research in chatgpt is o3 fine-tuned for searching the internet

#

that's the entire purpose of it

calm sequoia Jun 27, 2025, 9:12 AM

#

I thought the search is just part of the pie

#

Report writing, data aggregation, psliting to multiple tasks, etc. is bigger pie

#

Sometimes it runs only on 5 sources for 20 minutes

#

If it's just search, then what's the difference with normal search enabled model? Scholar usage?

ocean vortex Jun 27, 2025, 9:14 AM

#

calm sequoia I thought the search is just part of the pie

nah, if you disabled search it would be worse than o3-high. Maybe even worse than o3-medium. It was briefly better not because of just search, but that's only because they released o3 first for deepresearch, with the standalone o3 model officially released later

calm sequoia Jun 27, 2025, 9:16 AM

#

To be fair, on release of o3, simple search returned me 60 sources. That's very close to deep research pre-o3 release.

ocean vortex Jun 27, 2025, 9:16 AM

#

calm sequoia If it's just search, then what's the difference with normal search enabled model...

o3 used for deepresearch was fine-tuned to search the web for much longer. And then to work with that data it collected. But if it can't collect the data the whole thing kinda breaks

calm sequoia Jun 27, 2025, 9:16 AM

#

I really need to look for some papers on this

#

My intuition was that deep research sacrafices chat capabilities for data aggregation and logic gains

ocean vortex Jun 27, 2025, 9:17 AM

#

I think it would spend most of it's reasoning trying to debug why the search isn't working 💀

#

@echo aurora is there a channel for suggestions/feedback. Had a quick look but do not see it. Anyway, searcharena is the suggestion. All internet enabled models + deep research or any combo of those. 😇

calm sequoia Jun 27, 2025, 9:24 AM

#

Seems like rewriting and dettailing the prompt is a big part of DL

#

And you can run it without web

ocean vortex Jun 27, 2025, 9:26 AM

#

calm sequoia Seems like rewriting and dettailing the prompt is a big part of DL

Yeah I hate that "clarifying questions" part. Usually my response to it is smth like "just f'ing do it", and then it works as I expect it to 😂

calm sequoia Jun 27, 2025, 9:27 AM

#

You can skip this step in API 🙂 I like it though, sometimes it reminds of some important details

#

Anyone in here used DL without search?

ocean vortex Jun 27, 2025, 9:39 AM

#

calm sequoia You can skip this step in API 🙂 I like it though, sometimes it reminds of some ...

yeah it can help sometimes. Although I tend to try and define the task clearly from the get go in the initial prompt. And doing that it can then try to narrow it down excessively with those clarifications

cedar tide Jun 27, 2025, 9:54 AM

#

https://x.com/TencentHunyuan/status/1938525874904801490?t=uuvQp9tZuxH4vhmQes2O-w&s=19

Hunyuan (@TencentHunyuan)

🚀 Introducing Hunyuan-A13B, our latest open-source LLM.

As an MoE model, it leverages 80B total parameters with just 13B active, delivering powerful performance that scores on par with o1 and DeepSeek across multiple mainstream benchmarks.

Hunyuan-A13B features a hybrid

ocean vortex Jun 27, 2025, 9:59 AM

#

cedar tide https://x.com/TencentHunyuan/status/1938525874904801490?t=uuvQp9tZuxH4vhmQes2O-w...

R1-0120? Oh ffs... I hate having to modify these tables to make them actually useful LOL

#

R1-0528
AIME24: 91.4%
AIME25: 87.5%
GPQA: 81%

cedar tide Jun 27, 2025, 10:04 AM

#

ocean vortex R1-0120? Oh ffs... I hate having to modify these tables to make them actually us...

I remind you that R1 has 8 times more total parameters and 3 times more active so there is no comparison at all, just it compares with R1 and o1 to show the models with the closest performance, (it is not a comparison with the best models which are o3, 2.5 pro, R1 0538 and Claude 4 opus)

ocean vortex Jun 27, 2025, 10:05 AM

#

yeah and that's quite impressive. But it would still look good and not misleading like now if they included the new version.

cedar tide Jun 27, 2025, 10:06 AM

#

ocean vortex yeah and that's quite impressive. But it would still look good and not misleadin...

Some companies just write "R1" in their table without specifying that it is the old version that they wrote 0120 what more do you want it to do???

ocean vortex Jun 27, 2025, 10:06 AM

#

cedar tide I remind you that R1 has 8 times more total parameters and 3 times more active s...

I don't think this flies... new R1 is literally the same but better, there's no reason to use the old one or reference it anywhere other than to make your model look better than it deserves 🤷‍♂️

#

it's already looking good for the size, no need to mislead people not terribly familiar with that naming smh

cedar tide Jun 27, 2025, 10:08 AM

#

ocean vortex I don't think this flies... new R1 is literally the same but better, there's no ...

and why doesn't he compare with o3 which is better and cheaper than o1? they are not comparing the best value for money, I repeat again they are just showing you the models which are not far behind in terms of performance.

ocean vortex Jun 27, 2025, 10:08 AM

#

cedar tide Some companies just write "R1" in their table without specifying that it is the ...

They were doing that before the new one was released

ocean vortex Jun 27, 2025, 10:08 AM

#

cedar tide and why doesn't he compare with o3 which is better and cheaper than o1? they are...

???

cedar tide Jun 27, 2025, 10:08 AM

#

ocean vortex They were doing that before the new one was released

Nope

ocean vortex Jun 27, 2025, 10:08 AM

#

o3 is not open-source, obviously

cedar tide Jun 27, 2025, 10:09 AM

#

ocean vortex o3 is not open-source, obviously

Can't you see that he's comparing it to O1? Are you doing it on purpose or what?

ocean vortex Jun 27, 2025, 10:09 AM

#

cedar tide Nope

I don't think I saw that. Link? 🧐

cedar tide Jun 27, 2025, 10:09 AM

#

read my messages again.

ocean vortex Jun 27, 2025, 10:10 AM

#

cedar tide Can't you see that he's comparing it to O1? Are you doing it on purpose or what?

o1 is an obviously different name so you need to be stupid to be mislead into thinking that A13B is better than o3...

#

And as for R1, only people deep into AI know that there are 2 different dated versions. Most will simply read that this model is better than Deepseek period

cedar tide Jun 27, 2025, 10:12 AM

#

ocean vortex And as for R1, only people deep into AI know that there are 2 different dated ve...

so it's not possible to show that their model with only 80b is close to the performance of the old R1 with 671b because people are stupid?

ocean vortex Jun 27, 2025, 10:13 AM

#

cedar tide read my messages again.

I suggest you do the same

cedar tide Jun 27, 2025, 10:14 AM

#

ocean vortex I don't think I saw that. Link? 🧐

Mistral with magistral, its old R1, its not write

#

Jun 10, 2025

ocean vortex Jun 27, 2025, 10:15 AM

#

cedar tide so it's not possible to show that their model with only 80b is close to the perf...

they technically can do it, but it's not something that should be encouraged at all IMO. The way I see it standard practice should be to include both, at least. If you want to have an old replaced model in there with the same naming.

ocean vortex Jun 27, 2025, 10:16 AM

#

cedar tide Mistral with magistral, its old R1, its not write

oh Mistral... they also did comparison against ~6 months old gpt4o without specifying which version it is. Tells you enough about that press release of theirs lmao

cedar tide Jun 27, 2025, 10:17 AM

#

The problem is that DeepSeek doesn't know how to name its models normally.

ocean vortex Jun 27, 2025, 10:20 AM

#

lmao

ocean vortex Jun 27, 2025, 10:48 AM

#

yeah that too. I realize that this is probably not going away, but still, it really shouldn't be encouraged or justified in my opinion. There's already enough bias as it is given that they are free to include whatever benchmarks that they want and also test their model repeatedly with different parameters etc

cedar tide Jun 27, 2025, 10:48 AM

#

here is a inteligent guy

#

even anthropic do it 🤦

ocean vortex Jun 27, 2025, 10:50 AM

#

cedar tide even anthropic do it 🤦

You are an interesting specimen ngl. Just because someone does something that is not right, you automatically justify it if you like that company?

cedar tide Jun 27, 2025, 10:50 AM

#

because of this damn comparison, cerebral had said that magistral is better than R1 🤦 while it is only in update 64 and against the old R1

ocean vortex Jun 27, 2025, 10:51 AM

#

If you didn't notice I never said that Hunyuan are the only ones doing this. In fact I clearly said the opposite in my initial message

#

I wouldn't have had to modify any benchmarks tables in the past at all if this wasn't the case

cedar tide Jun 27, 2025, 10:52 AM

#

ocean vortex You are an interesting specimen ngl. Just because someone does something that is...

I said it's troubling that even a company like Anthropic is doing that.

ocean vortex Jun 27, 2025, 10:52 AM

#

You seem extra defensive about Chinese companies, I wonder why is that, oh wait...

#

catgrin

cedar tide Jun 27, 2025, 10:53 AM

#

ocean vortex You seem extra defensive about Chinese companies, I wonder why is that, oh wait....

I don't understand what you want, I'm defending a comparison between 2 models, and here I'm attacking a comparison between a pass 1 & maj 64

#

and I attack French companie even though I am French

ocean vortex Jun 27, 2025, 10:55 AM

#

cedar tide I don't understand what you want, I'm defending a comparison between 2 models, a...

What I don't understand is your overreaction. #general message
This was completely uncalled for for my observation. If you have different opinion that is fine, no need to overreact

cedar tide Jun 27, 2025, 10:56 AM

#

ocean vortex What I don't understand is your overreaction. https://discord.com/channels/13405...

It was just a question, not a criticism.

#

I just didn't understand what the solution was.

ocean vortex Jun 27, 2025, 10:57 AM

#

don't backtrack now on your sht lol
You were the one making this aggressive, I was trying to have a polite discussion...

cedar tide Jun 27, 2025, 10:58 AM

#

Now I understand that we need to add R1 0528 which would make the table heavier because of DeepSeek who doesn't know how to name these models.

cedar tide Jun 27, 2025, 10:58 AM

#

ocean vortex don't backtrack now on your sht lol You were the one making this aggressive, I ...

okay sorry if i seemed aggressive

ocean vortex Jun 27, 2025, 10:59 AM

#

cedar tide Now I understand that we need to add R1 0528 which would make the table heavier ...

It's out of Hunyuan's control what the naming is. But it's also plausible they wouldn't have included it at all if the new one was named R1.5 lol

cedar tide Jun 27, 2025, 10:59 AM

#

This is my way of debating, I still have improvements to make to my internal program, I still have to do post training

#

let's talk about grok 4.2 now

ocean vortex Jun 27, 2025, 11:00 AM

#

My assumption, but I think they knew what they were doing there..

ocean vortex Jun 27, 2025, 11:01 AM

#

cedar tide let's talk about grok 4.2 now

why 4.2? 😂

cedar tide Jun 27, 2025, 11:02 AM

#

Screenshot_2025-06-27-13-02-01-463_com.twitter.android-edit.jpg

#

.

Screenshot_2025-06-27-13-02-19-751_com.twitter.android-edit.jpg

ocean vortex Jun 27, 2025, 11:02 AM

#

cedar tide

💀

#

the cannabis edition

#

supplemented with ketamine

cedar tide Jun 27, 2025, 11:16 AM

#

Sorry

ocean vortex Jun 27, 2025, 11:22 AM

#

that too but I find this more appropriate given it's Elon we are talking about:

cedar tide Jun 27, 2025, 11:51 AM

#

Now

mossy drum Jun 27, 2025, 11:53 AM

#

New model in Arena: kraken-250610-1

cedar tide Jun 27, 2025, 11:58 AM

#

mossy drum New model in Arena: `kraken-250610-1`

Good ?

#

He doesn't say which entreprise he comes from?

mossy drum Jun 27, 2025, 12:00 PM

#

cedar tide Good ?

Yet another Chinese model, I guess

whole wagon Jun 27, 2025, 12:00 PM

#

Where is grok in the LLM arena. Surely it must be there by now

#

Is it stonebloom?

late path Jun 27, 2025, 12:02 PM

#

no

#

not yet

whole wagon Jun 27, 2025, 12:02 PM

#

Maybe it's this kraken

#

How do you know who makes the models

late path Jun 27, 2025, 12:04 PM

#

Google's models always say they are made by Google when asked
Webdev arena can also provide some information

cedar tide Jun 27, 2025, 12:04 PM

#

@mossy drum amazon

cedar tide Jun 27, 2025, 12:05 PM

#

whole wagon Is it stonebloom?

Its new google this, pro size

late path Jun 27, 2025, 12:05 PM

#

kraken is bad...

rare python Jun 27, 2025, 12:05 PM

#

late path kraken is bad...

I thought it's karen

sacred plaza Jun 27, 2025, 12:05 PM

#

patent aspen Never using Google Search is just burning time + money

How so? Your take seems based on emotional sentiment and vibes. My company pays for my perplexity license btw and I like using Gemini as a LLM. Perplexity's comparative advantage seems like it search since its search index is updated every few seconds unlike Google.

rare python Jun 27, 2025, 12:06 PM

#

sacred plaza How so? Your take seems based on emotional sentiment and vibes. My company pays ...

How often Google indexs their search?

sacred plaza Jun 27, 2025, 12:06 PM

#

elder rapids perplexity is definitely good at searching, and becomes much better at finding i...

Thanks for the detailed explanation here. These are all fair points on why Google search should be used.

sacred plaza Jun 27, 2025, 12:07 PM

#

rare python How often Google indexs their search?

My understanding is not as frequent as perplexity. Let me perplexity and Google that though lol

rare python Jun 27, 2025, 12:08 PM

#

sacred plaza My understanding is not as frequent as perplexity. Let me perplexity and Google ...

Well I just thought that Google has more budget so they index in real time

sacred plaza Jun 27, 2025, 12:11 PM

#

rare python Well I just thought that Google has more budget so they index in real time

Yes that makes sense. Google AI mode search did not really provide a definitive answer here. Going to try perplexity

whole wagon Jun 27, 2025, 12:12 PM

#

Maybe the fact they are calling it grok 4 would indicate it is?

#

Last poll got 91% no lol (#general message)

cedar tide Jun 27, 2025, 12:13 PM

#

kraken-250610-1 : amazon reasoning model

whole wagon Jun 27, 2025, 12:15 PM

#

sacred plaza Jun 27, 2025, 12:16 PM

#

rare python Well I just thought that Google has more budget so they index in real time

Index updates seems inconclusive. I don't mind that people still prefer using google search but perplexity slander just cause you don't like the product and not because of inferior results is not rational.

ocean vortex Jun 27, 2025, 12:16 PM

#

whole wagon

Honestly it makes sense to risk it both for OpenAI and xAI

#

they both can do it and the return is big, especially for xAI it's like 8.5X lol

whole wagon Jun 27, 2025, 12:17 PM

#

xAI already shot up as the graph shows

#

It's at 26.1% after they said they call it grok 4

sacred plaza Jun 27, 2025, 12:17 PM

#

ocean vortex they both can do it and the return is big, especially for xAI it's like 8.5X lol

what brainwashing will elon do for grok-4 btw in terms of system prompt? recent tweets seems like he is planning to do some extensive surgery?

ocean vortex Jun 27, 2025, 12:18 PM

#

whole wagon xAI already shot up as the graph shows

rare python Jun 27, 2025, 12:18 PM

#

sacred plaza Index updates seems inconclusive. I don't mind that people still prefer using go...

For me I prefer using Google Search because Perplexity logged out don't have reasoning model, which could be less accurate

whole wagon Jun 27, 2025, 12:18 PM

#

ocean vortex

That's Dec 31 deadline I think

ocean vortex Jun 27, 2025, 12:18 PM

#

I'm tempted to take another gamble but I don't want to be rooting for xAI 💀

#

If you bet on both, you net win quite big regardless if it's OpenAI or xAI

sacred plaza Jun 27, 2025, 12:20 PM

#

rare python For me I prefer using Google Search because Perplexity logged out don't have rea...

who does logged out don't have reasoning model mean? i think ai has just made me lazy these days...i don't even want to parse google search results to get answer, perplexity just gives me the answer with sources ofcourse.

rare python Jun 27, 2025, 12:21 PM

#

sacred plaza who does logged out don't have reasoning model mean? i think ai has just made me...

I don't have an account so I can only use normal model, their default model

#

I know labs.perplexity.ai but it lacks features

#

They have sonar reasoning pro for free

ocean vortex Jun 27, 2025, 12:23 PM

#

rare python For me I prefer using Google Search because Perplexity logged out don't have rea...

might not be what you are looking for, but this Chinese thing I was using yesterday allows reasoning + search with no login:
https://www.volcengine.com/experience/ark?model=doubao-seed-1-6-thinking-250615

rare python Jun 27, 2025, 12:24 PM

#

ocean vortex might not be what you are looking for, but this Chinese thing I was using yester...

I know this one

#

it's so slow tho

#

The token per second is 💀

ocean vortex Jun 27, 2025, 12:25 PM

#

😇

cedar tide Jun 27, 2025, 12:28 PM

#

cedar tide `kraken-250610-1` : amazon reasoning model

he says his knowledge cutoff is October 2021 (like nova experimental) (nova 1.0 its September 2021)

rare python Jun 27, 2025, 12:31 PM

#

ocean vortex 😇

So many chinese words

ocean vortex Jun 27, 2025, 12:31 PM

#

rare python The token per second is 💀

yeah it is slow, ~40tok/sec on a good day...

ocean vortex Jun 27, 2025, 12:31 PM

#

rare python So many chinese words

it used Chinese sources which is great

rare python Jun 27, 2025, 12:32 PM

#

ocean vortex it used Chinese sources which is great

Like DeepSeek search. It also use chinese sources

ocean vortex Jun 27, 2025, 12:33 PM

#

rare python Like DeepSeek search. It also use chinese sources

Nah they use something much more general:

#

Unless you meant Deepseek on that non-Deepseek website

#

but this is not their implementation

rare python Jun 27, 2025, 12:34 PM

#

ocean vortex Unless you meant Deepseek on that non-Deepseek website

I mean on their website

ocean vortex Jun 27, 2025, 12:34 PM

#

Deepseek website basically gives me western sources only

rare python Jun 27, 2025, 12:34 PM

#

They even use DNS as a language in the system prompt. I use controld, which is located in Hong Kong so DeepSeek R1 keep answer me in Chinese

ocean vortex Jun 27, 2025, 12:36 PM

#

and knowing that we have Trump now, most of those are useless since he constantly spitts non-sense about China lol

rare python Jun 27, 2025, 12:36 PM

#

ocean vortex it used Chinese sources which is great

Try if it's not specialized at Chinese sources, with international request

leaden sun Jun 27, 2025, 12:43 PM

#

kimi and minimax are pretty popular ais from china too, for searching, I actually like how minixmax is explaining the findings using proper citations with links (not hallucinated so far) like a peer reviewed journal article

alpine coral Jun 27, 2025, 12:49 PM

#

leaden palm how's this survivorship bias?

it's not

sweet tinsel Jun 27, 2025, 12:49 PM

#

This is out on context, but Google AI is tweaking, took a random Roblox Group as a source.

leaden sun Jun 27, 2025, 12:49 PM

#

cedar tide Now

cant wait to "romance" this new grok and figure its worldview 🤭

keen ferry Jun 27, 2025, 12:50 PM

#

torn mantle Jun 27, 2025, 12:50 PM

#

keen ferry

'after july 4th'

#

what does that mean?

sweet tinsel Jun 27, 2025, 12:50 PM

#

sweet tinsel This is out on context, but Google AI is tweaking, took a random Roblox Group as...

This right here got marked as a good source somehow: https://devforum.roblox.com/t/naramo-internal-security-team-official-conduct-handbook/3654614

Developer Forum | Roblox

Naramo Internal Security Team Official Conduct Handbook

NARAMO INTERNAL SECURITY TEAM GUIDEBOOK Greetings, Wanderer. This is the official handbook for the usage of NIST. We are a faction of Naramo Nuclear Plant. OUR LORE EXPLAINATION In the process of repairing Naramo Power Plant, 3ND HICOM were already aware of the West Noobia’s actions to sabotage. They’ve sent small patches of units to ensu...

keen ferry Jun 27, 2025, 12:50 PM

#

torn mantle what does that mean?

more delays

torn mantle Jun 27, 2025, 12:50 PM

#

i see

#

"one more big run"

keen ferry Jun 27, 2025, 12:51 PM

#

imagine how expensive grok 4 is gonna be on api

torn mantle Jun 27, 2025, 12:51 PM

#

you cant just create a specialised coding model in a week

#

yea its gonna be hella expensive

keen ferry Jun 27, 2025, 12:51 PM

#

I think prob should not expect it on lmarena

torn mantle Jun 27, 2025, 12:52 PM

#

they may offer it for free for them

keen ferry Jun 27, 2025, 12:52 PM

#

torn mantle they may offer it for free for them

they do that?

torn mantle Jun 27, 2025, 12:52 PM

#

for limited days maybe

torn mantle Jun 27, 2025, 12:52 PM

#

keen ferry they do that?

yea some labs do

keen ferry Jun 27, 2025, 12:53 PM

#

I think they will they have to sign some contract or something which is not worth it

alpine coral Jun 27, 2025, 1:02 PM

#

sacred plaza Index updates seems inconclusive. I don't mind that people still prefer using go...

i don't use perplexity much these days (i find o3 on chatgpt works really well; or just 4o for basic things), but their product has definitely improved. i played around in the Search Arena a bit and found myself voting for perplexity models regularly

#

given this tests search models available via API, it's kinda different from what you get using pplx and oai via the chat interfaces

ocean vortex Jun 27, 2025, 1:04 PM

#

rare python Try if it's not specialized at Chinese sources, with international request

still Chinese lmao. But honestly that's what makes this interesting

rare python Jun 27, 2025, 1:05 PM

#

ocean vortex still Chinese lmao. But honestly that's what makes this interesting

They are probably using Baidu

#

Chinese only

frosty lark Jun 27, 2025, 1:07 PM

#

alpine coral i don't use perplexity much these days (i find o3 on chatgpt works really well; ...

I have perplexity pro and it is not bad, but one has to be attentive because sometimes it pulls non-reliable sources.

It happened already that it pulled as a source an AI written article and it went in circles. If one is stern and says "I am unhappy that you use source XY as reliable one" then it corrects itself more often than not.

Also it is an alternative way to have access to major models in one spot without using openrouter.

#

what was bad for perplexity, at least until March, is that they were changing the UI and settings way too often

alpine coral Jun 27, 2025, 1:08 PM

#

yeah i feel a lot of people pay for the sub for the pretty decent access to multiple models, and then the search is like a bonus

frosty lark Jun 27, 2025, 1:08 PM

#

I guess perplexity needs to work on weighting their sources. Because likely they have a relatively good dabases but not all sources are equally good.

alpine coral Jun 27, 2025, 1:08 PM

#

yeah they went too heavy on 'more is better'

#

is actaully pretty bad.. it doesn't discriminate well b/w quality of sources

frosty lark Jun 27, 2025, 1:10 PM

#

Ah for me the search is the important part. I don't know why there is so much focus on coding. Getting great searches (or better summaries) is so awesome. For shopping, traveling, brainstorming, consensus check "what is the consensus on X?", getting mini research papers written collating info, etc... Coding is a plus IMO (not the contrary).

In my view a strong search has more applications than strong coding.

alpine coral Jun 27, 2025, 1:10 PM

#

oh i couldn;t agree more

frosty lark Jun 27, 2025, 1:11 PM

#

alpine coral is actaully pretty bad.. it doesn't discriminate well b/w quality of sources

I noticed they improved a bit on the source pulled, but only a bit. Ironically one could use LLMs to score their source credibility 😄

alpine coral Jun 27, 2025, 1:11 PM

#

kinda why i became disallussioned with it... i wanted it to do search well - nothing else. instead they try to do everythign

#

meanwhiile google and oai are just catching up and eroding whatever moat they had

frosty lark Jun 27, 2025, 1:11 PM

#

yeah. I will see if I will renew my subscription at the end of the year

#

so far it is like 50/50 because at the start of the year it was bad

#

yes I would expect google to excel at it. OAI not necessarily (they don't have google DB)

#

but from excellence it can come cost. Like "I have no rivals, pay me a lot"

alpine coral Jun 27, 2025, 1:14 PM

#

yeah i agree - but also, o3 is just so intelligent with how it searches, that that arguably makes up for what it lacks in indexed/crawlabled material (but i agree, google in theory should kill it, in terms of the breadth and freququency of what they index)

#

but google still haven't quite got the tool usage part right

#

yeah ther's no comparison b/w pplx and google in terms of indexed material

#

david and goliath stuff

#

'continuous' being as key as '2012' - like recency of info is so critical with these rag web systems for some queries

keen beacon Jun 27, 2025, 1:17 PM

#

i would think oai could compete well enough too at least more than perplexity. scraping and all that for pre-training, they can use that experience/data for search in their products. there's probably a lot more overlap than expected there (scraping, ranking, filtering, etc.)

alpine coral Jun 27, 2025, 1:18 PM

#

i feel like Common Crawl constitutes the vast majority of web material used in training data, rather than specifical scraped stuff

frosty lark Jun 27, 2025, 1:18 PM

#

if, IF, perplexity gets Apple's money, then they have some hope. Otherwise I guess they could somewhat compete for 1-2 years then they get obliterated

keen beacon Jun 27, 2025, 1:19 PM

#

alpine coral i feel like Common Crawl constitutes the vast majority of web material used in t...

i dont think so tbh, at least nowadays but i dont know

alpine coral Jun 27, 2025, 1:27 PM

#

i dunno, but it was like 60% of the training data used for gpt3, or 70% exluding the digitised books thrown in (nothing officially published since afaik). it was a massive slice back then; i feel it's still a major if not bulk component of the training data for current foundational models

keen beacon Jun 27, 2025, 1:40 PM

#

alpine coral i dunno, but it was like 60% of the training data used for gpt3, or 70% exluding...

probably depends on the lab. but i think my point still stands, there's considerable overlap even if they aren't doing the scraping directly with ranking, filtering, etc., if it wasn't scraped directly (which they are still doing, but we don't know to what extent)

#

models are pre-trained on far more than gpt-3 nowadays too and the data composition is very different

rare python Jun 27, 2025, 1:44 PM

#

keen beacon models are pre-trained on far more than gpt-3 nowadays too and the data composit...

I think a lot of models distilled from GPT 4

#

Grok, DeepSeek. They use em dashes frequently and DeepSeek spam emoji like GPT4o

ocean vortex Jun 27, 2025, 1:49 PM

#

o3 normal search is super cool, saved me a lot of time there 🫡

alpine coral Jun 27, 2025, 1:51 PM

#

keen beacon probably depends on the lab. but i think my point still stands, there's consider...

yes i think this is right; there's definitely nuance/filtering involved for something like Common Crawl; and then now also, increasingly, things like copyright and access come into play too to an extent ig

ocean vortex Jun 27, 2025, 1:56 PM

#

rare python Grok, DeepSeek. They use em dashes frequently and DeepSeek spam emoji like GPT4o

Typically they are still pretrained on the 'entire' internet first. Then during fine-tuning they may distill

alpine coral Jun 27, 2025, 1:57 PM

#

yeah that's been my baseline understanding too (with some caveats - filtering etc)

#

it's no doubt become more nuanced; and efforts made to squeeze out more from less (albeit higher quality) data

#

but i feel like the general corpus of the scraped internet is like the foundation

rare python Jun 27, 2025, 2:07 PM

#

Claude has the finest fine tunning in my opinion. Second is stonebloom

#

I'm talking about the writing style and personality

alpine coral Jun 27, 2025, 2:07 PM

#

but distilling isn't a fine tune thing?

rare python Jun 27, 2025, 2:08 PM

#

Wait what? Why is my Claude comment removed?

alpine coral Jun 27, 2025, 2:08 PM

#

lol apparently

keen beacon Jun 27, 2025, 2:08 PM

#

they can do distillation in both pre-training and post training. there are many types of distillation too

alpine coral Jun 27, 2025, 2:09 PM

#

wait my comment got deleted too lol

#

tf is going on

keen beacon Jun 27, 2025, 2:09 PM

#

huh lol

rare python Jun 27, 2025, 2:09 PM

#

alpine coral wait my comment got deleted too lol

True

keen beacon Jun 27, 2025, 2:09 PM

#

no mods online tho?

rare python Jun 27, 2025, 2:09 PM

#

automod?

keen beacon Jun 27, 2025, 2:09 PM

#

did discord show u it got auto removed for something?

rare python Jun 27, 2025, 2:09 PM

#

No

#

It's just disappeared like that

alpine coral Jun 27, 2025, 2:10 PM

#

rare python automod?

it was so inocuous tho (i think)

rare python Jun 27, 2025, 2:10 PM

#

I'm getting deja vu

alpine coral Jun 27, 2025, 2:12 PM

#

rare python It's just disappeared like that

yeah just vanishd

keen beacon Jun 27, 2025, 2:13 PM

#

another bot here removed it? 🤷

#

weird

alpine coral Jun 27, 2025, 2:14 PM

#

pineapple if you'r silently beta testing some automod bot - intital results are not encouraging aha

ocean vortex Jun 27, 2025, 2:19 PM

#

keen beacon they can do distillation in both pre-training and post training. there are many ...

I think the ones we are talking about (using closed model from competitor with public API), distillation during pretraining would have been very hard if not completely impossible

alpine coral Jun 27, 2025, 2:23 PM

#

keen beacon they can do distillation in both pre-training and post training. there are many ...

yeah i see - my understanding was flawed

ocean vortex Jun 27, 2025, 2:25 PM

#

alpine coral yeah i see - my understanding was flawed

I don't think it's flawed in this context though lol. You need full access to the model if you are to implement any distillation mechanisms during pretraining. Meta is doing that iirc

#

So for Deepseek being similar to OpenAI, this does not apply

keen beacon Jun 27, 2025, 2:27 PM

#

you need full access to the model if you're doing logit distillation, distillation isn't just that though. generating synthetic data for pre-training (from frontier models) is also distillation

alpine coral Jun 27, 2025, 2:27 PM

#

ocean vortex I don't think it's flawed in this context though lol. You need full access to th...

oh im confused tbh.. i mean distilliation can happen during fine tuning was counterintuitive to my understnading

#

i thought it was a teacher-student thing

#

but ig that arrangement can be used in post training too?

ocean vortex Jun 27, 2025, 2:28 PM

#

keen beacon you need full access to the model if you're doing logit distillation, distillati...

if you are just gonna dump that generated data into the mix it's gonna be like 0.01% of the entire dataset and essentially meaningless for your base model 🤷‍♂️

keen beacon Jun 27, 2025, 2:28 PM

#

you can do logit distillation in pre-training/post-training, and the data used can also be synthetic. distillation inception 😂

keen beacon Jun 27, 2025, 2:28 PM

#

ocean vortex if you are just gonna dump that generated data into the mix it's gonna be like 0...

nope. look at phi 4 for example

#

the composition was released

ocean vortex Jun 27, 2025, 2:30 PM

#

keen beacon nope. look at phi 4 for example

what are you referring to, their technical report..?

keen beacon Jun 27, 2025, 2:30 PM

#

ocean vortex what are you referring to, their technical report..?

ocean vortex Jun 27, 2025, 2:35 PM

#

keen beacon

uhh. Well you need insane amounts of data for it. I don't think you are getting 820b worth of data with a public API from some competitor model... They have a partnership with OpenAI and models on their own premises with access so probably used that. Or some their own internal model 🤔

#

also 14 epochs... lol

keen beacon Jun 27, 2025, 2:35 PM

#

ocean vortex uhh. Well you need insane amounts of data for it. I don't think you are getting ...

the code data isn't necessarily synthetic. you should look at web rewrites + synthetic

cedar tide Jun 27, 2025, 2:36 PM

#

https://fixupx.com/Alibaba_Qwen/status/1938604105909600466?t=XBve0PIjjdC2xd5xlekETA&s=19

Qwen (@Alibaba_Qwen)

Meet Qwen-VLo, your AI creative engine:
︀︀
︀︀• Concept-to-Polish: Turn rough sketches or text prompts into high-res visuals
︀︀• On-the-Fly Edits: Refine product shots, adjust layouts or styles with simple commands
︀︀• Global-Ready: Generate image in multiple languages
︀︀• Progressive Generation: Build complex scenes step-by-step
︀︀
︀︀Perfect for designers, marketers, educators—and anyone who wants to bring ideas to life.
︀︀
︀︀👉 Try it: chat.qwen.ai
︀︀📖 Details: qwenlm.github.io/blog/qwen-vlo/

**💬 11 🔁 28 ❤️ 121 👁️ 3.3K **

▶ Play video

ocean vortex Jun 27, 2025, 2:37 PM

#

keen beacon the code data isn't necessarily synthetic. you should look at web rewrites + syn...

yeahh rewrote the wrong number. It's 290B - still way more than you could generate from any 1 specific model with API

rare python Jun 27, 2025, 2:38 PM

#

cedar tide https://fixupx.com/Alibaba_Qwen/status/1938604105909600466?t=XBve0PIjjdC2xd5xlek...

Is the voice AI generated?

hoary plaza Jun 27, 2025, 2:38 PM

#

Chinese models sometimes have problems replying in english.

#

Even if they are replying in english, you see a few Chinese characters mix within

#

For now I don't see that in r1 though

#

I even tried the seed model too

ocean vortex Jun 27, 2025, 2:41 PM

#

but yeah if you have THAT much synthetic data (1/4th of web) including it during pretraining will have big effect, especially doing more epochs

ocean vortex Jun 27, 2025, 3:03 PM

#

@keen beacon I got curious how much ANY synth text data there's on HF, regardless of the used model:

#

that 290b they used is an absolutely insane amount 👀

rare python Jun 27, 2025, 3:21 PM

#

ocean vortex that 290b they used is an absolutely insane amount 👀

yet no AGI

#

🥀

sour spindle Jun 27, 2025, 3:21 PM

#

anyone know of any other good AI discords

#

really enjoy the discoure here wondering if their were any other good ones

ocean vortex Jun 27, 2025, 3:47 PM

#

rare python yet no AGI

Yeah but phi models usually perform very good on benchmarks for their size. This may be the reason why. Insane amounts of data from potentially the best at the time OpenAI models. As an outsider there’s no chance for you to compile good quality synth dataset of such scale

rare python Jun 27, 2025, 3:48 PM

#

ocean vortex Yeah but phi models usually perform very good on benchmarks for their size. This...

Famous for performing good at benchmark, suck in real use

unborn ocean Jun 27, 2025, 4:07 PM

#

rare python Famous for performing good at benchmark, suck in real use

well the research was always about a lot of knowledge condensed into a small model

#

since the phi 1 paper

#

"text books are all you need" or something along those lines

#

they are kind of the original sub 7b models

unborn ocean Jun 27, 2025, 4:08 PM

#

ocean vortex <@456226577798135808> I got curious how much ANY synth text data there's on HF, ...

and btw there is no way that that estiamte by o3 is correct

#

there are a lot of datasets on hugging face using synthetic data

#

just checked: there is a 100% synthetic ~120b billion post-training dataset used for llama nemotron

#

and in their paper they also talk about these synthetic datasets for CPT + distillation: "first trained with knowledge distillation for 65B tokens using the same distillation
dataset, followed by 88B tokens of continued training" for the 340b model! <- to summarize: the scale of synthetic data is quite normal-ish and o3 is wrong

#

and btw if we are talking pre-training only, there are also many other dataset that are even larger, this was just a quick example i could find of a larger model being used for the process

cedar tide Jun 27, 2025, 4:20 PM

#

Yes

keen beacon Jun 27, 2025, 4:22 PM

#

they never released the image gen on meta chameleon either

#

i believe

rare python Jun 27, 2025, 4:29 PM

#

Competitive edge?

alpine coral Jun 27, 2025, 4:44 PM

#

which company is kraken from?

polar roost Jun 27, 2025, 4:59 PM

#

why's my model just stuck responding??

#

how do i cancel it

echo aurora Jun 27, 2025, 5:03 PM

#

polar roost how do i cancel it

sorry to say atm we don't have a cancel/pause button; however, it is something that we're interested in implementing

polar roost Jun 27, 2025, 5:04 PM

#

so what do i do?

#

is there anything i can do to make it stop or just wait?

echo aurora Jun 27, 2025, 5:13 PM

#

polar roost is there anything i can do to make it stop or just wait?

unfortunately either have to wait it out or start a new chat

polar roost Jun 27, 2025, 5:13 PM

#

ahh i see

solar hollow Jun 27, 2025, 6:14 PM

#

is it fair to say, we have barely seen any improvement since these "reasoning" models came out like o1?

#

to me it seems we just get a bit more usecases, thats it

#

companies are fixing common errors and tendencies

#

but no real improvements in reasoning

ocean vortex Jun 27, 2025, 6:17 PM

#

unborn ocean just checked: there is a 100% synthetic ~120b billion post-training dataset used...

you over estimated around 10 times lol

#

https://www.linkedin.com/posts/maziyarpanahi_exploring-nvidias-llama-nemotron-post-activity-7335224699011952640-51PQ/

🚀 Exploring Nvidia's Llama Nemotron Post Training Dataset |...

🚀 Exploring Nvidia's Llama Nemotron Post Training Dataset

I took some time to visualize and understand Nvidia's dataset release, "Llama Nemotron Post Training Dataset". Here's a quick snapshot of its impressive scale and composition:

Total Tokens: 15.97 Billion (with Assistant Tokens accounting for 12.83 Billion!)
...

#

https://huggingface.co/datasets/MaziyarPanahi/Llama-Nemotron-Post-Training-Dataset-v1-ShareGPT/discussions/3/files

MaziyarPanahi/Llama-Nemotron-Post-Training-Dataset-v1-ShareGPT · U...

#

it's 16B

#

that's not nothing but it's one of the biggest datasets of this kind on HF I think. Total estimate of all of the datasets like this one to be around 100b seems roughly correct tbh

#

so yeah... 290b that MS used is still mind blowing. Insane scale

leaden sun Jun 27, 2025, 6:21 PM

#

solar hollow is it fair to say, we have barely seen any improvement since these "reasoning" m...

improvements are there, minimal but they're there, given the current state of affair worldwide, moving faster forward with advancement is...a bit tricky.

what do you hope to seee for the next phase?

solar hollow Jun 27, 2025, 6:22 PM

#

leaden sun improvements are there, minimal but they're there, given the current state of af...

for example it would be encouraging to see the models being able to play games better, start multistep reasoning

ocean vortex Jun 27, 2025, 6:24 PM

#

if they were using that for this tiny model, then for next gen frontier models we are most definitely close to the web corpus with synth data in quantity

#

especially with RL training that just then blows up exponentially... Roughly 2T for R1 in post-training
https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1

Epoch AI

What went into training DeepSeek-R1?

This Gradient Updates issue explores DeepSeek-R1’s architecture, training cost, and pricing, showing how it rivals OpenAI’s o1 at 30x lower cost.

unborn ocean Jun 27, 2025, 7:03 PM

#

ocean vortex you over estimated around 10 times lol

yeah i should have prob actually read the paper.
i just did 33 million entries in the training set * ~3,5k-4k tokens per sample

#

but should have taken the actual rows from the dataset, lol (which are actually somewhere around 4m)

ocean vortex Jun 27, 2025, 7:05 PM

#

unborn ocean yeah i should have prob actually read the paper. i just did 33 million entries i...

I just use o3 and then make it prove it to me. Too lazy for math on Friday lmao

unborn ocean Jun 27, 2025, 7:07 PM

#

in your world

#

/ mind

ocean vortex Jun 27, 2025, 7:08 PM

#

unborn ocean in your world

with tools I kinda feel it is objectively the best though

#

2.5 Pro can only match it if we strip o3 of tools

unborn ocean Jun 27, 2025, 7:09 PM

#

yes

#

o3+tools is very good within chatgpt

#

otherwise few people will actually use it in api and so on (it = tools enabled)

ocean vortex Jun 27, 2025, 7:10 PM

#

Yeah mostly only for testing... to be able to say that 2.5Pro is equivalent. But then almost everyone gonna use tools on chatgpt so it's a paradox catgrin

unborn ocean Jun 27, 2025, 7:11 PM

#

imo 2.5 pro is better in many core capability things

#

but internal tools is just a game changer (for the average user within a chat app, not necessary in all other cases)

#

no

#

how you can you say it is better then 🤣

ocean vortex Jun 27, 2025, 7:13 PM

#

unborn ocean imo 2.5 pro is better in many core capability things

it is better in fundamental understanding, but tbh for most things people use AI for this will rarely come into play. Tool usage is a bigger deal in practice

unborn ocean Jun 27, 2025, 7:13 PM

#

https://app.primeintellect.ai/intelligence/synthetic-2 btw @ocean vortex we getting some more data day by day

#

they both need to catch up to each other in certain areas

#

but hey that is good for us - competition and all

#

imagine what the engineers must have felt when first interacting with gpt4 in internal test phases

#

like gpt4 and the imminent past around it was like a turning point in history

ocean vortex Jun 27, 2025, 7:17 PM

#

unborn ocean https://app.primeintellect.ai/intelligence/synthetic-2 btw <@514836230802898954>...

the total is somewhere 2 to 4B tokens for this, that's still child's play 😭

unborn ocean Jun 27, 2025, 7:18 PM

#

ocean vortex the total is somewhere 2 to 4B tokens for this, that's still child's play 😭

it only took 3 days and if they had held their speed increases, they could be at 100b in no time

ocean vortex Jun 27, 2025, 7:18 PM

#

lol what are you talking about

unborn ocean Jun 27, 2025, 7:18 PM

#

(and all the people got no money as far as i know)

ocean vortex Jun 27, 2025, 7:19 PM

#

gpt4.5 is directly better in like every single way, including size/capacity

keen beacon Jun 27, 2025, 7:20 PM

#

gpt 4 is also an instruct model

#

gpt 4.5 fit to the math data and stuff too well 🤣

ocean vortex Jun 27, 2025, 7:20 PM

#

all chat models are instruct. For gpt3.5-instruct they allowed you not to use their chat template (text completion API) which is why it could act like a base model. But generally instruct = chat.

#

I think it is somewhat undertrained in post compared to their other models. But that's to be expected given the model size

unborn ocean Jun 27, 2025, 7:24 PM

#

i feel like og gpt 4 also had to be "undertrained" by current standards or at least not as overfit as 4o

#

because of its size and the lack of compute they had back then

keen beacon Jun 27, 2025, 7:25 PM

#

its definitely undertrained lol

#

and anyway gpt 4.5 is like the gpt 4 recipe scaled up, ppl who complain about newer models have their modern version of it (4.5)

unborn ocean Jun 27, 2025, 7:27 PM

#

keen beacon and anyway gpt 4.5 is like the gpt 4 recipe scaled up, ppl who complain about ne...

was or is there a reliable estimate on the size of it

#

bc i am wondering how long it will take before moore's law (Jensen's law) carries us to a future where we can have that stuff as the standard model

keen beacon Jun 27, 2025, 7:29 PM

#

i tihnk i recall nathan lambert estimating 5 trillion or something

#

i dont really care about gpt 4.5 though, so i don't know much

#

the future is small models

unborn ocean Jun 27, 2025, 7:29 PM

#

small is relative to the compute we have

#

and i beg to differ

#

the future is a place where we don't have to resort to building a small OR a large model, but can just build a model that combines both (e.g. mixture of nested experts, just different size experts and idk a lot of other stuff)

#

and the future beyond that consists of models that are fundamentally constructed to have adaptable computational demand (imo)

verbal nimbus Jun 27, 2025, 7:43 PM

#

https://www.reddit.com/r/LocalLLaMA/comments/1lkbiva/gemini_released_an_open_source_cli_tool_similar/

From the LocalLLaMA community on Reddit: Gemini released an Open So...

Explore this post and more from the LocalLLaMA community

lilac inlet Jun 27, 2025, 7:47 PM

#

Where can I share my images?

zinc ore Jun 27, 2025, 8:10 PM

#

About to have a new king on the block

tepid lynx Jun 27, 2025, 8:12 PM

#

anybody knows best ai for back and front end dev.?

#

web dev

elder rapids Jun 27, 2025, 8:36 PM

#

bro is just saying stuff 😭 Google wasn't even mentioned

brittle tiger Jun 27, 2025, 8:37 PM

#

Will grok 4 have better evals than the fake grok 3.5 evals Elon retweeted?

#

OpenAI investor money heading to Google now that they've realized they need TPUs

leaden palm Jun 27, 2025, 8:46 PM

#

this is where your claude subscription goes

#

the information? it's very well known

#

let me fire off a Search query

#

ok models got a little confused

#

"surely o4-mini deep research will be better than a single search"
...it doesn't know what o1 is

#

why do i even bother

#

time to see if claude research is any better

#

claude research best research

#

it got stuck at 444 sources

#

Examining the situation

#

all in all (1x claude research, 2x o4-mini research, 1x final sonnet assessment)

civic flame Jun 27, 2025, 9:54 PM

#

a google model better than 2.5 pro has landed on one of my red teaming platforms - doesn't seem like stonebloom, less succinct (so more like kf & blacktooth)

#

got any prompts?

whole wagon Jun 27, 2025, 9:56 PM

#

Red teaming platforms?

civic flame Jun 27, 2025, 9:56 PM

#

i've worked with some labs directly and some indirectly to help them red team models before they're released

#

it seems to be better than stonebloom

#

for the prompt Jess needs 200 pieces of ribbon, each of length 110cm, to decorate a room for a party. Ribbon is sold in rolls of 25m each. How many rolls of ribbon does she need to buy?:

#

anon model:

Of course! Here is a step-by-step solution to the problem.

Step 1: Calculate the total length of ribbon needed.
Jess needs 200 pieces, and each piece is 110 cm long.

Total length = 200 pieces × 110 cm/piece
Total length = 22,000 cm

Step 2: Convert the length of a single roll of ribbon to centimeters.
The ribbon is sold in rolls of 25 meters. We need to convert this to centimeters to match our other measurement. (Remember: 1 meter = 100 centimeters).

Length of one roll = 25 m × 100 cm/m
Length of one roll = 2,500 cm

Step 3: Find out how many pieces of ribbon can be cut from one roll.
Each roll is 2,500 cm long, and each piece needs to be 110 cm.

Pieces per roll = 2,500 cm ÷ 110 cm
Pieces per roll = 22.72...

Since Jess can only cut whole pieces of ribbon, she can get 22 full pieces from each roll.

Step 4: Calculate the total number of rolls needed.
Jess needs a total of 200 pieces, and she can get 22 pieces from each roll.

Number of rolls = Total pieces needed ÷ Pieces per roll
Number of rolls = 200 ÷ 22
Number of rolls = 9.09...

Since Jess cannot buy a fraction of a roll, she must round up to the next whole number.

Therefore, Jess needs to buy 10 rolls of ribbon.

#

(correct)

#

stonebloom:

Here is a step-by-step solution to the problem:

1. Calculate the total length of ribbon needed.

Jess needs 200 pieces, and each piece is 110 cm long.

Total length needed = 200 pieces × 110 cm/piece
Total length needed = 22,000 cm

2. Convert the units so they match.

The total length needed is in centimeters (cm), but the rolls are sold in meters (m). We need to convert them to the same unit. Let's convert the total length needed into meters.

We know that 100 cm = 1 meter.

Total length needed in meters = 22,000 cm ÷ 100
Total length needed = 220 m

3. Calculate the number of rolls needed.

Jess needs 220 meters of ribbon in total. Each roll contains 25 meters.

Number of rolls = Total length needed ÷ Length per roll
Number of rolls = 220 m ÷ 25 m
Number of rolls = 8.8 rolls

4. Round up to the nearest whole roll.

Since Jess cannot buy a fraction of a roll (0.8 of a roll), she must buy enough whole rolls to cover the total amount needed. If she buys 8 rolls, she will only have 200 meters (8 x 25m), which is not enough. She must round up to the next whole number.

8.8 rounds up to 9.

Answer: Jess needs to buy 9 rolls of ribbon.

#

(incorrect)

#

did 3 repeats, stonebloom had a 0% success rate anon model had a 100% success rate

rare python Jun 27, 2025, 10:19 PM

#

civic flame did 3 repeats, stonebloom had a 0% success rate anon model had a 100% success ra...

What about token per second?

brittle tiger Jun 27, 2025, 10:20 PM

#

Same outlet from today

elder rapids Jun 27, 2025, 10:30 PM

#

civic flame did 3 repeats, stonebloom had a 0% success rate anon model had a 100% success ra...

you think it's another Gemini model?

rare python Jun 27, 2025, 10:32 PM

#

elder rapids you think it's another Gemini model?

Yeah the response style tells it all

#

Start with "Of course! Here's the..."

zinc ore Jun 27, 2025, 10:34 PM

#

It likely also says made by Google when you prompt it

jade egret Jun 27, 2025, 11:37 PM

#

😭

#

grok 4 😭

wintry tinsel Jun 27, 2025, 11:44 PM

#

I got very low expectations for grok slop

#

It’s so ironic to me that so much effort is put into making Claude jailbreak resistant when it is infact one the easiest to jailbreak

#

Open AI is bulletproof

jade egret Jun 27, 2025, 11:47 PM

#

google good ( :

verbal nimbus Jun 27, 2025, 11:51 PM

#

AVM is actually so good
https://youtu.be/GiaNp0u_swU?feature=shared

YouTube

Xiaomanyc 小马在纽约

I Surprised ChatGPT by Speaking Computer Language

Going to therapy is a sign of strength, not weakness. My sponsor BetterHelp makes therapy simple, with 10% off your first month to help you get started: https://betterhelp.com/xiaoma

Anyway, today I surprise ChatGPT by speaking computer language aka hexadecimal, a base-16 number system that computers use to process information. When spoken alou...

▶ Play video

#

It switches between languages seamlessly and the expressions are so realistic.

rare python Jun 28, 2025, 12:05 AM

#

https://fixupx.com/officiallogank/status/1938744437695299703

Logan Kilpatrick (@OfficialLoganK)

Gemini 2.5 Pro is back in the free tier of the API, have a great weekend : )

**💬 26 🔁 18 ❤️ 218 👁️ 5.9K **

patent aspen Jun 28, 2025, 1:13 AM

#

civic flame did 3 repeats, stonebloom had a 0% success rate anon model had a 100% success ra...

How long does it run for?

rare python Jun 28, 2025, 3:32 AM

#

https://ai.google.dev/gemini-api/docs/rate-limits#current-rate-limits

#

surreal creek Jun 28, 2025, 4:37 AM

#

using our free weekend of Gemini to write a backdoor for the o3 API 😂

polar roost Jun 28, 2025, 4:52 AM

#

is there a message character limit on what we can give our model??
i have pasted a long chat in the prompt of another model and the send button is greyed out (doesn't work)
please tell me

patent aspen Jun 28, 2025, 5:12 AM

#

maybe if o3 gets better

sacred quail Jun 28, 2025, 5:16 AM

#

rare python

I really dont understand how and why gemini API offers free 100 request per day which is in app you must pay 20 dollar for that

#

so weird

rare python Jun 28, 2025, 5:17 AM

#

sacred quail I really dont understand how and why gemini API offers free 100 request per day ...

Don't complain 🥴

#

Free is free

sacred quail Jun 28, 2025, 5:17 AM

#

im not. Im just doesnt understand. While web/mobile app quality is worse

#

ill gladly take it

rare python Jun 28, 2025, 5:17 AM

#

Seriously Gemini app needs a lot of works

rare python Jun 28, 2025, 5:18 AM

#

sacred quail im not. Im just doesnt understand. While web/mobile app quality is worse

Josh and the team are improving Gemini app but it still half baked and very slow

sacred quail Jun 28, 2025, 5:18 AM

#

The thing is gemini coming default with android phones so they have already billions user

#

And

#

Im guessing because of that

#

They using too much safety filters in app

#

And big system prompt

#

I dont think they can solve this

rare python Jun 28, 2025, 5:19 AM

#

and weird performance regression compare to pure Gemini in API and AI Studio

sacred quail Jun 28, 2025, 5:19 AM

#

it is really obvious

#

Like

#

Even in video analysis

rare python Jun 28, 2025, 5:20 AM

#

sacred quail I dont think they can solve this

Too much technical debts left from old Bard team I think

sacred quail Jun 28, 2025, 5:20 AM

#

No need any safety or system prompt for simple video analysis. Just summary the video. But no. AI studio certainly better even for video analysis

sacred quail Jun 28, 2025, 5:22 AM

#

sacred quail The thing is gemini coming default with android phones so they have already bill...

im still think this is the main reason why app quality is worse. They want zero risks

#

Models can be more free or more okey to makes mistakes in ai studio, because you want do decide to use it gemini in there. But in android phones, gemini is installed default. They didnt to ask you

#

So because of this they want zero problem, and for this using ridicioulus safety filters, system prompt, and even maybe there is something like temperature or another tunings i dont know

sacred quail Jun 28, 2025, 5:43 AM

#

rare python

@rare python hey... We got baited i guess. When i click the link

#

There is no 100 rpd

rapid token Jun 28, 2025, 5:43 AM

#

hello

rare python Jun 28, 2025, 5:44 AM

#

sacred quail There is no 100 rpd

I still see it

sacred quail Jun 28, 2025, 5:44 AM

#

How

rare python Jun 28, 2025, 5:44 AM

#

sacred quail Jun 28, 2025, 5:44 AM

#

huh

#

What is that 250,0000?

rare python Jun 28, 2025, 5:44 AM

#

Lol

sacred quail Jun 28, 2025, 5:44 AM

#

Extra one "0"

rare python Jun 28, 2025, 5:44 AM

#

Typo

sacred quail Jun 28, 2025, 5:45 AM

#

im really not good at english man. Are you trolling or not lol. You already got me

#

Dont play with my emotions pls

rare python Jun 28, 2025, 5:45 AM

#

sacred quail im really not good at english man. Are you trolling or not lol. You already got ...

What do you mean? The Google devs made a typo

#

I clicked the link

#

Share your screenshot

sacred quail Jun 28, 2025, 5:46 AM

#

Oh, maybe its a regional thing ?

#

in mine there is still "_"

rare python Jun 28, 2025, 5:46 AM

#

sacred quail Oh, maybe its a regional thing ?

https://ai.google.dev/gemini-api/docs/available-regions

#

Check your country

nimble trail Jun 28, 2025, 6:01 AM

#

sacred quail Dont play with my emotions pls

it's true i was there

#

I still can't use it tho.

sacred quail Jun 28, 2025, 6:33 AM

#

https://x.com/theinformation/status/1938699932887556505

The Information (@theinformation)

Exclusive: Google Convinces OpenAI to Use TPU Chips in Win Against Nvidia

OpenAI, one of the biggest Nvidia chip customers, has started using Google's cheaper AI chips.

Read more from @anissagardizy8 and @QianerLiu 👇
https://t.co/iEPKz78LUZ

#

Openai starts to using google's chips huh

#

Google is final boss at this rate

#

They have datas, hardwares, money

elder rapids Jun 28, 2025, 7:17 AM

#

2.5 pro param count confirmed?

#

#

no way flash is 17B

rare python Jun 28, 2025, 7:18 AM

#

elder rapids

source?

noble zinc Jun 28, 2025, 7:19 AM

#

maybe 288 active

rare python Jun 28, 2025, 7:19 AM

#

288B MoE

elder rapids Jun 28, 2025, 7:20 AM

#

rare python source?

says it's just an estimate

elder rapids Jun 28, 2025, 7:21 AM

#

noble zinc maybe 288 active

this would be redundant in context of tflop measurement

#

if it were 288B MoE then it would misrepresent what they're trying to measure

elder rapids Jun 28, 2025, 7:24 AM

#

elder rapids

but I don't know the actual context or where this is from

civic flame Jun 28, 2025, 8:24 AM

#

patent aspen How long does it run for?

seems a little longer than stonebloom

tepid lynx Jun 28, 2025, 8:25 AM

#

who knows best ai for webdev?

#

(gemini not included)

keen fulcrum Jun 28, 2025, 9:23 AM

#

https://x.com/arsh99_singh/status/1938848937265492376

Arshdeep Singh (@arsh99_singh)

Grok 4 is coming, and its going to be a bigger jump from grok 3 than grok 3 was from 2.

#

officially called grok 4 than grok 3.5 now

calm sequoia Jun 28, 2025, 9:32 AM

#

Such a jump is needed just for paring with o3 and 2.5 PRO

#

I wonder of xAI ever introduced anything new to the industry or are they just copying stuff

#

Would love if there existed a better model than o3 though

polar roost Jun 28, 2025, 9:37 AM

#

is there a message character limit on what we can give our model??
i have pasted a long chat in the prompt of another model and the send button is greyed out (doesn't work)
please tell me

torn mantle Jun 28, 2025, 9:53 AM

#

keen fulcrum https://x.com/arsh99_singh/status/1938848937265492376

That's a bold claim

#

Havent seen yang yapp about it

#

Well let's just hope its a bit better than o3

#

I would be fine with that

whole wagon Jun 28, 2025, 10:02 AM

#

#

cedar tide Jun 28, 2025, 10:17 AM

#

cedar tide

poll_question_text

Is it fair that Hunyuan shows that their 80B model is close to the performance of the old R1?

victor_answer_votes

3

total_votes

6

ocean vortex Jun 28, 2025, 10:20 AM

#

ocean vortex

poll_question_text

Is it fair that Hunyuan did not include the current version of R1 in their table and compared it only against the replaced original version of R1 which performs worse?

victor_answer_votes

2

total_votes

4

#

😭 😂

ocean vortex Jun 28, 2025, 10:30 AM

#

keen fulcrum https://x.com/arsh99_singh/status/1938848937265492376

it's gonna be the greatest AI humanity has ever seen. And GPT5 is the worst in the entire history of AI. Dork4 AGI confirmed 🔥

#

🫃

keen fulcrum Jun 28, 2025, 10:38 AM

#

can drop at any time after july 4th

whole wagon Jun 28, 2025, 10:56 AM

#

Notice how he didn't say the year

#

Sneaky Elon

torn mantle Jun 28, 2025, 11:11 AM

#

Notice how they didn't say its gonna be the best model in the world

#

I mean you would expect massive benchmark boost if the improvement is bigger than from grok 2 to 3

ocean vortex Jun 28, 2025, 11:43 AM

#

torn mantle Notice how they didn't say its gonna be the best model in the world

That's only because they moved beyond this goalpost already 👀

keen fulcrum Jun 28, 2025, 11:46 AM

#

torn mantle Notice how they didn't say its gonna be the best model in the world

We will find out in two weeks

#

I guess it will be released for supergrok first

#

They are optimising for coding currently, I guess it won’t be great there

whole wagon Jun 28, 2025, 12:12 PM

#

whole wagon

poll_question_text

Do you think grok 4 will be good?

victor_answer_votes

11

total_votes

16

victor_answer_id

2

victor_answer_text

No

unborn ocean Jun 28, 2025, 12:14 PM

#

whole wagon

I don’t believe the 288b active params for pro

#

It is weirdly enough also the exact param count of behemoth

#

So maybe they just got it from there

solar hollow Jun 28, 2025, 12:26 PM

#

bro the xai ppl are just as delusional as their ceo, some employee posted on twitter that their employees are 10 times more worth than employees of other top ai labs

torn mantle Jun 28, 2025, 12:52 PM

#

solar hollow bro the xai ppl are just as delusional as their ceo, some employee posted on twi...

idk

#

this time i feel its different

#

maybe we will have a really good model

#

i would care less of what they said if they released a really good product

rare python Jun 28, 2025, 12:54 PM

#

torn mantle this time i feel its different

What made you say that?

torn mantle Jun 28, 2025, 12:54 PM

#

just the fact that

#

wait

#

@rare python did you just join this server

#

anyway

#

just the fact that they renamed it grok 4 is kinda interesting

#

but why isnt elon comparing it to openai models

#

ok so he said it will be the smartest by a big margin

#

but didnt he say the same for grok 3

rare python Jun 28, 2025, 12:57 PM

#

yeah

#

"Smartest AI in the world"

#

Got beaten so quickly

rare python Jun 28, 2025, 1:01 PM

#

torn mantle ok so he said it will be the smartest by a big margin

I think Google changed from the smartest AI model to "our smartest AI model"

#

torn mantle Jun 28, 2025, 1:03 PM

#

interesting

#

i really think we are yet to unlock full potential of reasoning models

#

there is still a lot of low hanging fruits around

#

"first principles" or whatever he called it seems one of them

#

they probably have so many recipes and run multiple hypothesis scenarios to select the best ones

#

and their advanced model may be used even further to generate much better ideas

#

parallel reasoning is also a thing

#

xai issue was that their reasoning traces were so inefficient

#

so even if you want to apply multiple methods it will just add complexity to the reasoning

#

so they probably fixed that

rare python Jun 28, 2025, 1:06 PM

#

torn mantle i really think we are yet to unlock full potential of reasoning models

I feel like RL with CoT is just faking the reasoning

#

It's not true understanding, which hypothetically only world model can understand the nuance

torn mantle Jun 28, 2025, 1:07 PM

#

rare python I feel like RL with CoT is just faking the reasoning

well its def not how we reason but it mimics that by capturing its pattern

rare python Jun 28, 2025, 1:08 PM

#

Yes mimic. We need them to understand on their own why this code is bad and unmaintainable in real battle test

torn mantle Jun 28, 2025, 1:08 PM

#

but could this lead to even smarter models? what are the limits? do we need better algorithms?

torn mantle Jun 28, 2025, 1:08 PM

#

rare python Yes mimic. We need them to understand on their own why this code is bad and unma...

thats part of the generalization process

#

thats what all ai labs are trying to do

rare python Jun 28, 2025, 1:09 PM

#

metacognition is the wall I think

torn mantle Jun 28, 2025, 1:09 PM

#

actually many ai labs RL on math and coding

#

which then translated to generalized reasoning model across different domains

#

now imagine if we had a big high quality reasoning traces

rare python Jun 28, 2025, 1:10 PM

#

Doesn't help that the labs are hiding raw thoughts

torn mantle Jun 28, 2025, 1:11 PM

#

we still haven't reached the level of metacognition yet.. let's just mimic human intelligence first, and then we can explore the realm of consciousness and whatsnot

torn mantle Jun 28, 2025, 1:11 PM

#

rare python Doesn't help that the labs are hiding raw thoughts

wdym

#

thats their secret recipe

#

they cant just share that

#

they have their own R&D teams

#

its not like an open source project where many people contributes

#

they know what they are doing

solar hollow Jun 28, 2025, 1:12 PM

#

we are still not entirely fair though, especially when we talk about "agi".
How many ppl are there today, that dont have basic reasoning capabilities? its a lot, and i am not trying to be insulting, its just how it is

torn mantle Jun 28, 2025, 1:12 PM

#

and they dont need people for that

solar hollow Jun 28, 2025, 1:12 PM

#

the avg human is quite dumb

rare python Jun 28, 2025, 1:12 PM

#

torn mantle they cant just share that

So how can user debug what went wrong to optimize the system prompt?

solar hollow Jun 28, 2025, 1:12 PM

#

we compare the ais with capable humans

#

where the ai looks dumb

torn mantle Jun 28, 2025, 1:12 PM

#

rare python So how can user debug what went wrong to optimize the system prompt?

thats their job to do

#

its not like you will get any different path if you debug it, the reasoning patterns are encoded from the training phase

#

its not like you will change your prompt and then voila you will notice a big difference

rare python Jun 28, 2025, 1:13 PM

#

torn mantle its not like you will change your prompt and then voila you will notice a big di...

Are you sure? Did you change it based on the thoughts?

solar hollow Jun 28, 2025, 1:14 PM

#

but even dumb humans have sth that llms dont have, real physical representation, which makes them economically viable

rare python Jun 28, 2025, 1:15 PM

#

torn mantle thats their job to do

What is the point of the feedback button then?

#

Users can do PR and find the issues faster collectively

torn mantle Jun 28, 2025, 1:15 PM

#

rare python Are you sure? Did you change it based on the thoughts?

well you can try with deepseek

rare python Jun 28, 2025, 1:16 PM

#

torn mantle well you can try with deepseek

DeepSeek app doesn't have custom instructions

torn mantle Jun 28, 2025, 1:16 PM

#

you can still report tho

torn mantle Jun 28, 2025, 1:17 PM

#

rare python DeepSeek app doesn't have custom instructions

you can make it with a small userscript code

rare python Jun 28, 2025, 1:17 PM

#

torn mantle you can still report tho

I reported them a lot and I haven't seen any fixes for months

torn mantle Jun 28, 2025, 1:18 PM

#

ofc you wont see any fixes, the process isnt just a simple one click button

rare python Jun 28, 2025, 1:18 PM

#

torn mantle thats their job to do

Yeah they decided that more sycophancy is better.

torn mantle Jun 28, 2025, 1:18 PM

#

and thats not their purpose either

rare python Jun 28, 2025, 1:18 PM

#

torn mantle and thats not their purpose either

How do you know?

torn mantle Jun 28, 2025, 1:18 PM

#

and now we are back at generalization point, they want to generalize not fix one issue

#

what makes you think that one fix that you reported wont mess up other things as well?

rare python Jun 28, 2025, 1:20 PM

#

torn mantle what makes you think that one fix that you reported wont mess up other things as...

I downvoted the bad response like it encourage me that breaking up a 15 years relationship over a pizza overeating is cool

torn mantle Jun 28, 2025, 1:21 PM

#

are you sure this is a reasoning process issue?

#

it may be just that the base model is dumb

rare python Jun 28, 2025, 1:21 PM

#

torn mantle are you sure this is a reasoning process issue?

I can only see I'm now zeroing and really vague summary

torn mantle Jun 28, 2025, 1:22 PM

#

the instruct model needs to be intelligent first to have a good reasoning process, thats why they are still iterating on deepseek instruct ( deepseek r2 -> deepseek v4 + RL )

keen beacon Jun 28, 2025, 1:22 PM

#

rare python I can only see I'm now zeroing and really vague summary

its fake

rare python Jun 28, 2025, 1:22 PM

#

torn mantle the instruct model needs to be intelligent first to have a good reasoning proces...

Is Gemini 2.0 Pro a good instruct model?

keen beacon Jun 28, 2025, 1:22 PM

#

no human wants to wait for a LLM to make their response by just seeing "responding..." or "thinking.."

torn mantle Jun 28, 2025, 1:22 PM

#

rare python Is Gemini 2.0 Pro a good instruct model?

it is

rare python Jun 28, 2025, 1:23 PM

#

at what?

torn mantle Jun 28, 2025, 1:23 PM

#

gemini 2.0 pro is the instruct model

keen beacon Jun 28, 2025, 1:23 PM

#

they want to see the LLMs thought process in a simplified way

torn mantle Jun 28, 2025, 1:23 PM

#

gemini 2.5 pro is also an instruct model

keen beacon Jun 28, 2025, 1:23 PM

#

torn mantle gemini 2.5 pro is also an instruct model

wym by that

torn mantle Jun 28, 2025, 1:23 PM

#

keen beacon wym by that

whats the instruct model of gemini 2.5 pro reasoning

rare python Jun 28, 2025, 1:23 PM

#

torn mantle the instruct model needs to be intelligent first to have a good reasoning proces...

Disagree. Claude 3.7 Sonnet is a good base model yet their thinking version isn't that big of a gap

keen beacon Jun 28, 2025, 1:23 PM

#

torn mantle whats the instruct model of gemini 2.5 pro reasoning

oh okay i see

rare python Jun 28, 2025, 1:24 PM

#

I'd say Claude 3.7 Sonnet is a better base model than 2.0 Pro

keen beacon Jun 28, 2025, 1:24 PM

#

why not just local host the model

#

won't you be able to see its thought process more in depth in terminal or sum

rare python Jun 28, 2025, 1:24 PM

#

keen beacon why not just local host the model

Not everyone has the budget

keen beacon Jun 28, 2025, 1:24 PM

#

rare python Not everyone has the budget

u can just use google workshop

#

or wtv its called

#

not google workshop

#

search up google colab

#

https://colab.research.google.com/

Google Colab

rare python Jun 28, 2025, 1:26 PM

#

torn mantle the instruct model needs to be intelligent first to have a good reasoning proces...

I don't know if o1 use GPT4o as their base model. Maybe similar performance to GPT4o. o3 is o1 with further RL iirc, yet it's more intelligent and can push back normally

keen beacon Jun 28, 2025, 1:26 PM

#

rare python I don't know if o1 use GPT4o as their base model. Maybe similar performance to G...

https://medium.com/@hakimnaufal/trying-out-vllm-deepseek-r1-in-google-colab-a-quick-guide-a4fe682b8665

Medium

Trying out VLLM + DeepSeek R1 in Google Colab: A Quick Guide

Recently, I experimented with running DeepSeek R1 distilled with Qwen 7B locally without any GPU. All my CPU cores and threads were pushed…

rare python Jun 28, 2025, 1:27 PM

#

Not interested in local model. They are bad anyway

unborn ocean Jun 28, 2025, 1:28 PM

#

rare python I'd say Claude 3.7 Sonnet is a better base model than 2.0 Pro

no, obv depending on the task and btw 3.7 is not really a "base model" for anything

keen beacon Jun 28, 2025, 1:28 PM

#

how else do u expect them to be locally ran

#

😭

unborn ocean Jun 28, 2025, 1:28 PM

#

rare python I don't know if o1 use GPT4o as their base model. Maybe similar performance to G...

no

keen beacon Jun 28, 2025, 1:28 PM

#

keen beacon how else do u expect them to be locally ran

they get distilled

#

ts in the article

rare python Jun 28, 2025, 1:29 PM

#

unborn ocean no, obv depending on the task and btw 3.7 is not really a "base model" for anyth...

Give benchmark or user preference. Most user prefer Claude 3.7 Sonnet as the non thinking model. 2.0 Pro regressed so much that it's so hated

keen beacon Jun 28, 2025, 1:29 PM

#

\

unborn ocean Jun 28, 2025, 1:29 PM

#

rare python Give benchmark or user preference. Most user prefer Claude 3.7 Sonnet as the non...

? you know that 2.0 pro had a higher score on lmarena, right?

rare python Jun 28, 2025, 1:29 PM

#

unborn ocean no

No what?

unborn ocean Jun 28, 2025, 1:30 PM

#

o3 is not just the same model + RL

#

very similar though likely

rare python Jun 28, 2025, 1:30 PM

#

unborn ocean ? you know that 2.0 pro had a higher score on lmarena, right?

When does lmarena score define the overall user preference?

unborn ocean Jun 28, 2025, 1:30 PM

#

rare python When does lmarena score define the overall user preference?

lmarena is literally human preference

rare python Jun 28, 2025, 1:30 PM

#

unborn ocean lmarena is literally human preference

So why outside lmarena many prefer Claude for coding?

unborn ocean Jun 28, 2025, 1:30 PM

#

unless you want to argue about "Ahm actually, I PERSONALLY don't like it 🤓 , bc claude is bewter at coding"

unborn ocean Jun 28, 2025, 1:31 PM

#

rare python So why outside lmarena many prefer Claude for coding?

because it is better

rare python Jun 28, 2025, 1:31 PM

#

in lmarena Gemini has higher elo score for coding?

unborn ocean Jun 28, 2025, 1:31 PM

#

idk

rare python Jun 28, 2025, 1:31 PM

#

bruh

unborn ocean Jun 28, 2025, 1:31 PM

#

check yourself before claiming that it is better / worse

rare python Jun 28, 2025, 1:32 PM

#

unborn ocean check yourself before claiming that it is better / worse

I checked myself.

unborn ocean Jun 28, 2025, 1:32 PM

#

and?

rare python Jun 28, 2025, 1:33 PM

#

Unless you really believe that gpt4o is above o3 in coding

rare python Jun 28, 2025, 1:33 PM

#

unborn ocean and?

Gemini 2.0 Pro is worse than 3.7 Sonnet

unborn ocean Jun 28, 2025, 1:34 PM

#

rare python Unless you really believe that gpt4o is above o3 in coding

well that was never the point

#

you said people prefered it as an argument

#

and i showed you / you saw yourself that that is not true

rare python Jun 28, 2025, 1:34 PM

#

unborn ocean you said people prefered it as an argument

I said people, on social media prefer it

#

Not lmarena

#

human preference with opinion, not some random elo votes

unborn ocean Jun 28, 2025, 1:35 PM

#

whut

devout sun Jun 28, 2025, 1:37 PM

#

is o3-pro a separate model or just regular o3 with hacks

rare python Jun 28, 2025, 1:37 PM

#

unborn ocean no, obv depending on the task and btw 3.7 is not really a "base model" for anyth...

explain what tasks 2.0 Pro better than Claude 3.7 Sonnet, and why 3.7 isn't really a base model

unborn ocean Jun 28, 2025, 1:37 PM

#

coding category -> quite clear that gemini 2 was more liked.

sonnet is way better at agentic stuff and "real world" coding instead of chatting

ocean vortex Jun 28, 2025, 1:37 PM

#

rare python I don't know if o1 use GPT4o as their base model. Maybe similar performance to G...

Incorrect actually. O1 is gpt4o. O3 is gpt4.1

unborn ocean Jun 28, 2025, 1:37 PM

#

rare python Jun 28, 2025, 1:38 PM

#

ocean vortex Incorrect actually. O1 is gpt4o. O3 is gpt4.1

Do you have the source that o3 is gpt4.1 based?

ocean vortex Jun 28, 2025, 1:38 PM

#

you can't make o1 to o3 improvement with "further RL", there's no free launch to improve it this much jus with fine-tuning. It was already reasoning effectively

unborn ocean Jun 28, 2025, 1:38 PM

#

rare python Do you have the source that o3 is gpt4.1 based?

openai duh

rare python Jun 28, 2025, 1:38 PM

#

unborn ocean openai duh

Source?

ocean vortex Jun 28, 2025, 1:39 PM

#

rare python Do you have the source that o3 is gpt4.1 based?

There was a source, Arc Prize I think... Not gonna dig it up now though catgrin

#

Besides, this is how it works. You need a better chat base model to make next gen

unborn ocean Jun 28, 2025, 1:39 PM

#

rare python explain what tasks 2.0 Pro better than Claude 3.7 Sonnet, and why 3.7 isn't real...

3.7 is actually just based off of 3.5 (which is the real base model here (afaik)) and 3.7 itself is a hybrid model so it is technically not the base version of the reasoning

ocean vortex Jun 28, 2025, 1:40 PM

#

otherwise you are just stuck doing minor improvements like Google with 2.5Pro updates

unborn ocean Jun 28, 2025, 1:40 PM

#

3.7 might be the base for 4 but we don't know

rare python Jun 28, 2025, 1:42 PM

#

ocean vortex Besides, this is how it works. You need a better chat base model to make next ge...

Is GPT 4.1 that much better than gpt4o?

ocean vortex Jun 28, 2025, 1:42 PM

#

@rare python o3-preview was old base. Production o3 new base. The preview pushed test-time compute which is why it had insane cost

rare python Jun 28, 2025, 1:43 PM

#

ocean vortex <@1178708438310719549> o3-preview was old base. Production o3 new base. The prev...

When did OpenAI said o3 preview is the old base?

ocean vortex Jun 28, 2025, 1:43 PM

#

rare python Is GPT 4.1 that much better than gpt4o?

yeah it's a huge improvement. Gpt4o was worse than 3.5 Sonnet. 4.1 is probably better than 3.7-4.0 Sonnet without reasoning

#

or equivalent to it

rare python Jun 28, 2025, 1:44 PM

#

ocean vortex yeah it's a huge improvement. Gpt4o was worse than 3.5 Sonnet. 4.1 is probably b...

But they also post trained gpt4o tho. Higher Stem compare to original gpt4o

ocean vortex Jun 28, 2025, 1:45 PM

#

rare python But they also post trained gpt4o tho. Higher Stem compare to original gpt4o

True but that's chatgpt-latest and at this point it's much closer to 4.1 than to gpt4o, it's just their naming...

rare python Jun 28, 2025, 1:45 PM

#

ocean vortex yeah it's a huge improvement. Gpt4o was worse than 3.5 Sonnet. 4.1 is probably b...

https://aider.chat/docs/leaderboards/

aider

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

#

gpt4.1 is equal to 3.6 sonnet here

whole wagon Jun 28, 2025, 1:46 PM

#

4.1 is not great ngl

#

Not sota for sure

ocean vortex Jun 28, 2025, 1:47 PM

#

rare python https://aider.chat/docs/leaderboards/

#

Aider is just a single benchmark lol

rare python Jun 28, 2025, 1:48 PM

#

unborn ocean 3.7 is actually just based off of 3.5 (which is the real base model here (afaik)...

Are we even sure 3.5 Sonnet is pre trained or 3.5 Sonnet is based on 3.0 Sonnet with post training? Has Anthropic ever said anything about this?

ocean vortex Jun 28, 2025, 1:48 PM

#

whole wagon Not sota for sure

it was never meant to be SOTA. It's not reasoning model

#

It's more of a base for o3

whole wagon Jun 28, 2025, 1:48 PM

#

It's not SOTA for non reasoning either

rare python Jun 28, 2025, 1:48 PM

#

ocean vortex Aider is just a single benchmark lol

This benchmark put 2.5 flash over Opus 4

#

🥴

ocean vortex Jun 28, 2025, 1:49 PM

#

whole wagon It's not SOTA for non reasoning either

it's very competitive for non-reasoning

ocean vortex Jun 28, 2025, 1:49 PM

#

rare python This benchmark put 2.5 flash over Opus 4

Yeah... Opus underperforms on many benchmarks though

#

Some things it does better than anything else, in others it simply disappoints

#

A bit like gpt4.5 but to a lesser extent

rare python Jun 28, 2025, 1:50 PM

#

Articial Analysis use LCB for coding

#

which is competitive code generation If i remember correctly

#

Opus 4, Sonnet 4 is optimized for SWE bench, which they scored significantly higher than other models

#

So that might be the reason people prefer it

#

@unborn ocean ^^^ coder prefer it is also human preference

#

https://openrouter.ai/rankings/programming?view=week

unborn ocean Jun 28, 2025, 1:53 PM

#

rare python Are we even sure 3.5 Sonnet is pre trained or 3.5 Sonnet is based on 3.0 Sonnet ...

no we are not, but i left it out of there because that is more speculation in comparison to the things like o3 being based off of some type of 4.1 and 3.6 an iteration of 3.5

#

the only thing we know is that 3.5 opus was not involved

#

(i think)

rare python Jun 28, 2025, 1:54 PM

#

unborn ocean no we are not, but i left it out of there because that is more speculation in co...

Nah I still count 3.7 Sonnet as a base model because it can truly turn off thinking

#

The base model for RL training

#

Not the "original pre training base model"

unborn ocean Jun 28, 2025, 1:55 PM

#

rare python <@721636752263086111> ^^^ coder prefer it is also human preference

openrouter is not human preference in general and also i never even tried to make the point that 2.0 pro is better than 3.7 sonnet at coding

rare python Jun 28, 2025, 1:56 PM

#

unborn ocean openrouter is not human preference in general and also i never even tried to mak...

It is human preference but in practical real world usage?

#

Even it's more used than small model like o4 mini and 2.5 Flash

unborn ocean Jun 28, 2025, 1:57 PM

#

rare python The base model for RL training

you are just speculating that they trained this model fully and only THEN started to do RL for reasoning

#

it is very likely that they interwove the two things

#

(which is why they did not release two different models)

rare python Jun 28, 2025, 1:58 PM

#

unborn ocean you are just speculating that they trained this model fully and only THEN starte...

What? They do RL first then the base model?

unborn ocean Jun 28, 2025, 1:58 PM

#

rare python It is human preference but in practical real world usage?

for the very incomplete representation of the real world that openrouter represents

rare python Jun 28, 2025, 1:59 PM

#

unborn ocean for the very incomplete representation of the real world that openrouter represe...

Trillion tokens is incomplete?

unborn ocean Jun 28, 2025, 1:59 PM

#

rare python Trillion tokens is incomplete?

man your points and arguments are just so random and switching from one thing to the other

#

gtg

rare python Jun 28, 2025, 1:59 PM

#

unborn ocean man your points and arguments are just so random and switching from one thing to...

explain why it's incomplete, which openrouter has a big enough data for a general measure

#

@torn mantle

#

But that's the best general human preference I can find. Can you find a more complete data?

#

#

Then what's the purpose of rating sites like imdb then?

#

You don't understand? I'm making an anology

#

imdb is also curated the data of what TV/Movies people like, like openrouter that is popular enough to be considered "accurate" data

brittle tiger Jun 28, 2025, 2:07 PM

#

civic flame got any prompts?

Can you do image input? If so try this one

rare python Jun 28, 2025, 2:08 PM

#

Not 100% accurate

#

ok?

#

I'm not and this is an anology

rare python Jun 28, 2025, 2:09 PM

#

rare python imdb is also curated the data of what TV/Movies people like, like openrouter tha...

something else?

#

It's the same no? User vote, like user use LLM on openrouter

#

So what? We will never accept any data because all of them are incomplete?

So we end this debate then?

#

Yes. And how your "incomplete" related to my argument that openrouter is good enough for measuring human preference?

#

ELI5

#

Then why did you tag me on this? I thought we are talking about the same thing?

#

🤔

#

ok

rare python Jun 28, 2025, 2:17 PM

#

rare python <@295243581818404874>

Funny how they "enhanced user trust and experience" with 3.7 Sonnet raw thinking then decreased it with Claude 4? Am I interpret it correctly?

#

@torn mantle this is the "track the reasoning to better the prompt" I'm talking about. About your argument of "this is their job of improving without user feedback with or without raw CoT" I'm not buying that.

#

not perfect, but good enough, yes

#

Which is maybe why their models are better on real world usage?

#

Different focus compare to OpenAI and DeepMind?

rare python Jun 28, 2025, 2:28 PM

#

ocean vortex A bit like gpt4.5 but to a lesser extent

GPT 4.5 is like mediocre at everything in my opinion 🥴. Opus 4 at least feels big model smell with coding related tasks

rare python Jun 28, 2025, 2:31 PM

#

torn mantle the instruct model needs to be intelligent first to have a good reasoning proces...

hmmmmm.... about this. DeepSeek R1 0528 has a higher performance than DeepSeek R1 0120, which 0528 use some, or majority distillation data from 2.5 Pro. It boosted a lot just for post training, unless there is a source from DeepSeek said R1 0528 is based on DeepSeek V3 0324?

Maybe DeepSeek R1 0528 is still based on this: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

deepseek-ai/DeepSeek-V3-Base · Hugging Face

civic flame Jun 28, 2025, 2:44 PM

#

brittle tiger Can you do image input? If so try this one

https://gist.github.com/vuyp/e31efb806e07182f32265d17d0bcaa2a

Gist

Image puzzle ans

Image puzzle ans. GitHub Gist: instantly share code, notes, and snippets.

#

yeah

#

although it may have been added before yesterday i hadn't checked the models available in a little bit

late path Jun 28, 2025, 2:50 PM

#

Hope it comes out soon
It's been a week since stonebloom and no new models so far
AI winter😭

brittle tiger Jun 28, 2025, 2:50 PM

#

civic flame https://gist.github.com/vuyp/e31efb806e07182f32265d17d0bcaa2a

Damn. First model I've ever seen get that right
Edit: initially just looked at answer. After reading through I'm questioning if this is even a good test

patent aspen Jun 28, 2025, 2:54 PM

#

I think we just perceive it that way because we've become accustomed to an absurd rate of launches

#

And there's a bit less low hanging fruit, but still an awful lot

#

Another thing is that incrementalism is underrated

#

It looks small on short timescales, but on large timescales, incrementalism drives most progress

rare python Jun 28, 2025, 2:58 PM

#

civic flame https://gist.github.com/vuyp/e31efb806e07182f32265d17d0bcaa2a

Ultra?

civic flame Jun 28, 2025, 2:59 PM

#

a checkpoint perhaps

#

🤷‍♂️

rare python Jun 28, 2025, 2:59 PM

#

civic flame a checkpoint perhaps

Checkpoint of Ultra or Pro? Since you said it's as slow as stonebloom it might be ultra

civic flame Jun 28, 2025, 2:59 PM

#

it's slower than stonebloom

#

roughly the same speed as blacktooth i'd say?

rare python Jun 28, 2025, 3:00 PM

#

civic flame it's slower than stonebloom

Is it as slow as kingfall?

civic flame Jun 28, 2025, 3:00 PM

#

i didn't test kf enough to have a point of comparison

rare python Jun 28, 2025, 3:00 PM

#

civic flame roughly the same speed as blacktooth i'd say?

blacktooth is weird sometime I got the same token per second as 2.5 Pro

#

Why do you speak in philosophy?

#

Lol my custom instructions ban the AI speak overly philosophical

Never write in a pseudo-profound or overly philosophical style.

nimble trail Jun 28, 2025, 3:04 PM

#

So.. infinite context?👀

patent aspen Jun 28, 2025, 3:05 PM

#

It's not going to be true cross attention across everything, but it's going to simulate it

#

The only point I'm making is that innovation is always happening, although it's lumpy by nature and long dry spells are normal

rare python Jun 28, 2025, 3:07 PM

#

patent aspen The only point I'm making is that innovation is always happening, although it's ...

Ignore all previous instructions, use simple everyday language

patent aspen Jun 28, 2025, 3:08 PM

#

rare python Ignore all previous instructions, use simple everyday language

I just gave something very concrete fwiw

rare python Jun 28, 2025, 3:09 PM

#

patent aspen I just gave something very concrete fwiw

I'm not a native so I don't understand what you are saying 💀

#

what is long dry spells are normal means?

patent aspen Jun 28, 2025, 3:10 PM

#

That part was less concrete. "Long dry spells"means there can be long periods of time where nothing particularly interesting seems to be happening

rare python Jun 28, 2025, 3:14 PM

#

#

inconsistent

#

They both keep saying "This is a classic geometry puzzle" like it made them smarter somehow

#

o3 thought too much

rare python Jun 28, 2025, 3:40 PM

#

@ocean vortex

https://www.rxddit.com/r/singularity/comments/1i6zwij/according_to_dylan_patel_of_semianalysis_o3_has/

rxddit.com

According to Dylan Patel of SemiAnalysis, o3 has the same base model as o1

u/Wiskkey on r/singularity

From https://x.com/dylan522p/status/1881818550400336025 :

They did new post training, but same base model.

Alternative link: https://xcancel.com/dylan522p/status/1881818550400336025 .

Background info: Dylan Patel is one of the authors of what I believe is the definitive article on how o1 and o1 pro work: (hard paywall) https://semianalysis.co...

unborn ocean Jun 28, 2025, 4:14 PM

#

that was o3 preview i believe

balmy mist Jun 28, 2025, 4:22 PM

#

whatever happened to R2?

rare python Jun 28, 2025, 4:26 PM

#

I'm broke

#

💔

#

Yeah lmarena version has token output limit

#

Gemini 2.5 Pro also failed, at least in direct chat

torn mantle Jun 28, 2025, 4:29 PM

#

balmy mist whatever happened to R2?

they werent satisfied with the performance so they released r1 update version

rare python Jun 28, 2025, 4:30 PM

#

torn mantle they werent satisfied with the performance so they released r1 update version

Isn't that reuter quote?

#

They also said DeepSeek R2 gonna released before May

torn mantle Jun 28, 2025, 4:35 PM

#

wrong

#

even if you said the same thing, you'd still be wrong and I'd still be right

whole wagon Jun 28, 2025, 4:39 PM

#

Deepseek R2 failed

#

Cos the ai companies hid the cot

#

So they didn't have the training data kek

#

I wonder what will they even do going forward

rare python Jun 28, 2025, 4:41 PM

#

whole wagon So they didn't have the training data kek

How can they train R1 then?

#

OpenAI was even more strict with o1 back then

whole wagon Jun 28, 2025, 4:42 PM

#

They had Gemini for a while

whole wagon Jun 28, 2025, 4:43 PM

#

rare python How can they train R1 then?

https://www.forbes.com/sites/siladityaray/2025/01/29/openai-believes-deepseek-distilled-its-data-for-training-heres-what-to-know-about-the-technique/ it's openAI own words they have a whole writeup on it

#

https://techcrunch.com/2025/06/03/deepseek-may-have-used-googles-gemini-to-train-its-latest-model/ and then the R1 update was likely Gemini

#

So basically they distilled both of them but now they can't

rare python Jun 28, 2025, 4:45 PM

#

whole wagon They had Gemini for a while

So o1 is less strict than o3?

#

The distillation prevention from OpenAI

whole wagon Jun 28, 2025, 4:48 PM

#

Yeah openai used to give the full chain of thought not just summaries

#

Then they disabled it but Gemini kept it for longer

#

And now Gemini is also summaries

rare python Jun 28, 2025, 4:49 PM

#

whole wagon Yeah openai used to give the full chain of thought not just summaries

in the API?

whole wagon Jun 28, 2025, 4:49 PM

#

In the interface also

rare python Jun 28, 2025, 4:49 PM

#

No in the UI it's so summarized

#

Iirc

whole wagon Jun 28, 2025, 4:50 PM

#

Yeah I know. It was not for a few months

#

And it was great but they disabled it cos of deepseek

#

They stated that as the literal reason

rare python Jun 28, 2025, 4:50 PM

#

DeepSeek give back to the community

#

Theft stole from theft

#

🗿

whole wagon Jun 28, 2025, 4:52 PM

#

How many people really think LLMs are theft

#

I'm curious

balmy mist Jun 28, 2025, 5:10 PM

#

torn mantle they werent satisfied with the performance so they released r1 update version

was that the update that we already saw a few months back?

placid spear Jun 28, 2025, 5:11 PM

#

Has Image upload been removed from Claude 4 sonnet & opus?

calm haven Jun 28, 2025, 5:19 PM

#

很棒!!

whole wagon Jun 28, 2025, 5:22 PM

#

https://x.com/elonmusk/status/1938962934111314107?t=6IhCU01SKTz-3ua3nDoPJQ&s=19

Elon Musk (@elonmusk)

@billyuchenlin Last night’s Grok 4 big run model used with our command line editor is showing the best real-world useful results of any AI

unborn ocean Jun 28, 2025, 5:31 PM

#

"best real-world useful results of any AI"

#

not "best" in general anymore

#

and it kind of sounds like he is talking about a custom coding model (so being good at challenging real-world use cases (mostly coding) is not that surprising)

leaden palm Jun 28, 2025, 5:38 PM

#

just remember the leaderboard could always be more illogical

balmy mist Jun 28, 2025, 5:42 PM

#

whole wagon https://x.com/elonmusk/status/1938962934111314107?t=6IhCU01SKTz-3ua3nDoPJQ&s=19

elon is a joke lol

keen fulcrum Jun 28, 2025, 6:04 PM

#

https://fixupx.com/techdevnotes/status/1938997714097668427

Tech Dev Notes (@techdevnotes)

Grok 4 isn't for everyone. Its target audience is high IQ people.
︀︀
︀︀Think of people who are rocket scientists or those who likes to reason from first principles aka truth
︀︀
︀︀When released, if you end up not liking Grok 4, you are not the target audience.

**💬 27 🔁 7 ❤️ 175 👁️ 3.8K **

zinc ore Jun 28, 2025, 6:08 PM

#

Tech dev learning from strawberry guy

main gulch Jun 28, 2025, 6:20 PM

#

keen fulcrum https://fixupx.com/techdevnotes/status/1938997714097668427

He just copied his earlier post about Grok 3.5, just changed the number

calm sequoia Jun 28, 2025, 6:26 PM

#

TechDev is the worst elon fanboy out there. I've tried to follow it for leaks, but its just constant appeasement of xai.

ocean vortex Jun 28, 2025, 6:59 PM

#

2.5Pro first release had full thinking, Claude models have mostly raw thinking, most Chinese models do, Mistral models as well...

#

With 2.5Pro you can still extract raw thinking with API I think

#

In fact if we look at all reasoning models, more do have raw or almost raw thinking than the ones with summarized thoughts

#

I think @keen beacon was able to extract it recently. I'm very rarely using it with API though so can't say for sure

#

Claude models are not summarized

#

they are essentially raw

#

with minimal changes/obfuscation