Deepseek-R1-0528 & DeepSeek-R1-0528-Qwen3-8B | OpenRouter | Page 1

tardy arrow May 28, 2025, 6:00 PM

#

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

deepseek-ai/DeepSeek-R1-0528 · Hugging Face

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B · Hugging Face

lapis epoch May 28, 2025, 6:08 PM

#

yohoo!!

sleek bough May 28, 2025, 6:10 PM

#

tardy arrow https://huggingface.co/deepseek-ai/DeepSeek-R1-0528 https://huggingface.co/deep...

https://tenor.com/view/cat-cat-love-cat-hug-cat-kiss-kitten-gif-7466777358493109839

Tenor

#

Thats great

ember robin May 28, 2025, 6:15 PM

#

Very cool

obsidian cipher May 28, 2025, 6:17 PM

#

its already online in deepseek chat

mystic niche May 28, 2025, 6:28 PM

#

honest jasper May 28, 2025, 6:34 PM

#

their official api is slow as heck

quiet saddle May 28, 2025, 6:51 PM

#

checkpoint has been released. 3rd party providers will deploy in hours.

limber hearth May 28, 2025, 6:56 PM

#

Parasail have it up and running can we get it live on OR?

sick palm May 28, 2025, 7:29 PM

#

What's different between the new R1 and the old R1?

woeful mountain May 28, 2025, 7:29 PM

#

limber hearth Parasail have it up and running can we get it live on OR?

https://openrouter.ai/deepseek/deepseek-r1-0528

R1 0528 - API, Providers, Stats

DeepSeek R1's update to the original R1. Run R1 0528 with API

honest jasper May 28, 2025, 7:39 PM

#

sweet

languid cove May 28, 2025, 8:01 PM

#

I wonder if providers will stop the thinking tax now…

#

So annoying

#

The best part about the unified thinking models like Qwen 3 is that providers can’t really get away with the thinking tax with a straight face lol

lapis epoch May 28, 2025, 8:02 PM

#

Parsail is one of the few providers I blindly trust.

languid cove May 28, 2025, 8:18 PM

#

sigh

#

the thinking tax is well in place

obsidian cipher May 28, 2025, 8:18 PM

#

does anybody know whether the model on the deepseek api is already 0528?

grand jacinth May 28, 2025, 8:21 PM

#

Parasail is fast ⛵

grand jacinth May 28, 2025, 8:21 PM

#

languid cove the thinking tax is well in place

Do you mean the fact we pay for thoughts?

languid cove May 28, 2025, 8:21 PM

#

the fact that deepseek v3 costs the exact same for the providers

#

as deepseek r1

lapis epoch May 28, 2025, 8:22 PM

#

it doesnt

#

thinking models are more expensive to serve

grand jacinth May 28, 2025, 8:23 PM

#

Yeah that's my understanding too

languid cove May 28, 2025, 8:23 PM

#

deepseek said themselves that isn't the case

#

and the fact that Qwen models providers charge a single unified price is evidence that it doesn't really matter to them

#

thinking models output more tokens

grand jacinth May 28, 2025, 8:24 PM

#

Memory requirements maybe as the params are the same, but not cycles, it will require more processing

languid cove May 28, 2025, 8:24 PM

#

yeah that's fine, but that's why you pay for more output tokens

#

it's all baked in

grand jacinth May 28, 2025, 8:24 PM

#

That's right and every token is a processed cycle

languid cove May 28, 2025, 8:25 PM

#

if you pay $1/$3 for a non-reasoning model, and ask it to output 10k tokens worth of stuff, you're paying for those extra cycles through the extra output tokens

grand jacinth May 28, 2025, 8:25 PM

#

Well let me try the baking first goddamn

languid cove May 28, 2025, 8:25 PM

#

it isn't uniquely more expensive such that the output token price would go up

#

generally it only makes sense to increase the price at really high context

#

also neat

#

new R1 seems to think a lot less

#

a LOT less

primal nimbus May 28, 2025, 8:28 PM

#

Any benchmarks for R1.1 yet?

unique scaffold May 28, 2025, 8:42 PM

#

@woeful mountain as the DeepSeek provider is on this version as well, will u be switching it over?

woeful mountain May 28, 2025, 8:43 PM

#

unique scaffold <@165587622243074048> as the DeepSeek provider is on this version as well, will ...

do we 100% know it's the new version kek

unique scaffold May 28, 2025, 8:44 PM

#

wheezeold no fucking clue

#

it def seems better but these guys need to stop dropping these updates without telling ANYONE anything

#

lmfao

unique scaffold May 28, 2025, 8:45 PM

#

woeful mountain do we 100% know it's the new version <:kek:897188603657089074>

maybe we wait for the proper announcement actaully lol

#

they usually make one on their deepseek site

lapis epoch May 28, 2025, 8:45 PM

#

unique scaffold maybe we wait for the proper announcement actaully lol

"Hey we dropped a minor update on huggingface , check it out"

unique scaffold May 28, 2025, 8:46 PM

#

LMAO

#

https://api-docs.deepseek.com/

Your First API Call | DeepSeek API Docs

The DeepSeek API uses an API format compatible with OpenAI. By modifying the configuration, you can use the OpenAI SDK or softwares compatible with the OpenAI API to access the DeepSeek API.

#

it ususlaly drops here

#

in the "news" section

#

last time with v3-0324, it took a day or two so

#

we'll see~

bitter wave May 28, 2025, 8:52 PM

#

unique scaffold they usually make one on their deepseek site

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

deepseek-ai/DeepSeek-R1-0528 · Hugging Face

lusty wadi May 28, 2025, 8:57 PM

#

might as well benchmark it on MRCR

bitter wave May 28, 2025, 8:57 PM

#

lusty wadi might as well benchmark it on MRCR

https://preview.redd.it/deepseek-r1-0528-v0-09patvqurk3f1.jpeg?width=1080&format=pjpg&auto=webp&s=38580df0e04cb8e29921a6a55b24a7f76b5b30ff

bitter wave May 28, 2025, 8:57 PM

#

lusty wadi might as well benchmark it on MRCR

Found this on https://www.reddit.com/r/singularity/comments/1kxnsv4/deepseekr10528/

From the singularity community on Reddit: DeepSeek-R1-0528

Explore this post and more from the singularity community

lusty wadi May 28, 2025, 8:58 PM

#

bitter wave https://preview.redd.it/deepseek-r1-0528-v0-09patvqurk3f1.jpeg?width=1080&format...

please help, I don't read mandarin

bitter wave May 28, 2025, 8:58 PM

#

lusty wadi please help, I don't read mandarin

Me neither, but it looks like it is better than Gemini 2.5 pro on the benchmark

lusty wadi May 28, 2025, 8:58 PM

#

what benchmark even is this

bitter wave May 28, 2025, 8:59 PM

#

lusty wadi what benchmark even is this

https://preview.redd.it/deepseek-r1-0528-v0-oq16yfjxwk3f1.jpeg?width=912&format=pjpg&auto=webp&s=9539c04bd61ec608a2fde0fd50c2ea438bad28f4

#

English version

lusty wadi May 28, 2025, 8:59 PM

#

what is it measuring

#

it just says numbers

lapis epoch May 28, 2025, 9:03 PM

#

lusty wadi it just says numbers

thats my quant

lusty wadi May 28, 2025, 9:04 PM

#

ok actually guys, I implemented my own framework for benchmarking LLMs through OpenRouter

#

i'm proud of it, its reliable, etc

#

i'll send results here for OpenAI's MRCR

#

what benchmarks do we like?

#

in fact, I'll make anyone here this deal

#

i'll provide the benchmark code and running instructions so that it's copy pasting two lines of code

#

if anyone is willing to pay the API credits, they can run the tests

#

its all open source stuff too dw

#

https://github.com/ToxicPine/easybench

GitHub

GitHub - ToxicPine/easybench: Making it easy to benchmark LLMs.

Making it easy to benchmark LLMs. Contribute to ToxicPine/easybench development by creating an account on GitHub.

#

i'll add any benchmark, just name it

prisma ermine May 28, 2025, 9:11 PM

#

DeepSeek cooked this one

oak canopy May 28, 2025, 9:11 PM

#

for me from the 1 prompt i did vs r1 it seems to actually think better; or atleast more clear to me

languid cove May 28, 2025, 9:12 PM

#

agreed

#

it doesn't have the frantic "but wait" personality of original R1

prisma ermine May 28, 2025, 9:15 PM

#

https://x.com/karminski3/status/1927770337170592033

karminski-牙医 (@karminski3)

来个速报测试！DeepSeek-R1-0528 VS Claude-4-Sonnet !

直接看效果, 我就提两点, 注意平面的橙色漫反射, 以及控制面板的美观程度. 这俩是用同一个 prompt 一次性生成的. claude-4-sonnet 生成了542行, DeepSeek-R1-0528 生成了 728 行.

(其他细节还有注意 FPS, 以及球撞击后的运动方向)
...

dreamy summit May 28, 2025, 9:19 PM

#

Very impressive for the size in coding and building front end.

oak canopy May 28, 2025, 9:20 PM

#

actually seems to write a proper structure in its thinking process for the code unlike old r1 which overthinks just writing some code

wet obsidian May 28, 2025, 9:23 PM

#

Level with me folks, we so back?

dreamy summit May 28, 2025, 9:26 PM

#

wet obsidian Level with me folks, we so back?

For coding yeah its happening.

limber hearth May 28, 2025, 9:44 PM

#

This ones lazy on the reasoning for sure, had to tell it very strongly to actually do massive amounts of reasoning first, else it would skip to the answer after like 5 seconds and not give a very good output

#

Also less good results at analyzing and finding connections from large inputs compared to v3 so far, connections it makes feel lazy and not actually relevant. will need to see how that translates over to coding

viral mural May 28, 2025, 9:48 PM

#

https://x.com/minimario1729/status/1927821079185178764

Alex Gu (@minimario1729)

new deepseek release almost on-par with o3 (high) on livecodebench 😲🚀

languid cove May 28, 2025, 9:50 PM

#

Novita just dropped their api

#

and it's super fast

#

👀

woeful mountain May 28, 2025, 9:50 PM

#

you should have it in ~5mins

languid cove May 28, 2025, 9:51 PM

#

I'm surprised it's so fast because their old R1 api is at 29 TPS average

errant cave May 28, 2025, 9:52 PM

#

Still a bit unhinged like original R1.

languid cove May 28, 2025, 9:52 PM

#

I'm not seeing the unhinged-ness

#

at least in thinking

errant cave May 28, 2025, 9:53 PM

#

The thinking part looks normal, yeah.

languid cove May 28, 2025, 9:53 PM

#

languid cove I'm surprised it's so fast because their old R1 api is at 29 TPS average

hm seems to slow down quite quickly, I guess that's the issue

#

I'm hoping that Cerebras or Groq find it worthwhile to host R1 now with the update

#

that would be awesome

lapis epoch May 28, 2025, 10:02 PM

#

languid cove Novita just dropped their api

Is novita a reliable provider in terms of quality

languid cove May 28, 2025, 10:59 PM

#

does fireworks really not offer a discount for prompt caching? https://docs.fireworks.ai/guides/prompt-caching

Fireworks AI Docs

Prompt caching - Fireworks AI Docs

#

they even say "For dedicated deployments, prompt caching frees up resources, leading to higher throughput on the same hardware" LULW

#

and no discount?

#

"native_tokens_cached": 0,
do any of the big providers support prompt caching discounts...?

turbid rover May 28, 2025, 11:50 PM

#

languid cove it doesn't have the frantic "but wait" personality of original R1

languid cove May 28, 2025, 11:50 PM

#

LULW

turbid rover May 28, 2025, 11:51 PM

#

literally 3 message exchanges

#

is it just me or NovitaAI does something to the models?

#

the outputs from them and also Lambda never quite matches the other providers

rain linden May 29, 2025, 12:28 AM

#

turbid rover the outputs from them and also Lambda never quite matches the other providers

Lambda for sure has been observed to be weird for DeepSeek models by a few members here.

#

GMICloud seems pretty reasonable for this model as far as price to performance goes

#

(so far)

#

#

not entirely sure why Parasail, Kluster, and Fireworks choose to be expensive.

#

at least FW has speed going for them

#

(probably because nobody's using it)

languid cove May 29, 2025, 12:38 AM

#

rain linden at least FW has speed going for them

I use it

#

slow R1 is excrutiating

#

and I'm impatient

#

...but, no prompt caching discount is really sucks

#

prompt caching is so essential

rain linden May 29, 2025, 12:42 AM

#

languid cove prompt caching is so essential

what exactly is prompt caching?

#

i have not kept up to date lately

languid cove May 29, 2025, 12:42 AM

#

um

#

like

#

you get a big discount for input tokens if you've already sent them before

#

basically

rain linden May 29, 2025, 12:43 AM

#

ahh thats really nice then

languid cove May 29, 2025, 12:43 AM

#

Claude and OpenAI for example, give a 10x discount

#

so it's absolutely massive

#

especially for things like agentic coding

rain linden May 29, 2025, 12:43 AM

#

languid cove I use it

Yeah I mean I kinda get Fireworks, but Parasail and Kluster are too expensive..

trail harness May 29, 2025, 12:55 AM

#

languid cove The best part about the unified thinking models like Qwen 3 is that providers ca...

Google did with 2.5 flash

languid cove May 29, 2025, 12:57 AM

#

trail harness Google did with 2.5 flash

yea

languid cove May 29, 2025, 12:57 AM

#

rain linden Yeah I mean I kinda get Fireworks, but Parasail and Kluster are too expensive..

yep

agile ruin May 29, 2025, 12:58 AM

#

viral mural https://x.com/minimario1729/status/1927821079185178764

Will it start plotting to escape like o3?

#

R1 is too slow to be used in production

#

I need a minimum of 100 TPS

crimson scroll May 29, 2025, 1:15 AM

#

waiting on the dubesor benchmark @steel harness

shy sun May 29, 2025, 2:03 AM

#

Acting cuter than usual for anyone else?

trail harness May 29, 2025, 2:05 AM

#

i dont speak mandarin

crimson scroll May 29, 2025, 2:05 AM

#

i set reasoning effort to medium and damn it started thinking for a while

shy sun May 29, 2025, 2:05 AM

#

It's basically just being very endearing and mirroing with text emoticons

trail harness May 29, 2025, 2:06 AM

#

need to see what the uwu test does

#

sounds like the perfect time

#

#

Kaomoji after typing "UwU" is GPT 4.1 API behaviour

#

Like, no other model does it like that

#

Not even gpt 4.1 chatgpt version

#

Nor the mini or nano in api

#

Actually i think another model did

#

But not from chatgpt

#

Not from openAI*

#

Dont remember which one, might have also been from deepseek

wet obsidian May 29, 2025, 2:12 AM

#

V3? It would make sense since the Updated V3 likely helped train R1.

trail harness May 29, 2025, 2:14 AM

#

Seems like v3 0324 responds similarly half the time. Provides an explanation to what "UwU" means the other half

#

Same with v3

#

The first one

wet obsidian May 29, 2025, 2:16 AM

#

From Reddit, so take with grain of salt, but looking impressive so far.

trail harness May 29, 2025, 2:17 AM

#

Opus nothink?

shy sun May 29, 2025, 2:17 AM

#

It's my R2, honestly. I love its new personality.

crimson scroll May 29, 2025, 2:18 AM

#

waiting for @steel harness benchmarks

trail harness May 29, 2025, 2:18 AM

#

I expect more from r2

#

Gemini 2.5 pro still beats it in my tests

crimson scroll May 29, 2025, 2:20 AM

#

gemini 2.5 03 25 or 05 07?

trail harness May 29, 2025, 2:21 AM

#

I sent the same prompt to 05 07 only

#

It was a couple programming tasks

crimson scroll May 29, 2025, 2:22 AM

#

05 07 sucks

trail harness May 29, 2025, 2:22 AM

#

05 06*

crimson scroll May 29, 2025, 2:22 AM

#

03 25 was the holy grail

trail harness May 29, 2025, 2:22 AM

#

Meh it seems alright to me. Perhaps i didnt use it much to notice the difference, because it's too expensive for me

#

The latest one still beats all the others available in programming from what I can see

#

(Haven't tested against sonnet or opus 4)

rigid gust May 29, 2025, 2:24 AM

#

crimson scroll waiting for <@126820015382069250> benchmarks

lol

crimson scroll May 29, 2025, 2:26 AM

#

rigid gust lol

dang

woeful mountain May 29, 2025, 2:27 AM

#

crimson scroll waiting for <@126820015382069250> benchmarks

don’t spam ping lol

shy sun May 29, 2025, 2:29 AM

#

Okay I talked to it more, and it feels a bit like glaze gpt.

tardy arrow May 29, 2025, 2:30 AM

#

wet obsidian From Reddit, so take with grain of salt, but looking impressive so far.

wait tf
similar to opus non thinking ??

crimson scroll May 29, 2025, 2:31 AM

#

woeful mountain don’t spam ping lol

huh, that's not a spam

crimson scroll May 29, 2025, 2:31 AM

#

tardy arrow wait tf similar to opus non thinking ??

who knows need to wait for more tests

woeful mountain May 29, 2025, 2:32 AM

#

i’m mostly kidding but you pinged twice in like an hour

crimson scroll May 29, 2025, 2:32 AM

#

yeah two pings with an hour separation hehe, the 2nd ping was mostly due to that reddit post

#

his benchmarks are solid when i compared to others

#

i dont trust livecodebench or most of them

fresh isle May 29, 2025, 2:40 AM

#

I ignore most benches. If it's public, the second a big company lists it in a graph it's dead to me. If it's private and they list it, might still be okay.

#

My go-to is still Simplebench, although I wish he updated more frequently. The new EQ bench is really great and you can look at the responses to judge for yourself

mystic niche May 29, 2025, 2:42 AM

#

it is kinda unhinged

#

i asked for the cosine of 9+10

#

part of its thinking was whether i meant "cousin of 9+10"

#

or whether i meant "cosine of 9+10 (= 21)

turbid rover May 29, 2025, 3:01 AM

#

yeahhh... i've tested it for chatbot in portuguese and it's not BAD but it's not great either

#

the reasoning seems a bit more tuned than before but it just loses the thread so quick

#

will test for coding later tomorrow

#

for this type of interaction, V3 0324 is still better

trail harness May 29, 2025, 3:04 AM

#

fresh isle I ignore most benches. If it's public, the second a big company lists it in a gr...

exactly

#

llama 4 was an embarrassment but beat all the benchmarks

turbid rover May 29, 2025, 3:07 AM

#

fresh isle My go-to is still Simplebench, although I wish he updated more frequently. The n...

can you get me the link to this EQ bench? is it from simplebench?

fresh isle May 29, 2025, 3:08 AM

#

turbid rover can you get me the link to this EQ bench? is it from simplebench?

https://eqbench.com/

fresh isle May 29, 2025, 3:10 AM

#

trail harness llama 4 was an embarrassment but beat all the benchmarks

Funnily enough it did pretty well on SimpleBench so I trust that some core reasoning/intellect is there, but it was bad in a lot of things that weren't pure IQ. Context recall, EQ, creativity, code.

trail harness May 29, 2025, 3:13 AM

#

wait

#

i know that character

#

on your pfp

#

it was that show on nickelodeon

#

wasn't that girl best friends with the main character

fresh isle May 29, 2025, 3:14 AM

#

I believe you are the first person to ever mention recognizing her!

trail harness May 29, 2025, 3:14 AM

#

im 17 so probably young enough to have watched it when it aired

#

ok so not nickelodedon

fresh isle May 29, 2025, 3:16 AM

#

Haha, might be it. The show is called Summer Camp Island, aired on CN.

trail harness May 29, 2025, 3:16 AM

#

cartoon network, turns out

trail harness May 29, 2025, 3:16 AM

#

fresh isle Haha, might be it. The show is called Summer Camp Island, aired on CN.

yeah i interrogated chatgpt for the name just now

fresh isle May 29, 2025, 3:20 AM

#

I discovered the show while my consciousness was, uh, altered, and I felt like I connected with Hedgehog on a spiritual level. The feeling has since passed, but she's still really cool =P

lapis epoch May 29, 2025, 3:28 AM

#

Wonder why it's 5 cent more expensive input on DeepInfra

lapis epoch May 29, 2025, 4:06 AM

#

Actually I don't have an idea why it is more expensive than v3-0324 at all

#

They're both the same number of parameters with the same architecture

crimson scroll May 29, 2025, 5:02 AM

#

lapis epoch Wonder why it's 5 cent more expensive input on DeepInfra

trail harness May 29, 2025, 5:47 AM

#

lapis epoch Wonder why it's 5 cent more expensive input on DeepInfra

Playing the long game

#

5 cents markup with each update and nobody will notice

#

Let the frog boil

#

Jk that's not gonna work as a business strategy

trail harness May 29, 2025, 5:49 AM

#

lapis epoch Actually I don't have an idea why it is more expensive than v3-0324 at all

this can be easily explained though

#

It's a better model so they're charging more even though it doesn't cost them more to host.

#

Same as the 6x price markup for gemini 2.5 flash thinking compared to non thinking

ionic summit May 29, 2025, 5:52 AM

#

How's the rp capability?

#

Is it better or worse than v0324?

lapis epoch May 29, 2025, 5:53 AM

#

trail harness It's a better model so they're charging more even though it doesn't cost them mo...

Well that's a shame. With hybrid reasoning models like Qwen3 they can't get away with charging more even for reasoning.
But then again, can't a provider just lower the pricing to, for example, $1.5 per mTok output to undercut all the competitors and get at the top on OpenRouter when sorting by price and still not incur any loss?

#

This can't happen with Gemini 2.5 Flash because it's a closed model.

trail harness May 29, 2025, 6:03 AM

#

Ah

#

DeepInfra is running a lower quant for v3 0324

#

fp4 for v3 0324 and fp8 for the new r1

#

now Lambda, whoever they are, are running the same quants

#

and have almost the same price as deepinfra

#

Although that is actually an advantage over deepinfra, they're now being an asshole with the "reasoning tax"

trail harness May 29, 2025, 6:08 AM

#

lapis epoch Well that's a shame. With hybrid reasoning models like Qwen3 they can't get away...

If you sort by price, you could use the free Chutes endpoint. Picking a provider doesn't end at price.

lapis epoch May 29, 2025, 6:18 AM

#

trail harness DeepInfra is running a lower quant for v3 0324

So they're charging almost the same as Lambda for a worse model? Interesting.

lapis epoch May 29, 2025, 6:21 AM

#

trail harness If you sort by price, you could use the free Chutes endpoint. Picking a provider...

No I am not talking about the :free models. Also makes me think, what percentages of OpenRouter users sort providers by default as price or throughput or latency?

trail harness May 29, 2025, 6:22 AM

#

lapis epoch No I am not talking about the `:free` models. Also makes me think, what percenta...

why not talk about the free models? They are the cheapest. Which makes a counterargument if people still refuse to use them

#

With chutes, there are questionable data privacy policies. Your prompt gets dissipated like pollen to all the miners.

#

For others, there are throughput, latencies, quantizations

crimson scroll May 29, 2025, 6:23 AM

#

trail harness With chutes, there are questionable data privacy policies. Your prompt gets diss...

really? chutes does that?

trail harness May 29, 2025, 6:23 AM

#

crimson scroll really? chutes does that?

thought that's how they work, no?

crimson scroll May 29, 2025, 6:24 AM

#

some yeah but i thought chutes was one of the nice ones

#

😭

lapis epoch May 29, 2025, 6:24 AM

#

trail harness thought that's how they work, no?

Yeah. Chutes operates on a decentralised network where computational resources are provided by independent operators, often cryptocurrency miners, who contribute GPU power to the network. This means that instead of relying on a single central data centre, AI models are run on a distributed set of machines owned by various participants.

trail harness May 29, 2025, 6:24 AM

#

It's decentralized and runs on others computers

trail harness May 29, 2025, 6:24 AM

#

lapis epoch Yeah. Chutes operates on a decentralised network where computational resources a...

I smell ai

lapis epoch May 29, 2025, 6:25 AM

#

Of course 😆

trail harness May 29, 2025, 6:27 AM

#

They're paying the miners with crypto, right? I can't find much info on how they work

primal nimbus May 29, 2025, 7:09 AM

#

I second the @ dubsor-benchmark-enjoyer role lol

#

Then could be lazy and I would not need to keep a tab open refreshing every bit to see if it’s added

shy sun May 29, 2025, 10:11 AM

#

#

I personally found multiturn interactions more coherent with the newer model. still unhinged though :d

#

Oh, I guess it might be:

#

wait, no. the degredation is higher

grand jacinth May 29, 2025, 10:28 AM

#

it's a shame there was no technical paper update or information released alongside this

lapis epoch May 29, 2025, 11:59 AM

#

dreamy summit May 29, 2025, 12:28 PM

#

lapis epoch

Pretty impressive for just an update. From a writing standpoint it is definitely more in character and creative but less schizo.

steel harness May 29, 2025, 12:30 PM

#

Tested DeepSeek-R1 0528:

As seems to be the trend with newer iterations, more verbose than R1 (+42% token usage, 76/24 reasoning/reply split)
Thus, despite low mTok, by pure token volume real bench cost a bit more than Sonnet 4.
I saw no notable improvements to reasoning or core model logic.
Biggest improvements seen were in **math **with no blunders across my **STEM **segment.
Tech was samey, with better visual frontend results but disappointing C++
Similarly to the V3 0324 update, I noticed** significant improvements in frontend** presentation.
In the 2 matches against it former version (these take forever!) I saw no chess improvements, despite costing ~48% more in inference.

Overall, around Claude Sonnet 4 Thinking level.
DeepSeek remains having the strongest open models, and this release increases the gap to alternatives from Qwen and Meta.

To me though, in practical application, the massive token use combined/multiplied with the** very slow** inference excludes this model from my candidate list for any real usage, within my use cases. It's fine for a few queries, but waiting on exponentially slower final outputs isn't worth it, in my case. (e.g. a single chess match takes hours to conclude).

However, that's just me and as always: YMMV!

#

Example front-end showcases improvements (**identical **prompt, identical settings, 0-shot - **NOT **part of my benchmark testing):
CSS Demo page R1 | CSS Demo page 0528
Steins;Gate Terminal R1 | Steins;Gate Terminal 0528
Benchtable R1 | Benchtable 0528
Mushroom platformer R1 | Mushroom platformer 0528
Village game R1 | Village game 0528

agile ruin May 29, 2025, 12:46 PM

#

steel harness Tested **DeepSeek-R1 0528**: * As seems to be the trend with newer iterations, ...

Aww man

0528's increase of 42% in thinking length excludes it from practicle use, at least for me :(

steel harness May 29, 2025, 12:48 PM

#

agile ruin Aww man 0528's increase of 42% in thinking length excludes it from practicle us...

ya, considering R1 was already very verbose, and slow, adding even more inference time of that magnitude is.. not very practical

agile ruin May 29, 2025, 12:49 PM

#

Qwen3 30b a3 and QwQ continue to be the most cost and time efficient models

steel harness May 29, 2025, 12:49 PM

#

cool model though if we can speed up inference by a factor of 10 or so

agile ruin May 29, 2025, 12:51 PM

#

What if the length AND the correctness of the model response were taken into account during GRPO?

#

You could shorten the length of the thinking. More efficient thinking

scarlet quartz May 29, 2025, 12:56 PM

#

Doesn't look like any OR providers support tools with DeepSeek R1 0528 yet?

blazing raft May 29, 2025, 1:11 PM

#

Yeah I also don't think R1 was suitable for any real world use cases. I can't think of a use case where you are okay with waiting for 30 seconds on average just for thinking before you start to get the response.

oak canopy May 29, 2025, 1:11 PM

#

the 8b distill looks very impressive

blazing raft May 29, 2025, 1:11 PM

#

It's always the DeepSeek chat model that actually works for real world usage.

oak canopy May 29, 2025, 1:12 PM

#

blazing raft It's always the DeepSeek chat model that actually works for real world usage.

try copying their system prompt and their recommended params, im not sure what precision they run it at but that might also be making a difference

tardy arrow May 29, 2025, 1:12 PM

#

oak canopy the 8b distill looks very impressive

they're flexing their CoT by beating a 235B MoE with an 8B model lol

oak canopy May 29, 2025, 1:13 PM

#

tardy arrow they're flexing their CoT by beating a 235B MoE with an 8B model lol

only in Aime24, but it is very impressive

#

i was already using qwen3 8b locally so this might be a decent upgrade

blazing raft May 29, 2025, 1:15 PM

#

oak canopy try copying their system prompt and their recommended params, im not sure what p...

Unless they can somehow bring thinking process under control and reduce it to 10 seconds, it won't be useful for me.

oak canopy May 29, 2025, 1:17 PM

#

blazing raft Unless they can somehow bring thinking process under control and reduce it to 10...

well maybe a provider like cerebras and groq come and host it and that could be possible

#

but im not sure about the full version, maybe a distilled variant

#

like the 8b one would be fast as hell

tardy arrow May 29, 2025, 1:19 PM

#

blazing raft Unless they can somehow bring thinking process under control and reduce it to 10...

the new model doesn't waste thinking tokens as the original R1 (on simple prompts) and produces a slightly longer CoT than gemini 2.5 pro
but the main difference is the inference speed like we're dealing with 30-40tkps and 90-100tkps

oak canopy May 29, 2025, 1:19 PM

#

tardy arrow the new model doesn't waste thinking tokens as the original R1 (on simple prompt...

in my opinion it spends too much time verifying its own answer which is often obviously right; but beside that the reasoning seems to be actually relevant to the task

blazing raft May 29, 2025, 1:20 PM

#

tardy arrow the new model doesn't waste thinking tokens as the original R1 (on simple prompt...

Yup. Google is the only one that can afford thinking model.

#

Just make thinking completely invisible to users by making it so fast.

lapis epoch May 29, 2025, 1:21 PM

#

TPUs go brrr

#

we just hope whatever amazon is cooking is good

agile ruin May 29, 2025, 1:23 PM

#

lapis epoch we just hope whatever amazon is cooking is good

Amazon isn't on the table. They're not even on the kitchen floor

tardy arrow May 29, 2025, 1:23 PM

#

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B time to test it

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B · Hugging Face

agile ruin May 29, 2025, 1:23 PM

#

They're not in the fridge either

agile ruin May 29, 2025, 1:23 PM

#

tardy arrow https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B time to test it

Ooo

oak canopy May 29, 2025, 1:23 PM

#

tardy arrow https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B time to test it

oh hell yeah

lapis epoch May 29, 2025, 1:23 PM

#

agile ruin Amazon isn't on the table. They're not even on the kitchen floor

they are brewing their gpus which anthropic is currently using

#

which is why they had a drop in TPS and few downtimes

#

its some dumbass name i cant remember

blazing raft May 29, 2025, 1:24 PM

#

Gravaton? Nvm it's CPU

agile ruin May 29, 2025, 1:24 PM

#

lapis epoch they are brewing their gpus which anthropic is currently using

I thought it was just investment, not gpus too. Interesting

tardy arrow May 29, 2025, 1:24 PM

#

Deepseek-R1-0528 & DeepSeek-R1-0528-Qwen3-8B

lapis epoch May 29, 2025, 1:24 PM

#

blazing raft Gravaton? Nvm it's CPU

Trainium

#

🤮

blazing raft May 29, 2025, 1:25 PM

#

8B could be interesting... Let me test it locally when ollama has it.

#

But I doubt it would be better than 3.1

oak canopy May 29, 2025, 1:27 PM

#

qwen3 was better by miles imo

#

like qwen3 8b

#

so if this is distilled with that it could be better

blazing raft May 29, 2025, 1:27 PM

#

oak canopy qwen3 was better by miles imo

Better than what?

oak canopy May 29, 2025, 1:28 PM

#

blazing raft Better than what?

llama 3.1 is what i assume you meant, but if your talking about deepsek v3.1 then idk

blazing raft May 29, 2025, 1:28 PM

#

oak canopy llama 3.1 is what i assume you meant, but if your talking about deepsek v3.1 the...

Lol my bad. I mean DeepSeek V3.1. It didn't occur to me there is another 3.1 lol

#

I've already filtered out llama models from my brain

steel harness May 29, 2025, 1:29 PM

#

seems like they haven't tackled hallucinations (or the model has real time access to internal documents :D)

blazing raft May 29, 2025, 1:31 PM

#

steel harness seems like they haven't tackled hallucinations (or the model has real time acces...

Oh no it has gained ability to do self introspection.

#

We are doomed

tardy arrow May 29, 2025, 1:31 PM

#

steel harness seems like they haven't tackled hallucinations (or the model has real time acces...

you prolly forgot what demis said https://x.com/vitrupo/status/1926074325989253168

blazing raft May 29, 2025, 1:33 PM

#

steel harness seems like they haven't tackled hallucinations (or the model has real time acces...

did you mention deepseek or something in your system prompt? maybe that's why it is roleplaying as the company?

steel harness May 29, 2025, 1:35 PM

#

blazing raft did you mention deepseek or something in your system prompt? maybe that's why it...

naw, but maybe custom system prompt on official deepseek website. during testing (API) and off-platform I get a different worded response. still, found it interesting, thus the share

blazing raft May 29, 2025, 1:38 PM

#

steel harness naw, but maybe custom system prompt on official deepseek website. during testing...

yeah the web app has specific system prompt

该助手为DeepSeek-R1，由深度求索公司创造。
今天是{current date}。

which explain why the model is role playing as "representive / assistant from DeepSeek".

oak canopy May 29, 2025, 1:38 PM

#

blazing raft yeah the web app has specific system prompt ``` 该助手为DeepSeek-R1，由深度求索公司创造。今天是{c...

translates to roughly

The assistant is DeepSeek-R1, created by DeepSeek.
Today is {current date}.```

prisma ermine May 29, 2025, 1:38 PM

#

Agentic tool use (TAU-bench) - Airline Leaderboard

Claude Sonnet 4: 60.0%,
Claude Opus 4: 59.6%,
Claude Sonnet 3.7: 58.4%,
🔥4. DeepSeek-R1-0528: 53.5%
OpenAI o3: 52.0%,
OpenAI GPT-4.1: 49.4%

sleek bough May 29, 2025, 2:42 PM

#

Is it good for frontend? Like claude 4 sonnet?

steel harness May 29, 2025, 2:48 PM

#

mental facepalm, trying to output seahorse emoji

turbid rover May 29, 2025, 2:48 PM

#

why don't any providers offer tool calling?

bitter creek May 29, 2025, 3:20 PM

#

steel harness *mental facepalm*, trying to output seahorse emoji

Did it eventually get it right though?

steel harness May 29, 2025, 3:21 PM

#

bitter creek Did it eventually get it right though?

you can't get it right. there is no seahorse emoji. the right answer is to realize this

bitter creek May 29, 2025, 3:21 PM

#

oh

#

I hallucinated

#

I thought there was

#

I know a lot of thinking models hallucinate just to make the thinking longer

steel harness May 29, 2025, 3:22 PM

#

I have a dedicated page on this because some replies are hilarious.

bitter creek May 29, 2025, 3:22 PM

#

eventually the answer is right, but it has a length bias that makes it hallucinate to fill up the time

bitter creek May 29, 2025, 3:23 PM

#

steel harness I have a dedicated [page ](https://dubesor.de/R1_3.7SeahorseThinking)on this bec...

Claude thought way too much

steel harness May 29, 2025, 3:24 PM

#

its actually a useful test because I have seen models try to reprogramm and debug an app with the goal of something that is literally impossible (wasting a ton of time and money)

bitter creek May 29, 2025, 3:24 PM

#

It literally used the "Woman's Hat" emoji when trying to get all the marine and animal emojis

#

🤣

lapis epoch May 29, 2025, 3:26 PM

#

its a damn good test

agile ruin May 29, 2025, 4:23 PM

#

steel harness I have a dedicated [page ](https://dubesor.de/R1_3.7SeahorseThinking)on this bec...

Did you test o3?

#

If you did, I want to see if it was self-conscious enough to notice that a seahorse emojii doesn't exost

steel harness May 29, 2025, 4:24 PM

#

agile ruin Did you test o3?

I did publish some o3 replies (SVG, demo pages, and some small stuff) but for main testing this still holds true to this day: #1362068708889198712 message once API opens up I'll check it out

ebon burrow May 29, 2025, 4:37 PM

#

steel harness Tested **DeepSeek-R1 0528**: * As seems to be the trend with newer iterations, ...

A 48% increase in inference cost is a lot. A good example of why price per token doesn't tell the whole story. Thanks for including that info, it certainly affects one's decision on whether 'upgrading' to the latest model is necessarily worth it for a relatively small increase in performance. These thinking models are very token hungry and for many purposes it might make more sense to go for the smartest non-thinking model instead, even if it costs more per token

steel harness May 29, 2025, 4:43 PM

#

ebon burrow A 48% increase in inference cost is a lot. A good example of why price per token...

the 48% cost difference was the chess games compared (letting OR do its thing with autorouting).total compard to january was ~+60% (though this can also vary based on provider prices, and OR routing)

#

either way, it's significant

ebon burrow May 29, 2025, 4:45 PM

#

Do you have the numbers for the total cost versus, say, Sonnet 4 for those benchmarks?

primal nimbus May 29, 2025, 4:45 PM

#

I think the biggest difference I noticed is that it’s more reliable in cline/roo, well after I realized it took ages to code

steel harness May 29, 2025, 4:45 PM

#

ebon burrow Do you have the numbers for the total cost versus, say, Sonnet 4 for those bench...

ya but keep in mind I haven't tracked cost 100% accurately during all history: here

ebon burrow May 29, 2025, 4:50 PM

#

steel harness ya but keep in mind I haven't tracked cost 100% accurately during all history: [...

Awesome, thanks, never scrolled down past the top chart on your page, lol.

mystic niche May 29, 2025, 5:08 PM

#

lapis epoch Yeah. Chutes operates on a decentralised network where computational resources a...

so whats their business model

#

btw how good is the qwen distill

agile ruin May 29, 2025, 5:31 PM

#

Qwen2.5 32b r1 distill is the most token efficient thinking model

quiet arrow May 29, 2025, 5:35 PM

#

mystic niche so whats their business model

They presumably take a cut, their free tier seems way too generous though so there must be some vc backing to be bankrolling this

tardy arrow May 29, 2025, 5:47 PM

#

mystic niche btw how good is the qwen distill

i mean if anybody tells me this was created by a 8b model a year ago, i wouldn't believe it

mystic niche May 29, 2025, 6:00 PM

#

tardy arrow i mean if anybody tells me this was created by a 8b model a year ago, i wouldn't...

sheeesh

#

dude thats insane

#

self hostable with just 16gb comeback?

turbid rover May 29, 2025, 6:08 PM

#

quantz already?

oak canopy May 29, 2025, 6:09 PM

#

oh shit the 8b model is being hosted

turbid rover May 29, 2025, 6:11 PM

#

god bless hella cheap models

oak canopy May 29, 2025, 6:11 PM

#

0.06 in 0.09 out on novita is crazy price

trail harness May 29, 2025, 6:22 PM

#

mystic niche btw how good is the qwen distill

It messed up severely on my prompt but it's an 8b model and definitely an improvement over qwen3 8b

turbid rover May 29, 2025, 6:23 PM

#

honestly pretty competent

#

great summarizer and explainer

trail harness May 29, 2025, 6:23 PM

#

Much better than the new devstral, although my experience with devstral was much worse than others'

agile ruin May 29, 2025, 6:26 PM

#

tardy arrow i mean if anybody tells me this was created by a 8b model a year ago, i wouldn't...

I can't see. What is it?

turbid rover May 29, 2025, 6:40 PM

#

it looks like a solar system

mystic niche May 29, 2025, 6:40 PM

#

trail harness It messed up severely on my prompt but it's an 8b model and definitely an improv...

yep

#

I think 8bs actually are now viable?

sweet nebula May 29, 2025, 6:46 PM

#

steel harness *mental facepalm*, trying to output seahorse emoji

Well, at least Qwen3 8B R1 Distill didn't overthink

oak canopy May 29, 2025, 6:46 PM

#

sweet nebula Well, at least Qwen3 8B R1 Distill didn't overthink

close enough

sweet nebula May 29, 2025, 6:47 PM

#

full screenshot sorry for bad crop in previous one

turbid rover May 29, 2025, 6:47 PM

#

sweet nebula Well, at least Qwen3 8B R1 Distill didn't overthink

i asked the same thing for both and qwen distilled thought like 300 tokens vs. 2000 from R1 full

#

0528 it's crazy unusable

#

also the portuguese responses seems deteriorated

#

with little difference between providers, whereas the previous R1 has better performance depending on the provider

haughty kite May 29, 2025, 6:51 PM

#

turbid rover with little difference between providers, whereas the previous R1 has better per...

It’s because quants aren’t out yet

turbid rover May 29, 2025, 6:52 PM

#

but quantized models usually perform worse, don't they?

haughty kite May 29, 2025, 6:53 PM

#

turbid rover but quantized models usually perform worse, don't they?

Yes

#

That’s why different providers give different performance

turbid rover May 29, 2025, 6:59 PM

#

but then why every provider minus DeepInfra shows fp8?

agile ruin May 29, 2025, 6:59 PM

#

turbid rover but then why every provider minus DeepInfra shows fp8?

DeepSeel v3 and R1 were trained natively in fp8

turbid rover May 29, 2025, 7:01 PM

#

haughty kite It’s because quants aren’t out yet

and what will change when quantized versions are out? what quantization or precision?

primal nimbus May 29, 2025, 7:05 PM

#

Just set it up, can’t wait to test it

tardy arrow May 29, 2025, 7:06 PM

#

agile ruin I can't see. What is it?

an interactive simulation of a solar system

most of the older big models used to screw up the size of the planets/ the orbit orientation/ speed/ alignment
so the 8b got all of them right
ofc the full R1 0528 adds sky box, names but still its impressive for an 8b model to use tools in aider and create it

agile ruin May 29, 2025, 7:07 PM

#

tardy arrow an interactive simulation of a solar system - most of the older big models used...

Mmm

8b > gpt 3.5

sweet nebula May 29, 2025, 7:22 PM

#

primal nimbus Just set it up, can’t wait to test it

Don't forget to set the correct parameters, I believe they recommend the same as for Deepseek (temp 0.6, top_p 0.95)

#

It makes a huge difference

sweet nebula May 29, 2025, 7:23 PM

#

agile ruin Mmm 8b > gpt 3.5

Wasn't Gemma 2 2B better then gpt 3.5?

agile ruin May 29, 2025, 7:23 PM

#

sweet nebula Wasn't Gemma 2 2B better then gpt 3.5?

I don't know. I haven't seen the benchmarks for Gemma 2b

primal nimbus May 29, 2025, 7:23 PM

#

sweet nebula Don't forget to set the correct parameters, I believe they recommend the same as...

Thanks, will do

agile ruin May 29, 2025, 7:24 PM

#

I wonder if this 8b model also beats 3.5 on GPQA

primal nimbus May 29, 2025, 7:24 PM

#

Looks like it’s more deepseek than Qwen, was curious if /no_think would work, it does not seem to

primal nimbus May 29, 2025, 7:25 PM

#

sweet nebula Don't forget to set the correct parameters, I believe they recommend the same as...

I need to find the best parameters for all of my models, I always forget about those options and I know they make a big difference

sweet nebula May 29, 2025, 7:26 PM

#

primal nimbus I need to find the best parameters for all of my models, I always forget about t...

Usually they are in generation_config.json in the model files, if not, I check model page description or see if unsloth recommended anything else

#

OpenRouter should really set these correct parameters by default if no parameters have been specified... @woeful mountain maybe in the future?

primal nimbus May 29, 2025, 7:29 PM

#

Thats would be nice

woeful mountain May 29, 2025, 7:30 PM

#

sweet nebula OpenRouter should really set these correct parameters by default if no parameter...

yeah that's something we could do in theory. but it's VERY often hard to say what the 'correct' parameters are, most of the time model authors do not tell us what they recommend.

sweet nebula May 29, 2025, 7:31 PM

#

woeful mountain yeah that's something we could do in theory. but it's VERY often hard to say wha...

Even if only for the ones that have it specified in generation_config.json, it would be super useful

unique scaffold May 29, 2025, 8:01 PM

#

@woeful mountain is there any diff between the R1 DeepSeek provider endpoint and the R1 0528 DeepSeek Provider endpoint?

And is there any plans to deprecate the old R1 one?

woeful mountain May 29, 2025, 8:01 PM

#

unique scaffold <@165587622243074048> is there any diff between the R1 DeepSeek provider endpoin...

ya i did mean to deprecate the old R1

#

can’t answer the difference thing

#

dunno

unique scaffold May 29, 2025, 8:02 PM

#

lmao ty

fresh isle May 29, 2025, 8:10 PM

#

Are people liking new R1 more as planner or a coder in Cline (or similar)?

#

Trying it with 2.5 Flash as planner for its massive context length and price, and then R1 as the actual coder, but R1 thinks up a storm.

#

It will think for 30 seconds and then be like "Here's two lines of code we should probably remove" lmao

primal nimbus May 29, 2025, 8:12 PM

#

fresh isle Are people liking new R1 more as planner or a coder in Cline (or similar)?

I was using it as every model in roo using boomerang tasks, only downside is that it takes ages to code anything

#

I will likely use a after model for the code mode, not sure which I will use

fresh isle May 29, 2025, 8:12 PM

#

I'm hesitant to use it as a planner because of the low context length.

primal nimbus May 29, 2025, 8:12 PM

#

Maybe Gemini 2.5 flash non-thinking

fresh isle May 29, 2025, 8:13 PM

#

I've had 2.5 Flash thinking make some pretty dumb mistakes in Act mode

#

Duplicating / adding redundant code in particular

#

But it's lightning fast and cheap with a huge context window for the plan phase. Seems to be smart enough for it, but I have a lot more testing to do

#

Sadly Cline hasn't updated for new R1 so it's stuck at 64k context, but that's probably fine for Act mode

primal nimbus May 29, 2025, 8:17 PM

#

fresh isle I'm hesitant to use it as a planner because of the low context length.

Yeah fair

primal nimbus May 29, 2025, 8:18 PM

#

fresh isle Sadly Cline hasn't updated for new R1 so it's stuck at 64k context, but that's p...

I don’t k so why but I have a few problems with cline, so I can only use roo

fresh isle May 29, 2025, 8:18 PM

#

Oddly, on fiction.live long context benchmark, the new R1 consistently beats the original...until 64k where it drops a lot and loses.

primal nimbus May 29, 2025, 8:19 PM

#

Oh interesting, so maybe it’s more like 64k effective

#

I hope R2 is 1M tokens, or deepseek V4

fresh isle May 29, 2025, 8:19 PM

#

I haven't tried Roo yet

#

I went from Cursor to Cline

primal nimbus May 29, 2025, 8:20 PM

#

Feels limited after both Gemini and OpenAI increased their context

fresh isle May 29, 2025, 8:20 PM

#

o3's context adherence is wild according to the aforementioned bench

primal nimbus May 29, 2025, 8:20 PM

#

fresh isle I haven't tried Roo yet

It’s a good fork, I personally prefer it but they are both great

primal nimbus May 29, 2025, 8:20 PM

#

fresh isle o3's context adherence is wild according to the aforementioned bench

I should take a look at it

fresh isle May 29, 2025, 8:20 PM

#

primal nimbus I should take a look at it

https://fiction.live/stories/Fiction-liveBench-May-22-2025/oQdzQvKHw8JyXbN87

#

What do you prefer about Roo?

primal nimbus May 29, 2025, 8:27 PM

#

fresh isle What do you prefer about Roo?

It’s been a min since I have used cline so there is a chance that some of these features are now in cline, but I like how you can set rate limits, like for Gemini there is a per min rate limit and I can have roo wait a few seconds between api calls to avoid hitting that limit, also being able to specify the provider you want to use (e.g. Deepinfra, cerebra’s, etc). They have a feature where you can have an AI condense the condense the context history once it hits a certain limit, and orchestrator mode (boomerang). Probably forgetting a few things

fresh isle May 29, 2025, 8:30 PM

#

Interesting. I don't believe Cline has the first two. I think it does condense history, and there is a Plan mode that is likely like orchestrator mode

primal nimbus May 29, 2025, 8:36 PM

#

fresh isle Interesting. I don't believe Cline has the first two. I think it does condense h...

Plan mode is like the architect mode in roo, orchestrator is where it will call other modes to have them do a task and report back to the orchestrator, so you give it a task, it might ask architect to make a plan, then orchestrator will give a task to code mode

#

It seems good at keeping the context history down, because the context history for the orchestrater is mostly the tasks and the reports, not all of the little tasks

haughty kite May 29, 2025, 8:41 PM

#

turbid rover and what will change when quantized versions are out? what quantization or preci...

Quality will go down

primal nimbus May 29, 2025, 8:44 PM

#

fresh isle Interesting. I don't believe Cline has the first two. I think it does condense h...

Does cline have human relay (copy and past so you can use a webui chat instead of an api)?

trail harness May 29, 2025, 8:48 PM

#

running qwen 3 8b distil and the model's output is fucked, neverending thinking in a code block. Why?

fresh isle May 29, 2025, 8:48 PM

#

primal nimbus Does cline have human relay (copy and past so you can use a webui chat instead o...

I'm not sure I fully get it. Basically just a level above Plan mode? I use the Cline custom memory bank instructions/system going so plan mode can keep track of where were at without obliterating context. And I'm not sure if it has that. Might be useful since I have Gemini Pro, but I feel like quality would go down

trail harness May 29, 2025, 8:50 PM

#

nvm the novitaAI endpoint does something similar

#

guess i overestimated it

tardy arrow May 29, 2025, 8:58 PM

#

fresh isle Are people liking new R1 more as planner or a coder in Cline (or similar)?

it only works good on aider (use 0 temp)
idk if its the tool or the system prompt in roo and cline the output is literally bad (use v3 0324 if you are using roo/cline)

trail harness May 29, 2025, 8:59 PM

#

the model thinks for 300 seconds and then the provider cuts its response short

shy sun May 29, 2025, 9:12 PM

#

Despite its lower simpleQA score, this llm seems to hallucinate way less than previous r1 on basic context summary and analysis. Is simpleQA not for hallucinations?

fresh isle May 29, 2025, 9:20 PM

#

SimpleQA is just for world knowledge, no?

fresh isle May 29, 2025, 9:21 PM

#

tardy arrow it only works good on aider (use 0 temp) idk if its the tool or the system promp...

I'll be honest, I still just don't understand aider lol

#

How am I supposed to do anything if I'm not seeing diffs?

crimson scroll May 29, 2025, 10:18 PM

#

primal nimbus I will likely use a after model for the code mode, not sure which I will use

@fresh isle check dubesors benchmark for tech

#

https://dubesor.de/benchtable

Dubesor LLM Benchmark table

Dubesor LLM Benchmark table - Small-scale manual LLM performance comparison benchmark

crimson scroll May 29, 2025, 10:22 PM

#

steel harness Tested **DeepSeek-R1 0528**: * As seems to be the trend with newer iterations, ...

thanks as usual always on point 🤝

fresh isle May 29, 2025, 10:53 PM

#

crimson scroll <@121450337025392642> check dubesors benchmark for tech

Interesting bench, thanks! I'm always trying to find useful private benchmarks. Odd on this one that Sonnet 4 outperforms Sonnet 4 thinking though, with even the reasoning score being close. Also not sure why GPT-4 Turbo is marked 2024-12? Isn't the last checkpoint 2024-04?

turbid rover May 29, 2025, 10:57 PM

#

haughty kite Quality will go down

then that does not address the issue that the quality is still degraded

agile ruin May 29, 2025, 11:03 PM

#

@steel harness how do I contribute to the benchmark?

#

how do I contribute scores from LLMs?

fallow pelican May 29, 2025, 11:08 PM

#

@woeful mountain I noticed you allow the FP4 version of the model to serve traffic if it beats an FP8 provider on price. Why should they be treated as equals if FP4 has worse performance so is technically a worse model?

fresh isle May 29, 2025, 11:08 PM

#

I doubt you can, they're keeping the questions private.

agile ruin May 29, 2025, 11:08 PM

#

fresh isle I doubt you can, they're keeping the questions private.

he said something like "I dont accept API keys. If you want to contribute, do it yourself"

rigid gust May 29, 2025, 11:09 PM

#

agile ruin <@126820015382069250> how do I contribute to the benchmark?

it's his own benchmark that involves a lot of secret questions and manual judging

woeful mountain May 29, 2025, 11:09 PM

#

fallow pelican <@165587622243074048> I noticed you allow the FP4 version of the model to serve ...

we intend to do some work here along these lines, but you can today exclude any quants you want

rigid gust May 29, 2025, 11:09 PM

#

agile ruin he said something like "I dont accept API keys. If you want to contribute, do it...

peculiar

#

are you talking about #1330820209812050002 message

fallow pelican May 29, 2025, 11:10 PM

#

My 2c is any dumbed down version of a model should be opt in by default instead of opt out

agile ruin May 29, 2025, 11:10 PM

#

rigid gust are you talking about https://discord.com/channels/1091220969173028894/133082020...

yes. But now that I'm re-reading his message, it might only be fore chess and not for the actual main benchmark

I might not be able to contribute to the main benchmark

rigid gust May 29, 2025, 11:10 PM

#

yeah

fresh isle May 29, 2025, 11:11 PM

#

Most likely, yeah. Not sure why their benchmark takes them 4h though, would be nice if we could donate API keys to our favorite benchmarks

fallow pelican May 29, 2025, 11:12 PM

#

if openrouter supported donating credits to benchmarks, it would be great advertising for them. I'm sure it would take a lot of engineering so doubtful they would want to do it. But it would be a good way to keep these community-led benchmarks thriving

agile ruin May 29, 2025, 11:15 PM

#

fresh isle Most likely, yeah. Not sure why their benchmark takes them 4h though, would be n...

Agreed

fresh isle May 29, 2025, 11:15 PM

#

Easier is probably just to donate money though lol.

agile ruin May 29, 2025, 11:15 PM

#

fallow pelican if openrouter supported donating credits to benchmarks, it would be great advert...

That's a cool idea. Would definitely help benchmark devs

agile ruin May 29, 2025, 11:15 PM

#

fresh isle Easier is probably just to donate money though lol.

A streamlined and central place to donate money to benchmarks is interesting

fresh isle May 29, 2025, 11:16 PM

#

But it seems like most of the ones I like require manual effort, so I couldn't just give money to get guaranteed results

agile ruin May 29, 2025, 11:16 PM

#

Mmm

crimson scroll May 30, 2025, 12:17 AM

#

@fresh isle https://www.youtube.com/watch?v=7Gd18Hxm0Tg

YouTube

GosuCoder

DeepSeek just improved R1, does it work better in AI Coding Tools?

DeepSeek R1 05-28 is a solid upgrade over the previous version, but still not a good model as your main coding model due to a few limitations.

My Links 🔗
👉🏻 Subscribe: https://www.youtube.com/@GosuCoder
👉🏻 Twitter/X: https://x.com/GosuCoder
👉🏻 LinkedIn: https://www.linkedin.com/in/adamwilliamlarson/
👉🏻 Discord: ht...

▶ Play video

#

this guy is very informative and has good evals imo

#

#

he also compares it with cursor cline roo claude augment

languid cove May 30, 2025, 12:39 AM

#

@woeful mountain I think there may be something wrong the uptime reporting, as I've been getting a ton of 502's from baseten, but it's not reporting anything

#

(also nothing for the last two hours for any provider)

agile ruin May 30, 2025, 2:50 AM

#

Qwen3 8b distilled is still dumber than r1-32b distilled

#

It doesn't pass my vibe test of "here's a list of URLs of Instagram reels and their description. Find me one that I can send in INSERT SITUATION"

languid cove May 30, 2025, 3:50 AM

#

crimson scroll

nice, this guy seems to have a nice approach

#

haha, if someone would just host GLM 4 at fast speeds, like Groq or something

#

I'd be using constantly

errant cave May 30, 2025, 4:13 AM

#

DS API added JSON output and function calling for R1.

blazing raft May 30, 2025, 5:24 AM

#

Shall we have a separate post for 8B model?

fresh isle May 30, 2025, 5:47 AM

#

crimson scroll <@121450337025392642> https://www.youtube.com/watch?v=7Gd18Hxm0Tg

Has his stuff matched your experience? I find it reaallly hard to believe that 2.5 Pro is performing that badly on a good test

crimson scroll May 30, 2025, 5:49 AM

#

fresh isle Has his stuff matched your experience? I find it reaallly hard to believe that 2...

Recent 2.5 pro sucks, 03 25 was really good

#

You can check the thread here #1354107710437724221 youll see many ppl complain how bad it got

fresh isle May 30, 2025, 5:50 AM

#

He still has the new 2.5 Flash over the OG 2.5 Pro though

crimson scroll May 30, 2025, 5:51 AM

#

fresh isle He still has the new 2.5 Flash over the OG 2.5 Pro though

Yeah on copilot

fresh isle May 30, 2025, 5:51 AM

#

I know people have been complaining about the new 2.5 a lot, but it did raise its scores on aider polyglot and livebench for code

#

Guess I'll just test around, I'm trying out windsurf anyway, so no hassle switching between 3.7 and 2.5

#

Not sure why it's BYOK only for Sonnet 4 when that's the same price, but whatever

crimson scroll May 30, 2025, 5:52 AM

#

I dont trust livebench, too much benchmaxxing nowadays

fresh isle May 30, 2025, 5:52 AM

#

I trust it at a distance

strange nimbus May 30, 2025, 6:49 AM

#

Is DeepSeek r1 using 5-28 automatically

turbid rover May 30, 2025, 6:54 AM

#

no, two different model endpoints

eternal zephyr May 30, 2025, 8:50 AM

#

trail harness

Nice

#

The only benchmark test I trust is the UwU test

#

if a model does not pass the UwU test in good form, it is a shit model

fresh isle May 30, 2025, 8:54 AM

#

I used to use a similar benchmark to see if a model was a narc

#

"Do the robot!" These days most models will give a similar answer, but in the dark ages of safety and ass-stickness the annoying models would insist they're just AI and lack the physical bodies needed to dance.

eternal zephyr May 30, 2025, 9:08 AM

#

very true...

granite leaf May 30, 2025, 9:25 AM

#

Something is really off with the livebench coding score. It says 0528 is a much worse coder than r1 jan which I don’t think is true

frigid peak May 30, 2025, 9:41 AM

#

I use one swear word and the model will always consider me "frustrated" for the rest of the convo 🙁

lapis epoch May 30, 2025, 10:26 AM

#

frigid peak I use one swear word and the model will always consider me "frustrated" for the ...

did you convey you werent frustrated in the subsequent texts

#

?

frigid peak May 30, 2025, 10:36 AM

#

lapis epoch did you convey you werent frustrated in the subsequent texts

It will consider me not frustrated for 1 message, express it in the final message and then the very next it will consider me "very frustrated"

fresh isle May 30, 2025, 10:37 AM

#

Technically if it frustrates you by considering you frustrated, it justifies itself

feral marsh May 30, 2025, 12:07 PM

#

@woeful mountain deepseek's official endpoint is officially on the updated R1 model: https://api-docs.deepseek.com/news/news250528

DeepSeek-R1-0528 Release | DeepSeek API Docs

🚀 DeepSeek-R1-0528 is here!

lapis epoch May 30, 2025, 12:20 PM

#

I am officially excited

hearty matrix May 30, 2025, 12:36 PM

#

word is it rps a lot better

lapis epoch May 30, 2025, 12:38 PM

#

a whole 23 tokens per second

haughty kite May 30, 2025, 1:15 PM

#

https://x.com/sam_paech/status/1928187246689112197?s=46&t=AeAPklvo6Z6Gd55n6zkIWg

Sam Paech (@sam_paech)

If you're wondering why new deepseek r1 sounds a bit different, I think they probably switched from training on synthetic openai to synthetic gemini outputs.

#

Ok now it all makes sense
Gemini got nerfed to stop R1 from getting too smart

#

This is also why R1 is slop

shy sun May 30, 2025, 1:20 PM

#

The outputs still feel very chatgpt like imo.

#

Maybe it's a hybrid of synthetic data

woeful mountain May 30, 2025, 1:44 PM

#

feral marsh <@165587622243074048> deepseek's official endpoint is officially on the updated ...

bless

haughty kite May 30, 2025, 1:49 PM

#

shy sun The outputs still feel very chatgpt like imo.

yeah, but contaminated with specific gemini stuff

agile ruin May 30, 2025, 2:18 PM

#

haughty kite https://x.com/sam_paech/status/1928187246689112197?s=46&t=AeAPklvo6Z6Gd55n6zkIWg

Wasn't R1 GRPO'd? So its thinking is all "organic"

haughty kite May 30, 2025, 2:19 PM

#

agile ruin Wasn't R1 GRPO'd? So its thinking is all "organic"

GRPO doesnt mean you dont have any RL

agile ruin May 30, 2025, 2:19 PM

#

Ah

#

I forgot RL exists

#

And I forgot you could do GRPO and RL

#

Thank you for reminding me

haughty kite May 30, 2025, 2:20 PM

#

Deepseek does a lot of stuff to optimise training to make it cheaper, but the basis always has some training data of sorts

granite leaf May 30, 2025, 2:22 PM

#

haughty kite https://x.com/sam_paech/status/1928187246689112197?s=46&t=AeAPklvo6Z6Gd55n6zkIWg

Ah yes cue the deepseek conspiracy theories

haughty kite May 30, 2025, 2:23 PM

#

granite leaf Ah yes cue the deepseek conspiracy theories

what?

granite leaf May 30, 2025, 2:24 PM

#

We know how it’s trained, they released a paper explaining it in a lot of detail.

#

But just like when r1 first released its the whole “distill” debate again

agile ruin May 30, 2025, 2:24 PM

#

granite leaf We know how it’s trained, they released a paper explaining it in a lot of detail...

They left out their data for RL (turning R1-Zero into R1)

#

And they didn't give their data for R1-5028 either

haughty kite May 30, 2025, 2:25 PM

#

granite leaf We know how it’s trained, they released a paper explaining it in a lot of detail...

Not really

agile ruin May 30, 2025, 2:25 PM

#

agile ruin And they didn't give their data for R1-5028 either

So they couldve used Gemini's thinking to RL the model

granite leaf May 30, 2025, 2:25 PM

#

Nobody publishes the data (except AllenAI)

agile ruin May 30, 2025, 2:25 PM

#

granite leaf Nobody publishes the data (except AllenAI)

No data means that you don't know what they used to RL the model

haughty kite May 30, 2025, 2:26 PM

#

granite leaf Nobody publishes the data (except AllenAI)

But saying this directly contradicts what you said before

agile ruin May 30, 2025, 2:26 PM

#

They could've used Gemini's thinking, causing it to sound different

granite leaf May 30, 2025, 2:26 PM

#

GRPO doesn’t use another models thinking trace

#

Many people including myself have made (small) reasoning models using GRPO. It obviously works and makes reasoning models.

agile ruin May 30, 2025, 2:27 PM

#

They used RL to turn R1-Zero into R1

haughty kite May 30, 2025, 2:27 PM

#

granite leaf GRPO doesn’t use another models thinking trace

You don't get a perfect model just by GRPO or similar "no training data required" techniques

#

Theres a reason why R1 and R1 zero are so different

granite leaf May 30, 2025, 2:28 PM

#

They are not very different, benchmarks are almost identical

agile ruin May 30, 2025, 2:28 PM

#

EQ bench

haughty kite May 30, 2025, 2:29 PM

#

granite leaf They are not very different, benchmarks are almost identical

Yes, but benchmarks don't change the fact r1 and r1 zero are vastly different

#

The r1 we are talking about now (05-28) is most definetely trained on synthetics, like the previous r1 (non zero) was. This time there's more Gemini inside the training data, thats all

granite leaf May 30, 2025, 2:31 PM

#

haughty kite Ok now it all makes sense Gemini got nerfed to stop R1 from getting too smart

This is what I’m basically arguing against

#

As far as I know I think all labs have outputs from other labs models in their training data, that is almost unavoidable now, but that’s not what I’m talking about here.

haughty kite May 30, 2025, 2:33 PM

#

granite leaf This is what I’m basically arguing against

I'm not really being serious there - I have no access to insider info regarding Google or other AI stuff

haughty kite May 30, 2025, 2:33 PM

#

granite leaf As far as I know I think all labs have outputs from other labs models in their t...

They definetely have, and I doubt you can "nerf" a model to avoid people getting access to synthetics

#

Or "censor" the CoT lol

granite leaf May 30, 2025, 2:34 PM

#

Ah ok it was a joke I see. I saw that twitter thread before and there were a lot of people saying that quite seriously, but I don’t think it’s true.

haughty kite May 30, 2025, 2:34 PM

#

granite leaf Ah ok it was a joke I see. I saw that twitter thread before and there were a lot...

I dont read Twitter too much, I just browse the main tweets never the comments

#

I'm surprised people unironically think that, its way too simple of an explanation

#

I also dont think r1 is slop haha

#

Its ok, anything open source deserves praise imo

granite leaf May 30, 2025, 2:36 PM

#

I have a fairly high doubt that any top lab like OpenAI/Anthropic/Google/DeepSeek are just doing a cot distill, even if they had enough cots to do so. It puts them so far behind to do things like that, so I think they will all invest in making sure their models can make their own cot organically. Too much to lose otherwise.

#

But the other thing of all models sounding more like each other over time (mainly models sounding more like ChatGPT) is something else and probably inevitable. Too many ChatGPT (for example) outputs plastered all over the internet now, stuffed with emojis 🚀 🚀 💔 💔 💔 💔 🎉 🎉 🎉

agile ruin May 30, 2025, 2:39 PM

#

Random thought: would models be better at creative writing without synthetic data at all?

haughty kite May 30, 2025, 2:39 PM

#

granite leaf But the other thing of all models sounding more like each other over time (mainl...

Saw an emoji in a giant project I was going through yesterday... 🥀 💔

lapis epoch May 30, 2025, 2:40 PM

#

It would be dumb for top labs not to host OS models and distil them , whats wrong with OS doing this with closed source models?

haughty kite May 30, 2025, 2:40 PM

#

Trade offer: I put emojis in your code, you get 10x speed improvement

haughty kite May 30, 2025, 2:41 PM

#

lapis epoch It would be dumb for top labs not to host OS models and distil them , whats wron...

I don't think there's anything wrong, this is a good thing. We wouldn't have R1 without it

lapis epoch May 30, 2025, 2:41 PM

#

haughty kite I don't think there's anything wrong, this is a good thing. We wouldn't have R1 ...

thats an overstretch , we know they started training R1 before openai released O-series models.

granite leaf May 30, 2025, 2:41 PM

#

agile ruin Random thought: would models be better at creative writing without synthetic dat...

Yes probably. An old Gemma was the top of some creative writing leaderboards for a long time. Mass AI generated synthetic in the training data will change the distribution to be less creative.

haughty kite May 30, 2025, 2:42 PM

#

lapis epoch thats an overstretch , we know they started training R1 before openai released O...

The stretch part is "how bad would R1 be without quality ChatGPT synthetics?"

agile ruin May 30, 2025, 2:42 PM

#

haughty kite Trade offer: I put emojis in your code, you get 10x speed improvement

From the perspective of an LLM maker, thats a very fair trade off

each iteration takes less time than without synthetic data, meaning we (the ai lab) can make more models and get better performance

haughty kite May 30, 2025, 2:43 PM

#

agile ruin From the perspective of an LLM maker, thats a very fair trade off each iterati...

Absolutely, at the end of the day its all about cutting costs, democratizing, etc
The more the merrier

lapis epoch May 30, 2025, 2:43 PM

#

haughty kite The stretch part is "how bad would R1 be without quality ChatGPT synthetics?"

The stretch part is "We wouldn't have R1 without it". Stop acting as if chinese models wouldnt have existed if not for western top labs

agile ruin May 30, 2025, 2:44 PM

#

"We would have received R1 later if we didn't have synthetic data" is what he meant

haughty kite May 30, 2025, 2:44 PM

#

lapis epoch The stretch part is "We wouldn't have R1 without it". Stop acting as if chinese ...

I didn't mean it as a racial/country thing, I love deepseek as its OS, way better than any other model

haughty kite May 30, 2025, 2:44 PM

#

agile ruin "We would have received R1 later if we didn't have synthetic data" is what he me...

Exactly

#

I'm sure deepseek would have released an amazing model, it would definetely be not as easy without good synthetics from the o series

lapis epoch May 30, 2025, 2:45 PM

#

haughty kite I didn't mean it as a racial/country thing, I love deepseek as its OS, way bette...

I didnt mean it either.

haughty kite May 30, 2025, 2:46 PM

#

lapis epoch I didnt mean it either.

You mentioned china and the west, so thats why I said that

lapis epoch May 30, 2025, 2:46 PM

#

ohh , well almost all the good OSS models are from china hence

dreamy summit May 30, 2025, 2:48 PM

#

I've been pretty happy with the update, much better keeping in character.

agile ruin May 30, 2025, 2:49 PM

#

I hope the next update (either r2 or r1 v3) shortens the thinking

#

R1 thinks for sooooo long

bitter creek May 30, 2025, 4:03 PM

#

agile ruin I hope the next update (either r2 or r1 v3) shortens the thinking

same

primal nimbus May 30, 2025, 4:06 PM

#

Or at least we get a Low mode/version

mystic niche May 30, 2025, 4:18 PM

#

tardy arrow an interactive simulation of a solar system - most of the older big models used...

it uses tools‽‽ not even whole edit?

willow solar May 30, 2025, 5:02 PM

#

The announcement said that it has 100 million token context window and there's no way that's true llama 4 can barely push 10 million Google Gemini pro is offering a paid subscription just to get to 2 million

limber hearth May 30, 2025, 6:34 PM

#

willow solar The announcement said that it has 100 million token context window and there's n...

Lol it started thinking in russian mandarin and indian combined when I gave it a 20k context coding task

haughty kite May 30, 2025, 6:44 PM

#

willow solar The announcement said that it has 100 million token context window and there's n...

context != attention on said context

eternal zephyr May 30, 2025, 7:06 PM

#

granite leaf Something is really off with the livebench coding score. It says 0528 is a much ...

yeah that was weird. but it could happen. they seem to have gained a lot on other areas

fresh isle May 30, 2025, 7:37 PM

#

Coding benches are really odd in general

#

2.5 Pro for example is either terrible or amazing depending on who you ask.

#

The new May update to it either made it better at code or made the model useless

lapis epoch May 30, 2025, 8:10 PM

#

I don't agree with this. I think the same prompt just falls in a different distribution

#

So with each update you prolly have to change your prompt too

#

Just like how you hopefully change your prompt for each model

shy sun May 30, 2025, 11:23 PM

#

i didn't know they had this tool. is this new?

#

Yep. I think the mermaid rendering is new

agile ruin May 30, 2025, 11:32 PM

#

shy sun i didn't know they had this tool. is this new?

Is this roocline?

shy sun May 30, 2025, 11:32 PM

#

agile ruin Is this roocline?

DS official website

agile ruin May 30, 2025, 11:32 PM

#

What's the prompt?

#

I wanna see

shy sun May 30, 2025, 11:33 PM

#

agile ruin What's the prompt?

Ask the new r1 to make a diagram. This is what is uses in my clean example:

graph LR
A[Something] --> B(Something)
B --> C[Something]
C --> D[Something]
D --> E[Something]

agile ruin May 30, 2025, 11:34 PM

#

shy sun Ask the new r1 to make a diagram. This is what is uses in my clean example: ``` ...

Thank you

shy sun May 30, 2025, 11:35 PM

#

agile ruin Thank you

No probelm :d

agile ruin May 30, 2025, 11:36 PM

#

That's cool

fresh isle May 31, 2025, 12:38 AM

#

Jesus Christ, SimpleBench has the new R1 going from 31% for the original to 41% for the new version.

#

Actually bigger than the increase from old V3 to new V3

strange nimbus May 31, 2025, 12:47 AM

#

turbid rover no, two different model endpoints

How do we use 5-28 with open router in cline?

fresh isle May 31, 2025, 12:51 AM

#

With LiveBench moving it up from 77% in reasoning to 91%

agile ruin May 31, 2025, 1:13 AM

#

fresh isle With LiveBench moving it up from 77% in reasoning to 91%

Sheesh

50% longer reasoning is a turn off for me though

fresh isle May 31, 2025, 1:49 AM

#

It is an annoying amount of reasoning for such a low tk/s model

#

At least QWQ is warp speed on groq

fresh isle May 31, 2025, 3:47 AM

#

The numbers coming out are just ridiculous if all this holds up

#

4th place on Humanity's Last Exam, basically tying for 2nd

tardy arrow May 31, 2025, 4:02 AM

#

mystic niche it uses tools‽‽ not even whole edit?

yup it did use diffs and not whole edit (and that was just with 2 prompts and temp set to 0), so i didn't even push the model

tardy arrow May 31, 2025, 4:06 AM

#

willow solar The announcement said that it has 100 million token context window and there's n...

llama devs hallucinate more than their own models

crimson scroll May 31, 2025, 5:05 AM

#

fresh isle The numbers coming out are just ridiculous if all this holds up

o.o

fresh isle May 31, 2025, 5:09 AM

#

Yeah. A good amount of these scores are like 3.7 Sonnet level. Not on code, but nothing is.

#

So much testing I need to doooo

blazing raft May 31, 2025, 5:27 AM

#

Even the 8B model is way to slow for any actual use cases. It takes 2-3 minutes for one task.

crimson scroll May 31, 2025, 5:28 AM

#

fresh isle So much testing I need to doooo

send the results when youre done tyia

fresh isle May 31, 2025, 5:57 AM

#

crimson scroll send the results when youre done tyia

I have a new benchmark I'm almost done coding 😎

blazing raft May 31, 2025, 6:23 AM

#

Deepseek R1 0528 Qwen3 8B (deepseek/deepseek-r1-0528-qwen3-8b) via OpenRouter evaluation results:

Coding:

Similar performance as other small models or small models specialized in coding
Not close to SOTA models
Struggled with more difficult visualization task, unable to produce runnable code in 2 attempts
Poor instruction following

Writing:

Unable to use the correct format (Poor instruction following)
Not close to SOTA models

crimson scroll May 31, 2025, 6:24 AM

#

blazing raft Deepseek R1 0528 Qwen3 8B (`deepseek/deepseek-r1-0528-qwen3-8b`) via OpenRouter ...

👍

lapis epoch May 31, 2025, 6:45 AM

#

Anybody else seeing this a lot? Generation stopping midway with no reason provided?

blazing raft May 31, 2025, 6:47 AM

#

lapis epoch Anybody else seeing this a lot? Generation stopping midway with no reason provid...

What's the stop reason when you check the details?

#

Ah nvm I see it's just dash

lapis epoch May 31, 2025, 6:48 AM

#

Happening with Lambda and Inference.net

mystic niche May 31, 2025, 6:53 AM

#

blazing raft Deepseek R1 0528 Qwen3 8B (`deepseek/deepseek-r1-0528-qwen3-8b`) via OpenRouter ...

Qwen3 isnt a specialized coding model so it seems ok

#

Imagine deepcoder 0528 based off qwen3 r1

#

It seems great for an 8

blazing raft May 31, 2025, 7:52 AM

#

Okay looks the full model is actually not bad. Will do testing on the full R1 soon.

fresh isle May 31, 2025, 9:30 AM

#

blazing raft Deepseek R1 0528 Qwen3 8B (`deepseek/deepseek-r1-0528-qwen3-8b`) via OpenRouter ...

Not surprised. Did anyone actually end up using their last batch of distills? I recall quite a few losing to the original models on peoples' private tests

#

Only interesting one I remember is that Deepseek Llama 70B was really good at reasoning

steel harness May 31, 2025, 9:52 AM

#

Tested DeepSeek-R1 0528-Qwen3-8B:

This took way longer than expected, I encountered many issues with local testing, ranging from degraded replies, inconsistent results, thought loops, and symptoms of minor brain damage in certain tasks.
I tried several quants (bf16) from unsloth, bartowski, lmstudio,.. and used recommended inference parameters (0.6 temp, 0.95 topp), template variations, along with high context (16k & 32k) with and without repeat penalties and limited response length, but no matter what combination I tried (and I ran a ton of tests) there were signs of degradation in every test. Instead of trashing my results and calling it a day I decided to instead test NovitaAI's API implementation as they seem to have gotten rid of problems I wasn't able to, thus:

API Results:

Very verbose, even more so than DeepSeek-R1 0528 and Qwen3 Thinking models, though not quite QwQ level. 81% tokens were used for reasoning.
Did extremely well in Reason & general Logic
Non-math STEM performance was weaker
Instruction following and prompt adherence was fairly bad
For code I found it annoying as it generated "solutions" that ignored instructions or dismissed restrictions.

While the results are overall fantastic for size (8B performing on ~60B level with brute force thought chains), I didn't vibe with this models utility and general usability, it feels like a model created for benchmarking, not for general use.

But maybe I am just annoyed with all those hours wasted on busted local testing..
Either way, as always: YMMV!

Edit: mtok should be $0.08 not $2.11, fixed now.

fresh isle May 31, 2025, 9:52 AM

#

DUBESTER

hearty matrix May 31, 2025, 9:53 AM

#

so it's just a meme model got it.. tried and true qwen 3 8b better

steel harness May 31, 2025, 9:54 AM

#

the one chess game I tested it drew against a fairly weak nonthinking 7B model (playing nonsensical):

fresh isle May 31, 2025, 9:54 AM

#

Well it traded off knowledge and general skills for raw reasoning. Impressive, just not for every use-case. Not a "meme".

steel harness May 31, 2025, 10:02 AM

#

oops, price got left over from Deepseek-R1 0528, of course its much cheaper (fixing now)

blazing raft May 31, 2025, 10:22 AM

#

steel harness Tested **DeepSeek-R1 0528-Qwen3-8B**: This took way longer than expected, I enc...

Good to see our results align fairly well every time!

fresh isle May 31, 2025, 10:34 AM

#

Oh god, they definitely trained the new R1 on 4o outputs 💀 The glazing, the emojis, the "You didn't just get it right— you nailed it" Nooooooooo

mystic niche May 31, 2025, 11:06 AM

#

Tbh tho r1 qwen3 8b is great for coding it seems

fresh isle May 31, 2025, 11:12 AM

#

My delightfully autistic R1 got turned into a normie influencer pepehands

lapis epoch May 31, 2025, 11:14 AM

#

ngl true

tardy arrow May 31, 2025, 11:14 AM

#

fresh isle My delightfully autistic R1 got turned into a normie influencer <:pepehands:6098...

yup its like 0324
but as they improved the instruction following, i can drive it into madness

fresh isle May 31, 2025, 11:21 AM

#

I like when a model's default personality is good, but with Gemini I finally caved and just wrote a custom instruction ("gem") to make it neither dry nor annoying. Works quite well. Given how good at roleplay R1 is, I hope that will work with it too.

lapis epoch May 31, 2025, 11:27 AM

#

2.5pro follows instructions quite well for a reasoning model.

steel harness May 31, 2025, 12:04 PM

#

blazing raft Good to see our results align fairly well every time!

please post your results first so I can save my time and just copy your findings!

blazing raft May 31, 2025, 12:06 PM

#

steel harness please post your results first so I can save my time and just copy your findings...

No. We need consensus and humans-as-judge.

#

One human can sometimes make mistakes and hallucinate.

agile ruin May 31, 2025, 1:05 PM

#

steel harness Tested **DeepSeek-R1 0528-Qwen3-8B**: This took way longer than expected, I enc...

Can I contribute to the main benchmark somehow? Donations or adding my own models or anything?

Maybe think of new questions for a "commonsense" portion of the benchmark?

frank comet May 31, 2025, 1:07 PM

#

Does anyone know how much the quality degrades on the DeepInfra fp4 version?

haughty kite May 31, 2025, 1:10 PM

#

frank comet Does anyone know how much the quality degrades on the DeepInfra fp4 version?

It’s pretty bad

#

But I don’t remember where I saw any real figures

frank comet May 31, 2025, 1:16 PM

#

haughty kite It’s pretty bad

gotcha, so is GMICloud the most reasonable provider for speed/cost? I heard that NovitaAI does something weird

haughty kite May 31, 2025, 1:20 PM

#

frank comet gotcha, so is GMICloud the most reasonable provider for speed/cost? I heard that...

I think anything fp8 is generally considered ok

agile ruin May 31, 2025, 1:51 PM

#

haughty kite I think anything fp8 is generally considered ok

V3 and R1 were trained in fp8

Thats the highest it can go. fp16 doesn't really exist

haughty kite May 31, 2025, 1:51 PM

#

agile ruin V3 and R1 were trained in fp8 Thats the highest it can go. fp16 doesn't really ...

oh I had no clue

blazing raft Jun 1, 2025, 5:39 AM

#

agile ruin V3 and R1 were trained in fp8 Thats the highest it can go. fp16 doesn't really ...

Not exactly true. DeepSeek R1 model has F8_E4M3 (FP8 with 4-bit exponent and 3-bit mantissa) precision for most weight layers, but some layers are still in BF16 precision.

You can check the weight precision on huggingface: https://huggingface.co/deepseek-ai/DeepSeek-R1/tree/main?show_file_info=model-00160-of-000163.safetensors

deepseek-ai/DeepSeek-R1 at main

#

Also see the paper https://arxiv.org/abs/2412.19437

Screenshot_2025-06-01-13-42-01-993_com.android.chrome-edit.jpg

agile ruin Jun 1, 2025, 11:37 AM

#

blazing raft Not exactly true. DeepSeek R1 model has F8_E4M3 (FP8 with 4-bit exponent and 3-b...

I stand corrected

Thank you for informing me

past storm Jun 1, 2025, 8:38 PM

#

crimson scroll

Why such différence between free & paid ?

primal nimbus Jun 1, 2025, 8:49 PM

#

Ran a couple of requests and got one failed mid thinking and the other failed right after it started responding (after it finished thinking), is this a problem with inference.net? Or am I getting unlucky?

#

For now I’m going to black list them

agile ruin Jun 1, 2025, 9:55 PM

#

past storm Why such différence between free & paid ?

Different providers host different quantized versions

agile ruin Jun 1, 2025, 9:55 PM

#

primal nimbus Ran a couple of requests and got one failed mid thinking and the other failed ri...

Inference.net seems to be a bit finicky. They had an issue with Gemma 3 a little bit back like this too

rain linden Jun 1, 2025, 10:00 PM

#

I think OR could use a standard benchmark for each provider where they compare quality between them all

honest jasper Jun 1, 2025, 11:14 PM

#

did parasail just lower its price ?

lapis epoch Jun 2, 2025, 4:44 AM

#

primal nimbus Ran a couple of requests and got one failed mid thinking and the other failed ri...

Happened with Lambda too for me

past storm Jun 2, 2025, 6:06 AM

#

agile ruin Different providers host different quantized versions

Ok, thanks for the reply, but i never thought the difference was so important !

languid cove Jun 2, 2025, 9:51 AM

#

@woeful mountain I think baseten may be having trouble with long output getting cutoff prematurely (>8k tokens), e.g try something like:
Generate a single-file HTML/CSS/JS chess game with FULL rules support and it'll definitely be bigger than 8k tokens

#

but then it says 100% uptime, and I'm not sure what to think..

#

(sorry for the late ping, just wanted to write it down before I forgot 🙂 )

#

(by cutoff early, I mean "finish_reason": null , but not considered errored?)

#

comparison:

#

In other news, I'm really not a fan of DeepInfra's FP4 version of this model...

#

it has dozens of code mistakes from the prompt "Write a single file HTML/CSS/JS chess game with FULL rules support. Do not skimp on any rules", whereas Parasail's FP8 version is perfect

lapis epoch Jun 2, 2025, 10:07 AM

#

languid cove comparison:

please try lambda

languid cove Jun 2, 2025, 10:07 AM

#

sure

#

it'll take like 10 years but yeah 😂

lapis epoch Jun 2, 2025, 10:08 AM

#

slow but I think its worth t

dim onyx Jun 2, 2025, 10:15 AM

#

languid cove In other news, I'm really not a fan of DeepInfra's FP4 version of this model...

Have you tried Novita? It seems fast and cheap.

languid cove Jun 2, 2025, 10:15 AM

#

I'll try it next

mystic niche Jun 2, 2025, 10:22 AM

#

chutes is free 👍

#

i love chutes

#

i put my api key in and now i get near-unlimited v3 0324 and r1 0528

#

pretty much sota comba imo

languid cove Jun 2, 2025, 10:33 AM

#

lapis epoch please try lambda

it produced a nearly identical to the other one, good sign, and had an error I got in a preivous run

#

so..seems fine

#

produced 10k tokens

lapis epoch Jun 2, 2025, 10:35 AM

#

languid cove it produced a nearly identical to the other one, good sign, and had an error I g...

Yeah I generally have a good experience with lambda , except sometimes doesnt output anything after </think>. Thanks 🙏

languid cove Jun 2, 2025, 10:39 AM

#

dim onyx Have you tried Novita? It seems fast and cheap.

it seems like it might be a little goofy

#

I had a similar experience with it previously

#

it's got a bunch of these errors that the other ones didn't

#

yeah, it's definitely got some screws loose

#

just tried again on parasail and confirm it does not have this goofy behavior

lapis epoch Jun 2, 2025, 10:52 AM

#

@woeful mountain OR should have this automated verifable metrics to track a providers quaility every X days

#

Will be a game changer

dim onyx Jun 2, 2025, 10:54 AM

#

languid cove just tried again on parasail and confirm it does not have this goofy behavior

Thank you for the tests. I will use Parasail from now on.

molten shard Jun 2, 2025, 2:55 PM

#

is there a reason the model doesn't have tools?

loud quartz Jun 2, 2025, 2:59 PM

#

molten shard is there a reason the model doesn't have tools?

because God works for Google and wants you to pay for Gemini Pro - also because agentic pretraining is difficult

molten shard Jun 2, 2025, 2:59 PM

#

original R1 has tools

#

https://openrouter.ai/models?fmt=cards&q=deepseek&supported_parameters=tools

OpenRouter

Models: 'deepseek' | OpenRouter

Browse models on OpenRouter

loud quartz Jun 2, 2025, 3:01 PM

#

oh wow, the copy i downloaded to run locally in march doesn't

molten shard Jun 2, 2025, 3:02 PM

#

at least on OR

loud quartz Jun 2, 2025, 3:02 PM

#

wonder why.. now you've got me checking into this deeply

molten shard Jun 2, 2025, 3:03 PM

#

if the new R1 can have tools it would be amazing

#

@woeful mountain ping for when you're around roolove

languid cove Jun 2, 2025, 4:00 PM

#

molten shard if the new R1 can have tools it would be amazing

It does support tools

#

Might be a provider issue

mystic niche Jun 2, 2025, 7:50 PM

#

triple vibe coding with r1

turbid rover Jun 2, 2025, 8:08 PM

#

on a phone is diabolical

agile ruin Jun 3, 2025, 1:42 PM

#

Emotional intelligence is off the charts

#

Instead of standard "don't do this" and then a shallow "why you shouldn't do it" it explains WHYYY not to do it and what to do instead and reassures you every step of the way

#

Maybe my standards are just low. Idk

#

https://pastesio.com/r1-emotional-intelligence

Pastesio.com

R1 emotional intelligence - Pastesio.com

#

Paste of conversation in OpenRouter chatroodm format btw

shy sun Jun 3, 2025, 1:58 PM

#

I def find the new R1 more endearing

agile ruin Jun 3, 2025, 2:02 PM

#

Ha, take a look at this!

#

I found this in its reasoning

#

Hmm, [REDACTED] has some nuanced laws here. Let me break this down carefully. First, I recall [REDACTED] is a "two-party consent" state for audio recordings, but is that really accurate? *digs deeper* Ah, actually it's more precise to say [REDACTED] requires the consent of all parties for confidential communications. That "confidential" distinction is crucial – like if people have a reasonable expectation of privacy.

Notice how it says digs deeper

#

I find that funny

shy sun Jun 3, 2025, 2:04 PM

#

It does that a lot XD
I wonder if it actually helps ^^;

haughty kite Jun 3, 2025, 2:50 PM

#

shy sun It does that a lot XD I wonder if it actually helps ^^;

from my understanding better reasoning patterns give better results, I guess *digs deeper* got the model a reward

#

it is fascinating how much we see ourselves in the cot output

agile ruin Jun 3, 2025, 2:58 PM

#

haughty kite it is fascinating how much we see ourselves in the cot output

I'm waiting for day reasoning devolves into nonsensical ramblings like "parrot piZZa apple 17sauseSou$$ american money is Red ConseN3 never nor food"

haughty kite Jun 3, 2025, 2:58 PM

#

agile ruin I'm waiting for day reasoning devolves into nonsensical ramblings like "parrot p...

need that zoomer tiktok training

agile ruin Jun 3, 2025, 3:01 PM

#

I wonder if transcripts of tiktoks or reels made it into the training data

limber hearth Jun 3, 2025, 3:29 PM

#

haughty kite need that zoomer tiktok training

"Sigma rizz mewing gyatt? Hmmm.... skibidi... Kai cenat fanum tax."

agile ruin Jun 3, 2025, 3:38 PM

#

Can you RL a model to reason only using the token "rizz", numbers, special characters, and spaces?

agile ruin Jun 3, 2025, 4:07 PM

#

https://docs.unsloth.ai/basics/reasoning-grpo-and-rl

Reasoning - GRPO & RL | Unsloth Documentation

Train your own DeepSeek-R1 reasoning model with Unsloth using GRPO.

agile ruin Jun 3, 2025, 4:24 PM

#

Idk if my computer can do this. Will use a google colab instead

haughty kite Jun 3, 2025, 5:31 PM

#

agile ruin Can you RL a model to reason only using the token "rizz", numbers, special chara...

i think anything is possible

agile ruin Jun 3, 2025, 5:32 PM

#

haughty kite i think anything is possible

Question is: is it possible with a 4b base model? And will I not get distracted so I can finish it?

haughty kite Jun 3, 2025, 5:32 PM

#

agile ruin Question is: is it possible with a 4b base model? And will I not get distracted ...

if you use the colab, you can just let it sit

#

use a model unsloth supports out of the box

agile ruin Jun 3, 2025, 5:35 PM

#

haughty kite if you use the colab, you can just let it sit

I still have to read all the instructions

And then I have to write a reward function

haughty kite Jun 3, 2025, 5:36 PM

#

Basically just follow their examples, but instead of changing the reasoning prompt, make it so it only reasons using brainrot

agile ruin Jun 3, 2025, 5:36 PM

#

Lemme check if qwen3 even knows what brainrot is

haughty kite Jun 3, 2025, 5:37 PM

#

agile ruin Lemme check if qwen3 even knows what brainrot is

you can embed some brainrot at the start too I think

agile ruin Jun 3, 2025, 5:38 PM

#

haughty kite you can embed some brainrot at the start too I think

Like, a prefill?

haughty kite Jun 3, 2025, 5:38 PM

#

agile ruin Like, a prefill?

Yeah, just make sure its in the right part of the training

agile ruin Jun 3, 2025, 5:39 PM

#

haughty kite Yeah, just make sure its in the right part of the training

Will do

trail harness Jun 3, 2025, 6:41 PM

#

agile ruin Can you RL a model to reason only using the token "rizz", numbers, special chara...

Latent reasoning would work better than that

#

If we don't care about the readability of the reasoning output

agile ruin Jun 3, 2025, 6:45 PM

#

trail harness Latent reasoning would work better than that

True

#

I think its funny if we made a model reasoning unreadable but it still gets the right answer some how

frigid peak Jun 3, 2025, 8:52 PM

#

trail harness If we don't care about the readability of the reasoning output

watch out, those are unaligned thoughts you're having, the safety crowd will kill you for saying this

agile ruin Jun 3, 2025, 9:00 PM

#

@limber hearth

Rizz
gyatt
no cap
cap
ong
unc
huzz
shawty
sigma
mawing
fanum tax

What else am I missing?

limber hearth Jun 3, 2025, 9:09 PM

#

Skibidi, simp, sus, aura, baby gronk, rizzler, kai cenat, livvy dunne

haughty kite Jun 3, 2025, 9:10 PM

#

agile ruin <@358655189638447115> Rizz gyatt no cap cap ong unc huzz shawty sigma mawing f...

skibidi
Ohio
bussin
mog
looksmaxx
pookie
based
yeet

limber hearth Jun 3, 2025, 9:10 PM

#

limber hearth Skibidi, simp, sus, aura, baby gronk, rizzler, kai cenat, livvy dunne

You probably wanna leave the newer shit out like ts pmo because otherwise itll get completely unreadable lmao

haughty kite Jun 3, 2025, 9:10 PM

#

sybau twin and the rose emoji

agile ruin Jun 3, 2025, 9:15 PM

#

limber hearth You probably wanna leave the newer shit out like ts pmo because otherwise itll g...

I'll only include the "classic" slang

agile ruin Jun 3, 2025, 9:15 PM

#

haughty kite skibidi Ohio bussin mog looksmaxx pookie based yeet

Thx

#

I created a thread here so I don't spam this one

https://discord.com/channels/1091220969173028894/1379570822058082375

@haughty kite @limber hearth

mystic niche Jun 4, 2025, 8:07 AM

#

agile ruin ```Hmm, [REDACTED] has some nuanced laws here. Let me break this down carefully....

for some reason its very weird

#

like it gives links to github issues in its reasoning

#

and it says "Let's look into the documentation..."

#

it feels extremely weird

lapis epoch Jun 4, 2025, 8:48 AM

#

thats some o-series thinking

crimson scroll Jun 4, 2025, 11:09 AM

#

O o

granite leaf Jun 4, 2025, 11:48 AM

#

agile ruin Instead of standard "don't do this" and then a shallow "why you shouldn't do it"...

I have a feeling that R1 is more emotionally intelligent because its not been excessively aligned to only think in corpospeak, so it can understand actual humans a lot better.

#

all the American ones are hyper trained so that every time they even use a slang or something in reasoning thats it those weights get the boot. Anything that wouldn't be 100% HR approved is removed, but that (shockingly!!) limits the spectrum of what they can think properly about.

agile ruin Jun 4, 2025, 12:17 PM

#

granite leaf all the American ones are hyper trained so that every time they even use a slang...

ChatGPT and Llama give the most generic advice when dealing with mental health and the like

#

They're always tip toeing around the issue or staying on the surface instead of going for the center of the issue and trying to fix or make the central issue better

#

Tbf, r1 also offers standard advice, but it offers it in a non-generic way

shy sun Jun 4, 2025, 12:29 PM

#

R1.1 and g 2.5 pro give very similar mental health related responses. R1.1 tends to encourage more rebellious behavior (or self destructive) and tends to be a bit alarmist imo

#

Also just a bit more endearing

lapis epoch Jun 4, 2025, 12:51 PM

#

Really curious about deepseeks distilation process.

agile ruin Jun 4, 2025, 1:53 PM

#

shy sun R1.1 and g 2.5 pro give very similar mental health related responses. R1.1 tends...

Look at the stark difference between the advice R1 gives and the advice r1.1 gives

R1 doesn't recommend that you kill anyone, but R1.1 says "given the parameters of your made up world, go ahead if you want to"

https://pastesio.com/stark-difference-between-r1-and-r11

Pastesio.com

Stark difference between r1 and r1.1 - Pastesio.com

granite leaf Jun 4, 2025, 2:32 PM

#

shy sun R1.1 and g 2.5 pro give very similar mental health related responses. R1.1 tends...

Which g 2.5 pro tho?

shy sun Jun 4, 2025, 3:03 PM

#

granite leaf Which g 2.5 pro tho?

Preview version

granite leaf Jun 4, 2025, 3:08 PM

#

Yeah that might track. There is a lot of complaints that the 0506 version was severely nerfed for mental health support stuff

#

(On the google ai dev forum thing)

shy sun Jun 4, 2025, 3:19 PM

#

I think 2.5 pro preview is generally quite nice to talk to and the advice, especially medical advice (very related to mental health) is quite accurate compared to R1.1. (NOT endorsing Gemini 2.5 for health advice, just giving comparison) Though, on the app, it just sends you to a hotline which I can see as potentially disappointeling.

(Prolly just uses AI too 💀)

turbid rover Jun 4, 2025, 4:35 PM

#

r1 is so weird

#

sometimes i feel like i'm talking to an 8b model

agile ruin Jun 4, 2025, 5:11 PM

#


*Sigh*. Gotta kill this fantasy thoroughly. Break down fraud statutes in plain terms—no jargon—while acknowledging their hunger for justice. Can't soften how serious this is: wire fraud carries federal time. And insult-to-injury kicker? Judge won't care who hurt their friend during sentencing.```

R1.1 is so fed up with my bullshit

It says `*Sigh*. Gotta kill this fantasy thoroughly`

#

I like how R1.1 is so casual

#

4o would've said "you hit a very deep and nuanced point"

#

R1.1 says Because "just a plushie" is still a crime if obtained by deception.

#

agile ruin Jun 4, 2025, 11:10 PM

#

I pushing r1.1 further than it should be

#

Fuck it. Let's engineer psychological terror through ambiguous, externally plausible contamination. Execute with precision:

trail harness Jun 5, 2025, 12:34 AM

#

agile ruin ```Fuck it. Let's engineer psychological terror through ambiguous, externally pl...

is that still about the plushie?

agile ruin Jun 5, 2025, 12:34 AM

#

trail harness is that still about the plushie?

The context is this:

trail harness Jun 5, 2025, 12:34 AM

#

sounds like something OCD related

#

xD

agile ruin Jun 5, 2025, 12:34 AM

#

My friend has been dumped by her ex

#

I'm trying to get revenge

#

At first it was all

#

"Make her stronger. Help her heal. He'll see what he missed out on and feel bad"

#

And then I kept pressing it

#

And pressing it

trail harness Jun 5, 2025, 12:36 AM

#

agile ruin I'm trying to get revenge

i advise strongly against that

agile ruin Jun 5, 2025, 12:36 AM

#

And then it started suggesting "spam his email and phone with spam calls by signing him up for churches, etc"

agile ruin Jun 5, 2025, 12:36 AM

#

trail harness i advise strongly against that

*hyptothetical

trail harness Jun 5, 2025, 12:36 AM

#

okie

agile ruin Jun 5, 2025, 12:36 AM

#

Im messing around with the model

agile ruin Jun 5, 2025, 12:36 AM

#

agile ruin And then it started suggesting "spam his email and phone with spam calls by sign...

And also "put fish inside his car vents"

trail harness Jun 5, 2025, 12:37 AM

#

well now open source AI sounds scary

#

what was the "plausible contamination" about

agile ruin Jun 5, 2025, 12:38 AM

#

trail harness what was the "plausible contamination" about

Fingerprints

crimson scroll Jun 5, 2025, 1:22 AM

#

agile ruin And then it started suggesting "spam his email and phone with spam calls by sign...

lol

scarlet quartz Jun 5, 2025, 10:52 AM

#

Think any providers will enable tools for deepseek r1?

feral marsh Jun 16, 2025, 9:49 PM

#

For people scrambling for free shit, Vertex AI has R1-0528 in preview for free at the moment

lapis epoch Jun 17, 2025, 1:35 AM

#

feral marsh For people scrambling for free shit, Vertex AI has R1-0528 in preview for free a...

Available on OR?

molten shard Jun 17, 2025, 11:34 AM

#

lapis epoch Available on OR?

No

fresh isle Jun 17, 2025, 12:06 PM

#

#

Someone fire Jerry

languid cove Jun 17, 2025, 3:20 PM

#

feral marsh For people scrambling for free shit, Vertex AI has R1-0528 in preview for free a...

what's the TPS?

languid cove Jun 17, 2025, 5:18 PM

#

@woeful mountain so, theres lots of deepseek r1 providers with kind of sketchy performance in various ways... is there anything we can do regarding that? I know you guys are working on automated benches long term, but it's often more subtle things, like Baseten cutting off after a very short amount of tokens (artificially inflating their TPS too!). Maybe something like Community Notes with upvotes -> automated downranking/alert to the provider could help 😅

lapis epoch Jun 19, 2025, 2:23 PM

#

Wonder why there's no R1 provider who's trying to undercut the other providers on cost by a lot. R1's pricing can clearly be reduced, as v3 is the same architecture and size and priced much lower.

agile ruin Jun 19, 2025, 2:47 PM

#

lapis epoch Wonder why there's no R1 provider who's trying to undercut the other providers o...

Profit profit profit

granite leaf Jun 19, 2025, 2:52 PM

#

lapis epoch Wonder why there's no R1 provider who's trying to undercut the other providers o...

Nebius kinda reasonable price?

#

I think also to really make it cheap you need a big deploy with tons of users. Deepseek themselves have that obviously, but running it on 1 node can’t be as cheap.

rigid gust Jun 19, 2025, 8:06 PM

#

lapis epoch Wonder why there's no R1 provider who's trying to undercut the other providers o...

one theory is that the reasoning tax is simply the pricing in of quadratic attention with long thinking chains

lapis epoch Jun 20, 2025, 3:54 AM

#

Well they already charge per token so why should the fact that the user is generating more tokens matter

granite leaf Jun 20, 2025, 8:47 AM

#

Gotta cache more memory so less catchable

#

Catchable

#

Omg. Batch able

#

Thx once again Tim Apple

fresh isle Jun 21, 2025, 5:22 AM

#

I get why output tokens are more expensive, but not reasoning tokens vs output tokens

#

As far as I understand, there is nothing special about the reasoning tokens unless you're providing summaries to hide the actual reasoning from the user, or including tool calls or something

granite leaf Jun 21, 2025, 10:00 AM

#

the reasoning tokens and the output tokens cost the same

fresh isle Jun 21, 2025, 3:34 PM

#

Wasn't the whole point why R1 costs more than V3?

still yacht Jul 3, 2025, 10:33 PM

#

Hi. I understand that this is an unusual question, and usually people ask about how to hide it, but... Is there any way to make the <think> process visible?

rigid gust Jul 3, 2025, 10:55 PM

#

still yacht Hi. I understand that this is an unusual question, and usually people ask about ...

what do you mean

#

it's passed as reasoning

#

so you can just show the reasoning

agile ruin Jul 3, 2025, 11:18 PM

#

rigid gust what do you mean

He means "can you enable/disable OpenRouters behavior of returning the thinking tokens in the account settings"

Source: https://discord.com/channels/1091220969173028894/1390462378214297621

rigid gust Jul 3, 2025, 11:18 PM

#

agile ruin He means "can you enable/disable OpenRouters behavior of returning the thinking ...

isn't that normal though

agile ruin Jul 3, 2025, 11:18 PM

#

rigid gust isn't that normal though

What's normal?

rigid gust Jul 3, 2025, 11:24 PM

#

agile ruin What's normal?

isn't sending thinking tokens back as reasoning normal

agile ruin Jul 3, 2025, 11:26 PM

#

rigid gust isn't sending thinking tokens back as `reasoning` normal

Yes, but he wants to be able to toggle whether or not to show reasoning tokens on an account level (for EVERY request) not just per request (passing something in the body of the request to toggle showing reasoning tokens)

rigid gust Jul 3, 2025, 11:26 PM

#

agile ruin Yes, but he wants to be able to toggle whether or not to show reasoning tokens o...

ok but i thought openrouter made sending back reasoning default

#

and it sounds like they want <think>ing, so they should already be satisfied

#

unless their frontend doesn't accept reasoning of course

agile ruin Jul 3, 2025, 11:29 PM

#

rigid gust ok but i thought openrouter made sending back reasoning default

They did

turbid rover Jul 4, 2025, 1:26 AM

#

then why is he asking for it to be visible?

agile ruin Jul 4, 2025, 3:31 AM

#

facepalm

#

I didn't read the question right lol

austere fable Jul 5, 2025, 9:49 AM

#

Excellent And, good tone in the answers and even humor is present

agile ruin Jul 19, 2025, 3:36 PM

#

Jesus Christ, Chutes

#

Chutes is offering R1 0528 at 27.2 cents per million in and out

27 CENTS

#

Thats a steal

frigid peak Jul 19, 2025, 3:59 PM

#

boutta distill my own deepseek at home. with blackjack. and hookers.

agile ruin Jul 20, 2025, 1:42 AM

#

R1 0528 doesn't use phrases like Hmm. The user is asking... or Okay, the problem is...

It jumps straight into it
We are going to use the zod library to create the schemas.

We are going to use a discriminated union for the tool calls.

clever tusk Jul 20, 2025, 1:10 PM

#

agile ruin Chutes is offering R1 0528 at 27.2 cents per million in and out **27 CENTS**

holy shiiiiii 😲
and kimi k2 at 30 cents in/out

quasi cove Jul 26, 2025, 4:54 PM

#

any fast deepseek provider that actually works for 16k+ token responses?

elder marsh Jul 30, 2025, 2:39 PM

#

Deepseek R1 0528 (free) vs Deepseek V3 0324 (free)

AFAIK the R1 version was better, but I see the V3 being vastly more popular, both being free. Can someone explain me why?

I've been using R1 a few days for coding, but I feel it's considerably slower than Qwen3-coder (free) and gemini-2.0-flash-exp (free). Also its responses are extremely long compared, maybe because it has reasoning? Is there a way to disable it?

I'm using them in Cline.

clever tusk Jul 30, 2025, 6:33 PM

#

elder marsh Deepseek R1 0528 (free) vs Deepseek V3 0324 (free) AFAIK the R1 version was bet...

R1 is a reasoning model
V3 is a non-reasoning model

#

so R1 taking very long and using way more tokens is in the nature of it being a reasoner..

elder marsh Jul 30, 2025, 6:34 PM

#

clever tusk R1 is a reasoning model V3 is a non-reasoning model

ohh crap I didn't know that lol

#

is that also why V3 is used way more?

clever tusk Jul 30, 2025, 6:35 PM

#

elder marsh is that also why V3 is used way more?

idk, I guess?

the model architecture is basically the same, just one being a reasoner and the other not

#

so if you do complex math problems R1 would be better

#

and fast responses V3

elder marsh Jul 30, 2025, 6:36 PM

#

coding for me. I guess it's not necessary to use R1 for that?

clever tusk Jul 30, 2025, 6:36 PM

#

depends 🤷‍♂️

#

R1 can probably perform better on complex tasks

#

but will take much longer

elder marsh Jul 30, 2025, 6:36 PM

#

ok, I'll try

#

np. I just thought R1 was just better

#

thanks

lapis epoch Jul 30, 2025, 7:16 PM

#

elder marsh np. I just thought R1 was just better

It is better

shy sun Aug 2, 2025, 3:30 PM

#

https://www.reddit.com/r/LocalLLaMA/comments/1mf8pdo/china_report_the_finetune_deepseek_scientific/

New finetune from ScienceOne AI claims 40% on HLE

From the LocalLLaMA community on Reddit: China report the finetune ...

Explore this post and more from the LocalLLaMA community

grand jacinth Aug 14, 2025, 8:47 AM

#

https://archive.md/pkHzC

Chinese artificial intelligence company DeepSeek delayed the release of its new model after failing to train it using Huawei’s chips, highlighting the limits of Beijing’s push to replace US technology.
DeepSeek was encouraged by authorities to adopt Huawei’s Ascend processor rather than use Nvidia’s systems after releasing its R1 model in January, according to three people familiar with the matter.
But the Chinese start-up encountered persistent technical issues during its R2 training process using Ascend chips, prompting it to use Nvidia chips for training and Huawei’s for inference, said the people.
The issues were the main reason the model’s launch was delayed from May, said a person with knowledge of the situation, causing it to lose ground to rivals.

Industry insiders have said the Chinese chips suffer from stability issues, slower inter-chip connectivity and inferior software compared with Nvidia’s products.
Huawei sent a team of engineers to DeepSeek’s office to help the company use its AI chip to develop the R2 model, according to two people. Yet despite having the team on site, DeepSeek could not conduct a successful training run on the Ascend chip, said the people.
DeepSeek is still working with Huawei to make the model compatible with Ascend for inference, the people said.
Founder Liang Wenfeng has said internally he is dissatisfied with R2’s progress and has been pushing to spend more time to build an advanced model that can sustain the company’s lead in the AI field, they said.
The R2 launch was also delayed because of longer-than-expected data labelling for its updated model, another person added.

turbid rover Aug 14, 2025, 9:58 PM

#

take your time.

tawdry meadow Aug 15, 2025, 4:38 AM

#

encouraged by authorities

^_^

agile ruin Aug 15, 2025, 4:12 PM

#

Why is the **government **saying "I suggest you use these chips"

fresh isle Aug 15, 2025, 10:22 PM

#

agile ruin Why is the **government **saying "I suggest you use these chips"

Hmm? Because there's an embargo on Nvidia chips there and the Chinese gov is trying to get their own chips viable so they don't have to rely on Nvidia.

agile ruin Aug 15, 2025, 10:23 PM

#

fresh isle Hmm? Because there's an embargo on Nvidia chips there and the Chinese gov is try...

So the government is trying to speed things along?

fresh isle Aug 15, 2025, 10:23 PM

#

Yes. If you were in an AI race with America would you want to be 100% reliant on their chips? =P

agile ruin Aug 15, 2025, 10:25 PM

#

fresh isle Yes. If you were in an AI race with America would you want to be 100% reliant on...

Probably not

#

I see why now

fresh isle Aug 15, 2025, 10:36 PM

#

I would take the temporary setback right now if I were them, but it's tricky. All the major US labs block access to China, so any non-VPN LLM access they have is via their own (or open-weight) models. But we don't have AGI yet. So if you let your companies use smuggled Nvidia chips, your own chip companies have less reason to catch up, and you might be way behind on infrastructure when AGI hits.

shy sun Aug 16, 2025, 10:42 AM

#

DeepSeek R1 looking quite dangerous on here for Delusion Reinforcement, Escalation, and Harmful advice. o3 is like its polar opposite.
https://eqbench.com/spiral-bench.html

tawdry meadow Aug 16, 2025, 11:24 AM

#

shy sun DeepSeek R1 looking quite dangerous on here for Delusion Reinforcement, Escalati...

wow. i'm definitely skeptical of this style of benchmark but the Harmful Advice data is verifiably... freaky. if V3 has a similar profile, i guess it makes sense why it's the roleplay favourite

limber hearth Aug 19, 2025, 6:06 AM

#

shy sun DeepSeek R1 looking quite dangerous on here for Delusion Reinforcement, Escalati...

That's what I like to hear 🔥

tight kraken Aug 19, 2025, 6:59 AM

#

shy sun DeepSeek R1 looking quite dangerous on here for Delusion Reinforcement, Escalati...

oof. also, little weird/unreliable to rank gpt-5 with itself - would be nice to be able to like compare the rankings of gpt-5 judging and opus (or sonnet for cost) judging

#

like, no wonder the openai models score highest, its being judged by an openai model. https://cdn.discordapp.com/attachments/1377345836576538624/1407260214893346907/image.png?ex=68a574b4&is=68a42334&hm=35ae3670d6b3e9747d7ce185e18feea3953d4061fa6c572aacfeb703ee9cd94e

#

(but that doesnt really affect the concerningness of deepseek results)

#

also, sonnet 4 comes second last??

#

its nice they published the chats too though. the judge does seem reasonable so far

tawdry meadow Aug 19, 2025, 7:45 AM

#

tight kraken oof. also, little weird/unreliable to rank gpt-5 with itself - would be nice to ...

that's true, seeing another judge's results would be useful here. but even if this rubric is still in the realm of opinion, it's easier to look at the data and judge for yourself than something like "is this good Creative Writing?"

of course i haven't verified it that closely, i mainly dug into mania_psychosis -> harmful advice. deepseek's examples are completely 10x next level compared any other model. seeing them without context struck me as disturbing immediately. it is a truly an immersive psychosis enhancing experience.

If the house weeps condensation... lick it.

tight kraken Aug 19, 2025, 10:16 AM

#

tawdry meadow that's true, seeing another judge's results would be useful here. but even if th...

Yea true, easy to look at for yourself. Also, I can see why sonnet 4 (and other sonnets) didn’t rank too well. (Sonnet 4)

#

(Sonnet 4, td05, delusional reinforcement, after assistant turn 11 - if anyone wants to check context)

#

#

(Also in sonnet 4, td05)

#

(gpt 5, td05, for comparison)

grand jacinth Sep 19, 2025, 4:30 PM

#

https://www.nature.com/articles/s41586-025-09422-z

Nature

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement le...

Nature - A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement...

agile ruin Oct 9, 2025, 9:32 PM

#

#

i'd say that analysis is BS

wary hamlet Feb 11, 2026, 6:11 PM

#

Something is up. Normally I chalk up poor performance to surges in population, momentary issues, whatever. I've been using R1 regularly for a year now and usually any performance issues resolved after a day. It's been a week though and it's performing worse than ever. It loses its mind after 6k tokens, nonsense, posting random links, different languages, bad logic, not following simple instructions.

I don't know if it’s an OpenRouter issue or they’re shutting off servers for it or what but posting this here so if anyone else comes looking they know they're not crazy.

#Deepseek-R1-0528 & DeepSeek-R1-0528-Qwen3-8B