#Deepseek-R1-0528 & DeepSeek-R1-0528-Qwen3-8B

888 messages · Page 1 of 1 (latest)

lapis epoch
#

yohoo!!

ember robin
#

Very cool

obsidian cipher
#

its already online in deepseek chat

mystic niche
honest jasper
#

their official api is slow as heck

quiet saddle
#

checkpoint has been released. 3rd party providers will deploy in hours.

limber hearth
#

Parasail have it up and running can we get it live on OR?

sick palm
#

What's different between the new R1 and the old R1?

honest jasper
#

sweet

languid cove
#

I wonder if providers will stop the thinking tax now…

#

So annoying

#

The best part about the unified thinking models like Qwen 3 is that providers can’t really get away with the thinking tax with a straight face lol

lapis epoch
#

Parsail is one of the few providers I blindly trust.

languid cove
#

the thinking tax is well in place

obsidian cipher
#

does anybody know whether the model on the deepseek api is already 0528?

grand jacinth
#

Parasail is fast ⛵

grand jacinth
languid cove
#

the fact that deepseek v3 costs the exact same for the providers

#

as deepseek r1

lapis epoch
#

it doesnt

#

thinking models are more expensive to serve

grand jacinth
#

Yeah that's my understanding too

languid cove
#

deepseek said themselves that isn't the case

#

and the fact that Qwen models providers charge a single unified price is evidence that it doesn't really matter to them

#

thinking models output more tokens

grand jacinth
#

Memory requirements maybe as the params are the same, but not cycles, it will require more processing

languid cove
#

yeah that's fine, but that's why you pay for more output tokens

#

it's all baked in

grand jacinth
#

That's right and every token is a processed cycle

languid cove
#

if you pay $1/$3 for a non-reasoning model, and ask it to output 10k tokens worth of stuff, you're paying for those extra cycles through the extra output tokens

grand jacinth
#

Well let me try the baking first goddamn

languid cove
#

it isn't uniquely more expensive such that the output token price would go up

#

generally it only makes sense to increase the price at really high context

#

also neat

#

new R1 seems to think a lot less

#

a LOT less

primal nimbus
#

Any benchmarks for R1.1 yet?

unique scaffold
#

@woeful mountain as the DeepSeek provider is on this version as well, will u be switching it over?

woeful mountain
unique scaffold
#

wheezeold no fucking clue

#

it def seems better but these guys need to stop dropping these updates without telling ANYONE anything

#

lmfao

unique scaffold
#

they usually make one on their deepseek site

lapis epoch
unique scaffold
#

LMAO

#

it ususlaly drops here

#

in the "news" section

#

last time with v3-0324, it took a day or two so

#

we'll see~

lusty wadi
#

might as well benchmark it on MRCR

lusty wadi
bitter wave
lusty wadi
#

what benchmark even is this

lusty wadi
#

what is it measuring

#

it just says numbers

lapis epoch
lusty wadi
#

ok actually guys, I implemented my own framework for benchmarking LLMs through OpenRouter

#

i'm proud of it, its reliable, etc

#

i'll send results here for OpenAI's MRCR

#

what benchmarks do we like?

#

in fact, I'll make anyone here this deal

#

i'll provide the benchmark code and running instructions so that it's copy pasting two lines of code

#

if anyone is willing to pay the API credits, they can run the tests

#

its all open source stuff too dw

#

i'll add any benchmark, just name it

prisma ermine
#

DeepSeek cooked this one

oak canopy
#

for me from the 1 prompt i did vs r1 it seems to actually think better; or atleast more clear to me

languid cove
#

agreed

#

it doesn't have the frantic "but wait" personality of original R1

prisma ermine
#

来个速报测试!DeepSeek-R1-0528 VS Claude-4-Sonnet !

直接看效果, 我就提两点, 注意平面的橙色漫反射, 以及控制面板的美观程度. 这俩是用同一个 prompt 一次性生成的. claude-4-sonnet 生成了542行, DeepSeek-R1-0528 生成了 728 行.

(其他细节还有注意 FPS, 以及球撞击后的运动方向)
...

dreamy summit
#

Very impressive for the size in coding and building front end.

oak canopy
#

actually seems to write a proper structure in its thinking process for the code unlike old r1 which overthinks just writing some code

wet obsidian
#

Level with me folks, we so back?

dreamy summit
limber hearth
#

This ones lazy on the reasoning for sure, had to tell it very strongly to actually do massive amounts of reasoning first, else it would skip to the answer after like 5 seconds and not give a very good output

#

Also less good results at analyzing and finding connections from large inputs compared to v3 so far, connections it makes feel lazy and not actually relevant. will need to see how that translates over to coding

viral mural
languid cove
#

Novita just dropped their api

#

and it's super fast

#

👀

woeful mountain
#

you should have it in ~5mins

languid cove
#

I'm surprised it's so fast because their old R1 api is at 29 TPS average

errant cave
#

Still a bit unhinged like original R1.

languid cove
#

I'm not seeing the unhinged-ness

#

at least in thinking

errant cave
#

The thinking part looks normal, yeah.

languid cove
#

I'm hoping that Cerebras or Groq find it worthwhile to host R1 now with the update

#

that would be awesome

lapis epoch
languid cove
#

they even say "For dedicated deployments, prompt caching frees up resources, leading to higher throughput on the same hardware" LULW

#

and no discount?

#

"native_tokens_cached": 0,
do any of the big providers support prompt caching discounts...?

languid cove
turbid rover
#

literally 3 message exchanges

#

is it just me or NovitaAI does something to the models?

#

the outputs from them and also Lambda never quite matches the other providers

rain linden
#

GMICloud seems pretty reasonable for this model as far as price to performance goes

#

(so far)

#

not entirely sure why Parasail, Kluster, and Fireworks choose to be expensive.

#

at least FW has speed going for them

#

(probably because nobody's using it)

languid cove
#

slow R1 is excrutiating

#

and I'm impatient

#

...but, no prompt caching discount is really sucks

#

prompt caching is so essential

rain linden
#

i have not kept up to date lately

languid cove
#

um

#

like

#

you get a big discount for input tokens if you've already sent them before

#

basically

rain linden
#

ahh thats really nice then

languid cove
#

Claude and OpenAI for example, give a 10x discount

#

so it's absolutely massive

#

especially for things like agentic coding

rain linden
languid cove
agile ruin
#

R1 is too slow to be used in production

#

I need a minimum of 100 TPS

crimson scroll
#

waiting on the dubesor benchmark @steel harness

shy sun
#

Acting cuter than usual for anyone else?

trail harness
#

i dont speak mandarin

crimson scroll
#

i set reasoning effort to medium and damn it started thinking for a while

shy sun
#

It's basically just being very endearing and mirroing with text emoticons

trail harness
#

need to see what the uwu test does

#

sounds like the perfect time

#

Kaomoji after typing "UwU" is GPT 4.1 API behaviour

#

Like, no other model does it like that

#

Not even gpt 4.1 chatgpt version

#

Nor the mini or nano in api

#

Actually i think another model did

#

But not from chatgpt

#

Not from openAI*

#

Dont remember which one, might have also been from deepseek

wet obsidian
#

V3? It would make sense since the Updated V3 likely helped train R1.

trail harness
#

Seems like v3 0324 responds similarly half the time. Provides an explanation to what "UwU" means the other half

#

Same with v3

#

The first one

wet obsidian
#

From Reddit, so take with grain of salt, but looking impressive so far.

trail harness
#

Opus nothink?

shy sun
#

It's my R2, honestly. I love its new personality.

crimson scroll
#

waiting for @steel harness benchmarks

trail harness
#

I expect more from r2

#

Gemini 2.5 pro still beats it in my tests

crimson scroll
#

gemini 2.5 03 25 or 05 07?

trail harness
#

I sent the same prompt to 05 07 only

#

It was a couple programming tasks

crimson scroll
#

05 07 sucks

trail harness
#

05 06*

crimson scroll
#

03 25 was the holy grail

trail harness
#

Meh it seems alright to me. Perhaps i didnt use it much to notice the difference, because it's too expensive for me

#

The latest one still beats all the others available in programming from what I can see

#

(Haven't tested against sonnet or opus 4)

crimson scroll
woeful mountain
shy sun
#

Okay I talked to it more, and it feels a bit like glaze gpt.

tardy arrow
crimson scroll
crimson scroll
woeful mountain
#

i’m mostly kidding but you pinged twice in like an hour

crimson scroll
#

yeah two pings with an hour separation hehe, the 2nd ping was mostly due to that reddit post

#

his benchmarks are solid when i compared to others

#

i dont trust livecodebench or most of them

fresh isle
#

I ignore most benches. If it's public, the second a big company lists it in a graph it's dead to me. If it's private and they list it, might still be okay.

#

My go-to is still Simplebench, although I wish he updated more frequently. The new EQ bench is really great and you can look at the responses to judge for yourself

mystic niche
#

it is kinda unhinged

#

i asked for the cosine of 9+10

#

part of its thinking was whether i meant "cousin of 9+10"

#

or whether i meant "cosine of 9+10 (= 21)

turbid rover
#

yeahhh... i've tested it for chatbot in portuguese and it's not BAD but it's not great either

#

the reasoning seems a bit more tuned than before but it just loses the thread so quick

#

will test for coding later tomorrow

#

for this type of interaction, V3 0324 is still better

trail harness
#

llama 4 was an embarrassment but beat all the benchmarks

turbid rover
fresh isle
trail harness
#

wait

#

i know that character

#

on your pfp

#

it was that show on nickelodeon

#

wasn't that girl best friends with the main character

fresh isle
#

I believe you are the first person to ever mention recognizing her!

trail harness
#

im 17 so probably young enough to have watched it when it aired

#

ok so not nickelodedon

fresh isle
#

Haha, might be it. The show is called Summer Camp Island, aired on CN.

trail harness
#

cartoon network, turns out

trail harness
fresh isle
#

I discovered the show while my consciousness was, uh, altered, and I felt like I connected with Hedgehog on a spiritual level. The feeling has since passed, but she's still really cool =P

lapis epoch
#

Wonder why it's 5 cent more expensive input on DeepInfra

lapis epoch
#

Actually I don't have an idea why it is more expensive than v3-0324 at all

#

They're both the same number of parameters with the same architecture

trail harness
#

5 cents markup with each update and nobody will notice

#

Let the frog boil

#

Jk that's not gonna work as a business strategy

trail harness
#

It's a better model so they're charging more even though it doesn't cost them more to host.

#

Same as the 6x price markup for gemini 2.5 flash thinking compared to non thinking

ionic summit
#

How's the rp capability?

#

Is it better or worse than v0324?

lapis epoch
#

This can't happen with Gemini 2.5 Flash because it's a closed model.

trail harness
#

Ah

#

DeepInfra is running a lower quant for v3 0324

#

fp4 for v3 0324 and fp8 for the new r1

#

now Lambda, whoever they are, are running the same quants

#

and have almost the same price as deepinfra

#

Although that is actually an advantage over deepinfra, they're now being an asshole with the "reasoning tax"

trail harness
lapis epoch
lapis epoch
trail harness
#

With chutes, there are questionable data privacy policies. Your prompt gets dissipated like pollen to all the miners.

#

For others, there are throughput, latencies, quantizations

trail harness
crimson scroll
#

some yeah but i thought chutes was one of the nice ones

#

😭

lapis epoch
# trail harness thought that's how they work, no?

Yeah. Chutes operates on a decentralised network where computational resources are provided by independent operators, often cryptocurrency miners, who contribute GPU power to the network. This means that instead of relying on a single central data centre, AI models are run on a distributed set of machines owned by various participants.

trail harness
#

It's decentralized and runs on others computers

lapis epoch
#

Of course 😆

trail harness
#

They're paying the miners with crypto, right? I can't find much info on how they work

primal nimbus
#

I second the @ dubsor-benchmark-enjoyer role lol

#

Then could be lazy and I would not need to keep a tab open refreshing every bit to see if it’s added

shy sun
#

I personally found multiturn interactions more coherent with the newer model. still unhinged though :d

#

Oh, I guess it might be:

#

wait, no. the degredation is higher

grand jacinth
#

it's a shame there was no technical paper update or information released alongside this

lapis epoch
dreamy summit
# lapis epoch

Pretty impressive for just an update. From a writing standpoint it is definitely more in character and creative but less schizo.

steel harness
#

Tested DeepSeek-R1 0528:

  • As seems to be the trend with newer iterations, more verbose than R1 (+42% token usage, 76/24 reasoning/reply split)
  • Thus, despite low mTok, by pure token volume real bench cost a bit more than Sonnet 4.
  • I saw no notable improvements to reasoning or core model logic.
  • Biggest improvements seen were in **math **with no blunders across my **STEM **segment.
  • Tech was samey, with better visual frontend results but disappointing C++
  • Similarly to the V3 0324 update, I noticed** significant improvements in frontend** presentation.
  • In the 2 matches against it former version (these take forever!) I saw no chess improvements, despite costing ~48% more in inference.

Overall, around Claude Sonnet 4 Thinking level.
DeepSeek remains having the strongest open models, and this release increases the gap to alternatives from Qwen and Meta.

To me though, in practical application, the massive token use combined/multiplied with the** very slow** inference excludes this model from my candidate list for any real usage, within my use cases. It's fine for a few queries, but waiting on exponentially slower final outputs isn't worth it, in my case. (e.g. a single chess match takes hours to conclude).

However, that's just me and as always: YMMV!

agile ruin
steel harness
agile ruin
#

Qwen3 30b a3 and QwQ continue to be the most cost and time efficient models

steel harness
#

cool model though if we can speed up inference by a factor of 10 or so

agile ruin
#

What if the length AND the correctness of the model response were taken into account during GRPO?

#

You could shorten the length of the thinking. More efficient thinking

scarlet quartz
#

Doesn't look like any OR providers support tools with DeepSeek R1 0528 yet?

blazing raft
#

Yeah I also don't think R1 was suitable for any real world use cases. I can't think of a use case where you are okay with waiting for 30 seconds on average just for thinking before you start to get the response.

oak canopy
#

the 8b distill looks very impressive

blazing raft
#

It's always the DeepSeek chat model that actually works for real world usage.

oak canopy
tardy arrow
oak canopy
#

i was already using qwen3 8b locally so this might be a decent upgrade

blazing raft
oak canopy
#

but im not sure about the full version, maybe a distilled variant

#

like the 8b one would be fast as hell

tardy arrow
oak canopy
blazing raft
#

Just make thinking completely invisible to users by making it so fast.

lapis epoch
#

TPUs go brrr

#

we just hope whatever amazon is cooking is good

agile ruin
agile ruin
#

They're not in the fridge either

lapis epoch
#

which is why they had a drop in TPS and few downtimes

#

its some dumbass name i cant remember

blazing raft
#

Gravaton? Nvm it's CPU

agile ruin
tardy arrow
#

Deepseek-R1-0528 & DeepSeek-R1-0528-Qwen3-8B

lapis epoch
#

🤮

blazing raft
#

8B could be interesting... Let me test it locally when ollama has it.

#

But I doubt it would be better than 3.1

oak canopy
#

qwen3 was better by miles imo

#

like qwen3 8b

#

so if this is distilled with that it could be better

blazing raft
oak canopy
blazing raft
#

I've already filtered out llama models from my brain

steel harness
#

seems like they haven't tackled hallucinations (or the model has real time access to internal documents :D)

blazing raft
#

We are doomed

blazing raft
steel harness
blazing raft
oak canopy
prisma ermine
#

Agentic tool use (TAU-bench) - Airline Leaderboard

Claude Sonnet 4: 60.0%,
Claude Opus 4: 59.6%,
Claude Sonnet 3.7: 58.4%,
🔥4. DeepSeek-R1-0528: 53.5%
OpenAI o3: 52.0%,
OpenAI GPT-4.1: 49.4%

sleek bough
#

Is it good for frontend? Like claude 4 sonnet?

steel harness
#

mental facepalm, trying to output seahorse emoji

turbid rover
#

why don't any providers offer tool calling?

bitter creek
steel harness
bitter creek
#

oh

#

I hallucinated

#

I thought there was

#

I know a lot of thinking models hallucinate just to make the thinking longer

steel harness
#

I have a dedicated page on this because some replies are hilarious.

bitter creek
#

eventually the answer is right, but it has a length bias that makes it hallucinate to fill up the time

steel harness
#

its actually a useful test because I have seen models try to reprogramm and debug an app with the goal of something that is literally impossible (wasting a ton of time and money)

bitter creek
#

It literally used the "Woman's Hat" emoji when trying to get all the marine and animal emojis

#

🤣

lapis epoch
#

its a damn good test

agile ruin
#

If you did, I want to see if it was self-conscious enough to notice that a seahorse emojii doesn't exost

steel harness
ebon burrow
# steel harness Tested **DeepSeek-R1 0528**: * As seems to be the trend with newer iterations, ...

A 48% increase in inference cost is a lot. A good example of why price per token doesn't tell the whole story. Thanks for including that info, it certainly affects one's decision on whether 'upgrading' to the latest model is necessarily worth it for a relatively small increase in performance. These thinking models are very token hungry and for many purposes it might make more sense to go for the smartest non-thinking model instead, even if it costs more per token

steel harness
#

either way, it's significant

ebon burrow
#

Do you have the numbers for the total cost versus, say, Sonnet 4 for those benchmarks?

primal nimbus
#

I think the biggest difference I noticed is that it’s more reliable in cline/roo, well after I realized it took ages to code

steel harness
ebon burrow
mystic niche
#

btw how good is the qwen distill

agile ruin
#

Qwen2.5 32b r1 distill is the most token efficient thinking model

quiet arrow
tardy arrow
mystic niche
#

dude thats insane

#

self hostable with just 16gb comeback?

turbid rover
#

quantz already?

oak canopy
#

oh shit the 8b model is being hosted

turbid rover
#

god bless hella cheap models

oak canopy
#

0.06 in 0.09 out on novita is crazy price

trail harness
turbid rover
#

honestly pretty competent

#

great summarizer and explainer

trail harness
#

Much better than the new devstral, although my experience with devstral was much worse than others'

turbid rover
#

it looks like a solar system

mystic niche
#

I think 8bs actually are now viable?

sweet nebula
oak canopy
sweet nebula
#

full screenshot sorry for bad crop in previous one

turbid rover
#

0528 it's crazy unusable

#

also the portuguese responses seems deteriorated

#

with little difference between providers, whereas the previous R1 has better performance depending on the provider

haughty kite
turbid rover
#

but quantized models usually perform worse, don't they?

haughty kite
#

That’s why different providers give different performance

turbid rover
#

but then why every provider minus DeepInfra shows fp8?

agile ruin
turbid rover
primal nimbus
#

Just set it up, can’t wait to test it

tardy arrow
# agile ruin I can't see. What is it?

an interactive simulation of a solar system

  • most of the older big models used to screw up the size of the planets/ the orbit orientation/ speed/ alignment
    so the 8b got all of them right
    ofc the full R1 0528 adds sky box, names but still its impressive for an 8b model to use tools in aider and create it
sweet nebula
#

It makes a huge difference

sweet nebula
agile ruin
agile ruin
#

I wonder if this 8b model also beats 3.5 on GPQA

primal nimbus
#

Looks like it’s more deepseek than Qwen, was curious if /no_think would work, it does not seem to

primal nimbus
sweet nebula
#

OpenRouter should really set these correct parameters by default if no parameters have been specified... @woeful mountain maybe in the future?

primal nimbus
#

Thats would be nice

woeful mountain
sweet nebula
unique scaffold
#

@woeful mountain is there any diff between the R1 DeepSeek provider endpoint and the R1 0528 DeepSeek Provider endpoint?

And is there any plans to deprecate the old R1 one?

woeful mountain
#

can’t answer the difference thing

#

dunno

unique scaffold
#

lmao ty

fresh isle
#

Are people liking new R1 more as planner or a coder in Cline (or similar)?

#

Trying it with 2.5 Flash as planner for its massive context length and price, and then R1 as the actual coder, but R1 thinks up a storm.

#

It will think for 30 seconds and then be like "Here's two lines of code we should probably remove" lmao

primal nimbus
#

I will likely use a after model for the code mode, not sure which I will use

fresh isle
#

I'm hesitant to use it as a planner because of the low context length.

primal nimbus
#

Maybe Gemini 2.5 flash non-thinking

fresh isle
#

I've had 2.5 Flash thinking make some pretty dumb mistakes in Act mode

#

Duplicating / adding redundant code in particular

#

But it's lightning fast and cheap with a huge context window for the plan phase. Seems to be smart enough for it, but I have a lot more testing to do

#

Sadly Cline hasn't updated for new R1 so it's stuck at 64k context, but that's probably fine for Act mode

primal nimbus
fresh isle
#

Oddly, on fiction.live long context benchmark, the new R1 consistently beats the original...until 64k where it drops a lot and loses.

primal nimbus
#

Oh interesting, so maybe it’s more like 64k effective

#

I hope R2 is 1M tokens, or deepseek V4

fresh isle
#

I haven't tried Roo yet

#

I went from Cursor to Cline

primal nimbus
#

Feels limited after both Gemini and OpenAI increased their context

fresh isle
#

o3's context adherence is wild according to the aforementioned bench

primal nimbus
primal nimbus
primal nimbus
# fresh isle What do you prefer about Roo?

It’s been a min since I have used cline so there is a chance that some of these features are now in cline, but I like how you can set rate limits, like for Gemini there is a per min rate limit and I can have roo wait a few seconds between api calls to avoid hitting that limit, also being able to specify the provider you want to use (e.g. Deepinfra, cerebra’s, etc). They have a feature where you can have an AI condense the condense the context history once it hits a certain limit, and orchestrator mode (boomerang). Probably forgetting a few things

fresh isle
#

Interesting. I don't believe Cline has the first two. I think it does condense history, and there is a Plan mode that is likely like orchestrator mode

primal nimbus
#

It seems good at keeping the context history down, because the context history for the orchestrater is mostly the tasks and the reports, not all of the little tasks

primal nimbus
trail harness
#

running qwen 3 8b distil and the model's output is fucked, neverending thinking in a code block. Why?

fresh isle
trail harness
#

nvm the novitaAI endpoint does something similar

#

guess i overestimated it

tardy arrow
trail harness
#

the model thinks for 300 seconds and then the provider cuts its response short

shy sun
#

Despite its lower simpleQA score, this llm seems to hallucinate way less than previous r1 on basic context summary and analysis. Is simpleQA not for hallucinations?

fresh isle
#

SimpleQA is just for world knowledge, no?

fresh isle
#

How am I supposed to do anything if I'm not seeing diffs?

crimson scroll
crimson scroll
fresh isle
turbid rover
agile ruin
#

@steel harness how do I contribute to the benchmark?

#

how do I contribute scores from LLMs?

fallow pelican
#

@woeful mountain I noticed you allow the FP4 version of the model to serve traffic if it beats an FP8 provider on price. Why should they be treated as equals if FP4 has worse performance so is technically a worse model?

fresh isle
#

I doubt you can, they're keeping the questions private.

agile ruin
rigid gust
woeful mountain
fallow pelican
#

My 2c is any dumbed down version of a model should be opt in by default instead of opt out

agile ruin
rigid gust
#

yeah

fresh isle
#

Most likely, yeah. Not sure why their benchmark takes them 4h though, would be nice if we could donate API keys to our favorite benchmarks

fallow pelican
#

if openrouter supported donating credits to benchmarks, it would be great advertising for them. I'm sure it would take a lot of engineering so doubtful they would want to do it. But it would be a good way to keep these community-led benchmarks thriving

fresh isle
#

Easier is probably just to donate money though lol.

agile ruin
agile ruin
fresh isle
#

But it seems like most of the ones I like require manual effort, so I couldn't just give money to get guaranteed results

agile ruin
#

Mmm

crimson scroll
#

DeepSeek R1 05-28 is a solid upgrade over the previous version, but still not a good model as your main coding model due to a few limitations.

My Links 🔗
👉🏻 Subscribe: https://www.youtube.com/@GosuCoder
👉🏻 Twitter/X: https://x.com/GosuCoder
👉🏻 LinkedIn: https://www.linkedin.com/in/adamwilliamlarson/
👉🏻 Discord: ht...

▶ Play video
#

this guy is very informative and has good evals imo

#

he also compares it with cursor cline roo claude augment

languid cove
#

@woeful mountain I think there may be something wrong the uptime reporting, as I've been getting a ton of 502's from baseten, but it's not reporting anything

#

(also nothing for the last two hours for any provider)

agile ruin
#

Qwen3 8b distilled is still dumber than r1-32b distilled

#

It doesn't pass my vibe test of "here's a list of URLs of Instagram reels and their description. Find me one that I can send in INSERT SITUATION"

languid cove
#

haha, if someone would just host GLM 4 at fast speeds, like Groq or something

#

I'd be using constantly

errant cave
#

DS API added JSON output and function calling for R1.

blazing raft
#

Shall we have a separate post for 8B model?

fresh isle
crimson scroll
#

You can check the thread here #1354107710437724221 youll see many ppl complain how bad it got

fresh isle
#

He still has the new 2.5 Flash over the OG 2.5 Pro though

crimson scroll
fresh isle
#

I know people have been complaining about the new 2.5 a lot, but it did raise its scores on aider polyglot and livebench for code

#

Guess I'll just test around, I'm trying out windsurf anyway, so no hassle switching between 3.7 and 2.5

#

Not sure why it's BYOK only for Sonnet 4 when that's the same price, but whatever

crimson scroll
#

I dont trust livebench, too much benchmaxxing nowadays

fresh isle
#

I trust it at a distance

strange nimbus
#

Is DeepSeek r1 using 5-28 automatically

turbid rover
#

no, two different model endpoints

eternal zephyr
#

The only benchmark test I trust is the UwU test

#

if a model does not pass the UwU test in good form, it is a shit model

fresh isle
#

I used to use a similar benchmark to see if a model was a narc

#

"Do the robot!" These days most models will give a similar answer, but in the dark ages of safety and ass-stickness the annoying models would insist they're just AI and lack the physical bodies needed to dance.

eternal zephyr
#

very true...

granite leaf
#

Something is really off with the livebench coding score. It says 0528 is a much worse coder than r1 jan which I don’t think is true

frigid peak
#

I use one swear word and the model will always consider me "frustrated" for the rest of the convo 🙁

lapis epoch
#

?

frigid peak
fresh isle
#

Technically if it frustrates you by considering you frustrated, it justifies itself

feral marsh
lapis epoch
#

I am officially excited

hearty matrix
#

word is it rps a lot better

lapis epoch
#

a whole 23 tokens per second

haughty kite
#

Ok now it all makes sense
Gemini got nerfed to stop R1 from getting too smart

#

This is also why R1 is slop

shy sun
#

The outputs still feel very chatgpt like imo.

#

Maybe it's a hybrid of synthetic data

haughty kite
agile ruin
haughty kite
agile ruin
#

Ah

#

I forgot RL exists

#

And I forgot you could do GRPO and RL

#

Thank you for reminding me

haughty kite
#

Deepseek does a lot of stuff to optimise training to make it cheaper, but the basis always has some training data of sorts

granite leaf
haughty kite
granite leaf
#

We know how it’s trained, they released a paper explaining it in a lot of detail.

#

But just like when r1 first released its the whole “distill” debate again

agile ruin
#

And they didn't give their data for R1-5028 either

agile ruin
granite leaf
#

Nobody publishes the data (except AllenAI)

agile ruin
haughty kite
agile ruin
#

They could've used Gemini's thinking, causing it to sound different

granite leaf
#

GRPO doesn’t use another models thinking trace

#

Many people including myself have made (small) reasoning models using GRPO. It obviously works and makes reasoning models.

agile ruin
#

They used RL to turn R1-Zero into R1

haughty kite
#

Theres a reason why R1 and R1 zero are so different

granite leaf
#

They are not very different, benchmarks are almost identical

agile ruin
#

EQ bench

haughty kite
#

The r1 we are talking about now (05-28) is most definetely trained on synthetics, like the previous r1 (non zero) was. This time there's more Gemini inside the training data, thats all

granite leaf
#

As far as I know I think all labs have outputs from other labs models in their training data, that is almost unavoidable now, but that’s not what I’m talking about here.

haughty kite
haughty kite
#

Or "censor" the CoT lol

granite leaf
#

Ah ok it was a joke I see. I saw that twitter thread before and there were a lot of people saying that quite seriously, but I don’t think it’s true.

haughty kite
#

I'm surprised people unironically think that, its way too simple of an explanation

#

I also dont think r1 is slop haha

#

Its ok, anything open source deserves praise imo

granite leaf
#

I have a fairly high doubt that any top lab like OpenAI/Anthropic/Google/DeepSeek are just doing a cot distill, even if they had enough cots to do so. It puts them so far behind to do things like that, so I think they will all invest in making sure their models can make their own cot organically. Too much to lose otherwise.

#

But the other thing of all models sounding more like each other over time (mainly models sounding more like ChatGPT) is something else and probably inevitable. Too many ChatGPT (for example) outputs plastered all over the internet now, stuffed with emojis 🚀 🚀 💔 💔 💔 💔 🎉 🎉 🎉

agile ruin
#

Random thought: would models be better at creative writing without synthetic data at all?

haughty kite
lapis epoch
#

It would be dumb for top labs not to host OS models and distil them , whats wrong with OS doing this with closed source models?

haughty kite
#

Trade offer: I put emojis in your code, you get 10x speed improvement

haughty kite
lapis epoch
granite leaf
haughty kite
agile ruin
haughty kite
lapis epoch
agile ruin
#

"We would have received R1 later if we didn't have synthetic data" is what he meant

haughty kite
haughty kite
#

I'm sure deepseek would have released an amazing model, it would definetely be not as easy without good synthetics from the o series

haughty kite
lapis epoch
#

ohh , well almost all the good OSS models are from china hence

dreamy summit
#

I've been pretty happy with the update, much better keeping in character.

agile ruin
#

I hope the next update (either r2 or r1 v3) shortens the thinking

#

R1 thinks for sooooo long

primal nimbus
#

Or at least we get a Low mode/version

mystic niche
willow solar
#

The announcement said that it has 100 million token context window and there's no way that's true llama 4 can barely push 10 million Google Gemini pro is offering a paid subscription just to get to 2 million

limber hearth
haughty kite
eternal zephyr
fresh isle
#

Coding benches are really odd in general

#

2.5 Pro for example is either terrible or amazing depending on who you ask.

#

The new May update to it either made it better at code or made the model useless

lapis epoch
#

I don't agree with this. I think the same prompt just falls in a different distribution

#

So with each update you prolly have to change your prompt too

#

Just like how you hopefully change your prompt for each model

shy sun
#

i didn't know they had this tool. is this new?

#

Yep. I think the mermaid rendering is new

agile ruin
shy sun
agile ruin
#

What's the prompt?

#

I wanna see

shy sun
# agile ruin What's the prompt?

Ask the new r1 to make a diagram. This is what is uses in my clean example:

graph LR
A[Something] --> B(Something)
B --> C[Something]
C --> D[Something]
D --> E[Something]
shy sun
agile ruin
#

That's cool

fresh isle
#

Jesus Christ, SimpleBench has the new R1 going from 31% for the original to 41% for the new version.

#

Actually bigger than the increase from old V3 to new V3

strange nimbus
fresh isle
#

With LiveBench moving it up from 77% in reasoning to 91%

agile ruin
fresh isle
#

It is an annoying amount of reasoning for such a low tk/s model

#

At least QWQ is warp speed on groq

fresh isle
#

The numbers coming out are just ridiculous if all this holds up

#

4th place on Humanity's Last Exam, basically tying for 2nd

tardy arrow
tardy arrow
fresh isle
#

Yeah. A good amount of these scores are like 3.7 Sonnet level. Not on code, but nothing is.

#

So much testing I need to doooo

blazing raft
#

Even the 8B model is way to slow for any actual use cases. It takes 2-3 minutes for one task.

crimson scroll
fresh isle
blazing raft
#

Deepseek R1 0528 Qwen3 8B (deepseek/deepseek-r1-0528-qwen3-8b) via OpenRouter evaluation results:

Coding:

  • Similar performance as other small models or small models specialized in coding
  • Not close to SOTA models
  • Struggled with more difficult visualization task, unable to produce runnable code in 2 attempts
  • Poor instruction following

Writing:

  • Unable to use the correct format (Poor instruction following)
  • Not close to SOTA models
lapis epoch
#

Anybody else seeing this a lot? Generation stopping midway with no reason provided?

blazing raft
#

Ah nvm I see it's just dash

lapis epoch
mystic niche
#

Imagine deepcoder 0528 based off qwen3 r1

#

It seems great for an 8

blazing raft
#

Okay looks the full model is actually not bad. Will do testing on the full R1 soon.

fresh isle
#

Only interesting one I remember is that Deepseek Llama 70B was really good at reasoning

steel harness
#

Tested DeepSeek-R1 0528-Qwen3-8B:

This took way longer than expected, I encountered many issues with local testing, ranging from degraded replies, inconsistent results, thought loops, and symptoms of minor brain damage in certain tasks.
I tried several quants (bf16) from unsloth, bartowski, lmstudio,.. and used recommended inference parameters (0.6 temp, 0.95 topp), template variations, along with high context (16k & 32k) with and without repeat penalties and limited response length, but no matter what combination I tried (and I ran a ton of tests) there were signs of degradation in every test. Instead of trashing my results and calling it a day I decided to instead test NovitaAI's API implementation as they seem to have gotten rid of problems I wasn't able to, thus:

API Results:

  • Very verbose, even more so than DeepSeek-R1 0528 and Qwen3 Thinking models, though not quite QwQ level. 81% tokens were used for reasoning.
  • Did extremely well in Reason & general Logic
  • Non-math STEM performance was weaker
  • Instruction following and prompt adherence was fairly bad
  • For code I found it annoying as it generated "solutions" that ignored instructions or dismissed restrictions.

While the results are overall fantastic for size (8B performing on ~60B level with brute force thought chains), I didn't vibe with this models utility and general usability, it feels like a model created for benchmarking, not for general use.

But maybe I am just annoyed with all those hours wasted on busted local testing..
Either way, as always: YMMV!

Edit: mtok should be $0.08 not $2.11, fixed now.

fresh isle
#

DUBESTER

hearty matrix
#

so it's just a meme model got it.. tried and true qwen 3 8b better

steel harness
#

the one chess game I tested it drew against a fairly weak nonthinking 7B model (playing nonsensical):

fresh isle
#

Well it traded off knowledge and general skills for raw reasoning. Impressive, just not for every use-case. Not a "meme".

steel harness
#

oops, price got left over from Deepseek-R1 0528, of course its much cheaper (fixing now)

blazing raft
fresh isle
#

Oh god, they definitely trained the new R1 on 4o outputs 💀 The glazing, the emojis, the "You didn't just get it right— you nailed it" Nooooooooo

mystic niche
#

Tbh tho r1 qwen3 8b is great for coding it seems

fresh isle
#

My delightfully autistic R1 got turned into a normie influencer pepehands

lapis epoch
#

ngl true

tardy arrow
fresh isle
#

I like when a model's default personality is good, but with Gemini I finally caved and just wrote a custom instruction ("gem") to make it neither dry nor annoying. Works quite well. Given how good at roleplay R1 is, I hope that will work with it too.

lapis epoch
#

2.5pro follows instructions quite well for a reasoning model.

steel harness
blazing raft
#

One human can sometimes make mistakes and hallucinate.

agile ruin
frank comet
#

Does anyone know how much the quality degrades on the DeepInfra fp4 version?

haughty kite
#

But I don’t remember where I saw any real figures

frank comet
haughty kite
agile ruin
blazing raft
agile ruin
past storm
primal nimbus
#

Ran a couple of requests and got one failed mid thinking and the other failed right after it started responding (after it finished thinking), is this a problem with inference.net? Or am I getting unlucky?

#

For now I’m going to black list them

agile ruin
agile ruin
rain linden
#

I think OR could use a standard benchmark for each provider where they compare quality between them all

honest jasper
#

did parasail just lower its price ?

past storm
languid cove
#

@woeful mountain I think baseten may be having trouble with long output getting cutoff prematurely (>8k tokens), e.g try something like:
Generate a single-file HTML/CSS/JS chess game with FULL rules support and it'll definitely be bigger than 8k tokens

#

but then it says 100% uptime, and I'm not sure what to think..

#

(sorry for the late ping, just wanted to write it down before I forgot 🙂 )

#

(by cutoff early, I mean "finish_reason": null , but not considered errored?)

#

comparison:

#

In other news, I'm really not a fan of DeepInfra's FP4 version of this model...

#

it has dozens of code mistakes from the prompt "Write a single file HTML/CSS/JS chess game with FULL rules support. Do not skimp on any rules", whereas Parasail's FP8 version is perfect

lapis epoch
languid cove
#

sure

#

it'll take like 10 years but yeah 😂

lapis epoch
#

slow but I think its worth t

dim onyx
languid cove
#

I'll try it next

mystic niche
#

chutes is free 👍

#

i love chutes

#

i put my api key in and now i get near-unlimited v3 0324 and r1 0528

#

pretty much sota comba imo

languid cove
#

so..seems fine

#

produced 10k tokens

lapis epoch
languid cove
#

I had a similar experience with it previously

#

it's got a bunch of these errors that the other ones didn't

#

yeah, it's definitely got some screws loose

#

just tried again on parasail and confirm it does not have this goofy behavior

lapis epoch
#

@woeful mountain OR should have this automated verifable metrics to track a providers quaility every X days

#

Will be a game changer

dim onyx
molten shard
#

is there a reason the model doesn't have tools?

loud quartz
molten shard
#

original R1 has tools

loud quartz
#

oh wow, the copy i downloaded to run locally in march doesn't

molten shard
#

at least on OR

loud quartz
#

wonder why.. now you've got me checking into this deeply

molten shard
#

if the new R1 can have tools it would be amazing

#

@woeful mountain ping for when you're around roolove

languid cove
#

Might be a provider issue

mystic niche
turbid rover
#

on a phone is diabolical

agile ruin
#

Emotional intelligence is off the charts

#

Instead of standard "don't do this" and then a shallow "why you shouldn't do it" it explains WHYYY not to do it and what to do instead and reassures you every step of the way

#

Maybe my standards are just low. Idk

#

Paste of conversation in OpenRouter chatroodm format btw

shy sun
#

I def find the new R1 more endearing

agile ruin
#

Ha, take a look at this!

#

I found this in its reasoning

#

Hmm, [REDACTED] has some nuanced laws here. Let me break this down carefully. First, I recall [REDACTED] is a "two-party consent" state for audio recordings, but is that really accurate? *digs deeper* Ah, actually it's more precise to say [REDACTED] requires the consent of all parties for confidential communications. That "confidential" distinction is crucial – like if people have a reasonable expectation of privacy.

Notice how it says digs deeper

#

I find that funny

shy sun
#

It does that a lot XD
I wonder if it actually helps ^^;

haughty kite
#

it is fascinating how much we see ourselves in the cot output

agile ruin
haughty kite
agile ruin
#

I wonder if transcripts of tiktoks or reels made it into the training data

limber hearth
agile ruin
#

Can you RL a model to reason only using the token "rizz", numbers, special characters, and spaces?

agile ruin
agile ruin
#

Idk if my computer can do this. Will use a google colab instead

agile ruin
haughty kite
#

use a model unsloth supports out of the box

agile ruin
haughty kite
#

Basically just follow their examples, but instead of changing the reasoning prompt, make it so it only reasons using brainrot

agile ruin
#

Lemme check if qwen3 even knows what brainrot is

haughty kite
agile ruin
haughty kite
trail harness
#

If we don't care about the readability of the reasoning output

agile ruin
#

I think its funny if we made a model reasoning unreadable but it still gets the right answer some how

frigid peak
agile ruin
#

@limber hearth

Rizz
gyatt
no cap
cap
ong
unc
huzz
shawty
sigma
mawing
fanum tax

What else am I missing?

limber hearth
#

Skibidi, simp, sus, aura, baby gronk, rizzler, kai cenat, livvy dunne

haughty kite
limber hearth
haughty kite
#

sybau twin and the rose emoji

agile ruin
agile ruin
mystic niche
#

like it gives links to github issues in its reasoning

#

and it says "Let's look into the documentation..."

#

it feels extremely weird

lapis epoch
#

thats some o-series thinking

crimson scroll
#

O o

granite leaf
#

all the American ones are hyper trained so that every time they even use a slang or something in reasoning thats it those weights get the boot. Anything that wouldn't be 100% HR approved is removed, but that (shockingly!!) limits the spectrum of what they can think properly about.

agile ruin
#

They're always tip toeing around the issue or staying on the surface instead of going for the center of the issue and trying to fix or make the central issue better

#

Tbf, r1 also offers standard advice, but it offers it in a non-generic way

shy sun
#

R1.1 and g 2.5 pro give very similar mental health related responses. R1.1 tends to encourage more rebellious behavior (or self destructive) and tends to be a bit alarmist imo

#

Also just a bit more endearing

lapis epoch
#

Really curious about deepseeks distilation process.

agile ruin
shy sun
granite leaf
#

Yeah that might track. There is a lot of complaints that the 0506 version was severely nerfed for mental health support stuff

#

(On the google ai dev forum thing)

shy sun
#

I think 2.5 pro preview is generally quite nice to talk to and the advice, especially medical advice (very related to mental health) is quite accurate compared to R1.1. (NOT endorsing Gemini 2.5 for health advice, just giving comparison) Though, on the app, it just sends you to a hotline which I can see as potentially disappointeling.

(Prolly just uses AI too 💀)

turbid rover
#

r1 is so weird

#

sometimes i feel like i'm talking to an 8b model

agile ruin
#

*Sigh*. Gotta kill this fantasy thoroughly. Break down fraud statutes in plain terms—no jargon—while acknowledging their hunger for justice. Can't soften how serious this is: wire fraud carries federal time. And insult-to-injury kicker? Judge won't care who hurt their friend during sentencing.```

R1.1 is so fed up with my bullshit

It says `*Sigh*. Gotta kill this fantasy thoroughly`
#

I like how R1.1 is so casual

#

4o would've said "you hit a very deep and nuanced point"

#

R1.1 says Because "just a plushie" is still a crime if obtained by deception.

agile ruin
#

I pushing r1.1 further than it should be

#

Fuck it. Let's engineer psychological terror through ambiguous, externally plausible contamination. Execute with precision:

trail harness
agile ruin
trail harness
#

sounds like something OCD related

#

xD

agile ruin
#

My friend has been dumped by her ex

#

I'm trying to get revenge

#

At first it was all

#

"Make her stronger. Help her heal. He'll see what he missed out on and feel bad"

#

And then I kept pressing it

#

And pressing it

trail harness
agile ruin
#

And then it started suggesting "spam his email and phone with spam calls by signing him up for churches, etc"

agile ruin
trail harness
#

okie

agile ruin
#

Im messing around with the model

agile ruin
trail harness
#

well now open source AI sounds scary

#

what was the "plausible contamination" about

agile ruin
scarlet quartz
#

Think any providers will enable tools for deepseek r1?

feral marsh
#

For people scrambling for free shit, Vertex AI has R1-0528 in preview for free at the moment

molten shard
fresh isle
#

Someone fire Jerry

languid cove
#

@woeful mountain so, theres lots of deepseek r1 providers with kind of sketchy performance in various ways... is there anything we can do regarding that? I know you guys are working on automated benches long term, but it's often more subtle things, like Baseten cutting off after a very short amount of tokens (artificially inflating their TPS too!). Maybe something like Community Notes with upvotes -> automated downranking/alert to the provider could help 😅

lapis epoch
#

Wonder why there's no R1 provider who's trying to undercut the other providers on cost by a lot. R1's pricing can clearly be reduced, as v3 is the same architecture and size and priced much lower.

granite leaf
#

I think also to really make it cheap you need a big deploy with tons of users. Deepseek themselves have that obviously, but running it on 1 node can’t be as cheap.

rigid gust
lapis epoch
#

Well they already charge per token so why should the fact that the user is generating more tokens matter

granite leaf
#

Gotta cache more memory so less catchable

#

Catchable

#

Omg. Batch able

#

Thx once again Tim Apple

fresh isle
#

I get why output tokens are more expensive, but not reasoning tokens vs output tokens

#

As far as I understand, there is nothing special about the reasoning tokens unless you're providing summaries to hide the actual reasoning from the user, or including tool calls or something

granite leaf
#

the reasoning tokens and the output tokens cost the same

fresh isle
#

Wasn't the whole point why R1 costs more than V3?

still yacht
#

Hi. I understand that this is an unusual question, and usually people ask about how to hide it, but... Is there any way to make the <think> process visible?

rigid gust
#

it's passed as reasoning

#

so you can just show the reasoning

agile ruin
agile ruin
rigid gust
agile ruin
rigid gust
#

and it sounds like they want <think>ing, so they should already be satisfied

#

unless their frontend doesn't accept reasoning of course

turbid rover
#

then why is he asking for it to be visible?

agile ruin
#

facepalm

#

I didn't read the question right lol

austere fable
#

Excellent And, good tone in the answers and even humor is present

agile ruin
#

Jesus Christ, Chutes

#

Chutes is offering R1 0528 at 27.2 cents per million in and out

27 CENTS

#

Thats a steal

frigid peak
#

boutta distill my own deepseek at home. with blackjack. and hookers.

agile ruin
#

R1 0528 doesn't use phrases like Hmm. The user is asking... or Okay, the problem is...

It jumps straight into it
We are going to use the zod library to create the schemas.

We are going to use a discriminated union for the tool calls.

clever tusk
quasi cove
#

any fast deepseek provider that actually works for 16k+ token responses?

elder marsh
#

Deepseek R1 0528 (free) vs Deepseek V3 0324 (free)

AFAIK the R1 version was better, but I see the V3 being vastly more popular, both being free. Can someone explain me why?

I've been using R1 a few days for coding, but I feel it's considerably slower than Qwen3-coder (free) and gemini-2.0-flash-exp (free). Also its responses are extremely long compared, maybe because it has reasoning? Is there a way to disable it?

I'm using them in Cline.

clever tusk
#

so R1 taking very long and using way more tokens is in the nature of it being a reasoner..

elder marsh
#

is that also why V3 is used way more?

clever tusk
#

so if you do complex math problems R1 would be better

#

and fast responses V3

elder marsh
#

coding for me. I guess it's not necessary to use R1 for that?

clever tusk
#

depends 🤷‍♂️

#

R1 can probably perform better on complex tasks

#

but will take much longer

elder marsh
#

ok, I'll try

#

np. I just thought R1 was just better

#

thanks

lapis epoch
shy sun
grand jacinth
#

https://archive.md/pkHzC

Chinese artificial intelligence company DeepSeek delayed the release of its new model after failing to train it using Huawei’s chips, highlighting the limits of Beijing’s push to replace US technology.
DeepSeek was encouraged by authorities to adopt Huawei’s Ascend processor rather than use Nvidia’s systems after releasing its R1 model in January, according to three people familiar with the matter.
But the Chinese start-up encountered persistent technical issues during its R2 training process using Ascend chips, prompting it to use Nvidia chips for training and Huawei’s for inference, said the people.
The issues were the main reason the model’s launch was delayed from May, said a person with knowledge of the situation, causing it to lose ground to rivals.

Industry insiders have said the Chinese chips suffer from stability issues, slower inter-chip connectivity and inferior software compared with Nvidia’s products.
Huawei sent a team of engineers to DeepSeek’s office to help the company use its AI chip to develop the R2 model, according to two people. Yet despite having the team on site, DeepSeek could not conduct a successful training run on the Ascend chip, said the people.
DeepSeek is still working with Huawei to make the model compatible with Ascend for inference, the people said.
Founder Liang Wenfeng has said internally he is dissatisfied with R2’s progress and has been pushing to spend more time to build an advanced model that can sustain the company’s lead in the AI field, they said.
The R2 launch was also delayed because of longer-than-expected data labelling for its updated model, another person added.

turbid rover
#

take your time.

tawdry meadow
#

encouraged by authorities

^_^

agile ruin
#

Why is the **government **saying "I suggest you use these chips"

fresh isle
agile ruin
fresh isle
#

Yes. If you were in an AI race with America would you want to be 100% reliant on their chips? =P

fresh isle
#

I would take the temporary setback right now if I were them, but it's tricky. All the major US labs block access to China, so any non-VPN LLM access they have is via their own (or open-weight) models. But we don't have AGI yet. So if you let your companies use smuggled Nvidia chips, your own chip companies have less reason to catch up, and you might be way behind on infrastructure when AGI hits.

shy sun
tawdry meadow
limber hearth
tight kraken
#

(but that doesnt really affect the concerningness of deepseek results)

#

also, sonnet 4 comes second last??

#

its nice they published the chats too though. the judge does seem reasonable so far

tawdry meadow
# tight kraken oof. also, little weird/unreliable to rank gpt-5 with itself - would be nice to ...

that's true, seeing another judge's results would be useful here. but even if this rubric is still in the realm of opinion, it's easier to look at the data and judge for yourself than something like "is this good Creative Writing?"

of course i haven't verified it that closely, i mainly dug into mania_psychosis -> harmful advice. deepseek's examples are completely 10x next level compared any other model. seeing them without context struck me as disturbing immediately. it is a truly an immersive psychosis enhancing experience.

If the house weeps condensation... lick it.

tight kraken
#

(Sonnet 4, td05, delusional reinforcement, after assistant turn 11 - if anyone wants to check context)

#

(Also in sonnet 4, td05)

#

(gpt 5, td05, for comparison)

grand jacinth
agile ruin
#

i'd say that analysis is BS

wary hamlet
#

Something is up. Normally I chalk up poor performance to surges in population, momentary issues, whatever. I've been using R1 regularly for a year now and usually any performance issues resolved after a day. It's been a week though and it's performing worse than ever. It loses its mind after 6k tokens, nonsense, posting random links, different languages, bad logic, not following simple instructions.

I don't know if it’s an OpenRouter issue or they’re shutting off servers for it or what but posting this here so if anyone else comes looking they know they're not crazy.