#DeepSeek V3.2

384 messages · Page 1 of 1 (latest)

agile fossil
violet hatch
serene mirage
agile fossil
agile fossil
serene mirage
#

AI是"他"或者"它"?

violet hatch
#

i doubt anyone would have the internet speed to download it that quickly

agile fossil
hexed cloud
#

Terminus my ass

violet hatch
#

👀

violet hatch
#

seems like they use a new attention method which leads to very cheap inference

#

good sign that benchmarks with tool calling still remain intact, means the attention mechanism isnt terrible

agile fossil
kindred abyss
#

Its cheaper 😮 nice

violet hatch
#

on api

#

@shrewd gate

agile fossil
serene mirage
violet hatch
#

output is like 3x cheaper and input is 2x cheaper

serene mirage
#

Is it a promotional period or is that the permanent new prices?

violet hatch
#

not sure but id say they're likely to stay like that since the inference is cheaper

violet hatch
serene mirage
#

v3.1terminus is still available at v3.2 rates it seems

#

For around 3 weeks

violet hatch
#

questionably long base url

serene mirage
deep verge
#

wow so cheap than terminus

#

what the hell

#

why even released terminus

violet hatch
deep verge
#

when is r2 coming though 😭

violet hatch
#

im not sure whether they'll even release r2, i think they'll stick with hybrid

deep verge
#

so v4 is r2?

violet hatch
#

maybe..

serene mirage
violet hatch
#

no confirmation

deep verge
#

aww....thats so long

sharp bloom
violet hatch
#

i think they'll go with hybrid because it pretty much gives them double the space to run their models, no need to host 2 models, just 1 which can do both

serene mirage
#

V4 is probably based on this new Architecture somehow

serene mirage
willow pumice
#

Max output tokens seems low? Was it that low for 3.1?

violet hatch
serene mirage
shell remnant
#

As an intermediate step toward our next-generation architecture, V3.2-Exp builds upon V3.1-Terminus by introducing DeepSeek Sparse Attention
Seems v4 is also very soon ?

violet hatch
#

i wonder if this means they'll increase context length in the future to something insane like 1M

#

cause if the scaling remains the same then it should be in theory really cheap to both train & run inference, beside the data

trail root
#

Qwen Next did something similar

#

So how many active parameters? 685B A37B?

violet hatch
charred torrent
serene mirage
#

huh, HLE went down

violet hatch
#

lower knowledge density maybe

#

though this is a "experimental" version

serene mirage
#

is HLE that bench that requires a bunch of specialist knowledge?

serene mirage
#

also were those results from thinking or non-thinking?

violet hatch
errant holly
#

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model!

✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context.
👉 Now live on App, Web, and API.
💰 API prices cut by 50%+!

1/n

trim musk
#

WeChat official account

#

Official app, web app, WeChat app have all been updated to this version

serene mirage
#

they're usually pretty fast at rolling out the new models to all their platforms

violet hatch
#

that was a quick rollout after their chat website, do they really get that much traffic to early test it that quickly 🤔

errant holly
#

Novita is already trying to add support. Once the OR team wakes up, i'm sure they're on it 🙂

serene mirage
#

im guessing the new architecture makes it not just a drop-in weights replacement?

violet hatch
#

probably not since the attention mechanism is entirely different

serene mirage
#

yeah all the previous (v3, r1, v3.1) are "DeepseekV3ForCausalLM"
the new model is "DeepseekV32ForCausalLM"

hollow anvil
#

Deepseek is becoming Qwen with small iterational releases

serene mirage
#

so the deepseek v3 architecture had:

  • v3
  • r1
  • v3 0324
  • r1 0528
  • v3.1
  • v3.1 terminus
#

all within around 10 months

deep verge
#

idk... i prefer a massive update rather than too many smaller versions so quickly.

olive leaf
#

That's massive savings!

serene mirage
#

yeah cause they got sparse attention now

agile fossil
#

Novita is coming

serene mirage
#

no more obscenely quadratic compute usage!!!!!

olive leaf
serene mirage
#

i can't find a research paper

#

maybe not out yet

#

i found the paper

#

well not quite a paper

#

more just an outline

sharp bloom
#

i like the point release updates. so often it feels like these llms are almost but not quite there yet. maybe the all the real life feedback helps dial them in

serene mirage
#

apparently they started with v3.1 terminus base

#

and just taped on the lightning indexer to that

#

and then trained that

#

they first train the lightning indexer to agree with the existing model's attention patterns

olive leaf
#

I'm 50% sure I'm misunderstanding

#

But I think the outline says that the indexer only chooses the topK tokens

#

If you set the topK to 40, it'll only consider the top 40 tokens in its attention, which makes it faster

#

Someone double check me

serene mirage
#

looks like it

#

each token only attends to a subset of the rest of the tokens

#

well it clearly works well so yeah

violet hatch
#

compute cost stays similar but memory usage is about 9x lower

serene mirage
#

cause it only has to do the full attention on the shortlisted tokens and only has to handle the 576d vectors for the rest of the tokens

violet hatch
#

so compute is still about the same

serene mirage
#

it computes all the tokens using the vectors with lower dimension though

violet hatch
#

though it could cut latency a bit

west moss
#

There are efficient algos for computing topk

#

It won’t calculate the whole thing, then sort it etc it will just use an optimized algorithm

hidden sequoia
#

I wonder if this will affect long context quality

#

Which was already average with DeepSeek

solemn cave
#

interesting

#

So... to my understanding, this is how DSA kinda works. Though I am not sure because I have a fever.

DeepSeek Sparse Attention (DSA):

  1. For every query vector the router (a tiny MLP) outputs k token-indices that will matter for this single query (k ≈ 1–4 % of context).

  2. Two index pools are merged:

    • local block – fixed-size window around the query (cache-friendly dense matmul).
    • global experts – the learned sparse indices anywhere in the 128 k span.
  3. Custom CUDA kernel (FlashMLA) gathers only those KV slots into a contiguous micro-matrix; the rest of KV-cache is never fetched.

  4. Scaled dot-product is done on this k×k tile; result is exactly the same size logits vector as dense attention, so downstream layers notice zero change.

  5. Softmax is computed only over the selected k positions, giving the same attention weights shape but O(k) instead of O(L) compute and memory.

  6. Gradients flow through the router; the index-selection is fully differentiable (straight-through Gumbel), so the sparsity pattern is jointly trained with the rest of the network—no post-hoc approximation.

shrewd gate
#

coming online shortly

shrewd gate
#

tools seem borked on their end

#

so launching without tools

violet hatch
shrewd gate
#

and terminus

#

our code didn't change

violet hatch
#

ah alright, i never checked it before so i assumed it wasnt like that

desert axle
violet hatch
#

hm

#

might be an issue from my code 🤔

coarse loom
#

has anyone tested long context performance yet?

desert axle
#

understood all relevant plot points and the connections between them

coarse loom
#

oh yeah

#

so better than before?

desert axle
#

can't say for sure, it's just one data point. but it's not terrible, at least

violet hatch
coarse loom
#

a zero shot

viscid thunder
violet hatch
#

yeah u aint got access to that

violet hatch
hidden sequoia
#

MrBeast ahh thumbnail

solemn cave
viscid thunder
#

Maybe with a Turbo api someone will do it

solemn cave
# viscid thunder Theoretically it could mean lowering batch size -> speed improvements

Lowering batch size can shave a couple percent off latency because each token now gets a slightly bigger slice of memory bandwidth, but that’s basically it—DRAM is still the ceiling. So, even a hypothetical “turbo” single-stream endpoint won’t suddenly jump to 2-3× speed; the 3-7× win is purely on the cloud-metered FLOP bill, not on wall-clock. A bummer, yea.

maiden thistle
#

Oh, wow, 1/4th of the output price is going to be great for reasoning

violet hatch
#

novita really creative with these prices

hidden sequoia
#

The difference is DeepSeek has caching

#

Working caching

maiden thistle
#

Gotta be #1 in the OR ranking

violet hatch
#

true but really now, 1 cent cheaper?

maiden thistle
#

Reminds me of a situation

#

Back then, we had some companies pricing their Llama 3 8B $0.001 cheaper than the other and they'd keep alternating who was #1

viscid thunder
heavy stag
#

Skeptically excited to see the performance on this. Quadratic computation is one of the major problems with LLMs unless RAG gets perfected

#

Not computation. You guys know what I mean. I slept 2 hours

violet hatch
#

im getting this error in the chatroom 🤔

#

no tokens for like 30 seconds then that happens

#

only with novita

hidden sequoia
#

Fiction livebench says 3.2 is better than 3.1 in long context, which is counter-logical?

kindred abyss
#

DSA achieves fine-grained sparse attention with minimal impact on output quality — boosting long-context performance & reducing compute cost.
they did state specifically that it boosts long-context performance

iron vessel
#

is it resulting on different outcome in other people test?

hidden sequoia
#

Well I specifically thought that processing 'some' tokens instead of 'all' tokens will result losing subtext in understanding creative writing, and it's just another step to agentic/coding benchmaxxing

#

So it's cheaper, faster AND better in roleplay? Meaning V4 could boast 1000b paramaters while having same or higher speen compared to v3

viscid thunder
#

Less tokens => more focus

#

But it’s a balance

#

In this case, it seems like a good balance

viscid thunder
#

Based on those numbers

hidden sequoia
viscid thunder
#

Ohh

#

Huh

hidden sequoia
#

58 v 71 @60k is massive

viscid thunder
#

Yea

solemn cave
# hidden sequoia So it's cheaper, faster AND better in roleplay? Meaning V4 could boast 1000b par...

DSA doesn’t “drop” subtext — it learns which tokens actually matter for each query, and those indices are different every layer and every token, so nuance survives. The 60k jump you see is mostly less noise accumulation (fewer low-relevance attention scores), not brute-force scale.

So yes, V4 can stack more params (or wider experts) without the old O(L²) tax, so role-play quality goes up while billable FLOPs stay flat or fall. it’s not just a benchmaxxing trick.

open grail
solemn cave
#

Do you really hate m-dash 😭

#

I always use that

hidden sequoia
#

That's what bot would say

subtle plank
#

is this a perfkrmance increase or nah?

#

i assume not but the sacings is wild

maiden thistle
#

It's not marketed as a quality improvement, so probably not

haughty ember
solemn cave
high iron
#

Is the new 3.2 model better for roleplay than 3.1?

#

And what is different between 3.1 and 3.2?

#

Please answer with ping. Thanks.

near forge
solemn cave
#

Nothing changed on the model's perf/quality

hexed cloud
#

@shrewd gate best default parameters?

shrewd gate
#

it's already set on the model

hexed cloud
#

I thought deepseek used temperature 0.3

#

It's set to 1

#

I know deepseek provider does the conversion

#

On their end

shrewd gate
#

from huggingface

hexed cloud
#

But what about the others?

#

Ah

#

I got 3 syntax errors with it with temp 1 so there's that

#

Also, if it's increased performance why are the two endpoints so slow?

solemn cave
# hexed cloud Also, if it's increased performance why are the two endpoints so slow?

the speed you feel is still gate-kept by the old infra. deepseek didnt roll out new gpu stacks, they just swapped the attention math inside the same containers. so the flops bill drops 3-7x for them, but the physical cards, pcie lanes and network hops are untouched -> same queuing, same throttle, same "slow". once providers re-tune batching / cache layouts for the sparse kernels you should see snappier responses, but for now its cheaper for them, not faster for us.

hexed cloud
#

is thisan ai response

maiden thistle
#

My thoughts exactly lol

hexed cloud
#

same old infra, but less maths to be done per token (If I understand right), so i'm within my right to expect faster responses

#

unless they intentionally throttle it

foggy falcon
solemn cave
#

I was building on the past interactions

hexed cloud
#

"for now its cheaper for them, not faster for us", "same X, same Y, same Z" and talking very personally as if to a customer

maiden thistle
#

It just sounds like asking AI to write all lowercase, there's a lot of slop in that response

foggy falcon
#

while this would be very normal to say, here in this discord everyone has their gpt-isms detectors on full blast

solemn cave
#

I guess I should stop doing that then?

hexed cloud
#

If that was actually you, I apologize. But also AI might have rewired your brain

solemn cave
#

Yep

#

I might've changed

foggy falcon
solemn cave
#

I was already being informal too because of earlier with Loinne

foggy falcon
#

If you want I can create an image of the key points of this conversation as a mind map for you.

solemn cave
#

Lmfao

foggy falcon
#

most people just like type one word

#

and then add "lol" in the next message or smth

maiden thistle
# solemn cave the speed you feel is still gate-kept by the old infra. deepseek didnt roll out ...

Well, in case you're wondering, your response has, in addition to what was pointed out:

  • A lot of "not x, but y" typical of AI (deepseek didn't ..., they ... / the bills drop ... but the physical cards / once you ... but for now ... / cheaper for them, not for us)
  • Weird, typical of AI phrasing ("gate-kept by old infra")
  • Suspicious, overly summarized, vague list of three elements (physical cards, pcie lanes and network hops)
solemn cave
solemn cave
foggy falcon
near forge
#

The just testmant, wasn't not, Y, but a tapestry of X.

maiden thistle
near forge
solemn cave
#

bye dawgs, I'm gonna go sleepa while my immune system fcks the fever

rain spruce
#

you can't really hide the prose style from people that use a bunch of different models on a daily

hidden sequoia
hearty gazelle
#

Anyone else getting random '<|end▁of▁thinking|>' written in some of their responses? It's not even somewhere predictable like the start or end of a response, it's just... between sentences as far as I can tell. It's not often, maybe one in every 10-20 responses or so?

high iron
high iron
hexed cloud
rain spruce
#

it's just a bit silly to do that in an AI related server

#

and i'm not saying there is something wrong with using it for writing, i do use it for long-form text because english isn't my first language

#

but there's no problem in saying that you've used AI to write and it's just funny when i see those GPTisms on messages

heavy stag
#

The cost savings are crazy. Novita had 3.1's input tokens at $0.27, and now with the 3.2 savings it's only $0.27 input!

maiden thistle
#

Great cost savings ||for Novita||

iron vessel
#

i don't know if someone feel this too, but it seems like deepseek is the most impactful in the AI fields.
i know for most people their model will not be the best but the fact that they experimenting with new things then releasing really good paper about it, show how they not just care about the money but also the future of technology for the people

#

even qwen didn't have this approch

serene mirage
#

Novita still no caching support?

solemn cave
#

I am at least happy that it didn't affect how I write in my first language.

since you guys understand newgen-stuff, maybe I can insert some into my explanations in the future. ❤️‍🩹

solemn cave
visual cape
hearty gazelle
opaque bolt
# iron vessel i don't know if someone feel this too, but it seems like deepseek is the most im...

Yes, Deepseek are my favorite of the Chinese participants. It just seems like more original research and training goes on here which makes their releases interesting, especially when they are being bold with architecture changes. I'm sure DeepSeek 3.2 is kind of a crazy idea they're throwing out there and that's why it's labelled experimental. Maybe it will underperform too much due to the attention differences, maybe it'll remain good enough for clearly most uses, and then it's a huge win.

opaque bolt
astral aurora
#

Will there be a free model of this?

queen gale
solemn cave
queen gale
#

V3.1 is handling heavy usage pretty fine, doenst drop below 90%

flat briar
agile fossil
#

Waiting other provider

hidden sequoia
#

This is getting out of hand

foggy falcon
solemn cave
maiden thistle
#

Yep, lol, companies do that

#

Wouldn't be surprised if they keep lowering it to get on top

foggy falcon
#

does OR route to them a lot though or are providers with caching preferred?

#

I guess when set to no logging / training deepseek is not in the race anyway

#

in the chat room the auto router routed me to deepseek now which is good to see..

pseudo tulip
#

We got a new challenger coming up

pseudo tulip
foggy falcon
#

I really want more providers to offer cheap caching

#

would be great to have a price war on that ..

pseudo tulip
#

From what I read in the docs, if it routes to a cache input provider once, it tries to route to that over and over to ensure cache hit

pseudo tulip
#

From their docs

#

But I notice a lot of cache misses when using openrouter for some reason

#

especially when you have 10-15 api calls simultaneously , some have cache miss some have hit, kinda random

hidden sequoia
#

Just enforce provider? It worked for me

pseudo tulip
# hidden sequoia Just enforce provider? It worked for me

I haven't tried it with deepseek but I had trouble with gemini flash 2.5. I used ai to add context to rag chunks, so the pre-fix was same for all the chunks, I processed 1 chunk first and then the rest simultaneously so they hit the cache of the first chunk. When I used deepseek official api I got almost 100% cache hit rate , with gemini it was quite random. I should have also gotten close to 100% but it was like 60% something, quite random. I did make sure to ensure it was only ai studio or vertex.

viscid thunder
#

so far in my tests, this model is stacking up quite favorably to Grok 4 Fast, which is now a very similar weight class. performing better on my own evals for a real application.

#

the tradeoff being speed, primarily

pseudo tulip
viscid thunder
#

yep

desert axle
prisma sable
#

Seems to be a slight lateral thinking regression vs 3.1 terminus
(From https://lateralbench.org )

pseudo tulip
maiden thistle
serene mirage
charred pollen
#

Has anybody else noticed 3.2 being much more repetitive than 3.1? It feels like it's not giving much attention to messages past the last 2-3, and I had it repeat the exact same answer in a multi-turn conversation.

serene mirage
#

(Sparse attention)

charred pollen
foggy falcon
#

that's probably why they called it Exp(erimental)

#

maybe it'll get better again with more training to offset this

#

while still being cheaper

west moss
#

I think it’s pretty much as good as terminus. Try using it vs DeepSeek themselves to test it.

agile fossil
#

A mini update

opaque bolt
# charred pollen Has anybody else noticed 3.2 being much more repetitive than 3.1? It feels like ...

I saw Fiction.liveBench which was kinda interesting on this topic. Non-thinking starts out performing very mediocre but oddly enough consistently mediocre which is in fact pretty decent for very long contexts...? But unusually poor for short contexts. Thinking High however performs well and what's nice is that it remains well over long contexts. Here's a screenshot and discussion on Reddit: https://www.reddit.com/r/singularity/comments/1ntmkah/fictionlivebench_tested_deepseek_32_qwenmax/

Reddit

Explore this post and more from the singularity community

#

So I'd use DS 3.2 Thinking if I want to boost attention. I'm not sure why this is so but I assume the thinking tokens helps it stay on track?

grim beacon
#

when will they release a model that can accept images

open grail
# opaque bolt I saw Fiction.liveBench which was kinda interesting on this topic. Non-thinking ...

The "thinking" section is likely acting as a secondary (and higher order) type of Associative Memory, eg:

You can think of Attention as a form of Associative Memory:

https://ml-jku.github.io/hopfield-layers/

but it can only deal with second-order interactions and this is where the O(n^2) comes from (third-order interactions would require O(n^3) and be completely impractical).

The way the reasoning section trawls over the same information over and over has the potential to selectively account for much higher order chains of interactions.

pseudo tulip
#

They take a lot of risks and try out a lot of new things, so far I believe they're focusing only on text input and text generation, but you never know

#

1 thing is for sure. They won't release a multimodal model just for the sake of having a deepseek multimodal model. They always bring something new to the table. Like they bought in Mode of experts and now this sparse attention with v3.2

iron vessel
grim beacon
charred pollen
heavy stag
#

I think this was bound to be a weird update. It's not like anyone expected more smarts when the big innovation is a form of sparse attention

iron vessel
#

But with good prompting the small model could be at the same level as the bigger one

heavy stag
#

I think at best it's that maybe the correlation is weaker than we thought.

#

I will say smaller models hold up better than a lot of us probably thought when they are given an ungodly amount of reasoning tokens

#

Namely QWQ and the new Qwen 80B

#

But they won't be acing HLE any time soon

serene mirage
keen brook
#

My experience in text adventure games with Dipsy v3.2 as the interactive system (official provider, reasoning on, Temperature 0 and 0.5):

  • Prices back down to v3 levels, massive
  • Carries v3.1's quirk of not spamming asterisks, good
  • No more "Somewhere, in the distance, X... Y..." and "Outside, X... Y..."
  • Follows instructions better, again from v3.1, so needs more explicit instructions
  • New kind of sloppa spam of "It's not about X, it's Y." / "It doesn't X; it Y..."
  • Starts to get things wrong a little past 32k to 64k, okayish
  • Knuckles whitening, breath hitches slop (I'm okay with this though)

Good model, if you can keep up with the slop.

keen brook
#

Indeed. Dipsy love

opaque bolt
#

Sounds good! Pricing is so good that even just sustained performance makes this one a clear success. I think the killer feature is being able to use premium providers without worry and not hit silly rate issues.

heavy stag
keen brook
keen brook
#

Oh, and don't forget about the spam on the smell of ozone. Deepseek models LOVE the word for some reason

ember karma
#

I'm still debating where to top-up though. Open router or deepseek official? I love deepseek and all but v3.1 (and I assume v3.2) would be inferior than 0528, I tried the free version of v3.1 and found it less creative than R1, especially on fantasy settings

#

Should I just stick with open router and go for my preference, or try v3.2? Are there other pros to it from r1 0528?

foggy falcon
#

and you can force routing through deepseek too so there isn’t much of a downside right now either 🤔

ember karma
#

🤔 you're right, open router has google pay too which makes things easier, so that's a plus

grim beacon
#

how much cost do cache hit saves?

foggy falcon
grim beacon
# foggy falcon 90% on input

so like when i chat with ai model and each time it sends the whole stream back? and I save money as its already cached?

queen gale
#

yes

maiden thistle
#

Well, you can reasonably expect to save money on the input for the system prompt part

#

Most chat apps operate with a rolling window of chat messages, let's say the limit is 4, so you have
SYS PROMPT - MSG1 - MSG2 - MSG3 - MSG4
When you send another message, we're over the window, so the oldest message will be deleted
SYS PROMPT - MSG2 - MSG3 - MSG4 - MSG5
Caching works on whichever prefix is repeated across messages. In this case, the prefix is the system prompt

grim beacon
rain spruce
#

for this specific model

iron vessel
#

woah..

#

this model amazing, only using one dollars for a lot of context.

#

so much cheaper and quality 97%-99% of sota model for general stuff

astral aurora
#

Still no 3.2 for free?

stuck charm
astral geyser
stuck charm
# astral geyser novita

yea I just generated 3 responses and novita and something is wrong with their template. also has massive CN issues. so exclude them for now, is the "fix".

astral geyser
stuck charm
full zodiac
#

does anyon have a good preset?

#

ive been using chatstream and its kinda not the best with 3.2

dense reef
#

I'm using V3.2 for a text adventure, and Atlas Cloud tends to input its reply entirely in Reasoning, which means the next turn wouldn't see its reply in the context. Not sure how to fix this, except to put Atlas in my ignore provider list.

hidden sequoia
#

Why this particular provider in first case?

dense reef
#

I had the providers as auto. I guess it picked whatever was easiest to connect to.

long kernel
long kernel
long kernel
#

No still does but not always, didn't check other providers little bit deepinfra and its good

astral geyser
#

is something going on with v3.2?

#

none of the providers are working properly

fringe shell
#

are there somewhere other providers that support caching other than deepseek themselves?

violet hatch
#

no

fringe shell
#

hm ok, than I still have to ignore all others :/