The model is now available on website and HuggingFace
https://huggingface.co/collections/deepseek-ai/deepseek-v32-68da2f317324c70047c28f66
#DeepSeek V3.2
384 messages · Page 1 of 1 (latest)
how did you even access this page
这是从哪里?他们的微信?
Yes.Now you can use deepseek v3.2 on their website
Now you can't access the page.They accidentally leaked it out before and were discovered by others
ah
有人下载了他(它?)吗?
AI是"他"或者"它"?
i doubt anyone would have the internet speed to download it that quickly
No.There have no any files just a blank
Terminus my ass
👀
seems like they use a new attention method which leads to very cheap inference
good sign that benchmarks with tool calling still remain intact, means the attention mechanism isnt terrible
Its cheaper 😮 nice
That is significantly cheaper
output is like 3x cheaper and input is 2x cheaper
Is it a promotional period or is that the permanent new prices?
not sure but id say they're likely to stay like that since the inference is cheaper
v3.1terminus is still available at v3.2 rates it seems
As an experimental version, although DeepSeek-V3.2-Exp has been validated for effectiveness on public evaluation sets, it still requires broader and larger-scale testing in real user scenarios to identify potential issues in certain long-tail use cases. To facilitate comparative testing by users, we have temporarily retained additional API acces...
For around 3 weeks
questionably long base url
Probably deliberately to discourage use
as a bug fix pretty much
when is r2 coming though 😭
im not sure whether they'll even release r2, i think they'll stick with hybrid
so v4 is r2?
maybe..
Probably
no confirmation
aww....thats so long
smart tbh, openrouter should do this
i think they'll go with hybrid because it pretty much gives them double the space to run their models, no need to host 2 models, just 1 which can do both
V4 is probably based on this new Architecture somehow
Also improves context caching ability for people who switch between think and nothink
true
Max output tokens seems low? Was it that low for 3.1?
based on internet archive seemingly yes
It’s always been max 8k for chat
As an intermediate step toward our next-generation architecture, V3.2-Exp builds upon V3.1-Terminus by introducing DeepSeek Sparse Attention
Seems v4 is also very soon ?
i wonder if this means they'll increase context length in the future to something insane like 1M
cause if the scaling remains the same then it should be in theory really cheap to both train & run inference, beside the data
seemingly somewhere around that
huh, HLE went down
is HLE that bench that requires a bunch of specialist knowledge?
yeah i think so
also were those results from thinking or non-thinking?
based on this, WITH reasoning
English twitter announcement: https://x.com/deepseek_ai/status/1972604768309871061
🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model!
✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context.
👉 Now live on App, Web, and API.
💰 API prices cut by 50%+!
1/n
WeChat official account
Official app, web app, WeChat app have all been updated to this version
they're usually pretty fast at rolling out the new models to all their platforms
that was a quick rollout after their chat website, do they really get that much traffic to early test it that quickly 🤔
Novita is already trying to add support. Once the OR team wakes up, i'm sure they're on it 🙂
im guessing the new architecture makes it not just a drop-in weights replacement?
probably not since the attention mechanism is entirely different
yeah all the previous (v3, r1, v3.1) are "DeepseekV3ForCausalLM"
the new model is "DeepseekV32ForCausalLM"
Deepseek is becoming Qwen with small iterational releases
so the deepseek v3 architecture had:
- v3
- r1
- v3 0324
- r1 0528
- v3.1
- v3.1 terminus
all within around 10 months
idk... i prefer a massive update rather than too many smaller versions so quickly.
yeah cause they got sparse attention now
Novita is coming
no more obscenely quadratic compute usage!!!!!
I gotta read the examination for thaf
i can't find a research paper
maybe not out yet
i found the paper
well not quite a paper
more just an outline
i like the point release updates. so often it feels like these llms are almost but not quite there yet. maybe the all the real life feedback helps dial them in
Thankies
apparently they started with v3.1 terminus base
and just taped on the lightning indexer to that
and then trained that
they first train the lightning indexer to agree with the existing model's attention patterns
I'm 50% sure I'm misunderstanding
But I think the outline says that the indexer only chooses the topK tokens
If you set the topK to 40, it'll only consider the top 40 tokens in its attention, which makes it faster
Someone double check me
looks like it
each token only attends to a subset of the rest of the tokens
well it clearly works well so yeah
If anyone here is fluent in PyTorch then this looks like the new indexer logic:
https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/inference/model.py#L431
according to GPT5 its like this
compute cost stays similar but memory usage is about 9x lower
looks like it also reduces compute
cause it only has to do the full attention on the shortlisted tokens and only has to handle the 576d vectors for the rest of the tokens
im pretty sure it computes all of the tokens then cut its down to top k rather than only computing top k
so compute is still about the same
it computes all the tokens using the vectors with lower dimension though
im pretty sure thats a minimal cut, since the meaning between 2 tokens is relatively trivial to compute
you still have to compute it across all of the context, then mask it to the top k then run the forward pass & repeat
though it could cut latency a bit
There are efficient algos for computing topk
It won’t calculate the whole thing, then sort it etc it will just use an optimized algorithm
I wonder if this will affect long context quality
Which was already average with DeepSeek
interesting
So... to my understanding, this is how DSA kinda works. Though I am not sure because I have a fever.
DeepSeek Sparse Attention (DSA):
-
For every query vector the router (a tiny MLP) outputs k token-indices that will matter for this single query (k ≈ 1–4 % of context).
-
Two index pools are merged:
- local block – fixed-size window around the query (cache-friendly dense matmul).
- global experts – the learned sparse indices anywhere in the 128 k span.
-
Custom CUDA kernel (FlashMLA) gathers only those KV slots into a contiguous micro-matrix; the rest of KV-cache is never fetched.
-
Scaled dot-product is done on this k×k tile; result is exactly the same size logits vector as dense attention, so downstream layers notice zero change.
-
Softmax is computed only over the selected k positions, giving the same attention weights shape but O(k) instead of O(L) compute and memory.
-
Gradients flow through the router; the index-selection is fully differentiable (straight-through Gumbel), so the sparsity pattern is jointly trained with the rest of the network—no post-hoc approximation.
coming online shortly
I have a fever
Get well soon!
yeah this was the case for v3.1 too
and terminus
our code didn't change
ah alright, i never checked it before so i assumed it wasnt like that
I was able to use tools with deepseek-chat via direct api
has anyone tested long context performance yet?
It performed decently on a ~32k token test
understood all relevant plot points and the connections between them
can't say for sure, it's just one data point. but it's not terrible, at least
okay my code just dropped all the connections for no reason
a zero shot
Disappointing that the efficiency gains didn’t improve TPS
yeah u aint got access to that
Timestamps:
00:00 - Intro
00:41 - First Look
02:08 - Technical Look
03:36 - Web Browser OS Test
06:57 - 3D Racing Game Test
10:01 - Freestyle Technical Test
11:59 - Creative Web Design Test
13:48 - Closing Thoughts
AI Integration & Consulting: https://bijanbowen.com
Join the Discord: https://discord.gg/hfaR2exy7S
In this video, we test the ne...
MrBeast ahh thumbnail
yea, because it's not a speed-improvement
ok that's pretty sick!
Theoretically it could mean lowering batch size -> speed improvements
Maybe with a Turbo api someone will do it
Lowering batch size can shave a couple percent off latency because each token now gets a slightly bigger slice of memory bandwidth, but that’s basically it—DRAM is still the ceiling. So, even a hypothetical “turbo” single-stream endpoint won’t suddenly jump to 2-3× speed; the 3-7× win is purely on the cloud-metered FLOP bill, not on wall-clock. A bummer, yea.
Oh, wow, 1/4th of the output price is going to be great for reasoning
novita really creative with these prices
Gotta be #1 in the OR ranking
true but really now, 1 cent cheaper?
Reminds me of a situation
Back then, we had some companies pricing their Llama 3 8B $0.001 cheaper than the other and they'd keep alternating who was #1
Being the Default routed provider is a powerful thing!
Skeptically excited to see the performance on this. Quadratic computation is one of the major problems with LLMs unless RAG gets perfected
Not computation. You guys know what I mean. I slept 2 hours
im getting this error in the chatroom 🤔
no tokens for like 30 seconds then that happens
only with novita
Fiction livebench says 3.2 is better than 3.1 in long context, which is counter-logical?
why counter logical?
DSA achieves fine-grained sparse attention with minimal impact on output quality — boosting long-context performance & reducing compute cost.
they did state specifically that it boosts long-context performance
is it resulting on different outcome in other people test?
Well I specifically thought that processing 'some' tokens instead of 'all' tokens will result losing subtext in understanding creative writing, and it's just another step to agentic/coding benchmaxxing
So it's cheaper, faster AND better in roleplay? Meaning V4 could boast 1000b paramaters while having same or higher speen compared to v3
“All tokens” has the downside of muddying the waters
Less tokens => more focus
But it’s a balance
In this case, it seems like a good balance
Where do you see better? It seems worse to me
Based on those numbers
Reasoning version, top half
58 v 71 @60k is massive
Yea
nice
DSA doesn’t “drop” subtext — it learns which tokens actually matter for each query, and those indices are different every layer and every token, so nuance survives. The 60k jump you see is mostly less noise accumulation (fewer low-relevance attention scores), not brute-force scale.
So yes, V4 can stack more params (or wider experts) without the old O(L²) tax, so role-play quality goes up while billable FLOPs stay flat or fall. it’s not just a benchmaxxing trick.
Yeah, I wouldn't have expected that! The paper said they did post-training on nearly 1T samples to learn the top-k indexing heads though, so maybe this data had a lot of what fictionlivebench tests for?
Mdash GPTism detected
breh
Do you really hate m-dash 😭
I always use that
That's what bot would say
It's not marketed as a quality improvement, so probably not
DeepSeek claim its the same performance while vastly more context efficient
Not.
Others have already explained above too.
Is the new 3.2 model better for roleplay than 3.1?
And what is different between 3.1 and 3.2?
Please answer with ping. Thanks.
Biggest difference? It's much cheaper.
DSA is the difference: It makes things cheaper.
Nothing changed on the model's perf/quality
@shrewd gate best default parameters?
it's already set on the model
I thought deepseek used temperature 0.3
It's set to 1
I know deepseek provider does the conversion
On their end
from huggingface
But what about the others?
Ah
I got 3 syntax errors with it with temp 1 so there's that
Also, if it's increased performance why are the two endpoints so slow?
the speed you feel is still gate-kept by the old infra. deepseek didnt roll out new gpu stacks, they just swapped the attention math inside the same containers. so the flops bill drops 3-7x for them, but the physical cards, pcie lanes and network hops are untouched -> same queuing, same throttle, same "slow". once providers re-tune batching / cache layouts for the sparse kernels you should see snappier responses, but for now its cheaper for them, not faster for us.
is thisan ai response
My thoughts exactly lol
same old infra, but less maths to be done per token (If I understand right), so i'm within my right to expect faster responses
unless they intentionally throttle it
lol yeah had that vibe too 😂
well no, because if it was I would be speaking too formally
I was building on the past interactions
"for now its cheaper for them, not faster for us", "same X, same Y, same Z" and talking very personally as if to a customer
It just sounds like asking AI to write all lowercase, there's a lot of slop in that response
while this would be very normal to say, here in this discord everyone has their gpt-isms detectors on full blast
damn
I guess I should stop doing that then?
If that was actually you, I apologize. But also AI might have rewired your brain
great observation! but it might actually be us being hyper focused on filtering out content generated by AI systems and not your language, that is perfectly normal and good.
I was already being informal too because of earlier with Loinne
If you want I can create an image of the key points of this conversation as a mind map for you.
"great observation"
Lmfao
yeah I guess what made it kinda sussy is that it was like one longer message
most people just like type one word
and then add "lol" in the next message or smth
😉
Well, in case you're wondering, your response has, in addition to what was pointed out:
- A lot of "not x, but y" typical of AI (deepseek didn't ..., they ... / the bills drop ... but the physical cards / once you ... but for now ... / cheaper for them, not for us)
- Weird, typical of AI phrasing ("gate-kept by old infra")
- Suspicious, overly summarized, vague list of three elements (physical cards, pcie lanes and network hops)
My classmates also get sussy with me whenever I write something formally 💀 Doakes ahh
Yea, maybe I should stop doing that.
at this rate everything is gonna be sus to us in a few years 💀
The just testmant, wasn't not, Y, but a tapestry of X.
Hi, please tick this
bye dawgs, I'm gonna go sleepa while my immune system fcks the fever
"gpt, write without em dashes and lowercase only please"
you can't really hide the prose style from people that use a bunch of different models on a daily
You are getting exposed as clanker. You better admit it
Anyone else getting random '<|end▁of▁thinking|>' written in some of their responses? It's not even somewhere predictable like the start or end of a response, it's just... between sentences as far as I can tell. It's not often, maybe one in every 10-20 responses or so?
Alright thanks
Absolutely not. AI would had added — into it.
It's like their breathing, AI can't write without goddamn em dashes.
Not the only LLMism out there and can be removed or replaced by regular dashes by user after
it's just a bit silly to do that in an AI related server
and i'm not saying there is something wrong with using it for writing, i do use it for long-form text because english isn't my first language
but there's no problem in saying that you've used AI to write and it's just funny when i see those GPTisms on messages
The cost savings are crazy. Novita had 3.1's input tokens at $0.27, and now with the 3.2 savings it's only $0.27 input!
Great cost savings ||for Novita||
i don't know if someone feel this too, but it seems like deepseek is the most impactful in the AI fields.
i know for most people their model will not be the best but the fact that they experimenting with new things then releasing really good paper about it, show how they not just care about the money but also the future of technology for the people
even qwen didn't have this approch
Novita still no caching support?
I might've been infected by clanker-style ever since I started using the darned thang in 2022
I am at least happy that it didn't affect how I write in my first language.
since you guys understand newgen-stuff, maybe I can insert some into my explanations in the future. ❤️🩹
What's with Novita?
is this with Novita or DeepSeek endpoints?
I bet you use text completion ?
Looks like it was hitting the Deepseek endpoint from the activity logs. I enabled reasoning.
Yes, Deepseek are my favorite of the Chinese participants. It just seems like more original research and training goes on here which makes their releases interesting, especially when they are being bold with architecture changes. I'm sure DeepSeek 3.2 is kind of a crazy idea they're throwing out there and that's why it's labelled experimental. Maybe it will underperform too much due to the attention differences, maybe it'll remain good enough for clearly most uses, and then it's a huge win.
I'd wait for this one to mature and get more widespread competition among the providers. Decent chances this one will go lower than before. Maybe even at Novita themselves...
Will there be a free model of this?
with how cheap V3.2 is listed? certainly
It'll get crowded fast
V3.1 is handling heavy usage pretty fine, doenst drop below 90%
Any update on this? tools are still disabled for DeepSeek provider
Maybe have, but not now
Waiting other provider
This is getting out of hand
meanwhile they all don't have caching..
Is this some sort of competition for them 😭 wth is with the pricing
Yep, lol, companies do that
Wouldn't be surprised if they keep lowering it to get on top
does OR route to them a lot though or are providers with caching preferred?
I guess when set to no logging / training deepseek is not in the race anyway
in the chat room the auto router routed me to deepseek now which is good to see..
We got a new challenger coming up
You can select your preference based on cost,latency and throughput .
A lot of the people select the cheapest one as preference so even 0.01 less dollars cost than your competition can ensure that it routes to you
but does that ignore cached input?
I really want more providers to offer cheap caching
would be great to have a price war on that ..
From what I read in the docs, if it routes to a cache input provider once, it tries to route to that over and over to ensure cache hit
yep that's true
From their docs
But I notice a lot of cache misses when using openrouter for some reason
especially when you have 10-15 api calls simultaneously , some have cache miss some have hit, kinda random
Just enforce provider? It worked for me
I haven't tried it with deepseek but I had trouble with gemini flash 2.5. I used ai to add context to rag chunks, so the pre-fix was same for all the chunks, I processed 1 chunk first and then the rest simultaneously so they hit the cache of the first chunk. When I used deepseek official api I got almost 100% cache hit rate , with gemini it was quite random. I should have also gotten close to 100% but it was like 60% something, quite random. I did make sure to ensure it was only ai studio or vertex.
so far in my tests, this model is stacking up quite favorably to Grok 4 Fast, which is now a very similar weight class. performing better on my own evals for a real application.
the tradeoff being speed, primarily
I personally think v3.2 is better than grok 4 fast and cheaper too.
For my evals deepseek almost always used around 500 reasoning tokens(Which is actually insane since my task was reasoning intensive), grok 4 is cheaper in terms of input and output tokens but when you consider the reasoning tokens that it uses, deepseek v3.2 is cheaper
yep
not really anymore, grok 4 fast's TPS is way down
Seems to be a slight lateral thinking regression vs 3.1 terminus
(From https://lateralbench.org )
LateralBench AI Model Performance Leaderboard - Interactive accuracy and pricing comparison
What is this benchmark about? I can't open the link for some reason
If any of these open model providers added caching support they would probably blow literally everyone else out the water with price while simultaneously increasing their capacity massively
Has anybody else noticed 3.2 being much more repetitive than 3.1? It feels like it's not giving much attention to messages past the last 2-3, and I had it repeat the exact same answer in a multi-turn conversation.
Well it quite literally isn’t going as much attention
(Sparse attention)
Yeah, I don't see how 3.2 is in any way a replacement for 3.1 right now, it's literally way worse.
that's probably why they called it Exp(erimental)
maybe it'll get better again with more training to offset this
while still being cheaper
I think it’s pretty much as good as terminus. Try using it vs DeepSeek themselves to test it.
This update is just make the price down only
A mini update
I saw Fiction.liveBench which was kinda interesting on this topic. Non-thinking starts out performing very mediocre but oddly enough consistently mediocre which is in fact pretty decent for very long contexts...? But unusually poor for short contexts. Thinking High however performs well and what's nice is that it remains well over long contexts. Here's a screenshot and discussion on Reddit: https://www.reddit.com/r/singularity/comments/1ntmkah/fictionlivebench_tested_deepseek_32_qwenmax/
So I'd use DS 3.2 Thinking if I want to boost attention. I'm not sure why this is so but I assume the thinking tokens helps it stay on track?
when will they release a model that can accept images
The "thinking" section is likely acting as a secondary (and higher order) type of Associative Memory, eg:
You can think of Attention as a form of Associative Memory:
https://ml-jku.github.io/hopfield-layers/
but it can only deal with second-order interactions and this is where the O(n^2) comes from (third-order interactions would require O(n^3) and be completely impractical).
The way the reasoning section trawls over the same information over and over has the potential to selectively account for much higher order chains of interactions.
I think deepseek as a company is a lot more prone to doing experimental things for the sake of experimentation (If you've seen the interview with their founder/head after deep seek reasoner crashed the Nvidea stock market).
They take a lot of risks and try out a lot of new things, so far I believe they're focusing only on text input and text generation, but you never know
1 thing is for sure. They won't release a multimodal model just for the sake of having a deepseek multimodal model. They always bring something new to the table. Like they bought in Mode of experts and now this sparse attention with v3.2
Yeah, pretty interesting company for sure.
They didn't really follow the trend and trying to be the one making the trend
yah if it weren't for them most companies would just be racing for more parameters , and cost , innovation would have been slow
Do people really care about more parameters at this point, though? All of the top proprietary models don't even reveal their parameter count, and for open weights, larger size is a downside, not upside. I'd think test scores would be much more popular compared to size, as flawed as they are
I think this was bound to be a weird update. It's not like anyone expected more smarts when the big innovation is a form of sparse attention
I heard somewhere that said the performance of small model actually could be boost through prompting to be close with their bigger version.
It said that with bigger parameters there's additional context added to each token that why it able to produce more quality output.
But with good prompting the small model could be at the same level as the bigger one
I think at best it's that maybe the correlation is weaker than we thought.
I will say smaller models hold up better than a lot of us probably thought when they are given an ungodly amount of reasoning tokens
Namely QWQ and the new Qwen 80B
But they won't be acing HLE any time soon
deepseek has cacheijg
I mean like one of those providers that host a bunch of open source models, like Novita or stuff
My experience in text adventure games with Dipsy v3.2 as the interactive system (official provider, reasoning on, Temperature 0 and 0.5):
- Prices back down to v3 levels, massive
- Carries v3.1's quirk of not spamming asterisks, good
- No more "Somewhere, in the distance, X... Y..." and "Outside, X... Y..."
- Follows instructions better, again from v3.1, so needs more explicit instructions
- New kind of sloppa spam of "It's not about X, it's Y." / "It doesn't X; it Y..."
- Starts to get things wrong a little past 32k to 64k, okayish
- Knuckles whitening, breath hitches slop (I'm okay with this though)
Good model, if you can keep up with the slop.
dipsy ❤️❤️
Indeed. Dipsy love
Sounds good! Pricing is so good that even just sustained performance makes this one a clear success. I think the killer feature is being able to use premium providers without worry and not hit silly rate issues.
Do you mean new for DS? That's the most famous slop there is across models.
yea, or I just didn't notice it before for dipsy
Oh, and don't forget about the spam on the smell of ozone. Deepseek models LOVE the word for some reason
I'm still debating where to top-up though. Open router or deepseek official? I love deepseek and all but v3.1 (and I assume v3.2) would be inferior than 0528, I tried the free version of v3.1 and found it less creative than R1, especially on fantasy settings
Should I just stick with open router and go for my preference, or try v3.2? Are there other pros to it from r1 0528?
you can use v3.2 via openrouter so if you want to experiment I’d top up on openrouter instead of deepseek ..
and you can force routing through deepseek too so there isn’t much of a downside right now either 🤔
🤔 you're right, open router has google pay too which makes things easier, so that's a plus
how much cost do cache hit saves?
90% on input
so like when i chat with ai model and each time it sends the whole stream back? and I save money as its already cached?
yes
Well, you can reasonably expect to save money on the input for the system prompt part
Most chat apps operate with a rolling window of chat messages, let's say the limit is 4, so you have
SYS PROMPT - MSG1 - MSG2 - MSG3 - MSG4
When you send another message, we're over the window, so the oldest message will be deleted
SYS PROMPT - MSG2 - MSG3 - MSG4 - MSG5
Caching works on whichever prefix is repeated across messages. In this case, the prefix is the system prompt
what is the limit if i use open-webui , because yesterday wen I was chatting it went to 90k tokens after 70-80 messages , was it keep track of every message?
the limit is 164k tokens
for this specific model
woah..
this model amazing, only using one dollars for a lot of context.
so much cheaper and quality 97%-99% of sota model for general stuff
Still no 3.2 for free?
any fix for this?
which provider? sounds like a broken template.
novita
yea I just generated 3 responses and novita and something is wrong with their template. also has massive CN issues. so exclude them for now, is the "fix".
so which provider doesn't add that annoying <|end▁of▁sentence|>
i have seen chinese but not that tag, vanilla openrouter? default settings? an app? hard to tell with no infos
does anyon have a good preset?
ive been using chatstream and its kinda not the best with 3.2
I'm using V3.2 for a text adventure, and Atlas Cloud tends to input its reply entirely in Reasoning, which means the next turn wouldn't see its reply in the context. Not sure how to fix this, except to put Atlas in my ignore provider list.
Why this particular provider in first case?
I had the providers as auto. I guess it picked whatever was easiest to connect to.
atlas-cloud/fp8 did that:
<|tool▁calls▁begin|><|tool▁call▁begin|>list<|tool▁sep|>{}<|tool▁call▁end|><|tool▁calls▁end|>
I believe DeepInfra does that too, but it happens when DeepSeek needs tool call.
so atlas removed it?
No still does but not always, didn't check other providers little bit deepinfra and its good
are there somewhere other providers that support caching other than deepseek themselves?
no
hm ok, than I still have to ignore all others :/