#DeepSeek V3
1605 messages · Page 2 of 2 (latest)
this gotta be a new record, 0.07t/s
bahahaha i was just going to post this
folks. we did it. negative uptime 😂
now over 100% uptime
still down
If deepinfra and hyperbolic are proxying then does that mean their privacy is the same as that of deepseek? As in prompts sent through deepinfra would not be private from deepseek and may be used for training by deepseek?
how do I route to hyperbolic?
It's not that they are proxying Deepseek
Is that are not ready to handle the whole load of traffic when Deepseek is down
So when Deepseek is down, they saturate too
u shouldn't
Is anyone else experiencing this issue? I've even tried creating a new API key.
Personally seeing silent failures documented here: #1324894987217014855 message
Hyperbolic now runs at precisely 0t/s
Deepseek has context caching
https://api-docs.deepseek.com/guides/kv_cache
If I use deepseek-v3 through openrouter, will this context caching still have effect?
thanks a lot
yeah deepseek at high contexts is not really working
also watching Hyperbolic fluctuate between "dead" and "barely alive"
is hilarious XD
how high context? I've been submitting big files and its still just about holding together. high latency tho
which provider?
the DeepSeek provider specifically gives me issues
about twice a day
Occasionally deepseek limits their context to 8k probably due to load
deepseek model so friggin slow for me
same, can't use it for anything atm. 😦
also is it just me or
is Fireworks' DeepSeek-v3 completely diff lmao
this shit is SLOW and RLLY RLLY BAD
lmao
why is DeepSeek so hard to run?
i get that it's massive, but isn't MoE supposed to run a bit quicker?
Imagine 671B model have 37B active parameters in given moment, then there 18 people using that model at the same time with different thing, isn't it going to activate all of the parameters anyway.
so in simple term because there is a lot of demand that why is hard.
That's my thought of it, could be wrong.
what quantizations are the non-official API for DeepSeek?
Deepseek had months to optimize their infra for MoE for a year now. Other providers never had this much demand for a MoE and most llm serving software don't have first class support for batched high throughput MoE support simple as that.
also quite a different MOE to server
also there deployment is 1 expert a gpu .. with a unit of scale of 320 gpus
most try to run that of units of 8xh200's
Event after all of that Deepseek still have hard time to run it, i mean they increase their node last time to accommodate the demand.
This one model really attract people to them, good for them to hit the hype train.
they did something i would have deemed impossible on a shoestring budget
soo they deserve all the hype they got
is 2048xH800 really a shoestring budget?
6m as training cost for a 600b+ model is a drop in the ocean
a 100b dense costs 50-100m
just in compute
and a 2k gpu cluster is minimal for such a model
405b from llama used over 20k gpu's
they did fantastic work with the limited stuff they had
also mind you the h800 is more on paar with a an a100
as the chip is capped
assuming H800 costs $42000 as it does on ebay, then the cost of 2048xH100 is $86M
h100 is around 28k in bulk
and we talking compute cost not hardware acquesition cost
2 very different things
i see
does anyone know how well merging the experts of MoEs together into a dense model works
the only attempts I've seen are attempts with Mixtral 8x22B
you wont be .. as knowledge wont cluster that way
and they were not evaluated very well
even at mistral that wont work
here its even way different
as the gate selects more then just 2 layers
mistral is. 2experts
here its 9 ?
oh
so each expert is ~4B?
i'd imagine pruning wouldn't work well since the experts are so small already
knowledge clustering happens differntly then one would assume
so the only way to get a smaller deepseek-v3 would be to logit-distil it or pretrain a smaller model on the same dataset?
moe is really just fractioning stuff out
you have overlaps and no central clustering
aka you dont say expert 38 is the math expert
just doesnt work that way
its a logical separation and inference hack
not really isolation on knowledge clustering
moe isnt new .. first paper was written in 97 about that
with 100k experts
its just getting steam after mistral
and deepseek has successfully integrated it as well
but non the less they did very fine work
inference on batch is still tricky
at deepseek they run 320 gpu's as unit of scale
and each expert is pinned to 1 gpu
the 3rd party guys dont have any of that
does the router need its own gpu
need is a strong word but you have kv and otherwise the inf cost is just like a dense 670ish B model
as in batch odds are close to all experts are hot
aka performs like a dense model on the gpu
if you pin an expert to a gpu - that becomes the unit of scale and you have massive gains in perf
as you have a 3-4 b model on a gpu vs a 680b over a few with tensor and data parallelism
since the experts are so small could you use 8~12gb vram gpus for the experts then? or does it still need large gpus?
im semi affiliated with mistral - but i really have massive respect for the boys at deepseek
they did very good work
issue is you never know what expert will be on
what you could eventually do is prune it down to just use 2 experts but after pruning you would need to retrain the gates
not sure how well that would work
im spitballing here
no my question is if you have something like 320 small GPUs
assign the experts to the different gpus
ya that work if you have the code to pin it
kv may rape you .. a little
and the network
but yes
👍
but then the power cost - and throughput
i think the cheaper option is to use the api
320 small gpu's will drain at least 200 w each aprox ..
thats a big steep bill .. not user if 8xh200 at 2 usd per gpu aka 15k a month wont be the cheaper option and you have a higher throughput - at least on paper
we just need different matmul processing that is faster and faster memory technology
in 1-2 decades we run models like that on our toaster
yeah
hmm somebody got 5.37t/s on 8xM4: https://blog.exolabs.net/day-2/
yeah
- prompt processing over longer ctx on such a setup will be horrible
its a great demo for sure
but i would not call that viable
for every day use that is
- the inital investment for 8 macs .. at that config . wont be a easy paletteable investment for most
given in a year or 2 its probably close to worthless
- if you try to spend 20k on the api given that price
you have a few years runway
lol
so not really sensible fiscaly
i have a hard time spending 100 bucks on deepseek with daily use
so it seems that the most improvements in local models will probably be dense models with higher-quality data and maybe new finetuning methods?
stuff gets bigger before it gets smaller
yeah
distillation and then newer dense models can train of that
oh yeah i forgot about distillation
i would love to run that local but no dice
im capped with 96g vram and 256 gb ram on my normal workstation
and that is already more then what the average user has available
Why is fireworks
So diff from Deepseek as a provider for this model
The response is so diff
I Also thought if there a other way to make the attention calculation requiring less step while able to result in the same attention capabilities as their original formula it will make our model more efficient and faster, i mean looks at MLP/FFN alone without other part are quite fast and memory efficient.
Are there already research for that tho?
Firework seems off. It's like it has it's own temp settings. Together is expensive and DeepSeek doesn't always work. It's a shame.
The DeepSeek endpoint doesn't seem to work at all rn
yeah. it's not working.
looks like a small anomaly
every part in the llm is matmul .. so if you optimize matmul you get speedups cross the board
custom asic's could help .. and some are working on it .. see cerebras / groq .. tpu
cerebras is pretty much unbeatable at this point as the interconnect is legaly fenched off to them
so groq still has some gains to be gotten once they get the v2 hardware out of the samsung run
but the devices are individually small as its all sram
so deployments aint cheap
Does matmul mean "matrix multiplication"?
yes
tpu and tensor cores are systolic arrays
pretty much vector processors
to accelerate matmul
modern cpu's have avx for that but way slower then simd or other achitectures
my bet is still on photonics .. but there is alot missing in terms of material science
next 1-2 decades are going to be interresting
Yeah, but attention head specially are more resource hog compare to the other part of the block as it have to get all the context so it's have more step and longer calculation.
I mean we can see that from layer with FFN/MLP only we get O(n) and if then we add attention layer into it then we will get O(n^2).
problem is you cant just change the arch as after effect
at least not without massive cont. pretraining
otherwise you get noise only
so that kills is very much for 99.9% of the smaller guys / labs
how to select provider ?
workaround was to block every provider but deepseek
I block them in the Openrouter settings on the website.
deepseek v3 works very well with roo cline
Has Fireworks like
fixed their issues yet
bc other than DeepSeek
all other providers have terrible responses
You can ignore provider though openrouter settings
ye but this feels like a non obvious workaround.
i mean if i select deepseek, i want deepseek.. not some 3rd party provider with higher prices. feels like intentional money grab behind the scenes
oh i see it's done in the code, not on website. thx
OpenRouter doesn't make money by routing you to a more expensive provider
surely they get kickbacks for allowing it to happen by default 😛
As far as I know, they make all their profit from deposit fees
accidentally routing you to openai's o1 (or later o3) would make them bank then - due to automatic top ups 😆
Anyone know why, no matter what settings I use, I can only get max response of approx 2000 tokens using deepseek provider? I've set max tokens to 8k.
Even when I ask specifically for up to 5000 words. Cheers
Does the response randomly cut off or does it just stop?
If it just stops, that's an LLM issue. After a certain point, it just stops talking probably because it's training data didn't have many examples past 2k tokens
It stops naturally, like it starts generating its response "knowing" it's limited to 2k or so words if that makes sense.
its not a rp model mate
dont expect it to write long stuff for you .. it doesnt have to - if anything most us want answers as short as possible and to the point
More for knowledge articles and wiki creation. Yeah it's a shame cause it's my go to model right now, and Gemini could pump out full articles in one prompt but deepseek take a couple prompts. No real issue just was checking if I'm missing something in settings.
It’s good at creative writing actually.
After all, do you ever really want any model 1 shotting a large of piece of text beyond say code refactoring? Just like a human writer, it's best to start with an outline, and progressively flesh it out.
is it down
I get errors over deepseek api, but not on official website.
imo its a bad thing if model answer always short, i dont like how sonnet answer a problem without details explanation of solving it even when you ask it where in other hand o1 always giving out details about that problem.
also the other person talk about how its limited to 2k even when its adverst to have more than that, so make sense that person thought it should have been outputting than 2k just as deepseek told us in their label.
overrated tbh .. - you can always use continue - and steer it
people who cry are the guys who think ai will do all alone ..
its a TOOL
i think you miss the one and two point, there use case where longer outputing could be really beneficial as in my case its about understanding problem and the longer its the better it could be layout also if it being adverst that it can outputing 8K token then you shouldnt need to steer it so it outputing as what being adverst where also if you ask it to continue then its not as what being adverst to be 8k token output.
i agree its a tool but its still shouldnt have that adverst in the first place if it cant get to it
It can output 8k but the model doesent feel it is necessery. Suppose you give it a translation task which requires 8k output then it will do that because the model feels the need to output all 8k tokens.
are you telling it to generate 8k tokens or are you telling it to like "respond in 10 paragraphs with 5 minimum sentences each" which can equal 8k tokens or something similar
has you actually try it to generate 8k token for other thing than translate? i has and its really are limited to below 8k token if you ask it to make story even if you steering it to do so, imo its shouldnt adverst as something that could generate 8k token if it cant do it for many thing other than translate and i think eternal answer are more suit for this thing as it is a model problem than other thing where there no example that goes 2k token on the training data.
then you should probably use a different model for your use case
has anyone figured out why the models return such different responses dpeneding on provider?
Everyone is running a different inference engine and potentially different quantization of weights. DeepSeek's inference engine is proprietary, Fireworks apparently does some form of quantization (https://x.com/FireworksAI_HQ/status/1874231432203337849?t=Y8xmqor0UFhkvzPAJd4H6A&s=19), and everyone else is a wildcard.
DeepSeek V3, a state-of-the-art open model, is now available on Fireworks Serverless and Enterprise!
🥇 SOTA open model for coding and reasoning
🥇 Best performing open model on Chatbot Arena and WebDev Arena
🧠 671B MoE parameters, 37B activated parameters
Congrats to the
I don't think there's been confirmation that the different open source inference engines produce the same output for deepseek at the moment.
also sampers make an impact how they process logits - fireworks will be some triton based inf
the best bet specially with moe's / new architectures is always the original model providers api
damn
it's v v different unforutnately
yikes
MTP module seems to alter model's behavior a lot
It's much more deterministic with MTP module on
source: local inference w/o MTP module compared to DeepSeek Platform
example: with MTP on, model stays stable at temp 2, w/o it it's much more chaotic
MTP?
Multi-Token Prediction
it's basically a specifically trained 14B model for speculative decoding
more complex than that, but it's the basic idea
spec decoding is changing outputs a lot, its very unlikey the official api is doing that.
Sorry for the intrusion, but can you share the prompt to me too?
I thought spec decoding didn't change the outputs
I thought the entire purpose was to be faster and not change outputs
I'd like to report that, unless I'm misunderstanding something, Deepseek seems to be limiting their context to only 10k, and it has been this way for days. Any time a prompt exceeds that amount, it never generates a response, and if you lower the context size to 10k, it starts working again.
This seems to be an issue in implementation on the providers side rather than a model issue
I use exclusively deepseek as the provider, and disable the others in the settings.
The others just output gibberish.
Though the other providers do generate a response even when over 10k. It's just usually garbage.
Sometimes it works over 8 to 10k but it is probably limited due to load
Seems like that should be listed somewhere. I've tried it at random times over the last several days and I've never had it generate a response over 10k. I get that it's a good model at a low price, but I was using it expecting to get the full context listed.
I agree
And I was wondering why my input that was about 12K tokens resulted in infinite timeout from DeepSeek
Could be the same issue
@lament yacht Will OR look into this? Because if DeepSeek is indeed limiting input tokens to under 10K, either DeepSeek or OR is lying about DeepSeek having 64K context window
I've been only hovering around 2-3k and I get the same issue.
I guess it's weekend so no one is answering or investigating this
there’s often a difference between a model’s technically supported limits and what the provider actually enables in prod, likely deepseek had to limit context length to serve more people
example being google charging more for tokens over 128k while the models can support 2M tokens
just checked deepseek discord, it’s 100% an issue on their end
It's fine to limit context length. Many Llama and Qwen fine-tune models hosted by Infermatic and Featherless actually support up to 128K context window, yet no host will actually host them with such context window
It's just that it should be clear and honest about it on OR's model page
Yeah, that's my main issue. It needs to be stated somewhere. This isn't a free service. We are paying for it, and if they are having problems then it should be clearly stated.
I think we need to chat with DeepSeek upstream, we're following their docs: https://api-docs.deepseek.com/quick_start/pricing
yeah was gonna say, deepseek themselves don’t say anywhere that they’re enforcing this limit
They’re increased input price of deepseek chat but decreased output price from 0.27 to 0.14
This is good
You misread the docs. $0.27 is the "regular" input price that will kick in next month. Current input is "promotional" $0.14.
They replied about the hanging problem in the Chinese channel but not in the English channel 😂
They just replied in the English channel
Even though it's mainly about Cline, basically they just suggest against using it with longer contexts
"Our team is actively optimizing calculation and server load management to improve the overall performance and stability."
Basically canned response
Also they recommended using other providers when the context is longer. Oh god, if only they know Fireworks often returns rubbish
And no mention about fixing their documentation
Thank you
I think its mostly they are running out of hardware to scale further
is there any caching with fireworks?
this is totally ok
my issue with them is like
when DeepInfra eats shit and gets messed up, and can't serve a model, it's clear, either their accepted context lowers and we're all aware, or the provider just turns red and we're all still aware
the issue is its not down or red because under 8 to 10k still works and you cant derank because then ppl will complain about costs
long context does still work sometimes tho, its not like a blanket ban on anything over 10k
well, they did literally say on their discord "use other providers" (with Cline specifically because of long ctx spam) so it does look that way. Can they not elastic scale using GPUs from like bytedancers or tencent or others?
Does anyone know if Together is hosting the model fp8 (i,e, as huggingface), or did they quantize it?
I've used this and Claude and Claude is faster than Deepseek on Cline. So IDK why Deepseek is slow or just hangs up. Frustrating
😅
is any provider working for anyone? all providers just load forever.
yep
I had to add deepinfra and together to my block list because I aint paying $1.25 for deepseek which defeats the purpose of deepseek
So now DeepSeek is FP8 and with 10K context?
In short, for RP it is better to go back to Sonnet and Wizard. Thank you
What happened to the 'Fireworks' provider btw?
#general message advice from staff
Yeah, sadly. Despite the problems, Fireworks was like the best in terms of quality/price ratio. I hope they'll return eventually.
looks like NovitaAI joined the race, taking place of Fireworks in some ways, let's see how it goes!
the main problem and this seem to be a issue for long now, no matter the params you set, the providers seem to tune the models in their own ways such that each provider acts a little different than the other. which disturbs the overall flow 😦
this was also noticeable with qwq, some providers did not work well with it
The latency problem with Together and NovitaAI are crazy
How’s the privacy policy of Together and NovitaAI compare to Fireworks? Their ToS seems pretty similar at first blush
and novitaai have now cut down their output as well!
Oof, that's rough.
looks like it died first
Yeah, looks like a lot of providers are struggling with serving such a gigantic model
what's the current next best in terms of coding usecase?
3.5 sonnet
True
Although I prefer o1, but it's test-time compute based and really quite expensive.
DeepSeek provider seems like the ONLY good option for generating a story. Every other provider just gives a load of rubbish.
I've never experienced this with any other model.
Idk, for me personally, Fireworks or Together are the only ones that work for creative writing.
What temp do you use for those two?
Around 1
do you touch the penalties?
Yeah, i crank some quite high too. Between 0.5 and 0.8
Really? I'm experiencing the opposite. I thought the model was bad for creative writing, but recently I got routed to deepinfra when the official one is down, and it's vastly better. Disabling fallback, the quality degrades severely again. Still experimenting but my guesses are: 1. Official API might be using a very short max input. 2. It uses a different method for censorship, maybe using logit bias or filtering the input with regex. So it still gives output unlike GPT, but the quality will be extremely bad. Currently the official API is very unstable so I haven't tested too much.
Yeah, i think official API uses some extreme optimization options for cost cutting which causes the performance to degrade too.
It's because I've been setting the temp too high for other providers. I'll try again at some point.
Were you using Novita? I find that provider reacts to temp way more than official. The deepseek docs recommend 1.3-1.5 for creative writing, but novita can only generate nonsense at as low as 1.1 . Other providers are also more sensitive to temp than official, but not nearly as much as novita. It also likes to add some comment, like an author's note, at the beginning of its response. None of the other providers does that.
I'm not an expert but this difference between providers really seem strange to me. I thought they should produce mostly the same results, but in deepseek's case, they are very different.
Overall Notiva's output is super weird. Not only it adds "author's notes", it also ignores my request to summarize the story (so that I can test whether it cuts context or filter the input). Suspecting the provider's honesty, I asked it which of 9.11 and 9.9 is larger, and it got it completely wrong. Not only the result was wrong, it also used a completely different format. Other providers, include official, ALWAYS starts with "To determine..., let's compare them step by step", and then lists out the steps. Novita doesn't do this, so I'm starting to doubt whether it's even actually deepseek.
It's why I've got all the providers except DeepSeek blocked via OR settings. I'll switch to DeepInfra once I'm done with MiniMax.
Responses also seem different if I use my own deepseek key or BYOK compared to through openrouter. I wonder if they did something to censor openrouter key
Just started using Together with temp 1.1, rep penalty 0.5, freq penalty 0.7. The difference is staggering.
True
Rep penalty 0.5 to me reports that it is an invalid value, minimum must be 1.
Are you using text completion or chat completion?
Then use 1.5
Like this?
Yes
Then I still have repetitions, ouch!!!
And I also tried Presence Penality = 0.7
And repetitions have a standard behaviour, they always appear when a speech or scene is prolonged.
If you have a faster pace they do not appear.
Are you sure you're using Together as the provider?
Yes.
If I prolong the scene there are always repetitions, not striking but it recomposes the sentences with the same terms.
Much less than before but using Sonnet, Wizard and Hermes 3 you realise that it cannot be like them, without repeating itself.
I get it. Sadly other models are either expensive, moderated or trained on 50 shades of grey. I have a bot I always go back to. It involves a guy with a secret identity with an online account to contact User. While other models struggle with the concept, DeepSeek just gets it. And it always seem to follow the prompt to the dot while other models seem to derail after a while.
You're right, I think the models I mentioned are much more nuanced on the sexual side and that prevents repetitions.
I too find DeepSeek a very good model for some of my SFW cards, but when I use the NSFW ones the repetitions always come back, as if at some point in the scene his vocabulary of phrases runs out or is otherwise more limited than other LLM models.
Why are you using Text completion instead of the Chat one?
I use text because the website won't let me choose.
I see
The deepseek provider has been really slow lately, or is it just me?
Yeah, a lot of providers are struggling with running Deepseek v3 adequately
actually I probably don't know if I can or know how to do it on OR, this stuff goes way over my head
it's always been like that for me, try other providers
I tried other providers, though they work, their responses are very different. Though right now my temps are 1.8-2 for creative writing
Like I mentioned before, temp should be much lower - close to 1 - for other providers like Together. #1321401252588032010 message
It was on a roll yesterday for me. Shame it shit the bed today.
it was laggy yesterday for me as well, but a bit more reliable than today
i feel like by the time it returns, the promo price will end already, which will suck
apparently they're limiting deepseek's official openrouter connection because of just too many concurrent requests, and if you byok through openrouter or just use their api directly it's fine
i did want to just top up $2 but i genuinely won't be using deepseek enough before the promo price ends to bother with it lol
do i have to top up on deekseek directly too or can i just use my OR credits?
deepseek directly unfortunately
there's probably atleast 500 people right now generating complete automated slop using deepseek on openrouter
it's too cheap to not mess around with
they gotta ruin all the fun smh
my first day my cline went wack and made over 2500 files when i was experimenting with a simple refactor
if you could put $2 on deepseek directly it works well, you can use the credits for the rumoured r1 or r1-lite launch soon too :)
i wish openrouter wasn't limited though
what is the r1/lite?
deepseek's reasoning model, o1 equivalent
what will be the differences between the main one?
Completely unusable for me
Same, Deepseek provider doesn't even work for me. I'm thinking this will remain until their promotional price ends. Their deal is too good so it's flooded
The original price is still worth it though imo
Use BYOK. It's so much faster.
I just looked it up, what's that? Is that running it on your PC or like through OpenRouter?
It's Bring Your Own Key. An OpenRouter features that lets you bring your own provider API Key and use it on top of OpenRouter
So it can act as primary or fallback
damn gotta top up on a different site too on top of it
lol yeah. thanks for letting me know though. I'll try it, deepseek roleplay is too good
Yes. OR still takes 5% of it. But i think it's still worth it. Even you can make it just as fallback not primary.
the top still shows 64K, but Together now has 131K. Fireworks just released theirs at 131K too
R1 Zero weights released 13 minutes ago.
https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero/tree/main
No model card or report atm. Same size as V3.
Nobody knows yet. Files just appeared, no information yet.
But it should be better than V3.
Yeah, fair enough, since V3 was distilled from r1
Similar to how v3 was released initially, so we will probably get the README in a few hours
insane size 685B parameters.
deepseek just on a roll lately, hope it's something big
I expect it to be at least close to o1-mini performance or a little below that
it destroys o1-mini, on par with o1-medium
https://fixupx.com/StringChaos/status/1880317308515897761?mx=2
Interesting
This is a different model to preview tho. Plus, on Deepseek chat website it wasn't performing as good as this benchmark claims, but this is just my personal tests and observations.
we back
I'm so excited to try it out!
Do you feel Deepseek v3 has the ability to properly adapt in multi-turn? Like, It doesn't repeat the same mistake if I give it the tiniest amount of feedback on its math. It's almost humble :>
What do you mean "multi-turn"? Is it when you ask it multiple follow up questions?
I tested this and it seems fine to me. Is this in the context of RP?
I mean it is good at multi-turn, indicated by the :>, :>
How come we have to use BYOK for lower latency? It seemed to be working fine a few days but now I’m not even getting responses. I’ve switched to BYOK and it works fine now
Good question. @lament yacht what's the technical reason behind this?
with BYOK you gets your own ratelimit with upstream provider
So openrouter is constrained by rate limits?
Yeah -- we're in contact with them to see if we can increase it
DeepSeek API does NOT constrain user's rate limit. We will try out best to serve every request.
🙈
There's probably just something like a global queue
: OPENROUTER PROCESSING
I keep getting this for DeepSeek
This took like 65s to complete
Is this the same for yall?
@lament yacht
edit: talking about wrong model, my mistake
OpenRouter doesn't forward reasoning tokens, and it's setup to generate up to 32K tokens of reasoning tokens. At 10 tokens/s, it could take up to an hour before you see the first non-reasoning output. whoops wrong model
deepseek-chat
ah sorry my mistake, should've read the title more carefully 😂 - long day
this is happening more frequently. any solution to this or something wrong?
Yeah I used BYOK and it fixed this
I don't need to additionaly credits to deepseek right?
You need to yeah
yikes!
I retract this statement. This model makes me wanna kms in multiturn.
Do you mean do you also need to have Deepseek credits to use it in OR?
Same question. Hoping someone from OR confirm this
I was asking if that's what you meant. If so, no.
If you BYOK, you need Deepseek and OR credits
BYOK pricing is 5% of the provider pricing deducted from your OR credits
For every $1.00 you pay directly to the provider, we'll charge you $.05 in OpenRouter credits.
and, without BYOK, just using OR credits?
Standard pricing from whichever provider serves your request
is the OPENROUTER Processing issue related to BYOK?
I am not sure, can you see what provider was behind that generation in your activity?
How’s it going with OR getting better rate limits from Deepseek?
It's very strange to me that openrouter has been having so many issues with their connection with deepseek
I've been using chat.deepseek.com for several days multiple times a day and the tok/s and ttft is always blazing fast and has not timed out on me once
granted, you're one user versus our thousands 😅
almost 2B tokens through v3 today on OpenRouter
& almost 1B on R1
I meant to post this in the r1 chat (since I've been using r1) but confused myself haha
that's fair but I was talking about deepseek's inference capacity in general
deepseek the company not the model
Oh my god V3 works in Sillytavern now too 😄
Now? Pretty sure it worked since release...
did you try telling them that your key(s) are basically representing 1000s of users?
otherwise maybe they do not know, and think you are just their biggest superfan, who simply cannot get enough deepseek api spam
Did not work for me or a decent number of others. Staff confirmed multiple fixes were required on the backend.
It was provider dependent
But Seek and Infra did not like ST
Isn't it still provider-dependent or they fixed DeepSeek provider?
I would need to check, i guess
It's fixed
In my testing at least. Obviously the DeepSeek provider is getting kind of rocked by requests rn tho
hi
I am also curios to see why the DeepSeek app is always blazing fast with both R1 and v3 but it is usually either down or too slow from OR (All providers are ignored but DeepSeek). Does BYOK help with up-time and latency? I would like to see real usage example from someone before topping up some credit to DeepSeek API directly 😄
Byok helps
OpenRouter is being rate limited
Oh thanks for letting me know!
Is it noticeable increase? Are you using it that way yourself?
Its a noticable increase for DeepSeek, since OpenRouter is being heavily rate-limited. In all other cases, the increase in speed is negligible.
I'm not using it myself, but I have seen other people see better speeds with BYOK
I will give it a try, thanks a lot!
hug of death'd #1330820209812050002 message
official website no longer works, even without web search enabled. I'm doomed T_T
I got a question, my gen ID with 0 token response and a timeout tells me that I am not using BYOK, but I am...why?
"id": 3996858223,
"generation_id": "gen-1738012106-DvtnKTf72ZruFYOrbaNR",
"provider_name": "DeepSeek",
"model": "deepseek/deepseek-chat-v3",
"app_id": null,
"streamed": true,
"cancelled": false,
"generation_time": 289575,
"latency": 13767,
"moderation_latency": null,
"created_at": "2025-01-27T21:13:29.867706+00:00",
"tokens_prompt": 1384,
"tokens_completion": 0,
"native_tokens_prompt": 1691,
"native_tokens_completion": 0,
"native_tokens_reasoning": null,
"num_media_prompt": null,
"num_media_completion": null,
"num_search_results": null,
"origin": "",
"usage": 0.0002343726,
"usage_cache": null,
"usage_data": -0.0000023674,
"usage_web": null,
"provider_responses": [
{
"provider_name": "DeepSeek",
"status": null,
"latency": 10000
},
{
"provider_name": "DeepSeek",
"status": 200,
"latency": 13767
}
],
"is_byok": false,
"finish_reason": null,
"native_finish_reason": null
}```
Can you check your key settings (specifically the fallback value) https://openrouter.ai/docs/integrations#automatic-fallback ?
noting that this did have an error
The 1st one with DeepSeek failed, then the 2nd one fallback to our key and got a 200
Still very bad that it's straight up 0 completion tokens...
yeah it's the deepseek issue from today
they're havin a bad time
i would like to not be charged for it tho XD
tho i unders;tand it's trying times, they're having a bad day
they are trying to go on holidays but they can't
GPUs getting absolutely hammered from around the world
The speed was beyond amazing, it's not even R1.
Even though i provided the direct API Key (BYOK), but it was not using it.
interesting 0t/s
hmm will check on this
patch is being deployed
@silent torrent the situation is getting freaky
how so? for v3 specifically?
Yeah, NovitaAI is bricked, official DeepSeek is getting cyberattacked, and DeepInfra has to absorb all traffic which sucks.
ah, yeah. IMO things will stabilize pretty significantly in a few days or a week or so
lots of hype and lots of difficulties providing stable inference
Hopefully, I am betting on more providers trying to do it, plus more distilled models
why deepseek v3 is so extremely slow?
I hope so, now its extremely slow
Which provider is slow for you?
how can I choose provider? I can do it in api call?
I see
seems that everything is slow, all providers
Its so massive. 670b parameters
LOL. It's hundreds billions params.
Guys it’s not slow because of the size, there are only a few active params during inference. The reason it’s slow is because of the massive model hype and low amount of providers supporting it
Along with deepseek getting ddosed and novitaai being broken
@tough whale someone said the magic word, OpenHands were notified but really NovitaAI has something weird, say didu use it under LiteLLM or direct API?
I just repeated what you said, I didn't actually test novitaai hah
Yeah I need to file the report sooner or later XD
What's the good replacement for deepseek v3?
For now I'm back to claude 3.5 sonnet but my wallet is not liking it
Also it's painfully lazy
Try the Qwen models
Wow
For what? Code?
Generating exercises for english learners in json format / translation.
Assuming it has to be cheap-ish, Llama 70B 3.3 should be good
It's very good at instruction following and strictly adhering to JSON
Very linguistic model overall, not finetuned so hard on math and such
Actually I noticed that llama is kinda bad in polish, I get better results from Qwen or models that focus to be multimodal
Which version of Qwen (is this version available on OpenRouter?) "models that focus to be multimodal" - any examples? 😄
@naive rock tried also Llama 70B 3.3 as you recommened and it knows Polish words better than others, need to test multiple cases, but sounds interesting
Curios if openrouter has a default system prompt, because it gave me good responses, but api returns differently
Ah, I should have asked if this was English only instruction or mixed
It should show in the OR chat thing. I can check in a bit.
Sorry, I meant multilingual not multimodal
All Qwen versions promise to be strong at multilingual tasks
Try Qwen 2.5 72b
It shouldn’t have any, it’s probably because of temperature, set it to 0 if you want consistency. For creative tasks like you want you are probably better off with higher temperatures
The default (1) is already pretty high tho
is this a common issue right now even on DeepSeek V3? I think I've read about this on R1 @silent torrent
0 token completion
Yes unfortunately
https://openrouter.ai/deepseek/deepseek-chat
Fireworks.ai is missing from the deepseek-v3 provider list.
DeepSeek-V3 is the latest model from the DeepSeek team, building upon the instruction following and coding abilities of the previous versions. Pre-trained on nearly 15 trillion tokens, the reported evaluations reveal that the model outperforms other open-source models and rivals leading closed-source models. Run DeepSeek V3 with API
I don’t know if it was mentioned before but Nebius has v3 as well. Will it be added as provider for v3 as well?
Discussing internally!
Should be live soon :)
By the way, nebius gives error for shortly from time to time saying that you are out of token or sometimes reached rate limit, then it works a few seconds later. Is it also known issue?
They don't have automatic top up so we briefly ran out of funds there, but we're good to go now. The rate limit is likely to happen if you're routing specifically to them
That’s exactly what I was doing but it is my intention to do so. Trying to keep the cost minimum.
Nevertheless, isn’t a bit weird I got that message in the very second request right after the first one. So it is not after a very long discussion
Everyone is also using them currently, so the rate limits are kind of shared
Hmm makes sense. Thank you very much
finally starting to get some usable providers which is nice
deepseek's endpoint is nice, but way too unreliable
Nebius just
Doesn’t respond properly
90% of the time
It like jsut chucks out context lmao
Those cheap providers either respond with context(V3)/reasoning process(R1), don't respond at all or stop after a few sentences.
yeee
like it reprints context?
no like it forgets context
ah
Is this in the chatroom?
API, but chattoom too on certain providers
Nebius in particular
For the chatroom, try increasing message limit to max
Yeah I had an api call route to nebius for some reason and it was awfully slow and then just stopped half way through
The DeepSeek: DeepSeek V3 (free)/deepseek/deepseek-chat:free has a problem with the provider Targon, the caching creates incoherency in some moments. Like in this situation: In a RP the user goes to a store, and for some reason all the swipes will show the same shopkeeper "Emily, blonde, 20's", I'm not sure if it's a caching problem or the AI just loves that name and age
I honestly don't understand how people manage to do good RP, let alone ERP, with DeepSeek and DeepSeek r1: it's slow, it often crashes, it often goes crazy, sometimes it's smart and other times it looks lobotomized, if you change providers you have to change presets and system prompts that each provider has its own different DeepSeek.
Boh, it may be me stupid and ignorant, but when I do RP and ERP I want to relax, not fight to get a single decent answer.
Really? How so?
V3 is too repetitive for RP. R1 is lobotomized in the sense that it often does what it wants and takes it too far. But it's so different from everything else we have at the moment, it's entertaining. Yes, I do have to edit almost every reply to keep it away from responding as me, but everything else is either basic, repetitive or moderated. R1 is like an obstacle course but the rewards are often suprisingly good.
Like I ask it to remember a number
Then next message it doesn’t know the number LMAO
Only for certain providers
It's pretty simple and at the same time it's just "use a good System Prompt and adjust the parameters", it has a 10% chance of messing up a reply but it's pretty good for RP, since it remembers a lot since it's context, R1 is slow but the V3 is faster to give replies, sometimes repeat one or two phrases but it happens to a lot of models
Yet I forgot to mention that ``The Provider Targor``` sometimes doesn't deliver the complete replies, sometimes it gives cut off replies with half of the reply, not finished
Hi, does all providers support Tools (function call) & structured output? I tried the standard version (not free) but failed, anyone could help me on that?
it looks like the DeepSeek provider is the only one supporting Tools at this time. https://openrouter.ai/deepseek/deepseek-chat
Got it! Thank you
it was decent on the official endpoint back when it actually worked
I swapped to deepseek api, but sometimes it happens too
I migrated to qwen32b coder instead
@silent torrent
Together on DeepSeekV3
what hapepneda lmfao
👀
i do not have a key configured
(i never did)
looking
Fixed now!
How do deepseek r1's sampling parameters compare on official website & open router? I seem to observe that same prompt gives better reasoning on deepseek.com and on openrouter it's shorter and more superficial. I'm concerned that the default sampling parameters don't match deepseek.com's
DEEPSEEK LOWERS OFF-PEAK API PRICING BY UP TO 75%
hope it will get reflected in OpenRouter pricing
https://x.com/Sino_Market/status/1894682095706128430
🇨🇳#BREAKING
DEEPSEEK LOWERS OFF-PEAK API PRICING BY UP TO 75% - STATEMENT
#CHINA #AI #DEEPSEEK
Source:
https://t.co/Us3Q42PFce
https://t.co/Us3Q42PFce
Finally turns out to be 545% true 🔥
And they just kindly opensourced their moat
what's a good temp for deepseek-v3 if I'm using the free version with only the targon provider?
when I'm using the non-free version, i use the official deepseek provider and for that one, i need to crank up the temp to 1.85 for decent results, i'm wondering what temp should i use to match that in the free version with targon
I personally use between 0.8 and 1, but the recommended values are much higher, ranging from 0 for math/code to 1.5 for creative writing: https://api-docs.deepseek.com/quick_start/parameter_settings
oh yeah i forgot, i'm using it for RP, and I needed to crank it to 1.9 to have good results
The RP recommendation was generally around 1.8. Higher has higher chance of vomit after 400+ tokens of output in a single generation, unless they fixed that. Think I've only used this model for about 2 weeks after release. And we had to fight it with a little bit of freq penalty.
GA on Azure (US reigons) https://techcommunity.microsoft.com/blog/machinelearningblog/announcing-deepseek-v3-on-azure-ai-foundry-and-github/4390438
$1.14/$4.56 1M in/out
($1.25/$5 if you use regional instead of global inferencing)
not very competitive pricing
Compared to official deepseek pricing it’s much more expensive
That’s exactly Toven said
MSFT aren't trying to be cheap, all the governance and procurement fluff they have to do for their government & enterprise clients...means they get those same clients without much competition
Yeah
I think Chutes(provider) glitched on v3 (Free):
its generating UNREALISTICALLY FAST responses, like 900 tokens per second...
but the responses are all being exactly the same
Yeah, probably cached responses.
So, what's happening with all the free DeepSeek models? I don't know which providers are the ones with the cache thing that makes replies be the same when you swipe/regenerate, but the cache thing is making me question everything. I'm using those models for RP in SillyTavern, and I have no idea about what to do with the cache thing. And I think I'm gonna suggest in the suggestions that the providers but have a tag in the model provider list, something that says "This provider may cache your prompts—learn more in [link]"
caching should not affect the outputs at all.
it’s just for processing the inputs faster in theory
but you could try blacklisting one provider to see if it fixes the issue
I have proof DeepSeek v3 was trained on the Bee movie script, and it's funny
It's logical to think that it was trained on the Bee movie script, but it's still funny.
Show us, enlight us..
I didn't make it output the entire script, but it went on like this for a little bit, and it matched up exactly with the bee movie script
I had to input some of the script, though
The new V3 is surprisingly fun to interact with. Maybe the best personality of any model.
Curious how they trained it in. Very playful.
better rlhf?
I love it rn, asked it for business advice and suggested me shady stuff straight away hahahaha
Yeah it's super fun
I was brainstorming with it and my system instruction doesn't have anything telling it to be casual, or rude, or dictating its tone or whatsoever
Yet it's the only model that says "your character won't be betrayed so easily unless there is a damn good reason"
lol
I went on a small rant about hating semi-colons and how we should just get rid of them. Every other model leaned heavily toward "Well actually here's why they're still a good idea-" whereas new V3 got playful with it and encouraged me on, squeaking in the counter-arguments as "counterarguments from grammar nerds". It finally ends with:
Compromise Proposal
Banish semi-colons except for:
- Winking at Grammar Nerds (to acknowledge their pain).
- Artistic Use (e.g., pretentious novel titles: "The Rain in Spain; The Lies We Weep").
Otherwise, let the comma and period split the semi-colon’s duties like a divorced couple dividing assets. The world might not end—just get slightly more breathless.
Verdict: Proceed with caution. Or recklessly. Language is a democracy (or should be).
Contender for my favorite LLM response of all time. This was in no way instructed, with a bog standard system prompt and temp
what did you ask it?
it's never refused me
Just a random task lol, never happened to me before either. Worked after one retry