#Serverless deepseek-ai/DeepSeek-R1 setup?

107 messages · Page 1 of 1 (latest)

kindred surge
#

How can I configure a serverless end point for deepseek-ai/DeepSeek-R1?

kindred tapir
#

if not, you can make a model that can run inference for that model

kindred surge
#

Basic config, 2 GPU count

#

Once it is running, I try the default hello world request and it just gets stuck IN_QUEUE for 8 minutes..

kindred tapir
#

or OOM

kindred tapir
#

seems like r1 is a really huge model isnt it?

kindred surge
kindred tapir
#

in your workers or endpoint?

kindred surge
#

Oh, wait!! I just ran the 1.5B model and got this response:

#

When I tried running the larger model, I got errors about not enough memory

#

""Uncaught exception | <class 'torch.OutOfMemoryError'>; CUDA out of memory. Tried to allocate 3.50 GiB. GPU 0 has a total capacity of 44.45 GiB of which 1.42 GiB is free"

kindred tapir
#

seems like you got oom ya..

kindred surge
#

So how do I configure ?

kindred tapir
#

r1 is such a huge model seems like you need 1tb+ vram
don't know how to calculate, but est maybe something in range of 700gb+ vram

kindred surge
#

wow

#

so it's not really an option to deploy?..

kindred tapir
#

not sure, depends for your use hahah

kindred surge
#

I mean, Deepseek offers their own API keys

#

I thought it could be more cost effective to just run a serverless endpoint here but..

kindred tapir
kindred surge
#

hmm.. I see

#

Thanks for your help

kindred tapir
#

your welcome bro

robust blaze
#

Hey @kindred tapir i still can deploy the 7B deepseek R1 model right instead of huge model. ?

#

I am facing this issue

#

I am not that good in resolving issues.

rich leaf
#

Did you find a solution ?

robust blaze
#

Not yet...

kindred tapir
#

use trust remote code = true

robust blaze
#

where should i put this

#

in envrinment

kindred tapir
#

env variable

#

like this

sly vine
# robust blaze

Is the model you are trying to run a GGUF quant? You'll need a custom script for GGUF quants or if there is multiple models in a single repo

tardy verge
buoyant temple
#

try 48GB gpu, see if that helps.

tardy verge
#

Hello there, I increased the max token settings but still getting only the beginning of the thinking, how can I fix that

tardy verge
kindred tapir
kindred tapir
#

in your request, or use a openai client sdk

tardy verge
tardy verge
kindred tapir
tardy verge
# kindred tapir How did you configure it

basically used this model casperhansen/deepseek-r1-distill-qwen-32b-awq with vllm and runpod serverless, except lower the model max lenght to 11000 I didnt modify any others settings

#

my input look like this now :

#

{
"input": {
"messages": [
{
"role": "system",
"content": "Your are an ai assistant."
},
{
"role": "user",
"content": "Explain llm models"
}
],
"max_tokens": 3000,
"temperature": 0.7,
"top_p": 0.95,
"n": 1,
"stream": false,
"stop": [],
"presence_penalty": 0,
"frequency_penalty": 0,
"logit_bias": {},
"user": "utilisateur_123",
"best_of": 1,
"echo": false
}
}

kindred tapir
#

Not the correct way

tardy verge
#

Ah ok, do you have an example of correct input for this model ?

kindred tapir
#

was going to give an example after this

#

wait

#
{
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "Your are an ai assistant."
      },
      {
        "role": "user",
        "content": "Explain llm models"
      }
    ],
"sampling_params": {
    "max_tokens": 3000,
    "temperature": 0.7,
    "top_p": 0.95,
    "n": 1,
    "stream": true,
    "presence_penalty": 0,
    "frequency_penalty": 0
}
  }
}
#

like this, inside sampling_params

#

if not just use openai sdk, its easier (the docs are easily accessible in openai's site) hahah

tardy verge
kindred tapir
#

no, for the client only

#

you can use packages from openai ( for the client) to connect using that url

#

your endpoint id should be replaced with your endpoint id

#

and use your runpod api key as the auth in the openai client

#

try reading the docs in runpod website, the vllm worker part

tardy verge
kindred tapir
#

Ya that's if you use http request directly

tardy verge
sly vine
# kindred tapir Ya that's if you use http request directly

Can you check the cloudflare proxy (not in serverless) for vllm openai compatible servers? Batched requests keep getting aborted only on proxied connections (not on direct using tcp forwarding(?)).

Related Github Issue: https://github.com/vllm-project/vllm/issues/2484

When the problem happens, the logs look something like this:

INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 0e89f1d2d94c4a039f868222c100cc8a.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request be67046b843244b5bf1ed3d2ff2f5a02.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request b532ed57647945869a4cae499fe54f23.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 6c56897bbc9d4a808b8e056c39baf91b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 75b645c69d7449509f68ca23b34f1048.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request eb87d6473a9d4b3699ca0cc236900248.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request ca15a251849c45329825ca95a2fce96b.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request c42bbea2f781469e89e576f98e618243.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 14b94d4ffd6646d69d4c2ad36d7dfd50.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 83c7dd9cbe9d4f6481b26403f46f1731.
INFO 01-17 03:11:25 async_llm_engine.py:134] Aborted request 3e98245d88534c53be230aa25c56d99a.
INFO 01-17 03:11:26 async_llm_engine.py:111] Finished request 49b84ef96af44f069056b2fc43526cdd.
GitHub

Hi, i am trying to load test vllm on a single gpu with 20 concurrent request. Each request would pass through the llm engine twice. Once to change the prompt, the other to generate the output. Howe...

kindred tapir
#

What's batched request?

#

Can you open a ticket

sly vine
#

Also that sdk in langchain doesn't support streaming requests in batch mode

kindred tapir
#

I see

#

It doesn't abort on streaming means there might be some timeout here that's limiting it

sly vine
#

Can i do it tomorrow..?

kindred tapir
#

So no response and it's aborted by the proxy or smth

kindred tapir
#

Or Lang chain client

sly vine
#

on that github issue,

kindred tapir
sly vine
#

ppl have problems with nginix or some kind of proxy in front of the server

sly vine
kindred tapir
#

Yeah. You can check your audit logs maybe and tell them it's deleted

#

In website

sly vine
#

thx for the info!

kindred tapir
#

Your welcome!

sly vine
# kindred tapir Your welcome!

It was a cloudflare problem that's on the blog here.
https://blog.runpod.io/when-to-use-or-not-use-the-proxy-on-runpod/

btw does serverless use cloudflare proxies too?

RunPod Blog

RunPod uses a proxy system to ensure that you have easy accessibility to your pods without needing to make any configuration changes. This proxy utilizes Cloudflare for ease of both implementation and access, which comes with several benefits and drawbacks. Let's go into a little explainer about specifically how the

#

If so, how do i run long-running requests on serverless without streaming?

kindred tapir
#

I'm not sure ask In the ticket ya

primal umbra
tardy verge
#

Is they’re any difference between using the fast deployment > vllm or using pre built the docker image

kindred tapir
#

Quick deploy right? You can configure it before deploying using a setup

tardy verge
kindred tapir
#

Yep

#

Or from your end point's env variable

tardy verge
#

Ok 🙂 about my issue with DeepSeek distilled r1, seems the prompt system is weird and tricky to use, if anyone know a good uncensored model to use vllm let me know ( I’m using llama 3.3 but it’s too censored )

kindred tapir
#

Find out some fine tuned model like the dolphin one, i forgot the name

tardy verge
kindred tapir
#

Ya

tardy verge
#

ok:)