SGLang DeepSeek-V3-0324 | Runpod | Page 1

fluid scroll Apr 15, 2025, 7:23 PM

#

I have been trying to run Deepseek-V3-0324 using instant clusters with 2 x (8 x H100s) and have so far been unsuccessful. I am trying to get the model to run multi-node + multi-gpu.

I have downloaded the model from Huggingface onto a persistent and attach the persistent volume to my instant cluster before launching. After launching, I then run the Pytorch demo script as presented in https://docs.runpod.io/instant-clusters/pytorch to make sure that the network is working (it does).

I then follow the instructions to get Deepseek-V3-0324 running according to: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

Instead of following the absolute default instructions and doing:

# node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

In its place, I run the following command on each node:

python3 -m sglang.launch_server --model-path DeepSeek-V3-0324 --tp 16 --dist-init-addr ${MASTER_ADDR}:${MASTER_PORT} --nnodes ${NUM_NODES} --node-rank ${NODE_RANK} --trust-remote-code

The issue is that this hangs. I check nvidia-smi to see the model loading and it only ever loads each GPU up to almost 1GB before it goes up no further.

Any help would be greatly appreciated.

Deploy with PyTorch | RunPod Documentation

Learn how to deploy an Instant Cluster and run a multi-node process using PyTorch.

GitHub

sglang/benchmark/deepseek_v3 at main · sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models. - sgl-project/sglang

velvet mistBOT Apr 15, 2025, 7:23 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

spice hill Apr 15, 2025, 10:12 PM

#

Before discussing this problem

#

Does a 600+ parameter fp16 model fit in 16xH100s?

#

With reasonable context length?

#

Or is it fp8?

#

Idk

#

Anyways why are you doing tensor parallel over network

gloomy kestrel Apr 16, 2025, 12:48 AM

#

spice hill Anyways why are you doing tensor parallel over network

They're using the instant cluster so it should work

#

Let me try and see how to run it on instant cluster, if it works I'll update here

gloomy kestrel Apr 16, 2025, 1:23 AM

#

launch_server.py: error: argument --nnodes: invalid int value: '${NUM_NODES}'

hmm i get this for eveery environment variable, including the world size one
im putting the command in the CMD immediately without going into web terminal

#

im using the lmsysorg/sglang:latest

#

yeah can you try try the 4bit first, and see ifit works? or try more gpu vram

spice hill Apr 16, 2025, 2:27 AM

#

gloomy kestrel They're using the instant cluster so it should work

Its tensor parallel tho

#

Over network you should pipeline parallel

spice hill Apr 16, 2025, 2:28 AM

#

gloomy kestrel Let me try and see how to run it on instant cluster, if it works I'll update her...

Wow u r rich lol

#

Apparently ots fp8 in the official repo

#

So it should work

#

https://www.naddod.com/blog/tensor-parallelism?srsltid=AfmBOor49ZgjIS_hjk9dWAmdVUDmjia5fEk7X1E18SsXs4voHbwh0dFI

Tensor Parallelism - NADDOD Blog

Tensor parallelism alleviates memory issues in large-scale training. RoCE enables efficient communication for GPU tensor parallelism, accelerating computations.

spice hill Apr 16, 2025, 4:26 AM

#

gloomy kestrel launch_server.py: error: argument --nnodes: invalid int value: '${NUM_NODES}' h...

Maybe becuz its cmd

#

Try bash -c 'command'

gloomy kestrel Apr 16, 2025, 1:17 PM

#

spice hill Wow u r rich lol

Yeah well not for an hour just testing

fluid scroll Apr 16, 2025, 9:37 PM

#

Yeah, I just want to run this thing. I'm happy to spend on GPUs for a period of time to get it running. But I can't even get the basics to work unfortunately... has anyone seen any example on any infrastructure setup of this working multi-node / pipeline parallelism? If not on RunPod than anywhere else? It seems that no one has got this running anywhere.

gloomy kestrel Apr 17, 2025, 1:43 AM

#

i haven't tried running anything via network honestly, and im interested in this too 🙂

#

is this a normal expectation for cluster's network speed?
bgts5433fn5f2j
d2d7wb5ale6zhl

i feel like its abit too slow

gloomy kestrel Apr 17, 2025, 2:14 AM

#

Don't know whats wrong here, nccl seems to be communicating

📎 logs.txt

#

oh and also riverfog, i've tried your rcommendation, bash -c works! it reads the env correctly.
also the other recommendation tp 8 when using total 8*2 gpus (16 total) will use only 8 gpus i guess, it doesnt load so i cannot know too, but when i try it some gpu vram usage are stuck at 0, some at 2% some 1%

when using tp 16, all gpus are sttuck between 1% and 2%

inner runeBOT Apr 17, 2025, 2:17 AM

#

fluid scroll I have been trying to run Deepseek-V3-0324 using instant clusters with 2 x (8 x ...

@fluid scroll

Escalated To Zendesk

The thread has been escalated to Zendesk!

gloomy kestrel Apr 17, 2025, 2:18 AM

#

maybe try opening this

#

i think sglang doesnt support pp, only tp

fluid scroll Apr 17, 2025, 2:43 AM

#

gloomy kestrel oh and also riverfog, i've tried your rcommendation, bash -c works! it reads the...

Yes, I have exactly this issue.

fluid scroll Apr 17, 2025, 2:45 AM

#

gloomy kestrel i think sglang doesnt support pp, only tp

They claim to support pipeline parallelism:

#

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 from this url

GitHub

sglang/benchmark/deepseek_v3 at main · sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models. - sgl-project/sglang

gloomy kestrel Apr 17, 2025, 2:54 AM

#

fluid scroll They claim to support pipeline parallelism:

I dont see the pipeline pararellism here

#

isnt it supposed to be a configuration arguments?

fluid scroll Apr 17, 2025, 3:11 AM

#

nnodes = 2

#

It’s a torchrun argument that gets passed through to the equivalent in SGLang I believe.

gloomy kestrel Apr 17, 2025, 3:13 AM

#

i see

#

yeah then it might use both

gloomy kestrel Apr 17, 2025, 3:14 AM

#

inner rune <@314738570168369153>

did you open a ticket?

#

i will try vllm in the future maybe its better, id recommend you to try it too

fluid scroll Apr 17, 2025, 3:27 AM

#

@gloomy kestrel have you tried vLLM with instant clusters? I believe the communication mechanism under the hood doesn't work with the way that Runpod sets up inter-node communication. I couldn't get it to work (this was a few weeks ago when it was still in beta though).

fluid scroll Apr 17, 2025, 3:28 AM

#

gloomy kestrel did you open a ticket?

I wasn't sure where to open a ticket because I'm not sure where the error is really coming from... I think it's an SGLang issue but I wasn't clear.

spice hill Apr 17, 2025, 3:46 AM

#

fluid scroll Yeah, I just want to run this thing. I'm happy to spend on GPUs for a period of ...

are you still here?

#

ive never used sglang but

#

vllm pipeline parallel works well with multi node

#

even with not. that good network bandwidth

fluid scroll Apr 17, 2025, 4:13 AM

#

I'm still trying @spice hill. Have you tested vLLM with Instant Cluster or do you have another solution where I can test multi-gpu in the cloud to run this?

spice hill Apr 17, 2025, 4:14 AM

#

@fluid scroll but do you really need multigpu?

fluid scroll Apr 17, 2025, 4:14 AM

#

Yeah, I specifically need to test tensor parallelism and pipeline parallelism: 2 nodes of 8 x H100

spice hill Apr 17, 2025, 4:14 AM

#

i mean

#

you can host the same model in 1 node

#

you want deepseek v3 at fp8

#

right?

fluid scroll Apr 17, 2025, 4:18 AM

#

My requirements are to run Deepseek-V3-0324 over two nodes by whatever means - I just have to see that pipeline and tensor parallelism can work for the model

spice hill Apr 17, 2025, 4:18 AM

#

okay

#

is sglang required too?

fluid scroll Apr 17, 2025, 4:18 AM

#

It's less about actually using it - more about showing it can work

#

No, it can be anything

#

vLLM would be fine too

spice hill Apr 17, 2025, 4:18 AM

#

vllm should work

#

ive done it in the past

fluid scroll Apr 17, 2025, 4:18 AM

#

Have you got that working in Runpod Instant Clusters?

spice hill Apr 17, 2025, 4:18 AM

#

not with 2x8

#

but that doesnt matter

spice hill Apr 17, 2025, 4:19 AM

#

fluid scroll Have you got that working in Runpod Instant Clusters?

no in AWS

#

should work. with runpod tho

fluid scroll Apr 17, 2025, 4:19 AM

#

I had trouble running the basic torchrun script:

https://docs.runpod.io/instant-clusters/pytorch

Deploy with PyTorch | RunPod Documentation

Learn how to deploy an Instant Cluster and run a multi-node process using PyTorch.

#

It failed to run it when I tried with VLLM

#

Yeah, I tried AWS but they wouldn't give me any GPUs so now just trying with runpod

#

I will try vllm again

spice hill Apr 17, 2025, 4:20 AM

#

can you ping the other pod

fluid scroll Apr 17, 2025, 4:20 AM

#

Yeah I can ping it

spice hill Apr 17, 2025, 4:20 AM

#

with ip

fluid scroll Apr 17, 2025, 4:20 AM

#

It's something to do with Ray, which vLLM uses under the hood

spice hill Apr 17, 2025, 4:20 AM

#

do u use vllm docker image

#

or sth else?

fluid scroll Apr 17, 2025, 4:21 AM

#

I was using something else - but I can use that docker image

spice hill Apr 17, 2025, 4:21 AM

#

i suceededd with the docker image

#

soo

fluid scroll Apr 17, 2025, 4:21 AM

#

Thanks for letting me know, I'll try that out and let you know how it goes

spice hill Apr 17, 2025, 4:23 AM

#

@fluid scroll https://docs.vllm.ai/en/latest/serving/distributed_serving.html

here's the multinode docs

#

https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh

GitHub

vllm/examples/online_serving/run_cluster.sh at main · vllm-project...

A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm

#

the cluster making script

#

docker run
--entrypoint /bin/bash
--network host
--name node
--shm-size 10.24g
--gpus all
-v "${PATH_TO_HF_HOME}:/root/.cache/huggingface"
"${ADDITIONAL_ARGS[@]}"
"${DOCKER_IMAGE}" -c "${RAY_START_CMD}"

this is the docker run command so

#

you can modify this and run it

fluid scroll Apr 17, 2025, 4:24 AM

#

Where do I run this? When I create the pod?

spice hill Apr 17, 2025, 4:24 AM

#

no so waht the docs says is

#

you have two physical machines

#

then you create a ray container on both physical machines and make a cluster.

#

but in your case you have no access to physical machines

fluid scroll Apr 17, 2025, 4:25 AM

#

Yeah, the issue I had is that runpod doesn't have that

#

I have to work within those bounds

#

I don't have AWS or anything to work with

spice hill Apr 17, 2025, 4:25 AM

#

so you should translate the docker run command to runpod's template

spice hill Apr 17, 2025, 4:27 AM

#

spice hill docker run \ --entrypoint /bin/bash \ --network host \ --name node \...

docker run --entrypoint /bin/bash --network host --name node --shm-size 10.24g --gpus all -v /path/to/the/huggingface/home/in/this/node:/root/.cache/huggingface -e VLLM_HOST_IP=ip_of_this_node vllm/vllm-openai -c ray start --block --address=ip_of_head_node:6379

fluid scroll Apr 17, 2025, 4:28 AM

#

Yeah, this was my next idea - I just have to figure out how to modify runpod to work with this since "docker" can't be run inside of a pod once I start it

#

It has to be part of a template or something. I'm pretty new to this side of Runpod.

spice hill Apr 17, 2025, 4:30 AM

#

should be
image name: vllm/vllm-openai
CMD: python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=ip_of_head_node:6379
mount nw volume to ~/.cache/

for the worker

#

image name: vllm/vllm-openai
CMD: python3 -m vllm.entrypoints.openai.api_server -c ray start --block --head --port=6379
env: VLLM_HOST_IP=ip_of_this_node

#

for the head

fluid scroll Apr 17, 2025, 4:33 AM

#

Are you starting these as two separate pods or using Instant Cluster?

#

In this case it looks like you're using two separate pods with global networking or something

spice hill Apr 17, 2025, 4:34 AM

#

can you apply diff images

fluid scroll Apr 17, 2025, 4:34 AM

#

Not with instant cluster

spice hill Apr 17, 2025, 4:34 AM

#

for the two pods in clusters?

#

or diff setting at least

fluid scroll Apr 17, 2025, 4:34 AM

#

Screenshot_2025-04-17_at_12.34.39_AM.png

#

Doesn't look likeit

spice hill Apr 17, 2025, 4:36 AM

#

lol

#

should we write a script?

#

its solvable

fluid scroll Apr 17, 2025, 4:37 AM

#

Yeah I'd love to lol, been trying to run DeepSeek V3 across two nodes for a while now

#

vllm serve /path/to/the/model/in/the/container
--tensor-parallel-size 8
--pipeline-parallel-size 2

#

I was thinking this should just work

#

If I spin up a cluster

#

And go into each node and run this

#

And make sure to pass in the right information for the host...

#

But then it fails because of Ray

spice hill Apr 17, 2025, 4:37 AM

#

yeah but we need a script

#

to build the ray cluster first

#

it should run INSIDE a ray cluster

fluid scroll Apr 17, 2025, 4:38 AM

#

Yeah, I feel like that's outside the default scope of instant clusters. Are you suggesting we set up a ray cluster inside of our non-ray cluster?

spice hill Apr 17, 2025, 4:38 AM

#

that's what i meant

#

😄

#

the writing a script part was for that

fluid scroll Apr 17, 2025, 4:39 AM

#

That would be cool... solve a lot of problems lol

spice hill Apr 17, 2025, 4:43 AM

#

ill try with global networking first

#

just to see if it works

fluid scroll Apr 17, 2025, 4:43 AM

#

I'll try form a ray cluster in Instant Cluster again

spice hill Apr 17, 2025, 4:43 AM

#

try
python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=ip_of_head_node:6379

this inside a vllm container

fluid scroll Apr 17, 2025, 4:44 AM

#

Sure will try now

spice hill Apr 17, 2025, 4:45 AM

#

so you have seperate ssh access to

#

the two nodes right?

fluid scroll Apr 17, 2025, 4:45 AM

#

Yes

spice hill Apr 17, 2025, 4:45 AM

#

good

fluid scroll Apr 17, 2025, 4:45 AM

#

Will send sc in a sec

#

api_server.py: error: argument --block-size: expected one argument

spice hill Apr 17, 2025, 4:47 AM

#

isnt it --block?

fluid scroll Apr 17, 2025, 4:47 AM

#

I copied what you sent and that's what it gave me

#

In my case: python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=10.65.0.2:6379

spice hill Apr 17, 2025, 4:47 AM

#

python3 -m vllm.entrypoints.openai.api_server -c ray start --block --address=ip_of_head_node:6379
this?

fluid scroll Apr 17, 2025, 4:48 AM

#

Yeah, that's what I did ^

spice hill Apr 17, 2025, 4:49 AM

#

hnm

#

vllm/vllm-openai
the image is this

fluid scroll Apr 17, 2025, 4:50 AM

#

Yep

spice hill Apr 17, 2025, 4:51 AM

#

i think il ltest in mine first

#

wait a sec

#

oh

#

it was

#

just
ray start --block --address=10.65.0.2:6379

#

or bash -c "ray start ...."

fluid scroll Apr 17, 2025, 4:59 AM

#

Yeah, I'm doing that right now actually

spice hill Apr 17, 2025, 4:59 AM

#

didnt see the --entrypoint /bin/bash

fluid scroll Apr 17, 2025, 4:59 AM

#

But can't get the worker to connect

spice hill Apr 17, 2025, 4:59 AM

#

xD

gloomy kestrel Apr 17, 2025, 4:59 AM

#

fluid scroll <@340880706865594370> have you tried vLLM with instant clusters? I believe the c...

no ihavent

fluid scroll Apr 17, 2025, 4:59 AM

#

Screenshot_2025-04-17_at_12.59.20_AM.png

spice hill Apr 17, 2025, 4:59 AM

#

fluid scroll But can't get the worker to connect

wdym?

gloomy kestrel Apr 17, 2025, 4:59 AM

#

fluid scroll But can't get the worker to connect

any errors or logs?

fluid scroll Apr 17, 2025, 4:59 AM

#

Tried 6379 and couldn't get that to work

gloomy kestrel Apr 17, 2025, 4:59 AM

#

fluid scroll Tried 6379 and couldn't get that to work

thats the port from env?

fluid scroll Apr 17, 2025, 4:59 AM

#

So tried a port I knew was exposed 29400 since I can ping between the nodes with that

spice hill Apr 17, 2025, 5:00 AM

#

ray start --block --address=10.65.0.2:6379

#

this?

fluid scroll Apr 17, 2025, 5:00 AM

#

spice hill this?

This works

spice hill Apr 17, 2025, 5:00 AM

#

is adding --block

#

make a diff?

fluid scroll Apr 17, 2025, 5:00 AM

#

But connecting from worker doesn't

#

Let me try with block

spice hill Apr 17, 2025, 5:00 AM

#

ray start --block --head --port=6379

#

for the head

#

ray start --block --address=10.65.0.2:6379

#

for the worker

fluid scroll Apr 17, 2025, 5:02 AM

#

spice hill Apr 17, 2025, 5:02 AM

#

maybe it doesnt work bc u already started a vllm process in the start command

fluid scroll Apr 17, 2025, 5:02 AM

#

I actually haven't started anything in this one

#

I am not using the vLLM image this time. I started a new cluster that doesn't have vLLM. I pip installed it.

#

Regardless, Ray should work independently

spice hill Apr 17, 2025, 5:03 AM

#

yeah

#

same thought

#

but had nothing to blame other than that

#

check

#

ufw

#

just in case

fluid scroll Apr 17, 2025, 5:04 AM

#

What is UFW?

spice hill Apr 17, 2025, 5:04 AM

#

and other firewalls too

#

ubuntu firewall

fluid scroll Apr 17, 2025, 5:05 AM

#

Ah okay let me check

spice hill Apr 17, 2025, 5:05 AM

#

that was the problem in my last attempt

#

@fluid scroll i have one question

fluid scroll Apr 17, 2025, 5:07 AM

#

Trying to check but have to install packages

#

@spice hill yeah what's up?

spice hill Apr 17, 2025, 5:07 AM

#

why does the first pic say 172.xx

#

but second pic says10.60.sth

fluid scroll Apr 17, 2025, 5:07 AM

#

That's the "master address"

spice hill Apr 17, 2025, 5:08 AM

#

master addr?

fluid scroll Apr 17, 2025, 5:08 AM

#

#

https://docs.runpod.io/instant-clusters/

Overview | RunPod Documentation

Instant Clusters enable high-performance computing across multiple GPUs with high-speed networking capabilities.

#

NODE_ADDR is the address of the individual node

fluid scroll Apr 17, 2025, 5:08 AM

#

gloomy kestrel thats the port from env?

That's the one that ray uses

fluid scroll Apr 17, 2025, 5:09 AM

#

gloomy kestrel no ihavent

vLLM uses Ray under the hood and it isn't playing nicely

#

That's why I was hoping SGLang would work since it uses pytorch

#

But then we have that weird bug where it hangs model loading at 1% lol

#

I suspect that it's actually only loading the pytorch stuff

#

And never actually loads any of the weights in

spice hill Apr 17, 2025, 5:11 AM

#

maybe it binds to the wrong nic?

fluid scroll Apr 17, 2025, 5:11 AM

#

We use eth1 I think here

spice hill Apr 17, 2025, 5:11 AM

#

and recieves from the public ip

#

but not from private ip

fluid scroll Apr 17, 2025, 5:12 AM

#

Issue is that I'm not sure if that's something we can even fix under the hood with vLLM... I just don't know enough about how vLLM works

spice hill Apr 17, 2025, 5:12 AM

#

if ray works vllm works

fluid scroll Apr 17, 2025, 5:12 AM

#

And vLLM uses that same ray cluster?

spice hill Apr 17, 2025, 5:12 AM

#

yeah

fluid scroll Apr 17, 2025, 5:13 AM

#

Hmm

spice hill Apr 17, 2025, 5:13 AM

#

you can just use the ray cluster

#

as one computer

#

vllm does the finicky things by itself

fluid scroll Apr 17, 2025, 5:13 AM

#

I actually can't even ping between each machine now

spice hill Apr 17, 2025, 5:14 AM

#

maybe ray start --block --head --port 6379 --node-ip-address 10.65.0.2
in the head?

spice hill Apr 17, 2025, 5:14 AM

#

fluid scroll I actually can't even ping between each machine now

wut

fluid scroll Apr 17, 2025, 5:14 AM

#

spice hill Apr 17, 2025, 5:15 AM

#

can u just ping it

#

ping 10.sth

fluid scroll Apr 17, 2025, 5:16 AM

#

spice hill Apr 17, 2025, 5:16 AM

#

ufw status?

fluid scroll Apr 17, 2025, 5:17 AM

#

Interesting

#

My environment is messed up now

#

I can't run the default torch script here

#

https://docs.runpod.io/instant-clusters/pytorch

Deploy with PyTorch | RunPod Documentation

Learn how to deploy an Instant Cluster and run a multi-node process using PyTorch.

spice hill Apr 17, 2025, 5:17 AM

#

lol

fluid scroll Apr 17, 2025, 5:17 AM

#

So I messed something up with whatever we tried

spice hill Apr 17, 2025, 5:17 AM

#

hmm

#

maybe start with a fresh pytorch image

fluid scroll Apr 17, 2025, 5:17 AM

#

I'

spice hill Apr 17, 2025, 5:17 AM

#

and install everything (ray and vllm)

fluid scroll Apr 17, 2025, 5:18 AM

#

I'm going to have to refund this account lol, it won't let me start another pod

#

Not enough money in the account lol

#

I may sleep for a bit and get back to this, interseting problem to solve

spice hill Apr 17, 2025, 5:18 AM

#

great

#

the community says

#

--node-ip-address providing this should make it bind to the proper address

#

so maybe try that next time

#

in the head node

#SGLang DeepSeek-V3-0324