I have been trying to run Deepseek-V3-0324 using instant clusters with 2 x (8 x H100s) and have so far been unsuccessful. I am trying to get the model to run multi-node + multi-gpu.
I have downloaded the model from Huggingface onto a persistent and attach the persistent volume to my instant cluster before launching. After launching, I then run the Pytorch demo script as presented in https://docs.runpod.io/instant-clusters/pytorch to make sure that the network is working (it does).
I then follow the instructions to get Deepseek-V3-0324 running according to: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
Instead of following the absolute default instructions and doing:
# node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code
# node 2
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
In its place, I run the following command on each node:
python3 -m sglang.launch_server --model-path DeepSeek-V3-0324 --tp 16 --dist-init-addr ${MASTER_ADDR}:${MASTER_PORT} --nnodes ${NUM_NODES} --node-rank ${NODE_RANK} --trust-remote-code
The issue is that this hangs. I check nvidia-smi to see the model loading and it only ever loads each GPU up to almost 1GB before it goes up no further.
Any help would be greatly appreciated.