Incredibly long startup time when running 70b models via vllm | Runpod | Page 1

storm summit Nov 12, 2024, 2:13 PM

#

I have been trying to deploy 70b models as a serverless endpoint and observe start up times of almost 1 hour, if the endpoint becomes available at all. The attached screenshot shows an example of an endpoint that deploys cognitivecomputations/dolphin-2.9.1-llama-3-70b . I find it even weirder that the request ultimately succeeds. Logs and screenshot of the endpoint and template config are attached - if anyone can spot an issue or knows how to deploy 70b models such that they reliably work I would greatly appreciate it.

Some other observations:

in support, someone told me that I need to manually set the env BASE_PATH=/workspace, which I am now always doing
I sometimes but not always see this in the logs: AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'..., even though I am deploying a completely different model
I sometimes but not always get issues when I don't specify the chat template

[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n

📎 logs-dolphin-2.9.1-llama-3-70b-fb.txt

#

I think my main issue is the same as https://github.com/runpod-workers/worker-vllm/issues/112

GitHub

Randomly the machine get stuck on loading model · Issue #112 · runp...

Hi, as the title suggests completely random, the machine gets stuck on Using model weights format ['*.safetensors'] and I have to manually terminate the worker and restart it. Do you have a...

true knotBOT Nov 12, 2024, 2:19 PM

#

storm summit I have been trying to deploy 70b models as a serverless endpoint and observe sta...

@storm summit

Escalated To Zendesk

The thread has been escalated to Zendesk!

ivory ginkgo Nov 12, 2024, 2:19 PM

#

I'll try to open a ticket, you can check from that button

storm summit Nov 12, 2024, 2:22 PM

#

Thanks, it now says Ticket created

tepid stirrup Nov 12, 2024, 2:22 PM

#

usually you dont want download models on sending request

storm summit Nov 12, 2024, 2:25 PM

#

Yes it would indeed be better if that wasn't necessary, but this is how the vllm-worker appears to be implemented. I could live with a long start up time because I mostlty want to do batch requests, but if you know how to deploy the vllm template with preloaded model then I'd gladly use that

tepid stirrup Nov 12, 2024, 2:26 PM

#

also serverless is not using /workspace

storm summit Nov 12, 2024, 2:27 PM

#

Ok this is what I got told when I opened a support ticket yesterday, but then I will remove that again

#

📎 message.txt

#Incredibly long startup time when running 70b models via vllm