#50/50 success with running a standart vllm template

9 messages · Page 1 of 1 (latest)

north herald
#

So when I start vllm/vllm-openai:latest on 2xA100 or 4xA40 I only able to do it 1/2 or 1/3 times. I haven't noticed any logic befind it it just fails sometimes. Here are parameters I use for 2xA100 for instanse: --host 0.0.0.0 --port 8000 --model meta-llama/Llama-3.3-70B-Instruct --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --api-key key --max-model-len 16256 --tensor-parallel-size 2
I also need have some logs.

cyan bear
#

Woop did you post your API key here @north herald

sleek creek
#

max-model-length maybe too large, try to a smaller number.

north herald
# sleek creek max-model-length maybe too large, try to a smaller number.

Why then I often can start it fine with this parameters? To use this in commercial application I need consistancy, either it works and I can use it or it not and I can troubleshoot this. I use runpod in my job and I almost always stick to it for prototyping, but I can not suggest it to my clients for production because such issues do occur occasionally. That's a pity cause I like runpod.

zenith mason
#

check logs, it might be because the gpu or container is ooming, thats a common problem we see

north herald
#

@zenith mason can you please explain what ooming means and what should I look for in logs? It appears that problem got much worse and now I can't start it pod at all

zenith mason
#

@sleek creek are you able to check? see if we log it also in pod logs when they oom

sleek creek
cyan bear