So when I start vllm/vllm-openai:latest on 2xA100 or 4xA40 I only able to do it 1/2 or 1/3 times. I haven't noticed any logic befind it it just fails sometimes. Here are parameters I use for 2xA100 for instanse: --host 0.0.0.0 --port 8000 --model meta-llama/Llama-3.3-70B-Instruct --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.95 --api-key key --max-model-len 16256 --tensor-parallel-size 2
I also need have some logs.
#50/50 success with running a standart vllm template
9 messages · Page 1 of 1 (latest)
Woop did you post your API key here @north herald
max-model-length maybe too large, try to a smaller number.
Why then I often can start it fine with this parameters? To use this in commercial application I need consistancy, either it works and I can use it or it not and I can troubleshoot this. I use runpod in my job and I almost always stick to it for prototyping, but I can not suggest it to my clients for production because such issues do occur occasionally. That's a pity cause I like runpod.
check logs, it might be because the gpu or container is ooming, thats a common problem we see
@zenith mason can you please explain what ooming means and what should I look for in logs? It appears that problem got much worse and now I can't start it pod at all
@sleek creek are you able to check? see if we log it also in pod logs when they oom
Hey, could you share me your pod id?
Out of memory, happens when your vram is too small to handle the model with that specific config