#Serverless

10 messages · Page 1 of 1 (latest)

arctic cobalt
#

Hi team đź‘‹,

I ran into an issue with unexpected billing (around $400) on my serverless vLLM endpoint while it was idle.
Support explained it was caused by a CUDA 12.9 misconfiguration in my endpoint settings. They kindly applied a $100 credit 🙏, but I’d like to make sure I configure things correctly moving forward.

Could you clarify:

Which CUDA version is recommended for running meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on vLLM?

How to ensure the pod truly scales down to zero when idle so I don’t continue to incur charges unnecessarily?

Appreciate your guidance 🚀

azure cape
#

Which VLLM are you using?

arctic cobalt
# azure cape if you're using https://github.com/runpod-workers/worker-vllm the base one 12.1 ...

Hi @azure cape đź‘‹, thanks for jumping in earlier!

I’m working on deploying meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on serverless vLLM with RunPod.
Here’s my current config:

GPU: 2 Ă— H200 SXM (141 GB each)

CUDA Versions allowed: 12.1 – 12.7

Device: cuda

Idle Timeout: 5s (to scale-to-zero quickly)

Execution Timeout: 600s

Flashboot: enabled

Env Vars:

MAX_MODEL_LEN=128000

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

RAW_OPENAI_OUTPUT=1

OPENAI_SERVED_MODEL_NAME_OVERRIDE=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

But I’m still running into OOM issues:

ERROR 08-21 09:43:51 [multiproc_executor.py:511] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB. GPU 0 has a total capacity of 139.72 GiB of which 260.69 MiB is free. Process 668548 has 139.46 GiB memory in use. Of the allocated memory 138.04 GiB is allocated by PyTorch, and 8.63 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

👉 Could you please confirm the optimal configuration for this FP8 model on H200 SXM?

Is 2Ă— H200 actually required, or should it run fine on 1 GPU?

Is my MAX_MODEL_LEN=128000 too aggressive for FP8 mode?

Any best practices you’d recommend for memory allocation/env vars with this model on vLLM?

Would appreciate your guidance 🚀

azure cape
#

Probably 2x or more, im not sure let me chck & you should configure another env

#

Env Vars:

MAX_MODEL_LEN=128000

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

RAW_OPENAI_OUTPUT=1

OPENAI_SERVED_MODEL_NAME_OVERRIDE=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
TENSOR_PARALLEL_SIZE=2

#

or set TENSOR_PARALLEL_SIZE to your gpu amount per worker

azure cape
#

if you still get oom with that, try reducing your max model len it might help because it will be using less VRAM

shadow pilot
#

Also, for your refund - did you get to reach out to support yet or no? Not a problem either way, just making sure you're good.

arctic cobalt