Hi @azure cape đź‘‹, thanks for jumping in earlier!
I’m working on deploying meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 on serverless vLLM with RunPod.
Here’s my current config:
GPU: 2 Ă— H200 SXM (141 GB each)
CUDA Versions allowed: 12.1 – 12.7
Device: cuda
Idle Timeout: 5s (to scale-to-zero quickly)
Execution Timeout: 600s
Flashboot: enabled
Env Vars:
MAX_MODEL_LEN=128000
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
RAW_OPENAI_OUTPUT=1
OPENAI_SERVED_MODEL_NAME_OVERRIDE=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
But I’m still running into OOM issues:
ERROR 08-21 09:43:51 [multiproc_executor.py:511] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB. GPU 0 has a total capacity of 139.72 GiB of which 260.69 MiB is free. Process 668548 has 139.46 GiB memory in use. Of the allocated memory 138.04 GiB is allocated by PyTorch, and 8.63 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
👉 Could you please confirm the optimal configuration for this FP8 model on H200 SXM?
Is 2Ă— H200 actually required, or should it run fine on 1 GPU?
Is my MAX_MODEL_LEN=128000 too aggressive for FP8 mode?
Any best practices you’d recommend for memory allocation/env vars with this model on vLLM?
Would appreciate your guidance 🚀