#Is it possible to run Llama 4 on serverless?

7 messages · Page 1 of 1 (latest)

leaden matrix
#

Hi RunPod team,

I’m currently running Llama-3.3 in production on RunPod Serverless using vLLM, with a worker that remains warm and handles continuous traffic successfully.

I’m now trying to upgrade this setup to Llama-4, and I’m looking for official guidance on how this should be configured on Serverless rather than confirmation of whether it’s theoretically possible.

Specifically, I’m looking for help with:

Reference Docker images

Do you provide (or recommend) a RunPod-maintained Docker image for running Llama-4 with vLLM on Serverless?

If not, is there a reference image or example you recommend as a starting point?

Model loading strategy on Serverless

For production Serverless workloads, is the recommended approach to:

bake Llama-4 weights into the Docker image, or

download weights at startup and rely on the warm worker lifecycle?

Are there size or startup-time thresholds where one approach is preferred?

Serverless-specific constraints

Are there known Serverless limits (ephemeral disk size, image size, startup timeout, worker recycling behavior) that differ from Pods and that we should explicitly account for when running larger models like Llama-4?

Production recommendations

Given a production use case with steady traffic and warm workers, is Serverless still a supported and recommended product for Llama-4, or do you advise migrating this workload to Pods?

I want to ensure I’m following the intended and supported setup for production rather than relying on behavior that might change.

Thanks for any concrete guidance or references you can share.

Best,
Wilbur

trail sedgeBOT
delicate needleBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

arctic tree
#

just build it yourself with any of these 2 option, i'd recommend the **second **one
(scroll down on the link i sent to see this)

#

download weights at startup is almost always more cost effecient for serverless usually

#

Are there known Serverless limits (ephemeral disk size, image size, startup timeout, worker recycling behavior) that differ from Pods and that we should explicitly account for when running larger models like Llama-4?

Not really from what i can, and see it depends on how you use it ( because it scales automatically on serverless, on pods you need to scale manually )

and on pods you pay for the docker image download time from the start of pod + the pricing is usually cheaper ( $ / hour)
and it doestn have idle timeout time, startup timeout (? not sure of what do you mean by this)

*some users reports that some tasks are slower on serverless