#Guide to deploy Llama 405B on Serverless?
50 messages · Page 1 of 1 (latest)
@random forge - you need to attach a network volume to the end point. The volume should have at least 1 TB space to hold the 405 B model (unless you are using quantized models). Then increase the number of workers to match the model gpu requirement (like 10 48 GB GPUs)
I tried several 405 B models in HF but get error related to rope_scaling. Looks like we need to modify it to null and try. To do this I need to download all files and upload again.
does the vllm worker supports this yet?
@soft meadow not sure about this, do we have a document or page that lists vllm's support for a model?
on the docs of vllm, not on runpod
look at the right versions, maybe current vllm is outdated
looks like it supports
LlamaForCausalLM
Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, 01-ai/Yi-34B, etc.
okay, again..
check the current vllm worker's vllm version
i think last time it hasn't been updated yet
I am using runpod/worker-vllm:stable-cuda12.1.0
since I am using serverless I am unable to run any command
No I get error related to rope_scaling
llama 3.1 's config.json has lots of params under rope_scaling
rope scaling huh, i think you're unable to set that too for current vllm worker version
but the current vllm accepts only two params
this is current's vllm-worker docs:
https://docs.vllm.ai/en/v0.3.2/models/supported_models.html
2024-07-24T04:42:22.063990694Z engine.py :110 2024-07-24 04:42:22,063 Error initializing vLLM engine: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
yep, check this
thats the matching docs for current vllm-worker
ok got it, 405 is not in there
seems like it just got updated on the newest version yeah
soo only the newest version, and we have to wait until vllm-worker updates to the latest or stable version of vllm
ok.. is it done automatically or should we raise a ticket etc
yeah. about that, we just wait until runpod's staff updates it
they say they're working on it, don't worry
im also waiting for it 🙂
You could try to use https://docs.runpod.io/tutorials/serverless/cpu/run-ollama-inference, but with a GPU. The ollama worker was updated and now it supports also Llama 3.1. We only tested this with 8B, but I don’t see why this shouldn’t also work with 405B 🙏
I will also test this later today with 70 and 405.
@slender echo pls let me know if ollama worker worked with 405
Meta’s recent release of the Llama 3.1 405B model has made waves in the AI community. This groundbreaking open-source model not only matches but even surpasses the performance of leading closed-source models. With impressive scores on reasoning tasks (96.9 on ARC Challenge and 96.8 on GSM8K)
Thats super cool! How can we also do this serverless? We can’t add multiple GPUs to a worker, so is there any other way?
Yeah currently you can’t , 405b needs too much memory. 😂😂😂
Hmm how much memory does it needs?
I suspect about 200+