R
RunPod2mo ago
octopus

Guide to deploy Llama 405B on Serverless?

Hi, can any experts on Serverless advice on how to deploy Llama 405B on Serverless?
33 Replies
Suba
Suba2mo ago
@octopus - you need to attach a network volume to the end point. The volume should have at least 1 TB space to hold the 405 B model (unless you are using quantized models). Then increase the number of workers to match the model gpu requirement (like 10 48 GB GPUs) I tried several 405 B models in HF but get error related to rope_scaling. Looks like we need to modify it to null and try. To do this I need to download all files and upload again.
nerdylive
nerdylive2mo ago
does the vllm worker supports this yet?
Suba
Suba2mo ago
@nerdylive not sure about this, do we have a document or page that lists vllm's support for a model?
nerdylive
nerdylive2mo ago
on the docs of vllm, not on runpod look at the right versions, maybe current vllm is outdated
nerdylive
nerdylive2mo ago
yes that one is for the latest version hope vllm-worker now is the latest
Suba
Suba2mo ago
looks like it supports LlamaForCausalLM Llama 3.1, Llama 3, Llama 2, LLaMA, Yi meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, 01-ai/Yi-34B, etc.
nerdylive
nerdylive2mo ago
okay, again.. check the current vllm worker's vllm version i think last time it hasn't been updated yet
Suba
Suba2mo ago
I am using runpod/worker-vllm:stable-cuda12.1.0 since I am using serverless I am unable to run any command
nerdylive
nerdylive2mo ago
and does it working with llama3.1 now? yeah ofc, leme check the repo one sec
Suba
Suba2mo ago
No I get error related to rope_scaling llama 3.1 's config.json has lots of params under rope_scaling
nerdylive
nerdylive2mo ago
rope scaling huh, i think you're unable to set that too for current vllm worker version
Suba
Suba2mo ago
but the current vllm accepts only two params
nerdylive
nerdylive2mo ago
Suba
Suba2mo ago
2024-07-24T04:42:22.063990694Z engine.py :110 2024-07-24 04:42:22,063 Error initializing vLLM engine: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
Want results from more Discord servers?
Add your server