RunPod•9mo ago

Guide to deploy Llama 405B on Serverless?

Hi, can any experts on Serverless advice on how to deploy Llama 405B on Serverless?

33 Replies

Suba•9mo ago

@octopus - you need to attach a network volume to the end point. The volume should have at least 1 TB space to hold the 405 B model (unless you are using quantized models). Then increase the number of workers to match the model gpu requirement (like 10 48 GB GPUs) I tried several 405 B models in HF but get error related to rope_scaling. Looks like we need to modify it to null and try. To do this I need to download all files and upload again.

Jason•9mo ago

does the vllm worker supports this yet?

Suba•9mo ago

@nerdylive not sure about this, do we have a document or page that lists vllm's support for a model?

Jason•9mo ago

on the docs of vllm, not on runpod look at the right versions, maybe current vllm is outdated

Suba•9mo ago

https://docs.vllm.ai/en/latest/models/supported_models.html

Jason•9mo ago

yes that one is for the latest version hope vllm-worker now is the latest

Suba•9mo ago

looks like it supports LlamaForCausalLM Llama 3.1, Llama 3, Llama 2, LLaMA, Yi meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, 01-ai/Yi-34B, etc.

Jason•9mo ago

okay, again.. check the current vllm worker's vllm version i think last time it hasn't been updated yet

Suba•9mo ago

I am using runpod/worker-vllm:stable-cuda12.1.0 since I am using serverless I am unable to run any command

Jason•9mo ago

and does it working with llama3.1 now? yeah ofc, leme check the repo one sec

Suba•9mo ago

No I get error related to rope_scaling llama 3.1 's config.json has lots of params under rope_scaling

Jason•9mo ago

rope scaling huh, i think you're unable to set that too for current vllm worker version

Suba•9mo ago

but the current vllm accepts only two params

Jason•9mo ago

this is current's vllm-worker docs: https://docs.vllm.ai/en/v0.3.2/models/supported_models.html

Suba•9mo ago

2024-07-24T04:42:22.063990694Z engine.py :110 2024-07-24 04:42:22,063 Error initializing vLLM engine: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

Jason•9mo ago

yep, check this thats the matching docs for current vllm-worker

Suba•9mo ago

ok got it, 405 is not in there

Jason•9mo ago

seems like it just got updated on the newest version yeah soo only the newest version, and we have to wait until vllm-worker updates to the latest or stable version of vllm

Suba•9mo ago

ok.. is it done automatically or should we raise a ticket etc

Jason•9mo ago

yeah. about that, we just wait until runpod's staff updates it they say they're working on it, don't worry im also waiting for it 🙂

Suba•9mo ago

great, thank you very much for your time 🙂

NERDDISCO•9mo ago

You could try to use https://docs.runpod.io/tutorials/serverless/cpu/run-ollama-inference, but with a GPU. The ollama worker was updated and now it supports also Llama 3.1. We only tested this with 8B, but I don’t see why this shouldn’t also work with 405B 🙏

Jason•9mo ago

ah ollama interesting thanks for sharing it will look at that too hahah

NERDDISCO•9mo ago

I will also test this later today with 70 and 405.

Suba•9mo ago

@nerdylive would like to know if you got any news on the vllm update for 405

Jason•9mo ago

No not yet I don't know They're still working on it

Suba•9mo ago

@NERDDISCO pls let me know if ollama worker worked with 405

yhlong00000•9mo ago

https://blog.runpod.io/run-llama-3-1-405b-with-ollama-a-step-by-step-guide/

RunPod Blog

Run Llama 3.1 405B with Ollama: A Step-by-Step Guide

Meta’s recent release of the Llama 3.1 405B model has made waves in the AI community. This groundbreaking open-source model not only matches but even surpasses the performance of leading closed-source models. With impressive scores on reasoning tasks (96.9 on ARC Challenge and 96.8 on GSM8K)

NERDDISCO•9mo ago

Thats super cool! How can we also do this serverless? We can’t add multiple GPUs to a worker, so is there any other way?

yhlong00000•9mo ago

Yeah currently you can’t , 405b needs too much memory. 😂😂😂

Jason•9mo ago

Hmm how much memory does it needs?

Madiator2011•9mo ago

I suspect about 200+

Jason•9mo ago

damn must be really good

Gaming

Programming

Guide to deploy Llama 405B on Serverless?

Did you find this page helpful?