RunPod•9mo ago

can't run 70b

any tips to run a 70b model, for example: mlabonne/Llama-3.1-70B-Instruct-lorablated i tried that: config 80GB GPU 2GPUs / Worker container disk: 500 gb env var: MAX_MODEL_LEN 15000* MODEL_NAME mlabonne/Llama-3.1-70B-Instruct-lorablated but it doesn't work without MAX_MODEL_LEN 15000, i got The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18368). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. " 2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B 2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B 2024-08-08T12:44:26Z 4f4fb700ef54 Pull complete 2024-08-08T12:44:26Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70 2024-08-08T12:44:26Z Status: Downloaded newer image for runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:26Z worker is ready 2024-08-08T12:44:38Z create pod network 2024-08-08T12:44:38Z create container runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:38Z stable-cuda12.1.0 Pulling from runpod/worker-v1-vllm 2024-08-08T12:44:38Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70 2024-08-08T12:44:38Z Status: Image is up to date for runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:38Z worker is ready 2024-08-08T12:44:39Z start container 2024-08-08T12:48:14Z start container and nothing after

51 Replies

Encyrption•9mo ago

The logs you have provided are the System logs, this shows how the container is launching. Have you checked your Container Logs? This will show logs for what your application is doing. You can see these logs in the RunPod website. If you click on a running worker you can then click logs there you will get a choice of System Logs and Container Logs. You have to be quick to catch them running because the Container Logs will not be visible after the worker stops running.

ThibaudOP•9mo ago

thanks. It's working now. can't know what i change but it's so slow 😒 7-8 sec for 2000 input token - 200 out. on 2 GPU (h100) per worker

Encyrption•9mo ago

I don't really do much with vLLM so I cannot comment on your speed. To me 160GB seems like a lot and 7-8 seconds sounds fast but I have never needed more than 48GB for any models I have ever ran. What kind of speed are you looking for the 2000 input token and 200 output token? Are you trying to do something live?

ThibaudOP•9mo ago

using together .ai, i have less 3-4 sec. but i don't want to run vanilla model, that's why i need runpod. are you using 70b model with 48gb? or you use tinier model?

Encyrption•9mo ago

I am not running any vLLM. I do image generation, lipsync, voice cloning, along with some custom apps.

ThibaudOP•9mo ago

ok!

Encyrption•9mo ago

Good luck!

ThibaudOP•9mo ago

thanks for your help

NERDDISCO•9mo ago

@Thibaud you are talking about "pods" not serverless or? Because I wonder how you would get two GPUs to run this model.

ThibaudOP•9mo ago

no sorry, it's serverless

NERDDISCO•9mo ago

hmm, but you can't have more than one GPU / worker when using serverless. That's why I'm wondering what you are using to run this ah forget it 😄 I just saw it in the UI

Emad•9mo ago

You can

NERDDISCO•9mo ago

I have never used it yet Yes totally, was my fault 😄

Emad•9mo ago

But for me its not working while idk why

NERDDISCO•9mo ago

@Karlas what is your problem? Also related to llama 3.1 70b?

Emad•9mo ago

Yea It keeps getting stuck at the same place

Emad•9mo ago

NERDDISCO•9mo ago

I will take a look at both things

Emad•9mo ago

thanks @NERDDISCO it worked just that it takes a bit of time to load for the first time in a while

NERDDISCO•9mo ago

perfect. You are also running on 2x80GB GPUs?

Emad•9mo ago

I am running 4x48 80GB Gpus are hard to get this is easier just a problem with cold start right now

Encyrption•9mo ago

Are you using a network volume? Doing so can add 30-60 seconds to delayTime.

Emad•9mo ago

I thought network volume speeds it up? I wasn't using before But i just tried setting it up

Encyrption•9mo ago

You would think... but it does just the opposite.. I have done direct comparisons with a baked in model vs network volume and network volume always have 30-60 seconds more delayTime. IMHO nobody should be using network volume, unless you don't care about response time. better off with a bigger image than using network volume.

Emad•9mo ago

It takes 6 minuts without a network volume for the first time

Encyrption•9mo ago

Have you tested how long it takes with a network volume? Also, how many in/out tokens for those 6 min? Thibaud was saying it was taking his 7-8 sec for 2000 input token - 200 out. on 2 GPU (h100) per worker

NERDDISCO•9mo ago

@Encyrption would you mind sharing your test results with me? I was looking into this the other day with our engineers and they said that network volumes were only affected in EU-RO and EU-SE at the start of this week.

Emad•9mo ago

It wasnt leaving the queue with it

Encyrption•9mo ago

All my data comes from EU-RO as I have compute (non GPU) workloads as well.

Emad•9mo ago

Theres about 4100 input tokens, and about 150 out. 4 A40 per worker So it's fine to use network volumes I am confused

NERDDISCO•9mo ago

When did you test this?

Encyrption•9mo ago

If the 6 minute delay is only happening on the initial request my guess is your code is loading model(s) during this time period... if you have pre-loaded them I would double check if they are being downloaded again (overwriting) what you have pre-loaded.

NERDDISCO•9mo ago

yeah sorry for this. We had some users reporting that network volumes were slow. They were all from these two regions that I mentioned. And there was some maintenance happening on our end at the start of this week that would have explained the slow repsonse times of the network volumes. Right now I'm trying to figure out of this is still ongoing by talking with users that had problems with slow network volumes before

Encyrption•9mo ago

I last tested it about 1 month ago, since then I no longer use network volume. @briefPeach did some more recent testing, with large difusion models. She also came to same conclusion. Although, I am not sure what region she was testing in.

NERDDISCO•9mo ago

@Thibaud sorry for poluting your thread with the network volume stuff. I will open a new thread for this @Karlas so when using the network volume, you don't have to download the model again = decreasing cold start as the model already exists @Thibaud I can't seem to get the model you are using to work at all

Emad•9mo ago

I am using the VLLM setup from runpod When I use network volume its on EU-RO because thats where the gpu is available but it keeps getting stuck in queue

NERDDISCO•9mo ago

@Thibaud did you change anything else related to the env variables? Because with the config you provided at the start, I can't get it to do anything it will get my request and produce errors

ThibaudOP•9mo ago

no. just the max_len but it s not a very good idea. model have less memory

NERDDISCO•9mo ago

I thought that this is just controlling the maximum length of the context or what do you mean with "model have less memory"?

Jason•9mo ago

welp they optimize their runtime, modelsl, etc ofc.. + more gpus are used by them

ThibaudOP•9mo ago

yep but their cost is low. so it s not only more gpu

Jason•9mo ago

More gpu, more optimization, more uses too maybe, idk, what do you think it can be?

ThibaudOP•9mo ago

more gpu reduce cost per gpu ofc. more optimization too. i think we can get better setting at our level.

Jason•9mo ago

What level?

Markrr•9mo ago

Hi @Thibaud - I am having the same issue with max seq len (131072) being larger than max number of tokens in KV cache ... Using H100 v16 80GB GPU RAM on Runpod using vLLM worker... I'm curious what you did to solve it? Thanks !!! 🙂

Markrr•9mo ago

I am using Neuralmagic's FP8 quantized model: https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 · Hugging Face

ThibaudOP•9mo ago

try 2 gpu / worker

Markrr•9mo ago

The GPU instance is an H100 14 vCPU 80GB VRAM. Runpod Serverless settings: 

MODEL = neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 
MAX_MODEL_LEN = 131072
 GPU_MEMORY_UTILIZATION = 0.99

024-08-14T21:49:06.860520596Z tokenizer_name_or_path: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8, tokenizer_revision: None, trust_remote_code: False

engine.py :113 2024-08-14 20:33:15,350 Error initializing vLLM engine: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (20736). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

I also tried: KV_CACHE_DTYPE = fp8

ThibaudOP•9mo ago

try with 2 GPU or use MAX_SEQ_LEN 20000

Markrr•9mo ago

Thanks, will do

octopus•7mo ago

@Thibaud were you able to get the execution time lowered? I compared mlabonne/Llama-3.1-70B-Instruct-lorablated with Llama-70B-3.0 (https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5) which is the original that 3.1 is based on and the difference is striking. 3-5secs for 3.1 vs only 0.6-0.8secs for 3.0

Gaming

Programming

can't run 70b

Did you find this page helpful?