R
RunPod•3mo ago
Thibaud

can't run 70b

any tips to run a 70b model, for example: mlabonne/Llama-3.1-70B-Instruct-lorablated i tried that: config 80GB GPU 2GPUs / Worker container disk: 500 gb env var: MAX_MODEL_LEN 15000* MODEL_NAME mlabonne/Llama-3.1-70B-Instruct-lorablated but it doesn't work without MAX_MODEL_LEN 15000, i got The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18368). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. " 2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B 2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B 2024-08-08T12:44:26Z 4f4fb700ef54 Pull complete 2024-08-08T12:44:26Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70 2024-08-08T12:44:26Z Status: Downloaded newer image for runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:26Z worker is ready 2024-08-08T12:44:38Z create pod network 2024-08-08T12:44:38Z create container runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:38Z stable-cuda12.1.0 Pulling from runpod/worker-v1-vllm 2024-08-08T12:44:38Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70 2024-08-08T12:44:38Z Status: Image is up to date for runpod/worker-v1-vllm:stable-cuda12.1.0 2024-08-08T12:44:38Z worker is ready 2024-08-08T12:44:39Z start container 2024-08-08T12:48:14Z start container and nothing after
51 Replies
Encyrption
Encyrption•3mo ago
The logs you have provided are the System logs, this shows how the container is launching. Have you checked your Container Logs? This will show logs for what your application is doing. You can see these logs in the RunPod website. If you click on a running worker you can then click logs there you will get a choice of System Logs and Container Logs. You have to be quick to catch them running because the Container Logs will not be visible after the worker stops running.
Thibaud
Thibaud•3mo ago
thanks. It's working now. can't know what i change but it's so slow 😒 7-8 sec for 2000 input token - 200 out. on 2 GPU (h100) per worker
Encyrption
Encyrption•3mo ago
I don't really do much with vLLM so I cannot comment on your speed. To me 160GB seems like a lot and 7-8 seconds sounds fast but I have never needed more than 48GB for any models I have ever ran. What kind of speed are you looking for the 2000 input token and 200 output token? Are you trying to do something live?
Thibaud
Thibaud•3mo ago
using together .ai, i have less 3-4 sec. but i don't want to run vanilla model, that's why i need runpod. are you using 70b model with 48gb? or you use tinier model?
Encyrption
Encyrption•3mo ago
I am not running any vLLM. I do image generation, lipsync, voice cloning, along with some custom apps.
Thibaud
Thibaud•3mo ago
ok!
Encyrption
Encyrption•3mo ago
Good luck!
Thibaud
Thibaud•3mo ago
thanks for your help
NERDDISCO
NERDDISCO•3mo ago
@Thibaud you are talking about "pods" not serverless or? Because I wonder how you would get two GPUs to run this model.
Thibaud
Thibaud•3mo ago
no sorry, it's serverless
NERDDISCO
NERDDISCO•3mo ago
hmm, but you can't have more than one GPU / worker when using serverless. That's why I'm wondering what you are using to run this ah forget it 😄 I just saw it in the UI
Emad
Emad•3mo ago
You can
NERDDISCO
NERDDISCO•3mo ago
I have never used it yet Yes totally, was my fault 😄
Emad
Emad•3mo ago
But for me its not working while idk why
NERDDISCO
NERDDISCO•3mo ago
@Karlas what is your problem? Also related to llama 3.1 70b?
Emad
Emad•3mo ago
Yea It keeps getting stuck at the same place
Emad
Emad•3mo ago
No description
NERDDISCO
NERDDISCO•3mo ago
I will take a look at both things
Emad
Emad•3mo ago
thanks @NERDDISCO it worked just that it takes a bit of time to load for the first time in a while
NERDDISCO
NERDDISCO•3mo ago
perfect. You are also running on 2x80GB GPUs?
Emad
Emad•3mo ago
I am running 4x48 80GB Gpus are hard to get this is easier just a problem with cold start right now
Encyrption
Encyrption•3mo ago
Are you using a network volume? Doing so can add 30-60 seconds to delayTime.
Emad
Emad•3mo ago
I thought network volume speeds it up? I wasn't using before But i just tried setting it up
Encyrption
Encyrption•3mo ago
You would think... but it does just the opposite.. I have done direct comparisons with a baked in model vs network volume and network volume always have 30-60 seconds more delayTime. IMHO nobody should be using network volume, unless you don't care about response time. better off with a bigger image than using network volume.
Emad
Emad•3mo ago
It takes 6 minuts without a network volume for the first time
Encyrption
Encyrption•3mo ago
Have you tested how long it takes with a network volume? Also, how many in/out tokens for those 6 min? Thibaud was saying it was taking his 7-8 sec for 2000 input token - 200 out. on 2 GPU (h100) per worker
NERDDISCO
NERDDISCO•3mo ago
@Encyrption would you mind sharing your test results with me? I was looking into this the other day with our engineers and they said that network volumes were only affected in EU-RO and EU-SE at the start of this week.
Emad
Emad•3mo ago
It wasnt leaving the queue with it
Encyrption
Encyrption•3mo ago
All my data comes from EU-RO as I have compute (non GPU) workloads as well.
Emad
Emad•3mo ago
Theres about 4100 input tokens, and about 150 out. 4 A40 per worker So it's fine to use network volumes I am confused
NERDDISCO
NERDDISCO•3mo ago
When did you test this?
Encyrption
Encyrption•3mo ago
If the 6 minute delay is only happening on the initial request my guess is your code is loading model(s) during this time period... if you have pre-loaded them I would double check if they are being downloaded again (overwriting) what you have pre-loaded.
NERDDISCO
NERDDISCO•3mo ago
yeah sorry for this. We had some users reporting that network volumes were slow. They were all from these two regions that I mentioned. And there was some maintenance happening on our end at the start of this week that would have explained the slow repsonse times of the network volumes. Right now I'm trying to figure out of this is still ongoing by talking with users that had problems with slow network volumes before
Encyrption
Encyrption•3mo ago
I last tested it about 1 month ago, since then I no longer use network volume. @briefPeach did some more recent testing, with large difusion models. She also came to same conclusion. Although, I am not sure what region she was testing in.
NERDDISCO
NERDDISCO•3mo ago
@Thibaud sorry for poluting your thread with the network volume stuff. I will open a new thread for this @Karlas so when using the network volume, you don't have to download the model again = decreasing cold start as the model already exists @Thibaud I can't seem to get the model you are using to work at all
Emad
Emad•3mo ago
I am using the VLLM setup from runpod When I use network volume its on EU-RO because thats where the gpu is available but it keeps getting stuck in queue
NERDDISCO
NERDDISCO•3mo ago
@Thibaud did you change anything else related to the env variables? Because with the config you provided at the start, I can't get it to do anything it will get my request and produce errors
Thibaud
Thibaud•3mo ago
no. just the max_len but it s not a very good idea. model have less memory
NERDDISCO
NERDDISCO•3mo ago
I thought that this is just controlling the maximum length of the context or what do you mean with "model have less memory"?
nerdylive
nerdylive•3mo ago
welp they optimize their runtime, modelsl, etc ofc.. + more gpus are used by them
Thibaud
Thibaud•3mo ago
yep but their cost is low. so it s not only more gpu
nerdylive
nerdylive•3mo ago
More gpu, more optimization, more uses too maybe, idk, what do you think it can be?
Thibaud
Thibaud•3mo ago
more gpu reduce cost per gpu ofc. more optimization too. i think we can get better setting at our level.
nerdylive
nerdylive•3mo ago
What level?
Markrr
Markrr•3mo ago
Hi @Thibaud - I am having the same issue with max seq len (131072) being larger than max number of tokens in KV cache ... Using H100 v16 80GB GPU RAM on Runpod using vLLM worker... I'm curious what you did to solve it? Thanks !!! 🙂
Thibaud
Thibaud•3mo ago
try 2 gpu / worker
Markrr
Markrr•3mo ago
The GPU instance is an H100 14 vCPU 80GB VRAM. Runpod Serverless settings:
 MODEL = neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
 MAX_MODEL_LEN = 131072 
GPU_MEMORY_UTILIZATION = 0.99 024-08-14T21:49:06.860520596Z tokenizer_name_or_path: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8, tokenizer_revision: None, trust_remote_code: False engine.py :113 2024-08-14 20:33:15,350 Error initializing vLLM engine: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (20736). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. I also tried: KV_CACHE_DTYPE = fp8
Thibaud
Thibaud•3mo ago
try with 2 GPU or use MAX_SEQ_LEN 20000
Markrr
Markrr•3mo ago
Thanks, will do
octopus
octopus•2mo ago
@Thibaud were you able to get the execution time lowered? I compared mlabonne/Llama-3.1-70B-Instruct-lorablated with Llama-70B-3.0 (https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5) which is the original that 3.1 is based on and the difference is striking. 3-5secs for 3.1 vs only 0.6-0.8secs for 3.0
Want results from more Discord servers?
Add your server