can't run 70b
any tips to run a 70b model, for example: mlabonne/Llama-3.1-70B-Instruct-lorablated
i tried that:
config
80GB GPU
2GPUs / Worker
container disk: 500 gb
env var:
MAX_MODEL_LEN 15000*
MODEL_NAME mlabonne/Llama-3.1-70B-Instruct-lorablated
but it doesn't work
without MAX_MODEL_LEN 15000, i got The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18368). Try increasing
gpu_memory_utilization
or decreasing max_model_len
when initializing the engine. "
2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B
2024-08-08T12:44:26Z 4f4fb700ef54 Extracting [==================================================>] 32B/32B
2024-08-08T12:44:26Z 4f4fb700ef54 Pull complete
2024-08-08T12:44:26Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70
2024-08-08T12:44:26Z Status: Downloaded newer image for runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:26Z worker is ready
2024-08-08T12:44:38Z create pod network
2024-08-08T12:44:38Z create container runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:38Z stable-cuda12.1.0 Pulling from runpod/worker-v1-vllm
2024-08-08T12:44:38Z Digest: sha256:44f3a3d209d0df623295065203da969e69f57fe0b8b73520e9bef47fb9d33c70
2024-08-08T12:44:38Z Status: Image is up to date for runpod/worker-v1-vllm:stable-cuda12.1.0
2024-08-08T12:44:38Z worker is ready
2024-08-08T12:44:39Z start container
2024-08-08T12:48:14Z start container
and nothing after51 Replies
The logs you have provided are the System logs, this shows how the container is launching. Have you checked your Container Logs? This will show logs for what your application is doing. You can see these logs in the RunPod website. If you click on a running worker you can then click logs there you will get a choice of System Logs and Container Logs. You have to be quick to catch them running because the Container Logs will not be visible after the worker stops running.
thanks. It's working now.
can't know what i change but it's so slow 😒
7-8 sec for 2000 input token - 200 out.
on 2 GPU (h100) per worker
I don't really do much with vLLM so I cannot comment on your speed. To me 160GB seems like a lot and 7-8 seconds sounds fast but I have never needed more than 48GB for any models I have ever ran. What kind of speed are you looking for the 2000 input token and 200 output token? Are you trying to do something live?
using together .ai, i have less 3-4 sec.
but i don't want to run vanilla model, that's why i need runpod.
are you using 70b model with 48gb? or you use tinier model?
I am not running any vLLM. I do image generation, lipsync, voice cloning, along with some custom apps.
ok!
Good luck!
thanks for your help
@Thibaud you are talking about "pods" not serverless or? Because I wonder how you would get two GPUs to run this model.
no sorry, it's serverless
hmm, but you can't have more than one GPU / worker when using serverless. That's why I'm wondering what you are using to run this
ah forget it 😄 I just saw it in the UI
You can
I have never used it yet
Yes totally, was my fault 😄
But for me its not working while idk why
@Karlas what is your problem? Also related to llama 3.1 70b?
Yea
It keeps getting stuck at the same place
I will take a look at both things
thanks
@NERDDISCO it worked
just that it takes a bit of time to load for the first time in a while
perfect. You are also running on 2x80GB GPUs?
I am running 4x48
80GB Gpus are hard to get
this is easier
just a problem with cold start right now
Are you using a network volume? Doing so can add 30-60 seconds to delayTime.
I thought network volume speeds it up?
I wasn't using before
But i just tried setting it up
You would think... but it does just the opposite.. I have done direct comparisons with a baked in model vs network volume and network volume always have 30-60 seconds more delayTime. IMHO nobody should be using network volume, unless you don't care about response time.
better off with a bigger image than using network volume.
It takes 6 minuts without a network volume
for the first time
Have you tested how long it takes with a network volume?
Also, how many in/out tokens for those 6 min? Thibaud was saying it was taking his 7-8 sec for 2000 input token - 200 out.
on 2 GPU (h100) per worker
@Encyrption would you mind sharing your test results with me? I was looking into this the other day with our engineers and they said that network volumes were only affected in EU-RO and EU-SE at the start of this week.
It wasnt leaving the queue with it
All my data comes from EU-RO as I have compute (non GPU) workloads as well.
Theres about 4100 input tokens, and about 150 out.
4 A40 per worker
So it's fine to use network volumes
I am confused
When did you test this?
If the 6 minute delay is only happening on the initial request my guess is your code is loading model(s) during this time period... if you have pre-loaded them I would double check if they are being downloaded again (overwriting) what you have pre-loaded.
yeah sorry for this. We had some users reporting that network volumes were slow. They were all from these two regions that I mentioned. And there was some maintenance happening on our end at the start of this week that would have explained the slow repsonse times of the network volumes.
Right now I'm trying to figure out of this is still ongoing by talking with users that had problems with slow network volumes before
I last tested it about 1 month ago, since then I no longer use network volume. @briefPeach did some more recent testing, with large difusion models. She also came to same conclusion. Although, I am not sure what region she was testing in.
@Thibaud sorry for poluting your thread with the network volume stuff. I will open a new thread for this
@Karlas so when using the network volume, you don't have to download the model again = decreasing cold start as the model already exists
@Thibaud I can't seem to get the model you are using to work at all
I am using the VLLM setup from runpod
When I use network volume its on EU-RO because thats where the gpu is available but it keeps getting stuck in queue
@Thibaud did you change anything else related to the env variables?
Because with the config you provided at the start, I can't get it to do anything
it will get my request and produce errors
no. just the max_len but it s not a very good idea. model have less memory
I thought that this is just controlling the maximum length of the context or what do you mean with "model have less memory"?
welp they optimize their runtime, modelsl, etc ofc..
+ more gpus are used by them
yep but their cost is low. so it s not only more gpu
More gpu, more optimization, more uses too maybe, idk, what do you think it can be?
more gpu reduce cost per gpu ofc.
more optimization too.
i think we can get better setting at our level.
What level?
Hi @Thibaud - I am having the same issue with max seq len (131072) being larger than max number of tokens in KV cache ... Using H100 v16 80GB GPU RAM on Runpod using vLLM worker... I'm curious what you did to solve it? Thanks !!! 🙂
I am using Neuralmagic's FP8 quantized model: https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
try 2 gpu / worker
The GPU instance is an H100 14 vCPU 80GB VRAM.
Runpod Serverless settings:

MODEL = neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

MAX_MODEL_LEN = 131072

GPU_MEMORY_UTILIZATION = 0.99
024-08-14T21:49:06.860520596Z tokenizer_name_or_path: neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8, tokenizer_revision: None, trust_remote_code: False
engine.py :113 2024-08-14 20:33:15,350 Error initializing vLLM engine: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (20736). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
I also tried:
KV_CACHE_DTYPE = fp8try with 2 GPU
or use MAX_SEQ_LEN 20000
Thanks, will do
@Thibaud were you able to get the execution time lowered? I compared mlabonne/Llama-3.1-70B-Instruct-lorablated with Llama-70B-3.0 (https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5) which is the original that 3.1 is based on and the difference is striking. 3-5secs for 3.1 vs only 0.6-0.8secs for 3.0