Error with the pre-built serverless docker image
Hi
completely random, because sometimes it works smoothly, using the runpod serverless VLLM the machine gets stuck on
Using model weights format ['*.safetensors']
and I have to manually terminate the worker and restart it.
Do you have any suggestions? (attached my current envs)
Thank you 🙂
7 Replies
have you try different type of GPU?
yes I did 😅
Just to make sure I understand correctly, are you saying that when you’re trying the vLLM serverless, sometimes the worker wakes up and completes the request, but other times it doesn’t? And you’ve tried different GPUs? Could you provide a bit more info on the size of the model, which GPUs you’ve tried, and how often this happens (like 1%, 10% of the time)? If you have a worker ID and a specific time when it didn’t work, that would be really helpful.
correct, it happens on this endpoint vllm-rxlyakgq58h7lf when running on 1 80GB GPU PRO
I'm running this model ModelCloud/Mistral-Large-Instruct-2407-gptq-4bit
It's always stucks on this log
2024-09-21T12:30:53.355632180Z (VllmWorkerProcess pid=161) INFO 09-21 12:30:53 model_runner.py:997] Starting to load model ModelCloud/Mistral-Large-Instruct-2407-gptq-4bit...
2024-09-21T12:30:54.669155104Z (VllmWorkerProcess pid=161) INFO 09-21 12:30:54 weight_utils.py:242] Using model weights format ['*.safetensors']
It doesn't happen all the time (maybe 30/40%) but as I have found on discord I'm not the only one with this problem, and once I delete the worker and start it again it runs smoothly
basically once the model is laoded and the machine is not in cooldown it can process requests, but once it turns off and turn of to process a new request - sometimes - it's sucks on that log, and I have to manually terminate the worker and run it again
I have tried with 2 80GB GPU (not pro), and at the moment its doesn't break, but the boot up time is increased a lot (from 30 seconds = when gpu pro is working, to 2 minutes)
thank you for your time in the meanwhile 🙂
Could it be that when you request comes in that there are no GPU available that meet your criteria?? Are 2 x 80GB a highly available option? Are you using a specific region?
yes always highly available and I'm using all the available regions
it's strange because if there are no gpus available it sould throw me an error, I mean, it's a big problem in production because a request can be stuck in loading forever/return an empty response (when response limit is set)
I reviewed the logs for this endpoint, and after the log entry you mentioned, ‘starting to load model…’, I can see that the model eventually loads. It’s pretty normal for it to take a couple of minutes before the loading completes. I’m not an expert on vLLM, but I believe this is related to some kind of initialization process. For large models, it can take up to 4 or 5 minutes. So, it might just require a bit more patience.
when there’s little traffic, the cold start can be quite long. If your endpoint don't have much activity, the model will have high chance remove from GPU memory, you’ll see this delay again. But if the model is being used constantly, it should perform much better.
If you’re looking for faster load times, from my experience, SXM GPUs (like A100 or H100) tend to have better speed. You could also try using 2 * 48 GPUs. When using two GPUs, enable these settings, see if it help with loading:
• TENSOR_PARALLEL_SIZE = 2
• MAX_PARALLEL_LOADING_WORKERS = 2