Serverless vLLM doesn't work and gives no error message
I've spent a few hours trying to deploy a serverless vLLM endpoint according to the instructions at https://docs.runpod.io/serverless/workers/vllm/get-started
The endpoint doesn't work and it gives no error message or any other indication of what's wrong.
All the requests I send just stay "in queue" and the status of the requests never change.
The logs show an initialization message and some warnings, but no errors, and the requests aren't shown in logs.
The endpoint id is o13ejihy2p9hi8.
8 Replies
@Alpay Ariyak
Could you share the environment variables you set please
It looks like the problem was using the default timeout value of 600. The initialization appeared to take longer than 600 seconds, so the instance would get rebooted every 600 seconds and never finish initialization. After I increased the timeout, it now finished initialization and gave me a different error (GPU VRAM exceeded, I will try with different sized model).
Edit: I'm still having problems but I think my current issues are related to vLLM and not RunPod so maybe I shouldn't bug you with these issues.
Im still happy to help, what are the issues?
Thanks!
1. I'm trying to run
TheBloke/airoboros-l2-13B-gpt4-1.4.1-GPTQ
on vLLM and it's going OOM when I have the 24GB VRAM GPU selected. Seems like 48GB VRAM is the smallest that it runs in. The same model fits in under 12GB when I run it in oobabooga as opposed to vLLM. I'm wondering if this is expected behavior that vLLM requires so much more VRAM for the same model?
2. The output quality of the model seems to be degraded (when comparing the same model on vLLM to oobabooga). For example, asking "What is the capital of France?" sometimes yields "", sometimes "<<SYS>>", sometimes "< Paris\n\n". I suspect that vLLM is not using the correct prompt template. The recommended API has these abstractions that prevent me from seeing what actually goes into the LLM. I was hoping to find a way to input the entire prompt into vLLM including special tokens, but I haven't found any way to do that. I don't know if that's possible without modifying vLLM source code?
3. I'm trying to understand how FlashBoot works. It's so fast that it must be holding the model weights in VRAM, and it's obviously not possible for RunPod to hold everybody's stuff in VRAM all the time. What happens when I get a "bad cold start" and my workload is executed on a new instance? Is it going to take 15 minutes to spin up, like it took for the first instance? Or am I going to experience a much more reasonable cold start like maybe 20 seconds to load the weights from disk to VRAM?
Regarding 3., I finally got a "bad cold start" that I was trying to get in my experiments. It was 45 seconds. I assume this means that my container & weights were on disk available for more than one host, and the 45 seconds was spent spinning up the container and loading model weights into VRAM. 45 seconds is too fast to download weights, at least.1. Not sure how to calculate the needed ram for inference but I think you can quantitize or like change the dtype to smaller one to fit in the vram
GPTQ is 4-bit quantized and the same model fits in 12GB VRAM in oobabooga
2. Yeah probably, check for env variables and the vllm source, docs for templating
Oh that's some huge difference, I have no idea about this yet