How long does it normally take to get a response from your VLLM endpoints on RunPod?
Hello. I've tested a very tiny model (Qwen2.5-0.5B-Instruct) on the official RunPod VLLM image. But the job takes 30+ seconds each time - 99% of it is loading the engine and the model (counted as delay time), and the execution itself is under 1s. Flashboot is on. Is this normal or is there a setting or something else I should check to make the Flashboot kick in? How long do your models and endpoints normally take to return a response?
11 Replies
what kind of setup do you use do you build the image?
Where do you store the model in, Network storage or inside image?
It's the official VLLM selected in the RunPod dashboard. I added only the model name and used Ray. Otherwise, everything should be the default
Network volume?
Nope
Use It try, because you download model every time that's why it's long like 30s
The Flashboot just doesn't seem to work with the Ray distributed executor backend as I see now. This makes sense I guess. It's overkill for single-node inference anyway so I'll stick to the MP which works. But good to know. I'll try to discourage everyone from using it with my custom image.
When you disable flashboot what happens then?
What's mp btw
Then it would work the same since the Flashboot does not take any effect with the Ray. That's how I meant it.
VLLM has two possible distributed executor backends - Ray or MultiProcessing which are needed if you want to use VLLM's continuous batching and RunPod worker concurrently.
Ohh ic
@3WaD
Escalated To Zendesk
The thread has been escalated to Zendesk!
Maybe Runpod's team should look at this too