R
RunPod17h ago
3WaD

How long does it normally take to get a response from your VLLM endpoints on RunPod?

Hello. I've tested a very tiny model (Qwen2.5-0.5B-Instruct) on the official RunPod VLLM image. But the job takes 30+ seconds each time - 99% of it is loading the engine and the model (counted as delay time), and the execution itself is under 1s. Flashboot is on. Is this normal or is there a setting or something else I should check to make the Flashboot kick in? How long do your models and endpoints normally take to return a response?
11 Replies
nerdylive
nerdylive15h ago
what kind of setup do you use do you build the image? Where do you store the model in, Network storage or inside image?
3WaD
3WaDOP14h ago
It's the official VLLM selected in the RunPod dashboard. I added only the model name and used Ray. Otherwise, everything should be the default
nerdylive
nerdylive14h ago
Network volume?
3WaD
3WaDOP14h ago
Nope
nerdylive
nerdylive13h ago
Use It try, because you download model every time that's why it's long like 30s
3WaD
3WaDOP13h ago
The Flashboot just doesn't seem to work with the Ray distributed executor backend as I see now. This makes sense I guess. It's overkill for single-node inference anyway so I'll stick to the MP which works. But good to know. I'll try to discourage everyone from using it with my custom image.
nerdylive
nerdylive13h ago
When you disable flashboot what happens then? What's mp btw
3WaD
3WaDOP13h ago
Then it would work the same since the Flashboot does not take any effect with the Ray. That's how I meant it. VLLM has two possible distributed executor backends - Ray or MultiProcessing which are needed if you want to use VLLM's continuous batching and RunPod worker concurrently.
nerdylive
nerdylive12h ago
Ohh ic
Poddy
Poddy12h ago
@3WaD
Escalated To Zendesk
The thread has been escalated to Zendesk!
nerdylive
nerdylive12h ago
Maybe Runpod's team should look at this too
Want results from more Discord servers?
Add your server