RunPod•5mo ago

How long does it normally take to get a response from your VLLM endpoints on RunPod?

Hello. I've tested a very tiny model (Qwen2.5-0.5B-Instruct) on the official RunPod VLLM image. But the job takes 30+ seconds each time - 99% of it is loading the engine and the model (counted as delay time), and the execution itself is under 1s. Flashboot is on. Is this normal or is there a setting or something else I should check to make the Flashboot kick in? How long do your models and endpoints normally take to return a response?

11 Replies

Jason•5mo ago

what kind of setup do you use do you build the image? Where do you store the model in, Network storage or inside image?

3WaDOP•5mo ago

It's the official VLLM selected in the RunPod dashboard. I added only the model name and used Ray. Otherwise, everything should be the default

Jason•5mo ago

Network volume?

3WaDOP•5mo ago

Nope

Jason•5mo ago

Use It try, because you download model every time that's why it's long like 30s

3WaDOP•5mo ago

The Flashboot just doesn't seem to work with the Ray distributed executor backend as I see now. This makes sense I guess. It's overkill for single-node inference anyway so I'll stick to the MP which works. But good to know. I'll try to discourage everyone from using it with my custom image.

Jason•5mo ago

When you disable flashboot what happens then? What's mp btw

3WaDOP•5mo ago

Then it would work the same since the Flashboot does not take any effect with the Ray. That's how I meant it. VLLM has two possible distributed executor backends - Ray or MultiProcessing which are needed if you want to use VLLM's continuous batching and RunPod worker concurrently.

Jason•5mo ago

Ohh ic

Poddy•5mo ago

@3WaD

Escalated To Zendesk

The thread has been escalated to Zendesk!

Jason•5mo ago

Maybe Runpod's team should look at this too

Gaming

Programming

How long does it normally take to get a response from your VLLM endpoints on RunPod?

Did you find this page helpful?