3WaD
RRunPod
•Created by 3WaD on 1/7/2025 in #⚡|serverless
Optimizing VLLM for serverless
Hello. I am trying to optimize the VLLM for the serverless endpoint. The default VLLM settings are blazing fast for cached workers (~1s) but unusable with cold start initialization (40-60 or more seconds).
Forcing eager mode removes the CUDA graph capture and helps push the initialization cold starts down to ~20s with a price of a slower generation time. But other than that, I feel stuck about what could be improved since currently the longest tasks are creating the LLM engine and VLLM's Memory profiling stage. Each takes up to 6 seconds. I am attaching the complete log file with time comments from such a job.
I am wondering if anyone found the settings sweet spot for the fastest cold starts and acceptable generation speed, or if there's a way to remove the initialization part for newly spawned workers. Although I already researched many things, from automatic caching on a network volume (which didn't work at all and when using bitsandbytes models there is no cache being saved) to snapshotting and trying to share the initiated state between the workers (which is probably not possible).
Any ideas or help will be appreciated.
2 replies
RRunPod
•Created by 3WaD on 11/22/2024 in #⚡|serverless
How long does it normally take to get a response from your VLLM endpoints on RunPod?
Hello. I've tested a very tiny model (Qwen2.5-0.5B-Instruct) on the official RunPod VLLM image. But the job takes 30+ seconds each time - 99% of it is loading the engine and the model (counted as delay time), and the execution itself is under 1s. Flashboot is on. Is this normal or is there a setting or something else I should check to make the Flashboot kick in? How long do your models and endpoints normally take to return a response?
13 replies
RRunPod
•Created by 3WaD on 10/11/2024 in #⚡|serverless
OpenAI Serverless Endpoint Docs
Hello. From what I could find in the support threads here, you should be able to make a standard openAI request not wrapped in the "input" param if you hit your endpoint at https://api.runpod.ai/v2/<ENDPOINT ID>/openai/...
The handler should then receive two new params, "openai_route" and "openai_input," but it's been a couple of months since the threads, and I can't find any official docs about this or the ability to test this locally with the RunPod lib. Can someone please confirm that this works in custom images too? If so, what is the structure of the parameters received? Does "input" in
handler(input)
contain "openai_input" and "openai_route" params directly? Is there any way I can develop this locally?47 replies