VLLM model loading, TTFT unhappy path
I am looking for a way to reduce the latency for the unhappy path of VLLM endpoints.
I use the quickstart VLLM template, backed by a network storage for model weights and flashboot enabled. By default the worker will load the model weights on first request. This, however poses the risk of exposing my customers to an unhappy path of latency measured in minutes, at scale we could see this in significant absolute numbers.
What would be the best way for me to make sure that a worker is considered ready only >after< it has loaded the model checkpoints, and trigger checkpoint loading without sending the first request? Should I roll my own VLLM container image? Or is there an idiomatic way to parametrize the quickstart template to achieve this? I would prefer to use the Runpod supplied, properly supported VLLM image, if possible.
1 Reply
Not possible:
What would be the best way for me to make sure that a worker is considered ready only >after< it has loaded the model checkpoints
This is possible, just activate active workers:
and trigger checkpoint loading without sending the first request?
because right, the technical side when you run something, container starts with the gpu, then loads the model, if it goes unused the flashboot tries to keep the vram loaded with your model (if only you load it right) check docs for more info for flashboot in serverless
but after some time idling it will be off like a new worker, ready but not with model loaded