Speed up cold start on large models
I'm trying to do some intermittent testing on a 70B LLM model, but any time a vLLM worker does a cold start, it downloads the model from HF. This takes about 19 minutes, so costs add up and the requests made to the API time out and fail. Once the model is loaded, things are fine with inference running in 12-15 seconds. Is there any good solution to work with this larger model without keeping a constant worker, which would defeat the whole purpose of running it on serverless for the very intermittent testing?
12 Replies
Either bake the model into your docker image or attach network storage to your endpoint and store the endpoint on network storage disk instead of container disk.
Baking the f16 model would create a 170GB image which Docker Hub won't support so I'm not sure how I'd get it onto the worker. I'm willing to try network storage though I'd need documentation on how to get that set up properly and access the model on it from a cold start worker.
Sadly there's no AWQ 4bit quant of the model, only GGUF, which vLLM doesn't support. I'd make a quant if I could determine how to do that successfully.
How are you using a vLLM worker if you say it doesn't support GGUF? Network storage is used automatically by the RunPod vLLM worker if you attach it to the endpoint.
I'm using the unquantized model, not a GGUF.
Which is why it's so big.
Yeah, I recommend network storage then.
I'll give that a try. Optimally I'll find up to date info on how to quantize it to an AWQ that isn't paywalled. Thank you!
Yeah its unfortunate that people love paywalling information these days instead of just sharing it freely.
How will baking the models into the docker image help reduce cold start time? The engine itself (in my case ComfyUI) loads its models from the same path anyway, which is on the network storage (because the comfyUI engine is on the network storage), so it will have to do a copy operation and load operation anyway.
I could have the entire engine and models all be a part of the docker image, but won't that make the cold start time be longer due to the download time?
Currently I'm having crazy weird numbers, sometimes it take 0.1 delay and then 160 seconds of execution time, sometimes it takes 40 delay and 200 exec, and as long as the pod not idle, it takes about 1s delay and 20-40s of exec time.
I'm trying to better understand why
Also, assuming I bake the image with my entire comfyui + nodes + models installation, I'd still need to copy it to a network storage during installation, no?
No actually it stays downloaded, so every run it doesn't download your imahe
That's flashboot when the time is fast and yes it's kinda random, but it speeds up the cold starts if your request comes Constantly
oh that might actually be pretty good then
Yee
it take 0.1 delay and then 160 seconds of execution time, sometimes it takes 40 delay and 200 exec, and as long as the pod not idle, it takes about 1s delay and 20-40s of exec time.This was a common problem when people tried to bake large files into AWS AMIs. What would happen is that AWS would start up the EC2 instance quickly, but lazy-load the bytes needed to run applications, so while the app would be quick to start, it would need to load in the entire data file before serving a request, so the initial request would take forever. There are techniques to do this with containers as well - only load in layers needed for initial application start up, then load in the rest of the layers whenever their files are attempted to be accessed by the running application. I would bet something similar is happening here. It is generally a good way to improve startup times for containers that don't need massive files/libraries to run their main application, but is a pretty sneaky anti-pattern for the usecase of storing LLM weights in containers. You think the container is starting up fast, but really its going to take an equal or longer amount of time to download the LLM weights behind the scenes :/