Speed up cold start on large models

I'm trying to do some intermittent testing on a 70B LLM model, but any time a vLLM worker does a cold start, it downloads the model from HF. This takes about 19 minutes, so costs add up and the requests made to the API time out and fail. Once the model is loaded, things are fine with inference running in 12-15 seconds. Is there any good solution to work with this larger model without keeping a constant worker, which would defeat the whole purpose of running it on serverless for the very intermittent testing?
12 Replies
digigoblin
digigoblin6mo ago
Either bake the model into your docker image or attach network storage to your endpoint and store the endpoint on network storage disk instead of container disk.
Charixfox
CharixfoxOP6mo ago
Baking the f16 model would create a 170GB image which Docker Hub won't support so I'm not sure how I'd get it onto the worker. I'm willing to try network storage though I'd need documentation on how to get that set up properly and access the model on it from a cold start worker. Sadly there's no AWQ 4bit quant of the model, only GGUF, which vLLM doesn't support. I'd make a quant if I could determine how to do that successfully.
digigoblin
digigoblin6mo ago
How are you using a vLLM worker if you say it doesn't support GGUF? Network storage is used automatically by the RunPod vLLM worker if you attach it to the endpoint.
Charixfox
CharixfoxOP6mo ago
I'm using the unquantized model, not a GGUF. Which is why it's so big.
digigoblin
digigoblin6mo ago
Yeah, I recommend network storage then.
Charixfox
CharixfoxOP6mo ago
I'll give that a try. Optimally I'll find up to date info on how to quantize it to an AWQ that isn't paywalled. Thank you!
digigoblin
digigoblin6mo ago
Yeah its unfortunate that people love paywalling information these days instead of just sharing it freely.
Sassy Pantsy
Sassy Pantsy6mo ago
How will baking the models into the docker image help reduce cold start time? The engine itself (in my case ComfyUI) loads its models from the same path anyway, which is on the network storage (because the comfyUI engine is on the network storage), so it will have to do a copy operation and load operation anyway. I could have the entire engine and models all be a part of the docker image, but won't that make the cold start time be longer due to the download time? Currently I'm having crazy weird numbers, sometimes it take 0.1 delay and then 160 seconds of execution time, sometimes it takes 40 delay and 200 exec, and as long as the pod not idle, it takes about 1s delay and 20-40s of exec time. I'm trying to better understand why Also, assuming I bake the image with my entire comfyui + nodes + models installation, I'd still need to copy it to a network storage during installation, no?
nerdylive
nerdylive6mo ago
No actually it stays downloaded, so every run it doesn't download your imahe That's flashboot when the time is fast and yes it's kinda random, but it speeds up the cold starts if your request comes Constantly
Sassy Pantsy
Sassy Pantsy6mo ago
oh that might actually be pretty good then
nerdylive
nerdylive6mo ago
Yee
Noah Yoshida
Noah Yoshida6mo ago
it take 0.1 delay and then 160 seconds of execution time, sometimes it takes 40 delay and 200 exec, and as long as the pod not idle, it takes about 1s delay and 20-40s of exec time.
This was a common problem when people tried to bake large files into AWS AMIs. What would happen is that AWS would start up the EC2 instance quickly, but lazy-load the bytes needed to run applications, so while the app would be quick to start, it would need to load in the entire data file before serving a request, so the initial request would take forever. There are techniques to do this with containers as well - only load in layers needed for initial application start up, then load in the rest of the layers whenever their files are attempted to be accessed by the running application. I would bet something similar is happening here. It is generally a good way to improve startup times for containers that don't need massive files/libraries to run their main application, but is a pretty sneaky anti-pattern for the usecase of storing LLM weights in containers. You think the container is starting up fast, but really its going to take an equal or longer amount of time to download the LLM weights behind the scenes :/
Want results from more Discord servers?
Add your server