Guidance on Mitigating Cold Start Delays in Serverless Inference
We are experiencing delays during the cold starts of our serverless server used for inference of a machine learning model (Whisper). The main suspected cause is the download of model weights (custom model trained by us), which are fetched via the Hugging Face package within the Python code. We are exploring possible solutions and need guidance on feasibility and best practices.
Additional Context:
- The inference server currently fetches model weights dynamically from Hugging Face during initialization, leading to delays.
- The serverless platform is being used for inference as part of a production system requiring low latency.
- We offer streaming inference, where low latency is critical for usability. Currently, many calls experience delays exceeding 5 seconds, making the solution unfeasible for our purposes (check img showing the delay times from this month).
Solution Options
1. Injecting the Model Directly into the Docker Container
This involves embedding the fully trained model within the Docker container. The server would bring up the container with the model file already included.
Cons: This will result in significantly larger Docker images.
Questions:
- Will this approach impact cold-start times, considering the increased image size?
- Are there recommended limits on Docker image sizes for serverless environments?
2. Using Storage - Disk Volume
Store the model weights on a disk volume provided by the cloud provider. The serverless instance would mount the storage to access the weights.
Cons: Potential additional storage costs.
Questions:
- Does the serverless platform support disk volume storage? We could only find documentation about using storage with Pods.
- If supported, is mounting disk volume storage expected to improve cold-start performance?
3. Using Storage - Network Storage
Host the model weights on an intranet storage solution to enable faster downloads (compared to public repositories like Hugging Face).
Cons: Possible network storage costs and additional management overhead.
Questions:
- Does the serverless platform support network storage for serverless instances? Again, documentation appears focused on Pods.
- Are there recommendations or best practices for integrating network storage with serverless instances?
We would like some guidance on which approach should we pursue, considering we would like to be using for streaming inference.
If any of those options are not optimal, could you suggest an alternative?
5 Replies
Most of the time, if you whisper model isn't too big, and like only one model you may choose the first one, to inject the model inside the image / docker container
the cold start time would be faster as its already in the local nvme disk compared to network storage
If supported, is mounting disk volume storage expected to improve cold-start performance?
No, serverless disk gets cleared everytime it goes down, the other option is to use network storage which is a decentralized storage that is usually slkower than baking in models in container
Are there recommendations or best practices for integrating network storage with serverless instances?
if you want, just download it when the model doesnt exist in
/workspace
. because its mounted there
for streaming inference i'd suggest running a tcp connection, opening a port for tcp ( edit endpoint then expand the options to open port ), with your service. so external can connect directly to your service in runpod
in short: for this tcp port connection, you can use the /run only to start the worker, and connect directly to the worker later
read more: https://discord.com/channels/912829806415085598/1238153273077141574@nerdylive
Actually, we download the models only during the build. So they are not being downloaded again during cold starts. However, we still think the "normal" cold starts are too big, taking about 10s (loading the model themselves usually take about 2-5s).
Furthermore, we have no idea why in some rare cases it takes an absurd amount of time, like the >100s. This is our biggest problem.
so sometimes it takes longer to load? is it in specific workers?
hmm thats weird, what type of library do you use to load it?
there's nothing failing too?
Yeah. Sometimes it did on specific workers. I used faster-whisper to load them. And there's nothing failing.
But these still do not explain how I got more than 100s of delay time.
maybe ask support, i hope they can check