Guidance on Mitigating Cold Start Delays in Serverless Inference
We are experiencing delays during the cold starts of our serverless server used for inference of a machine learning model (Whisper). The main suspected cause is the download of model weights (custom model trained by us), which are fetched via the Hugging Face package within the Python code. We are exploring possible solutions and need guidance on feasibility and best practices.
Additional Context:
- The inference server currently fetches model weights dynamically from Hugging Face during initialization, leading to delays.
- The serverless platform is being used for inference as part of a production system requiring low latency.
- We offer streaming inference, where low latency is critical for usability. Currently, many calls experience delays exceeding 5 seconds, making the solution unfeasible for our purposes (check img showing the delay times from this month).
Solution Options
1. Injecting the Model Directly into the Docker Container
This involves embedding the fully trained model within the Docker container. The server would bring up the container with the model file already included.
Cons: This will result in significantly larger Docker images.
Questions:
- Will this approach impact cold-start times, considering the increased image size?
- Are there recommended limits on Docker image sizes for serverless environments?
2. Using Storage - Disk Volume
Store the model weights on a disk volume provided by the cloud provider. The serverless instance would mount the storage to access the weights.
Cons: Potential additional storage costs.
Questions:
- Does the serverless platform support disk volume storage? We could only find documentation about using storage with Pods.
- If supported, is mounting disk volume storage expected to improve cold-start performance?
3. Using Storage - Network Storage
Host the model weights on an intranet storage solution to enable faster downloads (compared to public repositories like Hugging Face).
Cons: Possible network storage costs and additional management overhead.
Questions:
- Does the serverless platform support network storage for serverless instances? Again, documentation appears focused on Pods.
- Are there recommendations or best practices for integrating network storage with serverless instances?
We would like some guidance on which approach should we pursue, considering we would like to be using for streaming inference.
If any of those options are not optimal, could you suggest an alternative?

20 Replies
Most of the time, if you whisper model isn't too big, and like only one model you may choose the first one, to inject the model inside the image / docker container
the cold start time would be faster as its already in the local nvme disk compared to network storage
If supported, is mounting disk volume storage expected to improve cold-start performance?
No, serverless disk gets cleared everytime it goes down, the other option is to use network storage which is a decentralized storage that is usually slkower than baking in models in container
Are there recommendations or best practices for integrating network storage with serverless instances?
if you want, just download it when the model doesnt exist in
/workspace
. because its mounted there
for streaming inference i'd suggest running a tcp connection, opening a port for tcp ( edit endpoint then expand the options to open port ), with your service. so external can connect directly to your service in runpod
in short: for this tcp port connection, you can use the /run only to start the worker, and connect directly to the worker later
read more: https://discord.com/channels/912829806415085598/1238153273077141574@nerdylive
Actually, we download the models only during the build. So they are not being downloaded again during cold starts. However, we still think the "normal" cold starts are too big, taking about 10s (loading the model themselves usually take about 2-5s).
Furthermore, we have no idea why in some rare cases it takes an absurd amount of time, like the >100s. This is our biggest problem.
so sometimes it takes longer to load? is it in specific workers?
hmm thats weird, what type of library do you use to load it?
there's nothing failing too?
Yeah. Sometimes it did on specific workers. I used faster-whisper to load them. And there's nothing failing.
But these still do not explain how I got more than 100s of delay time.
maybe ask support, i hope they can check
@nerdylive I noticed that sometimes a worker takes too much time to completely setup a docker image, and sometimes the worker that is "downloading" the docker image is set as "idle" instead of "initializing". I think this is a bug. What can happen in this case is that a request may be allocated to this bugged worker, and I believe this is why the delay time may be huge sometimes.
Would using a Network Volume solve this problem? Note: I already download the models when building the docker image, so they're already cached. The problem is when a new worker is started and it needs to build the docker image. My image has 8 GiB total, so it's not that big. But the download parts take too much time through RunPod.
Or is the Network Volume completely unrelated in this case?
any endpoint id / worker id support can check on?
when its downloading and its "idle" yes it shouldn't be like that
does that happens often?
I passed both info to support yesterday:
request id:
sync-2fbf700d-b754-44d2-8df2-9ac9fb536005-u1
worker id: l8q3x9g7a1prqj
While I'm not 100% sure that this happened (since I did not annotate the exact worker id), I noticed in the log that the worker that had a "running" status was downloading the docker image.
But after the worker executed the request, the previous log disappeared.network volume is unrelated yep
Thanks thanks
oh just give the endpoint id then
to the ticket
Oh yeah, did it already.
yea hopefully they can check that
alright
@nerdylive the problem I mentioned is actually happening right now

I don't believe this worker should be considered idle
If you refresh it?
The refresh button on top right cornee
And what changes did you made to the endpoint that made that condition?
or it is like that when you create a new one
I'm using the same endpoint, just terminated the other workers as a test.
I did refresh the page but with F5.
Hmm and it's still idle? After you refresh it?
I mean same endpoint shouldn't be downloading the image again right? What changes did you make or did you just make that endpoint
I mean, now it's okay since it's been some time since everything downloaded.
I had enabled the network volume before, thinking it could be a solution. Then I disabled it and terminated all the workers to get new ones on the "latest version" of the endpoint. Some workers already had the cached docker image (probably because I used them before), but the ones that didn't needed to download it.
And I see this all the time: different workers downloading the image, even in the same endpoint. I thought it was standard.
You don't need to terminate it btw
Might send this in the ticket too to say what changes you have made to the endpoint to get that