Guidance on Mitigating Cold Start Delays in Serverless Inference

We are experiencing delays during the cold starts of our serverless server used for inference of a machine learning model (Whisper). The main suspected cause is the download of model weights (custom model trained by us), which are fetched via the Hugging Face package within the Python code. We are exploring possible solutions and need guidance on feasibility and best practices. Additional Context: - The inference server currently fetches model weights dynamically from Hugging Face during initialization, leading to delays. - The serverless platform is being used for inference as part of a production system requiring low latency. - We offer streaming inference, where low latency is critical for usability. Currently, many calls experience delays exceeding 5 seconds, making the solution unfeasible for our purposes (check img showing the delay times from this month). Solution Options 1. Injecting the Model Directly into the Docker Container This involves embedding the fully trained model within the Docker container. The server would bring up the container with the model file already included. Cons: This will result in significantly larger Docker images. Questions: - Will this approach impact cold-start times, considering the increased image size? - Are there recommended limits on Docker image sizes for serverless environments? 2. Using Storage - Disk Volume Store the model weights on a disk volume provided by the cloud provider. The serverless instance would mount the storage to access the weights. Cons: Potential additional storage costs. Questions: - Does the serverless platform support disk volume storage? We could only find documentation about using storage with Pods. - If supported, is mounting disk volume storage expected to improve cold-start performance? 3. Using Storage - Network Storage Host the model weights on an intranet storage solution to enable faster downloads (compared to public repositories like Hugging Face). Cons: Possible network storage costs and additional management overhead. Questions: - Does the serverless platform support network storage for serverless instances? Again, documentation appears focused on Pods. - Are there recommendations or best practices for integrating network storage with serverless instances? We would like some guidance on which approach should we pursue, considering we would like to be using for streaming inference. If any of those options are not optimal, could you suggest an alternative?
No description
20 Replies
nerdylive
nerdylive2mo ago
Most of the time, if you whisper model isn't too big, and like only one model you may choose the first one, to inject the model inside the image / docker container the cold start time would be faster as its already in the local nvme disk compared to network storage If supported, is mounting disk volume storage expected to improve cold-start performance? No, serverless disk gets cleared everytime it goes down, the other option is to use network storage which is a decentralized storage that is usually slkower than baking in models in container Are there recommendations or best practices for integrating network storage with serverless instances? if you want, just download it when the model doesnt exist in/workspace. because its mounted there for streaming inference i'd suggest running a tcp connection, opening a port for tcp ( edit endpoint then expand the options to open port ), with your service. so external can connect directly to your service in runpod in short: for this tcp port connection, you can use the /run only to start the worker, and connect directly to the worker later read more: https://discord.com/channels/912829806415085598/1238153273077141574
Rodka
Rodka2mo ago
@nerdylive Actually, we download the models only during the build. So they are not being downloaded again during cold starts. However, we still think the "normal" cold starts are too big, taking about 10s (loading the model themselves usually take about 2-5s). Furthermore, we have no idea why in some rare cases it takes an absurd amount of time, like the >100s. This is our biggest problem.
nerdylive
nerdylive2mo ago
so sometimes it takes longer to load? is it in specific workers? hmm thats weird, what type of library do you use to load it? there's nothing failing too?
Rodka
Rodka2mo ago
Yeah. Sometimes it did on specific workers. I used faster-whisper to load them. And there's nothing failing. But these still do not explain how I got more than 100s of delay time.
nerdylive
nerdylive2mo ago
maybe ask support, i hope they can check
Rodka
Rodka2mo ago
@nerdylive I noticed that sometimes a worker takes too much time to completely setup a docker image, and sometimes the worker that is "downloading" the docker image is set as "idle" instead of "initializing". I think this is a bug. What can happen in this case is that a request may be allocated to this bugged worker, and I believe this is why the delay time may be huge sometimes. Would using a Network Volume solve this problem? Note: I already download the models when building the docker image, so they're already cached. The problem is when a new worker is started and it needs to build the docker image. My image has 8 GiB total, so it's not that big. But the download parts take too much time through RunPod. Or is the Network Volume completely unrelated in this case?
nerdylive
nerdylive2mo ago
any endpoint id / worker id support can check on? when its downloading and its "idle" yes it shouldn't be like that does that happens often?
Rodka
Rodka2mo ago
I passed both info to support yesterday: request id: sync-2fbf700d-b754-44d2-8df2-9ac9fb536005-u1 worker id: l8q3x9g7a1prqj While I'm not 100% sure that this happened (since I did not annotate the exact worker id), I noticed in the log that the worker that had a "running" status was downloading the docker image. But after the worker executed the request, the previous log disappeared.
nerdylive
nerdylive2mo ago
network volume is unrelated yep
Rodka
Rodka2mo ago
Thanks thanks
nerdylive
nerdylive2mo ago
oh just give the endpoint id then to the ticket
Rodka
Rodka2mo ago
Oh yeah, did it already.
nerdylive
nerdylive2mo ago
yea hopefully they can check that alright
Rodka
Rodka2mo ago
@nerdylive the problem I mentioned is actually happening right now
No description
Rodka
Rodka2mo ago
I don't believe this worker should be considered idle
nerdylive
nerdylive2mo ago
If you refresh it? The refresh button on top right cornee And what changes did you made to the endpoint that made that condition? or it is like that when you create a new one
Rodka
Rodka2mo ago
I'm using the same endpoint, just terminated the other workers as a test. I did refresh the page but with F5.
nerdylive
nerdylive2mo ago
Hmm and it's still idle? After you refresh it? I mean same endpoint shouldn't be downloading the image again right? What changes did you make or did you just make that endpoint
Rodka
Rodka2mo ago
I mean, now it's okay since it's been some time since everything downloaded. I had enabled the network volume before, thinking it could be a solution. Then I disabled it and terminated all the workers to get new ones on the "latest version" of the endpoint. Some workers already had the cached docker image (probably because I used them before), but the ones that didn't needed to download it. And I see this all the time: different workers downloading the image, even in the same endpoint. I thought it was standard.
nerdylive
nerdylive2mo ago
You don't need to terminate it btw Might send this in the ticket too to say what changes you have made to the endpoint to get that

Did you find this page helpful?