RunPod•12mo ago

How to load model into memory before the first run of a pod?

In the template worker, in the handler file it is written:

# If your handler runs inference on a model, load the model here.
# You will want models to be loaded into memory before starting serverless.

# If your handler runs inference on a model, load the model here.
# You will want models to be loaded into memory before starting serverless.

I am loading my model here. But when a new pod started in my endpoint, it's first run will systematically take more than 10s because it is loading the model. This results in some requests taking more than 10x longer that the expected latency. Is there a way to load the model as soon as the new pod is "active"? Thanks.

6 Replies

ashleyk•12mo ago

Enable FlashBoot, but its only effective if you have a constant flow of requests. By the way serverless and pods are 2 completely different things, there are no pods in serverless, only workers.

MartinOP•12mo ago

What is flashboot doing? Is it running this part ahead? Why is it not running it when I have a flow that is not constant?

ashleyk•12mo ago

Because workers are shared between customers. You can also set active workers, but they are running constantly and pretty expensive.

Madiator2011•12mo ago

This part is asking it's for loading model to VRAM so lets say you have SD model it will load it on first boot and after job is done it will keep model in VRAM so it does not need to load it again. This is for active worker mostly as for normal workers worker is going down after job is done.

MartinOP•12mo ago

Then how can you explain the first request hitting the worker is taking much more time than the next ones, even after having the worker down for some time? What I would expect is that on the boot of the worker: - image is loaded - first part of the handler runs (loading my model) So then when a request is hitting the worker for the first time it will be as quick as the next times.

ashleyk•12mo ago

FlashBoot, as I said. FlashBoot is also not guaranteed, also as I said, it depends on your flow of requests.

Gaming

Programming

How to load model into memory before the first run of a pod?

Did you find this page helpful?