How to load model into memory before the first run of a pod?
In the template worker, in the handler file it is written:
I am loading my model here.
But when a new pod started in my endpoint, it's first run will systematically take more than 10s because it is loading the model.
This results in some requests taking more than 10x longer that the expected latency.
Is there a way to load the model as soon as the new pod is "active"?
Thanks.
6 Replies
Enable FlashBoot, but its only effective if you have a constant flow of requests.
By the way serverless and pods are 2 completely different things, there are no pods in serverless, only workers.
What is flashboot doing? Is it running this part ahead?
Why is it not running it when I have a flow that is not constant?
Because workers are shared between customers.
You can also set active workers, but they are running constantly and pretty expensive.
This part is asking it's for loading model to VRAM so lets say you have SD model it will load it on first boot and after job is done it will keep model in VRAM so it does not need to load it again. This is for active worker mostly as for normal workers worker is going down after job is done.
Then how can you explain the first request hitting the worker is taking much more time than the next ones, even after having the worker down for some time?
What I would expect is that on the boot of the worker:
- image is loaded
- first part of the handler runs (loading my model)
So then when a request is hitting the worker for the first time it will be as quick as the next times.
FlashBoot, as I said.
FlashBoot is also not guaranteed, also as I said, it depends on your flow of requests.