R
RunPod12mo ago
Ben

Serverless Inference

Hi, I have been using runpod to train my model, and am very interested in using serverless computing to deploy it. I have successfully created a docker image that loads the model and contains an inference endpoint function. However, the model is rather large, and I am curious if there is a way to hold the model in ram to avoid loading it every time the container is stopped and restarted? If not, could anyone recommend another resource for model deployment? Is a traditional server a better option here?
3 Replies
ashleyk
ashleyk12mo ago
You can't avoid loading it unless you use active workers which are very expensive, FlashBoot helps a bit, but only with a constant flow of requests.
Ben
BenOP12mo ago
Gotcha, thanks
justin
justin12mo ago
A traditional server can do better for this, but obvs you eat the uptime in between. Cold start is just inherently a problem with stuff like this. Your best bet is that if you have a more regular pattern of usage, you can try to prewarm your server ahead of time by sending empty requests or something, or find a balance between setting your serverless workers to idle for a bit after a request, so that it is still open to take a request in afterwards and you only eat cold start on the first request. But you need to balance this against your cold start time vs how much time on avg is in between your request, since you could be paying unnecessary up time if you set an unnecessarily long idle after a request. The other thing is that in your code you should be doing:
model = load(model)

def handler(event):
pass
model = load(model)

def handler(event):
pass
This way when the worker is active, the model loading is part of the delay / cold start time, vs execution time. B/c once the handler is done, if you had the variable be loaded inside of the function, the function scope will end, throwing away the memory even if you have your worker actively idling after a job is done, ready to pick up immediately the next request.

Did you find this page helpful?