RunPod•7mo ago

Fixed number of Total Workers - Any work around?

Currently our team has a pool of ~150 workers on RunPod serverless. The GPUs are of type RTX A4000/A5000/A6000. We have a total of 10 different models deployed on the serverless endpoints that we use at the time of inference. Each model has a different amount of active/max workers depending on the load that they can get, where they are placed in our pipeline, and the nature of the model. My question is: What are the best practices around runpod serverless, should we deploy multiple models within the same image and do a routing within the handler? This would let me make more endpoints with the given amount of workers. But with this solution one of my models can completely block off the requests for my other models.

1 Reply

yhlong00000•7mo ago

There are both benefits and drawbacks to this approach. Benefits: You’ll have fewer endpoints, making management easier. Plus, since all traffic is routed through fewer endpoints, you can enable active workers to reduce cold start times. Drawbacks: Building large images can be challenging, and loading multiple models might require higher-end GPUs, if you have to switch model in and out from gpu vRAM, it probably won't be good for speed. I’m not an expert in this area, but it’s worth testing out. I think combining 2-3 models into one endpoint could be beneficial, but if you go for something like 10 models in one, unexpected bottlenecks might arise. Feel free to test it and share your results here. Good luck!

Gaming

Programming

Fixed number of Total Workers - Any work around?

Did you find this page helpful?