RunPod•6mo ago

Offloading multiple models

Hi guys, anyone has experience with a inference pipeline that uses multiple models? Wondering how best to manage loading of models that exceed a worker's vram if everything is on vram. Any best practices / examples on how to keep model load time as minimal as possible. Thanks!

2 Replies

nerdylive•6mo ago

use a bigger gpu, or offload it to ram using code, locally, using trigger that you can detect locally in each gpu pod its platform/library specific so work on it yourself 🙂

yhlong00000•6mo ago

btw, you can also select multiple GPU per worker, if you need to load large models. Some tips to reduce start time:

message.txt

Gaming

Programming

Offloading multiple models

Did you find this page helpful?