R
RunPod•3mo ago
Hello

Offloading multiple models

Hi guys, anyone has experience with a inference pipeline that uses multiple models? Wondering how best to manage loading of models that exceed a worker's vram if everything is on vram. Any best practices / examples on how to keep model load time as minimal as possible. Thanks!
2 Replies
nerdylive
nerdylive•3mo ago
use a bigger gpu, or offload it to ram using code, locally, using trigger that you can detect locally in each gpu pod its platform/library specific so work on it yourself 🙂
yhlong00000
yhlong00000•3mo ago
btw, you can also select multiple GPU per worker, if you need to load large models. Some tips to reduce start time:
Want results from more Discord servers?
Add your server