Loading models from network volume cache is taking too long.

Hello all, I'm loading my model like following so that I can use the cache from my network volume. model = AutoModel.from_pretrained( os.getenv("MODEL_NAME"), cache_dir=os.getenv("/runpod-volume/models"), local_files_only=True, ) And recently, loading models start taking really long time. Originally it was taking 3-4 seconds, now I'm experiencing 40 secs during the daytime. How can I resolve this issue? I'm using US-OR-1 for my network volume
14 Replies
nerdylive
nerdylive6mo ago
How big is the model? I think 3-4 secs is with flashboot and 40secs is just first time loading or cold start like
thisisfine
thisisfineOP6mo ago
They are small (2GB, 1GB) I have the flashboot on. Does it mean that all my workers should be flashbooted when we are cold starting them? And I only log time that it takes for loading the models. I'm not sure cold start or flashboot impacts it. I'm seeing a lot of 30~40 secs latency recently. Please let me know if there is a way to optimize this!
nerdylive
nerdylive6mo ago
Yes it does, wait im not sure if it does... but maybe it should even after subsequent requests?
thisisfine
thisisfineOP6mo ago
Assume that these are all cold starts. I'm still seeing different latency performance (from 3~40 secs) and I think it's a network volume issue.
nerdylive
nerdylive6mo ago
Might be, but who knows... If you wanna ask a support try creating a ticket on the website also if you want to test a lower latency try active workers
thisisfine
thisisfineOP6mo ago
Yeah. That could be an option too. But if anyone knows how to fundamentally resolve this issue, pls lmk!
digigoblin
digigoblin6mo ago
Loading from network storage is inherently slow and the larger your model, the more you will be affected by slow loading times.
nerdylive
nerdylive6mo ago
Yeah and those doesn't seem like a big model Ey thisisfine try using active workers and send some request, look at your time log for model loading
thisisfine
thisisfineOP6mo ago
Yeah with an active instance, it's much faster taking only 3~7 secs. So I think it's a mix of cold start + establishing a new connection with network volume + etc?
Encyrption
Encyrption6mo ago
I been experience similar. When I started building out serverless workers the numbers indicated it is less expensive to utilize network volume whenever possible so I did that. But I am starting to think it is not worth the savings. It seems that with FlashBoot you are better off just building models into your image. Especially with small models.
digigoblin
digigoblin6mo ago
Yeah its definitely better to bake your models into your image wherever possible, but unfortunately for LLMs, models can be extremely large. Also last I checked, you could only build vllm docker images on a machine with a GPU which sucks, because thats why most people are using RunPod in the first place.
nerdylive
nerdylive6mo ago
That should be your inference time, connection w network volume is max like 1secs Hahah should be much less than that on average
thisisfine
thisisfineOP6mo ago
Nope this is not our inference time. It's only for model loading. Our inference time has been consistently fast. It's just that the model loading latency has been unpredictable.
nerdylive
nerdylive6mo ago
Oh, what is your inference time then Like in milliseconds ? Then I'd suggest waiting on your support ticket, I'm not really sure what could be causing that latency
Want results from more Discord servers?
Add your server