RunPod•11mo ago

Loading models from network volume cache is taking too long.

Hello all, I'm loading my model like following so that I can use the cache from my network volume. model = AutoModel.from_pretrained( os.getenv("MODEL_NAME"), cache_dir=os.getenv("/runpod-volume/models"), local_files_only=True, ) And recently, loading models start taking really long time. Originally it was taking 3-4 seconds, now I'm experiencing 40 secs during the daytime. How can I resolve this issue? I'm using US-OR-1 for my network volume

14 Replies

Jason•11mo ago

How big is the model? I think 3-4 secs is with flashboot and 40secs is just first time loading or cold start like

thisisfineOP•11mo ago

They are small (2GB, 1GB) I have the flashboot on. Does it mean that all my workers should be flashbooted when we are cold starting them? And I only log time that it takes for loading the models. I'm not sure cold start or flashboot impacts it. I'm seeing a lot of 30~40 secs latency recently. Please let me know if there is a way to optimize this!

Jason•11mo ago

Yes it does, wait im not sure if it does... but maybe it should even after subsequent requests?

thisisfineOP•11mo ago

Assume that these are all cold starts. I'm still seeing different latency performance (from 3~40 secs) and I think it's a network volume issue.

Jason•11mo ago

Might be, but who knows... If you wanna ask a support try creating a ticket on the website also if you want to test a lower latency try active workers

thisisfineOP•11mo ago

Yeah. That could be an option too. But if anyone knows how to fundamentally resolve this issue, pls lmk!

digigoblin•11mo ago

Loading from network storage is inherently slow and the larger your model, the more you will be affected by slow loading times.

Jason•11mo ago

Yeah and those doesn't seem like a big model Ey thisisfine try using active workers and send some request, look at your time log for model loading

thisisfineOP•11mo ago

Yeah with an active instance, it's much faster taking only 3~7 secs. So I think it's a mix of cold start + establishing a new connection with network volume + etc?

Encyrption•11mo ago

I been experience similar. When I started building out serverless workers the numbers indicated it is less expensive to utilize network volume whenever possible so I did that. But I am starting to think it is not worth the savings. It seems that with FlashBoot you are better off just building models into your image. Especially with small models.

digigoblin•11mo ago

Yeah its definitely better to bake your models into your image wherever possible, but unfortunately for LLMs, models can be extremely large. Also last I checked, you could only build vllm docker images on a machine with a GPU which sucks, because thats why most people are using RunPod in the first place.

Jason•11mo ago

That should be your inference time, connection w network volume is max like 1secs Hahah should be much less than that on average

thisisfineOP•11mo ago

Nope this is not our inference time. It's only for model loading. Our inference time has been consistently fast. It's just that the model loading latency has been unpredictable.

Jason•11mo ago

Oh, what is your inference time then Like in milliseconds ? Then I'd suggest waiting on your support ticket, I'm not really sure what could be causing that latency

Gaming

Programming

Loading models from network volume cache is taking too long.

Did you find this page helpful?