Loading models from network volume cache is taking too long.
Hello all,
I'm loading my model like following so that I can use the cache from my network volume.
model = AutoModel.from_pretrained(
os.getenv("MODEL_NAME"),
cache_dir=os.getenv("/runpod-volume/models"),
local_files_only=True,
)
And recently, loading models start taking really long time. Originally it was taking 3-4 seconds, now I'm experiencing 40 secs during the daytime. How can I resolve this issue?
I'm using US-OR-1 for my network volume
14 Replies
How big is the model?
I think 3-4 secs is with flashboot and 40secs is just first time loading or cold start like
They are small (2GB, 1GB)
I have the flashboot on. Does it mean that all my workers should be flashbooted when we are cold starting them?
And I only log time that it takes for loading the models. I'm not sure cold start or flashboot impacts it.
I'm seeing a lot of 30~40 secs latency recently. Please let me know if there is a way to optimize this!
Yes it does, wait im not sure if it does... but maybe it should
even after subsequent requests?
Assume that these are all cold starts. I'm still seeing different latency performance (from 3~40 secs) and I think it's a network volume issue.
Might be, but who knows... If you wanna ask a support try creating a ticket on the website
also if you want to test a lower latency try active workers
Yeah. That could be an option too. But if anyone knows how to fundamentally resolve this issue, pls lmk!
Loading from network storage is inherently slow and the larger your model, the more you will be affected by slow loading times.
Yeah and those doesn't seem like a big model
Ey thisisfine try using active workers and send some request, look at your time log for model loading
Yeah with an active instance, it's much faster taking only 3~7 secs. So I think it's a mix of cold start + establishing a new connection with network volume + etc?
I been experience similar. When I started building out serverless workers the numbers indicated it is less expensive to utilize network volume whenever possible so I did that. But I am starting to think it is not worth the savings. It seems that with FlashBoot you are better off just building models into your image. Especially with small models.
Yeah its definitely better to bake your models into your image wherever possible, but unfortunately for LLMs, models can be extremely large.
Also last I checked, you could only build vllm docker images on a machine with a GPU which sucks, because thats why most people are using RunPod in the first place.
That should be your inference time, connection w network volume is max like 1secs Hahah should be much less than that on average
Nope this is not our inference time. It's only for model loading. Our inference time has been consistently fast. It's just that the model loading latency has been unpredictable.
Oh, what is your inference time then
Like in milliseconds ?
Then I'd suggest waiting on your support ticket, I'm not really sure what could be causing that latency