Slow network volume
Some people reported, that loading models from network-volumes is very slow compared to baking the model into the image itself.
48 Replies
@Encyrption would you mind sharing your experience / tests on this topic again?
@briefPeach would you mind sharing your experience / tests on this topic?
With identical payloads on identical images with only difference was one had network volume and the other had the models baked into the image. While I would see no discernable difference between executionTime I would consistently see an additional 30 - 60 seconds in delayTime when using network volume. I only tested this in EU-RO.
And all of this was happening a month ago right?
yes
@NERDDISCO When I tried network volume with mine it was EU-RO it wouldnt leave queue
@Karlas this sounds strange, not sure if this is related to the network volume. Did it resolve in the end?
Nope wasnt able to resolve it
I removed network volume and it was back to working fine
@NERDDISCO How much network volume do you think I need for 8b model?
Also it was on EU-SE 1
@Karlas you should be good with around 20 GB, because the summary of all files in https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/tree/main roughly is 18 GB. Maybe this was the issue with your worker? That the size was not large enough?
alright
should i try again with the 8b model with a network volume?
Yeah I would try, because maybe this was the issue that it stuck. As you can totally create situations, when something is breaking, for example if the storage is not big enough.
So if you have some time and energy, I would appreciate it if you could test this again
and can i test on any region?
Would you mind creating a new post, so we can talk about all the things llama 3.1 8B? I want to keep the info here about the network volumes 🙏
Works now
Not getting stuck in queue
perfect!
But again I am not trying with a 70b model like I was before, just with a 8b model
When I used 70B I gave the network volume 150gb
@NERDDISCO Is network volume supposed to be slower than baking the model into the container image? Since if baking into container, the model is stored physically on the GPU machine, but if with network volume then the models needs to be transferred by network to load into GPU machine?
Because when I stored models in network volume, the step of loading model into GPU vram took way much longer than baking the model into container
There shouldn’t be significant performance problems when loading the model from the network volume. That’s why I’m trying to find people who have problems, so that we can find out, what the underlying problem might be.
So if you have any data in terms of used data centers, the models, the dates when this was tested. Then I can collect this and present the info to our team, so we can take a look at this.
I’m confused. Why there shouldn’t be significant difference for loading the model. I think physical storage should be significantly faster than network storage?
Ok once I try again, I’ll give you that data
That is true, but yeah I don’t have actual numbers here. So this is something I want to validate too, so that we can provide better reasoning for the community!
Thank you so much! I will also do some tests next week.
thank you! Yeah some benchmarks would be super helpful for us to choose which one to use in different situations
it's back to being extremely slow last several days. network volume is in eu-ro-1
I have currently unselected all EU-, EUR-, and US-OR regions. These seem to be the regions experiencing issues.
@Encyrption which regions do you recommend?
I don't use network volume so I am not tied to region so I select global, then unselect any problematic regions... then I can use any others.
once those issues are resolved, I will add those regions back in.
just tested out and i'm seeing that US-TX is at least 2x faster
Hopefully this gets resolved soon. I'm using storage to hold a 65GB model on specific GPU hardware and 40+ sec/it to load it is not good at all.
Based on your numbers, you’re getting around 1.6 GB/s. Do you have any specific speed expectations or benchmarks you were aiming for?
s/it, out of seven segments. Full load of the model takes at least 280 seconds in those instances, but about 21 seconds in other geographical areas.
Even more oddly, sometimes it will load two segments at 1-4 s/it, and then the next will be 38 s/it, and then the next three will be 50-60 each. It's very inconsistent. The 4+ minute loading is a cold start that I'm paying for every second of, and it might happen when the container is destroyed immediately after a run, and while the container is doing that cold start, any requests routed to it time out on the client side.
If you include your model as part of the Docker image, it could help reduce cold start times. Loading the model from the host disk is generally faster and more consistent
Is that a viable option for such a large model? I was under the impression it only scaled well for smaller models.
How big is your model? I see customer have 350G+ docker image and it works for them.
Wow, 350G+ I will no longer feel bad about my 40GB images LOL 😉
okie lets try baking llama3 model onto a docker image
The network volumes are very slow. Loading models from them is usually a bad idea. On an A1111 image, the inference time with sdxl is multiplied at least by 2 due to network disk access
I agree 💯 ! My issue has been finding a way to force loading them from disk. I recently had OpenVoice model refusing to use the baked in model and instead tries to download the model. Is there a way to make them use the baked in model regardless?
This is specific to the software you're running, so no idea
Local disk is generally faster than network volumes, and working with many small files on a network volume may result in slower speeds—it’s better to compress them. If multiple pods on the same machine use the network volume, they will share the bandwidth.
I split the model layers just fine, but one of the stock layers on the worker-vllm image is just shy of 13GB when built, so I'll be poking at that for a bit.
What does it mean to compress a model when it’s a .pth or .safetensor file?
I mean if you have bunch of images, it better compress them and send it, rather than one file at a time. Model file usually a big chunk of file and should be fine
which model did you use ?
how the hell can they bake 350Gig model in docker ? where do they host their image ? docker hub ?
The biggest I have ever used was 85.8 GB, I host that on Docker Hub. I run many models. I am trying to build out an AI market place.
I'm currently having a constraint on resources (max workers) with RunPod so I may have to work with other providers (i.e. Modal, etc.) to add additional models.
i have heard myself of vast.ai, fly and lambdalabs, what would you suggest ?
I am still trying to figure that out myself. Not sure any of them scale from 0 like RunPod.
Yep, docker hub
don't use vast.ai. recommend looking into modal but their syntax is confusing af.
i've found a way... after my pc docker engine wroke bc of some weird wsl issue that i still havent figured out....
i start a GCP VM with the deep learning linux image with a big boot disk and run docker build there. you get to take advantage of enterprise grade networking so builds are much bigger too
a 20gb docker image takes less than 5 minutes to push to dockerhub, whereas it would've taken 40 minutes on my residential toaster wifi