Is there a way to speed up the reading of external disks(network volume)?
Is there a way to speed up the reading of external disks? The network volume is a bit slow, or are there any similar plans?
I need to load the model from an external disk, 6.4G, but it takes 7 times longer than the container volume.
35 Replies
How are you loading things from the network volume?
Other threads say expect network loads to take longer, so bake it into the image if you can.
Too many models, more than 100G
Create a network volume and set the mount directory in the template.
How do you do that?
I agree with @Charixfox if you are wanting the fastest possible responses from your serverless worker you should bake the model into to the image. I routinely run 80+ GB images and have seen other people running 300+GB images. A network volume will always be slower than baking the model in.
You cannot set a mount point for your network volume with a serverless worker. That is only allowed with Pods. Under serverless your network volume will mount at /runpod-volume what you need to do is to create symbolic links in your image to map needed storage to the network volume at /runpod-volume
How do I do that?
It varies depending on the image you are using. Here is an example where I created a symbolic link for gfpgan model:
This links /app/gfpgan/weights to /runpod-volume/gfpgan (in my network volume) and now anything read or written to /app/gfpgan/weights happens on the network volume.
I am using the runpod one
What is the "runpod" one?
FROM runpod/pytorch:3.10-2.0.0-117
If you are using a pre-built image it will depend on how the author coded it. It will either take advantage of your network volume or it will not. Suggest you research which image might best work for you before picking one. If you are more technical you can create a custom image and configure it however you like.
Thank you very much, I will try it
By the way, what image registry do you use?
I use Docker Hub. For $11 month you can get unlimited private repos. Everywhere else I looked capped on upload size.
Do you use cloud build?
Sorry I am not sure what "cloud build" is, I guess I am not using it LOL.
I have the Pro account on Docker Hub so I do not get access to build cloud.
Sorry, I was talking about docker buildcloud. I'm using a very large image now, and build, push, and pull are all very slow. Do you have any experience?
I am just starting to experience this... My latest build is 81.5GB and pushing to repo is taking a long time. Since my plan is to keep baking in more and more models until it breaks I should look into build cloud. I've heard of people running 300+ GB images. What size are your images?
Good luck, I'll talk to you when I find a good method.
what machine do you even use to build the 300GB docker image?
i have a similar issue and considered doing this locally for the models i have on my local machine that would create a 300GB size docker image, but each push would eat up my residential data bandwidth for the month
I haven't actually tried to build a 300GB image yet. I also have not run into your bandwidth limitations, but pushing large images does take a very long time. I haven't found a solution to building big images as of yet. I may try Build Cloud on Docker Hub.
are you using loras? whats your use case for many models?
I am building a AI marketplace where users can come and run a variety of models. I have asked RunPod to increase my max workers and they did up it to 35. It was during this time RunPod suggested that rather than have an endpoint for each model I should try and combine multiple models into the same endpoint. The more models I can add to the same input the more options my marketplace can have. As a test, I currently have an image using ComfyUI with Flux Schnell, Flux dev, SD3, and SDXL models loaded.
Although, I have already discovered one issue with this method. Since the endpoint has to decide which model to load, based upon the JSON input I cannot pre-load the model beforehand. Which adds 30-40 seconds to each request.
Seems like no matter how I try and proceed I keep running into roadblocks. I am forced to either have a small number of fast responding models or a lot of slower responding models. What do you suggest I do?
that is a tough problem to solve with how limited vram is, ideally its small number of endpoints or single endpoint, possibly in future we may allow some type of routing for jobs to worker you want
Yea pretty much same use case here. But even less user control. I’m building a personalized portraits app and abstracting all the prompts and underlying Lora/base models. And I’m using about 300GB worth of models (combined Lora’s, sd base models, textual inversion/embeddings, vae, esrgan, controlnet, and real esrgan) so creating a giant docker image isn’t really feasible
My current solution is a 350GB network volume that’s attached and the most commonly used models like BLIP, onnx, interrogator, and controlnet are built into the docker image. Everything else “prompt- related” (specific Lora, base models, and embeddings) is what I store in the runpod network volume
I haven’t fully tested swapping out different models yet tho but Im currently reworking another serverless endpoint to support both single image inference and batch inference jobs
And what I mean by residential bandwidth limit is this https://www.xfinity.com/learn/internet-service/data#:~:text=Get%20the%20speed%20you%20need%20at%20a%20great%20price.&text=Think%20you%20might%20need%20more,Unlimited%20Data%20options%20are%20available.&text=Customers%20who%20use%20more%20than,billed%20for%20exceeding%20the%20limit.
I’m based in bag area and xfinity is my ISP. There was a month where I downloaded models locally and then uploaded them back into GCP cloud storage. This ate up my bandwidth for the month and basically bricked my wifi
The loras I use alone already take up 60Gb 😅
I also have about 100 other base models (most are between 3-6gb big)
In ur case, it might actually make sense programmatically build your docker images so then you have per-base-model-docker-images
I think this is how replicate was able to have cheap and fast inference time at the model level
Then, in your front end, u would basically map each model request to its respective model endpoint and won’t need to wait for start up time
I'm not sure how building per base model docker images dynamically is going to allow my requests to complete faster? Wouldn't that just add more delay to the response to a request?
The problem I face now is that traditionally I can load the model in the global scope and then for each subsequent request (using FlashBoot or active servers) I don't have to load it again and I can respond to queries in just a few seconds. Instead, by trying to use multiple models in a single endpoint, I am forced to select a specific model (of several) in the request. Because of this I have to load the model for each request and it is taking ~ 60 seconds per request. So instead of paying 0.00057 per image, I am paying 20 times as much (0.0114) and taking 20 times longer to process the request. I do not think people will pay me to run models with such slow response time nor do I think people will pay for a small 5 or 6 model selection.
I am starting to worry that maybe RunPod is incompatible with the type of service I am trying to develop. This really makes me sad 😦 as I put a lot of time and effort into building the software for it and I really like the community here.
Oo what i meant was, say you support sd1.5 and revanimated on your service.
Instead of having both sd1.5 and Reva images in the same docker image for a single runpod endpoint….
You have a dedicated runpod endpoint for sd1.5 with a docker image that has only sd1.5 baked in
And another separate runpod endpoint for revanimated with a different docker image with only revanimated baked in
And you’d do this for each model you support. And since each endpoint has its own workers(but ur not charged until the active request), you’d take advantage of flash bot when requests come in while minimizing model loading time
Generally speaking tho, how might this be different from replicate?
Yeah that is how I was doing it but I have 35 max workers. I figure for each endpoint that I build I would need at least 3 max workers for each endpoint. So, that means I can have only 11 endpoints. 11 x 3 = 33.
I was just suggested to try to build image using a diffuser pipeline. Stated it could pre-load multiple models in the global scope.
I may have to run everything in 80GB GPU lol
That might work. But then some stable diffusion models aren’t compatible with diffusers directly and you might run out of vram lol
I’m waiting for Google to develop their TPU stable diffusion ecosystem. Supposedly inference is ALOT faster and cheaper, but most of the open source projects (like controlnet, reactor, adetailer, etc) that are super useful to improve quality need to migrate over which is a huge project. It’s literally moving from PyTorch to tensorflow which means pretty much rebuilding everything from scratch
Maybe Claude sonnet or Gemini could do it actually tho soon
Yeah... heard that PyTorch is moving towards working with non CUDA HW. If that happens it could really hurt Nvidia.
If multiple GPUs in runpod can be connected using NVLink, this will not be a problem.
200G Docker buildcloud have a disk size limit, as I experienced today.
Godly
Are you pretty glued to docker build cloud or are u open to GCP artifact registry?
I use it for non-api / typical backend docker images and apparently the limit is 5TB https://cloud.google.com/artifact-registry/docs/docker/pushing-and-pulling
Docker buildcloud will report disk limit exceeded if the file is too large, not docker hub limit
is this due to your model being embded in the contaner image?
Yes, this is what we want to try.