Riley
RRunPod
•Created by Riley on 3/27/2024 in #⚡|serverless
Serverless worker loading with stable diffusion pipeline
Hello, I am trying to create a serverless endpoint with a stable diffusion pipeline from the Diffusers library. I used the https://github.com/runpod-workers/worker-sdxl repository as a template to cache the model so it never has to re-download it from Huggingface after it initializes the docker image. However, whenever a request is sent to a newly initialized idle worker (i.e. hasn't processed any requests yet), it can take up to a minute for the pipeline to load even though the model is cached. Below are the last few lines of what this looks like in the logs when loading the pipeline in case that helps clarify what I mean:
97%|█████████████████████████████████████▉ | 865M/890M [00:13<00:00, 82.9MiB/s]
98%|██████████████████████████████████████▎| 873M/890M [00:13<00:00, 82.1MiB/s]
99%|██████████████████████████████████████▌| 881M/890M [00:13<00:00, 82.9MiB/s]
100%|██████████████████████████████████████▉| 889M/890M [00:13<00:00, 81.6MiB/s]
After the worker has gone through this initial loading when it receives it's first request, it is fine for all subsequent requests and the delay time is very short. But it often happens that a request gets sent to a newly initialized idle worker and in that case, it takes way too long for a single image to be generated (usually between 1-2 minutes of loading when the actual generation time for an image is 5 seconds). Is there some way to prevent this loading from happening when a worker receives it's first request?
1 replies
Start and stop multiple pods
I have a product that will allow users to submit video editing requests that can range anywhere from 0-8 minutes of RTX 4090 GPU processing each to complete. To manage the multiple requests, I wanted to implement a system that turns on and off a group of GPUs all running the same docker image. This way if requests are high at a given time they could all still be handled. However in my experience, when pods are stopped, it can be the case that the GPU attached to it is no longer available when I attempt to restart it at a later point. This would obviously be a problem because if a GPU is no longer available when a request requires the pod to turn back on, the request would not be able to process correctly. Is there any way to go around this issue of GPUs being unassigned so I can turn pods on and off and make this system feasible? I saw the serverless option which seems like it would work for this product but the cost does not seem feasible. Thank you!
9 replies