Xqua
Xqua
RRunPod
Created by Xqua on 12/12/2024 in #⚡|serverless
Docker Image EXTREMELY Slow to load on endpoint but blazing locally
This is the first time I'm encountering this issue with the serverless EP I've got a docker image, which loads the model (flux schnell) very fast, and it runs a job fairly fast on my local machine with a 4090. When I use a 4090 in RP though, the image gets stuck at loading the model
self.pipeline = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)
self.pipeline = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)
This can take like 5mn ... which is 1) humongous and 2) not doable in production. What could be causing this ?
2 replies
RRunPod
Created by Xqua on 12/6/2024 in #⚡|serverless
How can I use Multiprocessing in Serverless ?
Hi I am trying to do something somewhat simple
def run(self):
print("TRAINER: Starting training")
train = Train()
trainer = self.ctx.Process(target=train.train, args=(self.config.config_path,))
trainer.start()
print("TRAINER: Starting watcher")
self.watch()
trainer.join()
def run(self):
print("TRAINER: Starting training")
train = Train()
trainer = self.ctx.Process(target=train.train, args=(self.config.config_path,))
trainer.start()
print("TRAINER: Starting watcher")
self.watch()
trainer.join()
I have a training script in a training loop, and I want a watcher to check in on it at times It runs fine locally but as soon as I put it in the docker, I get
lora-trainer-1 | --- Starting Serverless Worker | Version 1.7.5 ---
lora-trainer-1 | WARN | test_input.json not found, exiting.
lora-trainer-1 | --- Starting Serverless Worker | Version 1.7.5 ---
lora-trainer-1 | WARN | test_input.json not found, exiting.
From the trainer thread, why is this happening ? it looks as if its launching a whole new job and request handler
5 replies
RRunPod
Created by Xqua on 10/29/2024 in #⚡|serverless
Delay times on requests
No description
6 replies
RRunPod
Created by Xqua on 10/10/2024 in #⚡|serverless
Some serverless requests are Hanging forever
I'm not sure why but I "often" (often enough) have jobs that just ... hang there even if multiple gpus are available on my serverless endpoint. new jobs might come it and go through while the old job just "stalls" there. any idea why ?
5 replies
RRunPod
Created by Xqua on 10/10/2024 in #⚡|serverless
Application error on one of my serverless endpoints
Getting a TypeError: e.machine is null when I try to access https://www.runpod.io/console/serverless/user/endpoint/[HIDDEN]
6 replies
RRunPod
Created by Xqua on 9/26/2024 in #⚡|serverless
6x speed reduction with network storage in serverless
To reduce my docker image size I wanted to use the network storage to store the models, but the main issue I am running against now is that I went from 20sec per request to 120sec. When looking at the logs, it takes almost 100sec (vs a few sec) to load the model in GPU memory. Why is the network storage so slow ??? its a major drawback and means you and I have to handle 10s of Gb of Docker image for nothing.
3 replies