Xqua Posts - Answer Overflow

Xqua

•Created by Xqua on 3/11/2025 in #⚡｜serverless

Do you cache docker layers to avoid repulling ?

I have a question, does the current serverless system caches the docker layers to avoid repulling ? Let's assume I make a docker image with the first layers being the AI models, if I upload this, the dokcer repo will only get the new ones and won't update layers it already knows If I pull the same will happen Does the Serverless docker pulling also does the same? so that if I update my image, it will avoid repulling the layers it already knows ?

9 replies

RRunPod

•Created by Xqua on 2/10/2025 in #⚡｜serverless

Is serverless Network Volume MASSIVE lag fixed ? Is it now usable as a model store ?

Hi a while ago I tried to model store to avoid having to manage my now 100Gb docker image. But the runpod network volume took forever to load making 15 sec request take 1mn30 or more Support said they were working on a fix, is this now usable ?

1 replies

RRunPod

•Created by Xqua on 12/12/2024 in #⚡｜serverless

Docker Image EXTREMELY Slow to load on endpoint but blazing locally

This is the first time I'm encountering this issue with the serverless EP I've got a docker image, which loads the model (flux schnell) very fast, and it runs a job fairly fast on my local machine with a 4090. When I use a 4090 in RP though, the image gets stuck at loading the model

self.pipeline = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)

self.pipeline = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)

This can take like 5mn ... which is 1) humongous and 2) not doable in production. What could be causing this ?

2 replies

RRunPod

•Created by Xqua on 12/6/2024 in #⚡｜serverless

How can I use Multiprocessing in Serverless ?

Hi I am trying to do something somewhat simple

def run(self):
        print("TRAINER: Starting training")
        train = Train()
        trainer = self.ctx.Process(target=train.train, args=(self.config.config_path,))
        trainer.start()
        print("TRAINER: Starting watcher")
        self.watch()
        trainer.join()

def run(self):
        print("TRAINER: Starting training")
        train = Train()
        trainer = self.ctx.Process(target=train.train, args=(self.config.config_path,))
        trainer.start()
        print("TRAINER: Starting watcher")
        self.watch()
        trainer.join()

I have a training script in a training loop, and I want a watcher to check in on it at times It runs fine locally but as soon as I put it in the docker, I get

lora-trainer-1  | --- Starting Serverless Worker |  Version 1.7.5 ---
lora-trainer-1  | WARN   | test_input.json not found, exiting.

lora-trainer-1  | --- Starting Serverless Worker |  Version 1.7.5 ---
lora-trainer-1  | WARN   | test_input.json not found, exiting.

From the trainer thread, why is this happening ? it looks as if its launching a whole new job and request handler

5 replies

RRunPod

•Created by Xqua on 10/29/2024 in #⚡｜serverless

Delay times on requests

6 replies

RRunPod

•Created by Xqua on 10/10/2024 in #⚡｜serverless

Some serverless requests are Hanging forever

I'm not sure why but I "often" (often enough) have jobs that just ... hang there even if multiple gpus are available on my serverless endpoint. new jobs might come it and go through while the old job just "stalls" there. any idea why ?

5 replies

RRunPod

•Created by Xqua on 10/10/2024 in #⚡｜serverless

Application error on one of my serverless endpoints

Getting a TypeError: e.machine is null when I try to access https://www.runpod.io/console/serverless/user/endpoint/[HIDDEN]

6 replies

RRunPod

•Created by Xqua on 9/26/2024 in #⚡｜serverless

6x speed reduction with network storage in serverless

To reduce my docker image size I wanted to use the network storage to store the models, but the main issue I am running against now is that I went from 20sec per request to 120sec. When looking at the logs, it takes almost 100sec (vs a few sec) to load the model in GPU memory. Why is the network storage so slow ??? its a major drawback and means you and I have to handle 10s of Gb of Docker image for nothing.

3 replies

Gaming

Programming