Serverless is timing out before full load

I have a serverless endpoint which is loading a bunch of loras on top of sdxl, and the time it takes is a lot (more than 500 seconds) on the first load. This used to work well until I added even more loras, and now it's timing out and "removing container" and restarting it again and again Any tips to fix this?
16 Replies
lou
lou7mo ago
any tip's?
nerdylive
nerdylive7mo ago
Wow how many lora's do you use? and how do you load it in code? lou do you have the same problem too?
blabbercrab
blabbercrabOP7mo ago
I use load_lora_weights And i load it outside handler I'm using the sdxl worker serverless template
nerdylive
nerdylive7mo ago
maybe thats normal in huggingface? did you also use .to("cuda")
blabbercrab
blabbercrabOP7mo ago
yes I did
class ModelHandler:
def __init__(self):
self.base = None
self.compel = None
self.load_models()

def load_base(self):
base_pipe = StableDiffusionXLPipeline.from_single_file("./anime/autism.safetensors", add_watermarker=False, use_safetensors=True, torch_dtype=torch.float16, safety_checker=None)
base_pipe = base_pipe.to("cuda", silence_dtype_warnings=True)
base_pipe.enable_xformers_memory_efficient_attention()
compel = Compel(tokenizer=[base_pipe.tokenizer, base_pipe.tokenizer_2] , text_encoder=[base_pipe.text_encoder, base_pipe.text_encoder_2], returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED, requires_pooled=[False, True])

for weight in lora_weights:
print("Loading pose weight:", weight)
base_pipe.load_lora_weights(f"./anime/loras/{weight}.safetensors", adapter_name=weight)
for character in list(character_configs.keys()):
print("Loading character weight:", character)
base_pipe.load_lora_weights(f"./anime/loras/{character}.safetensors", adapter_name=character)

return base_pipe, compel

def load_models(self):
with concurrent.futures.ThreadPoolExecutor() as executor:
future_base = executor.submit(self.load_base)
self.base, self.compel = future_base.result()


MODELS = ModelHandler()
class ModelHandler:
def __init__(self):
self.base = None
self.compel = None
self.load_models()

def load_base(self):
base_pipe = StableDiffusionXLPipeline.from_single_file("./anime/autism.safetensors", add_watermarker=False, use_safetensors=True, torch_dtype=torch.float16, safety_checker=None)
base_pipe = base_pipe.to("cuda", silence_dtype_warnings=True)
base_pipe.enable_xformers_memory_efficient_attention()
compel = Compel(tokenizer=[base_pipe.tokenizer, base_pipe.tokenizer_2] , text_encoder=[base_pipe.text_encoder, base_pipe.text_encoder_2], returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED, requires_pooled=[False, True])

for weight in lora_weights:
print("Loading pose weight:", weight)
base_pipe.load_lora_weights(f"./anime/loras/{weight}.safetensors", adapter_name=weight)
for character in list(character_configs.keys()):
print("Loading character weight:", character)
base_pipe.load_lora_weights(f"./anime/loras/{character}.safetensors", adapter_name=character)

return base_pipe, compel

def load_models(self):
with concurrent.futures.ThreadPoolExecutor() as executor:
future_base = executor.submit(self.load_base)
self.base, self.compel = future_base.result()


MODELS = ModelHandler()
have any idea why this happens? it's a lot of loras I think around 30 it's not a memory issue as far as I know, it's about the first run of the serverless timing out before all the loras load in what i'm doing is using load_lora_weights outside handler and then using set_adapter() inside the hanlder to activate the lora if anyone has a clue on how to fix what's happening please let me know
digigoblin
digigoblin7mo ago
You can't really fix this, RunPod expects the handler to kick in within a certain time period. Why do you need to load 30 Lora anyway?
blabbercrab
blabbercrabOP7mo ago
I can't extend this time period? It's for our image generation service I dont use all those loras at once, but rather I load them all, then use set_adapter to activate only the ones I require, this way I dont have to load in and load out every lora during every request
nerdylive
nerdylive7mo ago
No i don't really have any idea if this is normal... i never tried loading that much lora's, but i guess its a normal time for loading all of them?
Charixfox
Charixfox7mo ago
If it's not getting them into memory, that's a problem I'm not sure about. If it's not getting them onto disk from the download before dying, that one I've seen and solved by using a network storage as long as it resumes the download. Then it can work on the downloads until it dies, and pick up where it left of the next time it tries, and eventually finish and not have to download them again afterward.
blabbercrab
blabbercrabOP7mo ago
the files are already on the docker container @Charixfox it dies before being able to load everything into ram
Charixfox
Charixfox7mo ago
Ah. I have no advice on that one unfortunately.
nerdylive
nerdylive7mo ago
Maybe it's normal?, did you benchmark somewhere else and it's faster than the performance here I mean 30 loras is a bunch of loras right, with loading sdxl too
blabbercrab
blabbercrabOP7mo ago
I don't mind it loading for however long it wants but I'd like for it to fully load What happens is before it loads all 30 loras there's some sort of time out which restarts the worker and it retries loading all of them back in again And that keeps continuing Anysy i came up with a different solution to my problem so it's all good now
nerdylive
nerdylive7mo ago
Oh how did you solve it? oh yea i guess its runpod problem, it might timesout because they think it doesn't load
blabbercrab
blabbercrabOP7mo ago
I'm loading the lora once at request, and then not unloading it for a new request, It always checks if the lora is already loaded this way any user who requests a specific lora it only takes extra time once
nerdylive
nerdylive7mo ago
Nicee progressive loading yeah nvm it loads all of the models at once in faster-whisper But it still can unload itself if your worker is inactive for some time

Did you find this page helpful?