RunPod•10mo ago

CUDA out of memory (80GB GPU)

Hi there, I am trying to run a Dreambooth training through a serverless endpoint Using the A100 80 GB GPU, is this perhaps not a good GPU for this type of training? Using this template as a base, but I did modify it a bit, also modified the accelerate command with some other params but I wouldn't expect it to run out of memory.. https://github.com/runpod-workers/worker-lora_trainer

GitHub

GitHub - runpod-workers/worker-lora_trainer

Contribute to runpod-workers/worker-lora_trainer development by creating an account on GitHub.

4 Replies

digigoblin•10mo ago

Thats ridiculous for training a LoRA, you can do full Dreambooth training with 24GB

smokeOP•10mo ago

I am not training a LoRa, I am using that template as a base and modified the accelerate command under the hood to perform a Dreambooth training That's why I don't understand why it's running out of memory

digigoblin•10mo ago

Must be something wrong with your implementation. Nobody can help you unless you share the code.

smokeOP•10mo ago

I understand. I’ll share a repository in a sec. https://github.com/flamed0g/runpod-kohya-worker The actual command that starts the training is here: https://github.com/flamed0g/runpod-kohya-worker/blob/d7e189465766890e21e941835db153de7813b369/src/handler.py#L67 And the serverless script uses sd-scripts under the hood and is a customized script based on runpod-workers/worker-lora_trainer

Gaming

Programming

CUDA out of memory (80GB GPU)

Did you find this page helpful?