R
RunPod6mo ago
smoke

CUDA out of memory (80GB GPU)

Hi there, I am trying to run a Dreambooth training through a serverless endpoint Using the A100 80 GB GPU, is this perhaps not a good GPU for this type of training? Using this template as a base, but I did modify it a bit, also modified the accelerate command with some other params but I wouldn't expect it to run out of memory.. https://github.com/runpod-workers/worker-lora_trainer
GitHub
GitHub - runpod-workers/worker-lora_trainer
Contribute to runpod-workers/worker-lora_trainer development by creating an account on GitHub.
No description
No description
4 Replies
digigoblin
digigoblin6mo ago
Thats ridiculous for training a LoRA, you can do full Dreambooth training with 24GB
smoke
smokeOP6mo ago
I am not training a LoRa, I am using that template as a base and modified the accelerate command under the hood to perform a Dreambooth training That's why I don't understand why it's running out of memory
digigoblin
digigoblin6mo ago
Must be something wrong with your implementation. Nobody can help you unless you share the code.
smoke
smokeOP6mo ago
I understand. I’ll share a repository in a sec. https://github.com/flamed0g/runpod-kohya-worker The actual command that starts the training is here: https://github.com/flamed0g/runpod-kohya-worker/blob/d7e189465766890e21e941835db153de7813b369/src/handler.py#L67 And the serverless script uses sd-scripts under the hood and is a customized script based on runpod-workers/worker-lora_trainer
Want results from more Discord servers?
Add your server