Why "CUDA out of memory" Today ? Same image to generate portrait, yesterday is ok , today in not.
"delayTime": 133684,
"error": "CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 23.68 GiB total capacity; 18.84 GiB already allocated; 1.47 GiB free; 20.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF",
"executionTime": 45263,
"id": "ae1e4066-e2b7-43c1-8f37-3525bda03893-e1",
32 Replies
Ask the developer of the application, it has nothing to do with RunPod.
seems like out of memory error
meaning you need a bigger gpu for that
or try unloading your other models somehow
I am the developer. When I use my ai app, I get CUDA out of memory. I did nothing to the app.
Then it needs a larger GPU as nerdy said.
It looks like you are trying to use a 24GB GPU when you need more VRAM. Try to run it on a 48GB GPU. If that is still not enough then try to run it on a 80GB GPU.
OK, I see, I will test.
I have exactly the same problem. We have changed nothing in our setup. Just today most image generation fails
I have a second serverless endpoint running that uses the same template. that one is running fine
how does your setup work?
does it unloads model?
what models are loaded in the vram, maybe too much model is loaded
I have just realised this only happend on one specific worker:
m07jdb658oetph
thats why not all of the generations failed and my other endpoint runs fineInteresting, when it happens, try collect traceback or logs
I have not switched it back on. But I can give you the logs from the weekend when it happened
Any stacktrace?
maybe somewhere here:
sorry, not sure how I would get a stacktrace. I just downloaded the logs directly from runpod
This?
{5 items
"endpointId":"6oe3safoiwidj3"
"workerId":"m07jdb658oetph"
"level":"info"
"message":"Compile with
TORCH_USE_CUDA_DSA to enable device-side assertions. "
"dt":"2024-08-03 18:27:11.64919904"
}
Whats the application that you're using?
that creates that
we run stable diffusion with automatic1111
so all worker runs the same specific model loras, etc?
yes
then all should be throwing that oom error if they load the same models
so there should be some workers that is loading and unloading models dynamically
and thats where you should find out from the application that your using
we don have that functionality in our code.
They should all load the very same way
This specific worker had a 100% fail rate though.
where's the code for loading the model tho
i'll try to look at it
try reporting it to runpod for now, but i guess this is more an application issue
I'll send you the start.sh and handler script.
maybe you didn't unload the models somewhere
How? I dont see a file option in DM
just the model loading code maybe
a1111 loads model dynamically as far as i know
Its an OOM issue, why are you using sdp attention and not xformers?
what about it? xformers has lower vram usage?
yes
Ill try this in a new deployment. Just thought it was odd that just this one worker failed
A1111 can fail intermittently with OOM errors based on your request. I experienced random/intermittent OOM and had to upgrade from 24GB to 48GB GPU tier.
Thanks for that tipp