RunPod•12mo ago

Memory usage on serverless too high

I finally managed to get the serverless setup working.
I just sent a very simple post with a minimum prompt but it runs out of memory. I'm using this highly qualitised model which should fit into a 24GB GPU: Dracones/Midnight-Miqu-70B-v1.0_exl2_2.24bpw I have chosen a 48 GB GPU so there should be plenty of room, why is it running out of memory? Error message: 2024-04-29T18:12:32.121035837Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 71.38 MiB is free. Process 2843331 has 44.27 GiB memory in use. Of the allocated memory 43.81 GiB is allocated by PyTorch, and 13.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

4 Replies

haris•12mo ago

@houmie would you be able to DM me the endpoint ID you're facing this on?

houmieOP•12mo ago

Sure, done. Thanks

haris•12mo ago

@houmie which front end are you using for this?

houmieOP•12mo ago

None, just curl from terminal to test it out. The issue is that vLLM doesn't support exl2.

Gaming

Programming

Memory usage on serverless too high

Did you find this page helpful?