R
RunPod8mo ago
houmie

Memory usage on serverless too high

I finally managed to get the serverless setup working.
I just sent a very simple post with a minimum prompt but it runs out of memory. I'm using this highly qualitised model which should fit into a 24GB GPU: Dracones/Midnight-Miqu-70B-v1.0_exl2_2.24bpw I have chosen a 48 GB GPU so there should be plenty of room, why is it running out of memory? Error message: 2024-04-29T18:12:32.121035837Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 71.38 MiB is free. Process 2843331 has 44.27 GiB memory in use. Of the allocated memory 43.81 GiB is allocated by PyTorch, and 13.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
4 Replies
haris
haris8mo ago
@houmie would you be able to DM me the endpoint ID you're facing this on?
houmie
houmieOP8mo ago
Sure, done. Thanks
haris
haris8mo ago
@houmie which front end are you using for this?
houmie
houmieOP8mo ago
None, just curl from terminal to test it out. The issue is that vLLM doesn't support exl2.
Want results from more Discord servers?
Add your server