Memory usage on serverless too high
I finally managed to get the serverless setup working.
I just sent a very simple post with a minimum prompt but it runs out of memory. I'm using this highly qualitised model which should fit into a 24GB GPU: Dracones/Midnight-Miqu-70B-v1.0_exl2_2.24bpw I have chosen a 48 GB GPU so there should be plenty of room, why is it running out of memory? Error message: 2024-04-29T18:12:32.121035837Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 71.38 MiB is free. Process 2843331 has 44.27 GiB memory in use. Of the allocated memory 43.81 GiB is allocated by PyTorch, and 13.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I just sent a very simple post with a minimum prompt but it runs out of memory. I'm using this highly qualitised model which should fit into a 24GB GPU: Dracones/Midnight-Miqu-70B-v1.0_exl2_2.24bpw I have chosen a 48 GB GPU so there should be plenty of room, why is it running out of memory? Error message: 2024-04-29T18:12:32.121035837Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 71.38 MiB is free. Process 2843331 has 44.27 GiB memory in use. Of the allocated memory 43.81 GiB is allocated by PyTorch, and 13.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
4 Replies