GPU memory already in use when pod starts
I have seen this happen multiple times across different GPU types and regions. When launching a pod some of the GPU memory is already in use and any attempt to make full use of the GPUs memory results in errors/crashes. For example, I have been trying to deploy 2xA100 GPUs in the Romania data center for the past hour. Each time I launch a pod one of the GPUs already shows 40% of the memory in use and attempting to utilize the GPU results in a crash. This is a screenshot of my GPU usage immediately after launching the pod, before any model had been loaded (or even downloaded). Restarting the pod and deleteing/recreating the pod does not resolve the issue.
If I paying to rent a GPU I expect to be able to make full use of it and not have half of the memory be locked up for no apparent reason.
Oh, and I tried running koboldcpp in the CA region which doesn't have this problem, but for some reason it is unable to create a cloudflare URL (only happens on CA region, have seen this for 2+ months now).
Honestly I'm starting to become very frustrated with Runpod's service and am strongly considering moving to a different provider for my uses. I spend half my time (and money) troubleshooting these errors rather than actually using the services I'm paying for.

3 Replies
Screenshot of trying to run koboldcpp in CA datacanter:

running nvidia-smi shows no processes using memory, yet 32BG of the memory in use. This is ridiculous.

Sorry for the inconvenience. I’ve sent a message to the internal team to take a look.