R
RunPod2mo ago
stevex

I'm seeing 93% GPU Memory Used even in a freshly restarted pod.

Not sure what to do about this. nvidia-smi shows there are no processes running, but when I try to run a job it shows "Process 1726743 has 42.25 GiB memory in use". How do I find and kill that?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 44.52 GiB of which 18.44 MiB is free. Process 1726743 has 42.25 GiB memory in use. Process 3814980 has 2.23 GiB memory in use. Of the allocated memory 1.77 GiB is allocated by PyTorch, and 53.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 44.52 GiB of which 18.44 MiB is free. Process 1726743 has 42.25 GiB memory in use. Process 3814980 has 2.23 GiB memory in use. Of the allocated memory 1.77 GiB is allocated by PyTorch, and 53.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
3 Replies
nerdylive
nerdylive2mo ago
Linux kill command It sounds like there’s a ghost process (Process ID 1726743) that’s holding onto your GPU memory, even though nvidia-smi isn’t showing any active processes. Here are some steps to locate and terminate that process: 1. Confirm the Process Exists Run the following command to check if the process is still active: ps -p 1726743 If it’s running, you’ll see the process details; otherwise, the terminal will indicate that no such process exists. 2. Find the Process with fuser or lsof Sometimes, processes won’t show up in nvidia-smi, especially if they aren’t actively using the GPU but are still holding onto memory. You can use fuser or lsof to identify processes using CUDA devices: sudo fuser -v /dev/nvidia* This command lists all processes using the NVIDIA device files. 3. Kill the Process If you identify Process 1726743 as still running, you can terminate it with: sudo kill -9 1726743 If the process doesn’t respond, double-check by rerunning the previous commands to confirm it’s actually terminated. 4. Double-Check with nvidia-smi After killing the process, run nvidia-smi again to ensure the GPU memory is freed up. You should see the memory released. 5. Restart the GPU (as a Last Resort) If none of the above steps free up the memory, you may need to reset the GPU. This step can impact other users if you’re on a shared system, so proceed with caution. You can reset the GPU with: sudo nvidia-smi --gpu-reset Note that this command may require elevated permissions or may be disabled on some systems. Following these steps should help you clear the memory and allow you to run your PyTorch job without the CUDA out of memory error. Or restart pod last* That is from chatgpt, worth trying
stevex
stevexOP2mo ago
I tried most of that .. the process id it quoted doesn't show up in ps -ef (and the number is a bit unusual). If there was a process holding onto memory, restarting the pod would clear that.
nerdylive
nerdylive2mo ago
Ah so it's like not from the pod? Can you try another pod first? To see if it's the same
Want results from more Discord servers?
Add your server