I'm seeing 93% GPU Memory Used even in a freshly restarted pod.
Not sure what to do about this. nvidia-smi shows there are no processes running, but when I try to run a job it shows "Process 1726743 has 42.25 GiB memory in use". How do I find and kill that?
3 Replies
Linux kill command
It sounds like there’s a ghost process (Process ID 1726743) that’s holding onto your GPU memory, even though nvidia-smi isn’t showing any active processes. Here are some steps to locate and terminate that process:
1. Confirm the Process Exists
Run the following command to check if the process is still active:
ps -p 1726743
If it’s running, you’ll see the process details; otherwise, the terminal will indicate that no such process exists.
2. Find the Process with fuser or lsof
Sometimes, processes won’t show up in nvidia-smi, especially if they aren’t actively using the GPU but are still holding onto memory. You can use fuser or lsof to identify processes using CUDA devices:
sudo fuser -v /dev/nvidia*
This command lists all processes using the NVIDIA device files.
3. Kill the Process
If you identify Process 1726743 as still running, you can terminate it with:
sudo kill -9 1726743
If the process doesn’t respond, double-check by rerunning the previous commands to confirm it’s actually terminated.
4. Double-Check with nvidia-smi
After killing the process, run nvidia-smi again to ensure the GPU memory is freed up. You should see the memory released.
5. Restart the GPU (as a Last Resort)
If none of the above steps free up the memory, you may need to reset the GPU. This step can impact other users if you’re on a shared system, so proceed with caution. You can reset the GPU with:
sudo nvidia-smi --gpu-reset
Note that this command may require elevated permissions or may be disabled on some systems.
Following these steps should help you clear the memory and allow you to run your PyTorch job without the CUDA out of memory error.
Or restart pod last*
That is from chatgpt, worth trying
I tried most of that .. the process id it quoted doesn't show up in
ps -ef
(and the number is a bit unusual).
If there was a process holding onto memory, restarting the pod would clear that.Ah so it's like not from the pod?
Can you try another pod first? To see if it's the same