AMD pods don't properly support GPU memory allocation
Hello! I've been trying to build a ROCm/HIP-based package to run on RunPod's ROCm-templated pods (or in a custom-built container/template), and I ran into memory issues that I believe I've tracked down to how RunPod is starting up docker containers.
In particular, pinned memory allocation fails with a misleading
Error: Failed to allocate pinned memory: out of memory (2)
. Inspecting the GPU devices shows unusual permissions, e.g.:
I was able to do some testing on compute infrastructure with AMD hardware, and identified that:
1. The error does not occur when running directly on the machine
2. The error does occur when running in docker bound with the docker run --device /dev/kfd --device /dev/dri/renderDXXX ...
3. The error is resolved by adding --security-opt seccomp=unconfined
to the docker arguments, as prescribed by the ROCm docs (this also returns the in-container device permissions to something normal)
I'll attach the code I used for testing in a reply.
It'd be great to have AMD pods use a more permissive security profile to improve AMD GPU support. Let me know if I can help with this in any way.7 Replies
Here's my script for quickly testing this, in case anyone wants to reproduce it:
You can compile and run this with
hipcc -o test_hip_malloc test_hip_malloc.cpp && ./test_hip_malloc
.@flash-singh
we do not directly pass devices for security
@ktabrizi do you run into this with every workload? we have run LLMs mostly and did not have issues similar to yours
we do – our application is compute intensive and involves PyTorch, but isn't an LLM or diffusion model. I think as soon as the software involved is doing anything custom with ROCm/HIP, someone would hit these kinds of issues. It'd be great to be able to run with RunPod's AMD pods as more and more applications are built to take advantage of the MI300Xs.
definitely fair, though I imagine there's a slightly more permissive security profile that will allow these pinned memory allocations without dropping seccomp altogether.
if only way forward is with
seccomp
then that will occur when we start deploying containers using kata containers which uses microvms, our cpu instances currently do this and offer privileged access in containers, will see if we can fit this into Q3Sounds good, thanks for the update. If there's any way be notified if/when this is supported, please let me know!