R
RunPod6mo ago
ktabrizi

AMD pods don't properly support GPU memory allocation

Hello! I've been trying to build a ROCm/HIP-based package to run on RunPod's ROCm-templated pods (or in a custom-built container/template), and I ran into memory issues that I believe I've tracked down to how RunPod is starting up docker containers. In particular, pinned memory allocation fails with a misleading Error: Failed to allocate pinned memory: out of memory (2). Inspecting the GPU devices shows unusual permissions, e.g.:
# ls -l /dev/dri/*
crw-rw-rw- 1 nobody nogroup 226, 144 Jun 27 13:07 /dev/dri/renderD144
# ls -l /dev/dri/*
crw-rw-rw- 1 nobody nogroup 226, 144 Jun 27 13:07 /dev/dri/renderD144
I was able to do some testing on compute infrastructure with AMD hardware, and identified that: 1. The error does not occur when running directly on the machine 2. The error does occur when running in docker bound with the docker run --device /dev/kfd --device /dev/dri/renderDXXX ... 3. The error is resolved by adding --security-opt seccomp=unconfined to the docker arguments, as prescribed by the ROCm docs (this also returns the in-container device permissions to something normal) I'll attach the code I used for testing in a reply. It'd be great to have AMD pods use a more permissive security profile to improve AMD GPU support. Let me know if I can help with this in any way.
7 Replies
ktabrizi
ktabriziOP6mo ago
Here's my script for quickly testing this, in case anyone wants to reproduce it:
#include <hip/hip_runtime.h>
#include <iostream>
#include <sstream>
#include <stdexcept>

#define CHECK_RESULT(result, errorMessage) \
if (result != hipSuccess) { \
std::stringstream m; \
m << errorMessage << ": " << hipGetErrorString(result) << " (" << result << ")"; \
throw std::runtime_error(m.str()); \
}

int main() {
unsigned int* pinnedCountBuffer = nullptr;
hipError_t result;

try {
// Attempt to allocate pinned memory
result = hipHostMalloc((void**)&pinnedCountBuffer, 2 * sizeof(unsigned int), hipHostMallocNumaUser);
CHECK_RESULT(result, "Failed to allocate pinned memory");

std::cout << "Successfully allocated pinned memory." << std::endl;

// Use the allocated memory
pinnedCountBuffer[0] = 42;
pinnedCountBuffer[1] = 84;

std::cout << "Values stored: " << pinnedCountBuffer[0] << ", " << pinnedCountBuffer[1] << std::endl;

// Free the allocated memory
result = hipHostFree(pinnedCountBuffer);
CHECK_RESULT(result, "Failed to free pinned memory");

std::cout << "Successfully freed pinned memory." << std::endl;
}
catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
return 1;
}

return 0;
}
#include <hip/hip_runtime.h>
#include <iostream>
#include <sstream>
#include <stdexcept>

#define CHECK_RESULT(result, errorMessage) \
if (result != hipSuccess) { \
std::stringstream m; \
m << errorMessage << ": " << hipGetErrorString(result) << " (" << result << ")"; \
throw std::runtime_error(m.str()); \
}

int main() {
unsigned int* pinnedCountBuffer = nullptr;
hipError_t result;

try {
// Attempt to allocate pinned memory
result = hipHostMalloc((void**)&pinnedCountBuffer, 2 * sizeof(unsigned int), hipHostMallocNumaUser);
CHECK_RESULT(result, "Failed to allocate pinned memory");

std::cout << "Successfully allocated pinned memory." << std::endl;

// Use the allocated memory
pinnedCountBuffer[0] = 42;
pinnedCountBuffer[1] = 84;

std::cout << "Values stored: " << pinnedCountBuffer[0] << ", " << pinnedCountBuffer[1] << std::endl;

// Free the allocated memory
result = hipHostFree(pinnedCountBuffer);
CHECK_RESULT(result, "Failed to free pinned memory");

std::cout << "Successfully freed pinned memory." << std::endl;
}
catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
return 1;
}

return 0;
}
You can compile and run this with hipcc -o test_hip_malloc test_hip_malloc.cpp && ./test_hip_malloc.
nerdylive
nerdylive6mo ago
@flash-singh
Madiator2011 (Work)
we do not directly pass devices for security
flash-singh
flash-singh6mo ago
@ktabrizi do you run into this with every workload? we have run LLMs mostly and did not have issues similar to yours
ktabrizi
ktabriziOP6mo ago
we do – our application is compute intensive and involves PyTorch, but isn't an LLM or diffusion model. I think as soon as the software involved is doing anything custom with ROCm/HIP, someone would hit these kinds of issues. It'd be great to be able to run with RunPod's AMD pods as more and more applications are built to take advantage of the MI300Xs. definitely fair, though I imagine there's a slightly more permissive security profile that will allow these pinned memory allocations without dropping seccomp altogether.
flash-singh
flash-singh6mo ago
if only way forward is with seccomp then that will occur when we start deploying containers using kata containers which uses microvms, our cpu instances currently do this and offer privileged access in containers, will see if we can fit this into Q3
ktabrizi
ktabriziOP6mo ago
Sounds good, thanks for the update. If there's any way be notified if/when this is supported, please let me know!
Want results from more Discord servers?
Add your server