RunPod•8mo ago

AMD pods don't properly support GPU memory allocation

Hello! I've been trying to build a ROCm/HIP-based package to run on RunPod's ROCm-templated pods (or in a custom-built container/template), and I ran into memory issues that I believe I've tracked down to how RunPod is starting up docker containers. In particular, pinned memory allocation fails with a misleading Error: Failed to allocate pinned memory: out of memory (2). Inspecting the GPU devices shows unusual permissions, e.g.:

# ls -l /dev/dri/*
crw-rw-rw- 1 nobody nogroup 226, 144 Jun 27 13:07 /dev/dri/renderD144

# ls -l /dev/dri/*
crw-rw-rw- 1 nobody nogroup 226, 144 Jun 27 13:07 /dev/dri/renderD144

I was able to do some testing on compute infrastructure with AMD hardware, and identified that: 1. The error does not occur when running directly on the machine 2. The error does occur when running in docker bound with the docker run --device /dev/kfd --device /dev/dri/renderDXXX ... 3. The error is resolved by adding --security-opt seccomp=unconfined to the docker arguments, as prescribed by the ROCm docs (this also returns the in-container device permissions to something normal) I'll attach the code I used for testing in a reply. It'd be great to have AMD pods use a more permissive security profile to improve AMD GPU support. Let me know if I can help with this in any way.

7 Replies

ktabriziOP•8mo ago

Here's my script for quickly testing this, in case anyone wants to reproduce it:

#include <hip/hip_runtime.h>
#include <iostream>
#include <sstream>
#include <stdexcept>

#define CHECK_RESULT(result, errorMessage) \
    if (result != hipSuccess) { \
        std::stringstream m; \
        m << errorMessage << ": " << hipGetErrorString(result) << " (" << result << ")"; \
        throw std::runtime_error(m.str()); \
    }

int main() {
    unsigned int* pinnedCountBuffer = nullptr;
    hipError_t result;

    try {
        // Attempt to allocate pinned memory
        result = hipHostMalloc((void**)&pinnedCountBuffer, 2 * sizeof(unsigned int), hipHostMallocNumaUser);
        CHECK_RESULT(result, "Failed to allocate pinned memory");

        std::cout << "Successfully allocated pinned memory." << std::endl;

        // Use the allocated memory
        pinnedCountBuffer[0] = 42;
        pinnedCountBuffer[1] = 84;

        std::cout << "Values stored: " << pinnedCountBuffer[0] << ", " << pinnedCountBuffer[1] << std::endl;

        // Free the allocated memory
        result = hipHostFree(pinnedCountBuffer);
        CHECK_RESULT(result, "Failed to free pinned memory");

        std::cout << "Successfully freed pinned memory." << std::endl;
    }
    catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }

    return 0;
}

#include <hip/hip_runtime.h>
#include <iostream>
#include <sstream>
#include <stdexcept>

#define CHECK_RESULT(result, errorMessage) \
    if (result != hipSuccess) { \
        std::stringstream m; \
        m << errorMessage << ": " << hipGetErrorString(result) << " (" << result << ")"; \
        throw std::runtime_error(m.str()); \
    }

int main() {
    unsigned int* pinnedCountBuffer = nullptr;
    hipError_t result;

    try {
        // Attempt to allocate pinned memory
        result = hipHostMalloc((void**)&pinnedCountBuffer, 2 * sizeof(unsigned int), hipHostMallocNumaUser);
        CHECK_RESULT(result, "Failed to allocate pinned memory");

        std::cout << "Successfully allocated pinned memory." << std::endl;

        // Use the allocated memory
        pinnedCountBuffer[0] = 42;
        pinnedCountBuffer[1] = 84;

        std::cout << "Values stored: " << pinnedCountBuffer[0] << ", " << pinnedCountBuffer[1] << std::endl;

        // Free the allocated memory
        result = hipHostFree(pinnedCountBuffer);
        CHECK_RESULT(result, "Failed to free pinned memory");

        std::cout << "Successfully freed pinned memory." << std::endl;
    }
    catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }

    return 0;
}

You can compile and run this with hipcc -o test_hip_malloc test_hip_malloc.cpp && ./test_hip_malloc.

nerdylive•8mo ago

@flash-singh

Madiator2011 (Work)•8mo ago

we do not directly pass devices for security

flash-singh•8mo ago

@ktabrizi do you run into this with every workload? we have run LLMs mostly and did not have issues similar to yours

ktabriziOP•8mo ago

we do – our application is compute intensive and involves PyTorch, but isn't an LLM or diffusion model. I think as soon as the software involved is doing anything custom with ROCm/HIP, someone would hit these kinds of issues. It'd be great to be able to run with RunPod's AMD pods as more and more applications are built to take advantage of the MI300Xs. definitely fair, though I imagine there's a slightly more permissive security profile that will allow these pinned memory allocations without dropping seccomp altogether.

flash-singh•8mo ago

if only way forward is with seccomp then that will occur when we start deploying containers using kata containers which uses microvms, our cpu instances currently do this and offer privileged access in containers, will see if we can fit this into Q3

ktabriziOP•8mo ago

Sounds good, thanks for the update. If there's any way be notified if/when this is supported, please let me know!

Gaming

Programming

AMD pods don't properly support GPU memory allocation

Did you find this page helpful?