ktabrizi
ktabrizi
RRunPod
Created by ktabrizi on 8/30/2024 in #⛅|pods
Two pods disappeared from my account
After a two week hiatus from RunPod I returned frustrated to find that at least two (on-demand, secure cloud) pods are missing from my account. These take time, effort, and money to setup, and I was happily paying their storage costs. My account is set up to autopay and roll over, so the balance is always > $100 (i.e., this is not a non-payment issue). There have been no reported storage outages AFAIK, and my audit logs show no activity whatsoever between 8/17 when I last used RunPod and today. Billing however indicates one being dropped on the 23rd, and another on the 24th. Can anyone shed some light on what's going on here, and ideally help me restore my missing pods? Similar issues, for reference: https://discord.com/channels/912829806415085598/1195670955939332157 https://discord.com/channels/912829806415085598/1263150831717449728
7 replies
RRunPod
Created by ktabrizi on 7/9/2024 in #⛅|pods
AMD pods don't properly support GPU memory allocation
Hello! I've been trying to build a ROCm/HIP-based package to run on RunPod's ROCm-templated pods (or in a custom-built container/template), and I ran into memory issues that I believe I've tracked down to how RunPod is starting up docker containers. In particular, pinned memory allocation fails with a misleading Error: Failed to allocate pinned memory: out of memory (2). Inspecting the GPU devices shows unusual permissions, e.g.:
# ls -l /dev/dri/*
crw-rw-rw- 1 nobody nogroup 226, 144 Jun 27 13:07 /dev/dri/renderD144
# ls -l /dev/dri/*
crw-rw-rw- 1 nobody nogroup 226, 144 Jun 27 13:07 /dev/dri/renderD144
I was able to do some testing on compute infrastructure with AMD hardware, and identified that: 1. The error does not occur when running directly on the machine 2. The error does occur when running in docker bound with the docker run --device /dev/kfd --device /dev/dri/renderDXXX ... 3. The error is resolved by adding --security-opt seccomp=unconfined to the docker arguments, as prescribed by the ROCm docs (this also returns the in-container device permissions to something normal) I'll attach the code I used for testing in a reply. It'd be great to have AMD pods use a more permissive security profile to improve AMD GPU support. Let me know if I can help with this in any way.
10 replies