ktabrizi
Two pods disappeared from my account
After a two week hiatus from RunPod I returned frustrated to find that at least two (on-demand, secure cloud) pods are missing from my account. These take time, effort, and money to setup, and I was happily paying their storage costs. My account is set up to autopay and roll over, so the balance is always > $100 (i.e., this is not a non-payment issue). There have been no reported storage outages AFAIK, and my audit logs show no activity whatsoever between 8/17 when I last used RunPod and today. Billing however indicates one being dropped on the 23rd, and another on the 24th.
Can anyone shed some light on what's going on here, and ideally help me restore my missing pods?
Similar issues, for reference:
https://discord.com/channels/912829806415085598/1195670955939332157
https://discord.com/channels/912829806415085598/1263150831717449728
7 replies
AMD pods don't properly support GPU memory allocation
Hello! I've been trying to build a ROCm/HIP-based package to run on RunPod's ROCm-templated pods (or in a custom-built container/template), and I ran into memory issues that I believe I've tracked down to how RunPod is starting up docker containers.
In particular, pinned memory allocation fails with a misleading
Error: Failed to allocate pinned memory: out of memory (2)
. Inspecting the GPU devices shows unusual permissions, e.g.:
I was able to do some testing on compute infrastructure with AMD hardware, and identified that:
1. The error does not occur when running directly on the machine
2. The error does occur when running in docker bound with the docker run --device /dev/kfd --device /dev/dri/renderDXXX ...
3. The error is resolved by adding --security-opt seccomp=unconfined
to the docker arguments, as prescribed by the ROCm docs (this also returns the in-container device permissions to something normal)
I'll attach the code I used for testing in a reply.
It'd be great to have AMD pods use a more permissive security profile to improve AMD GPU support. Let me know if I can help with this in any way.10 replies