bghira
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
i know how this feels. i have a free 8xH100 system that "fell out of a billing system" and its 100gbps port flaps so it's like, cool, i guess, but also kinda useless. i told the vendor. they haven't done anything in ~2 weeks
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
it's this one but it's gone now
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
Based on the information you’ve provided, we don’t believe this is an issue with the RunPod platform. Unfortunately, without more specific GPU details, our reliability team is unable to investigate further or escalate the matter to our hardware vendor. We’ve also reviewed our monitoring metrics and didn’t observe any ECC errors in the past 30 days, as I mentioned earlier.:maaaaan: i can't even believe the audacity of the support team to respond like this, do you guys want us wasting money running broken pods? really? blaming me even when these threads here around the same time were indicating it was a shared issue by more than one user :maaaaan:
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
well i went to cu128 container as recommended and one of the GPUs still gives ECC error i think
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
TPUs are easy versus Cerebras crap
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
it's really frustrating too
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
you know the other annoying thing is if the OS auto-updates the nvidia drivers :KEKLEO:
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
must be the outage mentioned from yesterday
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
that happens if the host system has the fabric manager crash
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
this is a pretty rare issue for SXM5 systems but i'm on H100 PCIe which are a lot more "meh" from past experience (not a RunPod problem, it's an NVIDIA problem)
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
different case
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
thanks for taking a look
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
been there. i understand
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
at my org we have something like 200-300 H100s and i've seen CUDA permissions error or NVML init error, but never this kinda ongoing ECC error, it's new to me, so i'm curious too
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
support looking at the ticket with the pod ID. i'll update here if they give any more insight into the reason
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
see the original post in this thread 🫠
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
also, 12.8 will need an entirely different version of pytorch. that isn't something someone can do in production
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
it's not the problem..
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
it's H100s. it's not going to be 11.8. 🙂
84 replies
RRunPod
•Created by 41see on 4/11/2025 in #⛅|pods-clusters
CUDA device uncorrectable ECC error
Cu124
84 replies