R
RunPod3w ago
bwa

Faulty node?

Since this morning, I encountered this error multiple times: 'CUDA error: uncorrectable ECC error encountered'. Everytime, after terminating the pod and starting a new one, the problem went away. All incidents were on US-GA-2, H100-PCIe
10 Replies
yhlong00000
yhlong000003w ago
Hey, could you share the pod ids here. We could take a look
bwa
bwaOP3w ago
Didn't note them down but just encountered one again: 3sc3qsn1qhu0mz Another one. Two in a row: 3g93y1byjkjq1o Can I get refund for the time I wasted on these? Had like more than 10 of these in the past 2 days. Another one: g91ov3ym70j0rc BTW, this happens when I run kohya-scripts. But the exact same script and config works with non-faulty nodes.
nerdylive
nerdylive3w ago
i think if you experience un normal errors just stop using it first, try another region but yeah i think they can refund it if the error is not from your side
yhlong00000
yhlong000003w ago
All three pods landed on the same machine, and I’ve delisted that machine to avoid further issues. I’ll DM you with more details.
bwa
bwaOP2w ago
@yhlong00000 Hey there, just got two more faulty instances: 6kxad780u6bda9, oweexcwlv8y62k. Same error. H100 NVL Also: hq57ofbzb1xmhb, cz6iu4pzb8z8h4
nerdylive
nerdylive2w ago
Wow alot huh, all h100?
bwa
bwaOP2w ago
Yep. Probably the same machine. It seems to start a bit slower than working ones.
yhlong00000
yhlong000002w ago
what is the error message you see from the container log? The server looks good from my end.
bwa
bwaOP2w ago
Same like before: 'CUDA error: uncorrectable ECC error encountered' when I ran kohya scripts. The container itself was launched fine.
yhlong00000
yhlong000002w ago
all the pods you list above running on the same machine and associate with the same card, I will reach out to DC and check it. Ping me here if you see more of this error.
Want results from more Discord servers?
Add your server