Faulty node?
Since this morning, I encountered this error multiple times: 'CUDA error: uncorrectable ECC error encountered'.
Everytime, after terminating the pod and starting a new one, the problem went away.
All incidents were on US-GA-2, H100-PCIe
10 Replies
Hey, could you share the pod ids here. We could take a look
Didn't note them down but just encountered one again: 3sc3qsn1qhu0mz
Another one. Two in a row: 3g93y1byjkjq1o
Can I get refund for the time I wasted on these? Had like more than 10 of these in the past 2 days.
Another one: g91ov3ym70j0rc
BTW, this happens when I run kohya-scripts. But the exact same script and config works with non-faulty nodes.
i think if you experience un normal errors just stop using it first, try another region
but yeah i think they can refund it if the error is not from your side
All three pods landed on the same machine, and I’ve delisted that machine to avoid further issues. I’ll DM you with more details.
@yhlong00000 Hey there, just got two more faulty instances: 6kxad780u6bda9, oweexcwlv8y62k. Same error. H100 NVL
Also: hq57ofbzb1xmhb, cz6iu4pzb8z8h4
Wow alot huh, all h100?
Yep.
Probably the same machine. It seems to start a bit slower than working ones.
what is the error message you see from the container log? The server looks good from my end.
Same like before: 'CUDA error: uncorrectable ECC error encountered' when I ran kohya scripts. The container itself was launched fine.
all the pods you list above running on the same machine and associate with the same card, I will reach out to DC and check it. Ping me here if you see more of this error.