AstraliteHeart
AstraliteHeart
RRunPod
Created by AstraliteHeart on 4/9/2025 in #⛅|pods-clusters
h100 servers having issues?
Hey RunPod folks, is something going on with the h100 secure cloud machines? I first got a number of weird issues on a 8xH100 (SXM) server (cross GPU links going down randomly? Hard to say what is exactly going on - I get random timeouts in multi GPU comms after days of work). I tried spinning a new machine (ID: nyotnwudbsq0mu, ID: 23xahufe1yk33g) but they are stuck loading the docker images from our private Docker (that works great and I can access from other RunPod machines). Can someone please have a look?
25 replies