h100 servers having issues?
Hey RunPod folks, is something going on with the h100 secure cloud machines? I first got a number of weird issues on a 8xH100 (SXM) server (cross GPU links going down randomly? Hard to say what is exactly going on - I get random timeouts in multi GPU comms after days of work).
I tried spinning a new machine (ID: nyotnwudbsq0mu, ID: 23xahufe1yk33g) but they are stuck loading the docker images from our private Docker (that works great and I can access from other RunPod machines).
Can someone please have a look?
19 Replies
I am also having issues with web ssh just getting stuck on a loading/white page even for machines I can SSH to normally.
As a data point - just spinned a community cloud machine (8xh100 SXM) and everything works great, web ssh connects, it loaded Docker super fast and so far no inter GPU connectivity issues.
Oh hey I know your name from civitai hi
@AstraliteHeart
Escalated To Zendesk
The thread has been escalated to Zendesk!
@AstraliteHeart there was a thread with a solution to not so good gpu interconnects
can send the link here
Unfortunately forgot the name
It was a post ablut H100s i thin
I see
It was related to something NVSwitch
OCT
YouTube
HALF HORSE HALF MAN | OFFICIAL VIDEO
#epicmusic #comedymusic #music #eurovision
Make sure to Like and Subscribe!
► Merch: https://octmusic.myshopify.com/
► Pre-save Half Horse Half Man: https://distrokid.com/hyperfollow/oct2/half-horse-half-man-2
Help fund our debut album:
https://www.paypal.com/ncp/payment/RG348NRLYTE28
Half Horse Half Man out on all streaming platforms o...
was that about disabling p2p by any chance? here or on the website?
i think it was sth about nvlink
So like they run commands or contact support to solve it?
Hahah
run commands to change some stuff and solved it
related to env variables
export NVIDIA_DRIVER_CAPABILITIES=compute,utility
export CUDA_VISIBLE_DEVICES=0,1 # only see GPU 0 and 1
maybe
#CUDA device uncorrectable ECC error probably related
Hi and sorry for the delay in responding here, the GPUs on this machine have been reset so they should(?) be good to go again.
Im also getting just a white page on every gpu im trying. Dont know if this is a issue with the service?
That sounds unrelated, do you want to make another thread explaining the issue and I can help you out?
Probably the same cloudflare issue