R
RunPod12mo ago
KurtUwe

H100 PCIe and SXM stability issues

I have been working on 8xH100 PCIe. While intially working well, after some time they issue CUDA errors. Overall seems to be unstable. I always install transformer engine to enable FP8, maybe some incompatibility has come up. Then I got the chance to test a SXM system, but strangely with this one (a 6x) the whole process haltet just before training. I'm using axolotl for everything. Thanks.
0 Replies
No replies yetBe the first to reply to this messageJoin

Did you find this page helpful?