KurtUwe
H100 PCIe and SXM stability issues
I have been working on 8xH100 PCIe. While intially working well, after some time they issue CUDA errors. Overall seems to be unstable.
I always install transformer engine to enable FP8, maybe some incompatibility has come up.
Then I got the chance to test a SXM system, but strangely with this one (a 6x) the whole process haltet just before training. I'm using axolotl for everything.
Thanks.
2 replies