Linux kernel version is 5.4.0
per accelerate:
https://github.com/huggingface/accelerate/blob/85a75d4c3d0deffde2fc8b917d9b1ae1cb580eb2/src/accelerate/utils/other.py#L314C1-L331C1
basically H100's are currently unusable as it hangs for me using accelerate to train models.
GitHub
accelerate/src/accelerate/utils/other.py at 85a75d4c3d0deffde2fc8b9...
🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision - huggingface/accelerate
1 Reply
@Alpay Ariyak