Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker
It looks like serverless worker will crash when spawning new processes from the handler. It crashes after the first process is spawned "Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2". Same code works fine in multi-GPU pod web terminal.
2 Replies