Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker

It looks like serverless worker will crash when spawning new processes from the handler. It crashes after the first process is spawned "Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2". Same code works fine in multi-GPU pod web terminal.
2 Replies
nerdylive
nerdylive5w ago
what process is that? Python with pytorch lightning code? I tried using multi processes from the handler and it works fine ( I'm sure it's not the same use as yours ) try this :
# from a bash or sh
unset LOCAL_RANK

# from a python file ( maybe inside your handler )
del os.environ['LOCAL_RANK']
# from a bash or sh
unset LOCAL_RANK

# from a python file ( maybe inside your handler )
del os.environ['LOCAL_RANK']
Bell Chen
Bell Chen5w ago
Oh... yes. I will try