RunPod•11mo ago

Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker

It looks like serverless worker will crash when spawning new processes from the handler. It crashes after the first process is spawned "Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2". Same code works fine in multi-GPU pod web terminal.

2 Replies

Jason•11mo ago

what process is that? Python with pytorch lightning code? I tried using multi processes from the handler and it works fine ( I'm sure it's not the same use as yours ) try this :

# from a bash or sh
unset LOCAL_RANK

# from a python file ( maybe inside your  handler )
del os.environ['LOCAL_RANK']

# from a bash or sh
unset LOCAL_RANK

# from a python file ( maybe inside your  handler )
del os.environ['LOCAL_RANK']

Bell ChenOP•11mo ago

Oh... yes. I will try

Gaming

Programming

Pytorch Lightening training DDP strategy crashed with no error caught on multi-GPU worker

Did you find this page helpful?