ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor
Hi I keep getting
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally.
I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem.GitHub
GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Py...
Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch
8 Replies
its not possible to edit shm on runpod
Okay, do you know how to avoid the problem?
umm lower batch size mayby
Also changing the number dataloader seems to have an effect
But now I'm only utilizing 20% of GPU memory
im not sure what you train and what you use
Sorry, I'm training an image transformer model with pytorch.
RunPod Pytorch 2.2.0
image
@Papa Madiator Is shm the same for all pod types?yes it's static
Can't train anything with a batch size larger than 16 😦
I just get
Killed
now