R
RunPod•7mo ago
Cajoek

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor

Hi I keep getting ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally. I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem.
GitHub
GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Py...
Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch
8 Replies
Madiator2011
Madiator2011•7mo ago
its not possible to edit shm on runpod
Cajoek
Cajoek•7mo ago
Okay, do you know how to avoid the problem?
Madiator2011
Madiator2011•7mo ago
umm lower batch size mayby
Cajoek
Cajoek•7mo ago
Also changing the number dataloader seems to have an effect But now I'm only utilizing 20% of GPU memory
Madiator2011
Madiator2011•7mo ago
im not sure what you train and what you use
Cajoek
Cajoek•7mo ago
Sorry, I'm training an image transformer model with pytorch. RunPod Pytorch 2.2.0 image @Papa Madiator Is shm the same for all pod types?
Madiator2011
Madiator2011•7mo ago
yes it's static
Cajoek
Cajoek•7mo ago
Can't train anything with a batch size larger than 16 😦 I just get Killed now
Want results from more Discord servers?
Add your server