R
RunPod•10mo ago
Cajoek

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor

Hi I keep getting ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally. I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem.
GitHub
GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Py...
Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch
11 Replies
Madiator2011
Madiator2011•10mo ago
its not possible to edit shm on runpod
Cajoek
CajoekOP•10mo ago
Okay, do you know how to avoid the problem?
Madiator2011
Madiator2011•10mo ago
umm lower batch size mayby
Cajoek
CajoekOP•10mo ago
Also changing the number dataloader seems to have an effect But now I'm only utilizing 20% of GPU memory
Madiator2011
Madiator2011•10mo ago
im not sure what you train and what you use
Cajoek
CajoekOP•10mo ago
Sorry, I'm training an image transformer model with pytorch. RunPod Pytorch 2.2.0 image @Papa Madiator Is shm the same for all pod types?
Madiator2011
Madiator2011•10mo ago
yes it's static
Cajoek
CajoekOP•10mo ago
Can't train anything with a batch size larger than 16 😦 I just get Killed now
Santosh
Santosh•3w ago
bumping this again, setting shm size is essential for distributed training on GPU
Madiator2011
Madiator2011•3w ago
shm size cant be changed on pod level
yhlong00000
yhlong00000•3w ago
Usually, the server memory depends on the type of GPU card and the number of GPUs in your pod. If you need more physical server memory, you can either: 1. Upgrade to a higher-end GPU card for your pod. 2. Use multiple lower-end GPUs.

Did you find this page helpful?