RunPod•13mo ago

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor

Hi I keep getting ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally. I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem.

GitHub

GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Py...

Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

11 Replies

Madiator2011•13mo ago

its not possible to edit shm on runpod

CajoekOP•13mo ago

Okay, do you know how to avoid the problem?

Madiator2011•13mo ago

umm lower batch size mayby

CajoekOP•13mo ago

Also changing the number dataloader seems to have an effect But now I'm only utilizing 20% of GPU memory

Madiator2011•13mo ago

im not sure what you train and what you use

CajoekOP•13mo ago

Sorry, I'm training an image transformer model with pytorch. RunPod Pytorch 2.2.0 image @Papa Madiator Is shm the same for all pod types?

Madiator2011•13mo ago

yes it's static

CajoekOP•13mo ago

Can't train anything with a batch size larger than 16 😦 I just get Killed now

Santosh•4mo ago

bumping this again, setting shm size is essential for distributed training on GPU

Madiator2011•4mo ago

shm size cant be changed on pod level

yhlong00000•4mo ago

Usually, the server memory depends on the type of GPU card and the number of GPUs in your pod. If you need more physical server memory, you can either: 1. Upgrade to a higher-end GPU card for your pod. 2. Use multiple lower-end GPUs.

Gaming

Programming

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor

Did you find this page helpful?