ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor
Hi I keep getting
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally.
I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem.GitHub
GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Py...
Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch
11 Replies
its not possible to edit shm on runpod
Okay, do you know how to avoid the problem?
umm lower batch size mayby
Also changing the number dataloader seems to have an effect
But now I'm only utilizing 20% of GPU memory
im not sure what you train and what you use
Sorry, I'm training an image transformer model with pytorch.
RunPod Pytorch 2.2.0
image
@Papa Madiator Is shm the same for all pod types?yes it's static
Can't train anything with a batch size larger than 16 😦
I just get
Killed
nowbumping this again, setting shm size is essential for distributed training on GPU
shm size cant be changed on pod level
Usually, the server memory depends on the type of GPU card and the number of GPUs in your pod. If you need more physical server memory, you can either:
1. Upgrade to a higher-end GPU card for your pod.
2. Use multiple lower-end GPUs.