Hi I keep getting ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally.
I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem.
#ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor
13 messages · Page 1 of 1 (latest)
its not possible to edit shm on runpod
Okay, do you know how to avoid the problem?
umm lower batch size mayby
Also changing the number dataloader seems to have an effect
But now I'm only utilizing 20% of GPU memory
im not sure what you train and what you use
Sorry, I'm training an image transformer model with pytorch. RunPod Pytorch 2.2.0 image
@compact haven Is shm the same for all pod types?
yes it's static
Can't train anything with a batch size larger than 16 😦
I just get Killed now
bumping this again, setting shm size is essential for distributed training on GPU
shm size cant be changed on pod level