#ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor

13 messages · Page 1 of 1 (latest)

dry lintel
#

Hi I keep getting ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally.
I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem.

GitHub

Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

compact haven
#

its not possible to edit shm on runpod

dry lintel
#

Okay, do you know how to avoid the problem?

compact haven
#

umm lower batch size mayby

dry lintel
#

Also changing the number dataloader seems to have an effect

#

But now I'm only utilizing 20% of GPU memory

compact haven
#

im not sure what you train and what you use

dry lintel
#

Sorry, I'm training an image transformer model with pytorch. RunPod Pytorch 2.2.0 image

dry lintel
#

@compact haven Is shm the same for all pod types?

compact haven
#

yes it's static

dry lintel
#

Can't train anything with a batch size larger than 16 😦
I just get Killed now

icy glacier
#

bumping this again, setting shm size is essential for distributed training on GPU

compact haven
#

shm size cant be changed on pod level