I'm training models with 2 8GB 5500 XT GPUs on the latest mainline RVC using ROCm 5.2 locally on Arch Linux. After a while, it stops working and gives me this error message nearly 30 minutes in:
Process Process-1: Traceback (most recent call last): File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/thetrustedcomputer/Software/Git/RVC/infer/modules/train/train.py", line 268, in run train_and_evaluate( File "/home/thetrustedcomputer/Software/Git/RVC/infer/modules/train/train.py", line 496, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/home/thetrustedcomputer/Software/Git/RVC/python-3.10/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/thetrustedcomputer/Software/Git/RVC/python-3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 199, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete
There appears to be a deadlock somehow, so it never actually continues unless I restart the RVC process to resume from the checkpoint. Even then, radeontop shows one GPU still under full load after RVC exits. I additionally had to kill these Python processes to make that GPU idle.
Prebuilt PyTorch versions 1.13 stable and 2.0.0 nightly are both affected. Single GPU training works fine but is twice as slow. How to resolve this issue? And are the developers already aware of it? If so, please provide a link to the GitHub issue. Thank you very much!

