#Problems with distributed training using AMD GPUs on RVC

1 messages · Page 1 of 1 (latest)

trim quartz
#

I'm training models with 2 8GB 5500 XT GPUs on the latest mainline RVC using ROCm 5.2 locally on Arch Linux. After a while, it stops working and gives me this error message nearly 30 minutes in:

Process Process-1: Traceback (most recent call last): File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/thetrustedcomputer/Software/Git/RVC/infer/modules/train/train.py", line 268, in run train_and_evaluate( File "/home/thetrustedcomputer/Software/Git/RVC/infer/modules/train/train.py", line 496, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/home/thetrustedcomputer/Software/Git/RVC/python-3.10/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/home/thetrustedcomputer/Software/Git/RVC/python-3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 199, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

There appears to be a deadlock somehow, so it never actually continues unless I restart the RVC process to resume from the checkpoint. Even then, radeontop shows one GPU still under full load after RVC exits. I additionally had to kill these Python processes to make that GPU idle.

Prebuilt PyTorch versions 1.13 stable and 2.0.0 nightly are both affected. Single GPU training works fine but is twice as slow. How to resolve this issue? And are the developers already aware of it? If so, please provide a link to the GitHub issue. Thank you very much!

deft hedge
# trim quartz I'm training models with 2 8GB 5500 XT GPUs on the latest mainline RVC using ROC...
#

there's several other examples of that one issue happening

#
#

Unrelated, but there's apparently a way to re-have support for navi1 card on latest ROCm, but the process is quite tedious

trim quartz
#

It's interesting to see that other users are facing the same Gloo timeout issue.

#

I've Googled this error, but there's not much in terms of solutions.

deft hedge
#

...Oh and use the arch repo-provided torch as well, else it'll rely on what pip downloads if I remember correctly

trim quartz
#

Indeed, Arch does have them in their repos.

extra/python-pytorch 2.2.0-1 [installed] Tensors and Dynamic neural networks in Python with strong GPU acceleration extra/python-pytorch-cuda 2.2.0-1 Tensors and Dynamic neural networks in Python with strong GPU acceleration (with CUDA) extra/python-pytorch-opt 2.2.0-1 Tensors and Dynamic neural networks in Python with strong GPU acceleration (with AVX2 CPU optimizations) extra/python-pytorch-opt-cuda 2.2.0-1 Tensors and Dynamic neural networks in Python with strong GPU acceleration (with CUDA and AVX2 CPU optimizations) extra/python-pytorch-opt-rocm 2.2.0-1 Tensors and Dynamic neural networks in Python with strong GPU acceleration (with ROCm and AVX2 CPU optimizations) extra/python-pytorch-rocm 2.2.0-1 Tensors and Dynamic neural networks in Python with strong GPU acceleration (with ROCm) extra/python-torchvision 0.16.1-2 [installed] Datasets, transforms, and models specific to computer vision extra/python-torchvision-cuda 0.16.1-2 Datasets, transforms, and models specific to computer vision (with GPU support) extra/torchvision 0.16.1-2 Datasets, transforms, and models specific to computer vision (C++ library only) extra/torchvision-cuda 0.16.1-2 Datasets, transforms, and models specific to computer vision (C++ library only with GPU support)

#

And it's also nice to have a PKGBUILD tailored for specific cards.

deft hedge
#

If needed. Should work as long as you alter the current 6.0.0-1 release and not any newer ones. Supposedly this makes it generate the required files for navi cards.

The version I sent will only compile for gfx 1010 and nothing else, just to make it build faster. Hope this isn't an issue

#

This shit is so frustrating, I cant believe it's allowed to be released in such a state. It's pretty much the same mess for polaris (rx 4xx & rx 5xx) cards

trim quartz
#

One good thing is that ROCm is open source, so it allows us to patch code for older cards.

trim quartz
deft hedge
#

I'm on gfx1032 and there's no way to generate the optimized files for those, can only do gfx1030 and then override to that

#

So you gotta generate the gfx1010 files and then override here too, but it shouldn't cause as many issues as when you try the 10.3.0 override on a 1010

blissful lakeBOT
#

Ayo? @scenic lagoon level 1 !!! lfg

trim quartz
#

@scenic lagoon Do you have the same problem as me? If not, go to the dedicated help channels.

trim quartz
#

@deft hedge I forgot to mention that I'm using it inside a Python 3.10 venv, so it's not going to use Arch's packages.

deft hedge
#

Oh yeah 😅

#

Python is lovely isn't it ?

trim quartz
#

So your solution will definitely not work.

deft hedge
#

I mean, maybe messing with LD_PRELOAD to tell the venv to load the system ROCm libs would work

#

Or not, nevermind. All it takes is for one single thing not being compiled for gfx1010 and it's over, and the pytorch releases still build against ROCm 5.7 so no kittyblep

trim quartz
#

@deft hedge I took the time to build the ROCm toolchain from source to target my card, and sadly, the deadlock remains. At least I have a later version (5.4.3) that works without the override.

blissful lakeBOT
#

Ayo? @trim quartz level 15 !!! lfg

deft hedge
trim quartz
#

In fact, I discovered the minimum ROCm version to build the latest PyTorch is 5.4. Anything lower won't without patching.

trim quartz
#

I did this inside an Ubuntu 22.04 Docker container, so I know the build works. Compared against the prebuild, I've noticed that mine runs a bit slower when running the unit tests.

#

Prebuild:

Using cuda device
Downloading dataset...
Loading dataset...
/home/thetrustedcomputer/Desktop/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:521: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Epoch: 0020 (0.0991s) loss_train: 1.8383 acc_train: 0.5722 loss_val: 1.8489 acc_val: 0.4975
Epoch: 0040 (0.0992s) loss_train: 1.7356 acc_train: 0.7310 loss_val: 1.7581 acc_val: 0.6525
Epoch: 0060 (0.0991s) loss_train: 1.6418 acc_train: 0.7897 loss_val: 1.6742 acc_val: 0.7225
Epoch: 0080 (0.0990s) loss_train: 1.5715 acc_train: 0.8096 loss_val: 1.6111 acc_val: 0.7550
Epoch: 0100 (0.0991s) loss_train: 1.5301 acc_train: 0.8348 loss_val: 1.5773 acc_val: 0.7950
Epoch: 0120 (0.0989s) loss_train: 1.4854 acc_train: 0.8601 loss_val: 1.5404 acc_val: 0.8000
Epoch: 0140 (0.0990s) loss_train: 1.4703 acc_train: 0.8691 loss_val: 1.5295 acc_val: 0.8125
Epoch: 0160 (0.0990s) loss_train: 1.4487 acc_train: 0.8818 loss_val: 1.5100 acc_val: 0.8125
Epoch: 0180 (0.0989s) loss_train: 1.4363 acc_train: 0.8827 loss_val: 1.5002 acc_val: 0.8300
Epoch: 0200 (0.0990s) loss_train: 1.4338 acc_train: 0.8935 loss_val: 1.5048 acc_val: 0.8175
Epoch: 0220 (0.0990s) loss_train: 1.4095 acc_train: 0.8944 loss_val: 1.4804 acc_val: 0.8100
Epoch: 0240 (0.0989s) loss_train: 1.4118 acc_train: 0.9143 loss_val: 1.4835 acc_val: 0.8250
Epoch: 0260 (0.0990s) loss_train: 1.4041 acc_train: 0.9025 loss_val: 1.4705 acc_val: 0.8375
Epoch: 0280 (0.0989s) loss_train: 1.3981 acc_train: 0.9043 loss_val: 1.4669 acc_val: 0.8400
Epoch: 0300 (0.0989s) loss_train: 1.4038 acc_train: 0.9152 loss_val: 1.4782 acc_val: 0.8350
Test set results: loss 1.4906 accuracy 0.7892
#

My build:

Using cuda device
Dataset already downloaded...
Loading dataset...
Epoch: 0020 (0.1179s) loss_train: 1.8383 acc_train: 0.5722 loss_val: 1.8489 acc_val: 0.4975
Epoch: 0040 (0.1178s) loss_train: 1.7356 acc_train: 0.7310 loss_val: 1.7581 acc_val: 0.6525
Epoch: 0060 (0.1179s) loss_train: 1.6418 acc_train: 0.7897 loss_val: 1.6742 acc_val: 0.7225
Epoch: 0080 (0.1180s) loss_train: 1.5715 acc_train: 0.8096 loss_val: 1.6111 acc_val: 0.7550
Epoch: 0100 (0.1181s) loss_train: 1.5301 acc_train: 0.8348 loss_val: 1.5773 acc_val: 0.7950
Epoch: 0120 (0.1180s) loss_train: 1.4854 acc_train: 0.8601 loss_val: 1.5404 acc_val: 0.8000
Epoch: 0140 (0.1180s) loss_train: 1.4703 acc_train: 0.8691 loss_val: 1.5295 acc_val: 0.8125
Epoch: 0160 (0.1179s) loss_train: 1.4487 acc_train: 0.8818 loss_val: 1.5100 acc_val: 0.8125
Epoch: 0180 (0.1178s) loss_train: 1.4363 acc_train: 0.8827 loss_val: 1.5002 acc_val: 0.8300
Epoch: 0200 (0.1179s) loss_train: 1.4338 acc_train: 0.8935 loss_val: 1.5048 acc_val: 0.8175
Epoch: 0220 (0.1179s) loss_train: 1.4095 acc_train: 0.8944 loss_val: 1.4804 acc_val: 0.8100
Epoch: 0240 (0.1176s) loss_train: 1.4118 acc_train: 0.9143 loss_val: 1.4835 acc_val: 0.8250
Epoch: 0260 (0.1178s) loss_train: 1.4041 acc_train: 0.9025 loss_val: 1.4705 acc_val: 0.8375
Epoch: 0280 (0.1177s) loss_train: 1.3981 acc_train: 0.9043 loss_val: 1.4669 acc_val: 0.8400
Epoch: 0300 (0.1178s) loss_train: 1.4038 acc_train: 0.9152 loss_val: 1.4782 acc_val: 0.8350
Test set results: loss 1.4906 accuracy 0.7892
trim quartz
#

UPDATE: I switched to the Zen kernel, and it seemed to remedy the deadlock. I don't know why, but I was able to train with both GPUs without hearing the fans spinning down after a while.

deft hedge
#

Really weird

#

Are the training speeds fine by the way ? Curious because I haven't seen dual-consumer AMD GPUs being used yet

trim quartz
#

@deft hedge

#

Yes, they're fine. In fact, it takes about 30 seconds per epoch on a 10 minute dataset with my 2 5500 XTs.

deft hedge
#

Huh, that's like double the speeds I get on my 6600xt

#

I guess you have batch size set high ?

trim quartz
#

I set batch size to 4.

deft hedge
#

I guess AMD really cheaped out on the memory bus for the 6600xt then

#

thing offers barely more bandwidth than the 5500xt 8GB

trim quartz
#

And for a 16 minute dataset => roughly 40 seconds

#

So yeah, with my custom build of ROCm for the latest PyTorch and RVC's distributed training capabilities. Creating models locally has become possible for me, at least for now.

#

Although I can't use Applio to train due to them using the newer PyTorch API. Thus, I'm stuck with mainline RVC unless I use my custom build from Docker.