#AMD GPU on Linux

23 messages · Page 1 of 1 (latest)

clear heron
#

I got everything installed but it's only using CPU-
Fresh Ubuntu 20.04 VM
GPU passed through
Followed linux install instructions https://github.com/invoke-ai/InvokeAI/blob/main/docs/installation/INSTALL_LINUX.md (though ln -sf on the model caused problems so I just copied it instead.
pip install -r requirements-lin-AMD.txt

Notably, I have not installed any amd drivers or anything like that yet. Wanted input before I start installing every driver and library I can find :p

#
lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Red Hat, Inc. QXL paravirtual graphic card (rev 05)
00:02.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
[etc]
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:00.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01)
02:00.0 Communication controller: Red Hat, Inc. Virtio console (rev 01)
03:00.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev c1)
05:00.0 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
06:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller

lsmod | grep amdgpu
amdgpu               9805824  1
iommu_v2               24576  1 amdgpu
gpu_sched              45056  1 amdgpu
i2c_algo_bit           16384  1 amdgpu
drm_ttm_helper         16384  2 qxl,amdgpu
ttm                    86016  3 qxl,amdgpu,drm_ttm_helper
drm_kms_helper        307200  3 qxl,amdgpu
drm                   618496  9 gpu_sched,drm_kms_helper,qxl,amdgpu,drm_ttm_helper,ttm
quasi coyote
#

I guess the starting point would be do you have rocm installed ?

sharp moss
#

Yeah, there's definitely a requirement on the Ubuntu side to have some kind of driver

clear heron
#
$ rocm-smi


======================= ROCm System Management Interface =======================
WARNING: No AMD GPUs specified
================================= Concise Info =================================
GPU  Temp  AvgPwr  SCLK  MCLK  Fan  Perf  PwrCap  VRAM%  GPU%  
================================================================================
============================= End of ROCm SMI Log ==============================
#

I'm getting an issue with amdgpu-install where it doesn't like dkms, but even with --no-dkms I still get the error

#
$ sudo apt install rocm-opencl-runtime
Reading package lists... Done
Building dependency tree       
Reading state information... Done
rocm-opencl-runtime is already the newest version (5.1.0.50100-36).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Requesting to save current system state
Successfully saved as "autozsys_dhvg96"
Setting up amdgpu-dkms (1:5.13.20.5.1.50100-1395274) ...
Removing old amdgpu-5.13.20.5.1-1395274 DKMS files...

------------------------------
Deleting module version: 5.13.20.5.1-1395274
completely from the DKMS tree.
------------------------------
Done.
Loading new amdgpu-5.13.20.5.1-1395274 DKMS files...
Building for 5.15.0-50-generic
Building for architecture x86_64
Building initial module for 5.15.0-50-generic
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/amdgpu-dkms-firmware.0.crash'
Error! Bad return status for module build on kernel: 5.15.0-50-generic (x86_64)
Consult /var/lib/dkms/amdgpu/5.13.20.5.1-1395274/build/make.log for more information.
dpkg: error processing package amdgpu-dkms (--configure):
 installed amdgpu-dkms package post-installation script subprocess returned error exit status 10
Errors were encountered while processing:
 amdgpu-dkms
#

gonna try downgrading my kernel 🙃

#

or not- ```
Command failed.
Adding boot menu entry for UEFI Firmware Settings
done
Errors were encountered while processing:
amdgpu-dkms
ZSys is adding automatic system snapshot to GRUB menu
E: Sub-process /usr/bin/dpkg returned an error code (1)

#

I don't understand this error. I'm telling it specifically not to do anything with dkms and yet

#
$ sudo amdgpu-install --usecase=rocm --no-dkms
Hit:1 http://us.archive.ubuntu.com/ubuntu focal InRelease
Get:2 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]                               

Fetched 397 kB in 1s (400 kB/s)     
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
rocm-dev is already the newest version (5.1.0.50100-36).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Requesting to save current system state
Successfully saved as "autozsys_tnrh66"
Setting up amdgpu-dkms (1:5.13.20.5.1.50100-1395274) ...
Removing old amdgpu-5.13.20.5.1-1395274 DKMS files...

------------------------------
Deleting module version: 5.13.20.5.1-1395274
completely from the DKMS tree.
------------------------------
Done.
Loading new amdgpu-5.13.20.5.1-1395274 DKMS files...
Building for 5.15.0-50-generic
Building for architecture x86_64
Building initial module for 5.15.0-50-generic
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/amdgpu-dkms-firmware.0.crash'
Error! Bad return status for module build on kernel: 5.15.0-50-generic (x86_64)
Consult /var/lib/dkms/amdgpu/5.13.20.5.1-1395274/build/make.log for more information.
dpkg: error processing package amdgpu-dkms (--configure):
 installed amdgpu-dkms package post-installation script subprocess returned error exit status 10
Errors were encountered while processing:
 amdgpu-dkms
sharp moss
#

I haven’t touched dkms in years 🤷

carmine radish
#

Oh lord. Kernel module hell. It's been a while since I've run into it but I still have nightmares.

#

Try looking in make.log at the path mentioned in the error message. My guess is that you're going to find that there is a missing build requirement, or possibly that the code doesn't compile cleanly with your current kernel.

clear heron
#

It's the default kernel on ubuntu-20.04.05 or whatever the current 20.04 is

#

That said I tore down the VM in frustration after I installed 3 kernels lol. I'll try again with 22.04 maybe tomorrow

carmine radish
#

Oh, you're running this in a VM on a Windows machine? You know for sure that the VM can use the GPU?

clear heron
#

No it's a VM on Unraid; along side my Windows VM, so I know things like GPU passthrough are working correctly

clear heron
#

Lots of research later, it seems my gpu is just plain unsupported by rocm. It's a 5700xt/navi10/gfx1010

coral fog
#

I believe ROCm should work on Navi10. Do you remember where you read that?

#

You might have to compile pytorch yourself with ROCm on gfx1010.

#

I'll do some testing myself, but It'll have to wait till after work tomorrow. (1700 EST)

clear heron