#Hanging ROCm drivers

1 messages · Page 1 of 1 (latest)

hallow bobcat
#

I'm being plagued by driver driver problems continually. Since updating to ROCm 7, we have a system where a GPU crashes or something, causing all ROCm applications to hang. Is there a way to reset the driver without rebooting the computer? I've already tried sudo cat /sys/kernel/debug/dri/*/amdgpu_gpu_recover, but this also hangs with the affected GPU and eventually hangs the entire system, forcing a reboot. There were some ominous kernel messages aforegoing the hang. The system's specs are as follows:
OS: Ubuntu 22.04
ROCm: 7.0.1 (including AMDGPU-DKMS driver)
CPU: AMD EPYC 7713P 64-Core
Motherboard: ASRock Rack ROMED8QM-2T
Memory: 512 GB DDR4 ECC
GPU: 2x AMD Radeon v620 and 2x AMD Radeon RX 9070 XT

(continued because discord character limit)

#

I've seen this a few times:

Sep 26 06:58:41 server kernel: {15}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
Sep 26 06:58:41 server kernel: {15}[Hardware Error]: It has been corrected by h/w and requires no further action
Sep 26 06:58:41 server kernel: {15}[Hardware Error]: event severity: corrected
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:  Error 0, type: corrected
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   section_type: PCIe error
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   port_type: 4, root port
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   version: 0.2
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   command: 0x0407, status: 0x0010
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   device_id: 0000:00:03.3
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   slot: 49
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   secondary_bus: 0x06
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   class_code: 060400
Sep 26 06:58:41 server kernel: {15}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0012
Sep 26 06:58:41 server kernel: pcieport 0000:00:03.3: AER: aer_status: 0x00000040, aer_mask: 0x00000000
Sep 26 06:58:41 server kernel: pcieport 0000:00:03.3:    [ 6] BadTLP
Sep 26 06:58:41 server kernel: pcieport 0000:00:03.3: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
Sep 26 06:58:47 server kernel: pcieport 0000:00:03.3: AER: aer_status: 0x00000040, aer_mask: 0x00000000
Sep 26 06:58:47 server kernel: pcieport 0000:00:03.3:    [ 6] BadTLP
Sep 26 06:58:47 server kernel: pcieport 0000:00:03.3: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
#

also random hangs like this (this was before the hardware errors)

Sep 25 20:39:06 server kernel: task:kworker/43:9    state:D stack:0     pid:1144790 tgid:1144790 ppid:2      flags:0x00004000
Sep 25 20:39:06 server kernel: Workqueue: events amdgpu_tlb_fence_work [amdgpu]
Sep 25 20:39:06 server kernel: Call Trace:
Sep 25 20:39:06 server kernel:  <TASK>
Sep 25 20:39:06 server kernel:  __schedule+0x27c/0x6a0
Sep 25 20:39:06 server kernel:  schedule+0x33/0x110
Sep 25 20:39:06 server kernel:  schedule_timeout+0x157/0x170
Sep 25 20:39:06 server kernel:  dma_fence_default_wait+0x13d/0x210
Sep 25 20:39:06 server kernel:  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
Sep 25 20:39:06 server kernel:  dma_fence_wait_timeout+0x116/0x140
Sep 25 20:39:06 server kernel:  amdgpu_tlb_fence_work+0x29/0x140 [amdgpu]
Sep 25 20:39:06 server kernel:  process_one_work+0x184/0x3a0
Sep 25 20:39:06 server kernel:  worker_thread+0x306/0x440
Sep 25 20:39:06 server kernel:  ? __pfx_worker_thread+0x10/0x10
Sep 25 20:39:06 server kernel:  kthread+0xf2/0x120
Sep 25 20:39:06 server kernel:  ? __pfx_kthread+0x10/0x10
Sep 25 20:39:06 server kernel:  ret_from_fork+0x47/0x70
Sep 25 20:39:06 server kernel:  ? __pfx_kthread+0x10/0x10
Sep 25 20:39:06 server kernel:  ret_from_fork_asm+0x1b/0x30
Sep 25 20:39:06 server kernel:  </TASK>
#

and this when trying to recover the GPU

Sep 26 10:04:11 server kernel: amdgpu 0000:43:00.0: amdgpu: failed to suspend display audio
Sep 26 10:04:11 server kernel: ------------[ cut here ]------------
Sep 26 10:04:11 server kernel: Could not get user_gpu_id from dev->id:8123
Sep 26 10:04:11 server kernel: WARNING: CPU: 61 PID: 1554107 at /tmp/amd.whWM8svS/amd/amdgpu/../amdkfd/kfd_events.c:1279 kfd_signal_reset_>
Sep 26 10:04:11 server kernel: Modules linked in: cpuid veth nfsv3 nfs_acl nf_conntrack_netlink xt_set ip_set xt_addrtype xfrm_user xt_CHE>
Sep 26 10:04:11 server kernel:  libcrc32c raid1 raid0 hid_generic cdc_ether usbhid usbnet mii hid amdgpu(OE) amddrm_ttm_helper(OE) amdttm(>
Sep 26 10:04:11 server kernel: CPU: 61 PID: 1554107 Comm: kworker/u256:0 Tainted: G           OE      6.8.0-83-generic #83~22.04.1-Ubuntu
Sep 26 10:04:11 server kernel: Hardware name: ASRockRack 2U4G-ROME/2T/ROMED8QM-2T, BIOS P3.10  07/12/2021
Sep 26 10:04:11 server kernel: Workqueue: amdgpu-reset-dev amdgpu_debugfs_reset_work [amdgpu]
Sep 26 10:04:11 server kernel: RIP: 0010:kfd_signal_reset_event+0x305/0x3e0 [amdgpu]
Sep 26 10:04:11 server kernel: Code: 00 80 fb 01 0f 87 ec 3a 6b 00 83 e3 01 0f 85 26 ff ff ff 41 8b 76 24 48 c7 c7 20 b2 7a c1 c6 05 15 0a>
#
Sep 26 10:04:11 server kernel: RSP: 0018:ffffbb0eb76bbc60 EFLAGS: 00010246
Sep 26 10:04:11 server kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Sep 26 10:04:11 server kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Sep 26 10:04:11 server kernel: RBP: ffffbb0eb76bbcf8 R08: 0000000000000000 R09: 0000000000000000
Sep 26 10:04:11 server kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9e2d58825000
Sep 26 10:04:11 server kernel: R13: 00000000ffffffea R14: ffff9e2d04edb600 R15: ffff9e2d254ca0c8
Sep 26 10:04:11 server kernel: FS:  0000000000000000(0000) GS:ffff9eaa4cc80000(0000) knlGS:0000000000000000
Sep 26 10:04:11 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 26 10:04:11 server kernel: CR2: 00005ecf09ee2be8 CR3: 000000209fa3c005 CR4: 0000000000f70ef0
Sep 26 10:04:11 server kernel: PKRU: 55555554
Sep 26 10:04:11 server kernel: Call Trace:
Sep 26 10:04:11 server kernel:  <TASK>
Sep 26 10:04:11 server kernel:  ? show_regs+0x6d/0x80
Sep 26 10:04:11 server kernel:  ? __warn+0x89/0x160
Sep 26 10:04:11 server kernel:  ? kfd_signal_reset_event+0x305/0x3e0 [amdgpu]
Sep 26 10:04:11 server kernel:  ? report_bug+0x17e/0x1b0
Sep 26 10:04:11 server kernel:  ? handle_bug+0x6e/0xb0
Sep 26 10:04:11 server kernel:  ? exc_invalid_op+0x18/0x80
Sep 26 10:04:11 server kernel:  ? asm_exc_invalid_op+0x1b/0x20
Sep 26 10:04:11 server kernel:  ? kfd_signal_reset_event+0x305/0x3e0 [amdgpu]
#
Sep 26 10:04:12 server kernel:  kgd2kfd_pre_reset+0x9c/0xe0 [amdgpu]
Sep 26 10:04:12 server kernel:  amdgpu_amdkfd_pre_reset+0x1a/0x30 [amdgpu]
Sep 26 10:04:12 server kernel:  amdgpu_device_halt_activities.constprop.0+0x136/0x270 [amdgpu]
Sep 26 10:04:12 server kernel:  amdgpu_device_gpu_recover+0x11b/0x3b0 [amdgpu]
Sep 26 10:04:12 server kernel:  amdgpu_debugfs_reset_work+0x69/0x90 [amdgpu]
Sep 26 10:04:12 server kernel:  process_one_work+0x184/0x3a0
Sep 26 10:04:12 server kernel:  worker_thread+0x306/0x440
Sep 26 10:04:12 server kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Sep 26 10:04:12 server kernel:  ? _raw_spin_lock_irqsave+0xe/0x20
Sep 26 10:04:12 server kernel:  ? __pfx_worker_thread+0x10/0x10
Sep 26 10:04:12 server kernel:  kthread+0xf2/0x120
Sep 26 10:04:12 server kernel:  ? __pfx_kthread+0x10/0x10
Sep 26 10:04:12 server kernel:  ret_from_fork+0x47/0x70
Sep 26 10:04:12 server kernel:  ? __pfx_kthread+0x10/0x10
Sep 26 10:04:12 server kernel:  ret_from_fork_asm+0x1b/0x30
Sep 26 10:04:12 server kernel:  </TASK>
Sep 26 10:04:12 server kernel: ---[ end trace 0000000000000000 ]---
#

I'd appreciate any help with this at all or like any driver help at all because the state of the AMD driver is still really terrible in my experience

vapid escarp
#

@misty quest have you seen these before?

#

@hallow bobcat I heard you're working as a contractor with AMD. I would recommend getting in touch with your contacts there to escalate this quickly as well

#

This sounds like an unpleasant experience and shouldn't happen so often. May I know which workloads you tried to run that caused it? Also, have you considered trying the latest linux-firmware and/or the DKMS drivers from amdgpu-install?

hallow bobcat
#

This is the latest firmware I think (from ROCm 7.0.1). The workload is general development of ROCm libraries, I don't really know a specific application other than tests from rocprim/rocthrust/rocrand/ck/miopen etc

hallow bobcat
#

yes

vapid escarp
#

ah okay

hallow bobcat
vapid escarp
hallow bobcat
#

Thanks. Yeah, maybe, I can create an issue there too. Unfortunately I get these all the time, depending on the machine. This is a specific instance that started happening recently, and I don't really know how to reproduce it yet

vapid escarp
#

no worries, just having one created is a start to track it

#

yeah sure let's discuss internally. Are you on teams?

#

you can DM me

hallow bobcat
#

I would if I knew who you are

#

im rvoetter

hallow bobcat
#

We took out one of the V620s in this system, it appears to be broken in some way since it doesnt want to use pcie 3. Also, its an older vbios. I hope that this will resolve hangs

rare quarry
#

Hi I'm new here

#

So what's updates here

#

I ask for amd graphics cards is

#

How much is

hallow bobcat
#

Can you not spam random threads please? Thanks