#AMDGPU crash adventure

1 messages · Page 1 of 1 (latest)

late salmon
#

let's move here

normal bluff
#

yes

late salmon
#

ok so i need to read the voltages, but the pp files have changed since i moved to my new gpu

#

i don't know how now

normal bluff
#

let me look around

late salmon
#

/sys/kernel/debug/dri/0/amdgpu_pm_info shows me current voltage

normal bluff
#

818 mV (VDDGFX) for me

late salmon
#

yea but thats instant, thats not what we want

#

we want the table

#

pp_od_clk_voltage

#

but it doesn't show the voltages on my end

normal bluff
#

let me google it

late salmon
#

post yours here, let me see

normal bluff
#

od_clk_voltage?

#
OD_SCLK:
0: 500Mhz
1: 2104Mhz
OD_MCLK:
0: 97Mhz
1: 1000MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK:     500Mhz       2600Mhz
MCLK:     674Mhz       1075Mhz
#

amdgpu_pm_info```
GFX Clocks and Power:
1000 MHz (MCLK)
2130 MHz (SCLK)
1825 MHz (PSTATE_SCLK)
1000 MHz (PSTATE_MCLK)
818 mV (VDDGFX)
42.0 W (average GPU)

late salmon
#

like mine, they changed it with rdna2

normal bluff
#

hmm

late salmon
#

or 1, i come from gcn

normal bluff
#

gcn is pretty old isnt it

late salmon
#

eh, depends

#

The OD_VDDGFX_OFFSET parameter in the Linux kernel is related to overclocking the graphics card. It is used by the AMDGPU driver to set the offset for the GPU voltage.

normal bluff
#

i can set the offset with this corectrl and through sysfs

#

so i could try tweaking it like that

#

and i just did a test

#

i ran the system at 2130 for a bit on high, and like clockwork it crashed

#

no voltage modification

late salmon
normal bluff
#

this is what the screen froze on btw

late salmon
#

memory 1000mhz? 🤔

normal bluff
#

yeah

#

i was saying earlier, it always sits at 1000

late salmon
#

there is some fuckery going on

normal bluff
#

ive never seen it go higher or lower

#

i can manually change it, but id like to know the root cuase

normal bluff
late salmon
#

have you changed installs ever?

#

i mean of os

normal bluff
#

no

#

maybe i should put arch on a usb or something and test

late salmon
#

you would need to stress test heavily, not easily done with a usb stick

normal bluff
#

i guess thats true

late salmon
#

do you have a friend nearby who can test it

normal bluff
#

i could actually use this ssd i have that has nothing on it to put arch

late salmon
#

i still feel this must be a software issue

normal bluff
#

me too

late salmon
#

what kernel are you on

normal bluff
#

im on 6.1.27-gentoo-r1

late salmon
#

try 6.3

normal bluff
#

but ive tried 6.3 and 5.11, 5.15

#

okay

late salmon
#

from unstable

#

oh nvm

normal bluff
#

i already have it compiled, so i could just reboot into it

#

want my config?

late salmon
#

we must first reset everything

normal bluff
#

maybe im really really really dumb and have it set terribly?

late salmon
#

no absolutely don't do the config manually

normal bluff
#

oh, i always do manual config on my kernels

late salmon
#

go for the dist kernel

normal bluff
#

ive never used a bin

#

i will try

late salmon
#

yeah but for this testing only

normal bluff
#

let me try

#

i have to figure out how to do that, bc ive never been able to boot a bin kernel, it can never find rootfs

#

let me emerge gentoo-kernel-bin

#

thats dist right?

late salmon
#

be absolutely sure that the kernel you are booting on is the one with all the modules

late salmon
normal bluff
#

what ver should i test?

late salmon
#

the gentoo-kernel

normal bluff
#

should i stick with the 6.1 version im doing, now, or 6.3?

late salmon
#

latest you can get without breaking the system

normal bluff
#

i feel like for the sake of variable consistency we should test with just one

late salmon
#

i don't think it matters, 6900 support should have landed a long time ago

normal bluff
#

ill do 6.1.24

#

i have 6800 but yeah

late salmon
#

oh

normal bluff
#

let me emerge with initramfs flag bc i dont use one normally but i cant boot this without

late salmon
#

oh right corectrl is bugged

#

what does it do?

normal bluff
#

well i compile my btrfs driver into kernel manually, but in bin kernels it is M, so i need an initramfs otherwise i cant load the btrfs module and boot

#

compiling rn, then ill test it out

#

would be fantastic if i loaded up this kernel and corectrl showed the right mem speed for example

late salmon
#

i would be happy if it showed the right card

#

whats up with taht

normal bluff
#

just loaded up bin kernel

#

no issue getting it to work for once

#

[ 31.705033] xhci_hcd 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0016 address=0xffef7000 flags=0x0000]

#

odd dmesg thing but nothing is broken so idk

#

same thing with memory being 1000mhz

#

i will now attempt to induce a crash

normal bluff
#

fixable maybe?

#

i can chagne the mem speed with corectrl

#

at least on my other kenrel i could

late salmon
#

can you go higher?

#

like, whats the maximum stock?

normal bluff
#

with bin kernel max is 1000

#

but on my other kernel i could go to 2000

#

also bin kernel is for some reason not letting me use my audio

#

like my usb dac for my headphones dont register

#

very odd

late salmon
#

I have these valid refresh rate options (Hz):
50, 60, 75, 100, 120, 144, 165, 240
50, 60, 75, 100 and 240 are using

cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz *
1: 500Mhz
2: 625Mhz
3: 875Mhz
120 and 165 are using

cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz
1: 500Mhz
2: 625Mhz
3: 875Mhz *

#

do you have a high refresh rate monitor?

normal bluff
late salmon
#

it means it's reaching maximum

normal bluff
#

odd that 1000 is maximum

late salmon
#

it's the same as ram, it gets reported halved

#

it's a matter of measurement

normal bluff
#

i didnt know ram got reported half either

late salmon
#

on ram you usually get a measurement of 1667

#

or something like that

normal bluff
#

i think 2133

late salmon
#

that's without overclock profile

#

its the common speed without xmp

late salmon
normal bluff
#

yes

#

144hz on my main, and 85hz on my second, 60hz on my third

late salmon
#

...

#

that complicates things

#

but yea for that many monitor, some of which higher in freq, i think that memory freq is normal

#

how are the benchmarks of the card?

#

is it holding up with the performance?

normal bluff
#

wdym?

#

like do i get the same fps as others?

#

i think so

late salmon
#

alright

#

have you stress tested it with the new kernel?

normal bluff
#

trying to rn

#

this checks out

#

i get even more fps than this

#

doing my testing rn, problem is that crash is almost completely random timing wise, and my audio doesnt work so its gonna suck to play like this for a bit

late salmon
#

its really weird, that kernel should have every module

#

did you enable something specifically for your sound card^

normal bluff
#

probably have it built in

#

i have to check the wiki now to see

late salmon
#

i have to go now

#

but i think the problem is that max voltage was too low

#

for some reason

#

good luck in this, you can keep posting here your findings so when i get back i read them

normal bluff
#

ok thanks for helping, i will update you

#

just got a hang using dist kernel

#

oddly enough, no dmesg output from the hang

#

i will be going back to my bin kernel now that i know that the hang occurs on both

#

its now between a voltage problem or some kind of hardware problem

#

stress testing now with increased voltage

#

crashed with thsese settings

#

just to rule out the voltage, ill raise to 200 now and see what happens, all else being the same

#

ok, that crashed too

#

so we can rule out a voltage issue

#

next is freq

#

im going to start by reducing min clock and max clock by 100

#

ok so min 2025, and max 2030 im not crashing (at least not super quick)

#

i guess that helps narrow this down to freq, so i if i keep the frequency within 2000mhz, i should be good

#

played a bit with the settings like this, seems stable

normal bluff
#

ok so solution seems to be to limit the gpu freq to something below 2100

#

sucks that auto and high dont work anymore

#

but also voltage isnt the problem

#

even +200 mv doesnt help anything so idk but at least i can use my system

normal bluff
#

@late salmon yo whenever you're free, can we discuss the results here?

late salmon
#

30min or more

normal bluff
#

yeah take your time bro dw, just dont forget about me lol

#

i already contact amd support and updated the bug report

late salmon
#

i'm here

#

@normal bluff

normal bluff
#

ahoy

#

so @late salmon i found some pretty interesting things

#

first of all, kernel version seem entirely irrelivant

#

what actually prevents the crashes is reducing my gpu's mhz by even 100

#

the question i have though, is why?

#

for what reason does the kernel (or the gpu) set its freq higher than it can go by default?

#

and on top of that, why cant my gpu reach those speeds?

late salmon
#

are you sure the frequency it's set to is higher than stock?

normal bluff
#

stock, or auto, brings the freq to 2130 max

#

which crashes

#

but reducing the max to 2030 fixes all crashes

#

i had a buddy show me his, he had a reddragon one, and he was getting well over 2300 no crashes

late salmon
normal bluff
#

let me look, i think it's 1800

#

ok

#

so base clock is 1700, game clock is 1815, and boost is 2105

late salmon
#

_
Base Clock
1825 MHz

Game Clock
2015 MHz

Boost Clock
2250 MHz _

normal bluff
#

where did you find that?

late salmon
#

oh yours is the non xt?

normal bluff
#

yeah

#

non xt

#

just 6800

late salmon
#

ok, what is the output of glxinfo | grep renderer

#

this will give us an idea of what the system thinks the card is

normal bluff
#
GLX_MESA_copy_sub_buffer, GLX_MESA_query_renderer, GLX_MESA_swap_control,
    GLX_MESA_query_renderer, GLX_MESA_swap_control, GLX_OML_swap_method,
Extended renderer info (GLX_MESA_query_renderer):
OpenGL renderer string: AMD Radeon RX 6800 (navi21, LLVM 15.0.7, DRM 3.52, 6.3.2-gentoo)
#

seems correct

late salmon
#

because corectrl reporting a 6900 triggers me

normal bluff
#

same

#

but i think its because lspci just sees navi 21 so it defaults to rx 6900 xt

late salmon
#

post lspci pls

normal bluff
#

just lspci?

late salmon
#

yea

#

just the card part

normal bluff
#

ok

late salmon
#

i need to see what generation it refers to

normal bluff
#
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c3)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
0b:00.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a6
0b:00.3 Serial bus controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB
late salmon
#

ok, still navi21

normal bluff
#

yeah all same series

late salmon
#

welp, it's correctly identified

#

are you 100% sure amdgpu module is setting last frequency at 2000+mhz?

#

can you post a pp_clk_od_voltage?

normal bluff
#

yup, so right now, i have it capped at 2030 as you can see

#

but it seems that the range can be upt to 2600

late salmon
#

it has only two states, mh

#

same as mine, but for example polaris had way more

#

also doesn't list voltages

normal bluff
#

yeah they def changed something

#

i did work on a server rig with amdgpus and noticed the same thign

#

polaris gpus as well

late salmon
#

< For Vega10 and previous ASICs >

Reading the file will display:

a list of engine clock levels and voltages labeled OD_SCLK

a list of memory clock levels and voltages labeled OD_MCLK

a list of valid ranges for sclk, mclk, and voltage labeled OD_RANGE

To manually adjust these settings, first select manual using power_dpm_force_performance_level. Enter a new value for each level by writing a string that contains "s/m level clock voltage" to the file. E.g., "s 1 500 820" will update sclk level 1 to be 500 MHz at 820 mV; "m 0 350 810" will update mclk level 0 to be 350 MHz at 810 mV. When you have edited all of the states as needed, write "c" (commit) to the file to commit your changes. If you want to reset to the default power levels, write "r" (reset) to the file to reset them.

< For Vega20 and newer ASICs >

Reading the file will display:

minimum and maximum engine clock labeled OD_SCLK

minimum(not available for Vega20 and Navi1x) and maximum memory clock labeled OD_MCLK

three <frequency, voltage> points labeled OD_VDDC_CURVE. They can be used to calibrate the sclk voltage curve.

voltage offset(in mV) applied on target voltage calculation. This is available for Sienna Cichlid, Navy Flounder and Dimgrey Cavefish. For these ASICs, the target voltage calculation can be illustrated by "voltage = voltage calculated from v/f curve + overdrive vddgfx offset"

a list of valid ranges for sclk, mclk, and voltage curve points labeled OD_RANGE
normal bluff
#

ok so we can set a curve based on voltage, where max voltage caps out?

late salmon
#

no no, this was just to see that it's a documented change in the kernel module

normal bluff
#

oh

#

so this is just showing that something def changed between polaris and navi

late salmon
#

yeah, it also tells me there is something else going on, because neither me nor you have the voltages

#

they can definitely be read tho, as rocm-smi reports them

#

sorry not rocm-smi, the other info command

normal bluff
#

let me look

#

rocminfo

#

emerging it now

late salmon
#

that doesn't show voltage sorry

#

also you might require rocm to run it

normal bluff
#

i could emerge it

#

but if it doesnt help then it doesnt help ig

#

so what voltage exactly are we looking for? the table?

#

i did just emerge it btw if you want any info from rocminfo

late salmon
#

dw they aren't useful, i wanted a command we run before

#

its a sysfs file that displays current info

#

like current voltage applied

normal bluff
#

this is all that i found thats even a bit relevant anyways Max Clock Freq. (MHz): 2475

#

ill be back in like an hour but yeah something is def fucky in the amdgpu space rn, and i think it has to do with either a kernel regression or a hardware problem

late salmon
#

i want to see where these max and min values are taken

normal bluff
#

good poit