#AMDGPU crash adventure
1 messages · Page 1 of 1 (latest)
yes
ok so i need to read the voltages, but the pp files have changed since i moved to my new gpu
i don't know how now
let me look around
/sys/kernel/debug/dri/0/amdgpu_pm_info shows me current voltage
818 mV (VDDGFX) for me
yea but thats instant, thats not what we want
we want the table
pp_od_clk_voltage
but it doesn't show the voltages on my end
let me google it
post yours here, let me see
od_clk_voltage?
OD_SCLK:
0: 500Mhz
1: 2104Mhz
OD_MCLK:
0: 97Mhz
1: 1000MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK: 500Mhz 2600Mhz
MCLK: 674Mhz 1075Mhz
amdgpu_pm_info```
GFX Clocks and Power:
1000 MHz (MCLK)
2130 MHz (SCLK)
1825 MHz (PSTATE_SCLK)
1000 MHz (PSTATE_MCLK)
818 mV (VDDGFX)
42.0 W (average GPU)
like mine, they changed it with rdna2
hmm
or 1, i come from gcn
gcn is pretty old isnt it
eh, depends
The OD_VDDGFX_OFFSET parameter in the Linux kernel is related to overclocking the graphics card. It is used by the AMDGPU driver to set the offset for the GPU voltage.
i can set the offset with this corectrl and through sysfs
so i could try tweaking it like that
and i just did a test
i ran the system at 2130 for a bit on high, and like clockwork it crashed
no voltage modification
memory 1000mhz? 🤔
there is some fuckery going on
ive never seen it go higher or lower
i can manually change it, but id like to know the root cuase
im thinking so too, bc all the reports of this gpu say 2000mhz is normal
you would need to stress test heavily, not easily done with a usb stick
i guess thats true
do you have a friend nearby who can test it
i could actually use this ssd i have that has nothing on it to put arch
i still feel this must be a software issue
me too
what kernel are you on
im on 6.1.27-gentoo-r1
try 6.3
we must first reset everything
maybe im really really really dumb and have it set terribly?
no absolutely don't do the config manually
oh, i always do manual config on my kernels
go for the dist kernel
yeah but for this testing only
let me try
i have to figure out how to do that, bc ive never been able to boot a bin kernel, it can never find rootfs
let me emerge gentoo-kernel-bin
thats dist right?
be absolutely sure that the kernel you are booting on is the one with all the modules
yea
wdym?
what ver should i test?
the gentoo-kernel
should i stick with the 6.1 version im doing, now, or 6.3?
latest you can get without breaking the system
i feel like for the sake of variable consistency we should test with just one
i don't think it matters, 6900 support should have landed a long time ago
oh
let me emerge with initramfs flag bc i dont use one normally but i cant boot this without
well i compile my btrfs driver into kernel manually, but in bin kernels it is M, so i need an initramfs otherwise i cant load the btrfs module and boot
compiling rn, then ill test it out
would be fantastic if i loaded up this kernel and corectrl showed the right mem speed for example
just loaded up bin kernel
no issue getting it to work for once
[ 31.705033] xhci_hcd 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0016 address=0xffef7000 flags=0x0000]
odd dmesg thing but nothing is broken so idk
same thing with memory being 1000mhz
i will now attempt to induce a crash
https://www.reddit.com/r/linux_gaming/comments/qp6n9n/amd_6700_xt_on_pop_os_2104_stuck_at_1000mhz/
https://bbs.archlinux.org/viewtopic.php?id=263256
https://forums.linuxmint.com/viewtopic.php?t=380124
70 votes and 41 comments so far on Reddit
seems to be a common thing
fixable maybe?
i can chagne the mem speed with corectrl
at least on my other kenrel i could
with bin kernel max is 1000
but on my other kernel i could go to 2000
also bin kernel is for some reason not letting me use my audio
like my usb dac for my headphones dont register
very odd
I have these valid refresh rate options (Hz):
50, 60, 75, 100, 120, 144, 165, 240
50, 60, 75, 100 and 240 are usingcat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz *
1: 500Mhz
2: 625Mhz
3: 875Mhz
120 and 165 are usingcat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz
1: 500Mhz
2: 625Mhz
3: 875Mhz *
do you have a high refresh rate monitor?
ah that's good
it means it's reaching maximum
odd that 1000 is maximum
i didnt know ram got reported half either
i think 2133
but
...
that complicates things
but yea for that many monitor, some of which higher in freq, i think that memory freq is normal
how are the benchmarks of the card?
is it holding up with the performance?
trying to rn
this checks out
i get even more fps than this
doing my testing rn, problem is that crash is almost completely random timing wise, and my audio doesnt work so its gonna suck to play like this for a bit
its really weird, that kernel should have every module
did you enable something specifically for your sound card^
i have to go now
but i think the problem is that max voltage was too low
for some reason
good luck in this, you can keep posting here your findings so when i get back i read them
ok thanks for helping, i will update you
just got a hang using dist kernel
oddly enough, no dmesg output from the hang
i will be going back to my bin kernel now that i know that the hang occurs on both
its now between a voltage problem or some kind of hardware problem
stress testing now with increased voltage
crashed with thsese settings
just to rule out the voltage, ill raise to 200 now and see what happens, all else being the same
ok, that crashed too
so we can rule out a voltage issue
next is freq
im going to start by reducing min clock and max clock by 100
ok so min 2025, and max 2030 im not crashing (at least not super quick)
i guess that helps narrow this down to freq, so i if i keep the frequency within 2000mhz, i should be good
played a bit with the settings like this, seems stable
ok so solution seems to be to limit the gpu freq to something below 2100
sucks that auto and high dont work anymore
but also voltage isnt the problem
even +200 mv doesnt help anything so idk but at least i can use my system
@late salmon yo whenever you're free, can we discuss the results here?
30min or more
yeah take your time bro dw, just dont forget about me lol
i already contact amd support and updated the bug report
ahoy
so @late salmon i found some pretty interesting things
first of all, kernel version seem entirely irrelivant
what actually prevents the crashes is reducing my gpu's mhz by even 100
the question i have though, is why?
for what reason does the kernel (or the gpu) set its freq higher than it can go by default?
and on top of that, why cant my gpu reach those speeds?
are you sure the frequency it's set to is higher than stock?
stock, or auto, brings the freq to 2130 max
which crashes
but reducing the max to 2030 fixes all crashes
i had a buddy show me his, he had a reddragon one, and he was getting well over 2300 no crashes
no i meant, what is the stock frequency of an amd reference 6800 xt
let me look, i think it's 1800
ok
so base clock is 1700, game clock is 1815, and boost is 2105
_
Base Clock
1825 MHzGame Clock
2015 MHzBoost Clock
2250 MHz _
ok, what is the output of glxinfo | grep renderer
this will give us an idea of what the system thinks the card is
GLX_MESA_copy_sub_buffer, GLX_MESA_query_renderer, GLX_MESA_swap_control,
GLX_MESA_query_renderer, GLX_MESA_swap_control, GLX_OML_swap_method,
Extended renderer info (GLX_MESA_query_renderer):
OpenGL renderer string: AMD Radeon RX 6800 (navi21, LLVM 15.0.7, DRM 3.52, 6.3.2-gentoo)
seems correct
because corectrl reporting a 6900 triggers me
post lspci pls
just lspci?
ok
i need to see what generation it refers to
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c3)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
0b:00.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a6
0b:00.3 Serial bus controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 USB
ok, still navi21
yeah all same series
welp, it's correctly identified
are you 100% sure amdgpu module is setting last frequency at 2000+mhz?
can you post a pp_clk_od_voltage?
yup, so right now, i have it capped at 2030 as you can see
but it seems that the range can be upt to 2600
it has only two states, mh
same as mine, but for example polaris had way more
also doesn't list voltages
yeah they def changed something
i did work on a server rig with amdgpus and noticed the same thign
polaris gpus as well
< For Vega10 and previous ASICs >
Reading the file will display:
a list of engine clock levels and voltages labeled OD_SCLK a list of memory clock levels and voltages labeled OD_MCLK a list of valid ranges for sclk, mclk, and voltage labeled OD_RANGETo manually adjust these settings, first select manual using power_dpm_force_performance_level. Enter a new value for each level by writing a string that contains "s/m level clock voltage" to the file. E.g., "s 1 500 820" will update sclk level 1 to be 500 MHz at 820 mV; "m 0 350 810" will update mclk level 0 to be 350 MHz at 810 mV. When you have edited all of the states as needed, write "c" (commit) to the file to commit your changes. If you want to reset to the default power levels, write "r" (reset) to the file to reset them.
< For Vega20 and newer ASICs >
Reading the file will display:
minimum and maximum engine clock labeled OD_SCLK minimum(not available for Vega20 and Navi1x) and maximum memory clock labeled OD_MCLK three <frequency, voltage> points labeled OD_VDDC_CURVE. They can be used to calibrate the sclk voltage curve. voltage offset(in mV) applied on target voltage calculation. This is available for Sienna Cichlid, Navy Flounder and Dimgrey Cavefish. For these ASICs, the target voltage calculation can be illustrated by "voltage = voltage calculated from v/f curve + overdrive vddgfx offset" a list of valid ranges for sclk, mclk, and voltage curve points labeled OD_RANGE
ok so we can set a curve based on voltage, where max voltage caps out?
no no, this was just to see that it's a documented change in the kernel module
yeah, it also tells me there is something else going on, because neither me nor you have the voltages
they can definitely be read tho, as rocm-smi reports them
sorry not rocm-smi, the other info command
i could emerge it
but if it doesnt help then it doesnt help ig
so what voltage exactly are we looking for? the table?
i did just emerge it btw if you want any info from rocminfo
dw they aren't useful, i wanted a command we run before
its a sysfs file that displays current info
like current voltage applied
this is all that i found thats even a bit relevant anyways Max Clock Freq. (MHz): 2475
ill be back in like an hour but yeah something is def fucky in the amdgpu space rn, and i think it has to do with either a kernel regression or a hardware problem
i want to see where these max and min values are taken
good poit