#First time 2v2 AMD driver timeout

120 messages · Page 1 of 1 (latest)

rugged canopy
real peak
#

thank you for the reports! We're looking into GPU crash issues - here's a couple troubleshooting articles that might help, but it could be a game issue too

https://theorycraft.helpshift.com/hc/en/3-theorycraft/faq/25-outdated-graphics-drivers/

https://theorycraft.helpshift.com/hc/en/3-theorycraft/faq/28-crashing-black-screen-performance-audio-and-other-common-tech-issues/

rugged canopy
real peak
#

oh yes, this is very possible! This is like when you call for network support from your ISP and they say "is the router plugged in? Can you turn it on and off again?" and you roll your eyes like "yeah of course". Gotta check the obvious stuff first saluteblob

#

shot in the dark here, but can you check if you have "Lumen" enabled in your video settings, and if so, turn it off? It should only affect the lobby, but I see some caching misses in the log 7560_Cypher_Think

Not likely to be a fix, but worth a shot

rugged canopy
real peak
#

it is the same crash, yep

#

D3D12RHI::TerminateOnGPUCrash if you're interested 🙂

real peak
#

ah - would also be handy if you can record a video showing when it happens

rugged canopy
#

It feels like it happens when some champion is doing something~

#

Or I could try using OBS CPU-based recording that would not interrupt on screen restart. Hmm.

real peak
#

we have an issue I'm thinking of where some folks seem to be crashing when certain UI shows up, so I'm wondering if this is the same issue hitting you here

#

yeah OBS is the way to go IMO if you have it setup

rugged canopy
#

And if the game itself somehow uses the UE5 built-in internet browser, that also might be an issue. AMD REALLY does not like hardware acceleration in non-gaming apps turned on when gaming apps are opened. I've crashed more in a month of using AMD than in 15 years of using nvidia because of this.

real peak
#

yeah AMD drivers are not the best IMHO, love the CPUs but the GPUs can have weird stuff like that

rugged canopy
#

Also AMD has biiig issues with DX12 lately.

#

Now, starting the recording session and jumping into 4v4 que. Should crash in a couple of games max.

#

I'll actually stream it to twitch so it will autoarchive CB_owo

rugged canopy
#

Soooo.
I've played a couple of 4v4 matches and didn't crash once. That's after disabling Lumen.

real peak
#

hmmmmm, that is... very suspicious

#

because it really shouldn't matter. I'll pass this along to our rendering pro

novel galleon
#

I can add I've had a single crash on supervive and I think it was after enabling lumen after seeing guntharf talk about it on a post, it only happened to me once though

#

But Im pretty sure it was after I enabled the setting yeah I just looked it up I enabled it on the 27 and I crashed a couple days ago, like 4 or so at max

gentle pumice
#

Interesting, Lumen is only active as opt in from menus and only active on Lobby/Hero select. We force disable when in game

real peak
#

keep us posted - if you continue to have no crashes that would be interesting information. I'm wondering if maybe it's just luck of the draw since your crashes don't look that frequent

rugged canopy
#

Played bit more and still no crashes.
I've checked my 4v4 stats on supervive stats website and at the moment of creating this thread I had a crash every 10 games.
I've played more than 10 games now and still no crash.
Will try more tomorrow!

gentle pumice
#

thanks for the update!

real peak
rugged canopy
real peak
#

can you do the following steps please?

Head to your Temp folder C:\Users\%USERNAME%\AppData\Local\Temp

Look for recently created files ending in dmp with the format PACKER_*_*.dmp

Examples: %TEMP%\PACKER_9719935093_25724.dmp or %TEMP%\PACKER_954071203_58416.dmp

Zip them up and send them to me directly (do not share here as they may contain sensitive information). They may be large, if so they can be uploaded to a secure location such as Google Drive and then shared directly to me saluteblob

rugged canopy
#

I'm wondering if it's crashing only when I play certain characters.
I do seem to remember than on each of the crashes I had, there was an enemy Kingpin in my vision range. But I can't say for sure.

real peak
#

if it's a rendering thing, it's possible certain visual effects cause the issue, tough to isolate though

rugged canopy
#

Okay my gpu driver just crashed having game in background for a couple of minutes. xP But that might be caused because yesterday I've turned on discord hardware acceleration because I wanted to see video attachments in 4k actually played. That's AMD's fault, if that was this case.

gentle pumice
#

mm yeah, unsure what occured here

[2024.12.10-19.15.16:153][ 46]LogD3D12RHI: Error: CurrentQueue.Fence.D3DFence->GetCompletedValue() failed 
 at C:\TheoryCraft\build-staging\Engine\Source\Runtime\D3D12RHI\Private\D3D12Submission.cpp:1013 
 with error DXGI_ERROR_DEVICE_REMOVED with Reason: DXGI_ERROR_DEVICE_HUNG```

seems to indicate just a GPU/Driver timeout and app closing because of it
rugged canopy
# gentle pumice mm yeah, unsure what occured here ```[2024.12.10-19.15.05:484][ 46]LogD3D12RHI: ...

So yes, confirming it's the AMD problem with hardware acceleration.

If more than one app runs on gpu, AMD is losing it's shit basically.
Any internet browser with hardware acceleration, discord with hardware acceleration, STEAM with hardware acceleration. Alt-tabbing to them while in a resource-intensive game results in driver timeouting for some god forgotten reason.

But AMD does not give a square crap about it. :/

gentle pumice
#

Yeah, uncertain what are all internal scheduling and logic that occurs inside of the driver that might cause instability.
Some folks online have mentioned that modifying settings to favor performance vs power savings have helped in this area

rugged canopy
gentle pumice
rugged canopy
rugged canopy
gentle pumice
#

The last 2 reports don't contain any logs or crash info..

Do you have any overlays enabled? AMD, Discord, Steam?

rugged canopy
#

Oops, wrong folder zipped. xP
Steam and amd yes. Discord overlay is disabled.

gentle pumice
#

mm same issue as before. the GPU seems to be timing out.

Does it occur when disabling Lumen from the graphics settings?

If it still crashed, does it crash on other quality settings? Low, Mid, High?

rugged canopy
rugged canopy
#

Also I'll add what I've told safelocked:

Locating the source of this problem might be ridiculously hard. AMD drivers are pure bullshit and for past 1-2 years they had enormous problems with DX12 in many games.

For example:
Firefox with hardware acceleration with a certain website opened as active tab -> launching Diablo 4 -> 100% crash each time when trying to enter the world
Changing the tab to any other website, no crash, proper load... And if I remember correctly it got fixed. BES_Dead

I'm past of disabling hardware acceleration in discord, firefox and steam because it's causing driver timeouts, mostly when alt-tabbing.

feral pasture
#

I got curious,
TBH, it sounds like something is dying in the GPU / FW and causing a unhandled issue and causing a timeout.
Does AMD have some kind of post mortem tool that gives you drawcall and compute dispatches that can create an indepth dump ? 🤔 That'd be the best bet imo

rugged canopy
#

I've sent AMD bug report dumps to Koalifier, and I think he did pass it to Felipe

feral pasture
#

AMD bug reports is not necessarily a proper post mortem tool

rugged canopy
#

Most of game crashes are resolved by AMD after a couple of months

#

well I don't have anything better than that

#

From user pov it's just driver freezing and watchdog kicking in and restarting it.
The interesting part is that not always it's freezing. Sometimes it's kicking in even though everything was rendered smoothly~
:D

feral pasture
rugged canopy
feral pasture
#

wait until he wants them

#

or not

#

maybe he want something else or better :)). i think AMD has multiple post mortem tools IIRC

#

I just work as a gpu driver engineer so i'm used to these types of things :'')) its why i got curious since its a gpu and DX issue

gentle pumice
#

I could take a look a the dumps to see if there is anything special or a pattern there.. but this doesn't seem to be widespread enough for us to suspect is due to a code defect with Unreal Engine or AMD driver.

I've been searching online for these issues on AMD and oh boy.. there are a ton of speculative solutions and nothing seems to be a sure thing. Some folks had luck disabling XMP, some folks updated motherboard bios, some folks replaced power supply and/or power cables, some folks found success raising the TdrDelay values in the Registry, disabling Windows MPO worked for some

I've been running here locally with a few AMD (RX 480 and 580) cards lately to see if I hit some of the issues but haven't hit any obvious problems

rugged canopy
# gentle pumice I could take a look a the dumps to see if there is anything special or a pattern...

Yeah, I went through them all. I've had more driver crashed in a year of AMD usage than in almost 15 years of nvidia usage.
Hardware acceleration is #1 issue. Then comes drivers, which the update much less frequent. With diablo 4 I've had plenty of crashes, but after some updates I havent got none since... well, feels like forever now. I'll see if they'll update driver for supervive on launch. Hopefully.

I've tried the post-mortem tool but after I think 10+ games (which I've streamed also), no crashes. Though, my FPS were affected baaadly. Dips even below 40 FPS on 7900 XTX on medium. This tool is a butcher lol but if it will produce anything interesting on crash, I'm down to continue using it.

gentle pumice
#

I wonder if the throttling from the tool (copying GPU state to ram, etc) reduces load enough to improve stability..

feral pasture
# gentle pumice I wonder if the throttling from the tool (copying GPU state to ram, etc) reduces...

Since it's a external tool doing the capture it has to modify the cmd buffers that are being submitted / activate some states. So the time in between submission of cmd buffers is going to be higher.

Though it means that its likely not to be contained in a singular submit but multiple submit potentially. (Including across apps)

To me it sounds like a scheduling issue on the hw side of AMD.

Hard to say what it could be without an actual hw dump though with these kinda issues :(

I've not looked at the actual dump though since I don't have a laptop with me 🙈

rugged canopy
#

Well today was something new. I've made a screenshot of my lobby that contained mostly bots, and it crashed. Lol.

feral pasture
feral pasture
#

they re binary files

rugged canopy
feral pasture
#

what is the UEMinidump.dmp memory dump file?

#

in the zip

rugged canopy
#

first .zip is from amd

feral pasture
#

can you open the RGD file ?

#

the text file just say what mem and gpu is being used the RGD file is the radoen file that should have all the information

rugged canopy
#

that's the .txt that opens when i double click rgd file in the radeon developer panel

feral pasture
#

Ahhh

#

MARKERS IN PROGRESS

INFO: no markers in progress since no command buffers were in flight during the crash.

=====================
EXECUTION MARKER TREE

INFO: execution marker tree is empty since no command buffers were in flight during the crash.

==================
PAGE FAULT SUMMARY

INFO: no page fault detected.

#

All work is completed but the VK fence didn't get an answer back

#

🙄

#

It's useful to know that everything completed correctly but if this is accurate and the VK fence didn't get signaled this sounds like an amd driver issue.

rugged canopy
#

i've checked it now so might be more

#

but idk how to open the rgd file

feral pasture
#

There's like a PowerPoint from amd that explains it

#

slide 23

rugged canopy
feral pasture
#

That's fine , is there anything in the output log on the bottom that's like suspicious

#

I think they just updated their ui

rugged canopy
feral pasture
#

Well, if the log is correct then the Gpu is basically idling cause it has no work, then fence times out since it didn't get signaled not sure what's happening:/

rugged canopy
#

For now I'm disabling the post-mortem capture tool because it's decimating my FPS and also makes my CPU stay on 75°C+ during the game instead of ~50s.
If you'd like any other post-mortem tool to be used Moozels just tell me which one

gentle pumice
#

inspecting these, it mostly seems caused by some page fault (reading or writing from a restricted or unavailable part of memory) from the driver side..
DirectX usually fails or returns with error codes when the application (Unreal Engine) feeds it bad inputs and parameters, this issue however seems to be mostly on the driver side.

Offending VA: 0x61000 mostly says that there is no permission to read or write into that offset, or the offset is invalid
Tool doesn't seem to link that offset to any relevant resource (texture, buffer, render target, etc)

rugged canopy
gentle pumice
#

It’s been this dance between Unreal Engine, AMD/Nvidia and Microsoft..

It really helps that Fortnite is still doing well as that has encouraged those 3 companies to work well toward making UE5, DirectX 12 and GPUs play nicely. Things used to be quite bad like a year ago (I feel bad for any UE5 game that shipped during that window). We try to keep the version of Unreal 5 fairly up to date (one large revision behind) to take on all the fixes to improve overall performance and stability.

It’s too bad that AMD doesn’t have a version of what Nvidia calls “Studio Drivers” which tend to sacrifice a tiny bit of performance but with much added stability

rugged canopy
#

Oh there were actually. I can't find them right now, but those were drivers without Adrenalin bloatware. I was considering using them, but the Radeon Chill feature is only available with Adrenalin unfortunately, and I really like using it.

feral pasture
#

the page fault itself is real weird.
cause you re still seeing a timeout (since the fence is hitting5 seconds) I feel if there was such a fatal page fault the gpu should just crash and reset directly rather than wait and timeout. Unless this is how AMD HW is working.
It feels more like the fault is due to the timeout happening then as memory gets cleared you hit a wrong VA.
also none of the MDI's have started which should mean nothing should (?) have even begun being accessed.

I feel bad for any engine makers trying to get performance tbh 😢 Having to support so many products with so many needs for WA to get performance out of each products 😂

#

@rugged canopy I wonder if having the 2 gpu's being recognize and used is causing some weird issues with shuffling of resources and allocating resources to the wrong places

wise belfry
#

I get this all the time too, happens during games a lot. sometimes my entire pc just crashes which idk if that is related to sv but it would never happen before.

rugged canopy
feral pasture
rugged canopy
#

yeah just crashed on last 4 teams in game on small circle...

rugged canopy
#

Well, I've played a couple of games now, with the new driver, and I did not crash even once.
Suspicious.

rugged canopy
pallid sentinel
#

@rugged canopy still facing this?

rugged canopy
pallid sentinel
#

working now or what

rugged canopy
pallid sentinel
#

ok bro.

gentle pumice
rugged canopy
#

@gentle pumice I wonder, is there a possibility of introducing GPU re-hooking in the case of gpu crash?
Some games do it.
After gpu crashing, the application does not close itself, but rather re-hooks to the new instance (?) of the gpu that reboots itself after the driver crash, and continues normally after.
For example - Heroes of the Storm. My gpu can crash there, but in 8/10 of cases the game does not close. It uses the after-crash instance of GPU and continues working, except for some small visuals like cooldowns opacity.

It would almost solve the problem of the timeouts by reducing the time to reconnect to the game by 99% (steam keeps thinking SV is running, so you have to restart steam, it takes time...)

rugged canopy
#

I haven't crashed even once since servers are up, that's 7 hours Hmm

rugged canopy
#

Nevermind, after 9h crash. That must be something with hardware acceleration.

/edit:
I'm crashing quite frequently now :(

gentle pumice
#

Heya,
About re-hooking the GPU
I think it's very unlikely that we would be able to have the bandwidth and resources to implement that in our own, at least not at the team size and current resources.
Unreal Engine's RHI layer is massive, and specially the DX12 layer is "fat" in complexity. UE is also no great at the time at resetting state, doing smart reloads, etc. Even for simple state changes, the solution is to "restart the client" which is a bit absurd at times,

We would hope that Epic would add more robustness to the DX12 RHI layer to support restoring the state and re-hook after crash and we can make sure to make SUPERVIVE being able to easily merge latest changes done to Unreal Engine

rugged canopy
#

Thank you for explanation!
I'll need to make a .bat script to kill steam background processes to make restarting quicker then :)