#Random crash to black screen, driver fails?

1 messages · Page 1 of 1 (latest)

brisk shadow
#

Hello, for a week or so I started facing random crashes to black screen.

Setup:

  1. Lastest SwarmUI 0.9.6.0 (2025-05-06)
  2. Windows 11, RTX 4090, 64GB DDR5, 7950X3D
  3. Driver: 576.28

What is happening and when:

  1. I load an Illustrious model, generate at 896x1152, no issues as expected.
  2. When I get something that I like, I use it as init.
  3. I turn on refiner, selecting different Illustrious model at low control and 1.5 upscale (0.15 for example, but it doesn't seem to matter if it's low or high).
  4. Most of the time it works, but randomly, without any previous indications, everything will freeze, screens will turn completely black and driver will crash. After roughly 30 second screens will turn back on, order of them reversed, selection of main screen reset, resolutions messed up etc. That makes me think it's a driver crash.

Anyone faced this before? I was doing tests with VRAM monitoring and it never went to full, but sometimes got close.
And, which is critical, I've been successfully doing this exact flow (1.5x upscale with different refiner model than base) for months.

Note: I tried using DDU and rolling back to earlier driver version, but issue persisted.

rancid storm
#

Might be driver bug (recently nv drivers are a bit wonk), might be hardware issue (yike), might be overheat -- in the lattermost case, downclocking the GPU and/or lowering the powerlimit will prevent that for only a modest % loss in gen speed

#

and yes full crash to black screen is "the graphics card died and came back" which can be anywhere from driver to hardware

brisk shadow
#

Well damn. There is literally zero other issues with the card, or the system. This is the only situation where this happens. Any ideas how to troubleshoot it further?

#

And how should I go about downclocking, any reliable software for that?

#

I'm perfectly fine with performance loss as a trade for stability.

rancid storm
rancid storm
brisk shadow
rancid storm
#

quick glance at google says you can set a power limit 15-25% lower than max and only have a 3-5% speed loss, on a 4090

brisk shadow
#

All right, power limit set to 80% in Afterburner, I'll do some further tests later. Thanks!

rancid storm
#

I wouldn't expect a graphics crash from ram errors? but yeah can't hurt to test

brisk shadow
#

So far I only set the power limit. I couldn't test extensively, but there were no driver crashes.

brisk shadow
#

No issues since I've set the power limit to 80%, thanks!

brisk shadow
#

Hmm, errors returned, but this time I managed to crash the gen before it bricked my PC, so I have console log

brisk shadow
#

I'm trying to figure out if, when using two different models from same architecture (SDXL Illustrious), I can somehow save memory by loading CLIP and VAE once, for example.

brisk shadow
#

I'm at my wits end. Can't be VRAM issue. I tested generating with one IL model as main and other IL model as refiner, and with all things fully loaded I cap out at 19 / 24 gigs of memory, but the Refine/upscale part results in system hanging up often. It grinds to a halt, then kind of unlocks itself enough for me to be able to close the SwarmUI cmd.

Visually, it generates the image on original model fully, then FULLY refines it using refiner model - and then hangs. At the very end of refining phase.

I'm not sure when does upscaling kick in - if it does at the end of refinement, maybe that's something to look at?

Testing it further - power limit doesn't seem to be a factor - at 70/80/90/100% it tends to have the same results. I verified it doing some synthetic GPU under stress testing and it was stable.

rancid storm
#

if you're doing multiple models at once, that makes sense potentially

rancid storm
brisk shadow
rancid storm
#

if you want some aggro memory reduction for XL models, Server->Backends->edit backend->in ExtraArgs add --fp8_e4m3fn-unet

rancid storm
#

enabling VAE Tiling can fix that

brisk shadow
#

Will try that. What worries is me is that I made thousands of gens where that was not the issue, then bam - for the past two weeks or so, crashes.

#

Changed log to Debug, found this before whole desktop crashed (recovered without restart with resolutions fucked and screen order reversed, but recovered).

[ComfyUI-0/STDERR] Warning: Ran out of memory when regular VAE encoding, retrying with tiled VAE encoding.

#

And I was nowhere near memory capacity.

rancid storm
#

VAE ran of memory would be VRAM ftr

rancid storm
# brisk shadow Will try that. What worries is me is that I made thousands of gens where that wa...

So, some options here:
(A) coincidence, sometimes stuff just happen
(B) windows or driver updates. nvidia's recent driver updates of the past few months have had some reported issues.
(C) comfy updates or custom node installs (unlikely it'd only affect you in particular but who knows)
(D) something eating memory. maybe some unrelated software updated, maybe you installed some new background thing. eg my fucking discord is eating 1.5 GiB of ram rn for some ungodly reason, so this is absolutely a risk.
(E) hardware issues. If the initial issue was an overheat, it's quite common for repeated overheats to leave lasting damage even after you put a stop to the overheating. Usually that would just manifest as continued stability issues, not memory behavior changing tho.
(F) change in your workflow. maybe prior to two weeks or so ago, you were using a different model or not upscaling as much, or whatever

#

I'd focus on the least convoluted first. It says memory issue, you have memory issue, look at memory things.

brisk shadow
#

Thanks, I'll go over those suggestions. Your help is much appreciated!

pseudo fjord
#

have you found a solution to this? i'm facing similar issues

brisk shadow
#

No, nothing permanent. Newest issue with Wan 2.2 may suggest something about offloading being fucked or something, but I'm grasping tbh

cinder ingot
thick urchin
#

Try this...