#POD spent 9 hours on an Gpu ERR state doing nothing

1 messages · Page 1 of 1 (latest)

indigo notch
#

The GPU usage was at 100% in the GUI, too, so I thought it was doing work.

  • ERR! in Fan column indicates HARDWARE FAILURE
  • 100% GPU Utilization with ZERO running processes
  • GPU consuming 140W power while doing NOTHING
  • Only 1MiB/46068MiB memory used but GPU stuck at 100%

[rank0]: RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

I just stopped the pod if you guys need to investigate it I am not terminating it, you can ask for more details like the pod id here

polar knotBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

cloud mothBOT