#How worried should we be about "SCSI:recovered error" in the debug event log?

1 messages · Page 1 of 1 (latest)

nocturne rock
#

Example:
1/27/2025 02:21:08 s01-01 DEBUG disk.IO.status: deviceName="0a.02.16", ETime="45", cdb="0x8f:000000014f352800:00000400", victimRetryCount="0", retryCount="0", timeoutRetryCount="0", pathRetryCount="0", adapterStatus="0x0", targetStatus="0x2", sSenseKey="SCSI:recovered error", sSenseCode="", iSenseKey="0x1", iASC="0xb", iASCQ="0x96", pathsTried="1", basicTimeout="10", returnCode="5", disk_information="Disk 0a.02.16 Shelf 2 Drawer 2 Slot 4 Bay 16 [NETAPP X318_HARHE08TA07 NA01] S/N [XXXXXXXX] UID [5000CCA2:5480459C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]" 1/27/2025 02:21:08 s01-01 INFORMATIONAL disk.ioRecoveredError.retry: Recovered error on disk 0a.02.16: op 0x8f:000000038e352800:00000400 sector 10 SCSI:recovered error - Disk used internal retry algorithm to obtain data (1 b 96 96) (45) Disk 0a.02.16 Shelf 2 Drawer 2 Slot 4 Bay 16 [NETAPP X318_HARHE08TA07 NA01] S/N [XXXXXXXX] UID [5000CCA2:5480459C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]

We get is on a few disks... seems kinda random... the disks are not failed or anything... and if we show event log without "priv set diag", the information is not shown... I can see it has been there for awhile... question is if we need to preventively fail the disks affected? Or wait until ONTAP does it for us?

#

...and would NetApp just replace a disk that has been failed manually?

still loom
#

make sure the firmware is up to date... I've never had an issue with getting a disk for one I had to kill manually

nocturne rock
#

the fw is up to date... and we got like 3 disks reporting errors i the logs that I can see back the last 3 days... so may be more.. the question is if this is critical... the error is recovered afterall... and ONTAP doesn't fail thee disk... so. ?

still loom
#

disks aren't perfect. Once in a while one gets a "bad batch" or a new revision that needs som fw tweeking. If the errors aren't causing hiccups, then it's not a problem. If they are all in the same raid group, you might want to fail at least one, so you don't get 3 failures at the same time.

queen sandal
#

As long as you don't see any performance impact, I would just let the disk be there until it is failed by the system. If you see performance issues (usually easily visible in statit -b/statit -e output as increased latency on that single disk), just fail it manually.

nocturne rock
#

I found out that there actually was a newer FW for most of the disks complaining "X318_WVELE08TA07" they are now all upgraded, and I have yet to see the same errors.. I also noticed in all the errors I saw before it was the same sector with every error (not the same on all disks) typically lower numbers, like 1, 3, 7 or 10... but hopefully the FW have fixed this...

queen sandal
#

Usually, newer drive firmware updates only improve the error reporting mechanisms in the drive, so that the drive reports errors sooner/better, and ONTAP can better decide whether to fail the disk early. In any case: no need to worry as long as you have a spare and an active support contract 👍

nocturne rock
#

Yeah... I already got a few "recovered errors" som one of the drives... my worry is if cause that multible disks fail at the same time, or a rebuild will kill other drives... 🙂 but then again we run scrubs all the time (it seems)... so a rebuild would be semilar workload I guess...

#

note to self: stay away from debug mode 😉