#Aggr Reconstruction did not start
1 messages · Page 1 of 1 (latest)
You are probably missing spare disks in the corresponding pool
Hm well you seem to have at least one spare in pool 0... Did you check the event log to see why it's not rebuilding?
With the sysconfig -r should be showing the recontructing process but is not starting for an unknown reason, is there any way to start it manually?
3/26/2024 09:28:09 NA-Polux ERROR raid.rg.recons.cantStart: The reconstruction cannot start in RAID group /aggr1/plex0/rg1: No matching disks available in spare pool, targeting any spare pool
from the event log
event log show
I was planning to bring back online any of the disks broken, after that, execute replace command with the valid spare in a certain way force it the reconstructing
as per the https://kb.netapp.com/onprem/ontap/hardware/Aggregate_offline_due_to_multiple_failed_disks
there is no command to manually start the reconstruction, no
it's strange because it should at least pick up the disk 0b.00.23 as spare. Try removing disk ownership on that partition and re-assigning it (should be disk removeowner -disk ... -data true followed by disk assign -node NA-Polux -all true)
I did something similar yesterday, and took more than a day to zero that spare 😩
you don't need to zero the spares, if a disk fails even non-zeroed spares will be picked
ok, you want me to remove 1.0.23P1 from the spare pool, then reassign it again?
yep
any other idea?
Looks like you have 3 spare disks on the other node; 0a.00.2P1, 0a.00.20P1 and 0b.00.21P1, I would try to reassign one of them to the partner node, just to see if it like them better. Something like disk assign -disk 0a.00.2P1 -owner <othernodename> (I think you need to put a "-force true" in there, or alternatively unassign it first...
Do a sysconfig -a pls
Fun fact - there’s at least 20 customers with castor/pollux HA pairs 🤣
Most of any I’ve seen. Yogi/booboo, Mario/luigi also popular
Safer to disable auto assignment then remove owner of the dat data partition then reassign to the other node.
not sure if you already fixed your assignment but i guss buller7929 is right. EMS says "The reconstruction cannot start in RAID group /aggr1/plex0/rg1: No matching disks available in spare pool, targeting any spare pool"
If I should hazard a guess it would be: netapp1-01 and netapp1-02 because "-01/02" is standard, and people have no imagination when it comes to names 🙂 Personally I liked a customer where all hostnames were beers and their filer was called "gambrinus-light and gambrinus-dark" 🙂
oh.. another thing mentioned in the logs on this system is the "SHUTDOWN PENDING" message... I'm not sure, but maybe the system will do a shutdown if two disks are broken, and it cannot rebuild? As a precaution like if it's too warm etc.? (i hope it doesn't). But there are actually a few issus on the system which should be addressed eventually, like the volume that is running out of inodes, and net.ifgrp.lacp.link.inactive which looks like the intercluster links are down? And of cause the "well know" disk.dynamicqual.fail.parse issue which I think is fixed in later ONTAP releases... but of cause focus on getting the rebuild up and running...
The thing I'm wondering about is: the system clearly has 1 spare of the matching size and in the correct pool... No idea why it doesn't like that one, it should have at least started to rebuild one of the failed disks
Agree, it's a bit strange... I have only had something similar happen when I was doing some disk replace commands in order to "move" disks arround to free up a shelf... It was some time ago, but it was also partitioned disks and I had to have support check it, and it ended up beeing because the two data partitions were on the same node... somehow that caused issues to where it would not start the disk replace... it was most likely my own fault for not paying attention to the partitions and just adding "-force" when it didn't do as I told it to 🙂
Hello Team,
I resolve the issue checking the failover, where the aggr1 [was on the partner ], then after I perform the giveback the reconstruct starts because I believe unless the return to their own node, can do a reconstructing
They bought another netapp storage they named it as Tauros
Everytime I need to do manual assignment I disable autoassign to avoi issues while I am doing the job