#Aggregate resync pct ?

1 messages · Page 1 of 1 (latest)

prisma terrace
#

Hi...
We just moved one node in a MCC and had to power down one side of the MCC... Now as node can see all it's disk again it has started it's resync..

Plex /stor12_data/plex4 (online, normal, resyncing 1% completed, pool1)
RAID group /stor12_data/plex4/rg0 (recomputing parity 38% completed, block checksums)

Which resync pct. is correct? 😉 Looks like the 1% stays there, while the recompute is moving up..

Any idea?

prisma terrace
#

.... update... the recomputing has completed, sadly the resync is very slow going... not so sure I understand why it needed to do the recomputing? And not sure why it need this long to recync two aggregates where we only had them disconnected for 3 hours... I am pretty sure the "aggregate snapshot reserve" wasn't maxed out, so afaik it then just needs to sync op the differences and not everything? It seems we are looking at 5-6 hours of resync now... This is a two node Stretched MCC A400 with 120 x 7.6TB SAS SSD disks in total split 50% to each aggregate/controller... the CPU is maxed out while we are resyncing read/writes at about 1-2GBps... seems to me like it's doing a total resync, but sure how to see if that's the case...

#

BTW: Can you do a switchback while the aggregates are resyncing? Right now we only have one node up, the other is booted into maint mode...

grizzled terrace
#

those are two different processes. Resyncing is what you're looking for (it rebuilds the plex mirrors), recomputing parity is just checking the stripes for lost parity blocks and recomputes those, within the aggregate. It is usually much faster since the system knows which RAID stripes need to be recomputed

#

And yes, you can do a switchback while the aggregates are resyncing (with override vetos), but you might lose some resync progress since the last checkpoint. But first check this

#

i.e. set raid.resync.perf_impact to high and raid.resync.num_concurrent_ios_per_rg to 16 or so

prisma terrace
#

OK, but how can I see if it has started a "full" resync ?

#

Since the CPU is already as 99% I don't want to change the resync priority 😉 It's set at the default low right now... it's reached 30% and 60% on the two aggregates, so is should be finished tomorrow morning at which point we will do a switchback... It just seems like a very long time for a resync compared to others I have done... when I cut the connections between the two sites I just powered off the ATTO Bridges... not sure if there is a "smoother" way than that?

grizzled terrace
#

you should see it in the EMS logs, something about starting a "Full" or "Level 0" or "L0" resync. You can also check if there still exists a mirror resync snapshot on the Aggregate (node run snap list -A)

#

but yeah, from how you describe it, it sounds like a full resync

jagged monolith
grizzled terrace
#

and if there are too many changes or it takes too long, then the resync snapshot will be deleted and you'll have to do a full L0 resync 🤷‍♂️

jagged monolith
#

it was about the "smoother way" and not so much about the resync. That takes the time it takes.

#

i've been running metrocluster for 15 years

prisma terrace
#

One additional question... our ATTO bridges 7600 .. we have no management IP setup on them, but ONTAP can see their status OK... will we be able to update the firmware on the ATTOs via ONTAP even without the IP setup?

jagged monolith
#

iirc, ATTO started using in-band communication in the FC links, but it's documented. Been a while since i ran FC MCC

prisma terrace