We just moved a fairly large volume using some switch gear, that is snapmirror from a primary system onto a temp-system, then moving the temp-system to another location, change IP's and continuing the snapmirror (OK), then we started a snapmirror of the temp-system onto the final-system... when everything was on sync, we deleted all the snapmirror relationships and setup a new cluster peer from the primary system to the final-system... did a snapmirror create, and snapmirror resync... looks like it's working, but the resync has been running for a few hours now, and I am just wondering what it is actaully doing? It doesn't seem to transfer much from the source, but the disks are at 100% util on the destination.... Is it trying to read all data on the destination in "chunks" and compare it again the source? Because that is going to take awhile then 😉 We can only see that the status is transfering and the progress is slowly climbing... Of cause it would be nice to know how far i has to go...
#SnapMirror Resync... What is actually happening?
1 messages · Page 1 of 1 (latest)
What's the relationship state and status?
Forgot to ask: Are these FlexVols or FlexGroups?
Just one FlexGroup with two volumes in it
First want to confirm you're using -expand to view progress of the constituents. The parent relationship doesn't report progress until a given Snapshot is done transferring. Looking at the constituents will give a better idea of progress.
...if I look at the snapshots on the destination FG, it has the latest snapshots compared to the source...
...and the "Transfer Snaphot" is also the latest snapshot...
I am of cause afraid that it has to read all the data or maybe even transfer it once again, which was what we wanted to avoid in the first place...
Resync reverts the destination to the latest in-common Snapshot then transfers a new Snapshot (just like an update), so it will only be moving the differences between the latest in-common and that new Snapshot.
One thing to note about FlexGroups is if "Progress" at the constituent level stops but we're still Transferring, it could be in Finalizing.
That's a reporting limitation with FlexGroup SM, it doesn't actually say "Finalizing" when it's doing that.
yeah that's what we expect... but just conserned that it takes this long
before we altered the setup a normal hourly update took under 20 minutes... and it has been at now for 4 hours..
one thing... when we started this last resync, we used a slightly different snapmirror policy which we can se has deleted some old snapshots on the destination...
Is constituent progress increasing (just slowly)?
yep
If progress is moving up the 99% likely issue is it's just bottlenecked somewhere, due to contention or perhaps network limits with the new setup.
But there is no pct. anywhere... we are at 50+47GB out of 150TB 😉
And yes the connection to the primary system is limited, but I can verify that the line is not maxed out...
Is it just annoying not knowing if something is wrong, or how long it will likely take... or how much data it has to read or transfer....
It seems to just read a lot on the destination... with 100% disk util... (no scrubbing is going on... also checkked SIS processes and other snapmirrors...)
If the progress is really slow, read on the destination could indicate Finalizing when we read buffered reference ops. These ops get buffered to disk and have to be read and applied to the destination filesystem.
well from the last screen shot we have progressed about 10GB per volume...so yes really slow 😉
so I guess I will just have to leave it alone and hope that it finishes up at some unknown point in time? 😉
If it's in Finalizing, there is a reporting limitation with FlexGroup SnapMirrors that it continues to report "Transferring" instead of "Finalizing".
It is also a limitation that the percentage field we created for Finalizing doesn't work with FlexGroups.
I have an enhancement open for both of these. If you want to ask your account team to open FEPVRs or otherwise push this forward: CONTAP-454339
OK nice to know we are on 9.16.1P3 but guess it's still an issue then... well it's late here in Europe, so I will leave it while sleeping and hopefully it's done in time for me morning coffee
To be honest 10GB in 30min is less likely finalizing and more likely just slow data replication, but due to the reporting it's hard to discern sometimes.
you might be right... (hopefully)... I cannot lookup "CONTAP-454339" with my mortal rights on NOW, but I will give my local NetApp guy a yell about this 😉
just a hunch, but compression/compaction settings are the same on source and destination?
Well yes they are. As I tried to explain to begin with, we had this running nicely with snapmirror from the primary to our temp-system, and from the temp-system to our final backup system...we then broke the inter cluster peers (and the snapmirrors) and setup a direct connection from the primary system to the finel backup system... did a snapmirror create... and then a snapmirror resync... and we can see that the snapmirror processes are working on the latest snapshot in the list, but has been for 17 hours... so it's a bit strange... we can see that right now the transfer between the primary and backup is much less than when we had it setup to the temo-system and the updates were running normally... so we do not think it's the connection which is the limiting factor... we think it's more a question of the disks beeing loaded 100% .. even though with sysstat it doesn't show many IOPs or throughput... and no scrubbing or sis processes are running... what actually makes this worse is that we cannot transfer new snapshot while this is going on, and we don't know how long this is going to take...
I think I figured it out... while looking at the volumes in details it states that "Volume Status: Volume BG3DLOG_Archive_dest__0001 has one or more scanners active."
and with a "wafl scan show" we can see that it seems to be scanning the filesystem... and very slow I might add
...can this be sped up?
With this speed I can calculate it will take about a year to complete...
But did we do something wrong? We had to remove the snapmirror relationship in order to remove the cluster peering so we did a "snapmirror delete" on the relationship... after we had setup the new cluster peering we tried a snapmirror resync right out of the box, but we were unable to make it work unless we did a "snapmirror create" first... we were then able to do the snapmirror resync... (we didn't use the option "-quick-resync") just -type XDP and a policy... and it started right up.. no errors or anyting.. so it's a bit strange...
FYI... I have just open a support case on this...
While we wait for NetApp's SSP partner the smaller of the two sub-volumes seems to have completed... even though the WAFL scan has not yet completed, so maybe they are not even related.... but maybe there's hope
...volume _0002 took about 24 hours to complete... sadly volume _0001 is 10 times larger... so gues we are looking at 9-10 days before this completes.... I am excited to see that NetAp support can find out here...
It's definitely not Finalizing, it's just slow, so they should be looking for bottlenecks. You could give them a head-start by collecting performance archives for a 4hr period of slowness when both were running: https://kb.netapp.com/on-prem/ontap/Perf/Perf-KBs/What_are_performance_archives_and_how_are_they_triggered