#Advice to migrate millions of files and folders

1 messages · Page 1 of 1 (latest)

spring lion
#

Hi all,
We have a large number of files and folders (About 64 million) in a flexvol (data size used around 11 TiB) on an A400 purely used for Cifs. To be more flexible and predicitive in performance we are in the process of migrating to a new flexgroup on the same system.
Experiencing quite some difficulty in copying the data.
We tried XCP, ndmpcopy and multi-threaded robocopy, but all run at least 5 days.

Snapmirror and conversion to flexgroup was not adviced by NetApp.

Looking for any advice on how to copy the data faster (read: in less time).

Thanks!
Martijn

ancient geode
#

XCP is quite good usually but just to clarify: You want it to copy faster than your estimated 5 days? What's so bad about 5 days?
You can do the initial transfer in the background via XCP (create a snapshot first and mount that since for SMB "XCP does not support modifying data on the source volume during migration") then plan the downtime, do the incremental copy with the sync cmd and then cut-over.

teal radish
#

yeah, with incremental transfers (even robocopy can do these, for example daily after the initial full copy) the time for migration shouldn't really matter that much...

have your users continue to work on the existing volume/data while you do the initial sync job in the background. then schedule regular updates daily, later n-hourly, and for the cutover you can have as short a downtime as an hour or so

vivid lintel
#

Did you get a reason why conversion to FlexGroup was not recommended?

You could clone the FlexVol and give it a try, so it's not touching your production volume. Then run a rebalance afterwards as a possible action to balance the data across all constituents (when you convert a FlexVol to a FlexGroup all exist data ends up in the first constituent; thats why its such a quick process).

teal radish
vivid lintel
#

Yeah in the past we've had to use XCP to copy data from one sub-directory, or Qtree, in a Flexgroup into another (and keep repeating) in order to rebalance FlexGroups. FlexGroup Rebalance is a cool feature, but has some severe limitations that make it mostly unusable (it blows out snapshots very easily is the main one).

I'd still give it a try if there is capacity. Schedule the rebalance to only run outside of hours, or in low usage times, and see how it goes.

spring lion
vivid lintel
#

That is because so much of the data has to be touched and moved. Usually its a small percentage of the data on a constituent that is moved, but after a conversion we're talking a huge percentage of the files have to be moved. So the I/O load is large and automatically scans all files and moves them to balance.

The "volume rebalance file-move start" is fine, but is one file at a time, and you have to manually choose a destination for every file. We're talking 64M files in this case. No one is going to run and monitor ~50M file move commands, that's just a non-starter. Also, the document is "not recommended", not "unsupported". It can be run, but there are considerations to be thought about.

The flexgroup rebalance is not recommended after a flexvol to flexgroup conversion because its going to place high I/O on that individual first constituent. Usually you've moving files from many constituents to many others, so the load is across many volumes. In this case all the read I/O will be targetting a single volume.

That is 100% why I suggested to test it with a clone of the volume, do a full clone-split so its a completely different FlexVol being hit, even put it on a different controller. And then test the rebalance and only allow it to run outside of normal operating hours where any performance impact to users and production systems is minimal. A rebalance doesn't just start and then finish when it's complete. You can limit it's duration, and also by default only touches files with no blocks trapped in snapshots. You can use it, just be careful and understand the options and implications.

I'd 100% try it rather than hacking together millions of file-move start commands.

Or, copy the data again using XCP if you wish. I'd personally at least test the flexgroup conversion and rebalance to see. You might be able to get away with running it for several hours overnight for a few weeks and it achieves the same outcome, and is non-disruptive and transparent to end-users.

spring lion
#

I will start with the XCP on a snapshot!

spring lion
#

First copy run xcp from snapshot

timber tulip
#

you can speed up the initial XCP copy (usually) by setting a few things
-parallel, default is the CPU count, you can adjust this as needed to increase/decrease. we ended up using 12 instead of 16 and it did speed up a bit for some reason.
-bs X block size, default is 1M but you can play with this for what works best for you. We ended up setting ours to 4
set loglevel to error, default is info
make sure you aren't using some things unless they are needed, such as:
-preserve-atime
-acl
-aclverify

we had to do a copy process of ~110million files (30TB) and it took less than 4 days for the initial copy
once the copy was done we started doing a sync/verify every other day

reef flax
#

@spring lion Are you wanting to move to a FlexGroup just for additional capacity or do you need to improve the performance of the existing data. If it is JUST for added capacity then doing a conversion is fine. There is nothing wrong with having an imbalanced Flexgroup, ONTAP will simply start to fill the new volumes. However, as noted, this will not improve the performance of the existing data, the only way to do that is to do a copy as was suggested with XCP.

spring lion
#

It is for performance, not capacity

main birch
#

Been through this a few times with customer(s). It's worth it to propperly re-ingest it for optimal preformance.

vivid lintel
# reef flax <@1122053716510515271> Are you wanting to move to a FlexGroup just for additiona...

Not with the FlexGroup Proactive Resizing feature introduced in ONTAP 9.8. I turn that off via node options on every single system I create flexgroups on. It brought down massive multi-PB scale flexgroups for my customers because of how it works and lack of data points used, all because of unbalanced constituents.

Disable it, adjust the warning and danger thresholds and we're golden.

spring lion
#

@vivid lintel Thanks! So your advice is not to do an conversion? Or do a conversion with this options disabled on the nodes?

errant vigil
#

FG Rebalance isn't for high inode configurations, but low inode configurations. It uses multi-part inodes, which is more work for the various subsystems. It doesn't add any benefit for performance but degrades it in those situations. If you have something like VMDKs, databases, or something else with large files, rebalance makes sense.

plain estuary
#

Rebalance also has a fairly high min-file size. Wasn't even able to do it on ours, which are a lot of sub 4k files.

reef flax
#

Yep. The goal is NOT to get all member volumes to the same capacity level. Thats not optimal. The goal of rebalance is to prevent one member volume from filling up by moving the LEAST files posible. Thus it targets the large files.

vivid lintel
#

@spring lion I'm probably a bit too overly sensitive with FlexGroups because of some past experiences. they're getting better every release. They have their place and I still use them where it makes sense. I'd still test out a clone of the volume and convert that to a flexgroup, and test out the rebalance. That is what I personally would be doing if it's my system. Its very quick and you dont have anything to lose by testing that, but can gain a lot.

I'd stick with the defaults and let the system manage the flexgroup. My experiences in the past were in the transtition phase of existing flexgroups and then proactive resizing being introduced and there were some teething issues.

But since then, I have several customers with many flexgroups. Some are insanely massive, close to 200 constituents and nearing 20PB each (I have one that has gone larger with 300TB flexvol support).