We have a cluster with two nodes that has an aggregate on each node. We then have a large FlexGroup volume with two underlying volumes, one on each aggregate.
This works fine, but we see a pretty large difference when writing to the flexgroup. We can see that the write speeds are about 800MB/s. if we write to the node that has both the LIF and the aggregate that is currently beeing written to... but every other file is written to the other nodes aggregate and the traffic has to go via the intercluster link, and we then see speeds at about 200MB/s....
AFAIK there is no way arround this? Even if you give the SVM a LIF on each node, CIFS is not able to figure out the it is more efficient to write to the volume that is local to the LIF... not sure if NFSv4 can do some magic in this regard, but doubt it?
So we are considering relocating the aggregate to the same node because we are sure that we will get better write performance over all... the reads are not important to us.
But will the FlexGroup be OK with this operation? 😉 or shoud we be aware of any issues?
PS: The aggregates are "full disk" aggregates (no partitions), so the nodes are booting from their own SSD aggregates.
#FlexGroup performance difference...
1 messages · Page 1 of 1 (latest)
...hmm just tried to relocate the aggregate to the same node which didn't have the expected results... so it must be the aggregate... which is a bit strange since the two aggregates are very similar... basically just a DS460c with half of the disks for each aggregate... but it seems like all disks are connected to channel A... (aggr status -r) I think there is/was a way to trigger ONTAP to distribute the disk connections between A and B ? (storage disk show -p) shows that there are both A and B connection to all disks...
hmmm tried the "storage load balance -node ..." yet it doesn't seem to change anything... all disks are connected via channel SA:A... the boot aggregate on the node is connected to SA:B... so it's not because there isn't another path on the system.. it's an FAS2750 so only two initiators (HBAs)... # Never mind... the "storage load balance" is apparently for LUNs and not the disks 😦
I don't think it's relevant what channel the disks are connected to. A DS460C should handle 60x HDDs easily.
Maybe check if you see any errors on the IOM, disks, etc. Also maybe one disks in a RG of the slow-performing aggr has higher latency which often leads to overall slower perf in the whole RG.
Also why you only have 1x constituent per aggr? I've though for bigger FlexGroups (which this seems to be) the default ist 4x constituents per aggr.
also check the usual suspects: is the aggregate very full, for example?
also, are you sure you're really copying data and not just silently using ODX in the background ?
Well I can clearly see the same performance trend with "sysstat" on the NetApp... so it's not just the Window server showing me this... and we have plenty of space... so I guess I will have to have a look at some "statit" details to see if a disk stands out here as we copy...
What would be the point with more than one constituent per aggregate anyway? This is just one host copying larger files to the share... the flexgroup started as just one volume and converted later on, the two underlying volumes are pretty much the same size which is why the performance changes from one aggregate to the other.... yeah I will have a look at statit tomorrow
having multiple volumes per aggregate increases performance in general, due to waffinity. But you're right, during a single-stream copy from one client, you won't see any impact
since you mentioned it is a flexvol to flexgroup conversion, this might be relevant though
Interresting.... and I love the solution to just create a new volume and migrate your data 😉 Is there a ONTAP copy tool my any chance, or so you have to pull out and put it in again? Maybe some hacking in the systemshell will help me? 😉
So the QOS output of latency shows that data01__0002 is having quite some higher latency issues than 0001 which makes sense as we look at the performance numbers.
Sorry the automod didn’t like your cmd output.
statit shows that some disks in the data04 aggregate has many more xfers and writes than the other... this is not the case on data03... again data04 is where the 0002 volume is located on which we are seeing performance issues...
we are not seeing any disk related errors in the event logs...
`Workload ID Latency Network Cluster Data Disk QoS Max QoS Min NVRAM Cloud FlexCache SM Sync VA AVSCAN
data01__0002-.. 4686 83.65ms 185.00us 0ms 83.04ms 0ms 0ms 0ms 425.00us 0ms 0ms 0ms 0ms 0ms
data01__0002-.. 4686 14.21ms 184.00us 0ms 13.49ms 0ms 0ms 0ms 537.00us 0ms 0ms 0ms 0ms 0ms
data01__0001-.. 13969 4.04ms 194.00us 0ms 3.55ms 0ms 0ms 0ms 292.00us 0ms 0ms 0ms 0ms 0ms
data01__0001-.. 13969 4.00ms 194.00us 0ms 3.45ms 0ms 0ms 0ms 348.00us 0ms 0ms 0ms 0ms 0ms`
now I know that the QOS numbers are just a few snippets..
What worries me is the statit output with the 7 last disks in the aggregatge which way higher util than the others..
...could it be that we have added these disks to the aggregate later on? and this is the reason? and isn't is possible to do some reallocate magic?
we are ready to nuke all snapshots and run some reallocate if it will solve this issue 🙂
Waiting for the volume reallocation measure to kick off...
yeah, client-side copy is the only way in that case. Converting a FlexVol to a FlexGroup was never a good idea, even NetApp themselves advise against doing it, instead prefering client-side copy.
I would try and create a test FlexGroup from scratch (thin provisioned), and see if the issue persists there, or if that works fine from a performance standpoint
Yeah I think we will try that... but if the aggregate has had disks added to it, I guess the data written will hit those disks more than the other... which is what we see on the data04 aggregate... and my guess is that we should be able to "fix" this with the reallocate... and it's just a bit easier than to copy everything arround 😉
....the job is still queued.. is there a way to "kick it off" 😉
`s01::volume reallocation*> show
Vserver Description Schedule State
Reallocation scans are on
privat /vol/data01__0002,measure reallocate_1d Queued`
...sad to say that creating a new flexgroup give us the same issue... but yes it creates 4 volumes if you create it from scratch...
hmm so... either reallocate or create a new temp aggregate to "empty" out the "problem" aggregate and then recreate it... have anyone had any real good luck with reallocate? to be honnest it's been awhile since I used it my selv... maybe even back in the 7mode days...
come to think of it... maybe we can create a new aggregate, then do a "vol move" of the problem volume___02 to that aggregate... then recreate the problem aggregate and move it back...
100% you should use XCP to migrate the data to a new FlexGroup. Otherwise you leave artifacts of the old FlexVol around and it’ll never balance out right. I don’t like that we weren’t more strict on that guidance. In the words of Dr Ian Malcolm, “Just because ya can, doesn’t mean you should.”
But Nick, I just tried to create a new FG, which gave me the same issues.. .so things point to the aggregate I cannot remember, but we may have added 7-10 disks at some point which will have skewed the balance, and since we mostly write to this and not delete anything I am more and more cirtan that this is the issue... so either we rebuild the aggregate or we add more disk to it... and adding disks is not an option... we do have a lot of spare disks on other nodes in the cluster, but not the same size... but I guess we could create a new aggregate and use it as a tempory storage as we re-create the aggregate again...
Side question... what is the difference between "Fast zeroed" and normal very slow zeroing? Just started this aggregate creation, and guess I will have to wait several hours for the zeroing to complete... but why aren't then just all fast zeroed? 😉
I stand corrected... apparently the aggregate is now ready.. so didn't have to wait that long 🙂
...even with 1GB/sec. transfer speeds, this is still going to take several days to complete 😉 I guess we are limited to the intercluster links that are 10Gb...
how is fast zeroing different from full zeroing? Well, full zeroing writes zeroes over every block. Fast Zeroing writes zeroes only over some important areas and keeps the rest of the data as-is. Since data is retained in each block, there needs to be a different method of finding out if a block is "zeroed" or not (for parity calculations etc.), because simply reading the block and testing all bits as 0 doesn't work.
Technically, the way it is done is that there's a random signature that gets generated when the disk is zeroed. This signature is stored in the checksum area of each block during write (the 8 bytes after 512 data bytes for 520 bps disks, or the 64 bytes after 4096 data bytes on 4k sectored disks). Every time a block is read, the signature is checked and if it doesn't match the one created during disk zeroing, the block is treated as zeroed.
makes totally sense... as I created the new aggregate it just took a while and I managed to run a quick "aggr status -r" and you can see the result above... so I was worried that it would have to do the "long zeroing" on some of the disks, as it was indicating... but aftwe a minute it completed.... all with "fast zerored" status now... 🙂