#FlexGroup performance difference...

1 messages · Page 1 of 1 (latest)

jagged crag
#

We have a cluster with two nodes that has an aggregate on each node. We then have a large FlexGroup volume with two underlying volumes, one on each aggregate.
This works fine, but we see a pretty large difference when writing to the flexgroup. We can see that the write speeds are about 800MB/s. if we write to the node that has both the LIF and the aggregate that is currently beeing written to... but every other file is written to the other nodes aggregate and the traffic has to go via the intercluster link, and we then see speeds at about 200MB/s....
AFAIK there is no way arround this? Even if you give the SVM a LIF on each node, CIFS is not able to figure out the it is more efficient to write to the volume that is local to the LIF... not sure if NFSv4 can do some magic in this regard, but doubt it?
So we are considering relocating the aggregate to the same node because we are sure that we will get better write performance over all... the reads are not important to us.
But will the FlexGroup be OK with this operation? 😉 or shoud we be aware of any issues?
PS: The aggregates are "full disk" aggregates (no partitions), so the nodes are booting from their own SSD aggregates.

jagged crag
#

...hmm just tried to relocate the aggregate to the same node which didn't have the expected results... so it must be the aggregate... which is a bit strange since the two aggregates are very similar... basically just a DS460c with half of the disks for each aggregate... but it seems like all disks are connected to channel A... (aggr status -r) I think there is/was a way to trigger ONTAP to distribute the disk connections between A and B ? (storage disk show -p) shows that there are both A and B connection to all disks...

jagged crag
#

hmmm tried the "storage load balance -node ..." yet it doesn't seem to change anything... all disks are connected via channel SA:A... the boot aggregate on the node is connected to SA:B... so it's not because there isn't another path on the system.. it's an FAS2750 so only two initiators (HBAs)... # Never mind... the "storage load balance" is apparently for LUNs and not the disks 😦

shadow dock
#

I don't think it's relevant what channel the disks are connected to. A DS460C should handle 60x HDDs easily.
Maybe check if you see any errors on the IOM, disks, etc. Also maybe one disks in a RG of the slow-performing aggr has higher latency which often leads to overall slower perf in the whole RG.

#

Also why you only have 1x constituent per aggr? I've though for bigger FlexGroups (which this seems to be) the default ist 4x constituents per aggr.

limber lake
#

also check the usual suspects: is the aggregate very full, for example?

#

also, are you sure you're really copying data and not just silently using ODX in the background ?

jagged crag
jagged crag
limber lake
limber lake
jagged crag
jagged crag
#

So the QOS output of latency shows that data01__0002 is having quite some higher latency issues than 0001 which makes sense as we look at the performance numbers.

somber pond
#

Sorry the automod didn’t like your cmd output.

jagged crag
#

statit shows that some disks in the data04 aggregate has many more xfers and writes than the other... this is not the case on data03... again data04 is where the 0002 volume is located on which we are seeing performance issues...

#

we are not seeing any disk related errors in the event logs...

#

`Workload ID Latency Network Cluster Data Disk QoS Max QoS Min NVRAM Cloud FlexCache SM Sync VA AVSCAN


data01__0002-.. 4686 83.65ms 185.00us 0ms 83.04ms 0ms 0ms 0ms 425.00us 0ms 0ms 0ms 0ms 0ms
data01__0002-.. 4686 14.21ms 184.00us 0ms 13.49ms 0ms 0ms 0ms 537.00us 0ms 0ms 0ms 0ms 0ms
data01__0001-.. 13969 4.04ms 194.00us 0ms 3.55ms 0ms 0ms 0ms 292.00us 0ms 0ms 0ms 0ms 0ms
data01__0001-.. 13969 4.00ms 194.00us 0ms 3.45ms 0ms 0ms 0ms 348.00us 0ms 0ms 0ms 0ms 0ms`

#

now I know that the QOS numbers are just a few snippets..

#

What worries me is the statit output with the 7 last disks in the aggregatge which way higher util than the others..

#

...could it be that we have added these disks to the aggregate later on? and this is the reason? and isn't is possible to do some reallocate magic?

#

we are ready to nuke all snapshots and run some reallocate if it will solve this issue 🙂

#

Waiting for the volume reallocation measure to kick off...

limber lake
jagged crag
#

....the job is still queued.. is there a way to "kick it off" 😉

#

`s01::volume reallocation*> show
Vserver Description Schedule State


Reallocation scans are on
privat /vol/data01__0002,measure reallocate_1d Queued`

#

...sad to say that creating a new flexgroup give us the same issue... but yes it creates 4 volumes if you create it from scratch...

#

hmm so... either reallocate or create a new temp aggregate to "empty" out the "problem" aggregate and then recreate it... have anyone had any real good luck with reallocate? to be honnest it's been awhile since I used it my selv... maybe even back in the 7mode days...

#

come to think of it... maybe we can create a new aggregate, then do a "vol move" of the problem volume___02 to that aggregate... then recreate the problem aggregate and move it back...

solar solstice
#

100% you should use XCP to migrate the data to a new FlexGroup. Otherwise you leave artifacts of the old FlexVol around and it’ll never balance out right. I don’t like that we weren’t more strict on that guidance. In the words of Dr Ian Malcolm, “Just because ya can, doesn’t mean you should.”

jagged crag
#

But Nick, I just tried to create a new FG, which gave me the same issues.. .so things point to the aggregate I cannot remember, but we may have added 7-10 disks at some point which will have skewed the balance, and since we mostly write to this and not delete anything I am more and more cirtan that this is the issue... so either we rebuild the aggregate or we add more disk to it... and adding disks is not an option... we do have a lot of spare disks on other nodes in the cluster, but not the same size... but I guess we could create a new aggregate and use it as a tempory storage as we re-create the aggregate again...

#

Side question... what is the difference between "Fast zeroed" and normal very slow zeroing? Just started this aggregate creation, and guess I will have to wait several hours for the zeroing to complete... but why aren't then just all fast zeroed? 😉

#

I stand corrected... apparently the aggregate is now ready.. so didn't have to wait that long 🙂

#

...even with 1GB/sec. transfer speeds, this is still going to take several days to complete 😉 I guess we are limited to the intercluster links that are 10Gb...

limber lake
# jagged crag Side question... what is the difference between "Fast zeroed" and normal very sl...

how is fast zeroing different from full zeroing? Well, full zeroing writes zeroes over every block. Fast Zeroing writes zeroes only over some important areas and keeps the rest of the data as-is. Since data is retained in each block, there needs to be a different method of finding out if a block is "zeroed" or not (for parity calculations etc.), because simply reading the block and testing all bits as 0 doesn't work.

Technically, the way it is done is that there's a random signature that gets generated when the disk is zeroed. This signature is stored in the checksum area of each block during write (the 8 bytes after 512 data bytes for 520 bps disks, or the 64 bytes after 4096 data bytes on 4k sectored disks). Every time a block is read, the signature is checked and if it doesn't match the one created during disk zeroing, the block is treated as zeroed.

jagged crag
#

makes totally sense... as I created the new aggregate it just took a while and I managed to run a quick "aggr status -r" and you can see the result above... so I was worried that it would have to do the "long zeroing" on some of the disks, as it was indicating... but aftwe a minute it completed.... all with "fast zerored" status now... 🙂