We have a backup datacenter to which we snapmirror data from other clusters... to make this a bit easier to manage we are considering adding 3-4 FAS27xx HA nodes into one single cluster.
We of cause need a pair of cluster switches for this... and since the FAS27xx only have SFP+ 10G ports it doesn't make sense to buy a 25G switch for this.
We do however have a pair of CN1610 which I guess would work OK for this task?
But I am aware that this is EOA/EOS switch, but at least we would be using it only for cluster traffic rather than just using ports in our Nexus switches...
The main reason why it makes sense to setup one cluster, is the possibility to move volumes between nodes so balance it all, and we don't have to change the snapmirror relationships...
There are no "high load" on any of these systems because most of them are behind firewalls etc. so the speed is limited to about 1-2Gb/sec. per cluster...
The FAS27xx systems all have valid NetApp service for 3-4 years (I think), but would setting up this, invalidate our support on everything? (I'm pretty sure it won't, just asking about the implications).
If you wanted to setup this in a fully supported setup, which switches would you use? The last multi node setup I did was with C800 and this was 100G Cisco switches, and looking at the models I can find in HWU they all look like they are with QSFP ports... so would you just use a breakout cable? Seams like a waste to use a overpowered switch for this...
When we upgrade to the FAS2800 we can consider upgrading the switches to ones that support 25G SFP28...
I have heard "horror stories" about using unsupported switches for cluster switches, but is it really that bad if you just use common sense?
#Clustering FAS27xx with CN1610 because why not?
1 messages · Page 1 of 1 (latest)
Technically, this will work, of course. I think you could even get temporary support for the CN1610 switches for migration scenarios. Is it supported? No. But if you open a support case that's unrelated, chances are that nobody even looks at the switches. However, if you're having issues that can be reasonably traced back to the cluster network, you might not get any support.
The "horror stories" are mostly if you're using misconfigured switches that don't have jumbo frames enabled or similar. But there's no "magic" involved in the cluster backend, it's just a regular tcp/ip network, so if you are careful, it shouldn't be a problem
I have been tempted to just use our nexus switches, because this is non critical things... And after looking at the NetApp RFC-files which has the supported configuration, you are right, there are no real magic here... only common sense for setting up a swich with jumbo frames, LACP and VLANs to seperate things... the CN1610 doesn't even have VLANs configured as far as I remember... 🙂 I guess if we have the 2U avaliable rack space for the CN1610's it would make me sleep a bit better using them 🙂
We have used various non-supported switches in the past, either for customers who had systems that were already out of support, or for temporary migration scenarios. Since we do 1st and 2nd level support ourselves anyway this wasn't a big issue, and we never encountered any issues
I have been working with NetApp as a tech for the past 25 years, so I also know that NetApp support are one of the best out there, especially if you are really in trouble, and I have also had cases where they worked on systems that was not even en service (we were migrating to a newer NetApp), they are fair to their customers if they are loyal to them 🙂 Now I also have a few customer who I have sold SSP support via Arrow (our local distributor) and they are also nice to work with if they are local or from the UK. I have been working as a TSE for Fujitsu locally when they sold NetApp (which they don't any longer) so I also know a bit of the back-end 🙂 But support was always better "back in the day" when you got hold of some of the WAFL developers in the US and learned something new every time 😉
Yeah it's a bit harder since we've grown to get direct with devs. If devs took every case, we'd never get ONTAP developed.
Thanks for the kind words about our support. 🙂
If the CN1610 is temporary this should be fine, but make sure to watch ifstat (can also look at NIC View in PAS if you have access) to make sure frame drop rate isn't terrible.
And we really don't want to use customer switches because we really don't want other traffic to influence cluster switches. A 2720 won't be powerful enough to matter, but bigger systems can jam multi GB/s down switches and they have to be able to efficiently handle the traffic.
Actually the most annoying thing is the alerts I get after connecting four nodes together over a Cisco 10G Nexus switch.. it just complains like a *itch 😉 Even if you configure your cluster switch with the correct snmp name etc.. it still just won't give it a rest... and to know knowlegde it's not possible to just disable this... Is there a Nexus switch which makes sense to use with the FAS2700? (10G Cluster links?) Most of the suported switches are 40G or 100G now... Looking at HWU I can see that an old 5010 might do the job, but I guess that will also complain as will the CN1610? I know that this works... all ports are healthy and stressing the ports works just fine... but I just cannot live with the error every time I login 😉 I give up 🙂
@dim musk any example of said error you refer to? I’ve done lots of clusters with 10g on 1610 and a handful on the old nexus 5k (supported models). Not sure I’ve seen what root are referring to
I get "Switch-Health: degraded"
and UnsupportedSwitch_Alert
Because my switches are not one of the supported switches
Ah. That makes sense then
I'm trying with a Nexus 3172PQ .. (N3K, but it's using the N9K IOS)... I have a pair if CN1610, but I think it's a bit much to take up two rack units in our already pretty full rack 😉 So I was just hoping that it would work.. knowing that it wasn't supported... but was not expecting that the alerts wasn't able to be disabled somehow...
It's when I do a "system health alerts show"...
Yeah that one maybe not so much
Subsystem Health
SAS-connect ok
Environment ok
Memory ok
Service-Processor ok
Switch-Health degraded
CIFS-NDO ok
Motherboard ok
IO ok
MetroCluster ok
MetroCluster_Node ok
FHM-Switch ok
FHM-Bridge ok
SAS-connect_Cluster
ok
13 entries were displayed.
You shouldn’t get that message with the 1610 unless you’re using ONTAP 9.13+
Strange because I know that if you run a metrocluster with non-mirrored disks, you also get an alert here (under metrocluster) but it doesn't show this on the frontpage when you login...
Yeah.. we are of cause using 9.15 😉
Metro cluster also supports other switches in certain configurations
Not a fan. But it works if they are properly configured
well this one customer I have is a MCC-FC...
I think I will revert back to two seperate clusters and live with the hassle of changing the source clusters's cluster peers...
That was my idea... to have one cluster and not having to "think" about setting up all the source clusters to different cluster-peers and have to make a choice every time you want to protect a volume...
..with one cluster I could move the volumes arround between the aggregates...
You do know about the multi cluster Rcf for the nexus 9336? Allows for up to two or up to four separate clusters on the same pair
Yeah but 9336 is a very much overkill with FAS2700 😉
It's a great switch.. I think I have just setup a MCC-IP with four C800 nodes... or I think it was that switch... they are all the same to me 😉
Just looked it up: N9K-C9336C-FX2 🙂
We had to get the Long Range 100G SFP28 40km optics from Cisco... eyewatering prices 😉
But how would you go about connecting a FAS2700? Is there a QSFP28 to SFP+ adapter that is supported?
..or maybe a 100G to 10 x 10G in one port? 😉
You really need long range for a local cluster breakout?
In order to do optical breakout, you need an option with MPO. Most long range fibers (well, all that i I can see) are LC based fiber so no breakout.
I’ve certainly used the 100g SR4 optics and then an MPO -> 4xLC breakout works
No no, that was a MetroCluster-IP we did 🙂 with about 30 KM between the two HA nodes 😉 so you need four long range optics between the two fabrics..
anyway.... I will revert the setup to two clusters tomorrow... good think only one of the HA pairs have to be nuked and setup as a new cluster... sometimes I miss 7mode 😉
you can disable health monitoring for the switches and then acknowledge/delete the alerts. they shouldn't show up again after that
Something like this? "system switch ethernet modify -is-monitoring-enabled-admin false -device *"
That might do it
I might have made a little "buh buh"... (nothing critical) before I joined the two pairs into one cluster I snapmirrored a few volumes over to the cluster where I would later join the "new" pair... I might not have been awake, because I moved from none NAE (Encrypted) aggregate to an aggregate with NAE... and now I am trying to move it back... which is possible with simple flexvols by adding "-encrypt-destination false -encrypt-with-aggr-key false" to the volume move command... but I have a "small" FlexGroup which has two constituents which I cannot move in the same way... or is there a way? Maybe a local snapmirror will do it? Or maybe just mounting it on a host an do a copy (it's just "dumb" files anyway)... I get this error as I am trying "Error: command failed: Conversion to a plain-text volume is not supported for FlexGroup constituent "data01_dest__0001" because one or more of the constituents in FlexGroup "data01_dest" resides on an NAE (NetApp Aggregate Encryption) aggregate.".....
...seems like snapmirror also complains that there is no encryption on the destination... so I guess I need to copy via a host then...
When moving a volume for encryption, provided you have enough space you can “move” it to the same aggregate specifying the proper encryption
The two are mutually exclusive I think. Nae or NVE so if you specify -encrypt-destination true then it removes nae and if you specify -encrypt-with-aggr-key true then it becomes nae and removes NVE
You absolutely need too have encryption on the hosts
It may be as simple as
security key-manager onboard sync
To sync all hosts to the passphrase
Of course if you don’t have the key/code and it’s under support, contact support. It’s a free license
(VE)
I have the license... just need to find the keyphrase 😉
Good. You’ll be in business then
nah
not quite... I think I need to enable NAE on the new aggregates?
I get this: s01::*> volume move start -vserver privat -volume data01_dest__0001 -destination-aggregate data03 -encrypt-destination false -encrypt-with-aggr-key false
Error: command failed: Conversion to a plain-text volume is not supported for FlexGroup constituent "data01_dest__0001" because one or more of the constituents in
FlexGroup "data01_dest" resides on an NAE (NetApp Aggregate Encryption) aggregate.
I think I am stuck in "encrypted land" 😉 I think the easiest way is to mount the two volumes on a host and copy from one to the other...
In order to flip the aggr bit, all volumes on the aggr must be NVE first. Then you flip the bit then you convert to nae. There is a very well detailed kb article in doing this
And any root SVMs should be migrated to an nae volume with the -encrypt-with-aggr-key true first. However, I’ve heard with 9.14 there are new rules regarding encrypting SVM root
I think this is the kb, recently updated
And this probably more appropriate
I have read the FAQ about NVE and NAE. but would it make sense for us to convert everything to NAE on our snapmirror destination clusters that has different clusters conencted which mirror volumes that are non-encrypted and NVE... It doesn't seen to be "transparent" to restore a volume back from a NAE aggregate to an aggregate with no encryption. It "fails" in the GUI, but it is possible in the command line (just not the Flex Groups)... I wonder if applications like SnapCenter can figure this out? 😉
I usually just work with customers and enable nae all around. Get everything to the same configuration for encryption. Sometimes it takes a while.
Well, I have a handful of older systems, and the customers has not taken up the task to convert everything to NAE...
If you have space, never use “volume encryption conversion start” it just takes way too long if you have the space to do a “volume move” with encryption. Plus that only works to get to NVE. To go to NAE you must use a volume move anyway
"way too long" is an understatement... in-place encryption is even slower than you can imagine 😉
we have seen 4+ months for a 100tb FlexGroup for example
I know someone that tried it on a single constituent. Yeah. It was a MONTHS long process.
Good ting we didn't start this, then we would not be finished within a year 😉 I think we will advise our customers to use NVE... but what happens if one customer has volumes on NAE, can we still snapmirror this to a non NAE aggregate and maybe include NVE? 😉 or are we more or less forced to create an aggregate with NAE?
when using snapmirror, it is agnostic about encryption. It does not matter on the source or destination
When you try to do something local on the cluster, that is something else.
If you are doing snapmirror (on the CLI at least) you create your destination DP volume. when you create that DP volume, you can specifiy if it is NVE or NAE. Then when snapmirror goes, it reads the source and dumps into the destination no matter the encryption or not
ahh ok.. I understand... I am still thinking of my issue with the flexgroup 😉
yeah. I have run into mostly every scenario with customers regarding encryption and trying to come up with supported ways to get from A to A'
I think when we eventually build new systems or aggregates we will just enable NAE because why not... and as we move the volumes there it will be encrypted automatically...