#LIF failover policies
1 messages · Page 1 of 1 (latest)
I change them to be broadcast-domain-wide for lifs that can go anywhere in the cluster. For node mgmt I set to disabled. This prevents a routine EMS alert from generating due to a lif not having a failover interface. (Most node mgmt lifs have no place to failover to anyway. It defaults to local-only but if there is just one you get a weekly EMS alert)
From a failover perspective it really doesn’t matter where it goes unless it’s a node takeover. Then I’d prefer the lif on the same node as storage
Oh, and for clarification, generally system defined will typically do the correct thing.
You can verify where the lifs will go:
Net int show -failover
Will show you what ONTAP sees as the ports it will fail lifs to
Yes; just wondering why it wouldnt be the partner node
I'd see this cause issues if node3 had different link speeds for example and this would be left as default
Looking at a 4 node system with the command you mentioned verifies the below statement from KB:
https://kb.netapp.com/on-prem/ontap/Ontap_OS/OS-KBs/Network_interface_failover_policies_behavior_and_uses
system-defined: This policy is the default for LIFs of type data. This policy prioritizes failover to ports that are not on the storage failover (SFO) partner. In a 4-node cluster, for example, LIFs homed to ports on node 1 would fail over to node 3.
This is a good tip; thanks
Well, it should be failing over based on broadcast domains anyway. That should imply the links and vlans are symmetrical.
If they are not, I’d suggest creating your own failover-groups to limit specific ports and then use that failover-group instead of a broadcast domain
I've no issues; works as advertised 🙂
Just wondering as to why it would prioritize a different ha pair instead of the partner
huh... that's a new for me...
I've always thought that system-default for NAS data LIFs would equal to broadcast-domain-wide. 🤨
system-defined - The system determines appropriate failover targets for the LIF. The default behavior is that failover targets are chosen from the LIF's current hosting node and also from one other non-partner node when possible
I mean we are also usually changing it manually to broadcast-domain-wide. But does that mean the only difference with system-default is that there is a prioritization?
- 4-node: if any other non-SFO-node is available --> failover there; if only SFO-node is available --> use this
- 2-node: there's only SFO-node --> failover there
...when possible"
Whereas broadcast-domain-wide would see all the ports in the failover-group as equal?
I think there is a priority on the order they are listed in the broadcast-domain. Try it. Do a Braodcast-domain show, then "net int show -failover" and see if there is any coorelation
I guess the reasoning is to reduce the load in case of a failover or something. When the HA partner takes over it already has a higher load and you would want to move protocol load somewhere else if possible. Technically there's no difference between the HA partner and any other node, as all the traffic has to go through the cluster network anyways
you can even see that for some workloads, client traffic is actually faster (at least in benchmarks) if the LIF is not where the storage is
and with the latencies in the cluster networks of a few microseconds, there's no performance impact anyway (with a 10gig Cluster backend in our lab we see latencies around 40µs to 120µs)
hmm.... I do have to say that the output of net int show -failover is a bit misleading to me. At least with the following output I would have thought that LIF cifsB ONLY has two possible failover-targets. Whereas there actually possible failover-targets on all nodes but system-default filters it and only shows the nodes which would be used first...
4-node cluster consisting of:
n11 & n12 --> HA-pair 1
n13 & n14 --> HA-pair 2
cluster1::*> net int show -vserver SVM2 -lif cifsA,cifsB -failover
(network interface show)
Logical Home Failover Failover
Vserver Interface Node:Port Policy Group
-------- --------------- --------------------- --------------- ---------------
SVM2
cifsA n12:a0a-2126 broadcast-domain-wide VLAN2126
Failover Targets: n12:a0a-2126, n13:a0a-2126,
n14:a0a-2126, n11:a0a-2126
cifsB n12:a0a-2126 system-defined VLAN2126
Failover Targets: n12:a0a-2126, n13:a0a-2126
2 entries were displayed.
I guess if I would -up-admin down the n13:a0a-2126 port, for the cifsB LIF it would now show n14 as the failover-target.
And if I down that too it would show n11 (the SFO partner).
Am I correct?
(can't test it since it's a customer environment and we don't have any 4-node clusters in our lab)
yeah, someone decided to reconfigure all our lab systems as switchless clusters 😉 you should be able to use the simulator though?
To me this looks according to the documented behavior:
system-defined - The system determines appropriate failover targets for the LIF. The default behavior is that failover targets are chosen from the LIF's current hosting node and also from one other non-partner node when possible.
so in that example, one LIF on n12 and one on either n13 or n14
Yes, but these will not be the only ones. The LIF may also failover to n11 if both ports on n13 and n14 are down. So it's basically still broadcast-domain-wide it only does some prioritization first.
yeah because the list is re-computed at every port status change, and so the possible failover targets change whenever a port goes up or down. So it's always two, but the two are not static. At least that's how I understand it
however, there are some bugs related to failover targets, e.g. this one (not public): "Bug ID 1182625: Allow more failover targets on additional node in case of two down nodes scenario."
I've seen system-defined failover a LIF to a node that then went down as part of a batch upgrade, leaving it nowhere to failover to. That was back in the ONTAP 8 or very early 9 days so hopefully that race condition has been fixed by now.
Probably before 8.3.
I’ve lifs failover to wrong locations. Like a data port to e0M. Pretty sure it’s fixed but I still make broadcast domains and failover groups as needed to avoid
I don't remember where I got the info from (maybe on here?) but that is how it was explained to me. They want to get the LIF off of an HA pair that is in a degraded state.
This has caused us issues with LIFs that only have ports configured on a single HA pair... they have no where to go if the policy is system defined.