#9.12.1P4 issues

1 messages · Page 1 of 1 (latest)

thorn cosmos
#

Heads up on 9.12.1P4. On a small 2-node cluster, we've already run into 2 issues since Friday. The first is EMERGENCY messages for vol.duplicate.msid events on our root mirrors. Support says to ignore the messages or filter them out.

We're also running into issues with the scheduled snapmirrors on the LS mirrors failing. We are also working this with Support.

Neither are major but we don't normally see issues this quickly following an ONTAP release. Interestingly enough, we did not see either issue on the first cluster we upgraded. When's P5 due out? 🙂

#

To make life more challenging, I expect some people will be upgrading to this to mitigate against the AD auth issues...

green niche
#

Thanks for the heads up but don't remember when we last deployed LS mirrors for SVM roots. Never really saw the need.
When an aggr fails sh*t already has hit the fan. Yes, with SVM root mirror enabled vols on other aggrs would still be accessible. But if the aggr failure is due to high temperature or something like that (shelf shutdown) then the other aggrs will usually follow.
Mostly it's safer to completely switch to the DR site.

thorn cosmos
#

It's still a best practice to create an LS mirror for SVM roots on another node.
https://docs.netapp.com/us-en/ontap/data-protection/create-load-sharing-mirror-task.html

'You should create a load-sharing mirror (LSM) for each SVM root volume that serves NAS data in the cluster. You can create the LSM on any node other than the one containing the root volume, such as the partner node in an HA pair, or preferably in a different HA pair. For a two-node cluster, you should create the LSM on the partner of the node with the SVM root volume.

If you create an LSM on the same node, and the node is unavailable, you have a single point of failure, and you do not have a second copy to ensure the data remains accessible to clients. But when you create the LSM on a node other than the one containing the root volume, or on a different HA pair, your data is still accessible in the event of an outage."

fair mauve
#

That guidance is for more than 2-node clusters using NAS. If you have an ha pair and one goes away SFO kicks in. If both nodes go down, no data anyway. With 4 or more nodes, the LS mirror allows for clients to access remaining data in the event the ha pair containing svm-root goes offline.

#

With SAN, nothing should be mounted in the namespace anyway, no LS mirror needed

thorn cosmos
#

If it's only for more than 2-node clusters, why does it say " For a two-node cluster, you should create the LSM on the partner of the node with the SVM root volume" ?

#

With the prevalence of flexgroups these days, creating an LS mirror has less value since a pair failure could shut you down anyway.

fair mauve
#

with 2-nodes:
If node 1 fails, then node 2 takes over and serves all data. No problem. The same is if node 2 fails. Node 1 will take over all data from Node 2 and continue.

The LS mirror does not really provide any benefit in this case. That document is from 2016 (ONTAP 9.1) docs

Looks like the "protection" it provides against is if the SVM root becomes unusable. My thought there is if it is unusable, then it is replicated, and the LS mirrors would be unusable.

Certainly use it if you'd like. I am seeing zero benefit with 2 nodes and only suggest them with >2 nodes.

From the ONTAP 9 release Notes:
Updated guidance on load-sharing mirrors:

It is no longer required that you create an SVM load-sharing mirror (LSM) on every node in the cluster. You can create the LSM on any node other than the one containing the root volume, preferably in a different HA pair.

All ONTAP 9 releases

Read into that..any node other than the one containing the root volume, preferably in a different HA pair.
Meaning, in a 2-node cluster, dont worry. In a 4-node cluster, you should have 1 x LS Mirror

snow rivet
fair mauve
#

"if it's taken over by another node"? The ONLY node that takes over storage is its' HA partner and will/should continue to server all data, including svm_roots. I have never in existence of GX/CDOT/Clusterd ONTAP seen a single case where root did not continue being serverd in a 2-node cluster during takeover (as long as SFO was actually working). Please read the release notes. You are welcome to do it, no longer recommended with a 2-node cluster.

snow rivet
#

If SFO worked its fine however SFO happened but corroption or something that affected access to svm_root then its a problem , enterprise we want to ensure protection at all levels , corruption can happen and there can be several reasons why the SVM (Storage Virtual Machine) root volume becomes unavailable or corrupt, If there is a hardware failure in the storage system, such as a disk failure or a controller issue, Disk errors, such as bad sectors or disk media failure,
,Sudden power outages or improper shutdowns can cause the SVM root volume to become unavailable or corrupt...

#

Yes yes i saw the release notes what it means is you don't have to mirror the svm_root on all nodes , no longer on every node it says <"It is no longer required that you create an SVM load-sharing mirror
(LSM) on every node in the cluster. You can create the LSM on any
node other than the one containing the root volume, preferably in a
different HA pair." > it doesn't say it is not recommended for 2 node cluster .

thorn cosmos
#

Updated guidance on load-sharing mirrors:
Awesome - thanks for pointing that out. We missed that comment somewhere along the way.

lofty iris
#

I never understood the use for LS mirrors either. As TMAC stated, HA takes care of that in all cases. If you want to ensure that "corruption cannot happen" (whatever that means?) then you would also need to have a synchronous mirror of all your data volumes, too, as those are actually more important and you're worse off of they get "corrupted" than when your empty root-volume gets "corrupted".

Also, for example, for SAN workloads, you wouldn't even need a root volume as the volumes containing the LUNs don't even need to be mounted to be accessible. Even NFS mounts continue to work if their volume is unmounted (this one caught some of our engineers by surprise, but it's not surprising if you know how NFS handles work)

lofty iris
# snow rivet If SFO worked its fine however SFO happened but corroption or something that aff...

also "If there is a hardware failure in the storage system, such as a disk failure [...]can cause the SVM root volume to become unavailable or corrupt.."
umm.. have you heard of RAID? WAFL? Those things were designed to prevent these exact issues.
You're describing failure scenarios that are so remotely unlikely that it's far more likely that your datacenter burns down or gets flooded, in which case your root volumes are the least of your problems. For example, in 15+ years of working with NetApp, I have never seen a (genuine) triple-disk failure in even a single RAID-DP group. I'm sure they have happened, but not in our install base of 1000+ systems

green niche
# fair mauve with 2-nodes: If node 1 fails, then node 2 takes over and serves all data. No pr...

I am seeing zero benefit with 2 nodes
The only possible benefit I can think of having LSMs for SVM root-vols in a 2-node cluster:
Say you have a FAS8300, there are two shelves connected. Each shelf's being used for one data-aggr:
shelf1 --> aggr1 (plus the root-aggrs of both nodes)
shelf2 --> aggr2

You have a SVM with 1x data-vol on aggr1 and 1x data-vol on aggr2. The SVM root-vol is also on aggr2.
If the shelf of aggr2 - for reasons unknown - suddenly goes offline now, you obviously lose access to everything on aggr2.
But since the SVM root-vol is gone too, you also can't access the data-vol on aggr1 - even though it's still online because shelf1 with aggr1 is fine.

Question is how likely this scenario might be. Maybe you cabled one shelf wrong and put both power-cables to the same phase which then failed. Or maybe a shelf-IOM firmware-update went rogue (never experienced that, also needs to happen for both IOMs).
Basically LSMs in this scenario add additional protection. But it's not like without LSMs there would be any single-point of failure. Everything is redundant already, you simply chose to cover a possible misconfiguration. (Which shouldn't exist if you use ConfigAdvisor and do your failover-tests before deployment.)

fair mauve
#

Doesn’t ONTAP panic with a multi disk failure at that point? That’s been my experience when a whole shelf goes offline

green niche
#

Hm... not sure actually. Would it panic if its root-aggr stays online? I know there are scenarios where aggrs simply go offline, and where a node decides to "multi-disk panic".

But that shouldn't change much. Node2 panics, node1 takes over. The data-vol on aggr1 is still inaccessible.
With a LSM mirror of the SVM root-vol on aggr1 the data-vol on aggr1 should be accessible now. I guess?

thorn cosmos
#

In >40 years in IT, I have seen many disks fail (and the fun of replacing disks pre-raid). I have seen a double-disk failure in the same RAID5 set exactly once and I was clairvoyant enough to have that RAID5 set mirrored between data centers so there was no data loss. I have felt the impact of sysadms failing to patch code that could have prevented a change from corrupting an XFS file system and taking 160M customer-facing files offline. I have seen user error result in data loss many, many times... I have also seen ONTAP return incorrect data when hardware issues were not detected.
Given competent admins, it's FAR, FAR more likely that software bugs or user errors will result in data loss than it is for a disk errors to cause data loss.

normal hound
#

The only real-world use-case I'd support is a big NAS deployment like 6+ node cluster, LIFs on every node, 1000+ users, DNS round-robin to access the LIFs, LS-mirror on every node...

In other words, real LOAD-sharing...

thorn cosmos
serene orbit
#