#ONTAP Upgrade: using normal NFSv4.x recovery procedures?

1 messages · Page 1 of 1 (latest)

ashen prism
#

The following is excerpted from ONTAP upgrade advisor. I don't know how we should understand the following:

  1. What is "normal NFSv4.x recovery procedures" is?|
  2. How can we end their sessions (before you upgrade)?
  3. If the client can recovery automatically from the connection losses, why do we have to end customer's sessions first?

NFSv4.x NFSv4.x clients will automatically recover from connection losses experienced during the upgrade using normal NFSv4.x recovery procedures. Applications might experience a temporary I/O delay during this process. You should direct users to end their sessions before you upgrade.

pine shore
#
  1. Read this, it explains very well how NFSv4 works: https://community.netapp.com/t5/Tech-ONTAP-Blogs/NFSv3-and-NFSv4-What-s-the-difference/ba-p/441316

  2. Something like umount -f -L [path], depends on the client.
    Check in ONTAP which clients are connected: vserver nfs connected-clients show

  3. Because as mentioned the apps might experience a temporary IO delay. Depending on the application that's no issue or a big issue.

What OS are your NFSv4 clients using? (please don't say ESXi)

ashen prism
#
  1. Read the link already. But I am looking for a specific procedure to have my NFSv4 clients automatically recover from losses as Upgraded Advisor indicated?
  2. umounting file systems first before the upgrade is a disruptive operation. I am seeking a non-disruptive upgrade as the Upgrade Advisor instructed to do.
  3. Yes, ESXi is the OS. Anything you would say about that please?
honest bay
#

Historically there have always been issues with v4 and esx. Search on communities. Search on Reddit. I think there may have been a discussion here. Just not very reliable. No real benefit to using v4 nothing but headaches. Build new nfs3 data stores and migrate

ashen prism
#

Yeah, we kind of know issues. But, there are lot of KB's and also according to NetApp Rep, issues as said on community are no longer a concern and Non-Disruptive upgrade is supported. Upon all those, and plus we already have v4, we decided to perform NON-Disruptive Upgrade. But, Upgrade Advisor under NON-DIS-Ugrade, requires us to end user's session. All those sound contradictory...

Bottom line is, NON-DIS-UPG should work for v4 with ESXi, unless NetApp officially told me otherwise.

pine shore
#

Nothing is always 100% non-disruptive. If that would be the case, there would be no need to for an Upgrade Advisor. You could just always click update and that's it.
There have always been bugs or clients which might misbehave. There have been NFSv3 bugs, I had takeovers with NFSv3 datastores which had been "disruptive" for some apps on some VMs because they were very time-sensitive and everything was under heavy load. Also NFS VAAI might have been misconfigured which also lead to some "disruptions".
Experience (and the internet) simply tells me that the chance to get impacted by disruptions during failovers with NFSv3 are much much lower than with NFSv4.

There have been many bugs with NFSv4, but also many bugs have been fixed, both on ONTAP side and vSphere side.
Here are some newer ones (https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/ESXi_NFS4.1_VMDK_inaccessible_after_takeover), one of the bugs has been fixed lately, for the other I'm not sure.
Also this KB says "Don't use NFS 4.1 for ESXi datastores at current time" - I understand this sort of contradicts with what the IMT or Chance said in the other thread.

But ultimately: NetApp will never assure with 100% certainty that nothing will break ever. You can't simply rely on documentation telling you "this is supported" or not. You ALWAYS need to test how your production workloads behave while you're doing maintenance in your own environment. You will never know unless you do failover-tests and check your apps for impact or latency spikes etc.
It's the same for backup: If you never confirm that restoring your backups actually works, there is no use in creating them in the first place.

You will need to decide if you take the risk and see how everything behaves. Maybe do it on weekends. Or don't take the risk, announce a maintenance and shut down your important workloads where you can't take the chance.

#

If you want some official answer from support, this is the wrong place. Open a NetApp-case.

ashen prism
#

Can you please point out for me where you saw the statement below? I can not find it:
Also this KB says "Don't use NFS 4.1 for ESXi datastores at current time”

pine shore
#

Not mentioned there anymore. I mean it's over 1,5 years later. Maybe the bugs which lead to that statement got fixed but I don't know. I still don't have any customers with NFSv4 and ESXi. Currently I would either choose NFSv3 with nconnect or NVMe_TCP.

ashen prism
#

My understanding is, "session trunking" can do I/O sessions across multiple LIF's same time on the same node. "nconnect " can do multi-sessions within the single LIF. So, I cannot tell the performance differences between them.

Can you please tell in-depth, why nconnect can replace NFS session trunking, or can do the same? NFS session trunking is supported starting 9.14.1.

pine shore
#

I'm not saying nconnect and session trunking are doing the same or provide the same performance. But I know that NFSv3 with nconnect is faster than NFSv3 without nconnect (sometimes 2x).

Session trunking is a feature of NFSv4, there is no concept of a session in NFSv3. Since we don't want NFSv4 we have not really tested the performance benefits of session trunking in detail.
But even session trunking can be improved by nconnect since you can combine both features: Multiple connections per session (same IP) plus multiple sessions (different IPs but same node). With ESXi this works since 8.0U3: https://www.vmware.com/docs/whats-new-with-vsphere-8-core-storage (page 11)

If you manage to saturate your link with NFSv3 nconnect to a single port of your storage system then of course NFSv4 with session trunking could be faster since you can use multiple ports. But as always you could use port-channels or other ways of load-balancing your links to counter that.