#NFS4.x Datastore could cause disruptions during failover/failback.

1 messages · Page 1 of 1 (latest)

inland iris
#

We know that we are not getting session trunking or pNFS from NFS4.x Datastore, or benefitting anything from converting NFS3 to NFS4.x Datastore for VMware. Not only that, VMware doesn't support DRS and v4.

Based on discussing with multiple NetApp engineers, there could be VM outages or services disruptions during failover/failback, also based on Upgrade Advisor as quoted below. All point to that we could have disruptions.
"Using normal NFSv4.x recovery procedures. Applications might experience a temporary I/O delay during this process. You should direct users to end their sessions before you upgrade."

Meanwhile, there are also following two KB's, and indicated there shouldn't be disruptions. So, we are receiving contradictory information on if NFS4.x Datastore could cause the service disruption or not.
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/VMWARE_Virtual_Machines__went_offline_during_upgrade
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/VMware_NFSv4.1_datastores_see_disruption_during_failover_events_for_ONTAP_9

The reality is that we have already partially converted some Datastores to NFS4.x So, we are facing the dilemma on if we should revert back or leave NFS4.x Datastores as is before next ONTAP Upgrade.

I create this thread in hoping to get a clear picture on if NFS4.x Datastores can allow us to perform un-disruptive failver/failback that could happen in any cases like ONTAP upgrade?

unborn pewter
#

I think you are looking for some "official" NetApp statement regarding NFSv4 and VMware datastores so I will only recommend this post which explains quite nicely what NFSv4 does different and why there might be disruptions:
https://community.netapp.com/t5/Tech-ONTAP-Blogs/NFSv3-and-NFSv4-What-s-the-difference/ba-p/441316

quasi glacier
#

NFSv3 is tried and true. The problem isn't ONTAP but VMware.

#

VMware has been slow to adopt V4, and honestly I don't see a benefit.

#

Maybe there is one...

fading nymph
#

We have huge VMware environment with over 40‘000 VMs all on different ONTAP cluster and talked a lot with NetApp and VMware about v4. Our hope was that session trunking on ONTAP cloud help us. Unfortunately it is only possible on node level, so no benefit for us. Also VMware support and develops v4 a bit underwhelming, sadly. Our next steps will be some PoCs with NVMe/TCP in the hope that will be the next step regarding VMware and storage

quasi glacier
#

I don't think that exists through NFS.

inland iris
#

Upon my reading the document forwarded by @unborn pewter , it is clear that NFSv4 could cause service interruptions for cases like failover/failback, when networking interruptions. I am hoping those two KB's I referred to in my original message could add what the document said and make the topic more completed. I am also hoping the document could elaborate more on what is going to happen if use NFSv4 for VMware Datastores. With that being said, should we revert those NFSv4 datastore back to NFSv3 or leave as is? I know it sounds up to us, or up to a management's decision. But, what would you do if you were the manager?

fading nymph
#

@inland iris when you look at the VMware KB https://kb.vmware.com/s/article/76136 the issue „should“ he fixed in vSphere 7. I can only say we as a service provider currently stay at NFSv3 because there you know what you have and 4v has more disadvantages on VMware then advantages

unborn pewter
#

What was the reason you moved some datastores to NFSv4? If none of it's features are needed in your environment I would revert to NFSv3.

inland iris
#

The reason we moved some datastores was because people here mistakenly thought NFSv4 datastores supports "session trunking", and then it could improve the performance, which turns out to be not true. There are about 24 NFS datastores are v4 now, among total of 380. So, the question is, should we leave them as is, or revert them back to v3 and why?

fading nymph
#

I personally would revert them back to v3 so you have a clear standard

inland iris
#

So, the link that @unborn pewter sent was very helpful, and indicated there could be Datastore(NFS) interruptions during failover/failback or when networking is unstable. But, sorry for my persistence, still, how can we explain what these two KB's said? those two KB's I posted in my original messages clearly indicated that Datastore disruption was fixed in VMware patch, or in 7.0 vCenter version. So, these two different type of docs seems contradictory each other.

unborn pewter
#

My understanding of working with NFSv4 is:

  1. Make sure ESXi and ONTAP are on supported versions.
  2. Set the correct timeout values for your systems so that a storage failover will not break things.
    So for VMware workloads make sure you are using the ONTAP Tools for vSphere VM to set the suggested values for your ESXi-hosts (make sure to reboot the hosts afterwards): https://docs.netapp.com/us-en/ontap-tools-vmware-vsphere/configure/task_configure_esx_server_multipathing_and_timeout_settings.html
    You can also set the values manually: https://docs.netapp.com/us-en/ontap-tools-vmware-vsphere/configure/reference_esxi_host_values_set_by_vsc_for_vmware_vsphere.html#esxi-advanced-configuration

Then you simply need to test and see how your VMs respond during a failover. You might also need to adjust timeouts on your VMs: https://docs.netapp.com/us-en/ontap-tools-vmware-vsphere/configure/reference_configure_guest_operating_system_scripts.html

#

Or you could ping @vital current, he's the VMware guru at NetApp. Or simply create a support case to go the official way.

inland iris
#

If we should do all these as @unborn pewter suggested, then the KB didn't make it clear enough. Support also sounds contradictory each other. Some said there could be interruptions, the other said no. The KB said the issue was resolved and dated on 12/14/2020. But we talked to a Sr. Engineer from Support on August, 2022, he said there could be interruptions, by then, we didn't know that KB.
The following is what Upgrade Advisor says about "Consideration for session-oriented protocols". It further confuses me. One side it said clients will automatically recover, the other side application might experience a temporary I/O delay. I know "clients" is different from "Applications", but how could we explain two could be different as far as interruptions?
• NFSv4.x NFSv4.x clients will automatically recover from connection losses experienced during the upgrade using normal NFSv4.x recovery procedures. Applications might experience a temporary I/O delay during this process. You should direct users to end their sessions before you upgrade.
@vital current upon what has been discussed, would you please comment on that?
So there are two questions here:
By using NFSv4 for VMware NFS datastores, if there anything can be benefitted? This is one question. There seems not. However, If there could be interruptions during failover/failback this is the other question. KB said no, it was a bug and fixed already, but, "steiner" said yes, there could be interruptions.

severe shore
#

I’ll tag @late vault too - he’s a Vmware guru as well

vital current
#

NFS v4.1 is expected to be perfectly suitable for any production datastore use as long as you follow the advice above from @unborn pewter . With NFS v4.1 there is a period of time where the NFS client reclaims its locks after the ONTAP node fails over.

In my opinion, however, unless I needed to use Kerberos auth I would hold off on using NFS v4.1 until session trunking was declared as fully supported with datastores. To me, if you don't need Kerb, there's not much more value add and the downsides associated with lock management overhead aren't worth it yet.

#

That was for @inland iris

inland iris
#

Thank you @vital current !

unborn pewter
#

fixed in 9.9.1P14, 9.10.1P11 and 9.12.1