#Please share your thoughts about implementing NFS trunking?

1 messages · Page 1 of 1 (latest)

willow crow
#

We have a large NFS environment, almost 3,000 VM’s(NFS datastores) and 600 NFS volumes(some CIFS), 8 AFF/C nodes. We know 9.14 has NFS trunking feature that can allow us to create multiplepathing for NFS volume/datastores. But, t is restricted to LIF’s on a single node, and cannot span them across multiple nodes.

I read the online document “manage NFS trunking” and some other posts as well. But, still there are a lot of questions on the implementation. I am hoping NetApp can publish a Best Practice document on it, but we have not seen one. Also, I cannot find anyone out there really implemented it.

Questions like how it will handle failover/failback, and non-discruptinve as it is a stateful protocol? If we should create a brand new SVM for it or continue to use existing SVM which requires services interruptions? We will have to add additional more LIF’s as supposed to one LIF per node, each LIF would be corresponding to a client, then with so many LIF’s how should we manage exports? Also how we should manage failover group Etc. So, the entire infrastructure looks different and not so easy to do to me.

I’d like to know all details, but the purpose of this post is not really seeking for answers for all these questions. My question really are:

• How much performance improvement should we expect?
• How many of you really have implemented NFS trunking?
• Should we or if it is worth of implementing it at all?

Thank you in advance for your inputs!

weak granite
#

I would suggest using nconnect. With NCONNECT 8 you get almost double the performance out of NFS in some workloads (usually the gain is much smaller but still not negligible)

tawdry dove
#

I agree with @Darkstar and would be looking at NCONNET first. VMware fully support NCONNECT now with NFSv3 datastores, and you can change it online. I believe that requires ESXi 8.0U2 or U3.

There are limits to how many sessions per LIF so you'd need to take that into account so you don't hit the limits. But I'd hope you'd have different SVMs, or obviously different LIFs also, for your NFS datastores and your NFS Clients. I'd be more aggressive where you can control the connection counts, vs your 600 NFS clients where they could slam your LIFs.

scenic pulsar
#

Regarding your question if it's worth it:

  • If you're already utilizing NFSv4.1 for your clients and/or VMware datastores then I would try it in another SVM since clients need to be remounted for session trunking (which I guess you are talking about since you're referring to ONTAP 9.14).
  • If you are using NFSv3 there is no concept of multipathing in the spec and I don't think it's worth the effort to switch to NFSv4.1, at least for VMware datastores.
    2-4x 100GbE ports per node and 2x NFS-LIFs per port plus 4-8x datastores per node (mounted via the different LIFs) should be more than enough to satisfy the systems resources.
    If you have a VM with really high throughput requirements yes session trunking could help but you could also simply put its disks to different datastores (each mounted via a different LIF).

As mentioned you could also use nconnect to utilize additional connections on the existing mounts. This works for both NFSv3 and NFSv4.1 (latter only since 8.0U3).
It's easy to implement and I don't know of any real drawbacks. Here's a good guide on how to configure it: https://community.netapp.com/t5/Tech-ONTAP-Blogs/Automate-nconnect-and-snapshot-offload-for-NFS-datastores-and-turn-your-VMware/ba-p/455598

weak granite
#

another option, if you're using Linux clients, is to use pNFS

tawdry dove
#

Does VMware support pNFS yet for NFS Datastores? I've been wanting that for years.

weak granite
#

nope. VMware doesn't support it and probably never will

white garden
#

I still say the best VMware NFS mount is to use v3. Version 4.1 with even with truncking still feels wonky with severe limits.

Mount the data stores using NFSv3. Set the ONTAP parameters for optimizing NFS, then go to each host and set nconnect (start with 4 and increase as needed). Reboot each esxi host for best effect

willow crow
#

From all your responses, and also from Internet I’ve browsed, frankly, I cannot find anybody really implemented NFS trunking on NetApp.

We are currently using NFSv3. Managers here pushes us to get NFSv4-->NFS trunking done. After reviewed the NA’s document “Manage NFS trunking”, I raised those questions as I listed in my original post. There seem dramatically changes to the storage infrastructure. I don’t feel comfortable with them and plus the uncertainty on what outcome it would result in.

Most of you recommended me to use nconnect instead, but, managers has different views. I am sorry, but I need your help to focus on NFS Trunking, and explain why technically NFS trunking is not worth of doing, as most of you concluded. I need your help on what issues or what exactly limitations(critical?) it could pose to us, not just tell me the conclusion. Overall NFSv4 on NetApp was wonky before, but, as indicated by a lot of NetApp KB’s, it is not longer the case now.

scenic pulsar
#

What are they trying to solve with NFS session trunking?
You need to remount every export if you switch from v3 to 4.1.
Compare that time effort (moving VMs away, remounting, moving back, etc) and additional space needs (moving VMs from volA to volB means increase in snapshot delta) vs. running the automated script from above to enable nconnect on all existing datastores without any downtime, remounting, moving stuff from A to B, etc. And possibly gaining a comparable throughput increase.

weak granite
# willow crow From all your responses, and also from Internet I’ve browsed, frankly, I cannot ...

simple answer: session trunking will cause trouble and problems. Almost every customer of ours who has ever tried it got bitten by it in the past, especially with ESX (it reportedly works a bit better for Linux clients). If "management" insists, I would get written confirmation that they want to do it against your advice and then implement it. When everything breaks you still have that E-Mail 🙂 From my experience just asking for such a written confirmation is usually enough to get the issue resolved quickly

willow crow
#

We know there will be quite some work to do, but, if it is worthy of doing it, then we should do it. I was told, and also reffered the advantage that we can gain from NFS trunking upon NetApp doc:
NFSv4.1 clients can take advantage of session trunking to open multiple connections to different LIFs on the NFS server, thereby increasing the speed of data transfer and providing resiliency through multipathing.

I am sorry for my persistency. We need to know why technically it is not worthy of doing NFS Trunking? And how much performance improvement or what issues really it could bring to us?

white garden
#

Honestly, if you are looking at v4 for “performance “ then you should really look at using NVMe/TCP. YES it’s block based but the performance should be better with lower latency

In my opinion, nfsv4.1 should never be used without full on testing, including node failover testing, network failure testing, ESXi host failure and more. I’ve seen my fair share of issues to really never trust it when v3 works just fine

scenic pulsar
# willow crow We know there will be quite some work to do, but, if it is worthy of doing it, t...

Ultimately it's up to you, but why are you / your managers saying that switching to session trunking "is worth it" (whatever that means) but switching to nconnect isn't? Again my question: What are they trying to solve with NFS session trunking? If you (may) get the same performance improvements with nconnect without all the effort switching to session trunking, why not pursuing it? You mentioned you have 600 NFS volumes, you really want to remount all of them and test everything in detail (like TMAC mentioned)...

Why not simply create two new SVMs, one for nconnect with NFSv3 & one for session trunking NFSv4.1 and then compare for yourself? You can't really break anything with a new SVM.

white garden
#

no, but you can break the ESXi host. When v4.1 fails, it usually does so spectaculary! I see the ESXi host just plain hang and the only fix is to go to the out-of-band mgmt (CIMC or iDRAC, etc) and power-cycle the host. Sometimes, multiple hosts. not pretty at all

weak granite
white garden
#

Like I said before....In my opinion...it is just not work the headaches that will (not maybe...will) come

willow crow
#

We knew some weird issues with v4 before. However, you can find from KB and also from the community that those issues were all fixed by newer ONTAP. Our manager firmly believe those issues are no longer the case.
@white garden @scenic pulsar @weak granite Do you have a bug numbers for issues you just described? Personally, I agree with a lot of things you said. But, it's not me.