#Will 30 seconds be long to break NFS connections?

1 messages · Page 1 of 1 (latest)

vague thunder
#

We will have multiple routers maintenance, I don’t know the details, but was told that there will be less than 30 seconds interruption. We have a lots of NFS datastores and volumes, also some CIFS. I know it will largely depend on applications, if they can tolerate the maximum of 30 seconds interruption. But , in general, based on your experience, would that cause issues for most applications?

next cradle
#

as for applications I cannot really tell, but I have seen VMs recover easily after even longer NFS outages like minutes or even hours... They just freeze and then when NFS returns they go on... Obviously anyone trying to actually do anything with those VMs will be out of luck 🙂

jovial mural
#

It’s usually based on TCP timeouts. Generally as long as the connection is reestablished in less than 8-9 minutes, NFS will usually continue. Some apps are more sensitive so that is a general statement. Poorly written apps may not handle it well

fossil seal
#

We recently tested this with couple of applications .. all worked fine except IBM's MQ.

vague thunder
vague thunder
jovial mural
#

LOL...usually when this is observed, you know you have a problem and rush to fix. VMware generally starts behaving again if NFS is restored in that 5-7 minute window. After that, the datastore is basically dead a host reboot is usually needed. Now, if that VM is running a database over an NFS datastore (which you should not be doing, should be a guest-attached iSCSI LUN instead), I have seen very bad things happen to the point the database is corrupted beyond repair and needs to be restored from a backup. And that only takes a few seconds or so!

I also had a customer using Citrix. The crazy-heads over there (at Citrix) somehow think it is a brilliant idea to use soft-mounts and then knock the nfs timeout settings down to 10 seconds. Well, Customer had cooling fail, the NetApp decided to save itself and do a proper shutdown...before the Citrix servers with soft mounts. Wellm when things came back online, ALL the VMs were corrupted (due to the soft-mounts and 10-second timeout settings). You even read the man page under citrix and it says soft mounts are a bad idea. They sacrafice the "usability" of the nodes over data integrity. Just awful

low pulsar
#

could have been worse. They could have been using Vmware vsan, hehe

jovial mural
#

well...they had to recover all the VMs, so that was pretty bad.

vague thunder
#

It sounds that VMware Datastores can be survived for long length of dropping.

For Oracle, for instance, maximum of 30 seconds dropping shouldn't cause anything, based on the link below, correct?
The default timeout values are: 300 seconds for TCP listeners and 60 seconds for HTTP listeners. The maximum timeout value is 7200 seconds.
https://docs.oracle.com/en-us/iaas/compute-cloud-at-customer/topics/lbaas/connection-timeout-settings.htm

jovial mural
#

Honestly, I believe it depends on the application and how it was programed to handle this scenario

next cradle
#

yeah, it's basically freezing time and then suddenly it's an hour later for your application... Similar to suspend/resume on a desktop PC. Some applications handle it well, others... not so much

vague thunder
#

What about SnapMorror? would that break up snapmirroring or the relationship? Not sure of how long of losses can it sustain?

next cradle
#

SnapMirror does not use NFS. In general, SnapMirror aborts pretty quickly but it will re-connect and restart the transfer during the next update schedule

vague thunder
#

SnapMirror does not use NFS.
Correct.
They only warned us about the primary site. You probably don't know our design, but I am going to ask you anyway. Based on a design in general, routers between the Primary and DR will be included in this maintenance?

next cradle
fossil seal
# vague thunder How did you do such test? How long had connections had lost?
  1. We removed some NFS clients from the export policy for 45 secs, 60 secs, 90 secs to see the behaviour of applications. Some freezed, but came back up when the NFS client was re added in the export policy and the mounted filesystem was immediately available.
  2. We had a controller panic during a ONTAP upgrade due to a Bug. And the aggregate, volumes were offline when core file was written to disks around 25 mins. This was much disruptive, however, not too many applications using NFS complained.
  3. Network team was asked to block Port 2049 for some applications using NFS for 45 secs, 60 secs, 90 secs.
  4. The network team keep rebooting switches, firewalls when performing firmware upgrades.
  5. IBM/MQ has a timeout set to 20 secs. If the storage isn't available it initiates failover to the secondary site. During ONTAP upgrades, we specifically ask them to failover to secondary site as a planned activity.
  6. TIBCO uses soft mounts with a timeout of 30 secs and retrans with value 3. We tested disruption of 45 sec & 60 sec and it worked fine. As the application is setup with a secondary EMS server.