We will have multiple routers maintenance, I don’t know the details, but was told that there will be less than 30 seconds interruption. We have a lots of NFS datastores and volumes, also some CIFS. I know it will largely depend on applications, if they can tolerate the maximum of 30 seconds interruption. But , in general, based on your experience, would that cause issues for most applications?
#Will 30 seconds be long to break NFS connections?
1 messages · Page 1 of 1 (latest)
as for applications I cannot really tell, but I have seen VMs recover easily after even longer NFS outages like minutes or even hours... They just freeze and then when NFS returns they go on... Obviously anyone trying to actually do anything with those VMs will be out of luck 🙂
It’s usually based on TCP timeouts. Generally as long as the connection is reestablished in less than 8-9 minutes, NFS will usually continue. Some apps are more sensitive so that is a general statement. Poorly written apps may not handle it well
We recently tested this with couple of applications .. all worked fine except IBM's MQ.
The TCP timeouts you are talking about is someting can be set on the storage or on the application level?
How did you do such test? How long had connections had lost?
LOL...usually when this is observed, you know you have a problem and rush to fix. VMware generally starts behaving again if NFS is restored in that 5-7 minute window. After that, the datastore is basically dead a host reboot is usually needed. Now, if that VM is running a database over an NFS datastore (which you should not be doing, should be a guest-attached iSCSI LUN instead), I have seen very bad things happen to the point the database is corrupted beyond repair and needs to be restored from a backup. And that only takes a few seconds or so!
I also had a customer using Citrix. The crazy-heads over there (at Citrix) somehow think it is a brilliant idea to use soft-mounts and then knock the nfs timeout settings down to 10 seconds. Well, Customer had cooling fail, the NetApp decided to save itself and do a proper shutdown...before the Citrix servers with soft mounts. Wellm when things came back online, ALL the VMs were corrupted (due to the soft-mounts and 10-second timeout settings). You even read the man page under citrix and it says soft mounts are a bad idea. They sacrafice the "usability" of the nodes over data integrity. Just awful
could have been worse. They could have been using Vmware vsan, hehe
well...they had to recover all the VMs, so that was pretty bad.
It sounds that VMware Datastores can be survived for long length of dropping.
For Oracle, for instance, maximum of 30 seconds dropping shouldn't cause anything, based on the link below, correct?
The default timeout values are: 300 seconds for TCP listeners and 60 seconds for HTTP listeners. The maximum timeout value is 7200 seconds.
https://docs.oracle.com/en-us/iaas/compute-cloud-at-customer/topics/lbaas/connection-timeout-settings.htm
On Compute Cloud@Customer, you can configure load balancer listeners to control the maximum idle time allowed during each TCP connection or HTTP request and response pair.
Honestly, I believe it depends on the application and how it was programed to handle this scenario
yeah, it's basically freezing time and then suddenly it's an hour later for your application... Similar to suspend/resume on a desktop PC. Some applications handle it well, others... not so much
What about SnapMorror? would that break up snapmirroring or the relationship? Not sure of how long of losses can it sustain?
SnapMirror does not use NFS. In general, SnapMirror aborts pretty quickly but it will re-connect and restart the transfer during the next update schedule
SnapMirror does not use NFS.
Correct.
They only warned us about the primary site. You probably don't know our design, but I am going to ask you anyway. Based on a design in general, routers between the Primary and DR will be included in this maintenance?
sorry, I don't know if your routers will be included in the maintenance, if that's what you're asking?
Again, SnapMirror will break after a few seconds if it loses TCP connection, but the destination system will re-start the transfer roughly where it left off (restart checkpoint)
- We removed some NFS clients from the export policy for 45 secs, 60 secs, 90 secs to see the behaviour of applications. Some freezed, but came back up when the NFS client was re added in the export policy and the mounted filesystem was immediately available.
- We had a controller panic during a ONTAP upgrade due to a Bug. And the aggregate, volumes were offline when core file was written to disks around 25 mins. This was much disruptive, however, not too many applications using NFS complained.
- Network team was asked to block Port 2049 for some applications using NFS for 45 secs, 60 secs, 90 secs.
- The network team keep rebooting switches, firewalls when performing firmware upgrades.
- IBM/MQ has a timeout set to 20 secs. If the storage isn't available it initiates failover to the secondary site. During ONTAP upgrades, we specifically ask them to failover to secondary site as a planned activity.
- TIBCO uses soft mounts with a timeout of 30 secs and retrans with value 3. We tested disruption of 45 sec & 60 sec and it worked fine. As the application is setup with a secondary EMS server.