#NFS server not responding

1 messages · Page 1 of 1 (latest)

abstract jackal
#

Hi everyone! I'm facing an intermittent issue after upgrading our OpenShift Container Platform (OCP) to version 4.13. We're seeing frequent error messages related to our NFS server, like this one:
nfs: server xxx.xx.xx.x not responding, still trying
Has anyone faced similar issues after upgrading to OCP 4.13?

#

NFS server not responding

crisp shell
#

this means the worker node cannot set up a TCP connection to ONTAP for NFS. A few possibilities: a) the IP address changed and a firewall or client match rule prevents access, b) the NFS protocol version changed (e.g. from NFSv3 to NFSv4) and the SVM is not configured correctly for that version, or c) security/SElinux policies or something similar on the woirker nodes is preventing the NFS connection

edgy lynx
#

And as silly as it sounds, please make sure your vserver aggr-list has the correct aggregates listed. I’ve seen that manifest issues in very different ways
(vserver show -fields aggr-list

It cannot/should not be an empty list!)

abstract jackal
#

This issue is not consistent, out of 5 times it is working fine 4 times

crisp shell
#

hm that sounds strange. Two things from my experience can cause that. a) NFS storpool exhaustion. Are you getting any EMS messages about NFS storpool ehaustion on your NetApp? or b) duplicate IP address (those should also be logged in ONTAP's EMS system at least if they're in the same layer 2 broadcast domain)

abstract jackal
#

Thanks @crisp shell @edgy lynx
After we upgraded our ocp version to 4.13 few of the trident pods are getting restarted occasionally.
here is what i got from logs
"level=fatal msg="Unable to start the K8S hybrid controller frontend." error="could not initialize Kubernetes client; couldn't retrieve API server's version"
but the problem after restart it is working fine for few hours.
and during init container execution we see the below error
"nfs: server xxx.xx.xx.x not responding, still trying
INFO: task python3.11 blocked for more than 122 seconds.

edgy lynx
#

Are you using certificates or credentials? Is K8S and ONTAP on the same subnet? Any possible firewall in the way?
How soon do you notice the issue? You could kick off a network trace from ONTAP, options to include keeping the last x files at a size of y MB. If you are able to stop the trace when done, you might have something to open a case with.

#

Anything in the event log on Netapp?

wheat nimbus
#

Could be a perf issue too.