#Openstack + NetApp NFS disconnects
1 messages · Page 1 of 1 (latest)
The error we see from the KVM instance
[Tue Jul 11 12:02:17 2023] sd 0:0:0:4: [sdb] tag#73 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Tue Jul 11 12:02:17 2023] sd 0:0:0:4: [sdb] tag#73 Sense Key : Aborted Command [current]
[Tue Jul 11 12:02:17 2023] sd 0:0:0:4: [sdb] tag#73 Add. Sense: I/O process terminated
[Tue Jul 11 12:02:17 2023] sd 0:0:0:4: [sdb] tag#73 CDB: Read(10) 28 00 00 00 00 03 00 00 01 00
[Tue Jul 11 12:02:17 2023] blk_update_request: I/O error, dev sdb, sector 3 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[Tue Jul 11 12:02:17 2023] XFS (sdb): metadata I/O error in "xfs_alloc_read_agfl+0x7c/0xc0 [xfs]" at daddr 0x3 len 1 error 5
[Tue Jul 11 12:02:17 2023] sd 0:0:0:4: [sdb] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Tue Jul 11 12:02:17 2023] sd 0:0:0:4: [sdb] tag#80 Sense Key : Aborted Command [current]
[Tue Jul 11 12:02:17 2023] sd 0:0:0:4: [sdb] tag#80 Add. Sense: I/O process terminated
[Tue Jul 11 12:02:17 2023] sd 0:0:0:4: [sdb] tag#80 CDB: Write(10) 2a 00 12 c0 01 87 00 00 02 00
[Tue Jul 11 12:02:17 2023] blk_update_request: I/O error, dev sdb, sector 314573191 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 0
[Tue Jul 11 12:02:17 2023] blk_update_request: I/O error, dev sdb, sector 314573191 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 0
[Tue Jul 11 12:02:17 2023] XFS (sdb): log I/O error -5
[Tue Jul 11 12:02:17 2023] XFS (sdb): xfs_do_force_shutdown(0x2) called from line 1274 of file fs/xfs/xfs_log.c. Return address = 000000001d91854b
[Tue Jul 11 12:02:17 2023] XFS (sdb): Log I/O Error Detected. Shutting down filesystem
[Tue Jul 11 12:02:17 2023] XFS (sdb): Please unmount the filesystem and rectify the problem(s)
Error from libvirt ```
libvirtd: 2023-07-11 12:03:00.002+0000: 4452: error : virNetSocketReadWire:1806 : End of file while reading data: Input/output error
Error from Qemu ```
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: INFO: task qemu-kvm:501245 blocked for more than 120 seconds.
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: qemu-kvm D ffff8adde33d1080 0 501245 1 0x000001a0
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: Call Trace:
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb9d8c3f9>] schedule+0x29/0x70
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb986bbc9>] inode_dio_wait+0xd9/0x100
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb96c7140>] ? wake_bit_function+0x40/0x40
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffc08c9e56>] nfs_getattr+0x1b6/0x250 [nfs]
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb9853d59>] vfs_getattr+0x49/0x80
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb9853dc3>] vfs_fstat+0x33/0x60
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb9854324>] SYSC_newfstat+0x24/0x60
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb973ea04>] ? __audit_syscall_entry+0xb4/0x110
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb98546fe>] SyS_newfstat+0xe/0x10
Jul 11 03:38:47 svlstage2-cfc-b-nova01-001 kernel: [<ffffffffb9d9a226>] tracesys+0xa6/0xcc
but when you check, everything is ok? or does the OS show the NFS path as inaccessible? is the NFS traffic routed or local?
@rough remnant NFS traffic is routed, and the underlying hypervisor sometimes shows up with NFS not responding messages in the kernel, what would i need to look for on the NetApp side to make sure that the NetApp interfaces themselves are accessible, which logs would help understand if its a NetApp issue or a network issue
@rough remnant What we were able to replicate is blocking NFS from a test hypervisor with a test VM on it, reproduces the same log lines pasted above after 60 seconds of the traffic being blocked, which corresponds to timeo=600 (which is 60 seconds)
do you have a firewall or any other middlebox between the server and the client? I've seen similar issues with middleboxes that did TCP sequence randomization. Also, double-check the MTU size everywhere, an MTU mismatch can also lead to dropped packets which can manifest in such a way. Other than that, without seeing your mount options and NFS options on the SVM, it's hard to troubleshoot. Did you try with NFSv3? do you have NFS delegations enabled? do you see any entries in the ONTAP event log relating to the IP address of that client? etc.
@umbral monolith no there is no firewall configured and the MTU is set to 9000 across please find some outputs that was requested lif3:/premium_vol3 on /var/lib/nova/mnt/ed4bd7505fa7defea2d45f4cc3600183 type nfs4 (rw,relatime,vers=4.1,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=x.x.x.x) MTU ping from the client to the nfs server ```
ping -c 10 -s 8992 10.123.189.31
PING 10.123.189.31 (10.123.189.31) 8992(9020) bytes of data.
9000 bytes from 10.123.189.31: icmp_seq=1 ttl=59 time=0.317 ms
9000 bytes from 10.123.189.31: icmp_seq=2 ttl=59 time=0.317 ms
9000 bytes from 10.123.189.31: icmp_seq=3 ttl=59 time=0.315 ms
9000 bytes from 10.123.189.31: icmp_seq=4 ttl=59 time=0.327 ms
9000 bytes from 10.123.189.31: icmp_seq=5 ttl=59 time=0.930 ms
9000 bytes from 10.123.189.31: icmp_seq=6 ttl=59 time=0.325 ms
9000 bytes from 10.123.189.31: icmp_seq=7 ttl=59 time=0.316 ms
9000 bytes from 10.123.189.31: icmp_seq=8 ttl=59 time=0.313 ms
9000 bytes from 10.123.189.31: icmp_seq=9 ttl=59 time=0.314 ms
9000 bytes from 10.123.189.31: icmp_seq=10 ttl=59 time=0.305 ms
--- 10.123.189.31 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9000ms
rtt min/avg/max/mdev = 0.305/0.377/0.930/0.185 ms```
Vserver NFS show Vserver: VSM General NFS Access: true NFS v3: enabled NFS v4.0: enabled UDP Protocol: enabled TCP Protocol: enabled Default Windows User: - NFSv4.0 ACL Support: disabled NFSv4.0 Read Delegation Support: disabled NFSv4.0 Write Delegation Support: disabled NFSv4 ID Mapping Domain: v4iddomain.com NFSv4 Grace Timeout Value (in secs): 45 Preserves and Modifies NFSv4 ACL (and NTFS File Permissions in Unified Security Style): enabled NFSv4.1 Minor Version Support: enabled Rquota Enable: disabled NFSv4.1 Parallel NFS Support: disabled NFSv4.1 ACL Support: disabled NFS vStorage Support: disabled NFSv4 Support for Numeric Owner IDs: enabled Default Windows Group: - NFSv4.1 Read Delegation Support: disabled NFSv4.1 Write Delegation Support: disabled NFS Mount Root Only: enabled NFS Root Only: disabled Permitted Kerberos Encryption Types: des, des3, aes-128, aes-256 Showmount Enabled: enabled
VSM continuation Set the Protocol Used for Name Services Lookups for Exports: udp NFSv3 MS-DOS Client Support: disabled Idle Connection Timeout Value (in seconds): 360 Are Idle NFS Connections Supported: disabled Hide Snapshot Directory under NFSv3 Mount Point: disabled Provide Root Path as Showmount State: disabled RDMA Protocol: enabled.
That doesn't look so bad actually. Nothing that's "obviously" wrong I would say.
Any latency in ONTAP?
If you don't have any historical info we can pull a perf archive. We can also get the counter manager version of netstat...https://kb.netapp.com/onprem/ontap/da/NAS/How_to_use_netstat_to_troubleshoot_network_problems_in_ONTAP_9.5_or_newer
@rapid forge Is this something that is available for NetApp personal to view from the cluster, or would i have to generate a perf archive manually ?
Is there a way from the NetApp nodes/cluster to track any intermittent network drops and point out to the something obvious in the north bound network ?