#ONTAP 9.13.1P1 - ifgrp Ports reported as "degraded l2_reachability" but everything works fine / BUG

1 messages · Page 1 of 1 (latest)

lusty finch
#

Anyone else have experienced that issue?

F***-****-01::> port show -health
Health
Node Port Link Status Degraded Reasons


F**-C****-01-01
a0a up degraded l2_reachability
a0a-1550 up healthy -
a0a-2505 up healthy -
a0a-3531 up healthy -
e0M up healthy -
e0a up healthy -
e0b up healthy -
e0c up degraded l2_reachability
e0d up degraded l2_reachability
e0e up degraded l2_reachability
e0f up degraded l2_reachability
F***-******-01-02
a0a up degraded l2_reachability
a0a-1550 up healthy -
a0a-2505 up healthy -
a0a-3531 up healthy -
e0M up healthy -
e0a up healthy -
e0b up healthy -
e0c up degraded l2_reachability
e0d up degraded l2_reachability
e0e up degraded l2_reachability
e0f up degraded l2_reachability
22 entries were displayed.

sharp hatch
#

This probably means that ARP pings don't get through. ONTAP periodically checks that it can reach all ports in the same broadcast domain through l2ping. If it gets no reply, the port is marked as degraded. However, if you have VLANs on top, usually you put all your LIFs into those VLANs, and those will then work fine.

#

so your switches are basically blocking any communication on the "native" VLANs on those ports

vapid mango
balmy hinge
#

Are you doing any STIGs on the switches? I’ve seen some places turn off GARP on VLANs on the cores. This prohibits ONTAP failover from working. When a data lif fails over to another node, it sends it a gratuitous ARP (GARP). If the STIG is in place prohibiting, then the clients will never see the new location until the lif is reverted to the original location

#

As soon as we took that line out of the VLAN config on the switches, lifs responded correctly

#

This was before ONTAP 9.8 so we didn’t see the message

latent sedge
#

I also see this on our a0a ports but not on any of the member ports or the tagged vlans... also after upgrading to 9.13.. I think I will opt to "set adv; network port modify -ignore-health-status true"... .. but ports still show up as degraded quite annoying... we don't use "a0a" for anything, we have VLANs tagged on top of that... is there a nicer way to get rid of this? Hmm find this description of the problem: https://kb.netapp.com/onprem/ontap/da/NAS/Native_VLAN_ports_show_degraded_after_upgrade_to_ONTAP_9.13.1. Another bug... nice 🙂

sharp hatch
latent sedge
sharp hatch
#

If there are other ports that are not in any broadcast domain, it tries to reach those as well, it doesn't matter if the ports are hosting LIFs or not. Have you tried simply moving those ports into their own (new) broadcast domain? the bug you noted has been rejected, as it's not really a "bug" but more of a misconfiguration on the switch side

latent sedge
#

`network port reachability show
Node Port Expected Reachability Reachability Status


fs07-dkaar1-01
a0a - no-reachability
a0a-10 Default:Default ok
a0a-20 Default:NFS ok
a0a-24 Default:IC ok
e0M Default:Default ok
e0a Cluster:Cluster ok
e0b Cluster:Cluster ok
e0c - ok
e0d - ok
e0e - ok
e0f - ok`

#

`broadcast-domain show
(network port broadcast-domain show)
IPspace Broadcast Update
Name Domain Name MTU Port List Status Details


Cluster Cluster 9000
fs07-dkaar1-02:e0a complete
fs07-dkaar1-02:e0b complete
fs07-dkaar1-01:e0a complete
fs07-dkaar1-01:e0b complete
Default Default 1500
fs07-dkaar1-02:a0a-10 complete
fs07-dkaar1-02:e0M complete
fs07-dkaar1-01:a0a-10 complete
fs07-dkaar1-01:e0M complete
IC 9000
fs07-dkaar1-02:a0a-24 complete
fs07-dkaar1-01:a0a-24 complete
NFS 9000
fs07-dkaar1-02:a0a-20 complete
fs07-dkaar1-01:a0a-20 complete`

sharp hatch
#

is the native vlan blocked switch-side? if so , you need to put the two a0a ports into different broadcast domains so that it doesn't try to ping them from each other

latent sedge
#

hmm ok I will try to create a dummy broadcast domain and add a0a into it.. but seems "stupid" to me.. 😉

sharp hatch
#

but in that case, setting health-monitoring to disabled would probably be the simpler solution?

latent sedge
#

`network port reachability repair -node fs07-dkaar1-01 -port a0a

Error: command failed: Reachability repairs are only supported for ports of type "physical" or "vlan".` 😉

sharp hatch
#

but it's strange I'm pretty sure I had an installation with a customer recently and there we saw a0a as "okay"

latent sedge
#

The reachability scan can do "a0a" but not the repair ;-). how long until it fixed the "degraded" port then?

balmy hinge
#

I typically create a broadcast domain I usually call DoNorUse just to lump them together and set the mtu to 9000. Certainly doesn’t need to be done. I like the ports grouped

latent sedge
#

haha that dummy broadcast domain didn't to anything good ;-). `network port show

Node: fs07-dkaar1-01
Ignore
Speed(Mbps) Health Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status


a0a Default Dummy up 9000 -/- degraded true
a0a-10 Default Default up 1500 -/- healthy false
a0a-20 Default NFS up 9000 -/- healthy false
a0a-24 Default IC up 9000 -/- healthy false
e0M Default Default up 1500 auto/1000 healthy false
e0a Cluster Cluster up 9000 auto/10000 healthy false
e0b Cluster Cluster up 9000 auto/10000 healthy false
e0c Default - up 9000 auto/10000 degraded false
e0d Default - up 9000 auto/10000 degraded false
e0e Default - up 9000 auto/10000 degraded false
e0f Default - up 9000 auto/10000 degraded false `

#

...now all the member ports are degraded

#

..also with "l2_reachability" as the issue...

sharp hatch
#

I wonder why it shows degraded... it shouldn't

#

this is how it looks in one of our lab systems

#

no-reachabililty, but still the port is healthy (and ignore-health-status is disabled)

#

I don't think the L2 reachability should set the port to degraded?

latent sedge
#

that is what it states as the reason: Port Health Degraded Reasons: l2_reachability

#

on both a0a and the other ports

sharp hatch
#

yeah, I saw that, I just haven't seen this in our environment 🤔 we should see the same issue then

latent sedge
#

I haven't setup the Cisco part, I just know it is fairly standard Nexus switches with active LACP and VLANs on top...

#

Seems "stupid" having to reconfigure your switches just because NetApp decides that l2 connecivity is an issue all of the sudden.. 🙂

#

I think I will revert back to no broadcast domain for a0a.. then pull the cisco configs and open a case with NetApp..

#

..and just like that.. removed the BC and did a reachabililty scan and all ports but a0a are healthy again

sharp hatch
balmy hinge
#

I think it might be useful to see the output from the switches that include the actual port configuration and the port-channel configuration

#

Also if they are nexus switches, would be useful to see the output of “show run vpc” to make sure all the correct items are there

sharp hatch
#

I guess if the VPC was down or misconfigured, all the VLANs on the portchannel would be affected, not just the native VLAN...?