#MCC-IP with Cisco 9336 switches issue with cluster links

1 messages · Page 1 of 1 (latest)

night sleet
#

We have just setup a 4node C800 MCC-IP where we use the Cisco 9336 provided by NetApp. Everything is configured as per documentation, with the RCF tool etc.. we have 100G for everything on the switches.
We have one issue where if we reboot a switch, sometimes we see that one of the metrocluster links will report down (no link).. rebooting the switch again doesn't solve it, we have to reboot the NetApp node, in which case it reports UP again... We can see this interface as down with "metrocluster interconnect adapter show" and on the switch. We have tried to issue a "metrocluster interconnect adapter reset" but for some reason it doesn't accept this command no matter what we use as portname... is just tells us: "Error: command failed: invalid operation"... But we shouldn't have to issue a reset anyway... so have anyone had this issue before? It seems to not be isolated to just one host, but we can replicate it on other nodes as well... We have checked the ports for transmission errors which we cannot see any... so pretty strange.. ?

left current
#

switch reboots can cause the link in ONTAP to go to a degraded state. Usually simply waiting for a few minutes (5-15 min) is enough to fix the issue. How long did you wait after noticing that the link was down?

night sleet
rain badger
#

There are times that the speed won’t link. See about identifying if it’s the same ports every time.

What nxos are you using (did you update EPLD?)

Which version of the rcf builder are you using?

Are cables all twinax or optical?
I suspect optical for the site-to-site connections. What about the rest?

I’ve seen issues using optical connections but that was the x1151 card to the 9336(fix was to use twinax)

night sleet
#

RCF Builder 1.6c the config used it the SASandEthernetStorage_v2.10_MetroclusterIP_L2Direct_C800.. (the C800 is actually an ASA C800, not that I think it matters)... the ISO is System version 10.3(5) from nxos64-cs.10.3.5.M.bin Not sure about the EPLD ?

night sleet
#

Just tested another reboot of a switch and got the same result... same port down... for 10 minutes now... surely this cannot be "by design" as the other port to the other node is up and had no issues... I find it strange that the command "metrocluster interconnect adapter reset" doesn't seem to work, no matter which port name I give it... I've tried "e1b", "t6nex1", "t6nex1_e1b" and "DKDC1-NAMN002_t6nex1_e1b" (mostly because I saw it in the event logs as this name...)

#

But I always get this response "Error: command failed: invalid operation"

#

The "worst" thing is that the "metrocluster check run" returns "successful"..

#

...yet in the webgui it reports an error under MetroCluster health... (hmmm not sure I like that?)

rain badger
#

How is that port connected? With a twinax cable or an optical transceiver?

Show int status

night sleet
#

It's port 1/10 that is the problem...

#

DKDC1-NASW002# sh int status


Port Name Status Vlan Duplex Speed Type

mgmt0 -- connected routed full 1000 --


Port Name Status Vlan Duplex Speed Type

Eth1/1 Intra-cluster Node connected 101 full 100G QSFP-100G-CR4
Eth1/2 Intra-cluster Node connected 101 full 100G QSFP-100G-CR4
Eth1/3 Unused port xcvrAbsen 1 auto auto --
Eth1/4 Unused port xcvrAbsen 1 auto auto --
Eth1/5 Ethernet Storage P xcvrAbsen trunk auto auto --
Eth1/6 Ethernet Storage P xcvrAbsen trunk auto auto --
Eth1/7 Intra-Cluster ISL connected trunk full 100G QSFP-100G-CR4
Eth1/8 Intra-Cluster ISL connected trunk full 100G QSFP-100G-CR4
Eth1/9 MetroCluster Node connected 112 full 100G QSFP-100G-CR4
Eth1/10 MetroCluster Node notconnec 112 auto auto QSFP-100G-CR4

left current
#

try setting the speed to 100g fixed. sometimes speed autonegotiation doesn't work correctly

night sleet
night sleet
#

Hmm... I was sure I tested the "speed 100000" but apparent it makes a difference... if this is set the ports acts as the other ports and comes up in a matter of 5-10 secs... which is good, but why is this only a problem on two specific ports in the MC? Also if I change this config to force the speed am I then not running an unsupported configuration? Because it's not what the RCF tool creates...?

left current
#

I would assume that just changing the speed to something fixed would be okay, similar to setting the FEC mode which is required by some platforms... If there's ever a case in which this plays a role you can always just remove the "speed .." line again for the duration of the case 🤷‍♂️

night sleet
left current
#

Adding it to the RCF would mean you'd have to separate the 40G systems from the 100G systems in the config generator, which would make it more complicated I think

night sleet
rain badger
#

There are known issues where 100g does not always work with auto-speed. It’s documented in the normal (non mcc) rcf files to set speed if needed

#

Happens on the bes and the Cisco switches

night sleet