#Interconnect oddities with 9336 rcf1.8 and TCAM issues.

1 messages · Page 1 of 1 (latest)

mighty timber
#

Alright, so if I've got a question regarding interconnect switches.

The skinny is this.
From an out of box install + fresh RCF1.8 application and moving of existing 10 nodes to a CISCO 9336 interconnect pair.

Should I be getting TCAM resources exhausted spam in the "show logging" when attempting to cable in new nodes to the cluster?
Port goes amber for a split second then shuts down entirely.

When checking the internal resource allocation on the switch itself those report 99% utilization at day 1.

Cisco's advice? Strip the QOS policy applied to the port profile OR move it to the second asic

Netapp initially said it's a Cisco problem.

My mindset is this: if I deviate from the RCF and base config, my necks on the line when it inevitably goes sideways.

twin bane
#

It unfortunately is. The QoS stuff is there for a reason

#

I assume the Cisco is running a recommended nxos?

mighty timber
#

Absolutely is.

I am not advocating to remove the RCF at all, I get the purpose of it and the QOS policy.

I don't think this is a Cisco issue, it's either a documentation problem or an RCF issue.

Reasoning;
There are two instances on that switch, both have their own L2 TCAM resources.
With 14 nodes on one instance it's at 99.2% resources for L2 TCAM resources

#

If I plug it into 18-34 it will light up and use it's secondary instance resources

#

So

Either the RCF needs modified to rob Peter from other resource pools to pay Paul and allocate them to L2 TCAM misbehaving.
Or documentation needs adjusted and explicitly defined with "you need to split nodes betweens ports X-Z to balance between these instances to not eat up all of the resources"

twin bane
#

Totally understandable. Want to send me the NetApp case number and I’ll see what I can do?

prime nova
#

Was the RCF installed properly? With those switches it usually requires (with the ONTAP cluster ports point at the other switch) to wipe the switch, reload, basic setup, install RCF, save config, reload, install RCF again. Then let the NetApp at the ports.

#

I suspect there are many out there that do not do the second RCF application and it will cause problems

mighty timber
mighty timber
#

Case# from me will be added shortly AND case# 2009452186 (opened by our installer).

prime nova
mighty timber
#

Getting that information now 😉

prime nova
#

And are you using the cluster-ha-storage, cluster-ha-breakout, or the storage rcf?

mighty timber
#

Nexus-9336C-RCF-v1.8-Cluster-HA-Breakout.txt

#

If the t-cam properties weren't applied properly it's possible I could verify that while waiting on the logs to be sent over.

When properly applied, what value should the TCAM resources be?

#

Ex:

show hardware capacity qos(truncated)

Ingress L2 QOS 56 200 21.87

My current 10n are cabled in, the rest will be when this is resolved.

prime nova
#

No idea. No access to running 9K

#

You are using the correct ports for the nodes, right? port 1-3 are 10g Breakout, ports 4-6 are 25G breakout and ports 7-34 are for 40/100G (with 35/36 as the ISL)

mighty timber
mighty timber
#

Hokay.

Reviewing the RCF and the link here -> https://docs.netapp.com/us-en/ontap-systems-switches/switch-cisco-9336c-fx2-shared/install-nxos-rcf-9336c-shared.html#step-2-configure-ports

I do not find any mention of needing to reapply the RCF file twice.
It absolutely is mentioned in RCF1.8a for the 3236Q-V here

# Installation Notes:
# - This RCF utilizes QoS and requires TCAM re-configuration, requiring RCF
#   to be loaded twice with the Cluster Switch rebooted in between.
#
# - Perform the following 4 steps to ensure proper RCF installation:
#
#   (1) Apply RCF first time, expect following messages:
#       - Please save config and reload the system...
#       - ERROR: Failed to write VSH commands
#       - TCAM region is not configured...
#
#   (2) Save running-configuration and reboot Cluster Switch
#
#   (3) After reboot, apply same RCF second time and expect following messages:
#       - % Invalid command at '^' marker
#
#   (4) Save running-configuration again

At the top of the 9336 1.8RCF is the messaging on how to dismantle the breakout modes for ports 1-3 and 4-6.

I am absolutely more than happy to apply it a second time in a maintenance window.

#

Reviewing the session logs from our partner, it indeed was not applied twice, documents are unclear as to whether it's required.

fiery wren
#

Applying it twice may help. Please let us know if that does it. We can update documentation then.

mighty timber
#

I got some interesting results from my case today, I think it is a mix of
"Odd config" and "this could be documented better"

#

But I absolutely will update as it plays out!

fiery wren
#

I believe there may also be an issue ongoing with these. I had a case last week or the week before involving these same 9336 cluster switches.

#

I will dig around a little and see what I can find.

mighty timber
#

If you'd like the case number to look over I'm more than happy to share.
I'm not looking for special attention, simply to add to the pile of information.

fiery wren
#

Appreciate it. if you don't mind sharing the case number would be happy to glance it. I've got a special interest in cluster switches.

mighty timber
#

They are the heart of it all.

Case# 2009452186
There's a good deal of resources attached to the case currently including some back and forth between T3, Cisco and Engineering (behind what I can see).
Come Monday there should be even more according to the current case owner.

mighty timber
#

Solution

#

We ended up recabling after guidance from support, but this is the kb this ticket spawned.