Hi All,
We have some IP MetroCluster switches (Cisco 31xx devices) where, on the CLI, I can see that some interfaces have a significant percentage of "OutDiscards". I have added the devices to Harvest, to try and better understand when/why this is happening, but the Cisco Switch dashboard doesn't include a panel for this metric 🙁 ... How can I best go about adding it? What would be involved?
#Cisco Switch Dashboard: Add OutDiscards?
1 messages · Page 1 of 1 (latest)
hi @latent garnet it doesn't look like Harvest is currently collecting this metric https://github.com/NetApp/harvest/blob/main/cmd/collectors/cisco/plugins/networkinterface/networkinterface.go#L17
Are you seeing this metric from the CLI via show interface ?
I'm running the CLI command "show interface counter error non-zero"
Can you check if the values you are seeing from the CLI match the values you see when running show interface | json-pretty then search for the eth_outdiscard key
thanks for confirming. That's OK, we have a few samples saved in GitHub that we can refer to. Would you mind opening a GitHub request asking for this feature and we'll add the metric and a new dashboard panel under the Traffic row in this open slot
OK - can do ... Is there a quick-hack guide, if I wanted to try and add it right-away myself?
not really, the networkinterface.go link I shared above is the plugin that runs as part of the template. The new counter you're requesting will follow a similar pattern in that file. The Cisco collector uses plugins to gather metrics, unlike the ONTAP collector which is more template driven. You could probably vibecode a solution in a few minutes if you want to pull the repo, point your agent at it, and describe what you need
Humm, there we must agree to differ, I rather write my own code 😉
But I'll get right on the Feature request 🙂
same but you said quick, hack, and right-away 🙂
sir, I have no vibe
Very wise!
If there are other counters you see that are missing, feel free to list those too
thanks @latent garnet !
Thank you guys - Harvest is excellent! 👏🏻
Got it - fast work! Arrrgh, nabox has no patch executable 😭
Hi @latent garnet these changes are available in nightly. Let us know how they work for you. https://github.com/NetApp/harvest/releases/tag/nightly
I finally installed Harvest nightly 26.03.05 in nabox, but I don't see any new panels in the Cisco Switch Dashboard, did I miss some step?
( I also did a "dc restart" )
We have to upgrade nabox as well? I only upgraded the Harvest package itself. I thought that would be sufficient?
Currently we have nabox 4.0.9.I used the nabox web UI to update Harvest from 25.11 to the latest nightly.
hi @latent garnet can you try removing the Cisco Switch dashboard from Nabox's Grafana. Nabox will reload it and hopefully it will reload the new dashboard from the nightly build you upgraded to
You can also check VictoriaMetrics for the new metric cisco_interface_eth_out_discards
Victoria can lookup that metric name, seems to show some values (I don't really know how to drive it)
But, a slight concen, if I delete the Dashbaord, do we lose all the existing Cisco Switch (interface stats) history?
You won't loose data. Data is stored in victoria metrics.
Good to know - Thanks Rahul!
Humm ... Can you tell me more about "Nabox will reload it and hopefully it will reload the new dashboard ..." Does it take a while? Do I have to trigger the reload somehow?
If you don't see it after a minute, ssh to the nabox and run
dc down
dc up -d
to restart Nabox
dc down + up worked, everything restarted. But still no Cisco Dashboard shown. Is it still directly under "Home > Dasboards" ?
I'll check so the yaml files to see if the cisco switches are still defined, although that doesn't really see UI related ...
Nope, don't have Harvest-Cisco. Just: Harvest-7mode, Harvest-StorageGrid, Harvest-cDOT, Harvest-cDOT Details and NAbox
(and a couple of test hack Dasboards I created for testing)
@alpine swift can chime in, but I think you will see the Cisco dashboards if you upgarde Nabox. Alternatively, you can import the dashboard into your current Grafana too. You are getting the metrics correctly from the nightly build of Harvest, but don't see the Cisco dashboard
I have somehow the feeling that upgrading nabox could be a bit sketchy, just based on some messages I've seen here ... just my impression. What's the name of the Dashboard file that I should try to import? I'll give that a go
great!
( Shows the same "comb" pattern as the interface traffic )
So that's great. Were collect the transmit discard info. and it looks reasonable ...
Question though. I'm looking at the Interface Receive Throughput graph. It's showing me the top traffic receiver as being Ethernet1/10, apparently a 40Gb it interface. Currently the graph shows me three peaks of 2TB/s
In the Traffic on Switch table (above the graph) the same interface is shown as loaded/hot/red with a value of 1.45 GB/s
When i inspect the data for the graph, around that time I see a series of 12 identical 15 second samples/values of 539.13 MB/s
I think the 2TB/s values must be incorrect, well, impossible?
Do you mean these panels? If possible, include screenshots - they help understand what you mean
Yes, exactly those panels. But for sure, sending a few screenshots is a good idea. I'll have to do that via email though
those values match for me - you're saying they don't match for you? the table shows instant values which will match the timeseries panel's "last" value
Will take me a few minutes to "get around to it"
fwiw, Harvest isn't doing any processing of these metrics - we're basically exporting what the Cisco API returns
can you check the unit for the receive throughput time-series panel and confirm that it is bytes/sec(SI)?
I have sent a couple of email messages with some screenshots, including the panel options ... Unit is "bytes/sec(SI)"
thanks, we'll take a look
Some ideas - maybe this counter is overflowing? In Prometheus can you look at the switch in question with a query like this cisco_interface_receive_bytes{switch="your-switch-name",interface="Ethernet1/10"} and see if the graph every decreases?
Another idea, the panel in question uses sum by (switch, interface, mac) which would cause double counting if multiple macs are on the same interface (maybe an uplink or trunk??) Does that seem possible in your env?
I also thought about the counters overflowing. The comb appearance is, I think, a classic symptom of this. But surely counter overflow is something from the past - I had assumed it didn't happen any more. Also, I don't immediately see how that would relate to the 2TB/s values that were being graphed. I will try to investigate further.
Could you try running below query for this switch and interface to see if values is being reset? Run it for a time range in prometheus which covers those 2TB bumps.
delta(cisco_interface_receive_bytes{switch=~"s1",interface=~"Ethernet1/26"}[4m])
I think I did that right, fwd via email ...
@latent garnet Thanks, yes. Delta shows that the value is not monotonically increasing, and most likely it is overflowing. We'll need to handle it in Harvest.
Cool - thanks. FYI, the same "comb" pattern (normal value, low value, normal, low, normal, low, ...) is also apparent on other Cisco graphs e.g. "Top 5 Interface Send Throughput"
If not for overflow, we thought comb is normal. We discussed this in #1473609804479205471 message
see if this helps
So ... Do you think there is an overflow problem, one that needs to be handled in Harvest? If it would help I could try collecting some "raw data" packet/octet counts via the cli. Perhaps that would help define the problem?
Yes that would help.
@latent garnet What is your switch software version? Like, is it using a 32-bit counter, causing it to overflow this fast?
I just started using the Cisco collector and also see the comb lines. I think they are related to the interval setting. Default interval in Grafana is 1m but the default collection interval of the cisco collector is only 3m.
If I change the Min interval or modify the queries from "[4m]" to "[4m:3m]", the comb lines are gone.
@rustic mauve What do you think about my finding?
hi @steady granite I think you're onto something about the comb lines. Given the choice between setting Grafana min step vs Prometheus subquery. I think min step is slightly better since Prometheus subqueries are slightly more expensive (CPU wise) for the Prometheus server. Both achieve the same results.
RobbW's issue being discussed above is different. They appear to have overflowing counters on a high bandwidth interface. Some web searching confirms that this is a common problem for some switches but I haven't been able to nail down the exact hardware/software versions yet.
@steady granite the natural sort order fix is in the latest nightly if you want to try it out
I tried to find out if "show interface" shows 32 bit or 64 bit counters, but found no public records. There is a command that gives back the 64 bit counters. Maybe Robb could have a look if these look any different? Could be an option for future implementation changes.
`# show interface Ethernet1/15 counters detailed all
Ethernet1/15
64 bit counters:
0. rxHCTotalPkts = 750468813296
-
txHCTotalPks = 807950796352 -
rxHCUnicastPkts = 750466156028 -
rxHCMulticastPkts = 2622852 -
rxHCBroadcastPkts = 34416 -
rxHCOctets = 2734122892481045 -
txHCUnicastPkts = 807942544927 -
txHCMulticastPkts = 8245269 -
txHCBroadcastPkts = 6156 -
txHCOctets = 3305176170729475 -
rxTxHCPkts64Octets = 459515993237 -
rxTxHCpkts65to127Octets = 222370981269 -
rxTxHCpkts128to255Octets = 18010968873 -
rxTxHCpkts256to511Octets = 98609473634 -
rxTxHCpkts512to1023Octets = 13907164415 -
rxTxHCpkts1024to1518Octets = 6519078919 -
rxTxHCpkts1519to1548Octets = 6679830591 -
rxHCTrunkFrames = 0 -
txHCTrunkFrames = 0 -
rxHCDropEvents = 0 -
InLayer3Unicast = 0 -
InLayer3UnicastOctets = 0 -
InLayer3Multicast = 0 -
InLayer3MulticastOctets = 0 -
OutLayer3Unicast = 0 -
OutLayer3UnicastOctets = 0 -
OutLayer3Multicast = 0 -
OutLayer3MulticastOctets = 0 -
InLayer3Routed = 0 -
InLayer3RoutedOctets = 0 -
OutLayer3Routed = 0 -
OutLayer3RoutedOctets = 0 -
InLayer3AverageOctets = 0 -
InLayer3AveragePackets = 0 -
OutLayer3AverageOctets = 0 -
OutLayer3AveragePackets = 0
...`
FWIW on the switch I tried - the command you pasted gives the same results as what Harvest uses
show interface port-channel1 counters detailed all | grep "tx bytes"
25. tx bytes: = 1954410324776480
vs what Harvest uses
show interface port-channel1 | json-pretty | grep eth_outbytes
"eth_outbytes": "1954410324776480",
You have to check the first table that is labeled "64 bit counters:"
The values in the first table are the same as the values in the 2nd table
all 64 bit
OK. Then I have no clue where the quick rollover should come from. 😁
it's also interesting that if you do this show interface port-channel1 counters detailed all | json-pretty instead of show interface port-channel1 counters detailed all you only get one table with no mention of 64 bit
but the values are 64 bit
Yes that's what happens. Some values are shown with different labels but same values.
"tx_octets": "8282487798",
"eth_outbytes": "8282487798",
all in one table
yep, since Harvest is using show interface | json (via RPC) the values should be 64 bit for versions of NX-OS that support 64 bit counters. I'll be curious what Robb sees on his switch since the PromQL delta query he shared clearly showed the values were overflowing. The most likely cause of overflow is 32 bit counters. It will be interesting to see if his tables have different values for 32 and 64 bit
Hi Switching fans, I will check here and let you know what I see. I had thought that 32 bit counters were a thing from the past, but to be honest I don't know what exactly we have here ... it will be whatever is recommend by NetApp for MetroCluster with ONTAP 9.16.x
I sent the result via email (I can't access Discord from there) I interpret it to be be pretty muc the same as what you see (above). I don't see any references to 32 bit counters, just two tables: "64 bit counters" and "All Port Counters"
thanks RobbW, that's interesting, we'll take a look. If your counters are 64 bit, I don't understand some of the earlier graphs you shared
No problem, glad to contribute - even if it is a problem :-). AFAICR another "classic" issue that I have seen is somewhere in the data path a value being treated as signed rather than unsigned i.e. monotonically increasing (well, until it wraps). Dunno if that could be relevant here ...