#Cisco Switch Dashboard: Add OutDiscards?

1 messages · Page 1 of 1 (latest)

latent garnet
#

Hi All,
We have some IP MetroCluster switches (Cisco 31xx devices) where, on the CLI, I can see that some interfaces have a significant percentage of "OutDiscards". I have added the devices to Harvest, to try and better understand when/why this is happening, but the Cisco Switch dashboard doesn't include a panel for this metric 🙁 ... How can I best go about adding it? What would be involved?

past swallow
latent garnet
#

I'm running the CLI command "show interface counter error non-zero"

past swallow
#

Can you check if the values you are seeing from the CLI match the values you see when running show interface | json-pretty then search for the eth_outdiscard key

latent garnet
#

There seems to be an exact match

#

I could email you the output ...

past swallow
#

thanks for confirming. That's OK, we have a few samples saved in GitHub that we can refer to. Would you mind opening a GitHub request asking for this feature and we'll add the metric and a new dashboard panel under the Traffic row in this open slot

latent garnet
#

OK - can do ... Is there a quick-hack guide, if I wanted to try and add it right-away myself?

past swallow
#

not really, the networkinterface.go link I shared above is the plugin that runs as part of the template. The new counter you're requesting will follow a similar pattern in that file. The Cisco collector uses plugins to gather metrics, unlike the ONTAP collector which is more template driven. You could probably vibecode a solution in a few minutes if you want to pull the repo, point your agent at it, and describe what you need

latent garnet
#

Humm, there we must agree to differ, I rather write my own code 😉

#

But I'll get right on the Feature request 🙂

past swallow
#

same but you said quick, hack, and right-away 🙂

latent garnet
#

Why are you still talking to me when you should be vibing? 😇

#

🤣

past swallow
#

sir, I have no vibe

latent garnet
#

Very wise!

past swallow
#

If there are other counters you see that are missing, feel free to list those too

latent garnet
past swallow
#

thanks @latent garnet !

latent garnet
#

Thank you guys - Harvest is excellent! 👏🏻

latent garnet
#

Got it - fast work! Arrrgh, nabox has no patch executable 😭

past swallow
latent garnet
#

I finally installed Harvest nightly 26.03.05 in nabox, but I don't see any new panels in the Cisco Switch Dashboard, did I miss some step?

#

( I also did a "dc restart" )

rustic mauve
#

What is your NABox version? It was added to NABox 4.0.11 onwards

latent garnet
#

We have to upgrade nabox as well? I only upgraded the Harvest package itself. I thought that would be sufficient?

#

Currently we have nabox 4.0.9.I used the nabox web UI to update Harvest from 25.11 to the latest nightly.

past swallow
#

hi @latent garnet can you try removing the Cisco Switch dashboard from Nabox's Grafana. Nabox will reload it and hopefully it will reload the new dashboard from the nightly build you upgraded to

#

You can also check VictoriaMetrics for the new metric cisco_interface_eth_out_discards

latent garnet
#

Victoria can lookup that metric name, seems to show some values (I don't really know how to drive it)

#

But, a slight concen, if I delete the Dashbaord, do we lose all the existing Cisco Switch (interface stats) history?

rustic mauve
#

You won't loose data. Data is stored in victoria metrics.

latent garnet
#

Good to know - Thanks Rahul!

#

Humm ... Can you tell me more about "Nabox will reload it and hopefully it will reload the new dashboard ..." Does it take a while? Do I have to trigger the reload somehow?

past swallow
#

If you don't see it after a minute, ssh to the nabox and run

dc down
dc up -d

to restart Nabox

latent garnet
#

dc down + up worked, everything restarted. But still no Cisco Dashboard shown. Is it still directly under "Home > Dasboards" ?

#

I'll check so the yaml files to see if the cisco switches are still defined, although that doesn't really see UI related ...

past swallow
latent garnet
#

Nope, don't have Harvest-Cisco. Just: Harvest-7mode, Harvest-StorageGrid, Harvest-cDOT, Harvest-cDOT Details and NAbox

#

(and a couple of test hack Dasboards I created for testing)

past swallow
#

@alpine swift can chime in, but I think you will see the Cisco dashboards if you upgarde Nabox. Alternatively, you can import the dashboard into your current Grafana too. You are getting the metrics correctly from the nightly build of Harvest, but don't see the Cisco dashboard

latent garnet
#

I have somehow the feeling that upgrading nabox could be a bit sketchy, just based on some messages I've seen here ... just my impression. What's the name of the Dashboard file that I should try to import? I'll give that a go

past swallow
latent garnet
#

OK! That worked - phew

#

Ethernet Out Discards are now displayed!

past swallow
#

great!

latent garnet
#

( Shows the same "comb" pattern as the interface traffic )

#

So that's great. Were collect the transmit discard info. and it looks reasonable ...

#

Question though. I'm looking at the Interface Receive Throughput graph. It's showing me the top traffic receiver as being Ethernet1/10, apparently a 40Gb it interface. Currently the graph shows me three peaks of 2TB/s

#

In the Traffic on Switch table (above the graph) the same interface is shown as loaded/hot/red with a value of 1.45 GB/s

#

When i inspect the data for the graph, around that time I see a series of 12 identical 15 second samples/values of 539.13 MB/s

#

I think the 2TB/s values must be incorrect, well, impossible?

past swallow
#

Do you mean these panels? If possible, include screenshots - they help understand what you mean

latent garnet
#

Yes, exactly those panels. But for sure, sending a few screenshots is a good idea. I'll have to do that via email though

past swallow
#

those values match for me - you're saying they don't match for you? the table shows instant values which will match the timeseries panel's "last" value

latent garnet
#

Will take me a few minutes to "get around to it"

past swallow
#

fwiw, Harvest isn't doing any processing of these metrics - we're basically exporting what the Cisco API returns

past swallow
#

can you check the unit for the receive throughput time-series panel and confirm that it is bytes/sec(SI)?

latent garnet
#

I have sent a couple of email messages with some screenshots, including the panel options ... Unit is "bytes/sec(SI)"

past swallow
#

thanks, we'll take a look

past swallow
#

Some ideas - maybe this counter is overflowing? In Prometheus can you look at the switch in question with a query like this cisco_interface_receive_bytes{switch="your-switch-name",interface="Ethernet1/10"} and see if the graph every decreases?

Another idea, the panel in question uses sum by (switch, interface, mac) which would cause double counting if multiple macs are on the same interface (maybe an uplink or trunk??) Does that seem possible in your env?

latent garnet
#

I also thought about the counters overflowing. The comb appearance is, I think, a classic symptom of this. But surely counter overflow is something from the past - I had assumed it didn't happen any more. Also, I don't immediately see how that would relate to the 2TB/s values that were being graphed. I will try to investigate further.

rustic mauve
#

Could you try running below query for this switch and interface to see if values is being reset? Run it for a time range in prometheus which covers those 2TB bumps.

delta(cisco_interface_receive_bytes{switch=~"s1",interface=~"Ethernet1/26"}[4m])
latent garnet
#

I think I did that right, fwd via email ...

rustic mauve
#

@latent garnet Thanks, yes. Delta shows that the value is not monotonically increasing, and most likely it is overflowing. We'll need to handle it in Harvest.

latent garnet
#

Cool - thanks. FYI, the same "comb" pattern (normal value, low value, normal, low, normal, low, ...) is also apparent on other Cisco graphs e.g. "Top 5 Interface Send Throughput"

rustic mauve
latent garnet
#

So ... Do you think there is an overflow problem, one that needs to be handled in Harvest? If it would help I could try collecting some "raw data" packet/octet counts via the cli. Perhaps that would help define the problem?

rustic mauve
#

Yes that would help.

rustic mauve
#

@latent garnet What is your switch software version? Like, is it using a 32-bit counter, causing it to overflow this fast?

steady granite
#

I just started using the Cisco collector and also see the comb lines. I think they are related to the interval setting. Default interval in Grafana is 1m but the default collection interval of the cisco collector is only 3m.

#

If I change the Min interval or modify the queries from "[4m]" to "[4m:3m]", the comb lines are gone.

steady granite
#

@rustic mauve What do you think about my finding?

past swallow
#

hi @steady granite I think you're onto something about the comb lines. Given the choice between setting Grafana min step vs Prometheus subquery. I think min step is slightly better since Prometheus subqueries are slightly more expensive (CPU wise) for the Prometheus server. Both achieve the same results.

#

RobbW's issue being discussed above is different. They appear to have overflowing counters on a high bandwidth interface. Some web searching confirms that this is a common problem for some switches but I haven't been able to nail down the exact hardware/software versions yet.

#

@steady granite the natural sort order fix is in the latest nightly if you want to try it out

steady granite
#

I tried to find out if "show interface" shows 32 bit or 64 bit counters, but found no public records. There is a command that gives back the 64 bit counters. Maybe Robb could have a look if these look any different? Could be an option for future implementation changes.

#

`# show interface Ethernet1/15 counters detailed all
Ethernet1/15
64 bit counters:
0. rxHCTotalPkts = 750468813296

  1.                   txHCTotalPks = 807950796352
    
  2.                rxHCUnicastPkts = 750466156028
    
  3.              rxHCMulticastPkts = 2622852
    
  4.              rxHCBroadcastPkts = 34416
    
  5.                     rxHCOctets = 2734122892481045
    
  6.                txHCUnicastPkts = 807942544927
    
  7.              txHCMulticastPkts = 8245269
    
  8.              txHCBroadcastPkts = 6156
    
  9.                     txHCOctets = 3305176170729475
    
  10.             rxTxHCPkts64Octets = 459515993237
    
  11.        rxTxHCpkts65to127Octets = 222370981269
    
  12.       rxTxHCpkts128to255Octets = 18010968873
    
  13.       rxTxHCpkts256to511Octets = 98609473634
    
  14.      rxTxHCpkts512to1023Octets = 13907164415
    
  15.     rxTxHCpkts1024to1518Octets = 6519078919
    
  16.     rxTxHCpkts1519to1548Octets = 6679830591
    
  17.                rxHCTrunkFrames = 0
    
  18.                txHCTrunkFrames = 0
    
  19.                 rxHCDropEvents = 0
    
  20.                InLayer3Unicast = 0
    
  21.          InLayer3UnicastOctets = 0
    
  22.              InLayer3Multicast = 0
    
  23.        InLayer3MulticastOctets = 0
    
  24.               OutLayer3Unicast = 0
    
  25.         OutLayer3UnicastOctets = 0
    
  26.             OutLayer3Multicast = 0
    
  27.       OutLayer3MulticastOctets = 0
    
  28.                 InLayer3Routed = 0
    
  29.           InLayer3RoutedOctets = 0
    
  30.                OutLayer3Routed = 0
    
  31.          OutLayer3RoutedOctets = 0
    
  32.          InLayer3AverageOctets = 0
    
  33.         InLayer3AveragePackets = 0
    
  34.         OutLayer3AverageOctets = 0
    
  35.        OutLayer3AveragePackets = 0
    

...`

past swallow
#

FWIW on the switch I tried - the command you pasted gives the same results as what Harvest uses

show interface port-channel1 counters detailed all | grep "tx bytes"
  25.                          tx bytes: = 1954410324776480

vs what Harvest uses

show interface port-channel1 | json-pretty | grep eth_outbytes
            "eth_outbytes": "1954410324776480",
steady granite
#

You have to check the first table that is labeled "64 bit counters:"

past swallow
#

The values in the first table are the same as the values in the 2nd table

#

all 64 bit

steady granite
#

OK. Then I have no clue where the quick rollover should come from. 😁

past swallow
#

it's also interesting that if you do this show interface port-channel1 counters detailed all | json-pretty instead of show interface port-channel1 counters detailed all you only get one table with no mention of 64 bit

#

but the values are 64 bit

steady granite
#

Yes that's what happens. Some values are shown with different labels but same values.

#

"tx_octets": "8282487798",

#

"eth_outbytes": "8282487798",

#

all in one table

past swallow
#

yep, since Harvest is using show interface | json (via RPC) the values should be 64 bit for versions of NX-OS that support 64 bit counters. I'll be curious what Robb sees on his switch since the PromQL delta query he shared clearly showed the values were overflowing. The most likely cause of overflow is 32 bit counters. It will be interesting to see if his tables have different values for 32 and 64 bit

latent garnet
#

Hi Switching fans, I will check here and let you know what I see. I had thought that 32 bit counters were a thing from the past, but to be honest I don't know what exactly we have here ... it will be whatever is recommend by NetApp for MetroCluster with ONTAP 9.16.x

latent garnet
#

I sent the result via email (I can't access Discord from there) I interpret it to be be pretty muc the same as what you see (above). I don't see any references to 32 bit counters, just two tables: "64 bit counters" and "All Port Counters"

past swallow
#

thanks RobbW, that's interesting, we'll take a look. If your counters are 64 bit, I don't understand some of the earlier graphs you shared

latent garnet
#

No problem, glad to contribute - even if it is a problem :-). AFAICR another "classic" issue that I have seen is somewhere in the data path a value being treated as signed rather than unsigned i.e. monotonically increasing (well, until it wraps). Dunno if that could be relevant here ...