#Harvest collecting network port stats for a non-existant port?

1 messages · Page 1 of 1 (latest)

ripe sluice
#

I got an alert (because I set one up a while ago) for a port that is showing errors on it. When i go and look, the node doesn't show that port in NA OS command line and also doesn't appear when I drop a shell in the OS layer either.

(network port show)

Node: cfsXXnYYb
Ignore
Speed(Mbps) Health Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status


a0a Default 10.YYY.XXX.0/22 up 1500 -/- healthy false
e0M Default Default up 1500 auto/1000 healthy false
e0c Default - up 1500 auto/100000
healthy false
e0d Default - up 1500 auto/100000
healthy false
e0e Cluster Cluster up 9000 auto/10000 healthy false
e0f Default - down 1500 auto/- - false
e0g Default - down 1500 auto/- - false
e0h Cluster Cluster up 9000 auto/10000 healthy false
e3a Default - down 9000 auto/- - false
e3b Default - down 9000 auto/- - false
e5a Default - down 9000 auto/- - false
e5b Default - down 1500 auto/- - false
e5c Default - down 9000 auto/- - false
e5d Default - down 1500 auto/- - false
14 entries were displayed.

I will add the rest to a comment, clearly 2k isn't enough for this post.

#

bash-5.0$ ifconfig -a | grep e0
e0M: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
e0g: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
e0f: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
e0c: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
e0d: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
laggport: e0c flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
laggport: e0d flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
bash-5.0$ exit

I am beyond confused about what could be causing this to occur. There are two nodes with e0a on it in the cluster (this node isn't one of them) and neither of them show any issues on the ports they have at all. Is this a virtual port that is identified a different way that I haven't seen before?

Here is the alert I setup some time ago and has worked well until this point.

  • alert: filer_nic_errors
    expr: sum by(node,nic,cluster) (netapp_nic_rx_total_errors) > 100
    for: 6m

OS is 9.14.1P5 with all firmware current to that OS level, which was updated a few days ago, but the stats for this port changed after a extended downtime event where the power was off. Upon booting of the nodes, this issue looked to appear and has started getting worse (for a non-existant port) even before the OS upgrade.

Harvest version is 24.05.0 and has worked well to this point.

Confused.

ruby dune
#

@ripe sluice This metric comes from nic_common object.

https://netapp.github.io/harvest/25.05/ontap-metrics/#nic_rx_total_errors

Could you check if this port exists in below Zapi call. instance_name in the output is the port name.
Please replace USERNAME, PASSWORD, and CLUSTER_IP with the appropriate values.

curl -s --connect-timeout 30 --user USER:PASSWORD --insecure --data-ascii '<?xml version="1.0" encoding="UTF-8"?>
<netapp xmlns="http://www.netapp.com/filer/admin" version="1.130">
  <perf-object-get-instances>
    <objectname>nic_common</objectname>
    <instances>
      <instance>*</instance>
    </instances>
  </perf-object-get-instances>
</netapp>' -H "Content-Type: text/xml" 'https://CLUSTER_IP/servlets/netapp.servlets.admin.XMLrequest_filer' 2>&1 | tee zapi_nic_common.xml

You can also check in Rest calls if Zapi is disabled. Here id field consists of port name.

curl -sk -u USER:PASSWORD 'https://CLUSTER_IP/api/cluster/counter/tables/nic_common/rows?fields=*' 2>&1 | tee  rest_nic_common.json
#

These endpoints only provide physical port stats.

ripe sluice
#

So the ports DO exist in nic_common, but my question is why would the OS not show them but they appear in harvest (or for that matter nic_common) but not via the OS? Also, these ports are reporting error statstics for something I am not clear about how it's being used/ where it exists.

scarlet gate
#

Not sure why the CLI of ONTAP is not showing them, but the REST/ZAPI interface of ONTAP is. Harvest isn't really in question here - we're just calling the REST/ZAPI API provided by ONTAP and exporting the results.

Do you see these ports in the CLI if you set diag first? Might be worth asking why the REST API returns ports that the CLI does not in the #1063542514780475493 or #1062049169520476220 channel

ripe sluice
#

zapi, I haven't tried rest to see if it does the same or not

scarlet gate
#

that would be a good check too

ripe sluice
#

returns it on both

scarlet gate
#

does the type returned give you any hint?

ripe sluice
#

even with set d, nothing from NA OS other than the two nodes that DO have those ports

#

nic_mce is the type

#

e0c is of the same type and is connected for production use

scarlet gate
#

Do you see the same mysterious port if you use this REST endpoint? curl -sk -u USER:PASSWORD 'https://CLUSTER_IP/api/network/ethernet/ports?fields=**' 2>&1 | tee rest_ports.json

ripe sluice
#

yep

#

normally when a port is unused/used, it appears in the NA OS under net port show

#

here's the system shell that shows the port from dmesg

scarlet gate
#

from the CLI, try these

ripe sluice
#

[1] nvf_probe platform_port_info port_name:e0a psid_valid:0, device_id:0x1017, vendor_id:0x15b3
[1] platform_port_info: mlx5 port e0a, psid:, valid:0, did:0x1017, vid:0x15b3
[1] nvf_probe platform_port_cluster_info port:e0a vendor_id:0x15b3, device_id:0x1017, subdevice_id:0x511
[1] platform_port_cluster_info: mlx5 port:e0a, vid:0x15b3, did:0x1017, subdid:0x511
[17] nvf_probe platform_port_info port_name:e0a psid_valid:0, device_id:0x1017, vendor_id:0x15b3
[17] platform_port_info: mlx5 port e0a, psid:, valid:0, did:0x1017, vid:0x15b3
[17] nvf_probe platform_port_cluster_info port:e0a vendor_id:0x15b3, device_id:0x1017, subdevice_id:0x511
[17] platform_port_cluster_info: mlx5 port:e0a, vid:0x15b3, did:0x1017, subdid:0x511
[17] changing device name from mlx5_core0 to e0a
[18] e0a: mlx5_enable_msix original nvec = 23
[18] e0a: mlx5_enable_msix requested nvec = 23
[18] e0a: mlx5_enable_msix adjusted nvec = 23
[18] e0a: mlx5_enable_msix
[18] e0a: mlx5_main::init_one called for
[22] e0a: mlx5e_set_board_type port_name = e0a port_psid = NAP0000000006, device_id = 0x1017, vendor_id = 0x15b3
[22] platform_port_info: mlx5 port e0a, psid:NAP0000000006, valid:1, did:0x1017, vid:0x15b3
[23] e0a: get_eeprom_data eeprom buf=Molex, size_read=20, xbuf=Molex
[23] e0a: get_eeprom_data eeprom buf=1111455002, size_read=20, xbuf=1111455002
[23] e0a: get_eeprom_data eeprom buf=93A2130520609, size_read=20, xbuf=93A2130520609
[23] e0a: mlx5e_get_cable_info size_read: 1, cable length: 1
[48] e0a: mlx5e_open_locked start
[113] e0a: mlx5e_open_locked start
[113] e0a: mlx5e_open_locked start
[220] e0a: mlx5_fw_version_check CX5 ioctl fw update disabled (e0a)

#

Per those:

ifgrp: a0a port d2:39:ea:17:ac:95 full e0c, e0d
net port show is as above in the beginning of the post
net port vlan show -node cfsXXnYYb
(network port vlan show)
There are no entries matching your query.

scarlet gate
#

in the json output from network/ethernet/ports what is the enabled field for this mysterious port?

ripe sluice
#

{
"name": "type",
"value": "nic_mce"
},
{
"name": "link_current_state",
"value": "up"
},
{
"name": "link_media_state",
"value": "active"
},
{
"name": "link_duplex",
"value": "full"
},
{
"name": "link_speed",
"value": "10000M"
},
{
"name": "link_flowcontrol",
"value": "full"
},
{
"name": "rss_enabled",
"value": "yes"
}

#

I would guess "yes"

scarlet gate
#

that does not look like the output from /api/network/ethernet/ports?fields=** Are you sure you checked the correct json?

ripe sluice
#

Sorry, I had used this:

curl -sk -u USER:PASSWORD 'https://CLUSTER_IP/api/cluster/counter/tables/nic_common/rows?fields=*' 2>&1 | tee rest_nic_common.json

scarlet gate
ripe sluice
#

/api/network/ethernet/ports?fields=** doesn't return the port

#

only the two nodes that have one are listed

#

is that something available under ZAPI as well? I could use that as a qualifier for my query so that it only alerts with active ports

#

I think I found it

#

net_port_status

#

Ok, so I think I found a work around for this issue:

whether or not the port exists in nic_rx_total_errors or not, I can filter based on the net_port_status as well

#

sum by(node,nic,cluster) (nic_rx_total_errors) > 10 and on (node,nic,cluster) net_port_status

#

Using that, it filters out the port that isnt' really there

#

seems weird to me that nic_rx_total_errors and net_port_status differ in anyway related to ports

scarlet gate
#

Maybe an ONTAP bug then. Like I mentioned earlier, you have all the REST queries to ask the question in the ontap channel.

Yes, net_port_status will have a value of 0 for down ports and a value of 1 for up ports. In your case, it sounds like the mystery port won't show up in that metric

ripe sluice
#

we are not yet REST but plan to eventually. I know that it is a bad thing as we move forward but we are just now getting to 9.14 which is well supported by REST vs 9.10 that we were on before