#Any way to show cluster/node uptime via Harvest/NAbox?
1 messages · Page 1 of 1 (latest)
hi @civic ravine does this work? https://netapp.github.io/harvest/nightly/ontap-metrics/#node_uptime
also shown on ONTAP: Node dashboard
Hi Chris, thanks for the fast reply!
But this only shows the current uptime of the node right?
Basically what you get with:
cl3::> node show -fields uptime
node uptime
----- ------------
cl3n1 3 days 02:39
cl3n2 3 days 03:03
2 entries were displayed.
So if for example the first node had two downtimes already in February and June before crashing again 13days ago that would not be visible. You only see how long it's up since the last downtime.
From the last post here (https://community.netapp.com/t5/Active-IQ-Unified-Manager-Discussions/Uptime-report-using-OCI-or-OCUM/td-p/148264) it looks like you can also get that as a graph.
But it would be much better to see how often Harvest polls for some data from a NetApp cluster and does not get anything back. So for example: During 2023 Harvest polled the NetApp every 5min (not sure what's the default interval currently) and in 0,1% of the polls they failed.
Just like every monitoring software does it, like PRTG for example: https://www.paessler.com/uptime_monitoring#visualisation
If it's not possible that's also ok. I guess customer needs something like Cloud Insights, etc. (AIQUM unfortunately does not provide this info...)
They were hoping to get that info for the last year since AIQUM and NAbox have been running for years already and the NetApp clusters are not in any other monitoring software (they rely on EMS and ASUPs).
yes, that's right. There's also node_new_status and cluster_new_status but they're not quite right either since they're tied to "health". Then there is metadata_target_ping and metadata_target_status which are similar but not quite right either.
You could also use node_uptime since it's monotonically increasing, but that doesn't quite account for times that you restart Harvest itself. For example, the first circled chunk on the left is when the pollers were restarted. The circled area on the right, is when a node was restarted.
Let me think about and get back to you
Thanks a lot! No hurries though
Hi @civic ravine, we discussed this today and came up with the following idea.
What about a Prometheus query like this:
100 - 100*(sum by (cluster, datacenter, node) (count_over_time(health_node_alerts[120d])) / (120 * 24 * 60 * 60 / 60))
At a high-level ,the query computes an uptime percentage by counting the number of times a nodes is down divided by the number of times the metric was scraped.
In detail:
Look back 120 days (this can be changed, but if you do also make sure you change the 120 constant later in the expression) and counts the number of times the health_node_alerts metric is recorded. This metric is recorded when Harvest polls ONTAP for node show -health false. That CLI command will return a record whenever a cluster's node is unhealthy.
The denominator calculates the number of times that Harvest should be scraped by Prometheus in 120 days. This assumes that your scrape_interval is 1m. Adjust if you have a different scrape_interval.
This Prometheus query still has the limitation that you will miss data when Harvest is not running.
Perhaps a more accurate way to calculate this would be to analyze ONTAP logs that include node restart/up/down messages? Not sure if AIQ does that already or not. It might be worth asking in the AIQ channel too.