#Any way to show cluster/node uptime via Harvest/NAbox?

1 messages · Page 1 of 1 (latest)

civic ravine
#

Customer asks for a yearly report of system uptime. Best would be in % or minutes/seconds to check if a certain available SLA is hit.

Since Harvest already collects data from the cluster constantly I was hoping for some sort of availability monitoring of the monitored endpoint (cluster-mgmt LIF)

valid plume
civic ravine
#

Hi Chris, thanks for the fast reply!

But this only shows the current uptime of the node right?
Basically what you get with:

cl3::> node show -fields uptime
node  uptime
----- ------------
cl3n1 3 days 02:39
cl3n2 3 days 03:03
2 entries were displayed.

So if for example the first node had two downtimes already in February and June before crashing again 13days ago that would not be visible. You only see how long it's up since the last downtime.

#

From the last post here (https://community.netapp.com/t5/Active-IQ-Unified-Manager-Discussions/Uptime-report-using-OCI-or-OCUM/td-p/148264) it looks like you can also get that as a graph.
But it would be much better to see how often Harvest polls for some data from a NetApp cluster and does not get anything back. So for example: During 2023 Harvest polled the NetApp every 5min (not sure what's the default interval currently) and in 0,1% of the polls they failed.
Just like every monitoring software does it, like PRTG for example: https://www.paessler.com/uptime_monitoring#visualisation

#

If it's not possible that's also ok. I guess customer needs something like Cloud Insights, etc. (AIQUM unfortunately does not provide this info...)
They were hoping to get that info for the last year since AIQUM and NAbox have been running for years already and the NetApp clusters are not in any other monitoring software (they rely on EMS and ASUPs).

valid plume
#

yes, that's right. There's also node_new_status and cluster_new_status but they're not quite right either since they're tied to "health". Then there is metadata_target_ping and metadata_target_status which are similar but not quite right either.

You could also use node_uptime since it's monotonically increasing, but that doesn't quite account for times that you restart Harvest itself. For example, the first circled chunk on the left is when the pollers were restarted. The circled area on the right, is when a node was restarted.

Let me think about and get back to you

civic ravine
#

Thanks a lot! No hurries though

valid plume
#

Hi @civic ravine, we discussed this today and came up with the following idea.

What about a Prometheus query like this:

100 - 100*(sum by (cluster, datacenter, node) (count_over_time(health_node_alerts[120d])) / (120 * 24 * 60 * 60 / 60))

At a high-level ,the query computes an uptime percentage by counting the number of times a nodes is down divided by the number of times the metric was scraped.

In detail:
Look back 120 days (this can be changed, but if you do also make sure you change the 120 constant later in the expression) and counts the number of times the health_node_alerts metric is recorded. This metric is recorded when Harvest polls ONTAP for node show -health false. That CLI command will return a record whenever a cluster's node is unhealthy.

The denominator calculates the number of times that Harvest should be scraped by Prometheus in 120 days. This assumes that your scrape_interval is 1m. Adjust if you have a different scrape_interval.

This Prometheus query still has the limitation that you will miss data when Harvest is not running.

Perhaps a more accurate way to calculate this would be to analyze ONTAP logs that include node restart/up/down messages? Not sure if AIQ does that already or not. It might be worth asking in the AIQ channel too.