#Metrics for check status of PSU and if any node is in takeover state

1 messages · Page 1 of 1 (latest)

blazing breach
#

Is there any metric where we can find if a PSU has power or not or it is failed? Additionally I would like to know the same for a takeover process.

livid wasp
blazing breach
#

Hello @livid wasp
Yes, I have seen the metric "node_failed_power" but I was wondering if there were other specific information since I found more in metrics "health_support_metrics" and "environment_sensor_status". In these metrics there are information about the threshold, sensor,etc but I don't really know how differentiate if there is a problem.
In the example of my screenshot in the discrete_value says MULTIFAULT but in the discrete_state is normal, it is true that it was a managed power outage.
If I use type = fru as a filter, I will get information about PSU and FAN (that could be good). But at the end my biggest doubt is, what type of filter I have to do for detecting a fail using those metrics?
Additionally to this I have to convert/check the information that I see in prometheus to influxDB, since our alert stack use influxdb as db.

Thanks for your invaluable help! 😉

livid wasp
#

Hello @blazing breach is the goal to alert on node status or PSU status or both? We have this issue that we're picking up to alert on ha pair going down. https://github.com/NetApp/harvest/issues/2315 Would that address your ask?

blazing breach
#

We are trying to define new alerts from different perspectives but mainly using the information that we have in our DB.

  • Hardware -> Power and fan (we have disk alerts implemented too)
  • Service -> Takeover, panic of a node or other issues which can impact on the service (not neccesarily impact on it but it can happens).

Thanks chris, the issue that you share is interesting for me, I subscribed. Also, I am not sure if I should open the new request that you mentioned before about takeover of node because it seems that will be cover with the issue for alerting the status of HA pair (maybe with this is enough).

livid wasp
#

thanks for the details. Excellent, yeah I think we can try addressing these in 2315 and later create a new issue if there are any gaps

blazing breach
native shoal
#

@blazing breachI have updated issue with an approach for HA pair
https://github.com/NetApp/harvest/issues/2315#issuecomment-1829942815
Please let us know your feedback if it covers your use case.

Regarding sensor issues, Harvest shows a panel in the Power dashboard that displays problematic sensors, as in the attached screenshot. The Prometheus query we utilize for this is environment_sensor_threshold_value{threshold_state != "normal"}. Alternatively, we can also use the query environment_sensor_status{threshold_state!="normal"} for the same.

blazing breach
#

@native shoal Thanks for sharing 😀 , I was following the thread too. From my perspective I think the the scenario is correct (checking possibility of takeover and the status of all nodes) but I would like to have more information in the measurement than takeover possible != true. I don't know if you could add which is the partner for a node and the description (for example).

Related to the ||power|| and ||fan|| information I will try to explain our use case, for monitoring purposes we use the PrometheusDB but for alerting we must use InfluxDB since the corporate stack for alerting uses this DB. So we are inserting the same information in both databases and the way we explode is different. Having said all this I had been checked different dashboards and information in the PrometheusDB but I have to understand how this info correlated with the information in InfluxDB. From now I am using when status_code < 1, the alert should appear but I am not sure if this is correct since I can't test it. To use a variable as condition should be a field (in influx).

native shoal
#

@blazing breach Yes, it's possible to include partner node information. There is a edge case where this information is unavailable, when both nodes of the HA pair are down. This is because ONTAP does not provide this information through the storage failover show command. We can add description as well. I'll add these in PR.

Regarding Influx, your check for status_code < 1 is correct and aligns with the Prometheus query environment_sensor_threshold_value{threshold_state != "normal"} check related to health of a sensor.

native shoal
blazing breach
#

Hello,

After a while we are using environment_sensor (measurement in influxdb v1) for alert a PSU failed, some days ago we got alerts from some systems but not for others with the same event. When I have checked which clusters appear in the database some of them do and others don't, so I was reviewing the harvest conf and all the clusters are included in this yaml file and in the influxdb section. Could you help me to check why the information isn't inserted?

native shoal
#

@blazing breach Could you please check if there are any errors in Harvest logs?

blazing breach
native shoal
blazing breach
native shoal
livid wasp
#

hi @blazing breach earlier you mentioned that your would share via email which clusters are missing. When you get a chance, please do

livid wasp
#

environment_sensor metrics are collected from the Sensor template and five of your 18 clusters are not running the Sensor collector. I'm guessing that my list of five matches your list of clusters not in the database. I assume all of your clusters are running from the same Harvest install? Can you run bin/harvest doctor and send us the output at ng-harvest-files@netapp.com?

Let's capture debug logs and upload them to https://upload.nabox.org/dafu-sigi-noqi by doing this:
cd to harvest install directory
replace name with poller name that is not reporting environment_sensor metrics
replace $port with a free port

bin/poller --poller $name --promPort $port --loglevel 1 2>&1 | tee debug.log

After that runs for 10 minutes or so, press Ctrl-C then gzip debug.log and upload debug.log.gz

blazing breach
#

hello @livid wasp
I have sent the output to your email of the command this morning:

harvest doctor

And additionally, I have upload now the debug.log from one of the pollers which has the problem gathering information. If you need that I share all pollers with the problem let me know.
Thanks.

blazing breach
#

Hello,
We had a takeover the last days and I have remembered about metric: health_ha. This is a unique metric which appears when a takeover happens but it hasn't a recovery metric. Do you have in mind implement something to know when a recovery occurs?

native shoal
blazing breach
native shoal
#

It is enabled by default. Once issue resolves, There should be a metric health_ha_alerts == 0 in DB.

blazing breach
#

I was checking in both of my DBs (prometheus and influxdb) and yes I have found the health_ha_alerts == 1 (in the prometheus DB) when the panic happens but I am not clear to understand the behaviour because if you see my screenshooot the orange, blue and red line indicates that the state 1 was produced on the opposite node that restart occurs and like 10 minutes after when the yellow line is appears as state 1, this line is refers to the node which had the restart.
That is make sense? Because for me it isn't clear in which node the success occurs if I am not checking the events on the storage array.

#

In the influxDB I have seen that both nodes appear with the health_ha == 1 at the same time and like 5 minutes after the node ends with '2' recovers (this node hadn't a panic). Then 8 minutes after both nodes has the state 0.
I am not seeing the same information in both DBs and I am not sure how to understand the information for trying to define a correct alert which indicates which node of the HA was down.

native shoal
#

@blazing breach Value 1 is published for nodes for which takeover is not possible. Therefore, it makes sense that an alert is generated for nodes that are not down. Harvest uses the API below to detect such nodes where value of possible field is not true. Once the other down nodes come up, Value 0 is published, suggesting that the issue is resolved. I believe the possible value for both nodes may change as they reach a stable state, which would mean that both are online but the possible flag is false for both resulting in health_ha_alerts == 1 for both nodes.

Your alerting logic should be based on health_ha_alerts == 1 for a relevant node, and once health_ha_alerts == 0 is published for that node, the issue is considered resolved.

REST Call:

api/private/cli/storage/failover?fields=possible,partner_name,state_description,partner_state&possible=!true

ONTAP CLI:

storage failover show -fields possible,partner-name,state-description,partner-state 
blazing breach
#

About what your are saying, always we will receive a false positive alert because in a first moment both nodes appear as DOWN which could be not correct but some time after that one of nodes is recovered inserting a 0 in health_ha_status. I will think if there is any way from alerting perspective to avoid the false positive for the partner node which don't receive a takeover or panic.
I am understanding that you have designed this in that way because there isn't a better way to do it. In the first moment that a failover occurs probably there isn't information about the node which don't receive the takeover so that is the reason where insert a 1.
Thanks for the information @native shoal

native shoal
#

The health_ha_alerts metric is dependent on the field value of Takeover Possible in the ONTAP CLI command:

storage failover show

If the value for this field is false or empty for any node, we raise an alert. This indicates that a "takeover not possible" alert is raised for both nodes, ensuring no false positives. The metric is defined as health_ha_alerts == 1.

I have observed that momentarily, both nodes in an HA pair are shown as unhealthy for a few seconds, as demonstrated below. Are you referring to these as false positives? Harvest will raise an alert for both these nodes, resulting in health_node_alerts == 1. This seems to be an ONTAP issue where both nodes are reported as unhealthy for a few seconds, even though only one node has gone down.