#HA events not surfacing
1 messages · Page 1 of 1 (latest)
Hi @lucid quarry do you mean the callhome.hainterconnect.down EMS event?https://github.com/NetApp/harvest/blob/0433bdc01d656f08dd38e37b015f9c0ca1b914df/conf/ems/9.6.0/ems.yaml#L256
If so, make sure you have the Ems collector included in the list of collectors for the poller in question. If not, can you clarify which HA event you mean?
Open-metrics endpoint for ONTAP and StorageGRID. Contribute to NetApp/harvest development by creating an account on GitHub.
I have -Ems enabled in my /etc/nabox/harvest/harvest.yml; I'm trying to figure out what else I need to enable so I can get EMS event in to NABox and Harvest so I can use the events in my dashboard
i.e. last night I had a node failover, but the cluster health did not show it this morning.
that should be all you need. Harvest will only publish events detected since it was restarted. Let's check you log files and make sure the collector is running. ssh into your nabox and run dc logs havrest | grep Ems:Ems
Also check VictoriaMetrics for the ems_events. You may need to look across a longer range
When you say, "the cluster health" are you referring a metric, dashboard panel or an EMS event?
Do you mean this dashboard?
- The collector appears to be running; I see collector=Ems:Ems output from the above command. I was thinking that "cluster_new_status{datacenter=~"$Datacenter",cluster=~"$Cluster"}[24h] would show an unhealthy event for a cluster withing the last 24hours, it didn't
Yes, I'm not getting most of those events in the ONTAP Health dashboard
I see Network ports down, eveything else is 0
what do these show?
dc logs havrest | grep Rest:Health
dc logs havrest | grep numNodeAlerts | grep -v 'numNodeAlerts=0'
Filtering Rest:Health, I see an error level event for the cluster in question at about the right time, but then I ran a grep on the cluster and 'error'. I am seeing 403, permission denied... I wonder if I left something out of my RBAC
sounds like that's probably the case - if you copy/paste I can confirm or you can reapply the RBAC from https://netapp.github.io/harvest/25.11/prepare-cdot-clusters/#rest-least-privilege-role
But, I'm not getting any stats for any clusters... Let me check RBAC, thanks, I'll get back to you
Am I remembering correctly that you can not share logs?
I can transcribe, but I can't download them from the dark side
Transcribe means read and type
😉
It looks like I'm collecting now. I will monitor to see what events I get; thanks for the help, although I may get back to you. 😉
I'm about to head out for a week, but when you see this and have time to respond, that would be great. I enabled the rest-role, and I can see EMS events being collected. But my ONTAP Health dashboard still does not seem right. I disabled HA on a cluster, re-enabled it, and actually force a takeover. I don't see any "Node Down" on the Health panel; I also used a query, cluster_new_status{datacenter,cluster}[24h] to see if I could catch the event, but it isn't coming up.
Have a great gobble day turkeys...
What if you try health_node_alerts? You have a good Thanksgiving too!
That didn't work either. However, I noticed after another storage failover on a test box, the ONTAP Health dashboard showed a node down "1", and HA down "2". What I don't understand, is that I brought the node back in to the cluster and I am seeing an HA down "4", and Node Down "1". After re-enabling HA, and a few minutes thereafter, I see HA "2", and "node down" 0. So I used this dashboard as a guide to setup my summary wallboard, using:
I created three queries, a "count(health_ha_alerts,serverity="error"), a "count(health_node_alerts, severity="error") and a "count(environment_sensor_threshold_value, threshold_state!="normal"). I am substituting the 'ok' for 0, 'nok' for 1 in the value mappings.
What I am not sure of are the values to monitor for; on one cluster I showed "na,node,env" nok, ok, ok. Turns out that the auto-giveback after takeover was disabled. Now my status is 'ok', 'ok', 'ok', but I don't understand where the numbers are coming from
@lucid quarry There was an issue fixed related to HA for single node systems. See if you are hitting that. Please upgrade to 25.11
https://github.com/NetApp/harvest/issues/3875
You want to check for value 1 for health_ha_alerts https://netapp.github.io/harvest/latest/ontap-metrics/#health_ha_alerts