#HA events not surfacing

1 messages · Page 1 of 1 (latest)

lucid quarry
#

Trying to figure out how to get ONTAP HA events to show up in NABox v4.0.1.0 with Harvest 25.05.1; not sure what I'm not enabling... 😉

lucid sundial
#

Hi @lucid quarry do you mean the callhome.hainterconnect.down EMS event?https://github.com/NetApp/harvest/blob/0433bdc01d656f08dd38e37b015f9c0ca1b914df/conf/ems/9.6.0/ems.yaml#L256

If so, make sure you have the Ems collector included in the list of collectors for the poller in question. If not, can you clarify which HA event you mean?

GitHub

Open-metrics endpoint for ONTAP and StorageGRID. Contribute to NetApp/harvest development by creating an account on GitHub.

lucid quarry
#

I have -Ems enabled in my /etc/nabox/harvest/harvest.yml; I'm trying to figure out what else I need to enable so I can get EMS event in to NABox and Harvest so I can use the events in my dashboard

#

i.e. last night I had a node failover, but the cluster health did not show it this morning.

lucid sundial
#

that should be all you need. Harvest will only publish events detected since it was restarted. Let's check you log files and make sure the collector is running. ssh into your nabox and run dc logs havrest | grep Ems:Ems

Also check VictoriaMetrics for the ems_events. You may need to look across a longer range

#

When you say, "the cluster health" are you referring a metric, dashboard panel or an EMS event?

#

Do you mean this dashboard?

lucid quarry
#
  1. The collector appears to be running; I see collector=Ems:Ems output from the above command. I was thinking that "cluster_new_status{datacenter=~"$Datacenter",cluster=~"$Cluster"}[24h] would show an unhealthy event for a cluster withing the last 24hours, it didn't
#

Yes, I'm not getting most of those events in the ONTAP Health dashboard

#

I see Network ports down, eveything else is 0

lucid sundial
#

what do these show?
dc logs havrest | grep Rest:Health
dc logs havrest | grep numNodeAlerts | grep -v 'numNodeAlerts=0'

lucid quarry
#

Filtering Rest:Health, I see an error level event for the cluster in question at about the right time, but then I ran a grep on the cluster and 'error'. I am seeing 403, permission denied... I wonder if I left something out of my RBAC

lucid sundial
lucid quarry
#

But, I'm not getting any stats for any clusters... Let me check RBAC, thanks, I'll get back to you

lucid sundial
#

Am I remembering correctly that you can not share logs?

lucid quarry
#

I can transcribe, but I can't download them from the dark side

#

Transcribe means read and type

#

😉

lucid quarry
#

It looks like I'm collecting now. I will monitor to see what events I get; thanks for the help, although I may get back to you. 😉

lucid quarry
#

GM

#

Would you have time for a question?

lucid quarry
#

I'm about to head out for a week, but when you see this and have time to respond, that would be great. I enabled the rest-role, and I can see EMS events being collected. But my ONTAP Health dashboard still does not seem right. I disabled HA on a cluster, re-enabled it, and actually force a takeover. I don't see any "Node Down" on the Health panel; I also used a query, cluster_new_status{datacenter,cluster}[24h] to see if I could catch the event, but it isn't coming up.

lucid quarry
#

Have a great gobble day turkeys...

lucid sundial
#

What if you try health_node_alerts? You have a good Thanksgiving too!

lucid quarry
#

That didn't work either. However, I noticed after another storage failover on a test box, the ONTAP Health dashboard showed a node down "1", and HA down "2". What I don't understand, is that I brought the node back in to the cluster and I am seeing an HA down "4", and Node Down "1". After re-enabling HA, and a few minutes thereafter, I see HA "2", and "node down" 0. So I used this dashboard as a guide to setup my summary wallboard, using:
I created three queries, a "count(health_ha_alerts,serverity="error"), a "count(health_node_alerts, severity="error") and a "count(environment_sensor_threshold_value, threshold_state!="normal"). I am substituting the 'ok' for 0, 'nok' for 1 in the value mappings.

#

What I am not sure of are the values to monitor for; on one cluster I showed "na,node,env" nok, ok, ok. Turns out that the auto-giveback after takeover was disabled. Now my status is 'ok', 'ok', 'ok', but I don't understand where the numbers are coming from

distant wagon