#I am trying the new Ems collector in the
1 messages · Page 1 of 1 (latest)
hi @livid slate let's try with curl to remove Harvest from the equation, let me dig that up for you
can you try this replacing username, password, and ip? curl -v --insecure --user admin:password 'https://10.193.48.11/api/support/ems/events'
this collector polls data every 3m by default
perhaps you had no new ems events in that 3m window?
that worked. received many events
"records": [
{
"node": {
"name": "flc1-02-noprod-ash-storage",
"uuid": "12c70b18-bf3a-11ec-bc33-00a098f218d0",
"_links": {
"self": {
"href": "/api/cluster/nodes/12c70b18-bf3a-11ec-bc33-00a098f218d0"
}
}
},
"index": 7126896,
"time": "2022-09-28T09:05:00-07:00",
"message": {
"severity": "informational",
"name": "wafl.scan.ownblocks.done"
},
"log_message": "wafl.scan.ownblocks.done: Completed block ownership calculation on volume rootvol(4)@vserver:606bdda8-7930-11ec-8c61-00a0986e013f. The scanner took 2 ms.",
"_links": {
"self": {
"href": "/api/support/ems/events/flc1-02-noprod-ash-storage/7126896"
}
}
},
i have a custom zapi Ems template and it collects events too. i did disable that before trying the new Ems collector
New ems collector collect last 3 minutes of ems every 3 minutes of schedule . You may not be having any new events since you have started the new ems collector
ahh. any way to trigger one event to test?
there's a way to tell ONTAP to synthesize an event - let me dig it up
Could you share this custom zapi template if possible.
name: Ems
query: ems-message-get-iter
object: ems
collect_only_labels: true
counters:
ems-message-info:
- ^ems-severity => severity
- ^message-name => name
- ^^node => node
- ^time => time
- ^^seq-num => seq_num
plugins:
- LabelAgent:
metric label zapi_value rest_value
value_to_num:default_value- non_encrypted encryption_state none none
0
- non_encrypted encryption_state none none
export_options:
instance_keys:
- node
- seq_num
instance_labels:
- severity
- name
- time
Thanks. Looks like collect all events on every poll.
yeah
give this a shot to create an object.store.unavailable event, replacing username, password, and IP curl -sk -uadmin:pass -X POST 'https://10.193.48.11/api/private/cli/event/generate' -d '{ "message-name": "object.store.unavailable", "values": [1,2,3] }'
very slow to query
we were experiencing low WAFL memory issue. tried to catch the warning from EMS before it reaches veryverylow level from verylow
triggered EMS with curl
if the ems poller is running you should see within 3m
ok
got it
curl -s localhost:12991/metrics | egrep '^ems'
ems_events{datacenter="ASH",cluster="flc1-noprod-ash-storage",cluster_uuid="b61b746e-ca01-11e5-aa1c-00a0986e012c",message="object.store.unavailable",node="flc1-01-noprod-ash-storage",node_uuid="2",severity="emergency",index="6713145",config_name="1"} 1
ems_events{cluster="flc1-noprod-ash-storage",cluster_uuid="b61b746e-ca01-11e5-aa1c-00a0986e012c",datacenter="ASH",index="6713146",app="",node="flc1-01-noprod-ash-storage",severity="informational",volume="AU2_1664380392",vol_ident="@vserver:a03a978a-ecd6-11e9-bc47-00a0986e013f",message="wafl.vvol.offline",node_uuid="5b8ad0df-bf3a-11ec-9f89-00a098f21d80"} 1
even one extra!
nice!
These events will not come again in next poll but you should see them in Prometheus past data. Basically these events will vanish in next poll and only events from last 3 minutes will be available on 12991 port.
you bet!
@livid slate Harvest also ships with some ems prometheus rules https://github.com/NetApp/harvest/blob/main/docker/prometheus/ems_alert_rules.yml If that helps for your use case.
Thanks. They are really useful. We need add the events that not in stock ems.yaml to custom_ems.yaml, right?
we are monitoring 2 events and will add more.
ems_event{message="callhome.spares.low"}
sum_over_time(ems_events{name="wafl.memory.statusLowMemory"}[10m]) > 1
yes, you'll need to add them to your custom
Great! Also do check if this is the expression you want sum_over_time(ems_events{name="wafl.memory.statusLowMemory"}[10m]) > 1 . Let's say your prometheus scrape interval is 15 sec and given harvest will keep this event in cache for 3 minutes then you'll have this entry 12 times in prometheus. In that case, this statement will be greater than 1 even if the ems was raised only once.
Most of our ems alert examples uses last_over_time and all active ems are published as value 1
good point. we were told by support if wafl.memory.statusLowMemory happens more than once in 10 minutes, a reboot maybe required. maybe i should use count_over_time?
It won't help sum and count would be same given metric value is 1 🙂
I think this use case we need to handle in current ems infrastructure . Could you pls open a GitHub request for this use case?
Mostly today we publish 1 when event happens and 0 when it is resolved via resolving event.
will do.
Thanks!
with our zapi ems template, we collect ems timestamp as value. then use this alert condition
count (time() - ems_time{name="wafl.memory.statusLowMemory"} < 600) by (node) > 1
not sure if it worked because prometheus did not report any 😉
Thanks . We ll see if we can simplify this use case!