I am trying the new Ems collector in the | NetApp | Page 1

indigo phoenix Sep 28, 2022, 4:01 PM

#

hi @livid slate let's try with curl to remove Harvest from the equation, let me dig that up for you

#

can you try this replacing username, password, and ip? curl -v --insecure --user admin:password 'https://10.193.48.11/api/support/ems/events'

#

this collector polls data every 3m by default

#

perhaps you had no new ems events in that 3m window?

livid slate Sep 28, 2022, 4:06 PM

#

that worked. received many events
"records": [
{
"node": {
"name": "flc1-02-noprod-ash-storage",
"uuid": "12c70b18-bf3a-11ec-bc33-00a098f218d0",
"_links": {
"self": {
"href": "/api/cluster/nodes/12c70b18-bf3a-11ec-bc33-00a098f218d0"
}
}
},
"index": 7126896,
"time": "2022-09-28T09:05:00-07:00",
"message": {
"severity": "informational",
"name": "wafl.scan.ownblocks.done"
},
"log_message": "wafl.scan.ownblocks.done: Completed block ownership calculation on volume rootvol(4)@vserver:606bdda8-7930-11ec-8c61-00a0986e013f. The scanner took 2 ms.",
"_links": {
"self": {
"href": "/api/support/ems/events/flc1-02-noprod-ash-storage/7126896"
}
}
},

#

i have a custom zapi Ems template and it collects events too. i did disable that before trying the new Ems collector

tawny sable Sep 28, 2022, 4:10 PM

#

New ems collector collect last 3 minutes of ems every 3 minutes of schedule . You may not be having any new events since you have started the new ems collector

livid slate Sep 28, 2022, 4:11 PM

#

ahh. any way to trigger one event to test?

indigo phoenix Sep 28, 2022, 4:12 PM

#

there's a way to tell ONTAP to synthesize an event - let me dig it up

tawny sable Sep 28, 2022, 4:18 PM

#

livid slate i have a custom zapi Ems template and it collects events too. i did disable that...

Could you share this custom zapi template if possible.

livid slate Sep 28, 2022, 4:19 PM

#

name: Ems
query: ems-message-get-iter
object: ems

collect_only_labels: true

counters:
ems-message-info:
- ^ems-severity => severity
- ^message-name => name
- ^^node => node
- ^time => time
- ^^seq-num => seq_num

plugins:

LabelAgent:
metric label zapi_value rest_value default_value
value_to_num:
- non_encrypted encryption_state none none 0

export_options:
instance_keys:
- node
- seq_num
instance_labels:
- severity
- name
- time

tawny sable Sep 28, 2022, 4:20 PM

#

Thanks. Looks like collect all events on every poll.

livid slate Sep 28, 2022, 4:20 PM

#

yeah

indigo phoenix Sep 28, 2022, 4:20 PM

#

give this a shot to create an object.store.unavailable event, replacing username, password, and IP curl -sk -uadmin:pass -X POST 'https://10.193.48.11/api/private/cli/event/generate' -d '{ "message-name": "object.store.unavailable", "values": [1,2,3] }'

livid slate Sep 28, 2022, 4:20 PM

#

very slow to query

#

we were experiencing low WAFL memory issue. tried to catch the warning from EMS before it reaches veryverylow level from verylow

#

triggered EMS with curl

indigo phoenix Sep 28, 2022, 4:25 PM

#

if the ems poller is running you should see within 3m

livid slate Sep 28, 2022, 4:25 PM

#

ok

#

got it
curl -s localhost:12991/metrics | egrep '^ems'
ems_events{datacenter="ASH",cluster="flc1-noprod-ash-storage",cluster_uuid="b61b746e-ca01-11e5-aa1c-00a0986e012c",message="object.store.unavailable",node="flc1-01-noprod-ash-storage",node_uuid="2",severity="emergency",index="6713145",config_name="1"} 1
ems_events{cluster="flc1-noprod-ash-storage",cluster_uuid="b61b746e-ca01-11e5-aa1c-00a0986e012c",datacenter="ASH",index="6713146",app="",node="flc1-01-noprod-ash-storage",severity="informational",volume="AU2_1664380392",vol_ident="@vserver:a03a978a-ecd6-11e9-bc47-00a0986e013f",message="wafl.vvol.offline",node_uuid="5b8ad0df-bf3a-11ec-9f89-00a098f21d80"} 1

#

even one extra!

indigo phoenix Sep 28, 2022, 4:26 PM

#

nice!

tawny sable Sep 28, 2022, 4:28 PM

#

These events will not come again in next poll but you should see them in Prometheus past data. Basically these events will vanish in next poll and only events from last 3 minutes will be available on 12991 port.

livid slate Sep 28, 2022, 4:33 PM

#

yep they are gone

#

Thank you very much Raul and Chris!

indigo phoenix Sep 28, 2022, 4:35 PM

#

you bet!

tawny sable Sep 29, 2022, 9:20 AM

#

@livid slate Harvest also ships with some ems prometheus rules https://github.com/NetApp/harvest/blob/main/docker/prometheus/ems_alert_rules.yml If that helps for your use case.

livid slate Sep 29, 2022, 4:28 PM

#

tawny sable <@1019101849846034472> Harvest also ships with some ems prometheus rules https:/...

Thanks. They are really useful. We need add the events that not in stock ems.yaml to custom_ems.yaml, right?
we are monitoring 2 events and will add more.
ems_event{message="callhome.spares.low"}
sum_over_time(ems_events{name="wafl.memory.statusLowMemory"}[10m]) > 1

indigo phoenix Sep 29, 2022, 4:35 PM

#

yes, you'll need to add them to your custom

tawny sable Sep 29, 2022, 4:45 PM

#

livid slate Thanks. They are really useful. We need add the events that not in stock ems.yam...

Great! Also do check if this is the expression you want sum_over_time(ems_events{name="wafl.memory.statusLowMemory"}[10m]) > 1 . Let's say your prometheus scrape interval is 15 sec and given harvest will keep this event in cache for 3 minutes then you'll have this entry 12 times in prometheus. In that case, this statement will be greater than 1 even if the ems was raised only once.

#

Most of our ems alert examples uses last_over_time and all active ems are published as value 1

livid slate Sep 29, 2022, 4:58 PM

#

tawny sable Most of our ems alert examples uses `last_over_time` and all active ems are publ...

good point. we were told by support if wafl.memory.statusLowMemory happens more than once in 10 minutes, a reboot maybe required. maybe i should use count_over_time?

tawny sable Sep 29, 2022, 4:59 PM

#

It won't help sum and count would be same given metric value is 1 🙂

#

I think this use case we need to handle in current ems infrastructure . Could you pls open a GitHub request for this use case?

#

Mostly today we publish 1 when event happens and 0 when it is resolved via resolving event.

livid slate Sep 29, 2022, 5:12 PM

#

will do.

tawny sable Sep 29, 2022, 5:17 PM

#

Thanks!

livid slate Sep 29, 2022, 5:20 PM

#

with our zapi ems template, we collect ems timestamp as value. then use this alert condition
count (time() - ems_time{name="wafl.memory.statusLowMemory"} < 600) by (node) > 1
not sure if it worked because prometheus did not report any 😉

tawny sable Sep 29, 2022, 5:21 PM

#

Thanks . We ll see if we can simplify this use case!

livid slate Sep 29, 2022, 5:39 PM

#

https://github.com/NetApp/harvest/issues/1324 Thank you!

#I am trying the new Ems collector in the

metric label zapi_value rest_value `default_value`

#I am trying the new Ems collector in the

metric label zapi_value rest_value default_value

metric label zapi_value rest_value `default_value`