We have had a lot of issues with the netdata agents sending alarms about our servers beiong unresponsive whenever the agent updates, this results in more than 50 alarms trickling in over almost half an hour. This in itself is annoying and i would love to know if there is a way to set limits on alarms to not activate before the agent has been down for lets say 2-5 minutes.
But the main problem is that non of the opened alarms are being closed after the servers become "reachable" again. This means that a lot of alarms are just flooding our alarm system and has to be manually closed. We are using the Opsgenie integration for alarms.
We have chosen now to disable automatic updating of the agents because of this problem.
so my question is three-fold:
- Is there some setting that has been missed when setting up alarms that needs to be changed to make sure that ALL alerts will be automatically closed upon restoring of the status?
- Is it possible to make sure it only triggers unreachable alerts when the agent has been unreachable for a set amount of time, to limit (in our case a lot of) false alerts?
- is it possible to prepare the agent for not triggering if the server/or agent has been manually (or under update for agent) set to restart?