#Alarms not closing after server has been unreachable + alarm spam

1 messages Β· Page 1 of 1 (latest)

frank trench
#

We have had a lot of issues with the netdata agents sending alarms about our servers beiong unresponsive whenever the agent updates, this results in more than 50 alarms trickling in over almost half an hour. This in itself is annoying and i would love to know if there is a way to set limits on alarms to not activate before the agent has been down for lets say 2-5 minutes.
But the main problem is that non of the opened alarms are being closed after the servers become "reachable" again. This means that a lot of alarms are just flooding our alarm system and has to be manually closed. We are using the Opsgenie integration for alarms.
We have chosen now to disable automatic updating of the agents because of this problem.

so my question is three-fold:

  1. Is there some setting that has been missed when setting up alarms that needs to be changed to make sure that ALL alerts will be automatically closed upon restoring of the status?
  2. Is it possible to make sure it only triggers unreachable alerts when the agent has been unreachable for a set amount of time, to limit (in our case a lot of) false alerts?
  3. is it possible to prepare the agent for not triggering if the server/or agent has been manually (or under update for agent) set to restart?
native shell
#

Hi @frank trench, sorry to hear you have to disable updates due to this issue.
Let me try to help with question 1 & 2.

  1. They should automatically clear once agent is reachable again. Could you provide me your space_id so I can take a closer look (you can find it at your space setting);
  2. Unfortunately that is not possible ... yet, the good news is that it has been a request for multiple users and added this feature to our list and we hope to delivery it soon (1/2 weeks);
frank trench
#
  1. i have forwarded the space_id directly to you in a pb πŸ™‚
  2. yeah i thought as much, and have seen some talks about it on the forum, but they are very old, so wanted to give my own reason for why it would be an extreamly great feature, is it almost always is a false positive from our system.
native shell
#

thank you, I will get back to you with my findings
indeed, you're absolutely right

native shell
#

@frank trench I found that we introduced a bug when adding reachability notification aggregation feature.
this leads to sometimes incidents not being closed on indicent managment integration (Opsgenie, PagerDuty, etc).
we will work on addressing this a soon as possible.
also, when doing my investigation I found your child nodes are also claimed to the Cloud, this is the root cause of the problem. although we support, it's not the best setup.
https://learn.netdata.cloud/docs/observability-centralization-points/metrics-centralization-points/clustering-and-high-availability-of-netdata-parents

Netdata supports building Parent clusters of 2+ nodes. Clustering and high availability works like this:

frank trench
#

Ahh nice to hear you found the bug πŸ™‚ will be looking forward to seeing a fix for it πŸ™‚

oh, then we have introduced that by mistake. Have to look into how to change that, as that was not meant to happen.
might return for some help on this, but have to talk with the person who originally made the setup, and we are currently looking into how to fix this πŸ˜„

#

so thanks for bringing this to our attention πŸ˜„

frank trench
#

ah we think we found the mistake.
We had enabled stream on the parent node as well as on the child notes πŸ˜„
so a quick parent stream.conf update and a reload should make the correct claim i guess?

native shell
#

the streaming configuration is all correct
the only improvement is not stream related but actually how Netdata was installed
only parents should be installed with the claiming arguments AKA Connect Agent to Cloud

#

let's take this simple example, one parent with one child and both claimed/connected to the cloud
it will look like this

node id | steam id | hops (distance to cloud)
---------------------------------------------
parent1 | stream 1 | 0
child1  | stream 1 | 1
child1  | stream 2 | 0
#

if only parent is claimed/connected to the cloud
it will look like this

node id | steam id | hops (distance to cloud)
---------------------------------------------
parent1 | stream 1 | 0
child1  | stream 1 | 1
#

this is hidden for the user but it's part of Cloud business logic

#

of course this is just an improvement and it's not actully wrong since we support these setups as well

frank trench
#

that makes sense.
We have just copied the value from within the dashboard when adding new agents.
So i guees instead of
wget -O /tmp/netdata-kickstart.sh https://get.netdata.cloud/kickstart.sh && sh /tmp/netdata-kickstart.sh --stable-channel --claim-token --claim-rooms --claim-url

we only need to use
wget -O /tmp/netdata-kickstart.sh https://get.netdata.cloud/kickstart.sh && sh /tmp/netdata-kickstart.sh --stable-channel --claim-token --claim-rooms
?

native shell
#

almost πŸ˜‰
it would be without any claim argument

#

they will appear automatically on cloud since they are streaming to a parent (as least as the parent is claimed/connected to the cloud)

frank trench
#

ahh fair enough.
i guess it would be fixed by just running the wget call again. as if i remember correctly that is the easiest way to "update" such settings

native shell
#

I think you have to remove/delete the cloud folder, let me find it on the docs

frank trench
#

By the way a small idea for the dashboard. historically anyway i have by accident generated a new token as there are no confirmation box before it yeets the old one πŸ™‚ might be an idea with an extra validation when hitting the regenerate token button πŸ˜„

#

well that is duable πŸ˜„
remove a folder, run a wget script and Bob's ur uncle πŸ˜„

native shell
#

indeed, that's a nice suggestion, let me tag product people

frank trench
#

this optimization might actually fix some issues with both memory and disk space we have had, that seemed to stem from netdata. But for now it just seems to be the way it is all connected multiple times.

native shell
#

it may actually be the case

#
To remove a node from your Space in Netdata Cloud, delete the cloud.d/ directory in your Netdata library directory.

cd /var/lib/netdata   # Replace with your Netdata library directory, if not /var/lib/netdata/
sudo rm -rf cloud.d/

IMPORTANT:
Keep in mind that the Agent will be re-claimed automatically if the environment variables or claim.conf exist when the agent is restarted.

This node no longer has access to the credentials it was used when connecting to Netdata Cloud via the ACLK. You will still be able to see this node in your Rooms in an unreachable state.
frank trench
#

two of our primary parent servers are using 20-25% of its 8GB memory running the netdata agent. so might help us there πŸ˜„

native shell
#

since your agent is streaming to a parent the all statement is not true, as your node will still be visible & online on the cloud

#

as it has another stream to reach the cloud

frank trench
#

this is truly amazing support πŸ˜„

native shell
#

thanks πŸ™‡β€β™‚οΈ
we like to keep users like you that help us making Netdata better

frank trench
#

I might not have the time to try this out until next week, but i will definitely get around to doing this next week, so i will report back when i have had time to try it out πŸ™‚
is there an easy way to see the streams used if not in just a simple way to check if the connection has been made?

native shell
#

it's ok, feel free to try it out and come back to us πŸ™‚
not yet, but we are working on a feature like that, where we draw a streaming/path graph to visually your infrastructure ... stay tuned

frank trench
#

ohh nice, will be looking forward to that:)
Then i might ask for help confirming the connection after i have made the update.

Is it best if i write here, or just directly to you or a third place?

#

have to run, but thanks for the help today πŸ™‚

native shell
#

of course, either here or as a DM works good