Alarms not closing after server has been unreachable + alarm spam | Netdata | Page 1

frank trench Sep 11, 2024, 8:07 AM

#

We have had a lot of issues with the netdata agents sending alarms about our servers beiong unresponsive whenever the agent updates, this results in more than 50 alarms trickling in over almost half an hour. This in itself is annoying and i would love to know if there is a way to set limits on alarms to not activate before the agent has been down for lets say 2-5 minutes.
But the main problem is that non of the opened alarms are being closed after the servers become "reachable" again. This means that a lot of alarms are just flooding our alarm system and has to be manually closed. We are using the Opsgenie integration for alarms.
We have chosen now to disable automatic updating of the agents because of this problem.

so my question is three-fold:

Is there some setting that has been missed when setting up alarms that needs to be changed to make sure that ALL alerts will be automatically closed upon restoring of the status?
Is it possible to make sure it only triggers unreachable alerts when the agent has been unreachable for a set amount of time, to limit (in our case a lot of) false alerts?
is it possible to prepare the agent for not triggering if the server/or agent has been manually (or under update for agent) set to restart?

native shell Sep 11, 2024, 10:23 AM

#

Hi @frank trench, sorry to hear you have to disable updates due to this issue.
Let me try to help with question 1 & 2.

They should automatically clear once agent is reachable again. Could you provide me your space_id so I can take a closer look (you can find it at your space setting);
Unfortunately that is not possible ... yet, the good news is that it has been a request for multiple users and added this feature to our list and we hope to delivery it soon (1/2 weeks);

frank trench Sep 11, 2024, 10:49 AM

#

i have forwarded the space_id directly to you in a pb 🙂
yeah i thought as much, and have seen some talks about it on the forum, but they are very old, so wanted to give my own reason for why it would be an extreamly great feature, is it almost always is a false positive from our system.

native shell Sep 11, 2024, 10:51 AM

#

thank you, I will get back to you with my findings
indeed, you're absolutely right

native shell Sep 11, 2024, 12:52 PM

#

@frank trench I found that we introduced a bug when adding reachability notification aggregation feature.
this leads to sometimes incidents not being closed on indicent managment integration (Opsgenie, PagerDuty, etc).
we will work on addressing this a soon as possible.
also, when doing my investigation I found your child nodes are also claimed to the Cloud, this is the root cause of the problem. although we support, it's not the best setup.
https://learn.netdata.cloud/docs/observability-centralization-points/metrics-centralization-points/clustering-and-high-availability-of-netdata-parents

Clustering and High Availability of Netdata Parents | Learn Netdata

Netdata supports building Parent clusters of 2+ nodes. Clustering and high availability works like this:

frank trench Sep 11, 2024, 1:01 PM

#

Ahh nice to hear you found the bug 🙂 will be looking forward to seeing a fix for it 🙂

oh, then we have introduced that by mistake. Have to look into how to change that, as that was not meant to happen.
might return for some help on this, but have to talk with the person who originally made the setup, and we are currently looking into how to fix this 😄

#

so thanks for bringing this to our attention 😄

frank trench Sep 11, 2024, 1:17 PM

#

ah we think we found the mistake.
We had enabled stream on the parent node as well as on the child notes 😄
so a quick parent stream.conf update and a reload should make the correct claim i guess?

#

nope missed the part about active-active parent.
so we actually dont know what is wrong, as it seems to be setup like the following
https://learn.netdata.cloud/docs/deployment-guides/deployment-examples#activeactive-parents

Deployment Examples | Learn Netdata

Deployment Options Overview

native shell Sep 11, 2024, 1:41 PM

#

the streaming configuration is all correct
the only improvement is not stream related but actually how Netdata was installed
only parents should be installed with the claiming arguments AKA Connect Agent to Cloud

#

only parents should be claimed/connected to the cloud
https://learn.netdata.cloud/docs/netdata-cloud/connect-agent-to-cloud#automatically-via-environment-variables

Connect Agent to Cloud | Learn Netdata

This section guides you through installing and securely connecting a new Netdata Agent to Netdata Cloud via the

#

let's take this simple example, one parent with one child and both claimed/connected to the cloud
it will look like this

node id | steam id | hops (distance to cloud)
---------------------------------------------
parent1 | stream 1 | 0
child1  | stream 1 | 1
child1  | stream 2 | 0

#

if only parent is claimed/connected to the cloud
it will look like this

node id | steam id | hops (distance to cloud)
---------------------------------------------
parent1 | stream 1 | 0
child1  | stream 1 | 1

#

this is hidden for the user but it's part of Cloud business logic

#

of course this is just an improvement and it's not actully wrong since we support these setups as well

frank trench Sep 11, 2024, 1:48 PM

#

that makes sense.
We have just copied the value from within the dashboard when adding new agents.
So i guees instead of
wget -O /tmp/netdata-kickstart.sh https://get.netdata.cloud/kickstart.sh && sh /tmp/netdata-kickstart.sh --stable-channel --claim-token --claim-rooms --claim-url

we only need to use
wget -O /tmp/netdata-kickstart.sh https://get.netdata.cloud/kickstart.sh && sh /tmp/netdata-kickstart.sh --stable-channel --claim-token --claim-rooms
?

native shell Sep 11, 2024, 1:49 PM

#

almost 😉
it would be without any claim argument

#

they will appear automatically on cloud since they are streaming to a parent (as least as the parent is claimed/connected to the cloud)

frank trench Sep 11, 2024, 1:50 PM

#

ahh fair enough.
i guess it would be fixed by just running the wget call again. as if i remember correctly that is the easiest way to "update" such settings

native shell Sep 11, 2024, 1:53 PM

#

I think you have to remove/delete the cloud folder, let me find it on the docs

frank trench Sep 11, 2024, 1:53 PM

#

By the way a small idea for the dashboard. historically anyway i have by accident generated a new token as there are no confirmation box before it yeets the old one 🙂 might be an idea with an extra validation when hitting the regenerate token button 😄

#

well that is duable 😄
remove a folder, run a wget script and Bob's ur uncle 😄

native shell Sep 11, 2024, 1:54 PM

#

indeed, that's a nice suggestion, let me tag product people

native shell Sep 11, 2024, 1:54 PM

#

frank trench By the way a small idea for the dashboard. historically anyway i have by acciden...

cc: <@&1200345604870656062>

frank trench Sep 11, 2024, 1:55 PM

#

this optimization might actually fix some issues with both memory and disk space we have had, that seemed to stem from netdata. But for now it just seems to be the way it is all connected multiple times.

native shell Sep 11, 2024, 1:56 PM

#

it may actually be the case

#

found it
https://learn.netdata.cloud/docs/netdata-cloud/connect-agent-to-cloud#linux-based-installations

Connect Agent to Cloud | Learn Netdata

This section guides you through installing and securely connecting a new Netdata Agent to Netdata Cloud via the

#

To remove a node from your Space in Netdata Cloud, delete the cloud.d/ directory in your Netdata library directory.

cd /var/lib/netdata   # Replace with your Netdata library directory, if not /var/lib/netdata/
sudo rm -rf cloud.d/

IMPORTANT:
Keep in mind that the Agent will be re-claimed automatically if the environment variables or claim.conf exist when the agent is restarted.

This node no longer has access to the credentials it was used when connecting to Netdata Cloud via the ACLK. You will still be able to see this node in your Rooms in an unreachable state.

frank trench Sep 11, 2024, 1:58 PM

#

two of our primary parent servers are using 20-25% of its 8GB memory running the netdata agent. so might help us there 😄

native shell Sep 11, 2024, 1:58 PM

#

since your agent is streaming to a parent the all statement is not true, as your node will still be visible & online on the cloud

#

as it has another stream to reach the cloud

frank trench Sep 11, 2024, 1:59 PM

#

this is truly amazing support 😄

native shell Sep 11, 2024, 2:02 PM

#

thanks 🙇‍♂️
we like to keep users like you that help us making Netdata better

frank trench Sep 11, 2024, 2:03 PM

#

I might not have the time to try this out until next week, but i will definitely get around to doing this next week, so i will report back when i have had time to try it out 🙂
is there an easy way to see the streams used if not in just a simple way to check if the connection has been made?

native shell Sep 11, 2024, 2:06 PM

#

it's ok, feel free to try it out and come back to us 🙂
not yet, but we are working on a feature like that, where we draw a streaming/path graph to visually your infrastructure ... stay tuned

frank trench Sep 11, 2024, 2:07 PM

#

ohh nice, will be looking forward to that:)
Then i might ask for help confirming the connection after i have made the update.

Is it best if i write here, or just directly to you or a third place?

#

have to run, but thanks for the help today 🙂

native shell Sep 11, 2024, 6:39 PM

#

of course, either here or as a DM works good

#Alarms not closing after server has been unreachable + alarm spam