Recently we ran into NABox disabling collection on two of our clusters. It didn't do them both at the same time, several days apart. We are trying to trouble shoot what would cause this? We are running NABox version 4.0.10 and Harvest 25.05.1. Our other 3 clusters plus our storage grid all continued to collect without issue. Our data partition is only about 50% full currently so we didn't run out of space. NABox is deployed with 4vcpus and 8GB of RAM. Looking back at the NABox Health dashboard I don't see anything that stands out as a resource issue. Any ideas where else we can look to see what caused these to disable? Also, is there a way to configure an alert if a cluster stops collecting?
#NABox stopped collecting several clusters.
1 messages · Page 1 of 1 (latest)
Hi @fiery tree can you grab a support bundle and upload to https://upload.nabox.org/gixa-japu-feyo Which cluster/poller stopped? Hopefully the log files shed light on what happened. Depending on how long ago this happened though, the log files may have rolled and obscured the reason
will do
uploaded, the clusters that stopped were fas8300-02 and c400-01. c400-01 was the latest one to stop.
thanks!
Hi @fiery tree unfortunately, the log files did roll and all I see are you pollers happily running. If this happens again, please try to grab a support bundle closer to when the pollers die. In terms of creating an alert when the pollers stop, try using poller_status 1=up 0=down
@echo forum no worries, appreciate you taking the time to take a look! As for the poller_status option, does this go in one of the harvest configuration files? I looked for some documentation for it and only found it mentioned in the release notes for harvest 24.08.0.
poller_status is one of the Prometheus metrics that Harvest exports. You would use this metric to setup a Prometheus alert in your alert_rules.yml Since you're using Nabox, another way to handle the poller up/down alert would be to use alertmanager (available via http://nabox-ip/am) with the existing Prometheus metric named up @distant ingot do you have any documentation or examples on how to setup an alert for a down poller?