#Enhance quota alerting
1 messages · Page 1 of 1 (latest)
hi @steep lily how did you install Harvest? That file will be in your conf directory which we can help you find after we know how you installed. e.g.
conf/rest/9.12.0/qtree.yaml
conf/zapi/cdot/9.8.0/qtree.yaml
Hi Chris, thnaks for replying. I installed by using the cloud formation template for EC2 instances ( https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/monitoring-harvest-grafana.html ), and each of my systems is a docker container.
NetApp Harvest is an open source tool for gathering performance and capacity metrics from ONTAP systems, and is compatible with FSx for ONTAP. You can use Harvest with Grafana for an open source monitoring solution.
I have just used the "find" command to see that there is a file for each of the containers
probably I should chnage one of them to see how it impacts my grafana dashboard, and then apply that chnage to all of them.
the cloud formation install does not mount the conf directory that contains the templates you need to edit to uncomment these lines in qtree.yaml
Depending on you needs, you could switch to the docker compose workflow which copies and mounts the conf directory locally, and sets up Grafana and Prometheus like the cloud formation script does or if you want, I can tell you how to copy and mount conf directory with your current cloud formation setup
That will be very useful Chris - "how to copy and mount conf directory with your current cloud formation setup". I was able to successfully test this by "finding" and modifying the file for one of the containers, but we have around 200 containers on this Harvest and we create new FSX instances every week. So we would like to have the ability to do this at the time of creating the docker instance on the Harvest - or alternately have a global setting that applies to all the instances
if you want to try the docker compose workflow, this should get your started https://netapp.github.io/harvest/nightly/install/containers/#docker-compose
if you want to copy and mount, give this a try https://github.com/NetApp/harvest/discussions/2648
Hi @hollow storm , the "mount and copy" has worked ok. I can now see the user and group quotas for all my clusters. Plus the bigger thing is that I now have the config in my hands. Many thanks !
Hi @hollow storm , I need your help with an investigation and I have just sent you logs from a container for help with that. (on email)
I made the change that we discussed above ( with regard to changing the qtree.yaml file ) to include user and group quotas in grafana at about 10:00 - this seemed to work OK after I monitored for a few hours, but then broke abruptly at 12:20 - which you will see in the logs. I have resurrected the service a few minutes ago - but needed your help in identifying the issues as we lost about a day's data
The other change we made around the time was to modify the grafana admin password - this seems to be unrelated and has been made to the other systems ( but I just wanted to cross all the variables and actions past you )
looks like the Zapi:Qtree collector is chugging along nicely, collecting every three minutes. numQuotas=740 and collector=Zapi:Qtree instances=37.
grep "collector=Zapi:Qtree" docker-logs.txt | wc -l
29
Then at 2024-02-13T12:22:28Z the poller was killed.
2024-02-13T12:22:28Z INF poller/poller.go:538 > caught signal [terminated] Poller=cluster-31
The question is who killed the container? Can you these on the machine hosting these containers?
sudo dmesg -T | egrep -i 'killed process'
or
grep -i 'killed process' /var/log/messages
These commands did not return any output, Chris. I have just sent the output of /var/log/messages from the time of interest, but I am not hopfeul this would provide us any information.
looks like the machine was rebooted around that time, and that's what killed the poller?? If your system is using SystemD you should be to see the reboots by running journalctl --list-boots
ok I get that now. Is there a better way to reboot the system ? Is it expected for the pollers to hang when a reboot happens ?
The poller didn't hang since the OS killed it. Sounds like it did not restart when the machine did though. The Ansible script uses restart_policy: unless-stopped which should do what you want, which is, restart all the running containers when the machine is rebooted. You can verify that your container has that set by running
docker inspect -f "{{ .HostConfig.RestartPolicy }}" harvest_cluster-1
Error: No such object: harvest_cluster-1