Hello NetApp team, I have been having trouble with the prometheus database after enabling the quota reporting for users and groups ( by modifying /opt/harvest/conf/rest/9.12.0/qtree.yaml ), which keeps restarting every few minutes because the memory fills up before the WAL replaying completes. I have expanded the memory from 24 GB to 36 GB but this increase of 150% does not seem to be enough to introduce the quota reporting. The WAL now replays for longer but the prometheus still restarts after replaying for around 30 minutes.
Please suggest what can be done to introduce this reporting functionality into the system. Happy to provide prometheus docker logs or any other logs that might be needed. All the NetApp systems are set up as docker instances.
#High memory consumption with quota reporting
1 messages · Page 1 of 1 (latest)
@distant carbon Could you share Harvest logs @ ng-harvest-files@netapp.com
https://netapp.github.io/harvest/24.02/help/log-collection/
Hi @rapid rampart , I have just emailed the logs. I was able to make some progress last evening - but to be able to do so - I had to increase the memory to 48 GB. So in essence - a memory upgrade from 24 GB to 48 GB - doubling the system memory - to only accommodate user and group quotas. I was able to get through the night - but saw that there was a restart of the prometheus this morning. We are better placed than yesterday when the WAL replay would never complete and the database would restart every 30 mins or so.
Please let me know what your observations or recommendations are.
@distant carbon Logs which you have sent are from prometheus. Could you share Harvest logs?
Hi Rahul, these are harvest logs. The contained ID that I have mentioned in "docker logs <container_id>" is that of the prometheus database. Are you looking for some other logs ?
We need harvest container logs. What is your installation type? container,nabox?
container
Okay. Can you share Harvest Poller container logs.
Thank you, Madaan. Received the logs. For the poller logs, the following objects have a high metric count. However, as you mentioned, the problem started only after turning on Qtree. Therefore, it's possible that there might be another poller with a high number of qtrees. Could you share the logs of that poller? If you're unsure which one it is, you could share the logs of all pollers, and we can check which poller might be causing a high number of metrics.
Shared logs poller has 3761 quotas.
Hi Rahul, the logs are too big to be sent on email. Is there a place where I can upload these ?
There are quite a few containers on the system resulting in a huge log file. Over the last 2 days, we have been forced to increase the RAM on the host from 24 GB to 64 GB to only accommodate the user and group quota reporting. This doesn’t seem quite right to me as this functionality had been working until it was broken by mistake and re-instated. Since its re-introduction, the memory had to be increased 3-fold to prevent the Prometheus from restarting time and again.
@distant carbon The issue is most likely caused by a very high number of objects on a poller. I've noticed some spam log messages in your logs, which are fixed in nightly builds. We only need the logs from the last couple of hours for the pollers to check this. Does that reduce the size of the logs? Also, it's okay to send multiple emails if the files are still large.
Thanks @distant carbon . Received logs.
Cheers Rahul.