#How to disable metrics in harvest?

1 messages · Page 1 of 1 (latest)

tight island
#

A user is getting error "template variable service failed Bad gateway"
We tried to restart the service, and it will work for 3 minutes, and return the same error,
The memory and cpu usage are normal,
Not sure what is the cause of the error.
And this data source is monitoring over 50 clusters.
We tried to disable some template or metrics to decrease the loading to see if the error can be fixed.
For example, to disable metric like "node_disk_busy" or "node_cifs_open_files", how can I disable them?
I have checked the node.yaml file, but didn't find a related one.
Is there any recommendation on disabling not-using template?

glacial robin
#

@tight island Are you using NABox?

tight island
#

Using harvest

glacial robin
#

Okay which installation type? rpm, debian, containers, native?

#

How much memory, cpu is available on the machine? ARe prometheus, grafana installed on the same machine?

glacial robin
glacial robin
#

Thanks

glacial robin
#

@tight island Based on the logs, there are 57 pollers, and only 2 of them have a high metric count. These pollers have volumes, qtrees, and LUNs in the range of 3.5k, and workloads in the range of 6k, which should be ok. Disabling metrics like node_disk_busy will not help much as these don't generate many metrics. There are some high collection times, they are still within the schedule limit of the collector.

Could you share the memory and CPU on the machine where the Harvest stack is installed? Are Prometheus and Grafana installed on the same machine? Also, could you check if there are any disk space issues? Do you see any errors in the Prometheus logs? It appears from your first message that Prometheus might be crashing.

tight island
#

User would like to disable unused metric, "node_disk_busy" is just an example, any recommendation on disabling default metrics?
Since another department user are federating this user's prometheus(harvest) , and the scrape_timeout needed to set to 90s to succeed the federation. Seems like the data polling from harvest is too big?

glacial robin
tight island
#

Thanks @glacial robin
And take metric "node_disk_busy" as an example, where should I comment out in node.yaml to disable this metric?
https://github.com/NetApp/harvest/blob/main/conf/zapi/cdot/9.8.0/node.yaml
I didn't find a counters related to this metric

Or for another example: metric "fcp_read_ops", how can I disable this metric?

Thanks!!

GitHub

Open-metrics endpoint for ONTAP and StorageGRID. Contribute to NetApp/harvest development by creating an account on GitHub.

tight island
glacial robin
tight island
#

How can I get Prometheus log?

glacial robin
#

The method for collecting logs for Prometheus depends on the type of installation. You may want to refer to the Prometheus documentation for specific details.

  • For a containerized installation, you would use the docker logs command.
  • For an installation using systemd, you can use journalctl -u prometheus.
  • If Prometheus is configured to write logs to a file, you should check that specific file.
tight island
# glacial robin <@706068726092398593> Based on the logs, there are 57 pollers, and only 2 of the...

Here is the memory and CPU on the machine:
oot@f18tksnamon02:/opt/harvest# top | grep %Cpu
%Cpu(s): 96.8 us, 1.0 sy, 0.0 ni, 2.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 84.2 us, 3.2 sy, 0.0 ni, 12.5 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 66.1 us, 4.8 sy, 0.0 ni, 29.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 67.4 us, 4.0 sy, 0.0 ni, 28.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 90.9 us, 1.6 sy, 0.0 ni, 7.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 99.3 us, 0.3 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 92.6 us, 1.5 sy, 0.0 ni, 5.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 65.4 us, 5.2 sy, 0.0 ni, 29.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 68.9 us, 5.0 sy, 0.0 ni, 26.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 72.8 us, 3.5 sy, 0.0 ni, 23.5 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st

Tasks: 901 total, 2 running, 899 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98.1 us, 1.4 sy, 0.0 ni, 0.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 257628.9 total, 17186.2 free, 193858.7 used, 46584.0 buff/cache
MiB Swap: 986.0 total, 336.8 free, 649.2 used. 61954.7 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1420512 prometh+ 20 0 408.4g 181.8g 5.9g S 6306 72.3 17162,07 prometheus
3258162 harvest 20 0 1243176 20272 10700 S 117.6 0.0 0:17.07 poller
3261438 root 20 0 10184 4636 3180 R 11.8 0.0 0:00.04 top

root@f18tksnamon02:/opt/harvest# free -m
total used free shared buff/cache available
Mem: 257628 197617 13363 5 46647 58195
Swap: 985 649 336

glacial robin
#

Thanks @tight island . Prometheus is clearly consuming a high amount of CPU. Could you share the Prometheus logs and prometheus configuration.

#

Also how about disk space?

df -h
glacial robin
#

There are several errors in the Prometheus logs, such as the one below:

prometheus[2812514]: ts=2024-05-10T21:02:55.767Z caller=stdlib.go:105 level=error component=web caller="http: Accept error: accept tcp [::]:9090" msg="accept4: too many open files; retrying in 5ms"

You need to increase the file limit for Prometheus. If you are using systemd, then add the following to the prometheus.service file:

[Service]
LimitNOFILE=65536

Then, reload the systemd configuration and restart Prometheus:

sudo systemctl daemon-reload
sudo systemctl restart prometheus