A user is getting error "template variable service failed Bad gateway"
We tried to restart the service, and it will work for 3 minutes, and return the same error,
The memory and cpu usage are normal,
Not sure what is the cause of the error.
And this data source is monitoring over 50 clusters.
We tried to disable some template or metrics to decrease the loading to see if the error can be fixed.
For example, to disable metric like "node_disk_busy" or "node_cifs_open_files", how can I disable them?
I have checked the node.yaml file, but didn't find a related one.
Is there any recommendation on disabling not-using template?
#How to disable metrics in harvest?
1 messages · Page 1 of 1 (latest)
@tight island Are you using NABox?
Using harvest
Okay which installation type? rpm, debian, containers, native?
How much memory, cpu is available on the machine? ARe prometheus, grafana installed on the same machine?
Could you also share logs @ ng-harvest-files@netapp.com . We can check if there are any spammy metrics in system which may be causing this issue.
https://netapp.github.io/harvest/24.02/help/log-collection/
Debian
Sent the log to you, thanks
Thanks
@tight island Based on the logs, there are 57 pollers, and only 2 of them have a high metric count. These pollers have volumes, qtrees, and LUNs in the range of 3.5k, and workloads in the range of 6k, which should be ok. Disabling metrics like node_disk_busy will not help much as these don't generate many metrics. There are some high collection times, they are still within the schedule limit of the collector.
Could you share the memory and CPU on the machine where the Harvest stack is installed? Are Prometheus and Grafana installed on the same machine? Also, could you check if there are any disk space issues? Do you see any errors in the Prometheus logs? It appears from your first message that Prometheus might be crashing.
User would like to disable unused metric, "node_disk_busy" is just an example, any recommendation on disabling default metrics?
Since another department user are federating this user's prometheus(harvest) , and the scrape_timeout needed to set to 90s to succeed the federation. Seems like the data polling from harvest is too big?
So, the error template variable service failed Bad gateway is from Grafana, which is connected to the federation Prometheus? We need to check the Prometheus logs to identify the issue.
You can disable the collection of a particular metric from Prometheus itself by using metric_relabel_configs. More information can be found here: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#metric_relabel_configs.
On the Harvest side, if you need to exclude a metric from a Zapi template, it's best to copy the Harvest templates into a different configuration path. Then, modify the templates as needed and pass that configuration path when starting Harvest.
https://netapp.github.io/harvest/24.05/configure-templates/#conf-path
Thanks @glacial robin
And take metric "node_disk_busy" as an example, where should I comment out in node.yaml to disable this metric?
https://github.com/NetApp/harvest/blob/main/conf/zapi/cdot/9.8.0/node.yaml
I didn't find a counters related to this metric
Or for another example: metric "fcp_read_ops", how can I disable this metric?
Thanks!!
@tight island You can find metric mapping to Harvest template for these here
https://netapp.github.io/harvest/24.05/ontap-metrics/#node_disk_busy
https://netapp.github.io/harvest/24.05/ontap-metrics/#fcp_read_ops
You can disable metric at these lines respectively
https://github.com/NetApp/harvest/blob/main/conf/zapiperf/cdot/9.8.0/disk.yaml#L10
https://github.com/NetApp/harvest/blob/main/conf/zapiperf/cdot/9.8.0/fcp.yaml#L30
It is not recommend to modify the harvest template in conf/zapi template counters?
They will be overwritten if you upgrade harvest. Best is to make use of conf_path as suggested earlier
https://netapp.github.io/harvest/24.05/configure-templates/#conf-path
Yes, Prometheus and harvest are installed on the same machine.
I had asked them to collect Prometheus log
How can I get Prometheus log?
The method for collecting logs for Prometheus depends on the type of installation. You may want to refer to the Prometheus documentation for specific details.
- For a containerized installation, you would use the
docker logscommand. - For an installation using systemd, you can use
journalctl -u prometheus. - If Prometheus is configured to write logs to a file, you should check that specific file.
Here is the memory and CPU on the machine:
oot@f18tksnamon02:/opt/harvest# top | grep %Cpu
%Cpu(s): 96.8 us, 1.0 sy, 0.0 ni, 2.2 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 84.2 us, 3.2 sy, 0.0 ni, 12.5 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 66.1 us, 4.8 sy, 0.0 ni, 29.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 67.4 us, 4.0 sy, 0.0 ni, 28.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 90.9 us, 1.6 sy, 0.0 ni, 7.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 99.3 us, 0.3 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 92.6 us, 1.5 sy, 0.0 ni, 5.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 65.4 us, 5.2 sy, 0.0 ni, 29.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 68.9 us, 5.0 sy, 0.0 ni, 26.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 72.8 us, 3.5 sy, 0.0 ni, 23.5 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
Tasks: 901 total, 2 running, 899 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98.1 us, 1.4 sy, 0.0 ni, 0.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 257628.9 total, 17186.2 free, 193858.7 used, 46584.0 buff/cache
MiB Swap: 986.0 total, 336.8 free, 649.2 used. 61954.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1420512 prometh+ 20 0 408.4g 181.8g 5.9g S 6306 72.3 17162,07 prometheus
3258162 harvest 20 0 1243176 20272 10700 S 117.6 0.0 0:17.07 poller
3261438 root 20 0 10184 4636 3180 R 11.8 0.0 0:00.04 top
root@f18tksnamon02:/opt/harvest# free -m
total used free shared buff/cache available
Mem: 257628 197617 13363 5 46647 58195
Swap: 985 649 336
Thanks @tight island . Prometheus is clearly consuming a high amount of CPU. Could you share the Prometheus logs and prometheus configuration.
Also how about disk space?
df -h
There are several errors in the Prometheus logs, such as the one below:
prometheus[2812514]: ts=2024-05-10T21:02:55.767Z caller=stdlib.go:105 level=error component=web caller="http: Accept error: accept tcp [::]:9090" msg="accept4: too many open files; retrying in 5ms"
You need to increase the file limit for Prometheus. If you are using systemd, then add the following to the prometheus.service file:
[Service]
LimitNOFILE=65536
Then, reload the systemd configuration and restart Prometheus:
sudo systemctl daemon-reload
sudo systemctl restart prometheus