#Prometheus federation timeout issue

1 messages · Page 1 of 1 (latest)

green crag
#

Hi All experts,
We met an prometheus federation timeout issue.
When we turn on three high loading harvest job, the federation will be timeout.
After turning off these three harvest job, it will works well, and the Scrape Duration is around 44s, and we has set the timeout to 90s.

To make sure the federation works well, three of the biggest clusters are not monitoring.

Also, does disabling all the unused metrics with conf_path help to solve this issue?

frigid parcel
#

hi @green crag Prometheus federation timeouts tend to happen when you're pulling too much data. Are you seeing broken pipe messages in your Prometheus logs or something else?

A quick search indicates that a lot has been written on this topic - some folks have had better luck with Thanos as a federation solution or VictoriaMetrics via remote_write. Increasing timeouts, reducing the number of metrics collected, or collecting the metrics at different intervals are other solutions/workarounds I saw. The consensus seems to be that federation is not a good option if you're wanting to pull all metrics from a set of high-metric Prometheus servers. Let me dig up some links for you

frigid parcel
green crag
#

@frigid parcel Thank you so much for providing these ducuments!!
We will study these carefully.

green crag
#

@frigid parcel Yes, we are seeing "write: broken pipe" messages in Prometheus logs

green crag
#

Is it possible to hit memory leak or cpu over used?
We are like monitoring around 70 clusters,
And from harvest document recommendation:

"When monitoring 10 clusters, we recommend:

CPU: 2 cores
Memory: 1 GB
Disk: 500 MB (mostly used by log files)"

How can we measure the hardware measurement for these 70 clusters?

regal totem
#

@green crag Above numbers are just for Harvest.

green crag