#Harvest->Prometheus: Duplicate samples for timestamp

1 messages · Page 1 of 1 (latest)

agile breach
#

We are using Harvest 24.02 with zapi and zapiperf to gather metric data from CMODE to Prometheus. We are using Harvest's Prometheus service discovery to facilitate this.
When we turned on debug logging on the Prometheus side, we got warnings from the Prometheus scraper of the poller's /metrics endpoint about duplicate samples for timestamp: msg="Duplicate sample for timestamp"

I do see multiple metrics with the exact same series and labels reported.

$ curl -s http://harvest.domain:13008/metrics |grep 'fabricpool_cloud_bin_op_latency_average{datacenter="AAA",cluster="bbb",cloud_target="ccc",svm="ddd",volume="vvv",metric="PUT"}'
fabricpool_cloud_bin_op_latency_average{datacenter="AAA",cluster="bbb",cloud_target="ccc",svm="ddd",volume="vvv",metric="PUT"} 116
fabricpool_cloud_bin_op_latency_average{datacenter="AAA",cluster="bbb",cloud_target="ccc",svm="ddd",volume="vvv",metric="PUT"} 57
fabricpool_cloud_bin_op_latency_average{datacenter="AAA",cluster="bbb",cloud_target="ccc",svm="ddd",volume="vvv",metric="PUT"} 115
fabricpool_cloud_bin_op_latency_average{datacenter="AAA",cluster="bbb",cloud_target="ccc",svm="ddd",volume="vvv",metric="PUT"} 62
#

In the docs, I see the query I should be using to pull these metrics from ZAPI myself, but when I try querying the poller myself, I can't find an instance of that volume. So where is it coming from?

# ./bin/harvest zapi -p bbb show data -o wafl_comp_aggr_vol_bin -c uuid -c cloud_bin_op_latency_average -c instance_name -c object_store_name -c vol_name -c vserver_name -c cloud_bin_operation |grep vvv
connected to AAA (NetApp Release XXX: Fri Nov 24 00:10:13 UTC 2023)

The Harvest logs don't seem to report any issues though. All metrics are reported fine according to Harvest, but I'm confused why I can't pull the data from ZAPI myself to see what's going on.

{"level":"info","Poller":"bbb","collector":"ZapiPerf:WAFLCompBin","skips":0,"apiMs":3518,"calcMs":3,"pluginMs":5,"metrics":23232,"zBegin":1726003192813,"pollMs":3710,"parseMs":167,"instances":1936,"exportMs":34,"instancesExported":1936,"metricsExported":23232,"caller":"collector/collector.go:585","time":"2024-09-10T14:19:56-07:00","message":"Collected"}
cyan bone
#

@agile breach Can you run this CLI command to get the all wafl_comp_aggr_vol_bin instance and share the fabricpool.txt file result to us at ng-harvest-files@netapp.com?
./bin/harvest zapi -p POLLER_NAME show data -o wafl_comp_aggr_vol_bin > fabricpool.txt

spice zinc
#

We can help you track down the duplicate samples - most often it's a misconfiguration problem. A common one is polling the same cluster multiple times.

I'm curious if there was a different problem that caused you to turn on debug in Prometheus though? I want to make sure we're solving your real problem.

Did the Prometheus log tell you which series has the duplicate?

This link has some good tips on debugging duplicate samples https://promlabs.com/blog/2022/12/15/understanding-duplicate-samples-and-out-of-order-timestamp-errors-in-prometheus/

I don't see a problem with your bin/harvest zapi command and it works correctly for me. I usually pipe the XML into dasel to convert to JSON to make it easier to parse with jq and gron. E.g. bin/harvest zapi -p sar show data -o wafl_comp_aggr_vol_bin --config cbg/harvest.openlab.yml -c uuid -c cloud_bin_op_latency_average -c instance_name -c object_store_name -c vol_name -c vserver_name -c cloud_bin_operation | dasel -r xml -w json

PromLabs - We teach Prometheus-based monitoring and observability

agile breach
agile breach
# spice zinc We can help you track down the duplicate samples - most often it's a misconfigur...

I'll take a look at the blog post. The reason I didn't suspect a prometheus misconfiguration is because the data is coming from the poller's endpoint on Harvest (harvest:port/metrics). Maybe prometheus self discovery is prompting for that, though? I'll start doing some more research on prometheus' side

I usually pipe the XML into dasel to convert to JSON to make it easier to parse with jq and gron
a plain "grep" of the zapi command should match on the volume I'm looking for though, right? regardless, couldn't find that particular volume when using dasel+jq either. The volume label for the fabricpool_cloud_bin_op_latency_average doesn't show up at all.
Could this in any way be related to volume constituents?

spice zinc
#

The misconfiguration we've seen in the past is when a poller is added to Prometheus multiple times under different names - you can check Prometheus targets to help confirm. And yes, service discovery could make it easier to accidentally do that. For example, if you have the same cluster listed multiple times in your harvest.yml file.

Yes, grep is fine for what you want. One issue you are hitting with your bin/harvest zapi is that by default that command will query for 100 instances. Looks like you have 1936 instances and I can tell from your fabricpool.txt that you only retrieved 100. Try this and see if your grep works.

./bin/harvest zapi -p POLLER_NAME show data -o wafl_comp_aggr_vol_bin --max-records 99999999999

agile breach
#

I'm seeing the same target on multiple, un-related prometheus servers, so I will have to check if that is what is causing the original problem.
Also thank you for pointing out the max-records argument. I don't know how I missed that before, but now I can find that volume with jq or grep.

agile breach
#

Prometheus expert said that only matters for the same prometheus server, but that's not the case here. All the targets are unique per Prometheus server.

I am starting to suspect Harvest again, because for the same poller on the same harvest server, wouldn't I see all the metrics series multiple times?
For example, for the same poller and Datacenter, the only difference really is the volume tag. One shows up 1 time (expected), the other 4 times. Had to dump it to a file to keep the labels from re-arranging.

$ curl -s http://harvest:13008/metrics > /tmp/metrics.txt

$ cat /tmp/metrics.txt |grep 'fabricpool_cloud_bin_op_latency_average{datacenter="X",cluster="cmode01",cloud_target="cmode02v-s3",svm="cmode01v",volume="vol_x",metric="GET"}'
fabricpool_cloud_bin_op_latency_average{datacenter="X",cluster="cmode01",cloud_target="cmode02v-s3",svm="cmode01v",volume="vol_x",metric="GET"} 11
fabricpool_cloud_bin_op_latency_average{datacenter="X",cluster="cmode01",cloud_target="cmode02v-s3",svm="cmode01v",volume="vol_x",metric="GET"} 29
fabricpool_cloud_bin_op_latency_average{datacenter="X",cluster="cmode01",cloud_target="cmode02v-s3",svm="cmode01v",volume="vol_x",metric="GET"} 13
fabricpool_cloud_bin_op_latency_average{datacenter="X",cluster="cmode01",cloud_target="cmode02v-s3",svm="cmode01v",volume="vol_x",metric="GET"} 27

$ cat /tmp/metrics.txt |grep 'fabricpool_cloud_bin_op_latency_average{datacenter="X",cluster="cmode01",cloud_target="cmode02v-s3",svm="cmode01v",volume="vol_y",metric="GET"}'
fabricpool_cloud_bin_op_latency_average{datacenter="X",cluster="cmode01",cloud_target="cmode02v-s3",svm="cmode01v",volume="vol_y",metric="GET"} 1
spice zinc
#

Agreed, your curl that shows four identical time-series should not happen.

Can you run the following curl pipe? Please replace addr and port with your addr and port. This will curl Harvest for all metrics and then filter out everything but fabricpool, then drop the metric value, sort the time-series, unique them, and then ignore all the metrics with only a single instance.
curl -s 'http://$addr:$port/metrics' | grep -Ev '^#|metadata' | grep -E '^fabricpool' | grep -Eo '^.*?}' | sort | uniq -c | grep -v '1 '

When I run this on our pollers monitoring clusters with fabricpools, I get zero duplicates. Do you see duplicates?

If so, can you email us the uneditted zapi response you captured yesterday with a larger max-records so we can replay your response in Harvest to debug the issue? I understand your concern about names in the response, but only the Harvest team has access to that email.

I also tried recreating duplicate fabricpool metrics on 24.02. No duplicates using 24.02 or git head.

Have you made any changes to your conf/zapiperf/cdot/9.8.0/wafl_comp_aggr_vol_bin.yaml template? What is the value of include_constituents on line 17?

agile breach