#Architecture Advise for Harvest

1 messages · Page 1 of 1 (latest)

novel oak
#

Here are some pain points user are facing:

• Each site (pv with Trident) needs to prepare a Harvest server. Currently, there are a total of 40 primary and standby Harvest servers, and performing version upgrades, configuration changes, or metric adjustments all require a significant amount of time.
-> Maybe using Ansible can help

• Although metric reduction has been applied to Harvest (using conf_path with customconf), both Harvest and AIQUM still consume 6–7% of NetApp’s performance resources (based on the A250 model).
-> Are we able to check the cpu usage of Harvest or any related document?

• For instance, in one of the harvest site, the number of cluster monitors has reached 78. When performance issues arise, clusters need to be split and monitored by two separate Harvest servers.

• The process of joining a cluster monitor to Harvest is manual, which increases the risk of missing monitors.
-> We might can integrate Ansible automation process

• Due to the large number of cluster volumes, more complex PromSQL queries cannot retrieve long-term data effectively.
-> They combine harvest metric with K8s metric and do some calculations. The calculation is complex, and can only show the data within 2 hours, data over the period will be "no data".

Here are some advice for them:

  1. Change the Harvest performance collector interval from 1 minute to 3 minutes.
  2. Disable metrics that are not used (Keep reducing)
  3. Enable jitter

Any other recommendations?

cerulean basin
#

Hi @novel oak we'll discuss this in more detail and get back to you.

Some other ideas to explore:

  1. Try VictoriaMetrics and see if it helps with the slow PromQL queries.
  2. Try one of the open-metrics servers that supports downsampling so you can query over longer time ranges
  3. Let us know if you have any suggestions on how we can make the upgrading easier. We've tried to make upgrading straightforward, but maybe there is more we can do in your env. Any specifics on the upgrade pain points, sounds like you are already using conf_path which can help with custom templates?
novel oak
#

@cerulean basin I will try to implement the Victoriametrics and integrate with the existing Prometheus.
Yes, they are using conf_path with customconf templates.

novel oak
#

@cerulean basin Do we have any automation example for upgrading harvest, change config or add new cluster to harvest.yaml?

uneven cedar
#

Hi Tony, what kind of Harvest installation are you using?

novel oak
#

@uneven cedar Debian

uneven cedar
#

There are no specific examples, but these tasks can be achieved through scripting. For example, upgrading is straightforward using the command listed at NetApp Harvest Upgrade.

Changing the configuration, such as adding or removing a cluster, involves modifying the YAML file via a script. You could accomplish this using languages like Go, Python, etc. NABox performs these tasks. @oak wraith can provide more details.

novel oak
#

@cerulean basin From my testing
The performance of victoriametrics might not be better than prometheus
The 8428 is victoriametrics and 9090 is prometheus

oak wraith
#

Can you run it on an actually long query ?

novel oak
#

We have done the following for harvest and Nabox:

  1. Keep only the templates and objects that are using. (Disable all unused)
  2. Enable jitter to 1m
  3. Change data interval to 3m for performance data

From user testing,
the CPU usage has dropped around 5%.

novel oak
#

Seems like on some sites,

  1. Keep only the templates and objects that are being used. (Disable all unused)
  2. Enable jitter to 3m
  3. Change data interval to 3m for performance data

The CPU usage hasn't decreased much.
Do you have any other suggestions on what we can do?

uneven cedar
#

@novel oak How much CPU usage we're seeing due to Harvest poll? Is it affecting ONTAP's data layer? Also, have you checked with ONTAP to see if the CPU usage is causing any issues?

novel oak
#

@uneven cedar The only way to check the CPU usage of Harvest Poll is to turn off Harvest, and check the CPU usage?

uneven cedar
#

Sounds reasonable but pls do check with ONTAP on this.

novel oak
#

We have a complex time series panel using multiple group_left and sum functions.
In Grafana, we can only see data within the past 12 hours; if the time range exceeds 12 hours, no data is displayed.
We tried creating a recording rule for this complex metric and set the interval to 1 hour to reduce data points, which worked.
But the pannel only shows data after we setup this recording_rules.
Does anyone know how to integrate historical data with the new metric generated by the recording rule?
Is it possible to use the API /api/v1/query_range?query=XXX&start=2025-03-01T00:00:00Z&end=2025-04-22T00:00:00Z&step=1h
to export the data, convert it into a .prom file, and then import it back into TSDB?

uneven cedar
#

Consider downsampling your data to improve query performance without relying on recording rules. If your Harvest poll interval is 3 minutes, you could also increase Prometheus' scrape interval to 3 minutes to prevent duplicate data scrapes.

novel oak
#

@uneven cedar Isn't using recording_rules can help with some complex calculation?

uneven cedar
#

Both options can be considered depending on the use case.

novel oak
#

The scrape interval changed from 2m to 3m.
And is it possible to integrate historical data with the new metric generated by the recording rule?

uneven cedar
#

I haven't tried the backfilling via /api/v1/query_range but this should be possible.

#

You should try that in a non prod environment.

novel oak
#

Yes, but not sure when I query API can convert to the .prom file like below, what's the next step.
volume_total_ops{aggr="DATA_SSD_N1",cluster="a150",datacenter="tplab",instance="localhost:12991",job="harvest",node="a150-01",style="flexvol",svm="infra",volume="fsa_test"} 5.016298006249962 1740787200000
volume_total_ops{aggr="DATA_SSD_N1",cluster="a150",datacenter="tplab",instance="localhost:12991",job="harvest",node="a150-01",style="flexvol",svm="infra",volume="fsa_test"} 5.812429217691095 1740790800000

Should I write data to /var/lib/prometheus/?

uneven cedar
#

I don't think if primetheus supports backfilling.

#

victoria metrics supports that.

#

I doubt if recording rules will run for backfilled data.

novel oak
#

Recording rules only apply to data ingested after their creation, which means the panel displays metrics starting from when the rules were set up.
I would like to backfill historical data so it is included in the panel visualization.
While I can successfully query the historical data, I am not sure how to ingest or write this data back into the TSDB with sepcific metric_name

uneven cedar
#

Victoria metrics may help

novel oak
#

@uneven cedar After setting up VictoriaMetrics and VMalert, we can successfully add recording_rules data into VictoriaMetrics DB.