Here are some pain points user are facing:
• Each site (pv with Trident) needs to prepare a Harvest server. Currently, there are a total of 40 primary and standby Harvest servers, and performing version upgrades, configuration changes, or metric adjustments all require a significant amount of time.
-> Maybe using Ansible can help
• Although metric reduction has been applied to Harvest (using conf_path with customconf), both Harvest and AIQUM still consume 6–7% of NetApp’s performance resources (based on the A250 model).
-> Are we able to check the cpu usage of Harvest or any related document?
• For instance, in one of the harvest site, the number of cluster monitors has reached 78. When performance issues arise, clusters need to be split and monitored by two separate Harvest servers.
• The process of joining a cluster monitor to Harvest is manual, which increases the risk of missing monitors.
-> We might can integrate Ansible automation process
• Due to the large number of cluster volumes, more complex PromSQL queries cannot retrieve long-term data effectively.
-> They combine harvest metric with K8s metric and do some calculations. The calculation is complex, and can only show the data within 2 hours, data over the period will be "no data".
Here are some advice for them:
- Change the Harvest performance collector interval from 1 minute to 3 minutes.
- Disable metrics that are not used (Keep reducing)
- Enable jitter
Any other recommendations?