#Harvest: Performance Metrics Collection Issue

1 messages · Page 1 of 1 (latest)

hallow cypress
#

We are observing gaps in latency metrics graphs.

Errors from Harvest:
poller_netapp-cmode-01.log:461:{"level":"info","Poller":"netapp-cmode-01","collector":"ZapiPerf:CIFSvserver","error":"API request rejected => Instance name was not set for given object. errNum="13001" statusCode="0"","task":"data","caller":"collector/collector.go:415","time":"2025-12-30T01:26:56Z","message":"Entering standby mode"}
poller_netapp-cmode-01.log:1830:{"level":"info","Poller":"netapp-cmode-01","collector":"ZapiPerf:NFSv3","error":"API request rejected => Instance name was not set for given object. errNum="13001" statusCode="0"","task":"data","caller":"collector/collector.go:415","time":"2025-12-30T01:58:12Z","message":"Entering standby mode"}

NetApp Support was engaged (2010573610) and provided the analysis below, then redirected to community support:

-- Graphs are not getting plotted correctly using harvest either on graphana or prometheus for nfs and cifs related latency counters
-- We tried checking the statistcs using command line, it was working ok
-- The latency graphs in the system manager was also continuous
-- This looks to be an issue with harvest managed by netapp community

grand prairie
hallow cypress
#

Hi @grand prairie the poller has been restarted and logs uploaded for before and after the restart.

grand prairie
#

thanks for the upload - I noticed you are using an old version of Harvest (24.08.16-nightly (commit f95fef19) (build date 2024-08-16T00:23:05-0400)) - can you upgrade to 25.11 and retry? We've made a lot of changes since then.

I see the gaps in the ZapiPerf:CIFSvserver collector from the after-log file you uploaded. Many of the polls succeed and then a few fail, hence the gaps. Here's a visualization of that collector's polls. The thin vertical lines are succesfull polls of the two instances and 36 metrics. The up-arrow-chevron are the errors, all of which are caused by ONTAP returning the API request rejected => Instance name was not set for given object. errNum="13001" error

#

Have you made any changes to the conf/zapiperf/cdot/9.8.0/cifs_vserver.yaml template?

#

The other thing I noticed is that it's taking ~36 minutes to collect the 47K Qtree's on this cluster. It's possible that collecting that many Qtrees from ONTAP is causing other APIs to intermittently fail. We've seen this before with a large number of Qtrees. It might be worth checking that by temporarily disable that collector and see if the gaps for ZapiPerf:CIFSvserver and ZapiPerf:NFSv3 are reduced

crude imp
#

Hi Chris,

I'm Mani from Nvidia. Thank you for your reply. We have some workflows and automations that rely on qtree level statistics, so disabling that collector isn't an option for us. Are there any other alternatives? I'm also seeing similar gaps with another NetApp cluster, so I'll check the qtree count for that specific cluster.

grand prairie
#

Hi @crude imp ONTAP made some improvements to Qtree stats collection, but those changes are Rest only and on a much later version of ONTAP. Looks like this cluster is on 9.9.1. Not sure how viable that is.

It's also worth trying a more recent version of Harvest - the version that you're using is quite old. You could setup the new instance in a different directory or machine to keep your current version intact. Disabling Qtree collection would only be temporary to help determine if the number of qtree instances is what is causing the intermittent gaps with ZapiPerf:CIFSvserver and ZapiPerf:NFSv3 but I understand if that's not possible

crude imp
#

Chris,

I just verified the qtree count of mtv-cdot01 cluster. There are less than 20 qtrees. Where did you observe the 47K qtrees?

grand prairie
#

Here is an example log line pretty-printed. The instancs count is the number of instances returned from the ZAPI call

{
  "level": "info",
  "Poller": "mtv-cdot01",
  "collector": "ZapiPerf:Qtree",
  "calcMs": 107,
  "metrics": 189212,
  "numPartials": 0,
  "apiMs": 1642673,
  "parseMs": 6132,
  "pluginMs": 43,
  "skips": 384,
  "pollMs": 43,
  "numCalls": 95,
  "pluginInstances": 0,
  "zBegin": 1767693728828,
  "exportMs": 325,
  "instances": 47303,
  "bytesRx": 53696856,
  "instancesExported": 47289,
  "metricsExported": 188966,
  "caller": "collector/collector.go:595",
  "time": "2026-01-06T10:29:38Z",
  "message": "Collected"
}
crude imp
#

Interesting. To be precise, there are only 5 qtrees on this cluster.

grand prairie
#

there seem to be more than that 😄 If I look at the Zapi:Qtree collector logging - it says there are 19 instances, which is still a far cry from 47K. Have you made any modifications to either of those templates?

crude imp
#

Nope

grand prairie
#

Can you try the latest version of Harvest while I look through the code for the older version you are using?

crude imp
#

I will set up the new Harvest version, test it on a separate development server, and keep you updated. I've also collected and uploaded the logs for another cluster that is experiencing the similar issue. Please have a look.

grand prairie
#

thanks! will do

grand prairie
#

the ZapiPerf:Qtree collector is collecting 51,697 instances from cluster tha

crude imp
#

There are around ~4,600 qtrees on the tha-cdot01 cluster.

grand prairie
#

yes, I see the same. Zapi:Qtree is collecting 4,693 instances and ZapiPerf:Qtree is collecting 51K

crude imp
#

We are not observing this metrics collection issue with other clusters, even those with over 5000 qtrees, using the same harvest version "24.08.16-nightly". Any idea on why is the polling for the "ZapiPerf:CIFSvserver" and "ZapiPerf:NFSv3" collectors only impacted on these two clusters?

grand prairie
#

We would need to look at the log files for one of those pollers to compare. I'm still going through the tha cluster logs and there are ONTAP counter manager messages saying that the calls were rejected due to timeouts on the ONTAP side. I'm going to talk with the counter manager team tomorrow about these error messages and will update here

crude imp
#

Okay, Thanks.

grand prairie
#

You're welcome!

grand prairie
#

Hi @crude imp can you run the following curl against tha-cdot01 and upload the results to the same share as yesterday? Replace $user, $pass, and $ip with their respective values

curl --connect-timeout 30 --user $user:$pass --insecure --data-ascii '<?xml version="1.0" encoding="UTF-8"?> <netapp xmlns="http://www.netapp.com/filer/admin" version="1.130"><perf-object-instance-list-info-iter><objectname>qtree</objectname><max-records>500</max-records></perf-object-instance-list-info-iter></netapp>' -H "Content-Type: text/xml" 'https://$ip/servlets/netapp.servlets.admin.XMLrequest_filer'

crude imp
grand prairie
#

not surprisingly the results of the curl you shared show many qtree instances, just like Harvest's ZapiPerf does. I'd like to understand why qtree-list-iter returns 19 instances, but ONTAP counter manger returns 47K instances. And from what you said earlier, it sounds like you expect there to only be 19 qtrees.

It's possible that if we understand the difference in the number of instances returned by those two APIs, that we may be able to add a filter to reduce the number returned from ONTAP counter manager. Reducing that number would leave counter manager more capacity to return the performance metrics that currently have intermittent gaps. I've asked the counter manager team to take a look at the ASUPs for these clusters that were attached to the case you opened.

Temporarily disabling ZapiPerf:Qtree collection would also confirm if reducing the number of qtree performance metrics fixed the intermittent gaps

crude imp
#

Ok, thanks. I'm working on setting up the latest version of Harvest on one of the dev servers. I'll add these two clusters, monitor the results for a few hours, and then share the findings with you.

grand prairie
#

thanks! A heads up that if the original Harvest instance continues to monitor these same clusters, you may still see gaps with the latest version of Harvest since the issue seems to be with ONTAP counter manager becoming overwhelmed when collecting statistics for the high number of qtrees

crude imp
#

Yeah, got it. I will disable the ZapiPerf:Qtree collector and monitor

grand prairie
#

thanks!

crude imp
#

Hi Chris,

I've setup the harvest (same version) on dev server and disabled "ZapiPerf:Qtree" collector. But, I'm still observing the gaps in the graphs. Same error msgs in the harvest logs.

grand prairie
#

hi @crude imp thanks for the update. Is the production instance of Harvest also monitoring mtv-cdot01?

crude imp
#

yes, production instance also monitoring both the clusters

grand prairie
#

thanks for the clarification. that's what I suspected would happen when I gave the heads back [here](#1458119320012984479 message) If ONTAP counter manager is overwhelmed when collecting qtrees - you will still see gaps when monitoring from dev, even if you aren't collecting qtrees from that instance.

Would it be possible to temporarily stop collecting qtrees from all instances? That would confirm the qtree theory.

I spoke with the counter manager team and they've asked that you create a case to investigate why so many qtrees instances are being returned from counter manager. At this point, I believe this is an ONTAP bug. Please paste the case number so I can follow along with the investigation and answer any Harvest-related questions

crude imp
#

Here is the Case Number #2010573610

#

NetApp Ontap Version Details:
tha-cdot01: NetApp Release 9.12.1P5
mtv-cdot01: NetApp Release 9.9.1P16

#

We can't stop the monitoring on the prod instance for these clusters.

grand prairie
#

thanks @Mani I'll loop Garrett Klassen back into the conversation to move this case forward to the counter manager team

crude imp
#

Okay, Thanks

grand prairie
#

You're welcome!

crude imp
#

Hi Chris

grand prairie
#

Hi Mani

crude imp
#

I have a few questions regarding the flexcache monitoring. Are you available for the zoom meeting today?

grand prairie
#

can you post the questions here? I'm not available today

crude imp
#

okay, I will post the question in sometime. Please check and respond when you are available

grand prairie
#

will do

crude imp
#

Thanks