#Some metrics are missing post upgrade to harvest version 24.08.0-1

1 messages · Page 1 of 1 (latest)

calm beacon
#

Hi
Harvest version was upgraded to 24.08 a few months back.
I could see some metrics are missing since that time one example being node_nfs_access_avg_latency.
I am not sure about the old version they were using earlier, but from the prometheus data I could see this metric was available earlier.
I tried switching to latest version 24.11.1-1 and switch to REST, but I cant see node_nfs_access_avg_latency.
And the weird thing is its not happening everytime, I could see some instances show this data some times.

Is there some known issues about this matrix on newer versions?

umbral pagoda
calm beacon
#

@umbral pagoda Thanks for the response and excuse me for the delay in response
I have uploaded logs for 2 clusters. one uses REST and other uses ZAPI

umbral pagoda
#

Thanks for the logs @calm beacon
According to the Rest logs, the following line indicates that 102 metrics have been exported. Based on the logs, you should be receiving node_nfs_access_avg_latency via RestPerf.

time=2025-01-07T07:35:23.722Z level=INFO source=collector.go:601 msg=Collected Poller=xxxxx collector=RestPerf:NFSv3Node apiMs=133 bytesRx=9929 calcMs=0 exportMs=0 instances=2 instancesExported=2 metrics=108 metricsExported=102 numCalls=1 numPartials=0 parseMs=0 pluginInstances=0 pluginMs=0 pollMs=133 renderedBytes=15407 skips=0 zBegin=1736235323589
#

Could you try running below curl and see if you are getting any data

curl -s http://HARVEST_MACHINE_IP:POLLER_PROM_PORT/metrics | grep node_nfs_access_avg_latency

Do replace HARVEST_MACHINE_IP and POLLER_PROM_PORT as applicable.

calm beacon
#

@umbral pagoda The metrics is showing up there but the value is 0

umbral pagoda
#

Okay. It is likely that there is no NFS traffic.

calm beacon
#

Ah but no

#

I have pretty busy clusters which shows 0 as value

#

and in the ontap GUI I could see latency

umbral pagoda
#

Okay. Let's check metric node_nfs_latency

#

Does that show any traffic?

calm beacon
#

Yes it does show traffic

umbral pagoda
#

I believe that node_nfs_access_avg_latency represents a subset of the overall latency, tracking a particular component of the total latency.

calm beacon
#

Ok

#

But the metrics were suddenly stopped at a time

#

if you look at the graph above this is from a cluster and coulds see it was having values between 0 and 100 till end of october then its gone to 0

#

Thats the time where they updated to the newer harvest

umbral pagoda
#

Okay Do we have metrics collecting now?

calm beacon
#

Metrics is being collected, but the values are zero almoist all the time

umbral pagoda
#

Which metric does this panel uses?

calm beacon
#

node_nfs_access_avg_latency

umbral pagoda
#

Okay. We can verify the value via ONTAP CLI. It may just be 0 all this time. You can run below command in diag mode in ONTAP and compare the same with Harvest

set d
statistics show-periodic -interval 60 -iterations 10 -object nfsv3:node -counter access_avg_latency  -instance umeng-aff300-01
#

Depending on your use case, you might want to check whether node_nfs_access_avg_latency or node_nfs_latency is more relevant for your needs.

calm beacon
#

Thanks
I will check this and share the result

calm beacon
#

This is the result I got from the same cluster in the graph

      avg    Complete    Number of
  latency Aggregation Constituents
 -------- ----------- ------------
     42us      Yes      2
     56us      Yes      2
     51us      Yes      2
     56us      Yes      2
     51us      Yes      2
     50us      Yes      2
     66us      Yes      2
     55us      Yes      2
     47us      Yes      2
     53us      Yes      2
umbral pagoda
#

Okay. Let's also check access_total value as well with latency. Could you run below command and share

set d
statistics show-periodic -interval 60 -iterations 10 -object nfsv3:node -counter access_avg_latency|access_total  -instance umeng-aff300-01
calm beacon
#
      avg   access    Complete    Number of
  latency    total Aggregation Constituents
 -------- -------- ----------- ------------
     60us      425      Yes      2
     49us      397      Yes      2
     47us      384      Yes      2
     55us      404      Yes      2
     65us      394      Yes      2
     52us      436      Yes      2
     61us      384      Yes      2
     45us      397      Yes      2
     44us      402      Yes      2
     58us      396      Yes      2
nstore03: nfsv3:node.nstore03a: 1/7/2025 11:46:16
   access
      avg   access    Complete    Number of
  latency    total Aggregation Constituents
 -------- -------- ----------- ------------
Minimums:
     44us      384       -       - 
Averages for 10 samples:
     53us      401       -       - 
Maximums:
     65us      436       -       - ```
umbral pagoda
#

Are these access total numbers similar to node_nfs_access_total for this node?

node_nfs_access_total{node="umeng-aff300-01"}

calm beacon
#

no they are quite different

#

If I plot node_nfs_access_total its coming between 6.67 and 7.22

umbral pagoda
#

Okay. Let me check

#

I believe the issue with node_nfs_access_total is that we are overriding the property of these variables, causing them to be divided by 60. In your case, since the value ranges between 6 to 7, multiplying it by 60 should give you a value similar to what you see in the CLI. Could you comment out the lines highlighted here:
https://github.com/NetApp/harvest/blob/main/conf/restperf/9.12.0/nfsv3_node.yaml#L64-L86
in the file located at conf/restperf/9.12.0/nfsv3_node.yaml and restart Harvest. You can then check if access_total is correct? If this works, a code change will be needed to fix the latency issue.

calm beacon
#

Let me do that and share the result

umbral pagoda
#

Thanks

#

Could you confirm if this use to work fine with ZapiPerf?

calm beacon
#

No actually our setup is still on Zapi

#

I just addded one node to use REST to see if it is a problem with zapi

umbral pagoda
#

understood. We are doing the same for ZAPI also.

calm beacon
#

Yea commenting that section makes it similar to the one we got from ontap

umbral pagoda
#

Thanks. We'll fix latency as well and provide you with a build.

calm beacon
#

Sure thanks a lot Rahul

umbral pagoda
#

@calm beacon I believe the override implemented by Harvest is correct. Could you please revert the template changes we made earlier and restart Harvest? If you then run the following queries in Prometheus, they should return similar results:

sum(node_nfs_access_total+node_nfs_commit_total+node_nfs_create_total+node_nfs_fsinfo_total+node_nfs_fsstat_total+node_nfs_getattr_total+node_nfs_link_total+node_nfs_lookup_total+node_nfs_mkdir_total+node_nfs_mknod_total+node_nfs_null_total+node_nfs_pathconf_total+node_nfs_read_symlink_total+node_nfs_read_total+node_nfs_readdir_total+node_nfs_readdirplus_total+node_nfs_remove_total+node_nfs_rename_total+node_nfs_rmdir_total+node_nfs_setattr_total+node_nfs_symlink_total+node_nfs_write_total) by (cluster)
sum(volume_nfs_read_ops) by (cluster) + sum(volume_nfs_write_ops) by (cluster)
sum(svm_nfs_access_total+svm_nfs_commit_total+svm_nfs_create_total+svm_nfs_fsinfo_total+svm_nfs_fsstat_total+svm_nfs_getattr_total+svm_nfs_link_total+svm_nfs_lookup_total+svm_nfs_mkdir_total+svm_nfs_mknod_total+svm_nfs_null_total+svm_nfs_pathconf_total+svm_nfs_read_symlink_total+svm_nfs_read_total+svm_nfs_readdir_total+svm_nfs_readdirplus_total+svm_nfs_remove_total+svm_nfs_rename_total+svm_nfs_rmdir_total+svm_nfs_setattr_total+svm_nfs_symlink_total+svm_nfs_write_total) by (cluster)

Regarding latency, Harvest reports latency as 0 when the IOPS are very low. This property should be rate only, and since the value is 6 or 7, which is less than 10, you see the latency as 0. This behavior is expected. It is designed this way because low IOPS can sometimes result in high latency readings, which would be false positives.