#Node NFS read latency 10x+ higher than volume NFS read latency in Harvest (not in ONTAP)

1 messages · Page 1 of 1 (latest)

jagged sand
#

We recently ran into this scenario:

  1. a node (AFF-A400) has only one production volume
  2. that volume is accessed via NFSv3 for reads and writes, mostly reads
  3. that volume is origin of 3 other flexcache volumes (load sharing)
  4. Node NFS read latency in Harvest and in OnTAP was about 30-100ms
  5. However, Volume NFS read latency was about 2-3ms in Harvest and about 30-100ms in OnTAP (volume NFS read latency was not reported correctly)
  6. High Node latency was due to CPU d-blade
  7. There was significant amount of remote retrieve traffic causing CPU bottleneck (at STRIPE - again only one volume on the node, CPU affinity)

Question to Harvest team - #5 is troublesome as Volume NFS read Latency was showing 2-3ms (within acceptable range), when in actuality it was 10x+ higher.
Harvest was reporting correct number of NFS read Ops, but significantly lower NFS read Latency !

What could cause Harvest to mis-calculate this latency?

Thank you.

karmic sinew
#

Hi @jagged sand ,

Could you please share the Harvest version you are using? Are you comparing the Harvest metrics node_nfs_read_avg_latency and node_vol_nfs_read_latency?

Also, could you provide the commands you have used to compare Harvest metrics with ONTAP? I have been comparing the relevant mappings of these metrics with the ONTAP CLI while running only NFSv3 traffic, and I have found that the Harvest metrics match the ONTAP CLI outputs.

However, I am encountering discrepancies when comparing node_nfs_read_avg_latency with node_vol_nfs_read_latency, as shown in the screenshots. Are you experiencing the same issue? If so, it might be an ONTAP-related question regarding why the ONTAP CLI outputs do not match.

To diagnose further, please run the following commands in diagnostic mode in two separate terminals simultaneously to compare harvest metrics with ONTAP

Harvest Metric: node_vol_nfs_read_latency

# ONTAP CLI
set d
statistics show-periodic -interval 60 -iterations 20 -object volume:node -counter read_latency -instance NODENAME

Harvest Metric: node_nfs_read_avg_latency{nfsv="v3"}

# ONTAP CLI
set d
statistics show-periodic -interval 60 -iterations 20 -object nfsv3:node -counter read_avg_latency -instance NODENAME
jagged sand
#

We are using harvest version 24.02.0-1 (commit 8f9201cc) (build date 2024-02-21T10:01:50-0500)

We are comparing Node NFS read latency metric (node_nfs_read_avg_latency) with Volume NFS read latency (volume_avg_latency or volume_nfs_read_latency; these two volume metrics seem to have identical data).

The issue with volume_nfs_read_latency not matching ONTAP was a corner case. Normally volume_nfs_read_latency does match ONTAP's CLI statistics show-periodic -object volume -instance $volume -counter nfs_read_latency.

We believe something caused Harvest to mis-calculate volume_nfs_read_latency in that special corner case while the condition was there (it was for several days). What we know about the corner condition was: High Node latency was due to CPU d-blade & there was significant amount of remote retrieve traffic causing CPU bottleneck. We have a case opened with Netapp and we uploaded PAs so the data is there. Please let me know if you'd like to know the case#.

Thank you.

#

(and we are on 9.10.1P12 on that cluster)

#

here are Grafana graphs for 1h (2024-09-19 13:00:00 - 2024-09-19 14:00:00)
Node NFS read latency
and
Volume read and other avg latency

karmic sinew
#

Thanks, @jagged sand .

So during this corner case, the volume_nfs_read_latency metric was showing values between 2-3ms, while the CLI command statistics show-periodic -object volume -instance $volume -counter nfs_read_latency reported values between 30-100ms. Is that correct? I want to confirm whether Harvest deviated from what the similiar ONTAP CLI showed for the same time frame.

If you have workload metrics enabled, could you check the value of the qos_read_latency metric during this time frame? Also, could you share the Harvest logs for that period and email them to ng-harvest-files@netapp.com? More details on log collection can be found here.

Since Harvest 24.02, we have fixed an issue related to high latency reported in Harvest, as mentioned here. I don't believe we are encountering this issue in your case. Please upgrade to the latest version of Harvest as well.

sonic nova
#

Is this a Harvest issue? There are situations where latency will be different between nblade and dblade.

#

For example, a workload that normally does >100k IOPS with thousands of NFS clients, set a QoS policy of 10 IOPS, and the CPU for the nblade (network exempt) goes 100%*num of cores for nblade. Yet volume latency may be 1 ms or less.

jagged sand
#

thank you Rahul and Paul
when Netapp loaded PAs into their tool, they could see high volume latency, consistent with overall node latency, and CLI command reported high latency as well - but the volume latency data collected/calculated by Harvest was magnitude 10x+ lower

#

i'll try to find/upload Harvest logs
and we do not have qos metrics enabled (we would like to enable them, but every time we tried, it overwhelmed Prom server so we had to back out)

jagged sand
#

@karmic sinew
sent email with oldest logs i could find (unfortunately logs for the period in question already rolled off)

#

If you have a way to reference Netapp case (#2010167751 ), you should be able to see PAs and look at the data. We can send you data we have in in Harvest for volume_nfs_read_latency for the same period (if that will help).

sonic nova
#

That does.

#

Or I can reach out on the actual case thread as a splinter thread.

jagged sand
#

thank you @sonic nova , email sent

karmic sinew
#

Thanks, @jagged sand . The case details were very helpful. From my understanding, the PA files highlighted high latency in the QoS stats for that volume. I observed that the volume latency in PA is similar to what was reported in Harvest.

We should compare the node NFS latency in your case with the QoS latency, as issues related to nblade are captured in QoS latency and not in volume latency as mentioned by Paul above. https://kb.netapp.com/on-prem/ontap/Perf/Perf-KBs/What_is_the_difference_in_latency_of_the_same_volume_between_statistics_volume_show_and_qos_statistics_volume_latency_show_commands
I'll share additional details via email.

Regarding the overwhelmed Prometheus server due to QoS stats, I suspect that you might have enabled all four QoS templates as specified in the default configuration. If you enable only two templates, as shown below, that should not impact Prometheus. The detailed templates (WorkloadDetail and WorkloadDetailVolume) can generate a lot of metrics, so it's advisable to leave them disabled.

# Uncomment to collect workload/QOS counters.
# Workload:             workload.yaml
# WorkloadVolume:       workload_volume.yaml