#qos_detail_service_time_latency & qos_detail_resource_latency raw point in time values
1 messages · Page 1 of 1 (latest)
@mellow flame To obtain the raw values, you can execute the following commands in ONTAP CLI in diag mode:
statistics show -object workload_detail -raw
statistics show -object workload_detail_volume -raw
Cooked values calculated in Harvest:
qos_detail_service_time_latency = delta(service_time)/delta(qos_ops)
qos_detail_resource_latency = delta(service_time+wait_time)/delta(qos_ops)
The deltas for these metrics are computed based on their polling interval, which is set to a default of three minutes in Harvest.
For more details on calculating cooked values of performance counters, refer to the Harvest documentation at: https://netapp.github.io/harvest/23.11/configure-zapi/#metrics_1.
Legend thanks. I'll review.
@rocky oyster based on the documentation, am I correct in thinking that I should be able to edit my harvest config somehow to change the 'property' and or 'base-counters' from 'average' to 'raw'? If that's a correct assumption, is there an example somewhere I can use? I'm interested in doing this for the data from: /opt/harvest/conf/zapiperf/cdot/9.8.0/workload_detail_volume.yaml
@mellow flame We do have special parameter override in ZapiPerf, But it's mainly used for overcoming ZAPI bugs and not for the use-case you highlighted above. you can find the details here: https://netapp.github.io/harvest/23.11/configure-zapi/#object-configuration-file_1 Few example usage of override where link_speed is string field but zapiperf response return it as raw metric counter.: https://github.com/NetApp/harvest/blob/812012edf2390129b24133d3c9d6c94201796278/conf/zapiperf/cdot/9.8.0/fcp.yaml#L40
Also, We can not change the base-counter, base-counter name will be fetched from ZapiPerf response only. Like this:
{
"base-counter": "path_cache_misses",
"desc": "Average latency for a path cache miss",
"is-deprecated": "false",
"name": "path_cache_miss_latency",
"privilege-level": "diag",
"properties": "average",
"unit": "microsec"
},
Thanks @fierce sand .
Is there any way to retrieve the output of:
statistics show -object workload_detail_volume -raw
from /api/private/cli?
(I'm hacking in the lab but no luck so far)
Taking a couple steps back.. the outcome I'm looking for is a point in time per volume latency breakdown attributed to each delay centre similar to the 'Latency - Cluster Component View' breakdown provided by the 'Cluster Components' view on the Latency for Volume dashboard in AIQUM.
@mellow flame We do expose qos_detail_service_time_latency, qos_detail_resource_latency via Workload Dashboard/Volume Dashboard in Harvest. Does that help?
Kinda.. another way to describe the outcome is that for a given volume, at a point in time, if the latency is say 2ms, I'd like to know how much of the latency at that point in time is because of tiering, or because of data processing and so on.
I don't think that's possible if the metrics coming out of harvest are averages between the two data points though
This average is a cooked value from raw values and it matches with ONTAP CLI. Are you suggesting that this is not the correct?
See if this page help where we did this comparison
https://github.com/NetApp/harvest/wiki/Volume-latency-for-flexgroup
You may use below private cli for raw values. Replace user,pass,cluster_ip as applicable.
curl -s -k -u user:pass 'https://cluster_ip/api/private/cli' \
--header 'Content-Type: application/json' \
--data '{ "input": "set -showseparator \"!\" -showallfields true -rows 0 diagnostic -confirmations off; statistics show -object workload_detail_volume -raw -counter resource_name|service_time|instance_name|wait_time" }'
That's awesome, thanks Rahul!
Nope, I'm not suggesting they are different - perhaps I misunderstood. To confirm, even the:
statistics show -object workload_detail -raw
statistics show -object workload_detail_volume -raw
on the CLI are averaged? So the output from the ONTAP CLI would match the metrics in Harvest, correct?
I'll do some more lab work - thanks for this.
@mellow flame The command statistics show -object workload_detail -raw in ONTAP retrieves raw values, which are not averaged, and are monotonically increasing. These values are dynamic and change over time. If you need to calculate an average metric, it requires two different timestamps. The formula to calculate this average is x = (xi - xi-1) / (yi - yi-1), where xi and xi-1 are the current and previous raw values, respectively, and yi and yi-1 are the current and previous base counter values, respectively.
Okay, all clear now. 🙂
This is great - thanks again!
What you're looking for is a breakdown of the latency, not modifying the average. I wouldn't recommend changing the calculations. There are delay center breakdown views in Workload view and I think volume view. You have to enable workload collection.
Hey Guys.. hope 2024 is treating you well.
Quick follow up.. in the context of WorkloadDetailVolume..
Is it fair to say that I'd get more accuracy across the current to previous latency diff calculations if I lower the 'data' config below? Also, what does counter and instance do exactly? I've read the docs but still don't understand.
schedule:
- counter: 20m
- instance: 10m
- data: 3m
Thanks!
@mellow flame
The counter configuration is the time interval used by Harvest to check for changes in counter names within ONTAP. This is necessary to identify any changes in counter names or deprecations that may occur during ONTAP upgrades. In our upcoming release, this duration will be extended from 30 minutes to 24 hours.
The instance configuration is used to detect new instances. For example, if a new volume is created in ONTAP, Harvest may take up to 10 minutes to begin collecting performance statistics for that object.
The data configuration, set to 3 minutes in this case, means how frequently Harvest calculates performance metrics for WorkloadDetailVolume. If you want to collect more granular data, you could reduce this time interval in the data. But, please be aware that reducing data configuration for workload details templates may slow down data collection due to the increased load on ONTAP and high number of metrics. It's advisable to check the number of metrics being collected and time taken to collect data for workloaddetail template .This data interval should not be shorter than the scrape interval configured in Prometheus.