#Any way to show 95th percentile for IOP/s, TPut, Latency

1 messages · Page 1 of 1 (latest)

tender parrot
#

I'm working on a table that displays perf metrics by QoS policy. The customer prefers 95th percentile rather than averages for these metrics.
Looks like the only way this is possible is through adjusting summary metrics via Prometheus - https://grafana.com/blog/2022/03/01/how-summary-metrics-work-in-prometheus/

Wondering if there is another way to do this. Appreciate any inputs

Grafana Labs

An in-depth look at the CKMS algorithm and how it works with summary metrics in Prometheus.

native agate
tender parrot
#

Thanks Chris! This looks great. One thing we are trying to do though is sort workload by Policy Group.

Here's what this looks like in my table... JSON for the same panel

#

To clarify, each policy group can contain multiple workloads. In our case, we are using OpenShift so there can be multiple PVs (vol, qtree or LUN) so each one of those would have its own workload.

native agate
#

Multiple workloads per policy group makes sense. And looks like sorting by policy group is working for you in your screenshot. Is that correct?

tender parrot
#

True. I sort/filter by policy in my table

#

I actually see this in the table already provided. Moving this to the front of my table for our purposes.

Thanks!

tender parrot
#

After breaking this down a bit further, I see that there is a % Used per workload. This might be misleading as the entire policy might be consumed if the workload constituents all total >100% of the policy.

IMO, would be good if we could sum up all workloads per policy to see if their total is approaching 100%. I like what's there so possibly a separate panel as a 'per policy' breakdown if possible

weary latch
#

@tender parrot we have flexgroup level percentages along with their constituents for fixed QoS panels also. Constituents are only shown in QoS fixed Utilized% panels. They are not calculated for adaptive policy panels.
Do you think we should remove the constituent level calculation for QoS % in fixed QoS policies panels?

tender parrot
#

@weary latch I think it helps to have both. Good to know if you are utilizing all of your policy (or flexgroup in your case) as well as any in-balance for constituent volumes. Both data points are useful.

I would like to see a single data point for one entire QoS policy. That is something that is missing if we are only seeing the constituent/workload level within the policy.

  • We can of course pull this into a workbook from one of the tables, but not as easy.
weary latch
#

@tender parrot When you mention a single data point for the entire QoS policy, are you referring to aggregating the used QoS percentages across all workloads within the policy? For example, if the QoS policy has a maximum of 1000 IOPS, and workload 1 uses 100 IOPS (10%) while workload 2 uses 200 IOPS (20%), then the aggregated used percentage at the QoS policy level would be (10% + 20%) / 2 = 15%.

tender parrot
#

@weary latch If the policy is shared, then all workloads within that policy will consume a part of the 1000IOP/s total. So for your example, it would be workload 1 100 IOP/s (10%) + workload 2 200 IOP/s (20%) = 30% of the policy consumed.

Here's an example of a shared policy group with 64 workloads sharing the same maximum of 600MB/s -

tpavcpax96a800::qos> qos policy show -policy-group svm_pod4_monitoring_tpa-PG

          Policy Group Name: svm_pod4_monitoring_tpa-PG
                    Vserver: svm_pod4_monitoring_tpa
                       Uuid: f70ebb35-560e-11ed-948e-00a098f3e613
         Policy Group Class: user-defined
            Policy Group ID: 5281
         Maximum Throughput: 600MB/s
         Minimum Throughput: 0
        Number of Workloads: 64
          Throughput Policy: 0-600MB/s
                  Is Shared: true
   Is Policy Auto Generated: -

When the policy is not 'shared', then each of the workloads will have a max or min equal to what is set on the policy

weary latch
#

Thanks @tender parrot for the context. Let me check if we can add it to workload dashboard for fixed QOS policies.

weary latch
tender parrot
#

@weary latch Thank you!... Assuming we will have a 'shared' and 'not shared' row so that this is clear for the end user, this approach will be perfect.

weary latch
#

Yes, the new panels are for shared use only hence, the table doesn't have an additional column indicating whether they are shared or not.
Also, existing panels/rows for workload levels have a new column added to make this distinction clear at the workload level, indicating whether the policy is shared or not, as shown below:

tender parrot
#

@weary latch This is perfect! Thanks!

#

@weary latch One further request... The new/shared table looks like it does not show the infinity symbol when there is no policy -

weary latch
#

@tender parrot

  • Do you mean that you are getting rows with max IOPS as 0 when you have not defined any IOPS for this policy? Probably only throughput is defined?
  • Could you share the Harvest version?
  • Also, please provide the output of the command qos policy-group show -policy-group for this policy.
tender parrot
#

@weary latch
"Do you mean that you are getting rows with max IOPS as 0 when you have not defined any IOPS for this policy? Probably only throughput is defined?"

  • Correct. This policy is only tput based
    We are running an older harvest version (from 2023). This is customer managed in a lab which I don't have access to. I've just imported this dashboard json.
  • NOTE: The original tables show an infinity symbol instead of 100% so this appears to be working on those. The problem seems to be with the new/shared table only.

tpavcpax96a800::qos> qos policy show -policy-group svm_pod4_monitoring_tpa-PG

          Policy Group Name: svm_pod4_monitoring_tpa-PG
                    Vserver: svm_pod4_monitoring_tpa
                       Uuid: f70ebb35-560e-11ed-948e-00a098f3e613
         Policy Group Class: user-defined
            Policy Group ID: 5281
         Maximum Throughput: 600MB/s
         Minimum Throughput: 0
        Number of Workloads: 64
          Throughput Policy: 0-600MB/s
                  Is Shared: true
   Is Policy Auto Generated: -
#

@weary latch Seems like the math is off for the shared table. Focusing on the svm_pod4_monitoring_PG policy from both tables in the previous screen shots. The shared table shows only 4.84% of the 600MB/s consumed. However, the original table shows 14.93% for the top workload in that policy.

The shared table should add all the workloads within the policy to determine % used of the policy.

weary latch
#

@tender parrot The dashboard I have shared is compatible with the latest release, 24.05, of Harvest. Since last year, we have made some fixes in the metrics published for QoS, which is why you are seeing max IOPS as 0 in the panels. Once you upgrade to the latest release, that problem will be fixed. We are going to release a patch today for 24.05 with a fix for a different issue, so you may want to wait for that.

Regarding the Shared % not matching, once you upgrade, we'll need to debug why that is the case. Below is the Prometheus query for Harvest 24.05 which should give us an idea of what the relevant percentage for this policy is:

    (
      qos_total_data{workload!~".*__[0-9]{4}-.*"} * 100 / (1000 * 1000)
      / on(datacenter, cluster, policy_group)
      group_left
      label_replace(
        qos_policy_fixed_max_throughput_mbps{},
        "policy_group",
        "$1",
        "name",
        "(.*)"
      )
    )
    * on(datacenter, cluster, policy_group)
    group_left(capacity_shared,object_count,max_throughput_iops,max_throughput_mbps,min_throughput_iops,min_throughput_mbps)
    label_replace(
      qos_policy_fixed_labels{name="svm_pod4_monitoring_PG", max_throughput_mbps!="", capacity_shared="true"},
      "policy_group",
      "$1",
      "name",
      "(.*)"
    )

Below is the output for me in the dashboard. Please note that constituents are excluded in the shared table percentage. Also, check if it is a refresh panel issue where the percentages are out of sync between panels.

weary latch
tender parrot
#

@weary latch Regarding this -

"Regarding the Shared % not matching, once you upgrade, we'll need to debug why that is the case."

I updated the "Fixed QoS Bandwidth Utilization (%)" dashboard to have 'avg_over_time'. This looks right now. FYI

#

Same for "Fixed QoS Utilized IOPs (%)" -

avg_over_time (qos_ops{datacenter=~"$Datacenter", cluster=~"$Cluster", workload=~"$Workload"} [$__range]) * on(datacenter,cluster,policy_group) group_left(max_throughput_iops,max_throughput_mbps,min_throughput_iops,min_throughput_mbps) label_replace(qos_policy_fixed_labels{datacenter=~"$Datacenter", cluster=~"$Cluster", max_throughput_iops != ""}, "policy_group", "$1", "name", "(.*)")

weary latch
#

@tender parrot But Fixed QoS Workload IOPs Utilization (%) is not for shared percentages. It is just for each workload level data. We currently show instant data (current record) in that table. Do you think avg_over_time is better for the table? Also, are you getting the shared % correctly now post-upgrade?

tender parrot
#

@weary latch avg_over_time is better. The Last pole of data is not super helpful IMO. This can be wildly different than the avg.

I'm waiting on my customer to update their lab with the latest. Should be done fairly soon.

weary latch
#

Sure. Yes that makes sense. Thanks. We'll change in Harvest as well.

tender parrot
#

Thanks!

tender parrot
#

@weary latch We have updated to Harvest v24.05.2-1. However, I'm not seeing the "Fixed QoS Shared Policy Utilization" panel. Can you confirm if this was omitted?

weary latch
#

@tender parrot It appears there is an issue with the JSON for version 24.05.2-1. I am also experiencing the same problem with these panels not displaying. Could you please import the workload dashboard JSON from the following link into Grafana and check?

https://raw.githubusercontent.com/NetApp/harvest/6da6b2e681484e23e64822dcda0f9457e75f42d1/grafana/dashboards/cmode/workload.json

Note: You may encounter some links in this dashboard. These can be ignored for now as they are from a nightly build.

#

Since our last discussion, we have fixed a bug in the workload dashboard where workload classes of type autovolumes were not displaying in the 24.05.2 release. This issue has been resolved in the nightly build. If you want to see these workload classes in the dashboard, please refer to the nightly build available at: https://github.com/NetApp/harvest/releases/tag/nightly

weary latch
#

@tender parrot Actually, in version 24.05.2-1, these panels are present, but the rows/panels have been renamed, which has caused some confusion for me as well. Could you please check if you see the following two rows?

tender parrot
#

@weary latch I do see the panels, but no reference to 'shared' or 'not-shared' is indicated

#

Also, looks like the 'avg_over_time' was omitted in the latest release

weary latch
#

You should see below rows

tender parrot
#

@weary latch Confirmed... The nightly from 7-24-2024 has the changes requested

#

Should we see this included in the August release then?

native agate
#

yes, it will be

tender parrot
#

thx!