#Feature request: CPU utilization graph removing background i/o usage for failure/disaster planning

1 messages · Page 1 of 1 (latest)

quasi bear
#

Hey everyone, I've had a super-long running ticket with NetApp support to help determine how to ensure our systems are appropriately utilized without affecting capability during a failover event (planned or otherwise)
It took a very long time, but they eventually came across this article and passed it along to us: https://kb.netapp.com/on-prem/ontap/Perf/Perf-KBs/CPU_utilization_above_50_percent_before_upgrading_ONTAP
Step 1 says to look at ActiveIQ, but that seems to show exactly the same CPU output as NABox, step 2 references a depreciated capability for AIQUM (node failover planning guide), but step 3 is where the magic is!
Step 3 is to confirm how much of the workload is background IO (background IO yields to foreground operations, so would be halted in a CPU bottlenecked situation similar to a failover.
https://kb.netapp.com/on-prem/ontap/Perf/Perf-KBs/How_to_check_for_background_CPU_utilization_in_ONTAP_9

The basis of this seems to be using the qos statistics workload resource cpu show -node <Node> command to provide a breakdown of the CPU usage.

The request: Would it be possible to get a CPU graph (or even just a line-item that can be looked at) for the foreground processes to help ensure that we keep the essential CPU usage below 50%, so we're not bound up in the event of a failover situation...

I'm quite happy to have conversations or provide info about what I mean if my rambling wasn't clear, but I feel like this is something that is very much missing from standard NetApp monitoring, so maybe it's yet another place for NABox to shine?

Thanks again for all your work!

lucid pendant
#

hi @quasi bear that's a good suggestion. Harvest can collect resource cpu today, but unfortunately when we publish that metric we don't do it a the node level. That's an area we're looking to improve if you want to track progress via https://github.com/NetApp/harvest/issues/2427

In the meantime, does the CPU Busy Domains panel on the Node dashboard help?

quasi bear
#

Awesome, thanks for re-opening that area of research, I hope something comes out of it.

As for the CPU Domains... Kindof? but there's very little documentation on what is good/bad/ignorable for those entries. I came across this article on what they all basically mean (https://kb.netapp.com/on-prem/ontap/Perf/Perf-KBs/What_are_CPU_as_a_compute_resource_and_the_CPU_domains_in_ONTAP_9)

However, in my specific instance, we have (had?) Snapmirror causing what looked like 80-90% CPU usage on nodes, but support looked into it and said that enough of it was background I/O caused by Snapmirror that we were ok. When I disabled Snapmirror to test this, the only thing that goes down significantly is wafl_exempt and exempt, which don't seem to be listed as background i/o in this article (https://kb.netapp.com/on-prem/ontap/Perf/Perf-KBs/How_to_check_for_background_CPU_utilization_in_ONTAP_9)

So, I do see data in the CPU Busy Domains, but I don't see, or understand, how to translate that into something that helps me know my system capabilities...

naive violet
#

@reef basin Could you please provide your input on CPU domains question above. Thanks.

quasi bear
#

Hello @lucid pendant and @naive violet, I was just checking back on this request. looking at the issue Chris linked above, it appears to have been last month, but I don't see any status. Was it deemed to be not possible, or did it get rolled into a larger request effort or something?