#vol_latency data in AIQUM and harvest

1 messages · Page 1 of 1 (latest)

signal cairn
#

Hi All, as shown in the images, I want to get the volume latency.

The value in AIQUM is half of the value in harvest!!
The metric I am using in harvest is "qos_volume_latency"

Is the gap between two values caused by the collection interval setting?
The AIQUM is 5 minutes and harvest is 1 min?

tropic anchor
#

@signal cairn It may be possible that you are hitting issue as highlighted in KB https://kb.netapp.com/onprem/ontap/Performance/High_ONTAP_latency_reported_but_not_in_qos_statistics_volume_latency_show
Harvest qos latencies are validated against ONTAP command qos statistics volume latency show as mentioned here
https://github.com/NetApp/harvest/wiki/Volume-latency-for-flexgroup#qos-volume-read-latency-metric

Please note since Harvest 23.08 qos_volume_latency metrics are tracked/renamed as qos_latency. More details here https://github.com/NetApp/harvest/discussions/2265

signal cairn
#

@tropic anchor The value of qos_volume_latency is higher than the value in AIQUM, which is just the opposite of what the KB says. Which is quite strange.
So I should follow the guide line in KB: "Trust the delay/service center delays from qos statistics volume latency show, instead of ActiveIQ Unified Manager when there is a significant gap. "?

tropic anchor
#

I think best is to monitor the output of ONTAP CLI qos statistics volume latency show and see if Harvest value matches with that.

signal cairn
#

Okay! I tried with that ! But thanks

tropic anchor
#

Do you see any differences in what ONTAP CLI and Harvest shows?

signal cairn
#

@tropic anchor I'll ask for the value, to be confirmed.

signal cairn
#

@tropic anchor We found some peak value around 3 minutes happen regularly. What could be the problem? The peak doesn't show in AIQUM.

tropic anchor
#

@signal cairn Do you see the same in ONTAP when you run qos statistics volume latency show? We have validated Harvest metrics with this ONTAP CLI.

wise locust
#

AIQUM averages at 5 minutes, so it may be getting averaged out. Insert standard perf troubleshooting here...

tropic anchor
signal cairn
#

Hi @tropic anchor
There are still some spike of qos_latency

tropic anchor
#

okay graph is bit unclear. Green line is qos_ops?

signal cairn
#

Yes, the green line is qos_ops*180

tropic anchor
#

Okay. Green line is mostly at 0.. it probably got normalised due to high latency values. Can you share separate graphs for qos_ops and qos_latency so that qos_ops values are clear. You can run them in prometheus UI as well.

signal cairn
tropic anchor
#

Thanks Tony. Can you share separate graphs for qos_ops and qos_latency from prometheus UI. Also if you can share for different workloads having such spikes that will be helpful.

tropic anchor
#

@signal cairn What is the Harvest installation type? NABox, container, rpm, deb,native? Can you also check Harvest version? Pls share output of bin/harvest --version output if any of rpm, deb, native systems.

signal cairn
#

The previous images are from nabox(harvest version=23.12.18-nighly)

signal cairn
#

open an issue for this latency spike #2597

tropic anchor
tropic anchor
#

@signal cairn Can you also share output of below command from NABox root login
dc exec nabox-harvest2 /netapp-harvest/bin/harvest --version

signal cairn
#

@tropic anchor User replying that when using these commands, it shows "docker compose not found"

#

Is there other ways to get logs that you need from NAbox?

tropic anchor
#

@mint hearth can help with this

signal cairn
#

@mint hearth If any method that can help! Thanks

mint hearth
#

What version of NAbox is that ? Are you logging in as root or admin ?

signal cairn
#

with root, and the version is 3.4

mint hearth
#

Can you share a screenshot of the terminal showing the error ?

signal cairn
#

@mint hearth s14gvnamon02:~# dc logs nabox-api > nabox-api.log; dc logs nabox-harvest2 > nabox-harvest2.log;\

-ash: docker-compose: not found

#

dc exec nabox-harvest2 /netapp-harvest/bin/harvest –version
-ash: docker-compose: not found

mint hearth
#

Do you have time now ?

signal cairn
#

@tropic anchor

tropic anchor
#

Thanks

kind harness
#

Hey @signal cairn did you use the ng-harvest-files@netapp.com address? We haven't received anything yet, maybe it's large or things are moving slowing this morning

signal cairn
#

@kind harness, I just sent with the subject "qos_latency spike issue Ticket #2597"

kind harness
#

received

tropic anchor
#

Thanks @signal cairn Have these QoS latency spikes been noticed in just one cluster, or in multiple clusters?

tropic anchor
#

@signal cairn The cluster screenshot, which shows spikes and has been shared, appears to be very slow for the collection of objects below. Do you know if customer is using WorkloadDetailVolume and WorkloadDetail metrics? If not, we can try disabling these templates to reduce some load on Cluster.

signal cairn
#

Their workloadDetails are already disabled

#

And the custom.yml is like this

#

But will can still find metrics from workloadDetail template such as "qos_detail_resource_latency" in prometheus

tropic anchor
#

@mint hearth

tropic anchor
#

@signal cairn Let's try to comment out lines in custom_volume_details.yaml and see if it disables workload details

signal cairn
#

Yes, I do delete them, but still the same

tropic anchor
#

What do you have in cat custom.yaml?

signal cairn
#

custom.yml

tropic anchor
#

Okay. What does below command returns?

grep -i "workloaddetail" /opt/harvest2-conf/conf/zapiperf/*
signal cairn
#

/opt/harvest2-conf/conf/zapiperf/custom_volume_details.disabled: WorkloadDetail:/opt/harvest2-conf/conf/zapiperf/custom_volume_details.disabled: WorkloadDetailVolume:/opt/harvest2-conf/conf/zapiperf/default.yaml:# WorkloadDetail: workload_detail.yaml/opt/harvest2-conf/conf/zapiperf/default.yaml:# WorkloadDetailVolume: workload_detail_volume.yaml
@tropic anchor

tropic anchor
#

Thanks Tony. Not sure if custom_volume_details.disabled is causing an issue
@mint hearth We discussed a possible issue around this. Is it the same case here?

#

@signal cairn We can comment out lines in custom_volume_details.disabled and check.

mint hearth
#

Yep please do a dc restart nabox-harvest2

signal cairn
#

After commenting out lines in custom_volume_details.disabled, there are no more workload detail metrics!

tropic anchor
#

nice!

tropic anchor
#

@signal cairn I have shared Harvest debug build with you via DM.

tropic anchor
#

@signal cairn Are there any updates on whether the issue is resolved with the debug build?

signal cairn
#

@tropic anchor Since we are going to celebrate our Chinese new year, customer will update the debug build after the holiday, expected to be updated on 16th Feb,

tropic anchor
#

Sure Thanks!

tropic anchor
signal cairn
#

Hi @tropic anchor Still seeing spike value of qos_latency after upgrading to harvest 24.02

tropic anchor
#

Thanks @signal cairn . Let's enable debug logs on this. Can you share output of below command.

docker exec -it nabox-harvest2 sh -c "cat /app/harvest2/__init__.py"

tropic anchor
#

Yes

signal cairn
tropic anchor
#

You can email this file to us as well.

signal cairn
#

@tropic anchor I'll email the file to you, thanks

tropic anchor
#

sure thanks

#

Let's run below command to enable debug logs in NABox

docker exec -it nabox-harvest2 sh -c "sed -i \"s/#'-l','0',/'-l','1',/g\" /app/harvest2/__init__.py"
dc restart

Please share Harvest logs with us once you see spike after enabling debug logs.

signal cairn
#

After dc restart, keeps showing this error?

tropic anchor
#

services must be restarting

#

wait for 3-4 mins

#

what does dc ps show?

signal cairn
#

yes, it start running

#

I'll share logs with spike value with you through email

tropic anchor
#

Thanks

#

Let's verify once if debug logs are enabled

#

If you run below command

dc logs --tail 100 -f nabox-harvest2

you should see few lines with DBG as in screenshot

signal cairn
#

Also, user saying that after upgrading to harvest 24.02, when time range set to 7 days, it will runs timeout error, but in old version 23.11 of harvest, this error doesn't show

tropic anchor
#

Can you share screenshot of error?

tropic anchor
#

looks good. Thanks

signal cairn
tropic anchor
#

Is it just this panel? or all panels are timing out?

#

Also see for any errors in prometheus logs

dc logs --tail 200 -f prometheus
signal cairn
tropic anchor
#

Let's check prometheus logs and also verify if similiar query is executing fine in prometheus UI

signal cairn
tropic anchor
#

Okay. It sound like a problem with Prometheus having to fetch a lot of data. How big is the time range?

signal cairn
#

If set to 7 days, I think the query time is over 30second, and timeout?

tropic anchor
#

default timeout is 2m

#

Can you share your grafana panel query which is timing out?

signal cairn
tropic anchor
#

This query could retrieve a large amount of data, depending on the number of clusters you have. It would be more efficient to use topk. See if the following query is helpful.

avg by (cluster) (qos_latency{datacenter=~"$Datacenter",cluster=~"$Cluster"})
and
topk($TopResources, avg by (cluster) (avg_over_time(qos_latency{datacenter=~"$Datacenter",cluster=~"$Cluster"}[$__range] @ end())))
#

Regarding the latency spikes, To clarify: did these spikes originate from the 24.02 version, or are they old data from 23.11? We are only interested in spikes that have occurred since the 24.02 version. Any spikes prior to 24.02 upgrade this will remain as it is.

tropic anchor
#

@signal cairn Are there any updates on spikes with 24.02?

signal cairn
tropic anchor
#

Sure thanks

signal cairn
#

See only this volume with spike value

#

And from AIQUM, it seems normal

#

harvest version 24.02
NAbox version 3.4

tropic anchor
signal cairn
#

@tropic anchor Sent

tropic anchor
#

Thanks @signal cairn
1: It appears that the logs do not have the debug level enabled, but are set at the info level instead. Have there been any changes to the Harvest installation since we switched the Harvest logging to the debug level?
2: Is there only one workload experiencing spikes now, or are there more? Have the occurrences of spikes decreased since 24.02?
3: Is the graph showing qos_latency metrics? If yes, Could you share the qos_ops for the same time range for this workload?
4: Could you enable the debug logs again as previously suggested? Once enabled, please share the logs if a spike occurs after the debug log has been enabled.

signal cairn
#
  1. No, we just do the following command and get the logs:
    docker exec -it nabox-harvest2 sh -c "sed -i "s/#'-l','0',/'-l','1',/g" /app/harvest2/init.py"
    dc restart
  2. The occurrences of spikes do decreased since 24.02. Now only occur in the busiest workload cluster. There are around 1-3 workloads in the that cluster experiencing spike.
  3. Yes the graph is using "qos_latency" metric.
  4. Should I follow the same command as previous to enable debug logs?
tropic anchor
signal cairn
tropic anchor
#

Thank you. These logs only cover the last 30 minutes and don't show any spikes. The logs will be helpful once spikes occur. Let's wait for these spikes, after which you can share the logs. Could you also share screenshots from the Prometheus UI? I am looking for separate charts for qos_ops and qos_metrics for the workloads where there are spikes, covering the same time range of spikes. We aim to capture the ops for the workload at the time of the spike.

#

NABox prometheus is available at https://NABOX_URL/prometheus

signal cairn
tropic anchor
mint hearth
#

Yes you can change the log setting and do service docker restart

# cat /etc/docker/daemon.json
{
    "data-root": "/opt/docker",
    "insecure-registries": [
      "registry.lab"
    ],
    "log-driver": "local",
    "log-opts": {
        "max-size": "200m",
        "max-file": "5"
    },
    "bridge": "none"
}
service docker restart
signal cairn
signal cairn
#

@tropic anchor Once I do service docker restart, do I have to enable to debig mode again?

tropic anchor
#

@signal cairn In the Docker configuration, setting the number of days is not available in daemon.json out of the box. Given the current log load, 200MB may only provide us with ~6 hours of logs. You do not need to enable the debug log again with this restart, it will remain as is.

mint hearth
#

As long as you have the capacity you can push the log settings as much as you want. I believe the data partition is used for logs.

signal cairn
#

@tropic anchor
As the image show, the latency has exceptional peak value, but the ops looks normal

#

We have increase the logging limit size, and will wait till the next peak shows and collect the logs.

#

Also, we seeing some strange value in qos_ops as the image show

tropic anchor
#

@signal cairn Thanks. It appears that the LUN workload has spiked. Interestingly, the ops value for the LUN workload has changed within a minute. QoS templates, by default, refresh data every three minutes.

What is the current logging limit size?
Could you clarify what you mean by 'strange value of qos_ops?

signal cairn
tropic anchor
signal cairn
tropic anchor
tropic anchor
#

Thanks @signal cairn for logs. I see logs are still for last 30 minutes. Can you share output of below

docker inspect nabox-harvest2 | jq '.[0].HostConfig.LogConfig'
signal cairn
tropic anchor
#

Okay. We need to restart pollers for them to get this change. can you run below commands

dc down
dc up -d

and then check output of

docker inspect nabox-harvest2 | jq '.[0].HostConfig.LogConfig'
signal cairn
tropic anchor
#

Not needed. But as I mentioned earlier, 200m can get us last 6 hours of logs only. Is it possible to bump it to 1000m?

tropic anchor
#

Thanks @signal cairn . Our last steps reset the log level to INFO again in NABox. Could you please run the steps below to enable debug logs again?

docker exec -it nabox-harvest2 sh -c "sed -i \"s/#'-l','0',/'-l','1',/g\" /app/harvest2/__init__.py"
dc restart

You can validate if the logs are set to debug by running the following command:

dc logs --tail 100 -f nabox-harvest2

Also, please ensure that the log configuration matches the configuration defined in daemon.json by running this command:

docker inspect nabox-harvest2 | jq '.[0].HostConfig.LogConfig'
tropic anchor
#

@signal cairn You can also run below command for longer duration to capture all latency spikes over 5 secs from logs

docker logs -f nabox-harvest2 2>&1 | grep -i 'Detected high latency' | grep -iE 'collector=ZapiPerf:Workload|collector=ZapiPerf:WorkloadVolume'
signal cairn
tropic anchor
#

Thanks. Please share full logs as well.

tropic anchor
#

@signal cairn I have shared relevant cluster and spike time details via DM. Could you provide the perf archive files for the specified time range from this cluster? Since we don't have the workload names in the logs, could you also share information on which workloads have experienced spikes, as indicated by the data from Prometheus/Grafana qos_latency for the mentioned time? We want match these spikes with the values in the archive to perform a comparison.

tropic anchor
signal cairn
tropic anchor
kind harness
signal cairn
#

Thanks @kind harness

signal cairn
#

@tropic anchor Thanks for opening an issue #2769 regarding this issue.
We also see abnormal spike value when using "qos_ops"
Is this also regarding to this issue?

tropic anchor
signal cairn
tropic anchor
#

look for lines with using latency_io_reqd if debug logs are enabled.

signal cairn
#

Use the following command,
docker logs -f nabox-harvest2 2>&1 | grep -i 'latency_io_reqd'?

tropic anchor
#

Yes

signal cairn
#

@tropic anchor After setting latency_io_reqd value to 50, no more "qos_ops" and "qos_latency" data coming in
So we change it back to 10

signal cairn
tropic anchor
tropic anchor
#

We can tune this value if needed after fixing #2769

tropic anchor
signal cairn
tropic anchor
#

You can try below

docker logs -f nabox-harvest2 2>&1 | grep -i 'delay' | grep -i 'schedule' | grep -i 'task'

Value of delay is jitter value

2024-03-27T05:04:51Z DBG collector/collector.go:185 > Poller=A250-41-42-43 collector=ZapiPerf:CIFSNode delay=31.773100438s schedule=24h task=counter
2024-03-27T05:04:51Z DBG collector/collector.go:185 > Poller=A250-41-42-43 collector=ZapiPerf:CIFSNode delay=31.773100438s schedule=10m task=instance
2024-03-27T05:04:51Z DBG collector/collector.go:185 > Poller=A250-41-42-43 collector=ZapiPerf:CIFSNode delay=31.773100438s schedule=1m task=data

signal cairn
#

Update: user has upgraded to harvest-24.03.29-nightly and set the jitter to 1m,
Will keep monitoring the abnormal peak value of "qos_latency" and "qos_ops"

signal cairn
#

After upgrading to harvest-24.03.29-nightly build, everything seems good now!!
"qos_ops" and "qos_latency" metrics don't show abnormal spike value now.
Will keep updating here if any abnormal value happen.
@tropic anchor

tropic anchor
#

Thanks @signal cairn . Could you also share harvest logs. It will help in comparing before and after.

tropic anchor
#

@signal cairn How are things now in relation to latency and OPS spikes? Also, have you had a chance to collect the latest logs?

signal cairn
#

@tropic anchor After upgrading to latest nightly build version, user doesn't see any abnormal spike value. Will arrange time to collect latest logs

tropic anchor
#

Thanks

tropic anchor
#

@signal cairn Do we have latest logs from system?

signal cairn
#

@tropic anchor Sent the latest nabox log to you, thanks!

tropic anchor
#

Thanks