When viewing node backend latency as an example in Grafana (using Nabox 2.6.4) , i get avg read latency coming back as 25ms , this matches what I see in ActiveIQ, however with the latest version of NABOX deploy (v3.1.2) I am seeing avg read lantecy coming back as 2ms for the same time periods. Why would this be so vastly different? As a result grafana alerts are not being triggered when they hit thresholds.
#NABOX 2 vs 3 - Metrics difference?
1 messages · Page 1 of 1 (latest)
@timid epoch Could you share the harvest versions in these 2 boxes.
older one is Harvest 1.6 and in the newer setup we have Harvest 22.08.0-1`
Thanks. Could you share the metric name in harvest 2 which is not matching?
volume_read_latency
okay do you see differences only for this metric or for other also?
okay. We have made few fixes related to perf metrics for our latest release.
https://github.com/NetApp/harvest/releases/tag/v22.11.0
Could you upgrade to this version and see.
We did test this metric last year against ONTAP CLI. See if this page helps
https://github.com/NetApp/harvest/wiki/Volume-latency-for-flexgroup
Sorry should this auto update, or is there a process for this updating this? first time updating harvest like this
See steps here
https://nabox.org/documentation/upgrade/
Just need to upload harvest binary to NABox
Does it take a while to upgrade i grabbed the harvest-22.11.0.tar.gz file and installed which was successful but under maintenance page the NetApp Harvest version has completely dropped off
also saying no clusters actively reporting anymore on the main dashboard
is a restart required?
Usually it happens pretty fast and restart happens internally in NABox
any way to check the status of it from SSH a bit more?
yes let me dig those up
lets login to nabox with root user through ssh https://nabox.org/documentation/configuration/
yep done
if you run dc ps what does it show?
Could you confirm if this is the binary you have uploaded?
https://github.com/NetApp/harvest/releases/download/v22.11.1/harvest-22.11.1-1_linux_amd64.tar.gz
no it wasnt, see smaller once which was downloaded
ill upload from that link youve sent
it wasnt clear which to download from git but i guess I should of asked first
nice!
okay could you compare it with ontap cli metrics for a volume
https://github.com/NetApp/harvest/wiki/Volume-latency-for-flexgroup
We should match with ontap
this is for a flexvol not flexgroup if that matters
ok no problem. That page will be valid for flexvol also
For comparison I am comparing the Node dashboard for Backend Drilldown panel for Latency on both old and new, as I say ActiveIQ matches will the harvest 1.6 metrics
I think this disparity applies to many more metrics between the two
the difference being harvest version
throughput and IOPs for that same dashboard look to be very close
so that seems ok
okay It could be the way volume summary is being calculated between these 2. Let me check
thanks
just checking another in the same dashboard "Protocol Backend IOPs" this looks ok too, hopefully you can find something different in the latency figures, that one seems to jump out as having the biggest dif that I have spotted.
sure
In addition to the read_volume_latency I think the write and other latency metrics in the same panel may also be suffering similar differences
yeah i suspect in harvest 1.6, it was weighted average while harvest 2, we are just averaging in panels
would AIQ also work in the same way? as it shows the same figures as H1.6
I am not sure about AIQ. But looks like individual metric i.e. volume_read_latency is fine but it's summary at node level is not matching due to weighted average vs plain average
@timid epoch Thanks for raising this. I have opened an issue for this. https://github.com/NetApp/harvest/issues/1666
Cool. thanks for looking into. How do bug fixes work these days, just wondering typically what the time to fix normally is once a bug has been raised? we may need to leave the old system in place is all
I have added it for next release which is due some time mid or later of next month.
We'll update you for you to try as soon as fix is ready and available through our nightly builds so that you don't have to wait for official release.
@timid epoch This issue is now fixed in our nightly builds. https://github.com/NetApp/harvest/releases/tag/nightly
If you get a chance, let us know if this fixes the issue.
excellent, thanks will give it a go and let you know
out of interest at what point does this build become classed as a stable build?
We release harvest every 3 months. So next stable release is due sometime in feb.
When you try this nightly build, do update harvest dashboards as well.
Ah OK. How do I do that sorry?
After upgrading harvest as mentioned here https://nabox.org/documentation/upgrade/#ugrading-netapp-harvest
You can upgrade existing harvest dashboards with Reset button (as in screenshot) which will overwrite default harvest dashboards.
applied the nightly build and reset dashboards but doesnt look to have changed at all
okay. Let's verify if dashboard is updated or not. If you open latency panel under backend drilldown. What are the queries it is pointing to?
Also verify once if harvest version is updated on nabox UI
Yep Harvest defo updated to nightly build version
Screenie of backend drilldown latency panel
that looks correct
means dashboard also updated fine
do u mean values still dont match with harvest 1.6?
perhaps compare last 5-10 mins of data
as older values before upgrade would still be wrong
I think I will need to monitor it, they do look a lot closer now when cross checking different nodes in the cluster
great. sure do let us know!
Also had to remember to update all my alert rules config to use the new queries i.e. node_volume_read_latency, worth noting.
Still looking good on this front, however I have noticed in the same Backend drill down section, that under system utilisation panel, that CPU Busy metrics are showing different figures between Harvest 1.6 and 2. Could the same be happening with the way averages are worked out? For example Harvest 1.6 system showing avg 65% CPU busy and Harvest 2 showing 74% for the same time period.
wondering how many other panels work out avgs different between the Harvest versions. These are just ones I have spotted.
Thanks @timid epoch for the update. About node_cpu_busy, Yes there are differences and Harvest 2 metric is correct. More information here https://github.com/NetApp/harvest/discussions/1668
There shouldn't be any other metric having this weighted average problem but we'll take a note of this and verify all of them before next release.
Added a issue for verification https://github.com/NetApp/harvest/issues/1680
Yeh actually if I view the CPU Layer Drilldown, this is the same on both systems, so as the link says the sys util panel is now including other processor related activities