#NABOX 2 vs 3 - Metrics difference?

1 messages · Page 1 of 1 (latest)

timid epoch
#

When viewing node backend latency as an example in Grafana (using Nabox 2.6.4) , i get avg read latency coming back as 25ms , this matches what I see in ActiveIQ, however with the latest version of NABOX deploy (v3.1.2) I am seeing avg read lantecy coming back as 2ms for the same time periods. Why would this be so vastly different? As a result grafana alerts are not being triggered when they hit thresholds.

kindred reef
#

@timid epoch Could you share the harvest versions in these 2 boxes.

timid epoch
#

older one is Harvest 1.6 and in the newer setup we have Harvest 22.08.0-1`

kindred reef
#

Thanks. Could you share the metric name in harvest 2 which is not matching?

timid epoch
#

volume_read_latency

kindred reef
#

okay do you see differences only for this metric or for other also?

timid epoch
#

volume_write_latency also looks lower

#

same node, same time period

kindred reef
kindred reef
timid epoch
#

Sorry should this auto update, or is there a process for this updating this? first time updating harvest like this

kindred reef
timid epoch
#

Does it take a while to upgrade i grabbed the harvest-22.11.0.tar.gz file and installed which was successful but under maintenance page the NetApp Harvest version has completely dropped off

#

also saying no clusters actively reporting anymore on the main dashboard

#

is a restart required?

kindred reef
#

Usually it happens pretty fast and restart happens internally in NABox

timid epoch
#

any way to check the status of it from SSH a bit more?

kindred reef
#

yes let me dig those up

timid epoch
#

yep done

kindred reef
#

if you run dc ps what does it show?

timid epoch
#

guessing its stuck

kindred reef
#

yes

#

lets try restarting
dc down

timid epoch
#

everything was up but now harvest3 is in a restarting mode again

#

sorry harvest2

kindred reef
timid epoch
#

no it wasnt, see smaller once which was downloaded

#

ill upload from that link youve sent

#

it wasnt clear which to download from git but i guess I should of asked first

kindred reef
#

that smaller one looks wrong

#

yes let's upload the one which I have shared

timid epoch
#

uploading.....

#

cool that worked

kindred reef
#

nice!

timid epoch
#

back up and running

#

metrics still different by the looks

kindred reef
#

We should match with ontap

timid epoch
#

this is for a flexvol not flexgroup if that matters

kindred reef
#

ok no problem. That page will be valid for flexvol also

timid epoch
#

For comparison I am comparing the Node dashboard for Backend Drilldown panel for Latency on both old and new, as I say ActiveIQ matches will the harvest 1.6 metrics

#

I think this disparity applies to many more metrics between the two

#

the difference being harvest version

#

throughput and IOPs for that same dashboard look to be very close

#

so that seems ok

kindred reef
#

okay It could be the way volume summary is being calculated between these 2. Let me check

timid epoch
#

thanks

#

just checking another in the same dashboard "Protocol Backend IOPs" this looks ok too, hopefully you can find something different in the latency figures, that one seems to jump out as having the biggest dif that I have spotted.

kindred reef
#

sure

timid epoch
#

In addition to the read_volume_latency I think the write and other latency metrics in the same panel may also be suffering similar differences

kindred reef
#

yeah i suspect in harvest 1.6, it was weighted average while harvest 2, we are just averaging in panels

timid epoch
#

would AIQ also work in the same way? as it shows the same figures as H1.6

kindred reef
#

I am not sure about AIQ. But looks like individual metric i.e. volume_read_latency is fine but it's summary at node level is not matching due to weighted average vs plain average

timid epoch
#

Cool. thanks for looking into. How do bug fixes work these days, just wondering typically what the time to fix normally is once a bug has been raised? we may need to leave the old system in place is all

kindred reef
#

I have added it for next release which is due some time mid or later of next month.

#

We'll update you for you to try as soon as fix is ready and available through our nightly builds so that you don't have to wait for official release.

kindred reef
timid epoch
#

excellent, thanks will give it a go and let you know

timid epoch
#

out of interest at what point does this build become classed as a stable build?

kindred reef
#

We release harvest every 3 months. So next stable release is due sometime in feb.

kindred reef
#

When you try this nightly build, do update harvest dashboards as well.

timid epoch
#

Ah OK. How do I do that sorry?

kindred reef
timid epoch
#

applied the nightly build and reset dashboards but doesnt look to have changed at all

kindred reef
#

okay. Let's verify if dashboard is updated or not. If you open latency panel under backend drilldown. What are the queries it is pointing to?

#

Also verify once if harvest version is updated on nabox UI

timid epoch
#

Yep Harvest defo updated to nightly build version

#

Screenie of backend drilldown latency panel

kindred reef
#

that looks correct

#

means dashboard also updated fine

#

do u mean values still dont match with harvest 1.6?

#

perhaps compare last 5-10 mins of data

#

as older values before upgrade would still be wrong

timid epoch
#

I think I will need to monitor it, they do look a lot closer now when cross checking different nodes in the cluster

kindred reef
#

great. sure do let us know!

timid epoch
#

Also had to remember to update all my alert rules config to use the new queries i.e. node_volume_read_latency, worth noting.

timid epoch
#

Still looking good on this front, however I have noticed in the same Backend drill down section, that under system utilisation panel, that CPU Busy metrics are showing different figures between Harvest 1.6 and 2. Could the same be happening with the way averages are worked out? For example Harvest 1.6 system showing avg 65% CPU busy and Harvest 2 showing 74% for the same time period.

#

wondering how many other panels work out avgs different between the Harvest versions. These are just ones I have spotted.

kindred reef
#

There shouldn't be any other metric having this weighted average problem but we'll take a note of this and verify all of them before next release.

timid epoch
#

Yeh actually if I view the CPU Layer Drilldown, this is the same on both systems, so as the link says the sys util panel is now including other processor related activities