#Error with Volume Time Range > 6Months (Datacenter Variable)

1 messages · Page 1 of 1 (latest)

potent flower
#

Nabox Version: 3.3 (Same Error with 3.5.X)
Harvest Version: 24.04.02-nightly

Querying the Datacenter Variable brings the same error (see image)

potent flower
#

Error with Volume Time Range > 6Months (Datacenter Variable)

bronze raft
#

Hi @potent flower can you check your Javascript console for errors and see if there is anything useful there? The grafana log file should have something useful to. ssh to nabox and adjust the duration (4 hours below) to include the time this error happened
dc logs grafana --since 4h

potent flower
#

Hello @bronze raft --since doesnt work

bronze raft
#

ah, guess that was introduced in a later version. What does dc logs --help show?

potent flower
#

docker logs --since 4h grafana
Worked

#

Sent you a PM. Look if it is right. There is not much out in this log

bronze raft
#

thanks! agreed not much there - all info level messages and nothing about timeouts or warnings. Let's check the browser's console log like the Grafana error says

potent flower
#

Tried to export it. Sent you a PM again

bronze raft
#

are you using Chrome, Firefox or something else? Looks like timeouts, perhaps because you are requesting a year's worth of data. What if you go to the dashboard in question again, switch to the variables page again, open DevTools, switch to the network tab and refresh the variables pages so the network requests happen again. I'd like to see how long it is taking them before timing out

potent flower
#

I am using Chrome

bronze raft
potent flower
#

You mean this page?

bronze raft
#

either way, a timeout is a timeout. The network tab should show us how long these requests are taking. Are your Grafana and Prometheus geographically close to each other or far away?

Yes, the page you pasted at the top with the error. Basically I want to recreate that error with the DevTools network tab open to see how long the request takes before timing out and which part of the network connection is timing out

potent flower
#

Ok.
Grafana and Prometheus are on the same Server (Nabox)

#

The Filers are here in the same network to

bronze raft
#

that's good, shouldn't be much speed of light latency then

potent flower
#

Hoipe it is everything there

bronze raft
#

received thanks!

bronze raft
#

there are eight requests in your har file, one svg, four series queries, and three v1/query. Two of the v1 queries are timing out after 30s. We can increase Grafana's timeout so it doesn't do that. Let's see if we can get an idea what we should increase the timeout to by going to Prometheus, pasting this query, and executing to see how long it takes Prometheus to return the data. topk(5,sum by (volume)(avg_over_time(fabricpool_cloud_bin_operation{datacenter=~".*",cluster=~".*",svm=~".*",volume=~".*"}[31622400s])))

Prometheus will show how long it took to execute the query as "Load time:"

potent flower
#

This query is running long and eating my Mem

potent flower
bronze raft
#

that's good and bad news 🙂 now we know why Grafana is timing out. Did that query ever finish? I wonder if more RAM would help? Prometheus may not be up to the task of asking for a year's worth of data. Is this your production or dev instance? One idea would be to upgrade to Nabox v4 (beta). Nabox 4 uses VictoriaMetrics instead of Prometheus and my reading has been that it may be faster on long range queries

potent flower
#

We never had Problems with Queries for 1 Year and have the need for it for analizing of Aggregate or Volume growth over longer periods.
The Problem can be reporduced even with 6 Months worth of data.
The Query never finished.
This is our Prod Instance and this is the reason why we cant upgrade to a Beta version.

#

In the last month a large amount of volumes where moved to different aggregates and then back. Can this cause inconsistencies in the DB?

bronze raft
#

thanks for the additional details. It was not clear that this is a new problem and that one year queries worked in the past. Moving volumes to different aggregates will increase the number of records that Prometheus needs to scan since, from Prometheus POV, those new labels create a different time-series

potent flower
#

Do we have a chance to fix this?

bronze raft
#

maybe you could delete the transient moves? Assuming that you have a way of identifying those records and you're OK with losing that data. I'll try to think of other options too

#

More RAM might help too, like I mentioned earlier, if it's straightforward to add more, that would be a quick check

potent flower
#

Ok i will try to double RAM tomorrow and test again

#

I dont know how to identify the data but i do know which aggregates and filer it was

bronze raft
#

that's good to know. Do you care about the fabricpool_cloud_bin_operation metric? That's the one that is taking so long and is used by the Top Volume Object Storage row in the volume dashboard. That metric is a histogram which means there are more series than usual. If you don't care about it - maybe the other queries are fast enough if we remove that one?

potent flower
#

It is always present if i try to update the Datacenter variable

bronze raft
#

if you go to Prometheus and query volume_labels does it take a long time? That query should be fast and if it isn't maybe this system is in a bad state from earlier queries. Will be good to see how long that query is now and then tomorrow after more RAM

potent flower
#

Is100ms seconds fast? 😅

#

Should we try that datacenter query?

bronze raft
#

good enough! That means when you see a slow response time when upding the datacenter variable, it's probably because other variable queries are running too. You can confirm that theory by opening dev tools first and finding the datacenter query in the network tab. I suspect it will be fast there and other queries are slow

potent flower
#

Yeah i understand. We did this yesterday already and i setn you the har file of the dev tools network tab.
Which queries do volume and aggregate dashboard have in common?

potent flower
#

@bronze raft Tried Query with 32G RAM (Doubled from 16G) and the same result:

bronze raft
potent flower
#

First Tests are showing that everything is much faster. And we have no longer timeouts it seems.
Digging deeper.

bronze raft
#

Excellent!

potent flower
#

In Volume Dashboard

bronze raft
#

can you click on that panel and press e to edit and then paste the query from the Metrics browser?

potent flower
#

If i set 12h

#
volume_total_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume"}
and
topk($TopResources, avg_over_time(volume_total_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume"}[3h] @ end())) 
bronze raft
#

ok works with 12h and does not work with?

potent flower
#

works with 12h but only shows 1/2 an hour?

#

Ok it gets strange

potent flower
#

And with 90D NoData to

bronze raft
#

that panel was changed recently to fix https://github.com/NetApp/harvest/issues/2787 the fix uses volume_total_data instead summing volume_write_data and volume_read_data in the panel the way it was previously. Summing the values in Grafana caused problems when a volume's node changed. The current fix does not have that problem

potent flower
#

As soon as i select 90D i have nodata. Even if i select only one filer. So it is not the query running long

bronze raft
#

only for the Top 5 Volumes by Average Throughput panel or other panels on the Volume dashboard?

potent flower
#

Other Panels to Mom

#

Same Panel in Aggr Dashboard to

bronze raft
potent flower
#

And the Aggregate Dashboard brings the Timeout Errors that i had with Volume .
We are missing data Total Troughput to in Vol Dashboard i think. (See screens)

#

It uses the same Metric it seems:

sum(topk($TopResources, volume_total_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume"}))
bronze raft
#

yes

#

i understand the issue - the fix for 2787 changed several of the dashboards to use volume_total_data for the panels, which is correct. But volume_total_data was introduced in this commit on Feb 12. That means that metric does not exist before Feb 12 and is why you have no data when selecting a larger time range

potent flower
#

Ok. Got it.
But maybe it would be better to leave both for a few months

bronze raft
#

that might work, but if we do that you will hit the problem that caused us to move to volume_total_data which was this error

potent flower
#

Oh ok

#

Will keep that in Mind

bronze raft
#

let me try backing out the volume_total_data change but keep the faster query change and see if that works for you

potent flower
#

If it is extra for me dont struggle. We will have to live with that

#

Only give me the old query so that i can build one own Panel if we need that old data

bronze raft
potent flower
#

The new Vol Dashboard is really fast. What have you changed?