#Error with Volume Time Range > 6Months (Datacenter Variable)
1 messages · Page 1 of 1 (latest)
Error with Volume Time Range > 6Months (Datacenter Variable)
Hi @potent flower can you check your Javascript console for errors and see if there is anything useful there? The grafana log file should have something useful to. ssh to nabox and adjust the duration (4 hours below) to include the time this error happened
dc logs grafana --since 4h
Hello @bronze raft --since doesnt work
ah, guess that was introduced in a later version. What does dc logs --help show?
docker logs --since 4h grafana
Worked
Sent you a PM. Look if it is right. There is not much out in this log
thanks! agreed not much there - all info level messages and nothing about timeouts or warnings. Let's check the browser's console log like the Grafana error says
Tried to export it. Sent you a PM again
are you using Chrome, Firefox or something else? Looks like timeouts, perhaps because you are requesting a year's worth of data. What if you go to the dashboard in question again, switch to the variables page again, open DevTools, switch to the network tab and refresh the variables pages so the network requests happen again. I'd like to see how long it is taking them before timing out
I am using Chrome
By default, Grafana waits 30s before timing out https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#timeout
Checking to see how to increase that
even with 6 Months this happens. Sometimes i had this with 90 Days too.
You mean this page?
either way, a timeout is a timeout. The network tab should show us how long these requests are taking. Are your Grafana and Prometheus geographically close to each other or far away?
Yes, the page you pasted at the top with the error. Basically I want to recreate that error with the DevTools network tab open to see how long the request takes before timing out and which part of the network connection is timing out
Ok.
Grafana and Prometheus are on the same Server (Nabox)
The Filers are here in the same network to
that's good, shouldn't be much speed of light latency then
received thanks!
there are eight requests in your har file, one svg, four series queries, and three v1/query. Two of the v1 queries are timing out after 30s. We can increase Grafana's timeout so it doesn't do that. Let's see if we can get an idea what we should increase the timeout to by going to Prometheus, pasting this query, and executing to see how long it takes Prometheus to return the data. topk(5,sum by (volume)(avg_over_time(fabricpool_cloud_bin_operation{datacenter=~".*",cluster=~".*",svm=~".*",volume=~".*"}[31622400s])))
Prometheus will show how long it took to execute the query as "Load time:"
that's good and bad news 🙂 now we know why Grafana is timing out. Did that query ever finish? I wonder if more RAM would help? Prometheus may not be up to the task of asking for a year's worth of data. Is this your production or dev instance? One idea would be to upgrade to Nabox v4 (beta). Nabox 4 uses VictoriaMetrics instead of Prometheus and my reading has been that it may be faster on long range queries
We never had Problems with Queries for 1 Year and have the need for it for analizing of Aggregate or Volume growth over longer periods.
The Problem can be reporduced even with 6 Months worth of data.
The Query never finished.
This is our Prod Instance and this is the reason why we cant upgrade to a Beta version.
In the last month a large amount of volumes where moved to different aggregates and then back. Can this cause inconsistencies in the DB?
thanks for the additional details. It was not clear that this is a new problem and that one year queries worked in the past. Moving volumes to different aggregates will increase the number of records that Prometheus needs to scan since, from Prometheus POV, those new labels create a different time-series
Do we have a chance to fix this?
maybe you could delete the transient moves? Assuming that you have a way of identifying those records and you're OK with losing that data. I'll try to think of other options too
More RAM might help too, like I mentioned earlier, if it's straightforward to add more, that would be a quick check
Ok i will try to double RAM tomorrow and test again
I dont know how to identify the data but i do know which aggregates and filer it was
that's good to know. Do you care about the fabricpool_cloud_bin_operation metric? That's the one that is taking so long and is used by the Top Volume Object Storage row in the volume dashboard. That metric is a histogram which means there are more series than usual. If you don't care about it - maybe the other queries are fast enough if we remove that one?
Would like to test that to. I need to delete this part of the dashboard right?
I will ask my collegues if its important then if this really helps.
But we have the same Problems on the Aggregate Dashboard to...and maybe many more that we dont know right now?
It is always present if i try to update the Datacenter variable
if you go to Prometheus and query volume_labels does it take a long time? That query should be fast and if it isn't maybe this system is in a bad state from earlier queries. Will be good to see how long that query is now and then tomorrow after more RAM
good enough! That means when you see a slow response time when upding the datacenter variable, it's probably because other variable queries are running too. You can confirm that theory by opening dev tools first and finding the datacenter query in the network tab. I suspect it will be fast there and other queries are slow
Yeah i understand. We did this yesterday already and i setn you the har file of the dev tools network tab.
Which queries do volume and aggregate dashboard have in common?
@bronze raft Tried Query with 32G RAM (Doubled from 16G) and the same result:
hi @potent flower today's nightly build contains the dashboard improvements we discussed last week. When you get a chance, can you give it a try and let us know how they work for you? https://github.com/NetApp/harvest/releases/tag/nightly
First Tests are showing that everything is much faster. And we have no longer timeouts it seems.
Digging deeper.
Excellent!
In Volume Dashboard
can you click on that panel and press e to edit and then paste the query from the Metrics browser?
If i set 12h
volume_total_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume"}
and
topk($TopResources, avg_over_time(volume_total_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume"}[3h] @ end()))
ok works with 12h and does not work with?
No data was with 6M
And with 90D NoData to
that panel was changed recently to fix https://github.com/NetApp/harvest/issues/2787 the fix uses volume_total_data instead summing volume_write_data and volume_read_data in the panel the way it was previously. Summing the values in Grafana caused problems when a volume's node changed. The current fix does not have that problem
As soon as i select 90D i have nodata. Even if i select only one filer. So it is not the query running long
only for the Top 5 Volumes by Average Throughput panel or other panels on the Volume dashboard?
thanks. Yes, the aggregate dashboard Volume by Average Throughput panel had the same bug and fix as mentioned above in 2787. Each of the dashboards listed https://github.com/NetApp/harvest/pull/2786/files were updated to reflect accurate throughput
And the Aggregate Dashboard brings the Timeout Errors that i had with Volume .
We are missing data Total Troughput to in Vol Dashboard i think. (See screens)
It uses the same Metric it seems:
sum(topk($TopResources, volume_total_data{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume"}))
yes
i understand the issue - the fix for 2787 changed several of the dashboards to use volume_total_data for the panels, which is correct. But volume_total_data was introduced in this commit on Feb 12. That means that metric does not exist before Feb 12 and is why you have no data when selecting a larger time range
Ok. Got it.
But maybe it would be better to leave both for a few months
that might work, but if we do that you will hit the problem that caused us to move to volume_total_data which was this error
let me try backing out the volume_total_data change but keep the faster query change and see if that works for you
If it is extra for me dont struggle. We will have to live with that
Only give me the old query so that i can build one own Panel if we need that old data
give this a try https://gist.github.com/cgrinds/6c2c14163fab27676cd2ea0345d955e3
The new Vol Dashboard is really fast. What have you changed?