Hello, just installed nabox4 and wanted to add storagegrid. This did not work because we're running it on a different port behind our loadbalancer. Is there a current way to support this situation ?
I've tried adding :xxx with portnumber behind the server but this is not allowed.
It would be nice to be able to change the portnumber (and also http/https options. we have https and a different port but I could imagine a situation that someone is not using https).
Is this something that can be considered as feature for a next update ?
#different port
1 messages · Page 1 of 1 (latest)
@fierce ridge is this NABox related?
Very likely 😄
That said, https is pretty standard
Shame I just release 4.0.8 😄
Allowing ":" for port number is a quick change. But it's still going to speak https
https is fine
Man, you're blazing fast !
😆
Thanks ! Downloading as we speak.
We're still in migration from NAbox 3 to 4 which takes some time because around 1TB data. So I will have to wait until that finishing before installing the patch.
That sounds wise !
I wish I had added some CPU's to NABox4 instance. 2 isn't enough. (NABox3 was running 4vCPU and NABox4 is pinned at 100% for 2 vCPU during migration) But we're halfway there. Doubling and restarting would take the same time as keeping it running I think.
Probably, I should make a note of it. But that would require also modifying the args for VM migration, it has arguments for number of threads
Ah. Ok. I assumed 2 threads was because of the 2 vCPU on destination. Well it's just a one-time deal.
@fierce ridge To confirm. The update works like a charm. I have the storagegrid added to the NABox. Now we have to figure out if we can get growth per tenant out of some graph.
There is a panel in Tenants Dashboard Tenants by Capacity which may help.
There is no such panel in my Tenants dashboard. (just Tenant Quota and Buckets) However in the Overview dashboard I can see tenants by logical space used and quota used.
Do you see below panel in Tenants Dashboard
This and logical space used in overview dashboard for tenants should be same.
You can use either of these.
Could you share Harvest version?
These panels were only added in latest release
Could you upgrade to 24.11.1
The default that came with nabox4. 24.08.0. I will upgrade.
Ok. That did it. I see that now.
Great!
Unfortunately the Tenant dashboard is empty when I checked today. The regular dashboard is showing data.
Okay. Could you share Harvest logs @ https://upload.nabox.org/daxe-majo-myje
Ok. Uploaded.
Okay. Tenant collection failed during init itself
Could you try restarting Harvest container and see if it fixes the issue.
ok. after docker restart the graph has entries again.
this ranks very high on my weird-shit-o-meter ... nabox3 shows like something stopped november...
when I import the same (default volume dashboard) from nabox3 into nabox4 .... the graph picks up where the nabox3 one drops off ?
The data from nabox3 was migrated to that nabox4 ....
hi @upper sundial what versions of Harvest are you running in nabox3 and nabox4? Recent versions of Harvest do not include a panel with that title in the volume dashboard. I suspect what you're observing is a difference in how topK is displayed. When you say, "the graph picks up where the nabox3 one drops off", are you referring to the graphs between the two red lines here? Are the yellow and orange lines for the same volume or different ones?
nabox3 has harvest 23.05.0-1 and the nabox4 has 24.11.1 The volume dashboard from nabox3 was exported and imported in nabox4 to have the exact same graphs. This is a selection of one volume. So yes, the green and yellow lines are the same volume as is the orange and yellow or green and orange on the nabox4 graph.
(The migrate was done this year so all data from last year was collected on harvest 23/nabox3 and migrated to nabox4. I could imagine that the difference would be around the newyear period from old to new harvest but it's all in 2024.)
maybe a clue... okt 27th we did upgrade ontap on the cluster. and maybe the other change was a previous update.
ok march 31th was also an upgrade of ontap.
maybe. Did you create that panel yourself? Looking at Harvest version 23.05.0-1, the dashboard panel you shared above is not from that version. Version 22.11.0 is the last version of Harvest that included that panel title https://github.com/NetApp/harvest/blob/release/22.11.0/grafana/dashboards/cmode/volume.json#L460
In 23.02.0 we changed all titles with topk to indicate that in their titles
https://github.com/NetApp/harvest/blob/release/23.02.0/grafana/dashboards/cmode/volume.json#L461
The panel was not created myself. It was a default panel. Maybe it stayed after upgrading harvest ? Still the question is why this behaviour ? I could imagine that a volume would get a different UUID and drop out of the graph but not that it would come back after some time or after changing harvest/nabox.
@upper sundial Could you share the Grafana versions from NABox3 and NABox4? I suspect that the differences you are observing are due to the large time range selection and the way Grafana is selecting points to plot. Let's try zooming into one week of the month and then compare the results. Do they look similar?
month on nabox3
month on nabox4. it misses a part where harvest was not running due to some init error. restarting container fixed that
This data looks similiar right?
grafana 9.5.14 on nabox3 and grafana 11.4.0 on nabox4. Yes this looks similar so I guess you're on to something.
The specific pattern on that volume is a 2 hour spike in traffic for an hour or so. This is creating this graph.
We have observed instances where selecting larger time ranges in Grafana may not display all spikes, if any. In this case, we are comparing two different Grafana versions. It is possible that the queries sent from Grafana differ slightly between the versions. You can verify this by checking the network tools in your browser.
(2 hour I mean, every 2 hour the traffic/throughput/latency goes up for around an hour. looks like the specific virtual machine is doing some task scheduled every 2 hours)
well that does not explain why an ontap upgrade changes it I think ?
That is during the same grafana if you look at the nabox graph.
The change 31th march and 27th october match ontap upgrades. So it looks like that triggered a change.
I don't believe the ONTAP upgrade is causing the difference here. Let's zoom into the relevant time period you mentioned, focusing on approximately 15 days of data. I expect the data should be the same
Looking at the month graph the data is indeed the same. However it's really a shame that the year graph is not helping. What I wanted to find out is since when this traffic behaviour has been going on. It was longer than a month. Going further back using the year graph it is not really helping. It looked like the traffic suddenly stopped and suddenly started.
The longer period is normally causing data becoming more spread out, or flatter due to averaging. This is not that.
I believe it's not even averaging. Grafana selects a single point from each step size, and this step size changes as you adjust the time range. If you enable the Prometheus query log, you will notice changes in the step size as you modify the interval in Grafana. The step size determines that Grafana will pick one point from each interval of n. Even size of panel matters here. Increasing the panel size in Grafana allows it to display more data points, which can lead to changes in the appearance of the graph as well.
The longer the time range, the more problematic the data representation can become. It might be better for the panel to perform a different type of query that averages the data for longer duration graphs, providing more of a summary.
Yes. That explains why it's doing this. It would make more sense if it was averaging. That graph would be more helpful. We have the specific behaviour in the vm that is not working well if you pick the wrong timestamps. A regular graph would not make a difference but this specific behaviour is now messing the graphs up.
Could you share this panel prometehus query?
topk($TopResources, volume_avg_latency{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$TopVolumeAvgReadLatency"})
Expr: topk(5, volume_avg_latency{datacenter=~"PDC2",cluster=~"prev-aff-cl1-pdc2",svm=~"svm-previder-02",volume=~"cl2_winvol1502"})
Step: 6h0m0s
Thanks.
Expr: topk(5, volume_avg_latency{datacenter=~"PDC2",cluster=~"prev-aff-cl1-pdc2",svm=~"svm-previder-02",volume=~"cl2_winvol1502"})
Step: 12h0m0s
looks like the newer query is doing 12 hour step instead of 6
For what I'm looking for some kind of averaging would be better instead of randomly picking a measuring point. With gradually changing values this randomly picking a point is not an issue. With more dynamic behaving values this can really put you on the wrong foot.
Yes, we have discussed this in the past. Given that this value has to be static for averaging in the query, it may not be suitable for shorter time ranges. For shorter time ranges, we want to capture peaks accurately, as averaging them out would miss those peaks. However, for longer durations, we need smoothing to get a more meaningful summary.
For example, the query below may perform better for queries spanning months, but it will not be as effective for queries covering just one day or a few hours.
avg_over_time(volume_avg_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[3h])
and on(datacenter, cluster, svm, volume)
topk($TopResources, avg_over_time(volume_avg_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[3h]) * on(datacenter, cluster, svm, volume) group_left() volume_labels{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", tags=~".*$Tag.*"})
query is not working unfortunately
What is the error?
We can also try using the $__interval variable, which adjusts its value as the time range changes. Could you try the following query to see if it works for both shorter and longer time ranges?
avg_over_time(volume_avg_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[$__interval])
and on(datacenter, cluster, svm, volume)
topk($TopResources, avg_over_time(volume_avg_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[$__interval]) * on(datacenter, cluster, svm, volume) group_left() volume_labels{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", tags=~".*$Tag.*"})
Let's try with the query you have shared. I have modified it as below
topk($TopResources, avg_over_time(volume_avg_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$TopVolumeAvgReadLatency"}[$__interval]))
Yes, I suspect the interval value is quite large for a time range of one year. When we experimented with this previously, the results appeared unusual.
As a result, we have left it for Grafana to automatically select points for the time series, rather than manually guiding it to pick specific points. If we hardcode the value to, say, 3 hours, the graph will probably be for a year's worth of data but not for shorter time ranges.
topk($TopResources, avg_over_time(volume_avg_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$TopVolumeAvgReadLatency"}[3h]))
I think the selected volume is not used in the last query and not acting on the relevant values ?
is denied by the document's Content Security Policy.
got error while trying to save copy of dashboard. it does save it but when editing panel it again shows error.
That is strange. I suspect it's some other issue. Maybe try reverting to an older version of this dashboard or deleting and re-importing it.
yeah. something is broken, cannot edit the original volume panel either without getting the error.
Okay. Better to delete and reimport.
Deleted the copy dashboards. Seems to be working again.
When looking at the volume dashboard from harvest 24.11.1 and having 1 year as time it gives an error saying duplicate time series.
When selecting a smaller time windows it is ok.
30 days for example works
even 90 days is fine
Ok it is possible that aggregate has changed for selected volume in last 1 year leading to duplicate results.
If you update this query to below? Does it work for 1 year range?
volume_avg_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}
and on(datacenter, cluster, svm, aggr, volume)
topk($TopResources, avg_over_time(volume_avg_latency{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}[3h]) * on(datacenter, cluster, svm, aggr, volume) group_left() volume_labels{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", tags=~".*$Tag.*"})
Cool we ll fix this.
However I'm unable to save a copy dashboard.
I think that’s not allowed in NABox4. @fierce ridge
have created a different folder to avoid contaminating the original ones but still this error.
That’s the error in dashboard created in new folder?
yes, when trying to create copy of the volume dashboard in the new folder
looks like some permission related issue.
Do you see any error related to this in grafana logs?
After ssh to nabox machine, docker logs grafana?
I have opened a issue for duplicate time series here https://github.com/NetApp/harvest/issues/3416
still when looking again there is a lower portion of months. so partly fixed. it has data now but the original issue still there.
Okay do you mean that it is missing data for some months?
nabox3 shows this. high on the part where nabox4 shows low
Okay got it. Yeah that is due to the time range issue we discussed above.
Querying on a downsampled dataset may give better results. However, we first need to downsample the dataset before we can perform such queries.
I think you should be able to save as
Nope, save as copy is giving the error.
but it does create something because with the same name it says that the dashboard exists.
Have to delete it to avoid issues.
If I don't delete them I get the same error when opening a different dashboard.
not finding anything related I think.
Could you please share full grafana logs @ https://upload.nabox.org/gury-yuku-qapa
you mean this ? Yes. Uploaded it. (fyi my ip ends at .58, I saw a .104 connecting but that's a colleague)
I get the operation is insecure error as well. But the dashboard is saved, don't try to save it in Harvest folder though I guess. I did a different folder
Looks like an issue with the new Content Security Policies
Ok, issue fixed, let's get you a build
https://upload.nabox.org/xepy-xexa-zazu Can you update to this ?
I'd appreciate if you can ugrade to the version I sent earlier for SG port, validate it, and then to this one for the dashboard thing
updated but the volume year graphs show no data. the 90 days do show statistics. Do I have to manually restart nabox ? Or would the upgrade restart al necessary services ?
I can confirm that save a copy of dashboard now works without error.
What is the error if you hover/click on this red marks?
duplicate time series
Okay. I'll share fix for this shortly.
Could you try importing json from this link and see if it fixes the problem?
https://raw.githubusercontent.com/NetApp/harvest/8d09c2586d4265f0445cc1501fc029fe0f60e295/grafana/dashboards/cmode/volume.json
not working
Okay. I see that node is different which is causing duplicate time series.
We do regularly move volumes. Either to free up space on that aggregate or because we're replacing the HA-pair because the support contract is ending. So that could be the case.
Understood. Yeah We need to handle that in dashboard.
Let's focus on this Average latency panel and try to fix its query. Could you try below query if that works for this panel?
volume_avg_latency{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume",style!="flexgroup_constituent"}
and on(datacenter,cluster,svm,volume)
topk($TopResources, avg_over_time(volume_avg_latency{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume",style!="flexgroup_constituent"}[3h] @ end())
and on(datacenter,cluster,svm,volume)
volume_labels{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume",tags=~".*$Tag.*"})
yes. that works
cool. I'll fix for remaining panels and give you updated dashboard.
Could you import dashboard from here and see if works?
it seems to work partly. the middle graph is showing data for past year but left and right graphs just start from the moment nabox4 was the harvester it seems.
when specifically selecting the svm and volume there is the remarkable drop we have discussed earlier in the middle graph and outer graphs are missing data.
Could you edit the average latency panel and then inspect -> query as shown in the screenshot? Then, could you share the query that was executed?
Does this missing data occur in the volume latency panel when SVM and volume are selected, or does it happen with the 'All' filter as well?
in both cases
Okay, let's take the Volume Average Latency panel again and try the query below to see if it shows all 1-year data correctly without any duplicate series errors for the panel.
volume_avg_latency{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume",style!="flexgroup_constituent"}
and on (datacenter, cluster, svm, volume)
topk($TopResources, avg_over_time(volume_avg_latency{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume",style!="flexgroup_constituent"}[3h] @ end()) * on (datacenter, cluster, svm, volume) group_left(node) volume_labels{datacenter=~"$Datacenter",cluster=~"$Cluster",svm=~"$SVM",volume=~"$Volume",tags=~".*$Tag.*"})
Okay so data is back but step size issue exists.
yes.
Okay we don't have a good way yet to handle step size issue.
I have made similiar changes to other panels for duplicate time series issue. Could you import dashboard from here and check?
https://raw.githubusercontent.com/NetApp/harvest/4377522a48833b7bf608978c8b4891d8f24f3c7b/grafana/dashboards/cmode/volume.json
the 1 year for all has data now on the 3 panels.
The middle top panel is still having an issue
there is the duplicate time series error
Could you check if below query works for middle top panel
sum(
topk(
$TopResources,
volume_read_data{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}
* on(datacenter, cluster, svm, volume) group_left(node)
volume_labels{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", tags=~".*$Tag.*"}
)
)
+
sum(
topk(
$TopResources,
volume_write_data{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", style!="flexgroup_constituent"}
* on(datacenter, cluster, svm, volume) group_left(node)
volume_labels{datacenter=~"$Datacenter", cluster=~"$Cluster", svm=~"$SVM", volume=~"$Volume", tags=~".*$Tag.*"}
)
)
yes
Thanks for confirming!
@upper sundial I'll share a volume dashboard with you tomorrow, designed to work over longer time ranges where the data didn't make sense in the Grafana panels.
@upper sundial Could you please send us an Test email at ng-harvest-files@netapp.com? We would like to share a new dashboard with you for testing via email. This new dashboard effectively handles long time ranges by presenting a summary of the data over time.
@upper sundial Gentle reminder.
Sorry, been away from discord for a while. Will send a mail.
Thanks. I have shared updated Volume dashboard via email for your feedback.
Hi Yann. Just discovered that the new 4.0.9 broke the port fix. Can you provide an alternative that supports a different portnumber again ?
@fierce ridge