I noticed not able to load any data on grafana when login to nabox noticed disk capacity was at 95% used. I increase the disk size and issued reboot via NAbox >> Maintenance. I see the system is back up and I see the new data in recent time line but nothing historical. I do see the disk size 57% used after I increase so that mean data is there. how do I fix this?
#Lost historical data after disk capacity increase
1 messages · Page 1 of 1 (latest)
@quaint pewter
OOH! I think I lost historical data after a CPU/RAM increase, I kinda just shrugged it off because I was dealing with an upgrade, adding new collectors, and a few other support things, so 'functional' was the requirement and I left the historical thing alone...
Well you should lose historical data after cpu or ram upgrade. Maybe we should investigate this
I would love to, can I provide some info to go off of?
If you simply reboot, do you lose anything ?
I've run the dc restart havrest command a few times today. Do you want me to reboot the whole appliance? I would lose a little of the monitoring for that period, but it wouldn't be an issue.
That was supposed to be a reply... 😉
Either a reboot or restart Victoria metrics service.
Oh, that’s v3 sorry
Restart Prometheus then. But migrate to v4 too 🙂
I piggybacked on @livid minnow's thread, sorry about that. I have v4
Ok to just use this restart?
lol, ok. Yes restart here
After the reboot, and refreshing the page, I'm still not seeing historical data for volumes. You recently helped me get the REST API active, but I'm not sure if that's the right timeline for it.
But did you have more data before the reboot ?
it looks like it took sometime to show historical data.
I did reboot at that time but it took about couple of days to show all the data
Yeah, we've been running it for a long time. Looking again, the timeline coincides with you helping me to get REST API active, not with the CPU/RAM change... They just happened close together.
but start showing the recent states right away
what would be the best way to HA this
@quaint pewter it appears to only be certain metrics... the Node CPU metric seems to have data back further, but the volume data doesn't. Perhaps enabling REST changed the collector and now it is not looking at the ZAPI data from before?
I can clone this and keep it running just in case but not sure how much stress it will put on the filers
agree. this is in my list to do. I should do that now
wondering if I should migrate to v4 before going over enabling REST API?
the migraiton was pretty painless for me! I had a disk capacity issue, but once we resolved that, the whole documented process was well written out
Now I wonder. What if you select a time range before the 10/11
huh, that works. Makes it tough for correlation of data though...
@pearl field changing to rest would change the label and break continuity ?
Is there a way to stick these together to prevent the disconnect for prettier graphing? If not, I can just stitch two graphs together, at least I have the data. but would be nice to see if it would do that automatically somehow
Node>CPU seems to just view them as different nodes, but shows them in the same graph with just different colored lines
the upgrade required to have atlest 3.5.2. does that mean I am two version behind?
nabox:/$ cat /etc/nabox-release
3.5
Yes indeed !
Yeah I think @pearl field will be able to explain, unfortunately I don’t see an easy fix. Let’s see.
This scenario likely involves a change in the poller port. There are two cases to consider:
-
Panels without Join Queries:
- You will observe a change in color as the port changes over a time range. All data will be available.
-
Panels with Join Queries:
- All of our
topkpanel queries use join queries. In this case, you will notice a loss of data. This happens because the joined results attempt to match with the latest port. If older data has a different port, those results will be excluded.
- All of our
Possible Reasons for Port Change:
- The poller was deleted and re-added.
- During migration, the pre- and post-migration ports are different. @quaint pewter to confirm.
I did document this issue https://github.com/NetApp/harvest/issues/2782 under Case 3: Poller Port Change. At that time, we decided not to make any changes in this area. For more details, refer to the discussion here.
In theory, the migration tool asks for the same port when creating the cluster. If the systems were added manually that would probably break continuity.
I need to check that modifying a cluster tries to keep the same port but I think it does.
So the switch from ZAPI to REST poller means that I have to search those specific timelines to see the different data?
@abstract lodge Switching between ZAPI and REST will not cause this issue. It is related to the port value in harvest.yaml for a poller. If that changes, the problem you mentioned will occur.
Do you remember what happened at that point in time ? Migration or change to REST by modifying the system ? If you re-created the system that would cause this
I don't think I ever recreated the system... and it's all of them, I definitely didn't recreate all of them... A few times I recreated the username/role in ONTAP, but I never deleted the entry from NABox...
The only thing I can think of is that might have been when I manually edited the order of pollers in the harvest.yaml file? I changed all the arrays and the default from
- restperf
- zapi
- zapiperf```
to
``` - rest
- zapi
- zapiperf
- restperf```
but I didn't modify any other portion of the yaml file
I had the same issue as Ryan after migrating to v4. I switched the 'Preferred ONTAP API' method to 'REST' on the migrated systems (BEFORE the migration had finished trasnferring data, and also switched the user to a new one with updated ONTAP permissions) and could only see new data. I tried Yan's suggestion, and changed the date to an older point in time to be able view the migrated 'ZAPI' data.
No fix other than just adjusting your view to get older data?
Correct, that's all I have found so far. We can live with that in our environment.
@abstract lodge We can check whether any renaming has occurred within the data. Could you please run the following query in Nabox Victoriametrics and share the output? You can email that to us @ ng-harvest-files@netapp.com
sum(sum_over_time(aggr_space_physical_used[1y])) by (datacenter, cluster, instance)
Rahul, I've never used Victorial Metrics, how do I go about running this query? do I start in Grafana, or from the NABox splash screen?
Hi @abstract lodge try this https://$ip/vm/vmui/ replacing $ip with your nabox ip and then paste the query in the query box. Execute Query and then switch to table in the results
thanks @abstract lodge received
Hey Chris, I didn't even realize the VMUI was on this NABox installation. That's phenomenally helpful, but now I don't know where to start in finding out how to write my queries against the DB to get the info I want. Is there some sort of documentation on NABox specific Victoriametrics queries that I can use, similar to the aggr_space_physical_used, etc... I wouldn't even know where to begin on guessing what I want. Nothing seems to show up related to volume when I use the autocomplete option, so I must not be putting it in correctly...
@abstract lodge, the Prom query that Rahul asked you to run shows what we suspected, which is that your pollers are using different prom ports now. That's what's causing the problem. Nabox manages those ports. @quaint pewter do you know why the ports changed? Here's a screenshot, the boxes are around clusters with the same name
That's what I assumed the issue was... It's not the worst thing in the world. sort of a "would be nice" situation. If it helps for future issues, I don't mind working with you to resolve it. If it is something that you don't want to put time into, I can just manually select the time as necessary to see the historical data.
Thanks, I appreciate the willingness to help troubleshoot! I'd like to figure out why it happened so we can improve the situation for folks that haven't upgraded yet and because I don't like mysteries, at least not this kind 😃 I believe it's an Naboxism, which is why I've asked Yann to take a look. Let's see what he says about those port groupings.
When no port is provided, NAbox 4 starts from 12000 until it finds an available port. high ports like 12993 is definitely not expected
Oh... NAbox 3 actually started with 12990
That would happen if migration happened on NAbox < 4.0.3
Or if the systems were manually created in NAbox 4 when deploying, instead of using the migrate tool to create
I used the migration tool, you helped me find a space issue for that migration. It shouldn't have recreated anything manually...
You remember what what NAbox 4 release when you migrated ?
I don't believe I've upgraded since then, so that would make it 4.0.5, but I'm looking back to verify that for sure
it was either 4.04 or 4.05
migration happened on 10/10 ?
No no, NABox 3>4 upgrade was August 29th, completed on the 30th
Enabling REST API was around 10/10, started the conversation around 9/28, and on 10/10 we dug into an issue with the users created for API, and I had to delete and recreate those users (though I stuck with the same username and password and only rebuilt them on the NetApp arrays)
@abstract lodge Let's try to get the time when this has happened. Could you share result of below query to us via email @ ng-harvest-files@netapp.com
max_over_time(
timestamp(
sum(aggr_space_physical_used) by (datacenter, cluster, instance)
)[1y:5m])
It errors out with a "too many points for the given start" message with 1y:5m, but if I change that to 49w:4m it works. I'm emailing that result now
Thanks for the email Ryan. I've obfuscated your cluster names. Using the data from your screenshot, we can see that cluster c1 was last using port 12993 on 2024-10-10T16
00-04:00 and now it uses port 12004. There is a similar pattern for the other pollers too. I'd still like to understand why this happened, could you upload an NABox support bundle https://nabox.org/documentation/troubleshooting/ to https://upload.nabox.org/fuwo-ruxy-wypa so we can try to figure out which version you had before upgrading?
One way to workaround this issue would be to change your poller ports back to what they were before the upgrade. That would create a gap of 19 days of data between now and Oct 10, but after making the change, you would see the historical data prior to Oct 10 in your graphs. Let us know if you want to do that and I can tell you what port changes need to be made.
Thanks Chris, I've been working a lot with you, so I have lots of nabox logs recently, I've just uploaded all of them to the link provided
thanks!
I think the 19 day gap would be preferable at this point, so if that's something we can do, that'd be amazing!
Give this a try.
ssh into nabox by running ssh admin@$ip.
Once you auth, run:
sudo vim /etc/nabox/harvest/harvest.yml
Change the port: for each poller using this mapping:
cluster before after
port port
------- ------ -----
c1 12004 12993
h1 12003 12990
i1 12002 12992
s1 12001 12991
After making these changes, save them
Restart everything via dc restart
Shot in the dark. Can you try :
journalctl -u nabox --since "2024-10-10 00:00:00" --until "2024-10-11 00:00:00"
do this before or after @winter depot's commands? and is this in SSH or is it a VM command?
(looks like ssh...)
no entries
Are we trying to determine when the change happened?
that's easy, it's visible in the graphs
I was trying to get all the logs from that day
@quaint pewter did you check the support bunldes that @abstract lodge uploaded?
If they're in the diag collection, I have some from that day that I've uploaded... there was one from 10/10 @ 2:50pm ish
Right I didn't 🙂 I would have been there. I'll take a look though
oh we have logs from back then
ok made changes, saved, and restarted
There's also one from 9/9, and I think I uploaded some from when I was having issues with the migration, but I've deleted those on my end, so unless you keep a copy, they're gone.
We have it, but I don't see any PATCH requests
Thanks for providing the log files Ryan. Your harvest.yml is identical between 10-09 and 10-10. 10-16 is when I see the ports change in your harvest.yml file. The file was changed Oct 16 14:36, of course, it may have changed anytime in between those support bundles. On 10-10 all pollers, except for the unix poller, are running on promPort 129.., the unix poller is using promPort 12000. As you mentioned, each poller's list of collectors changed in the 10-16 harvest.yml. Previously they were [Zapi, ZapiPerf, Rest, Ems], then on 10-16 they changed to [Rest, RestPerf, Zapi, ZapiPerf].
You did not make those edits by hand, I assume? Did you setup email or make any other nabox changes during that time-frame?
The 10-10 logs include a bunch of 401 Unauthorized access log messages coming from the Zapi:Sensor plugin. I'm guessing REST was not enabled for your clusters at that point? Either way, I see some log improvements we will make.
Looks like the kernel's out of memory killer was getting a good workout, killing pollers and victoria-metrics. Looks like your nabox instance had 3.79 GB of memory. Ah, looks like you upgraded it to 11.6 GB according to the 10-22 logs. Nice. I also noticed you have enabled ZapiPerf:WorkloadDetailVolume which will push your memory footprint higher. Poller c1 is using a lot of memory due to the following templates: Rest:CIFSSession, Rest:NFSClients, ZapiPerf:WorkloadDetailVolume, and ZapiPerf:WorkloadVolume. If you don't need those counters, you can save memory by disabling them. They account for the majority of metrics collected by poller c1 and s1.
There has not been an oom-killer since 10-16 so looks like your good there.
Between the 10/10 log and the 10/16 log is when we finally got rest enabled and working. The issue was we didn't have http set in the role permissions, so user accounts and roles were recreated on the NetApp filers, but we didn't delete the arrays from NABox.
I did not manually adjust the order of collectors between 10/10 and 10/16, so that must have happened when I finally got REST collections working. I did change the collectors order with guidance from this thread (https://discord.com/channels/855068651522490400/1297961456326672447), but that happened on 10/22.
I have tried (unsuccessfully) to get email set up, but I'm not sure what day/time that was.
The Kernal OOM-killer events were also a suppport event, I didn't realize I was out of mem, so we bumped it quite a bit, now I know what to look for, so if we run out of it again I can bump up, or decrease poller templates.
CIFSSession and NFSClients are important for our usage, but the WorkloadDetailVolume and WorkloadVolume templates are not likely something we'll need. I believe those are on by default, as I don't believe I've enabled them directly. I will go modify those once things have stabilized a little now.
(forgot to hit reply again...)
Thanks Ryan. Looks like @quaint pewter moved workload collection to here
it also looks like that will collect them via ZapiPerf regardless of which API you have picked for your poller
It's using the stock template from harvest and putting it in custom.
Any better way to do it ?
In Ryan's case, the part that surprised me is they have their poller's collectors ordered like so: [Rest, RestPerf, Zapi, ZapiPerf]. When "Enable workload collection" is enabled, I would have expected to see RestPerf collecting workloads instead of ZapiPerf
The checkbox works with the ability to merge templates, which is only available in Zapi
I'm guessing that you want merging so you can use the comma-seperated notation like this?
WorkloadDetail: workload_detail.yaml,exclude_transient_volumes.yaml