#Lost historical data after disk capacity increase

1 messages · Page 1 of 1 (latest)

livid minnow
#

I noticed not able to load any data on grafana when login to nabox noticed disk capacity was at 95% used. I increase the disk size and issued reboot via NAbox >> Maintenance. I see the system is back up and I see the new data in recent time line but nothing historical. I do see the disk size 57% used after I increase so that mean data is there. how do I fix this?

pearl field
#

@quaint pewter

abstract lodge
#

OOH! I think I lost historical data after a CPU/RAM increase, I kinda just shrugged it off because I was dealing with an upgrade, adding new collectors, and a few other support things, so 'functional' was the requirement and I left the historical thing alone...

quaint pewter
#

Well you should lose historical data after cpu or ram upgrade. Maybe we should investigate this

abstract lodge
quaint pewter
#

If you simply reboot, do you lose anything ?

abstract lodge
#

I've run the dc restart havrest command a few times today. Do you want me to reboot the whole appliance? I would lose a little of the monitoring for that period, but it wouldn't be an issue.

abstract lodge
quaint pewter
#

Either a reboot or restart Victoria metrics service.

#

Oh, that’s v3 sorry

#

Restart Prometheus then. But migrate to v4 too 🙂

abstract lodge
quaint pewter
#

lol, ok. Yes restart here

abstract lodge
# quaint pewter lol, ok. Yes restart here

After the reboot, and refreshing the page, I'm still not seeing historical data for volumes. You recently helped me get the REST API active, but I'm not sure if that's the right timeline for it.

quaint pewter
livid minnow
#

it looks like it took sometime to show historical data.

#

I did reboot at that time but it took about couple of days to show all the data

abstract lodge
livid minnow
#

but start showing the recent states right away

#

what would be the best way to HA this

abstract lodge
#

@quaint pewter it appears to only be certain metrics... the Node CPU metric seems to have data back further, but the volume data doesn't. Perhaps enabling REST changed the collector and now it is not looking at the ZAPI data from before?

livid minnow
#

I can clone this and keep it running just in case but not sure how much stress it will put on the filers

livid minnow
#

wondering if I should migrate to v4 before going over enabling REST API?

abstract lodge
quaint pewter
abstract lodge
quaint pewter
abstract lodge
#

Node>CPU seems to just view them as different nodes, but shows them in the same graph with just different colored lines

livid minnow
#

the upgrade required to have atlest 3.5.2. does that mean I am two version behind?
nabox:/$ cat /etc/nabox-release
3.5

quaint pewter
pearl field
#

This scenario likely involves a change in the poller port. There are two cases to consider:

  1. Panels without Join Queries:

    • You will observe a change in color as the port changes over a time range. All data will be available.
  2. Panels with Join Queries:

    • All of our topk panel queries use join queries. In this case, you will notice a loss of data. This happens because the joined results attempt to match with the latest port. If older data has a different port, those results will be excluded.

Possible Reasons for Port Change:

  1. The poller was deleted and re-added.
  2. During migration, the pre- and post-migration ports are different. @quaint pewter to confirm.

I did document this issue https://github.com/NetApp/harvest/issues/2782 under Case 3: Poller Port Change. At that time, we decided not to make any changes in this area. For more details, refer to the discussion here.

quaint pewter
#

In theory, the migration tool asks for the same port when creating the cluster. If the systems were added manually that would probably break continuity.
I need to check that modifying a cluster tries to keep the same port but I think it does.

abstract lodge
#

So the switch from ZAPI to REST poller means that I have to search those specific timelines to see the different data?

pearl field
#

@abstract lodge Switching between ZAPI and REST will not cause this issue. It is related to the port value in harvest.yaml for a poller. If that changes, the problem you mentioned will occur.

quaint pewter
#

Do you remember what happened at that point in time ? Migration or change to REST by modifying the system ? If you re-created the system that would cause this

abstract lodge
#

I don't think I ever recreated the system... and it's all of them, I definitely didn't recreate all of them... A few times I recreated the username/role in ONTAP, but I never deleted the entry from NABox...

#

The only thing I can think of is that might have been when I manually edited the order of pollers in the harvest.yaml file? I changed all the arrays and the default from

- restperf
- zapi
- zapiperf``` 
to 
``` - rest
- zapi
- zapiperf
- restperf``` 
but I didn't modify any other portion of the yaml file
hot pecan
#

I had the same issue as Ryan after migrating to v4. I switched the 'Preferred ONTAP API' method to 'REST' on the migrated systems (BEFORE the migration had finished trasnferring data, and also switched the user to a new one with updated ONTAP permissions) and could only see new data. I tried Yan's suggestion, and changed the date to an older point in time to be able view the migrated 'ZAPI' data.

abstract lodge
hot pecan
pearl field
#

@abstract lodge We can check whether any renaming has occurred within the data. Could you please run the following query in Nabox Victoriametrics and share the output? You can email that to us @ ng-harvest-files@netapp.com

sum(sum_over_time(aggr_space_physical_used[1y])) by (datacenter, cluster, instance)
abstract lodge
winter depot
#

Hi @abstract lodge try this https://$ip/vm/vmui/ replacing $ip with your nabox ip and then paste the query in the query box. Execute Query and then switch to table in the results

#

thanks @abstract lodge received

abstract lodge
# winter depot thanks <@689858598968623117> received

Hey Chris, I didn't even realize the VMUI was on this NABox installation. That's phenomenally helpful, but now I don't know where to start in finding out how to write my queries against the DB to get the info I want. Is there some sort of documentation on NABox specific Victoriametrics queries that I can use, similar to the aggr_space_physical_used, etc... I wouldn't even know where to begin on guessing what I want. Nothing seems to show up related to volume when I use the autocomplete option, so I must not be putting it in correctly...

winter depot
#

@abstract lodge, the Prom query that Rahul asked you to run shows what we suspected, which is that your pollers are using different prom ports now. That's what's causing the problem. Nabox manages those ports. @quaint pewter do you know why the ports changed? Here's a screenshot, the boxes are around clusters with the same name

abstract lodge
winter depot
#

Thanks, I appreciate the willingness to help troubleshoot! I'd like to figure out why it happened so we can improve the situation for folks that haven't upgraded yet and because I don't like mysteries, at least not this kind 😃 I believe it's an Naboxism, which is why I've asked Yann to take a look. Let's see what he says about those port groupings.

quaint pewter
#

When no port is provided, NAbox 4 starts from 12000 until it finds an available port. high ports like 12993 is definitely not expected

#

Oh... NAbox 3 actually started with 12990
That would happen if migration happened on NAbox < 4.0.3

#

Or if the systems were manually created in NAbox 4 when deploying, instead of using the migrate tool to create

abstract lodge
quaint pewter
#

You remember what what NAbox 4 release when you migrated ?

abstract lodge
#

I don't believe I've upgraded since then, so that would make it 4.0.5, but I'm looking back to verify that for sure

#

it was either 4.04 or 4.05

quaint pewter
#

migration happened on 10/10 ?

abstract lodge
#

Enabling REST API was around 10/10, started the conversation around 9/28, and on 10/10 we dug into an issue with the users created for API, and I had to delete and recreate those users (though I stuck with the same username and password and only rebuilt them on the NetApp arrays)

pearl field
#

@abstract lodge Let's try to get the time when this has happened. Could you share result of below query to us via email @ ng-harvest-files@netapp.com

max_over_time( 
  timestamp(
    sum(aggr_space_physical_used) by (datacenter, cluster, instance)
  )[1y:5m])
abstract lodge
#

It errors out with a "too many points for the given start" message with 1y:5m, but if I change that to 49w:4m it works. I'm emailing that result now

winter depot
#

Thanks for the email Ryan. I've obfuscated your cluster names. Using the data from your screenshot, we can see that cluster c1 was last using port 12993 on 2024-10-10T165000-04:00 and now it uses port 12004. There is a similar pattern for the other pollers too. I'd still like to understand why this happened, could you upload an NABox support bundle https://nabox.org/documentation/troubleshooting/ to https://upload.nabox.org/fuwo-ruxy-wypa so we can try to figure out which version you had before upgrading?

One way to workaround this issue would be to change your poller ports back to what they were before the upgrade. That would create a gap of 19 days of data between now and Oct 10, but after making the change, you would see the historical data prior to Oct 10 in your graphs. Let us know if you want to do that and I can tell you what port changes need to be made.

abstract lodge
winter depot
#

thanks!

abstract lodge
#

I think the 19 day gap would be preferable at this point, so if that's something we can do, that'd be amazing!

winter depot
#

Give this a try.

ssh into nabox by running ssh admin@$ip.
Once you auth, run:
sudo vim /etc/nabox/harvest/harvest.yml

Change the port: for each poller using this mapping:

cluster before after
          port  port
------- ------ -----
c1      12004  12993
h1      12003  12990
i1      12002  12992
s1      12001  12991

After making these changes, save them
Restart everything via dc restart

quaint pewter
#

Shot in the dark. Can you try :

journalctl -u nabox --since "2024-10-10 00:00:00" --until "2024-10-11 00:00:00"
abstract lodge
#

(looks like ssh...)

quaint pewter
#

from within NAbox ssh

#

but it's probably way too far. Before is better

abstract lodge
#

Are we trying to determine when the change happened?

#

that's easy, it's visible in the graphs

quaint pewter
#

I was trying to get all the logs from that day

winter depot
#

@quaint pewter did you check the support bunldes that @abstract lodge uploaded?

abstract lodge
#

If they're in the diag collection, I have some from that day that I've uploaded... there was one from 10/10 @ 2:50pm ish

quaint pewter
#

oh we have logs from back then

abstract lodge
abstract lodge
quaint pewter
#

We have it, but I don't see any PATCH requests

winter depot
#

Thanks for providing the log files Ryan. Your harvest.yml is identical between 10-09 and 10-10. 10-16 is when I see the ports change in your harvest.yml file. The file was changed Oct 16 14:36, of course, it may have changed anytime in between those support bundles. On 10-10 all pollers, except for the unix poller, are running on promPort 129.., the unix poller is using promPort 12000. As you mentioned, each poller's list of collectors changed in the 10-16 harvest.yml. Previously they were [Zapi, ZapiPerf, Rest, Ems], then on 10-16 they changed to [Rest, RestPerf, Zapi, ZapiPerf].

You did not make those edits by hand, I assume? Did you setup email or make any other nabox changes during that time-frame?

The 10-10 logs include a bunch of 401 Unauthorized access log messages coming from the Zapi:Sensor plugin. I'm guessing REST was not enabled for your clusters at that point? Either way, I see some log improvements we will make.

Looks like the kernel's out of memory killer was getting a good workout, killing pollers and victoria-metrics. Looks like your nabox instance had 3.79 GB of memory. Ah, looks like you upgraded it to 11.6 GB according to the 10-22 logs. Nice. I also noticed you have enabled ZapiPerf:WorkloadDetailVolume which will push your memory footprint higher. Poller c1 is using a lot of memory due to the following templates: Rest:CIFSSession, Rest:NFSClients, ZapiPerf:WorkloadDetailVolume, and ZapiPerf:WorkloadVolume. If you don't need those counters, you can save memory by disabling them. They account for the majority of metrics collected by poller c1 and s1.

There has not been an oom-killer since 10-16 so looks like your good there.

abstract lodge
#

Between the 10/10 log and the 10/16 log is when we finally got rest enabled and working. The issue was we didn't have http set in the role permissions, so user accounts and roles were recreated on the NetApp filers, but we didn't delete the arrays from NABox.

I did not manually adjust the order of collectors between 10/10 and 10/16, so that must have happened when I finally got REST collections working. I did change the collectors order with guidance from this thread (https://discord.com/channels/855068651522490400/1297961456326672447), but that happened on 10/22.
I have tried (unsuccessfully) to get email set up, but I'm not sure what day/time that was.

The Kernal OOM-killer events were also a suppport event, I didn't realize I was out of mem, so we bumped it quite a bit, now I know what to look for, so if we run out of it again I can bump up, or decrease poller templates.
CIFSSession and NFSClients are important for our usage, but the WorkloadDetailVolume and WorkloadVolume templates are not likely something we'll need. I believe those are on by default, as I don't believe I've enabled them directly. I will go modify those once things have stabilized a little now.

abstract lodge
winter depot
#

Thanks Ryan. Looks like @quaint pewter moved workload collection to here

#

it also looks like that will collect them via ZapiPerf regardless of which API you have picked for your poller

quaint pewter
#

It's using the stock template from harvest and putting it in custom.
Any better way to do it ?

winter depot
#

In Ryan's case, the part that surprised me is they have their poller's collectors ordered like so: [Rest, RestPerf, Zapi, ZapiPerf]. When "Enable workload collection" is enabled, I would have expected to see RestPerf collecting workloads instead of ZapiPerf

quaint pewter
#

The checkbox works with the ability to merge templates, which is only available in Zapi

winter depot
#

I'm guessing that you want merging so you can use the comma-seperated notation like this?
WorkloadDetail: workload_detail.yaml,exclude_transient_volumes.yaml