#ONTAP Major Update (9.13.1 -> 9.14.1P2) requires harvest restart
1 messages · Page 1 of 1 (latest)
@agile geyser Could you share harvest version?
24.02.0 on rhel 8.9
Could you share harvest logs @ ng-harvest-files@netapp.com https://netapp.github.io/harvest/24.02/help/log-collection/#rpm-deb-and-native-installations
sure, do you also need the poller_unix.log?
That is not needed. It will be helpful if harvest log contains data during the duration of ONTAP upgrade.
oh, the logfiles have already been rotated. So I will observe if the behaviour is the same during the next update (scheduled for 2024-04-27) and then send logfiles...
Ok, We do keep 5 rotated files upto 10MB each or max 7 days of logs (whichever hits first). Probably those have the data?
@agile geyser we're happy to look at whatever log files you have. Hopefully they captured the upgrade
I updated 2 other systems to 9.14.1 and now I have logfiles for this behaviour.... do you need the whole data from all clusters? (the whole /var/log/harvest is 88MByte as a tar.gz, not sure if your smtp accepts that much...)
@agile geyser Any 1 cluster logs will also work.
received files @agile geyser thanks!
thanks for the log files @agile geyser, they helped root cause the issue. The RestPerf:SystemNode collector panicked while collecting counters during upgrade. This happened because the collector failed to get new instances during the upgrade. Downstream code then tried to work with those missing instances and panicked. Since the collector panicked, it stopped collecting metrics until you restarted.
This is an issue we discovered in March and fixed https://github.com/NetApp/harvest/pull/2743 If you have more clusters to upgrade, and want to try https://github.com/NetApp/harvest/releases/tag/nightly it has that fix, otherwise you can pick it up when we release 24.05.
I also confirmed that the other collectors continued collecting metrics without panics. It was only the RestPerf:SystemNode collector that panicked. Some of the other collectors hit timeouts during upgrade, likley due to ONTAP restarting nodes
ah @delicate gust thank you very much for explaining. Unfortunately, that's my last upgrade to 9.14, so I can not test the new code...
ah well, thanks for reporting! Good to document for others
would this issue also be triggered when updating from 9.14.1P<x> to the new 9.14.1P4? Or do I have to do a major update to trigger this behaviour?