#ONTAP Major Update (9.13.1 -> 9.14.1P2) requires harvest restart

1 messages · Page 1 of 1 (latest)

agile geyser
#

Every time I do a major update of ONTAP, I have to restart the harvest process... can anybody explain why?
The process is still running, log files are looking normal, but it doesn't work...
as you can see from the 2day screenshot, we are missing a lot of values in the middle, that's why there is only a straight line...

frank merlin
#

@agile geyser Could you share harvest version?

agile geyser
#

24.02.0 on rhel 8.9

agile geyser
#

sure, do you also need the poller_unix.log?

frank merlin
#

That is not needed. It will be helpful if harvest log contains data during the duration of ONTAP upgrade.

agile geyser
#

oh, the logfiles have already been rotated. So I will observe if the behaviour is the same during the next update (scheduled for 2024-04-27) and then send logfiles...

frank merlin
#

Ok, We do keep 5 rotated files upto 10MB each or max 7 days of logs (whichever hits first). Probably those have the data?

delicate gust
#

@agile geyser we're happy to look at whatever log files you have. Hopefully they captured the upgrade

agile geyser
#

I updated 2 other systems to 9.14.1 and now I have logfiles for this behaviour.... do you need the whole data from all clusters? (the whole /var/log/harvest is 88MByte as a tar.gz, not sure if your smtp accepts that much...)

frank merlin
#

@agile geyser Any 1 cluster logs will also work.

delicate gust
#

received files @agile geyser thanks!

delicate gust
#

thanks for the log files @agile geyser, they helped root cause the issue. The RestPerf:SystemNode collector panicked while collecting counters during upgrade. This happened because the collector failed to get new instances during the upgrade. Downstream code then tried to work with those missing instances and panicked. Since the collector panicked, it stopped collecting metrics until you restarted.

This is an issue we discovered in March and fixed https://github.com/NetApp/harvest/pull/2743 If you have more clusters to upgrade, and want to try https://github.com/NetApp/harvest/releases/tag/nightly it has that fix, otherwise you can pick it up when we release 24.05.

I also confirmed that the other collectors continued collecting metrics without panics. It was only the RestPerf:SystemNode collector that panicked. Some of the other collectors hit timeouts during upgrade, likley due to ONTAP restarting nodes

GitHub

Nightly builds may include bugs and other issues. You might want to use the stable releases instead.

agile geyser
#

ah @delicate gust thank you very much for explaining. Unfortunately, that's my last upgrade to 9.14, so I can not test the new code...

delicate gust
#

ah well, thanks for reporting! Good to document for others

agile geyser
#

would this issue also be triggered when updating from 9.14.1P<x> to the new 9.14.1P4? Or do I have to do a major update to trigger this behaviour?