#FAS9500's and ONTAP 9.14.1P12
1 messages · Page 1 of 1 (latest)
Note that 9.14.1P14 is probably still 3 months or more away (~2 months between P releases).
You might want to update directly to 9.15.1P8 (or later) which also gets you some nice new features.
That is, unless the bug you encountered is also present in those releases. Do you have a BURT ID that you can share?
It’s a new bug. This is a dark site. Won’t be recommending 9.15 for another 2-3 months. We are totally shocked something like this happened
CONTAP-366340
There’s nothing there yet
The info is internal-only, at least for now.
ok but then it's specific to the 9500? Because we have quite a few systems on 9.14.1P12 already without any problems
okay, seems like only 1 impacted system so far... apparently some nasty edge case
panic prod/common/wafl/buf.c:3422: assertion failure in sk process wafl_expempt26 on release 9.14.1p12
The wafl exempt number changes a bit but that’s the panic. Happened on two switch connected nodes. Both nodes an ha-pair
the number is just the thread it happened in (wafl_exempt is the multithreaded part of WAFL, there's one wafl_exempt thread per CPU so it's basically a random number depending on where the crash happens)
buf.c is part of the buftree code... could be anything, but since this code is pretty well-exercised I guess it is some edge case. One filer trips over something, does a takeover, and the other one trips over the same issue. result is that both nodes panic.
Yup
Hi, why wont you recommend 9.15.1? We have 9 switchless clusters on 9.13.1p9 a250 c250 and fas2820. We are planning to upgrade them to 9.15.1 platest in 2 weeks. Is this release not stable? Or should we sit on 9.13.1p-latest?
9.15.1 is very stable, we have dozens of systems running 9.15.1 without issues. It just comes down to personal preference how long you want to wait, i.e. how conservative you are. If, for example, any ONTAP upgrade requires you to schedule a maintenance window, notify a dozen other groups and have them possibly migrate stuff away or something like that, you want to make sure to update no more than say once a year. In that case you want the latest P release you can get your hands on
If, on the other hand, you can upgrade relatively easily without any advance scheduling over a weekend or even during a workday, you might be more eager (and open to) running earlier P releases
I work with a lot of secure customers. Air gapped. No asup. No logs. No cores. I want as many things fixed as possible.
I wait for 9.16.1P4. After that is published, the next version of 9.15 is where I look. I’m looking for some stability on 9.16 with the expectation that the “exotic” bugs that are discovered by those on 9.16 are fixed in 9.15.
Will use 9.16 is absolutely necessary. Prefer not to.
I also work with Netapp looking and waiting for some feedback as customer use increases on 9.15.
With asup NetApp has insights can can suggest when to upgrade. These customers do not
I don't think you'll find too many P releases that aren't, at some point, updated with yet another panic fix. The idea that there is a supreme good release with no problems, isn't really an attainable goal. Some new releases with have some embarassing short-comings, but after that, edge cases will catch you at some point. I upgrade relatively often and luckily our "Change Regime" has shaken off the idea that others get to approve when we do things. We're just expected to fix things when they break, more like modern CI/CD approaches. Failovers/givebacks aren't as disruptive as they once were either.
so I've probably used most of the P releases in the last few years (after P2 or P3) without any serious skew towards instability (far over 5 9's)
while it's true that there won't be a bug-free P-release ever, the statistics work in your favor, i.e. bugs more likely to affect many systems are found in earlier releases, while in the later P releases, the bugs tend to be somewhat more obscure (like the one in 9.14.1P12 mentioned above only affecting 1 or 2 systems currently). So the chances of you hitting a critical bug are much smaller when you're on a later P release
History. History shows that around P4 is when the significant/obvious fixes are done. The recent exceptions were a couple of issues that weren’t discovered/reported until P4 and fixed in P7-9-ish
And as @rough wren said, not looking for bug free as we all know that’s impossible. Looking for best chance at stability on the newest release possible…because these customers typically cannot send logs or core files.
new functionality and new hardware always increase the risk for bugs, but sometimes it's just your day. The matrix of usage is too big to test and the number of undiscovered and unfixed bugs is probably daunting. The saving grace is always failing without losing data (i.e. panic when necessary). I feel like the long wait periods into using a new release are a bit like the "uptime" idea one used to tout in the Unix world where one would keep systems running until they were nearly impossible to upgrade. The SAN world is probably less forgiving than NAS, so I could be convinced to differential more there. I probably wouldn't be using 9.16 yet if it wasn't for a few features and newer hardware and I've encountered minor bugs, but nothing to lose sleep over yet.
And you are probably using asup and if you run into unknown issues can be relatively easily be resolved with logs and core files.
I’ve been doing secure sites my whole career. They can’t live on the bleeding or even leading edge unless a feature or hardware require it.
The DISA approved products list doesn’t even include 9.16 yet. (Although I hear it may be dropping this week).
With the secure sites there are plenty of extra things we need to consider. New code is nice but if older/stabler code works, that’s what we do.
We certainly have fewer issues this way and if someone does hit an issue we can usually look at the signs and figure out the resolution (since it’s usually been discovered/fixed). this is nearly impossible to do with bleeding edge code
There's an old saying in manufacturing: "We don't know why things work, only when they stop working". Unless one really knows that bugs that one would hit aren't already in the "safe" code one is running (and certainly there are both new and old bugs), it seems like a bit of cognitive bias. Anyway, you probably aren't going to make any adjustments to your procedures if they seem like they work and I don't have more time.
@rough wren @pastel fulcrum @plucky narwhal hi guys thanks for feedback, sorry that I'm responding now because discord does have some serious issues in my country - channels are not loading etc. Usually we are having only one upgrade window per year so we plan to do it as high as we can using N-1 strategy so we were aiming for 9.15.1P8 (which is also atleast P3 that we are also following). we did this approach by many years (SAN + CIFS/nfs envs) and we didn even hit any problems except some bugs in BMC which were fixed very fast and applied regardless of ontap upgrade. I was doing check and our systems (c250 a250 and 2820/2720) and didnt found any bugs. they are switchless clusters and fc switches only g720 from brocade. they even do not have shelves for now. before we used 2620 and a200 and also approached this strategy (even with multihop from 9.3 to 9.7 was smooth)
by the way you are still sticking to cmd during upgrades? or using now system manager as netapp recommends? 🙂 always used cli but its tempting to use System Manager since NetApp is recommending it everywhere
Personally, I only use system manager for uploading the image when I don't have an FTP ready. The actual update I do via CLI always
same here for years to be honest
but uploading from client was gamechanger for me
to be honest
always it was annoying part to find/establisish ftp/http to upload image
we had the updates in an S3 bucket that you could just download via https if you knew the exact URL... worked pretty well in all but the most restricted sites
It really does not matter how the upgrade starts. I do both. CLI and GUI. If you start in the GIU, you can follow on the CLI. If you start in the CLI you can follow on the GUI.