#AIQUM 9.16+ reliability

1 messages · Page 1 of 1 (latest)

gray vector
#

Hello everyone, I am getting a bit concerned about AIQUM's reliability with recent releases. We upgraded 9.14 instances (with about 30 nodes each) to 9.16, and one to 9.18, and are getting a metric ton of trouble each time.

First one was https://kb.netapp.com/data-mgmt/AIQUM/AIQUM_Kbs/AIQUM_OCIE_service_fails_to_start_due_to_mysql_native_password_error , with the official solution beeing to go to debug builds of the appropriate version, and then on the 9.18 one we seem to ALSO hit https://kb.netapp.com/data-mgmt/AIQUM/AIQUM_Kbs/AIQUM_becomes_unresponsive_frequently_after_upgrade_to_9.18 , asking to use another debug build.

Basically, one is asking for 9.18D7 and the other 9.18D11, I am quite puzzled to have to choose the right debug build to upgrade an AIQUM instance...

Is anyone here is having similar experiences or is it just bad luck ?

restive turret
#

had the same issues with going to 9.16 on all of our instances, have not tried to upgrade to 9.18 yet, not really wanting to with some of the issues present

worldly herald
#

I think the (unspoken) future will be NetApp Console. I wouldn't get my hopes up about much more AIQUM development (as least for new features).

restive turret
#

kind of sucks, aiq is really the only 'free' tool that collects the information our management team looks at.
The online/hosted version only keeps a short amount of data, which is rather useless for us.
NABox is too difficult for several people to dig around and find the info they want, etc.
Oh well, time to start looking at options

worldly herald
#

Rumors are NetApp Console might get some updates to include parts of the AIQUM stuff.

lyric drum
#

The d patches are not cumulative. If you are running into a scenario where you need multiple D patches, a support ticket will need to be raised so that a combined D patch can be generated.

gray vector
#

Did not know that, thanks @lyric drum . I thought the sum of all D patches was cumulative and formed the P releases afterwards.

#

To be honest, if I am running into a scenario where I need multiple D patches, I would rather rollback the upgrade and wait for the next patch release 🙂

#

But the point is, I do not remember having so much issues upgrading until AIQUM 9.14+

gray vector
#

Point is, I have on one hand Nabox / Harvest that are really painless to maintain, well polished and uses modern runtimes, and AIQUM that seems to suffer from worsening QA and has barely evolved since OCUM, runtime and UI-wise.

That is why I opened this thread, I'm trying to understand if this a temporary fluke or do we need to prepare for a transistion to NetApp console eventually as AIQUM is "passively" sunsetting.

lyric drum
gray vector
#

D-patches are builds to test a specific remediation for an issue, and P-releases are a collection of tested and "proven" patches in a bundle.

teal wyvern
#

FWIW, I'm in a bit of the same boat. Upgrade to 9.18 resulted in an AIQUM that just had to be attempted restarted multiple times until it finally succeeded after a few days of downtime. There was apparently some bug in the legacy (and deprecated, iirc) method AIQUM uses to connect to the mysql database. I also have a D patch I'm supposed to try, but I haven't had the time to play with this yet. It's working right now, so this hasn't made it to the top of the priority list.
I also find that Harvest is generally a better product and much better support (those guys are just on the ball all the time), but AIQUM has a better approach to data reduction (reducing resolution over time), even if you only get 13 months history. This is just how Prometheus works, unfortunately.
AIQUM also has better warningw information about things "noisy neighbors" and generally tracks aberations more automagically. Being able to "drill down" the layers to find problems is quite useful at times (at least time-saving). Some of the "automagic" like automatically generating endless nearly identical adaptive-qos-policies was a bit much, of course.
Good monitoring tools would ideally warn about impending problems (event convergence, both "load" and "storage capacity" maximum timelines, better integration with ARP) and perhaps have a few more channels for warning than email and SNMP (integrated webhook support, for example).
The "cloud" offerings aren't really an option currently.

gray vector
#

Thanks for your feedback, that's appreciated mate 🙂 I agree with you, we do use AIQUM for this specific reason too actually.

#

In our case, it is possible we are hitting another "unpatched" (AFAIK) case, we have been recommended to add a XX:CompileCommand JVM argument to bypass a JIT optimization issue. I asked if a patch release is scheduled, the GA seems to accumulate quite a lot of issues...

keen remnant
# gray vector In our case, it is possible we are hitting another "unpatched" (AFAIK) case, we ...

the fix for that java crash is actually a patch that is already available, you should not have been asked to add the JVM arguement. Would you mind DM'ing me your case?
a GA patch is scheduled that should consolidate many of these fixes, but i don't have the release date. just that it will be sometime this summer.

The patch is linked on the bug report (second link)
https://kb.netapp.com/data-mgmt/AIQUM/AIQUM_Kbs/AIQUM_becomes_unresponsive_frequently_after_upgrade_to_9.18

https://kb.netapp.com/data-mgmt/AIQUM/AIQUM-Issues/CAIQUM-8338

gray vector
#

@keen remnant Problem is, we seem to hit both the JVM / JIT optimization issues and the MySQL connection ones, and if I'm not mistaken since they are basically different relases of the ocum / ocie / ... apt packages in the appliance, they are not cumulative. Switching to D7 would remove the D11 fix, so I am actually a bit more comfortable with keeping D7 + the arg patch if it is the case.

gray vector
#

Oh, the D7 patch is a simple replacement of a war file, that's actually quite cumulative... 😄

keen remnant
#

ah, unfortunately the D patches aren't cumulative, and the D11 patch is a full build so the D7 patch would need to be respun to apply it on top of D11 . that explains the route they were going.
i'm sorry for this, the whole thing is very frustrating.

worldly herald