#Cannot add clusters in AIQUM 9.16P2

1 messages · Page 1 of 1 (latest)

worn ginkgo
#

I have been struggling for a long time now and would really appreciate if someone could help me with the actual reason why the clusters don't add in AIQUM. Getting the attached error, timezone is same and in sync. AIQUM HTTPS certificate start date is June 13 and cluster's date is current, cluster only has 2 destinations configured, port 443 is open, I have given admin permission to the user as well for the required roles, but still it doesn't work.

alpine dagger
#

aye, join the club.
I have a clean install of AIQ doing the exact same thing.

There is a ticket open on this from my side, once I get anything useful I can report the same here

worn ginkgo
#

that would be very helpful, this is also a fresh installation

alpine dagger
#

have a zoom setup for tomorrow morning, hopefully it goes well and I can figure out wth is going on.
This system had AIQ installed, tried to upgrade, it was 'successful' yet it wasn't.
uninstall failed, re-install failed, repair failed, etc.
manually uninstalled
moved all the data to another drive
clean install
using same name, ip, cert, etc. everything.
This error comes up on almost every system we try to add, but 1 somehow managed to get added and is working correctly.
go figure

worn ginkgo
#

wow, same as mine, the first cluster got added, others keep giving this error

modern lantern
#

hmmm ok looks like I'll wait P3 🙄

alpine dagger
#

no luck yet.
Uploaded a full support bundle after the call.

It's really weird.
One cluster was added without issue, none of the other 13 will get added, same error on all of them.

While running the diag collection under maintenance it threw an error that a file was 'permission denied' which is really weird.
So I fixed that, ran the collection again and it worked.
But now, the application isn't running properly.
When you try to go to the URL it just spins forever, or it instantly comes back with a 404 not found.

Good times

I might just install an older version and wait to figure out wtf is up with this one.

#

hard to imagine we're the only 2 having issues on clean installs with this, or after an upgrade.
But there are no listed bugs/issues/etc that they could find

storm chasm
#

To be honest, I have the feeling’s that AIQUM is like the unloved child 🙈 its here and needed but not so important like other stuff

covert heart
#

unfortunately, the clocks error is a generic error, and not usually what it says unless you've added that cluster to multiple AIQUM servers and/or DII.

best place to look for cluster add errors is in server_acq.log, but we might need to branch out to other logs.

Starting with AIQUM 9.14, ONTAP 9.14 and up will default to using the cloud agent, so you'd need those ports opened. If you've already disabled the cloud agent (please open a case before doing so, we're trying to track down and report all the issues), then you should be using mTLS.

@alpine dagger if you want to DM me your case number, i can take a quick peek today.

alpine dagger
#

The weird thing is.. this system and all the clusters were working fine before I tried to upgrade from 9.16 to 9.16P2
then i had to manually remove everything and install fresh.
#2010420892

#

Thanks

covert heart
#

@alpine dagger What I'm seeing in your logs is the cluster you tried adding today doesn't like a self signed cert from AIQUM.

This is what we have in the server_acq.log on aiqum:

2025-06-20 10:43:16,652 ERROR [default task-752] c.n.s.a.s.r.s.i.AcquisitionFacadeSessionServiceImpl (AcquisitionFacadeSessionServiceImpl.java:905) - Fail to add data source for acquisitionUnitName: local, dataSourceName: <cluster name>, vendor:NetApp OCI Essentials, model: NetApp OCI Essentials, datasourceTypeId: 91, manual: 1, isActive: false, attrs: -- listing properties --

: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Failed to establish connection for cloud agent instance UnifiedManager_aiqum_sysid. Reason: Certificate error: self-signed certificate in certificate chain.

And I can see this in your cluster autosupport mgwd.log:

00000020.01d7e458 030830e9 Fri Jun 20 2025 14:42:11 +00:00 [kern_mgwd:info:4039] [2025/06/20 14:42:11:9539] E: HTTP OPENSSL_PERFORM_SERVER_CERT_VERIFICATION for client UnifiedManager_aiqum_sysid: Error: self-signed certificate in certificate chain

Is everything on both ends CA signed with the same certificate authority?
When I saw this before, I had to add the root CA from AIQUM's chain into the cluster's trusted certificate authorities.

https://kb.netapp.com/data-mgmt/AIQUM/AIQUM_Kbs/Adding_ONTAP_Cluster_with_CA_signed_certificate_on_AIQUM_with_CLAG_fails_with_certificate_chain_error

https://kb.netapp.com/data-mgmt/AIQUM/AIQUM_Kbs/Clusters_addition%2F%2Fdiscovery_fail_in_AIQUM_9.14__with_CA_signed_certificate_and_cloud_agent_enabled

I'm afraid this may require a detailed examination of all the certs involved if that's not enough of a hint, so I'll leave a note for your case owner on some things to check next.

alpine dagger
#

thanks, i'll look into it, but as I said this was all working before I had to reinstall.

#

after changing the permissions on the java keystore that was throwing a permission error and restarting the aiq services, it isn't coming up. so that's another thing that I have to deal with

covert heart
#

yeah, that will throw a bigger wrench in the works

#

if you did a restore after the reinstall, i'd expect it would have the same certs as before, so i don't have good answer there other than it doesn't like what is sent now

#

otherwise, if you redid the certs from scratch, it may be the root certificate was updated and doesn't match what is on the cluster now.

alpine dagger
#

yea, i'll double check but all the certs are the ones that were originally used.
but who knows if our domain admins did anything with the cert servers in the past couple of months

#

thanks for looking

covert heart
#

if you have a server.log with the keystore errors, please upload or attach it to the case for your case owner.

#

And just because we've highjacked this thread - for the others with issues adding the cluster that aren't easily resolved with what I've mentioned to check here, please do open a case as well.

alpine dagger
#

yea, case owner is aware of the permission error

worn ginkgo
#

Thank you for the healthy discussion guys, I will try to go through the KB articles shared here and see if that helps, else will try installing a lower version of AIQUM

vale linden
#

Basically, the solution is just to disable the "cloud agent" most of the time.

alpine dagger
#

so, is there a way to disable cloud agent?

vale linden
#

yeah, but i can't remember the file at the moment. I think you can search for it. The "solution" comes up if one digs enough about errors on adding clusters

alpine dagger
#

netapp/essentials/conf/server.properties

worn ginkgo
#

Update: I was able to add other clusters after disabling cloud agent.

tribal sparrow
tribal sparrow
alpine dagger
#

all of my systems are CVO, all firewalls are set to allow full access to/from our AIQ machine, and it still fails with cloud agent enabled

tribal sparrow
#

As Dawn mentioned we usually check the server-acq log for errors related to the failure to add. If that doesn't give enough details we can usually check the Apache and Audit logs on the cluster. The user also needs to have the aqmp application assigned if you are not using the default admin user for monitoring.

modern lantern
# tribal sparrow As Dawn mentioned we usually check the server-acq log for errors related to the...

Why is that no documented anywhere? I'm talking about the docs-page which contains the official documentation and not some KB-article (every day a new one appears, I can't keep up which KB to check for a correct AIQUM configuration...).

Here it still says: "This account must have the admin role with Application access set to ontapi, console, and http."
That AMQP application which apparently is needed now is completely missing.
https://docs.netapp.com/us-en/active-iq-unified-manager/config/task_add_clusters.html

We're always using dedicated users for each application which connects to an ONTAP-cluster. So software like OTV, SCV, AIQUM, BlueXP Connector, Veeam, etc. they all get their owner user (otherwise changing a password becomes a nightmare). All these users don't have the AMQP application because it's not documented.
Same was by the way with BlueXP B&R which after an automatic update of the BlueXP connector suddenly started using AMQP for job history without documenting it anywhere. The devs simply expected that everyone's using the default admin-user for everything.

tribal sparrow
#

We are working to get the documentation updated.

alpine dagger
#

the really weird thing on my site...
One cluster was added with CA enabled.
None of the other 12 would join.
All of the systems are configured nearly identical

alpine dagger
#

so.. domain users can't be granted console access, but that is required for AIQ, according to that document

#

or disable mTLS

tribal sparrow
#

A remote(domain,ldap,saml) user will not work due to not being able to generate an ndmp password for the account on the cluster

#

I will update those two KBs. If you have additional feedback on KBs, feel free to submit feedback on them.

vale linden
#

i'd still love to see tcpdump back in the vmware image

#

you can also update the port usages with "source" and "destination" for each of the tcp ports. I found the documentation to be too generic since it doesn't tell you which side initiates the connections and that makes firewall openings more complex, too much of a guessing game without tcpdump. Running tcpdump in the system shell on the remote filer is sort of usable, depending on how one has setup admin interfaces and such.

#

I spent several days digging into everything I could find. First we had the mTLS expiration which really should be something that AIQUM itself could simply tell you straight out. Some things just keep working, other things don't. Then the AQMP stuff and Cloud Agent and REST polling. It got to be a bit of a mess very fast.

#

AIQUM also doesn't seem to like to clean up after itself whenever new versions simply start logging to new locations or in new formats. I had lots of old logs from earlier versions laying around.

alpine dagger
#

the logging functions are horrid.
I have to manually go in and stop the service, move files and then restart because our C: drives are locked to a very specific size.
I'm sure you can edit a server.properties or some other conf file to edit that, but it really should be documented or available via the GUI or CLI.

#

the other issue for us is the SSL/cipher options. The GUI has a pretty limited set of options and we have to deal with our secvuln team bugging us because of

Certificate issued with sha512WithRSAEncryption
At least one of the following should be enabled
ecdsa-with-SHA256
ecdsa-with-SHA384
ecdsa-with-SHA512
sha256WithRSAEncryption

TLS curves {'X448', 'secp521r1'} should be rejected

sage yarrow
#

Gosh AIQUM 16 16P1 and P2 are a dogs breakfast !.. Just reinstalled our AIQUM server as it never worked and ONTAP 9.16.1 clusters would never add in. 16P2 still does not add the clusters.. I have changed (to turn of Set rest outbound version option value to 9.20.1, in order for API check to go through ZAPI instead of REST) as you cannot add in as we are using SAML:, i have also changed netapp/essentials/conf/server.properties to set enable.cloudagent=false, try to add in cluster.. Failed.. Time to log another support case

alpine dagger
#

my min problem is trying to get a way to recover data from our old install before trying to do the upgrade.
No clue if there is a way to pull the data from the mysql db and import it or anything, but it is going to suck losing 12+ months of history
but that's a problem for tomorrow me to deal with

vale linden
#

one should be able to restore a backup, but i've never had to do it

tribal sparrow
#

You can VM snapshot the machine prior to performing the upgrade or you can create a UM backup. Reverting with a VM snapshot will be faster than restoring from a UM backup. UM backups are version and platform specific.

near mirage
#

Just upgraded a 9.14 instance to 9.16p2. I have a single cluster that is getting license key field 'serial_number' may not be null. in au.log.
Anyone hit this one yet?

#

forgot to mention, this cluster just had nodes added. Licenses don't match but AIQUM 9.14 didnt mind.
edit: by "not matching" i mean legacy keys vs ONTAP ONE keys.

#

this was nearly my first 100% successful upgrade in my long history of DFM/UM/AQUM/AIQUM upgrades 😅

tribal sparrow
#

That is due to CAIQUM-7125

near mirage
#

it is a public kb though

tribal sparrow
near mirage
near mirage
#

figured out another one...

java.lang.NullPointerException: disk_aggregate_relationship key field 'aggregate_uuid' may not be null.

This was caused by a failed ADP disk and it wouldn't poll again until i assigned the replacement's ADP partitions an owner.

charred wren
#

anybody came up with solution of adding the 9.16P2 cluster to AIQ ?

alpine dagger
#

disable cloud agent

charred wren
#

cloud agent is already with "false"

alpine dagger
#

you'll need to look into the logs to see what the exact error message is

charred wren
#

looking at /var/log/ocie/server_acq.log