#Partial collector fail with credential script

1 messages · Page 1 of 1 (latest)

steel berry
#

I am using a credential script and I just noticed a partial collector fail after updating harvest. It looks like the credential API was not responding well to restart of 23 collectors at once. 🙂
I ended up with a StorageGrid collector running with 1 of 2 templates. To me it looks like every collector template does a unique credential lookup. Is that correct?

ionic glen
steel berry
#

This happend during a restart for the poller. I can deliver log output via mail.

ionic glen
#

can you paste your poller's credentials_script: block too?

steel berry
#

credentials_script: path: "/opt/ccp_read.pl" schedule: 24h timeout: 10s

ionic glen
#

thanks! looks good and which version of Harvest are you using? I verified locally with 24.05 that the script is only called once per poller restart

steel berry
#

I just updated to 24.05

ionic glen
#

perfect! How many pollers are restarting at once? Is the issue that you have 23 pollers, all restarting a the same time, and even though Harvest only makes one request per poller per 24h, that is still too many for your cred service?

steel berry
#

It may be coincidence, or the cred service could have been overwhelmed. My main concern is the poller is not failing when the credential script returns error.

ionic glen
#

ah ok, that was not clear. I thought you were asking how many cred requests per poller should be expected. Yes, the log files will help

steel berry
#

Well both actually. I want to avoid overwhelming the cred service for sure. But half-working poller is worse. 🙂

ionic glen
steel berry
#

I sent the mail! I will look into the jitter

#

Jitter is only mentioned for the perf pollers and the documentation only talks about performance query startup times. Will this really be helpful for poller startup?

ionic glen
#

that's misleading documentation, we'll fix that. Jitter works for all collectors that you specify jitter e.g.
Poller=sar collector=Zapi:Security jitter=4.223234577s
Poller=sar collector=ZapiPerf:NFSv4 jitter=56.714268116s

ionic glen
#

Jitter documentation improved at https://github.com/NetApp/harvest/discussions/2856 and in this PR https://github.com/NetApp/harvest/pull/2920

Thanks for the log files. Did you happen to notice if this affects other collectors too or only StorageGRID? I think I understand what's happening. Harvest's auth code called your Perl script, and your script returned an HTML page to Harvest with text saying "403 Forbidden". That error percolated up the Harvest stack and ultimately that collector fails to start.

This behavior is similar to what would happen if you were NOT using a credentials_script and mistyped credentials. Except in the mistyped credentials case, all the collectors would fail to start and the entire poller would exit since no collectors are running. With your script, as you noted, some collectors may start and others may not since your script sometimes returns valid credentials and sometimes doesn't. The StorageGrid:Prometheus collector failed and the StorageGrid:Tenant collector succeeded.

steel berry
#

This comes back to my initial question. So there are different credential lookups for StorageGrid:Prometheus and StorageGrid:Tenant? Why? It is just different templates in the same collector configuration. If the same happens for ONTAP I have a lot more then 23 cred requests.

kindred axle
#

@steel berry The cred script is invoked only once per poller and is shared by each collector. However, if any collector fails due to an error, the cred service will be retried by another collector. Therefore, if all collectors fail while executing the cred service, that's when you'll observe the cred service being called by each collector individually.

Below is the code for reference
https://github.com/NetApp/harvest/blob/main/pkg/auth/auth.go#L85-L95

steel berry
#

That makes sense, thanks for the explanation. I will configure jitter to reduce the load on the cred service.

#

Is there any way to ensure all collectors are running correctly?

kindred axle
steel berry
#

I would be great to have a panel for failed collectors in the Metadata dashboard 😉

kindred axle
#

It is available

steel berry
#

Great, a stats panel in the Highlights section would be helpful.

kindred axle
#

Sure. Could you please open a github request for that.

steel berry
#

Sure, will do