I am using a credential script and I just noticed a partial collector fail after updating harvest. It looks like the credential API was not responding well to restart of 23 collectors at once. 🙂
I ended up with a StorageGrid collector running with 1 of 2 templates. To me it looks like every collector template does a unique credential lookup. Is that correct?
#Partial collector fail with credential script
1 messages · Page 1 of 1 (latest)
hi @steel berry have you specified a schedule for your credentials-script block? If not, the default schedule is to cache the credentials for 24h, in which case, the credential script should only be called once per poller per 24h, regardless of the number of templates you have. https://netapp.github.io/harvest/24.05/configure-harvest-basic/#credentials-script
This happend during a restart for the poller. I can deliver log output via mail.
can you paste your poller's credentials_script: block too?
credentials_script: path: "/opt/ccp_read.pl" schedule: 24h timeout: 10s
thanks! looks good and which version of Harvest are you using? I verified locally with 24.05 that the script is only called once per poller restart
I just updated to 24.05
perfect! How many pollers are restarting at once? Is the issue that you have 23 pollers, all restarting a the same time, and even though Harvest only makes one request per poller per 24h, that is still too many for your cred service?
It may be coincidence, or the cred service could have been overwhelmed. My main concern is the poller is not failing when the credential script returns error.
ah ok, that was not clear. I thought you were asking how many cred requests per poller should be expected. Yes, the log files will help
Well both actually. I want to avoid overwhelming the cred service for sure. But half-working poller is worse. 🙂
As a possible workaround to overwhelming, you could try smearing the poller restarts over a larger window by using jitter. https://netapp.github.io/harvest/24.05/configure-zapi/#zapiperf-configuration-file This will randomly distribute collector startup across the jitter duration.
And on the half-working problem, we'll take a look at your logs and root cause
I sent the mail! I will look into the jitter
Jitter is only mentioned for the perf pollers and the documentation only talks about performance query startup times. Will this really be helpful for poller startup?
that's misleading documentation, we'll fix that. Jitter works for all collectors that you specify jitter e.g.
Poller=sar collector=Zapi:Security jitter=4.223234577s
Poller=sar collector=ZapiPerf:NFSv4 jitter=56.714268116s
Jitter documentation improved at https://github.com/NetApp/harvest/discussions/2856 and in this PR https://github.com/NetApp/harvest/pull/2920
Thanks for the log files. Did you happen to notice if this affects other collectors too or only StorageGRID? I think I understand what's happening. Harvest's auth code called your Perl script, and your script returned an HTML page to Harvest with text saying "403 Forbidden". That error percolated up the Harvest stack and ultimately that collector fails to start.
This behavior is similar to what would happen if you were NOT using a credentials_script and mistyped credentials. Except in the mistyped credentials case, all the collectors would fail to start and the entire poller would exit since no collectors are running. With your script, as you noted, some collectors may start and others may not since your script sometimes returns valid credentials and sometimes doesn't. The StorageGrid:Prometheus collector failed and the StorageGrid:Tenant collector succeeded.
This comes back to my initial question. So there are different credential lookups for StorageGrid:Prometheus and StorageGrid:Tenant? Why? It is just different templates in the same collector configuration. If the same happens for ONTAP I have a lot more then 23 cred requests.
@steel berry The cred script is invoked only once per poller and is shared by each collector. However, if any collector fails due to an error, the cred service will be retried by another collector. Therefore, if all collectors fail while executing the cred service, that's when you'll observe the cred service being called by each collector individually.
Below is the code for reference
https://github.com/NetApp/harvest/blob/main/pkg/auth/auth.go#L85-L95
That makes sense, thanks for the explanation. I will configure jitter to reduce the load on the cred service.
Is there any way to ensure all collectors are running correctly?
Yes, You can track metric metadata_component_status
https://netapp.github.io/harvest/24.05/ontap-metrics/#metadata_component_status
status of the collector - 0 means running, 1 means standby, 2 means failed
metadata_component_status{type="collector"} > 0
I would be great to have a panel for failed collectors in the Metadata dashboard 😉
It is available
Great, a stats panel in the Highlights section would be helpful.
Sure. Could you please open a github request for that.
Sure, will do