#Harvest Data collection problem

1 messages · Page 1 of 1 (latest)

tropic vapor
#

Hi All, I’ve got a problem with harvest data collection. The pollers themselves are starting and the first period looks like on the Grafana dashboard everything is green regarding the pollers and the collectors as well. After some time, the status of the Zapi collectors is going to standby and not pushing the data to the influxDB anymore. Sometimes the pollers behave also weirdly, usually starting, but there is a scenario where a couple of the pollers are not starting and I need to do the service restart until all the pollers are not running. I've attached some snippets from the poller log regarding the errors. We have different kind of errors and our log file is totally flooded with those. Last time I restarted the services and the Zapi collectors were working fine for a couple of hours and after that, they just stopped again.

Environment details:
Host systems OS - Red Hat Enterprise Linux release 8.7 (Ootpa)
Harvest version - harvest-22.08.0-1.x86_64
ONTAP - 9.9.1P13
InfluxDB 2.6.1
Grafana 9.2.10

Changes regarding the environment within 2 weeks
OS update from 8.5 to 8.7
Harvest upgrade from harvest-22.02.0-4.x86_64 to harvest-22.08.0-1.x86_64
switched the login method to user/pw authentication from a self-signed certificate, the user has read-only permission on the ONTAP.

I think the harvest configuration files are good as the services are starting and operating for a while, but please correct me if wrong.

Your support will be appreciated.

small cedar
#

@tropic vapor Could you share the full log file with us? I see there are some permission issues related to aggregate in above logs
{"level":"info","Poller":"name of the poller","collector":"Zapi:Aggregate","task":"data","caller":"collector/collector.go:361","time":"2023-02-22T15:59:28+01:00","message":"API request rejected => not authorized for that command"}

#

Also we have just released Harvest 23.02 , If possible see if you can upgrade to this version.

#

You can also email logs to us ng-harvest-files@netapp.com

tropic vapor
#

HI, thanks for the quick response, i will send the log via email to you. subj. will be the name of the post.

small cedar
#

@tropic vapor We have received the logs. Are these logs from the state when pollers are on standby? When you say standby , is the poller process getting stopped automatically or its is running but not collecting any data?

tropic vapor
#

yes, this is the current state of the pollers, basically atm. the pollres are up and running, but they not collecting any data which is related to Zapi.

small cedar
#

Got it. Yes I see in logs collected metrics are 0
{"level":"info","Poller":"n01891001","collector":"Zapi:Volume","instances":79,"metrics":0,"apiD":"126ms","parseD":"6ms","caller":"collector/zapi.go:443","time":"2023-02-23T12:42:48+01:00","message":"Collected"}

#

Performance metrics seems fine which are ZapiPerf related

tropic vapor
#

yesterday I restarted the services and the Zapi and the Zapiperf part was also green at least for 3 hour's

#

im totally confused regarding the behavior

small cedar
#

Has this problem started after upgrade to 22.08 only?

tropic vapor
#

actually before we did not restarted any services and everything was running as configured ones, so i cant tell that for sure that is related to the version upgrade, but before we dont have this issue.

small cedar
#

okay

#

is upgrading to 23.02 an option for you?

tropic vapor
#

If it will help us to solve the current issue than I think we can do that

#

but it a bit tricky as we have 7 harvest on diff host machine and we have the same problem everywhere

#

but we can test it for sure, im able to rollback is something is go totally wrong

small cedar
#

sure. We have made several improvements around this area since 22.08 with better logging and optimisation. Let's try upgrading 1 machine with Harvest 23.02? It will be easier for us to debug as well with latest version.

tropic vapor
#

Okey, i need to bring this topic to architectural board first and i will back to you, but probably we can test that.

small cedar
#

sure. In mean time we'll try to see if we can find what's wrong with 22.08.

#

btw i don't see this error in the shared logs from email
{"level":"info","Poller":"name of the poller","collector":"Zapi:Aggregate","task":"data","caller":"collector/collector.go:361","time":"2023-02-22T15:59:28+01:00","message":"API request rejected => not authorized for that command"}

tropic vapor
#

Based on the shared information's we are good to go to 23.02 ? i mean our OS, influx, grafana version is fine ?

small cedar
#

Could you check once if the user running harvest has proper permissions

small cedar
split ember
#

hi @tropic vapor also make sure that the username/password you are using to talk to the cluster has the appropriate permissions. As Rahul mentioned above, it doesn't look like the user has permission to collect aggregate info.

tropic vapor
split ember
#

raed-only is recommended, but that user is unable to read aggregate resources which is unexpected and typically due to misconfigured permissions

tropic vapor
#

okey, i will dobule check the user

small cedar
#

@tropic vapor Logs shared seems to have information from last 2.5 hours only. Looks like problem has started occurring prior to that. Could you share us the full log for this service from systemd?

tropic vapor
#

i will send it soon.

small cedar
#

Thanks

tropic vapor
#

user checked and read only role is configured.

tropic vapor
#

Just a qq, can i go directly to 23.02 from 22.08 ?

small cedar
#

yes

split ember
#

thanks for checking - ONTAP is returning API request rejected => not authorized for that command when trying to collect aggregate info. We've only see that when there are user permissions problems.

tropic vapor
#

did you received the me latest email ? I'm not sure that it is arrived to you guys as i noticed some delivery failure.

split ember
#

we did not

tropic vapor
#

okey, i will resend it.

split ember
#

thanks that worked

tropic vapor
#

Hi, just a quick update from my end, i will update the test harvest instance to 23.02 after that i will have a green light to update the production system as well to the latest harvest.

small cedar
#

Okay but as Chris has shared over email , problem currently is due to ontap permissions

#

Updating to 23.02 will not solve the issue

tropic vapor
#

ah, okey, maybe i overlooked it, i will check again, thanks.

tropic vapor
#

Hi Guys, I've sent the latest information/ log files to you. The situation is came better since the upgrade. I've only problem with one harvest instance and one specific poller atm.

#

looks like somehow the collectors are flapping let's say. the collection is rejected with API reject, but after a while the collection is green again.

#

but only one poller is affected at the moment, not all of them

#

The harvest service is up and running for 3hrs and 8 minutes, since then I've noticed that the collectors are went off 2 times, but now everything is green again.

#

Just now other poller is behaving the same as described above, collectors are down with API reject.... 😦

small cedar
#

Thanks @tropic vapor We have received the files. Yes, logs clearly says authorisation issue. Could you try the curl command which Chris has shared ? We want to run the curl when harvest pollers are failing. That curl command should fail as well.

tropic vapor
#

I initiated the curl command when the pollers were failing

small cedar
#

Okay and what is the response?

tropic vapor
#

I've just tried the curl for the problematic one and now "not authorized fo
00c0: r that command"

#

so now it is not working.

small cedar
#

okay so problem is not with Harvest

#

it is related with ontap permissions

tropic vapor
#

I'm not really understand that, how could that happen that sometimes the permissions are fine and sometimes not 😦

small cedar
#

May be try a different user? I am guessing if that user is shared across teams and being changed?

#

audit logs are the only way to know

split ember
#

the current theory is something else is changing the permissions for your Harvest user

tropic vapor
#

I will try to use another user then, the current user is for other purpose as well.

#

Thanks the hint guys!

split ember
#

we'll take a look at the audit log and see if we can find who/how the permissions are changed to support that theory

tropic vapor
#

Thanks a lot, your time is highly appreciated!

tropic vapor
tropic vapor
#

I've prepared everything and now the Harvest is reconfigured with the local ONTAP user for data collection. The user has been created based on the guide. Everything looks green for me so far. I will keep you updated!

split ember
#

awesome!

tropic vapor
#

Hi just quick update, since we are using the local ID for data collection instead of the domain ID everything is green.

#

I already implemented the same setting on 2 different harvest instance and it looks like both of them green so far.

split ember
#

that sounds good! - do you believe there is a Harvest problem with domainIDs we need to fix or you think it's unrelated?

tropic vapor
#

i will do the sme for the rest instance

#

I'm not sure atm, I have an ongoing investigation with the AD team regarding the problematic userID

#

probably later on i will have more information

split ember
#

cool, thanks for the update

tropic vapor
#

i will keep you updated!

#

and please allow me to thank you for the help.

split ember
#

your welcome!

tropic vapor
split ember
#

we don't have a lot of insight into that beyond what customers share. I do know of some other customers using domainIDs too without issue. I doubt the issue you've run into is related to domainIDs since originally your pollers ran for a few hours before failing. That implies it is not an issue with the user id but that something is changing during those two hours. Is it possible that the reason everything is green now is because you created a new user/password that isn't being modified in some way at runtime?

tropic vapor
#

yes, it is possible as we are using a new local usr atm.

tropic vapor
#

Based on our syslog server entries, it seems that we have some Error: 500 Internal Server Error, which for we have a KB https://kb.netapp.com/Advice_and_Troubleshooting/Data_Protection_and_Security/SnapMirror/ONTAP_reports__"Error%3A_500_Internal_Server_Error"_due_to_a_large_number_of_zapi_call

Do you think that that the default Zapi & ZapiPerf schedules doesn't fit? Maybe until one of the poller/collector doesn't finish, and the other one starts, they clashes? IDK, any idea?
We use these default yml configs.

collector: Zapi

Order here matters!

schedule:

  • data: 180s
    =============================
    collector: ZapiPerf

Order here matters!

schedule:

  • counter: 1200s
  • instance: 600s
  • data: 60s
    =============================
    Let me note, that since we rolled back to local user ID, everything is working properly, so maybe the above statement is false. We just try to understand what happened in the background.
    Appreciate your reply on this.
small cedar
#

@tropic vapor Could you check if these 500 error server entries are due to harvest? We have not seen such 500 errors in harvest with customers or in any internal setups.