#Harvest (?) timeouts on Ontap S3 requests

1 messages · Page 1 of 1 (latest)

fervent moat
#

Hi.
I keep getting false s3 buckets amount in Prometheus (& Grafana, of course) from my Netapp via harvest.
Could you please point me to some additional troubleshooting options / possible solutions / more ideas what to tune?

Ontap Version 9.8P16

Docker container images in use:
prom/prometheus:v2.44.0
grafana/grafana:9.5.2
rahulguptajss/harvest:23.05.0-1
prom/alertmanager:v0.25.0

curl -k -X GET "https://<ontap_ip>/api/protocols/s3/buckets?return_records=true&return_timeout=15" -H "accept: application/json" -H "authorization: Basic <token>" | grep "num_records"
** "num_records": 33,**

curl -k -X GET "https://<ontap_ip>/api/protocols/s3/buckets?return_records=true&return_timeout=60" -H "accept: application/json" -H "authorization: Basic <token>" | grep "num_records"
** "num_records": 103,**

vi harvest/conf/zapiperf/default.yaml
...
client_timeout: 80
...

vi harvest/conf/zapiperf/cdot/9.8.0/ontap_s3.yaml
...
client_timeout: 80
...

Tried all the options from here:
https://github.com/NetApp/harvest/wiki/Troubleshooting-Harvest#some-of-my-clusters-are-not-showing-up-in-grafana

Thanks in advance.

upper sigil
#

It seems direct rest call itself hasn't been finished with all data from ONTAP side and got timeout. Could you try this in harvest cli and see if that would be any help: ./bin/harvest rest -p POLLER_NAME show data --api protocols/s3/buckets

fervent moat
#

The 2nd curl, with the timeout=60, shows the real data.
I cannot enter into the harvest cli, it's containerized with Docker 😕

sudo docker exec -it sudo docker ps -a | grep harvest | awk '{print $1}' /bin/sh
OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
sudo docker exec -it sudo docker ps -a | grep harvest | awk '{print $1}' /bin/bash
OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown

fervent moat
upper sigil
#

I am trying to find any recommendation from ONTAP side for invoking heavy rest calls, Default return_timeout would be 15s for all rest calls from ONTAP.

fervent moat
#

Yep. I've changed it to 80 already; it seems either not being applied, or not enough.
And as far as I see, 120s is the maximum timeout acceptable by the API.

upper sigil
#

There are few things I would like to add, 1. As you have mentioned to change in this path vi harvest/conf/zapiperf/cdot/9.8.0/ontap_s3.yaml But In 23.05 we don't have this: https://github.com/NetApp/harvest/tree/release/23.05.0/conf/zapiperf/cdot/9.8.0. 2. vAs ontapS3 object collection happened via REST only, not from zapi/zapiperf. So this would be right change if you want custom timeout: a. Enable Rest collector. b. Update this file https://github.com/NetApp/harvest/blob/release/23.05.0/conf/rest/9.7.0/ontap_s3.yamlwith this line return_timeout: 80. Rest has return_timeout than Zapi client_timeout.

fervent moat
#

Yep, forgot that I've already added the "client_timeout: 80" in the harvest/conf/rest/9.7.0/ontap_s3.yaml as well.
Changed it to the "return_timeout =120" now. Let's see if it helps.
P.S. Rest collector has been already enabled in the harvest/conf/harvest.yml

upper sigil
#

It's right that Full parity of REST would be 9.12+. This OntapS3 is special case where no zapi has been exist and it's available from Rest from 9.7. Attaching rest docs screenshot. cluster-1x-x. ::> show-ontapi -command "vserver object-store-server bucket show"
(security login role show-ontapi)
There are no entries matching your query.

cluster-1x-x::> show-ontapi -command "vserver object-store-server bucket policy statement show"
(security login role show-ontapi)
There are no entries matching your query.

fervent moat
#

With either return_timeout between 60 and 120s, I started to get no data at all and now I see in the harvest container logs the timeout errors:
... ERR collector/collector.go:367 > error="failed to fetch data: error making request context deadline exceeded (Client.Timeout or context cancellation while reading body)" Poller=<cut> collector=Rest:OntapS3 task=data.

stone isle
#

@fervent moat Thanks for reporting the issue. There is an Harvest bug which is identified due to this and it is now fixed as part of https://github.com/NetApp/harvest/issues/2109. You can try the fix in nightly builds https://github.com/NetApp/harvest/releases/tag/nightly
If you are using docker workflow then you can use steps mentioned here to upgrade.
https://netapp.github.io/harvest/23.05/install/containers/#upgrade-harvest

You can revert all changes related to return_timeout from templates. They are not needed after this fix.

fervent moat
fervent moat
#

Well, the joy wasn't for too long. Every like 5..10 min Prometheus seems getting just nothing now:

2023-05-28T06:46:18Z ERR collectors/commonutils.go:22 > Failed to fetch data error="error making request context deadline exceeded (Client.Timeout or context cancellation while reading body)" Poller=ragnar href=api/protocols/s3/services?return_records=true&fields=svm.name,name,is_http_enabled,is_https_enabled,secure_port,port object=OntapS3 plugin=Rest:OntapS3Service
2023-05-28T06:46:18Z ERR collector/collector.go:397 > error="error making request context deadline exceeded (Client.Timeout or context cancellation while reading body)" Poller=ragnar collector=Rest:OntapS3 plugin=OntapS3Service
2023-05-28T06:56:44Z ERR collectors/commonutils.go:22 > Failed to fetch data error="error making request context deadline exceeded (Client.Timeout or context cancellation while reading body)" Poller=ragnar href=api/protocols/s3/services?return_records=true&fields=svm.name,name,is_http_enabled,is_https_enabled,secure_port,port object=OntapS3 plugin=Rest:OntapS3Service
2023-05-28T06:56:44Z ERR collector/collector.go:397 > error="error making request context deadline exceeded (Client.Timeout or context cancellation while reading body)" Poller=ragnar collector=Rest:OntapS3 plugin=OntapS3Service

Though when the data does arrive, it's still precisely true.

#

That's with the latest nightly:
version="harvest version 23.05.26-nightly (commit 8d22d74f) (build date 2023-05-26T04:44:23+0000) linux/amd64\n"

stone isle
#

@fervent moat Thank you for providing the update. It appears that there is a delay in the API response from the ontap side. Could you provide us with the full logs.

I have a couple of additional questions:

1: Are these error logs specific to this particular API, or are timeouts occurring for other APIs as well?
2: Is this issue limited to one cluster, or is it affecting all clusters?

fervent moat
#

That's what's make me curious too, but I didn't connect our other clusters to this monitoring set [bearing the newest Prom, Grafana & Harvest] yet.
I'm planning to do it this week and then I'll update.

#

Could you please point me out maybe to some HowTo regarding the ONTAP API logs? I'm less a NetApp guy 😬

stone isle
#

We can try increasing client_timeout for this template. By default it is 30s for Rest Collector. Let's set it to 1m and see.

fervent moat
#

You mean in the harvest/conf/rest/9.7.0/ontap_s3.yaml, right?

stone isle
stone isle
fervent moat
#

NetApp Ontap Version 9.8P16
Docker Composer with container images in use:
prom/prometheus:v2.44.0
grafana/grafana:9.5.2
harvest version 23.05.26-nightly (commit 8d22d74f)
prom/alertmanager:v0.25.0

stone isle
#

okay You can collect harvest poller logs as below

docker logs CONTAINER_NAME >& poller.log

fervent moat
#

Oh, I thought you wanted me to collect the logs from the ONTAP API side.
The errors I've provided earlier are exactly from the harvest container log and are the only ones in its log.

2023-05-29T08:28:17Z ERR collectors/commonutils.go:22 > Failed to fetch data error="error making request context deadline exceeded (Client.Timeout or context cancellation while reading body)" Poller=ragnar href=api/protocols/s3/services?return_records=true&fields=svm.name,name,is_http_enabled,is_https_enabled,secure_port,port object=OntapS3 plugin=Rest:OntapS3Service
2023-05-29T08:28:17Z ERR collector/collector.go:397 > error="error making request context deadline exceeded (Client.Timeout or context cancellation while reading body)" Poller=ragnar collector=Rest:OntapS3 plugin=OntapS3Service
2023-05-29T09:00:45Z ERR collector/collector.go:367 > error="failed to fetch data: error making request 262197: The value "health" is invalid for field "fields" (<field,...>)" Poller=ragnar collector=Rest:Status task=data
2023-05-29T09:00:45Z ERR collector/collector.go:367 > error="failed to fetch data: error making request 262197: The value "anti_ransomware.state" is invalid for field "fields" (<field,...>)" Poller=ragnar collector=Rest:Volume task=data
2023-05-29T09:00:45Z ERR collector/collector.go:367 > error="failed to fetch data: error making request 262197: The value "anti_ransomware_default_volume_state" is invalid for field "fields" (<field,...>)" Poller=ragnar collector=Rest:SVM task=data

I've changed the client_timeout in the S3 yaml to 120, they still happen.

I'm trying to start polling additional cluster from this harvest now.

#

Hm... 2023-05-29T09:00:32Z INF rest/rest.go:146 > Using default timeout Poller=ragnar collector=Rest:OntapS3 timeout=30s
I've put the "client_timeout: 120" to the harvest/conf/rest/default.yaml already, and still it seems ignoring this config:
rest/rest.go:146 > Using default timeout Poller=ragnar collector=Rest:ClusterPeer timeout=30s

stone isle
#

okay let me check. Could you share harvest/conf/rest/default.yaml

fervent moat
stone isle
#

okay value is incorrect. It needs to follow Go Duration standard

#

for example in your case it should be either 120s or 2m

fervent moat
stone isle
#

Thanks. Corrected

fervent moat
#

The timeout errors in the log are now replaced with the warning:
2023-05-29T09:15:57Z WRN collector/collector.go:456 > lagging behind schedule Poller=ragnar collector=Rest:OntapS3 lag=1m33.789760339s
2023-05-29T09:24:25Z WRN collector/collector.go:456 > lagging behind schedule Poller=ragnar collector=Rest:OntapS3 lag=5m28.146181909s
The "holes" of data absence are still there.

stone isle
#

yes It means ONTAP is slow to respond to this API

#

There are no issues due to Harvest in this case.

fervent moat
#

I've added another cluster, but it doesn't even appear in the log:

*Exporters:
prometheus:
exporter: Prometheus
local_http_addr: prometheus
port: 12990
prom_exp_01:
exporter: Prometheus
port_range: 13000-13100
** prom_exp_02:
exporter: Prometheus
port_range: 13101-13200
**Defaults:
collectors: # Order here matters!
- Rest
- Zapi
- ZapiPerf

Pollers:
ragnar:
datacenter: TST
addr: <cut>
username: <cut>
password: <cut>
use_insecure_tls: true
exporters:
- prom_exp_01

** loadvantagex:
datacenter: LOAD
addr: <cut>
username: <cut>
password: <cut>
use_insecure_tls: true
exporters:
- prom_exp_02***

stone isle
fervent moat
#

Tnx, I forgot there's a procedure to build that yaml 👍 I will.

#

Just BTW, those seem like actual bug. If it's not my misconfig somewhere, of course.
2023-05-29T09:00:45Z ERR collector/collector.go:367 > error="failed to fetch data: error making request 262197: The value "health" is invalid for field "fields" (<field,...>)" Poller=ragnar collector=Rest:Status task=data
2023-05-29T09:00:45Z ERR collector/collector.go:367 > error="failed to fetch data: error making request 262197: The value "anti_ransomware.state" is invalid for field "fields" (<field,...>)" Poller=ragnar collector=Rest:Volume task=data
2023-05-29T09:00:45Z ERR collector/collector.go:367 > error="failed to fetch data: error making request 262197: The value "anti_ransomware_default_volume_state" is invalid for field "fields" (<field,...>)" Poller=ragnar collector=Rest:SVM task=data

stone isle
#

if you are using clusters before 9.12 ONTAP then consider changing priority order of collectors like below

Exporters:
prometheus1:
exporter: Prometheus
port_range: 12990-14000
add_meta_tags: false
Defaults:
collectors:
- Zapi
- ZapiPerf
- Rest
Pollers:
u2:
datacenter: u2
addr: -REDACTED-
username: -REDACTED-
password: -REDACTED-
exporters:
- prometheus1
dc1:
datacenter: dc1
addr: -REDACTED-
username: -REDACTED-
password: -REDACTED-
exporters:
- prometheus1

stone isle
fervent moat
#

Ah, I get it.

stone isle
#

It is advisable to utilize Zapi/ZapiPerf prior to 9.12. However, Rest can still be employed if desired. When applying precedence order as mentioned above, the Rest Collector will only load templates that are unavailable in Zapi.

#

Furthermore, if you have already configured the port-range for Prometheus in the Harvest configuration, there is no need to repeat it under the Exporters section. I have provided an example of this in the previous configuration.