Hello Team. I have deployed a new monitoring for netapp system and got the error of restperf.
Some netapp system has version ONTAP 9.14.1P4 +
All poller run in containers. Sometimes reboot container help me, sometimes no. Can you help invistigate what the problem?
I got the error:
ime=2024-12-19T09:17:19.739Z level=ERROR source=collector.go:434 msg="" Poller= collector=RestPerf:NFSv41 error="configuration error => missing timestamp metric" task=data
ime=2024-12-19T09:17:19.739Z level=ERROR source=collector.go:434 msg="" Poller= collector=RestPerf:NFSv41Node error="configuration error => missing timestamp metric" task=data
ime=2024-12-19T09:17:19.739Z level=ERROR source=collector.go:434 msg="" Poller= collector=RestPerf:NVMfLif error="configuration error => missing timestamp metric" task=data
ime=2024-12-19T09:17:19.739Z level=ERROR source=collector.go:434 msg="" Poller= collector=RestPerf:HostAdapter error="configuration error => missing timestamp metric" task=data
#Harvest. RestPerf Error missing timestamp metric
1 messages · Page 1 of 1 (latest)
@bronze quarry Could you share Harvest version?
Hi.
harvest version 2.0-nightly (commit 9cad88ee) (build date 2024-12-19T08:52:01+0000) linux/amd64
But, the same errors i got on lastest verison
Okay. I suspect these errors are side effects of some other errors like timeouts. Could you share Harvest logs @ https://upload.nabox.org/maxu-coqo-fixa
I see that These errors are side effects of a timeouts which has happened earlier
time=2024-12-19T10:23:42.165Z level=ERROR source=collector.go:434 msg="" Poller=xxxx collector=RestPerf:Volume error="failed to fetch data. href=[api/cluster/counter/tables/volume?return_records=true&max_records=500] err: error making request connection error Get \"https://a402-dbaas-ds/api/cluster/counter/tables/volume?max_records=500&return_records=true\": dial tcp :443: i/o timeout" task=counter
It seems like Cluster is unable to parse requests.
yes. i saw that. But it's stranger, because other metrics collected fine
Yeah, I suspect Cluster is unable to respond to many requests at once.
Let's try enabling jitter for RestPerf collector. Could you set it to 1m? Steps are listed @ https://github.com/NetApp/harvest/discussions/2856
I have checked and will back with results
sure
collector=RestPerf:CIFSvserver error="configuration error => missing timestamp metric" task=data
collector=RestPerf:SMB2 error="configuration error => missing timestamp metric" task=data
collector=RestPerf:Lun error="configuration error => missing timestamp metric" task=data
collector=RestPerf:NFSv4 error="configuration error => missing timestamp metric" task=data
collector=RestPerf:NFSv4Node error="configuration error => missing timestamp metric" task=data
collector=RestPerf:NicCommon error="configuration error => missing timestamp metric" task=data
collector=RestPerf:Path error="configuration error => missing timestamp metric" task=data
collector=RestPerf:VolumeSvm error="configuration error => missing timestamp metric" task=data
this file limit_60s.yaml root@575d57a25911:/opt/harvest# cat conf/restperf/limit_60s.yaml
jitter: 1m
collector: RestPerf
Order here matters!
schedule:
- counter: 60m
- data: 1m
objects:
CIFSNode: cifs_node.yaml
Disk: disk.yaml
ExtCacheObj: ext_cache_obj.yaml
FCVI: fcvi.yaml
FcpPort: fcp.yaml
HeadroomAggr: resource_headroom_aggr.yaml
HeadroomCPU: resource_headroom_cpu.yaml
Could you share logs again to same location
Done.
It's the same timeouts in logs. Let's curl the endpoint to isolate the issue from Harvest. Do you get any response back if you curl endpoint api/cluster/counter/tables/lun?return_records=true&max_records=500?
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
but 443 port [tcp/https] succeeded!
Could you try below
great
{
"name": "lun",
"description": "This table contains LUN-level SAN counters which are shared between 7-mode and C-mode. These counters are available for every mapped logical unit. The alias name for lun:node is lun_node.",
"counter_schemas": [
{
"name": "average_read_latency",
"description": "Average read latency in microseconds for all operations on the LUN",
"type": "average",
"unit": "microsec",
"denominator": {
"name": "read_ops"
}
},
{
"name": "average_write_latency",
"description": "Average write latency in microseconds for all operations on the LUN",
"type": "average",
"unit": "microsec",
"denominator": {
"name": "write_ops"
}
},
{
"name": "average_xcopy_latency",
"description": "Average latency in microseconds for xcopy requests",
"type": "average",
"unit": "microsec",
"denominator": {
"name": "xcopy_requests"
}
},
{
"name": "caw_requests",
"description": "Number of compare and write requests",
"type": "delta",
"unit": "none"
},
Great. I noticed that even with a jitter of 1 minute, the requests are still very close together. It's odd that ONTAP is failing under these conditions. We could try increasing the jitter value, but I doubt it will completely resolve the issue.
We could try increasing polling schedule if you are okay with that? It is 1m poll currently for RestPerf
We can try jitter: 3m
and data: value @ https://github.com/NetApp/harvest/blob/main/conf/restperf/default.yaml#L7 to 3m as well and see.
okay. i'll come back with result
This is not a vsim right?
Okay
What is the output of below command in ONTAP CLI? In diag mode.
system services web show -fields per-address-limit, wait-queue-capacity
Rahul, Hi. jitter: 3m really help me. Thank yoy so much for your help
Great! Did you change data as well to 3m?
no, i stay 1m, jitter 3 and it's working perfectlly
Cool. Do you have other API load as well on this cluster apart from Harvest?
hm, yes, it has. But, it's only by zapi method not rest
Okay. Yeah i suspect that Cluster is unable to handle many requests at the same time leading to timeouts.
yes and I see sometimes the error like this: time=2024-12-20T09:07:24.819Z level=ERROR source=collector.go:436 msg="" Poller=a800 collector=Rest:QosPolicyFixed error="failed to fetch data: error making request connection error Get "https://a800/api/private/cli/qos/policy-group?class=user_defined&fields=class%2Cis_shared%2Cnum_workloads%2Cpolicy_group%2Cthroughput_policy%2Cuuid%2Cvserver&ignore_unknown_fields=true&max_records=500&return_records=true\": dial tcp : i/o timeout" task=dat
I believe opening an ONTAP support case would be beneficial, as they can examine the logs on the ONTAP side to determine the root cause of this issue.
per-address-limit wait-queue-capacity
80 192
That looks correct