#RPC: Remote system error

1 messages · Page 1 of 1 (latest)

kindred grotto
#

Hi, we see many errors from one cluster

│ rod-lon-storage collector=ZapiPerf:WAFL request=perf-object-instance-list-info-iter                                                                                                                                            │
│ 2024-09-18T15:17:59Z INF zapiperf/zapiperf.go:1563 > API request rejected => RPC: Remote system error [from mgwd on node "flc2-04-prod-lon-storage" (VSID: -1) to cm at 127.0.0.1] errNum="13001" statusCode="0" Poller=flc2-p │
│ rod-lon-storage collector=ZapiPerf:ExtCacheObj request=perf-object-instance-list-info-iter                                                                                                                                     │
│ 2024-09-18T15:17:59Z INF zapiperf/zapiperf.go:1563 > API request rejected => RPC: Couldn't make connection [from cmd on node "flc2-03-prod-lon-storage" (VSID: -1) to cm at 169.254.42.37] errNum="13001" statusCode="0" Polle │
│ r=flc2-prod-lon-storage collector=ZapiPerf:FCVI request=perf-object-instance-list-info-iter                                                                                                                                    │
│ 2024-09-18T15:18:00Z INF collector/collector.go:414 > Entering standby mode error="polling href=[api/cluster/counter/tables/svm_nfs_v42?return_records=true] err: API request rejected => error making request StatusCode: 404 │
│ , Message: API not found, Code: 3, API: /api/cluster/counter/tables/svm_nfs_v42?return_records=true" Poller=flc2-prod-lon-storage collector=RestPerf:NFSv42 task=counter                                                       │

The cluster runs Ontap 9.10.1P18. Harvest is 24.08. It started a day ago. We did many volume moves.
inland geyser
#

hi @kindred grotto The first three are errors from ONTAP that Harvest is logging. 13001 errors from ONTAP are logged for a varitey of reasons. The most common is when an object is request that does not exist. For example, if you ask for nvme metrics but the cluster has none, ONTAP will return a 13001 - that's why Harvest is logging them at INFO level instead of error. It's not really an error to ask for something that doesn't exist.

What you pasted looks truncated though so I can't say for sure.

The last log entry you pasted is because you are trying to use the RestPerf collector with a version of ONTAP that does not support RestPerf. https://netapp.github.io/harvest/nightly/architecture/rest-strategy/
The first version of ONTAP with REST performance metrics is 9.11.1, although it's better to wait until 9.12.1 for full zapiperf parity

kindred grotto
#

Thanks Chris. Harvest is only collecting very limited metrics on this cluster. Maybe we have some network issue. After filtering out the noise, I see many error like this
2024-09-18T16:03:31Z ERR collectors/commonutils.go:76 > Failed to fetch data error="error making request connection error Get \"https://flc2.prod.lon.storage.tmcs/api/storage/volumes?fields=anti_ransomware.dry_run_start_time%2Canti_ransomware.state%2Cclone.parent_snapshot.name%2Cclone.split_estimate%2Cis_object_store%2Cname%2Csvm.name&is_constituent=false&return_records=true\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" Poller=flc2-prod-lon-storage href=api/storage/volumes?return_records=true&fields=anti_ransomware.dry_run_start_time,anti_ransomware.state,clone.parent_snapshot.name,clone.split_estimate,is_object_store,name,svm.name&is_constituent=false hrefLength=202 object=Volume plugin=Rest:Volume

inland geyser
#

That one is from the Rest collector - perhaps Rest is not enabled on your cluster? We generally don't recommend using the Rest collector until your cluster is 9.12+

kindred grotto
#

we supposed to be all on 9.14 now but ran into the double panic issue. will upgrade to 9.14.1P8 soon 😀

inland geyser
#

if you want, you could enable Zapi and ZapiPerf and disable Rest and RestPerf for this cluster in the meantime

kindred grotto
#

rest seems enabled


                        Vserver: flc2-prod-lon-storage
                   Service Name: rest
                Type of Vserver: admin
         Version of Web Service: 1.0.0
     Description of Web Service: Remote Administrative REST API Support
Long Description of Web Service: This service supports a RESTful Interface that can be used to remotely manage all elements of the cluster infrastructure.
           Service Requirements: -
       Default Authorized Roles: admin, readonly, vsadmin, vsadmin-protocol,
                                 vsadmin-readonly, vsadmin-volume
                        Enabled: true
                       SSL Only: false```
inland geyser
#

agreed, looks like Rest is enabled. In that case, the Rest volume ERR may be due to an ONTAP timeout trying to retrieve the volumes. As you said, could be network issue or ONTAP timing out trying to return a large number of volumes on this cluster. You can increaes the amount of time that Harvest waits before giving up by adding a line like this to your volume.yaml template client_timeout: 2m
Given that you are running 9.10, the template in conf/rest/9.12.0/volume.yaml will be the one used, so edit that file and add the line client_timeout: 2m at line 4

kindred grotto
#

so the problem was "flc2-04-prod-lon-storagedebug nodewatchdog.monitor.history: secd null[secd]". after bouncing that node, harvest works again. BTW that nodewatchdog issue also prevented us from deleting a SVM.

       A required service (secd) is not yet available; try again later
       Wait a few minutes, and then try the Vserver delete operation again. If the error persists, contact technical support for assistance.```
timber python
#

That's odd secd died. You might want to open a case with our NAS team to investigate.

#

Did you get a core file?

kindred grotto
#

no. not in the past week

Node:Type Core Name                                   Saved Panic Time
--------- ------------------------------------------- ----- -----------------
flc2-01-prod-lon-storage:kernel
          core.537135698.2018-11-13.15_20_13.nz      true  11/13/2018 15:20:13
              Size: 4096259753 bytes (3.81) GB
              MD5 checksum: 563645e5f37462afebd9dd9ce95f4507
flc2-04-prod-lon-storage:application
          vserverdr.42965.539030377.2024-09-09.19_19_24.ucore.bz2 true 9/9/2024 19:19:24
          vifmgr.42619.539030377.2024-09-09.19_19_25.ucore.bz2 true 9/9/2024 19:19:25
          cmd.42948.539030377.2024-09-09.19_19_26.ucore.bz2 true 9/9/2024 19:19:26
          ndo_manager.43785.539030377.2024-09-09.19_19_27.ucore.bz2 true 9/9/2024 19:19:27
          vifmgr.43792.539030377.2024-09-09.19_19_28.ucore.bz2 true 9/9/2024 19:19:28
          perfstatd.42967.539030377.2024-09-09.19_19_29.ucore.bz2 true 9/9/2024 19:19:29
          vifmgr.44437.539030377.2024-09-09.19_19_30.ucore.bz2 true 9/9/2024 19:19:30
          hashd.42937.539030377.2024-09-09.19_19_31.ucore.bz2 true 9/9/2024 19:19:31
          hashd.44933.539030377.2024-09-09.19_19_32.ucore.bz2 true 9/9/2024 19:19:32
          perfstatd.45101.539030377.2024-09-09.19_19_33.ucore.bz2 true 9/9/2024 19:19:33
          vifmgr.45050.539030377.2024-09-09.19_19_34.ucore.bz2 true 9/9/2024 19:19:34
12 entries were displayed.```
inland geyser
#

thanks for the follow-up @kindred grotto good to know for troubleshooting!