#no environment sensor metrics since ontap update to 9.11.1
1 messages · Page 1 of 1 (latest)
@harsh violet Could you share logs files for this poller @ ng-harvest-files@netapp.com
can you tell me where to find the logs? thank you
@harsh violet Which type of Harvest installation are you using? rpm, deb, container, native, nabox?
see Collecting logs here https://nabox.org/documentation/troubleshooting/
no environment sensor metrics since ontap update to 9.11.1
hi @harsh violet thanks for sending along your log file. One thing that jumped out was this error 2023-08-01T12:42:15Z ERR collector/collector.go:367 > error="context deadline exceeded (Client.Timeout or context cancellation while reading body)" Poller=A300_filer collector=Zapi:Sensor task=data that means Harvest is timing out when collecting the sensor resources. That would explain why you are no longer seeing your environment_sensor_* metrics. What isn't clear is why it is taking ONTAP longer in 9.11.1 than it did in 9.9.1. Might be worth trying to increase the client timeout for that object. By default, Harvest uses a timeout of 30s, but it can be increased via https://github.com/NetApp/harvest/wiki/Troubleshooting-Harvest#client_timeout
hi @pure oar / @tidal plover checked the client_timeout paramenter in my /opt/harvest2-conf/conf/zapi/cdot/9.8.0/volume.yaml and it´s 2m (never changed that). should I increase it even more?
it´s a FMC (fabric metro cluster) problem! have this issue on FMC A300 & A700 systems. HA systems or MCIP show the environment_sensor paramenters. all clusters have the same ontap version 9.11.1P8.
so it´s clear ..... the FMC didn´t show the parameters since the ontap update to 9.11.1P8 🤔
@harsh violet You need to increase client_timeout in /opt/harvest2-conf/conf/zapi/cdot/9.8.0/sensor.yaml
thank you @tidal plover! still the same problem after updating the sensor.yaml file with client_timeout: 2m
sure that it is a timeout problem and not a fabric metro cluster problem?
Based on the logs, it is a timeout issue. To further investigate, let's try calling the ONTAP system directly to see if the ZAPI returns any response. Please execute the following curl command on the affected Poller, You need to replace USER, PASS, and CLUSTER_IP with the appropriate values:
curl --connect-timeout 30 --user USER:PASS --insecure --data-ascii '<?xml version="1.0" encoding="UTF-8"?>
<netapp xmlns="http://www.netapp.com/filer/admin" version="1.130">
<environment-sensors-get-iter/>
</netapp>' -H "Content-Type: text/xml" 'https://CLUSTER_IP/servlets/netapp.servlets.admin.XMLrequest_filer'
amazing! you´re right - it is a timeout problem.
PART1: # curl --connect-timeout 30 --user harvest2 --insecure --data-ascii '<?xml version="1.0" encoding="UTF-8"?>
<netapp xmlns="http://www.netapp.com/filer/admin" version="1.130">
<environment-sensors-get-iter/>
</netapp>' -H "Content-Type: text/xml" 'https://IP-OF-CLUSTER/servlets/netapp.servlets.admin.XMLrequest_filer'
Enter host password for user 'harvest2':
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE netapp SYSTEM 'file:/etc/netapp_gx.dtd'>
<netapp version='1.211' xmlns='http://www.netapp.com/filer/admin'>
<results status="passed"><attributes-list><environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state><discrete-sensor-value>GOOD</discrete-sensor-value><node-name>NODE</node-name><sensor-name>PSU2</sensor-name><sensor-type>fru</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info><environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state><discrete-sensor-value>GOOD</discrete-sensor-value><node-name>NODE</node-name><sensor-name>PSU1</sensor-name><sensor-type>fru</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info><environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state><discrete-sensor-value>GOOD</discrete-sensor-value><node-name>NODE</node-name><sensor-name>Fan3</sensor-name><sensor-type>fru</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info><environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state><discrete-sensor-value>GOOD</discrete-sensor-value><node-name>NODE</node-name><sensor-name>Fan2</sensor-name><sensor-type>fru</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info><environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state>
PART2: <discrete-sensor-value>GOOD</discrete-sensor-value><node-name>NODE</node-name><sensor-name>Fan1</sensor-name><sensor-type>fru</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info><environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state><discrete-sensor-value>IPMI_HB_OK</discrete-sensor-value><node-name>NODE</node-name><sensor-name>SP Status</sensor-name><sensor-type>discrete</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info><environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state><discrete-sensor-value>OK</discrete-sensor-value><node-name>NODE</node-name><sensor-name>mSATA Status</sensor-name><sensor-type>discrete</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info><environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state><discrete-sensor-value>PRESENT</discrete-sensor-value><node-name>NODE</node-name><sensor-name>mSATA Pres</sensor-name><sensor-type>discrete</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info></attributes-list><next-tag><environment-sensors-get-iter-key-td>
<key-0>NODE</key-0>
<key-1>7</key-1>
</environment-sensors-get-iter-key-td>
</next-tag><num-records>8</num-records></results></netapp>
sorry, text was to long. the output on the FMC A300 looks completely different to another HA system. (i had to hide cluster ip and node name above).
I see that you got the Zapi response. Where did you hit timeout in this CLI?
takes 120 seconds between </threshold-sensor-state></environment-sensors-info> and <environment-sensors-info>
<discrete-sensor-value>OK</discrete-sensor-value><node-name>ramfas11</node-name><sensor-name>mSATA Status</sensor-name><sensor-type>discrete</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info>
120 seconds here between
<environment-sensors-info><discrete-sensor-state>normal</discrete-sensor-state><discrete-sensor-value>PRESENT</discrete-sensor-value><node-name>ramfas11</node-name><sensor-name>mSATA Pres</sensor-name><sensor-type>discrete</sensor-type><threshold-sensor-state>normal</threshold-sensor-state></environment-sensors-info></attributes-list><next-tag><environment-sensors-get-iter-key-td>
<key-0>ramfas11</key-0>
<key-1>7</key-1>
</environment-sensors-get-iter-key-td>
Okay thanks. It's best to contact ONTAP support to check why this ZAPI is slow. On Harvest side, we can only increase client_timeout .
case is open. i´ll give you an update if i know more. thank you so much for your support!
Thanks!
problem:
::*> system node environment sensors show -node <node>
Error: show failed on node "<node>": The Service Processor on node "<node>" is not reachable. Verify that the SP or BMC is online, verify that api-service is enabled on the SP or BMC,
verify that the partner node is running, check if pings from SP or BMC to partner node work, check if hw-assist keep-alives are normal, check that network ports are configured
correctly and are functional (up). Then, try the command again.
Syslog:
Aug 31 02:26:29 (none) spcs[5730]: TSimpleServer client died: SSL_accept: sslv3 alert unsupported certificate
solution:
::*> system service-processor api-service renew-internal-certificates