#Client.Timeout exceeded while awaiting headers

1 messages · Page 1 of 1 (latest)

gaunt flint
#

One of my Pollers is experiencing a client timeout. I have 17 clusters total getting polled and only this one is have this error. It was noticed because our Grafana dashboard is showing "No data" for this cluster when looking at perf metrics.

This is a Linux deployment using Harvest 25.05.1...additionally, I no longer have poller logging to /var/log/harvest/poller*.log

#

I am seeing that it may be more specific to volume latency/throughput/iops metrics

hardy crest
#

@gaunt flint Could you update to the nightly version?

https://github.com/NetApp/harvest/releases/tag/nightly

We can then increase the timeout for the relevant objects (nightly includes a fix where the timeout setting was not taking effect). After upgrading, we can investigate why the logs aren’t coming through.

gaunt flint
#

Sure I can work on this.

gaunt flint
#

@hardy crest I now have the nightly build in place: ./harvest/bin/harvest version harvest version 26.03.10-nightly (commit 6e8f9404) (build date 2026-03-10T10:11:30-0400) linux/amd64 checking GitHub for latest... you have the latest ✓

gaunt flint
#

Also, I suspect that I will need to update my Grafana dashboards since jumping from 25.05.1 to 26.03.10?

leaden cargo
#

hi @gaunt flint yes, it would be a good idea to reimport the dashboards from the nightly build you installed

gaunt flint
#

Ha...okay, I just have to edit to preppend all metrics names to netapp_

leaden cargo
#

did you use a bin/grafana import CLI option to do that before?

gaunt flint
#

no

#

I'm not familiar but if it makes it easier then I am for it

leaden cargo
#

let me check

gaunt flint
#

example: "expr": "avg(svm_vol_avg_latency{cluster=~\"$Cluster\",datacenter=~\"$Datacenter\",svm=~\"$SVM\"})",
To this:
"expr": "avg(netapp_svm_vol_avg_latency{cluster=~\"$Cluster\",datacenter=~\"$Datacenter\",svm=~\"$SVM\"})",

leaden cargo
#

Try using --prefix "netapp" when importing or you can use customize if you want to rewrite all the dashboards and compare before importing

gaunt flint
#

I may not have the access needed to push the import to the Grafana server as I do not own that instance. I don't suspect I can just --directory to --directory?

#

Is there a test mode?

leaden cargo
#

that's what I meant by customize. customize has the same options as import except customize modifies the dashboards and writes them to a local directory so you can diff/modify as you want. import modifies the dashboards and talks directly to a Grafana server to import them. customize is like a dry run flag

#

e.g. bin/harvest grafana customize --output-dir /tmp/harvest-dashboards --prefix netapp --directory grafana/dashboards/cmode

gaunt flint
#

You did indeed say that and I think I just need to slow down my reading...

#

Okay, that is pretty slick. It gets nearly all of them but here is an example of a missed one: "expr": "sum(qos_read_data + netapp_qos_write_data{cluster=~\"$Cluster\",datacenter=~\"$Datacenter\",svm=~\"$SVM\"})",

#

But one or two here and there is managable.

leaden cargo
#

phooey! that's a bug that we'll fix

gaunt flint
#

lol

leaden cargo
#

UNACCEPTABLE

gaunt flint
#

Well it got me 99% of the way there so 🤷‍♂️

leaden cargo
#

I'm glad for that, but we'll still fix the bug 😄

gaunt flint
#

In case you need it, also expr for __name__ got missed

#

"expr": "sum by (__name__) (\n {__name__=~\"svm_nfs_.+_total\",cluster=~\"$Cluster\",datacenter=~\"$Datacenter\",nfsv=\"v3\",svm=~\"$SVM\"}\n)",

#

I figure you'll go through it but I thought I would present specifics to find easier

leaden cargo
#

appreciate it, that helps!

gaunt flint
#

Our Grafana instance is doing the dumbest thing...upon importing, I select the datasource but when it is done importing it is using the --promethues-- default and subsequently wipes all the queries.

leaden cargo
#

Don't think I've seen that one before. I wanted to mention that the issues you reported yesterday are fixed in the latest nightly build.

gaunt flint
#

The issues as in bin/grafana customize ?

gaunt flint
#

Okay, so regarding the logging issue (while I wait), I found that all the logs are dumping to /var/log/messages whereas before they were going to their (poller) respective log in /var/log/harvest/. Not sure what changed

#

I guess more context, I have the pollers set up as systemd services so they fire up upon reboot

leaden cargo
#

can you share your systemd service file? It should control the log location

gaunt flint
#
Description="NetApp Harvest Poller instance %I"
PartOf=harvest.target
After=network-online.target
Wants=network-online.target

[Service]
User=root
Group=root
Type=simple
Restart=on-failure 
WorkingDirectory=/opt/harvest
ExecStart=/opt/harvest/bin/harvest --config /opt/harvest/harvest.yml start -f %i

[Install]
WantedBy=harvest.target
leaden cargo
#

looks good and which OS and version is this running on? Do you see the poller logs via journalctl? Something like journalctl -fu poller@cluster-01 replace the poller with the name of your poller

gaunt flint
leaden cargo
#

bin/harvest start -f runs the poller in the foreground and Harvest writes to stdout, which is the recommendation when running as a SystemD service since writing to stdout allows SystemD to manage the logs. Looks like journalctl is correctly picking up the logs

#

is it possible that you saw the logs in /var/log/harvest/... before you setup a SystemD service? The rpm and native installs log to /var/log/harvest/..., but SystemD only logs to stdout so journalctl can manage the logs

gaunt flint
#

This is what I believe to be what is happening. the logs I have are pre systemd setup I suspect

#

Now all pollers dump to messages

leaden cargo
#

I suspect you are seeing them in /var/log/messages because of something else. SystemD does not log to /var/log/messages Maybe your system has a syslog daemon running which reads systemd's journal and dumps them to /var/log/messages?

gaunt flint
#

rsyslogd is in fact running...I know that with my Aspera environment I have to define where to dump those logs in the rsyslogd.conf but I am not sure how to do that with the harvest pollers

#

fortunately I can still tail -f /var/log/messages | grep fas03-as to get what I need but it would be nice to keep the original method

#

Also, I added the client_timeout 1m to the restperf default.yaml (just under "collector:") but I still see the client timeout for:
Poller=fas03-as collector=RestPerf:Volume error="error making request connection error
Poller=fas03-as collector=RestPerf:NicCommon error="error making request connection error

#

Is this correct? ```collector: RestPerf
client_timeout: 1m

Order here matters!

schedule:

  • counter: 24h
  • instance: 720h # The instance schedule is only used by the workload and workload_volume templates
  • data: 1m
leaden cargo
#

yes, that is correct. Can you paste more of the error line? If you want the pollers to log to a file instead of journalctl, try running this from the cmdline, make sure that it does what you want, and if it does, update your systemd service file to use the logtofile CLI arg.

/opt/harvest/bin/harvest --config /opt/harvest/harvest.yml start -f fas03-as --logtofile

You can get the same thing via journalctl though by using journalctl -u poller@fas03-as You don't need to grep /var/log/messages you can search and grep the journalctl output

gaunt flint
#

the bin/harvest worked, so now I guess I need to figure out how to format the logtofile CLI arg to ensure each poller logs to their respective file.
Error output:
Mar 12 18:00:42 sds-harvest-854010 harvest[1133276]: time=2026-03-12T18:00:42.713Z level=ERROR source=collector.go:449 msg="" Poller=fas03-as collector=RestPerf:Volume error="error making request connection error: Get \"https://fas03-as.cable.comcast.com/api/cluster/counter/tables/volume/rows?counters.name=average_latency%7Cbytes_read%7Cbytes_written%7Cnfs.access_latency%7Cnfs.access_ops%7Cnfs.getattr_latency%7Cnfs.getattr_ops%7Cnfs.lookup_latency%7Cnfs.lookup_ops%7Cnfs.other_latency%7Cnfs.other_ops%7Cnfs.punch_hole_latency%7Cnfs.punch_hole_ops%7Cnfs.read_latency%7Cnfs.read_ops%7Cnfs.setattr_latency%7Cnfs.setattr_ops%7Cnfs.total_ops%7Cnfs.write_latency%7Cnfs.write_ops%7Cother_latency%7Cread_latency%7Ctotal_ops%7Ctotal_other_ops%7Ctotal_read_ops%7Ctotal_write_ops%7Cwrite_latency&fields=%2A&max_records=500&return_records=true\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" task=data

true leaf
#

looks like the import custom also doesn't do headroom_aggr_ewma_ expressions in the update

#

headroom.json: "expr": "headroom_cpu_ewma_${Interval}{cluster=~"$Cluster",datacenter=~"$Datacenter",metric="optimal_point_utilization"}",

#

it would be awesome if we could have it also edit the title as we prefix not only the expressions but the filenames as well

#

that way, when we search our dashboard list for 'netapp', they all appear

#

sorry for getting off topic

#

I guess if this is fixed as of today, I missed that, sorry

gaunt flint
#

@leaden cargo Here's my updated service config: ```[Unit]
Description="NetApp Harvest Poller instance %I"
PartOf=harvest.target
After=network-online.target
Wants=network-online.target

[Service]
User=root
Group=root
Type=simple
Restart=on-failure
WorkingDirectory=/opt/harvest
ExecStart=/opt/harvest/bin/harvest --config /opt/harvest/harvest.yml start -f %i
LogsDirectory=harvest
StandardOutput=append:/var/log/harvest/poller_%i.log
StandardError=append:/var/log/harvest/poller_%i.log

[Install]
WantedBy=harvest.target```

#

Logging is good now

leaden cargo
#

@true leaf thanks for reporting the headroom issue. That is fixed in the latest nightly build.

prefixing titles was added in 26.02

#

Glad you got it sorted @gaunt flint your changes look good

gaunt flint
#

@leaden cargo So following up on the client connect timeout, I parsed down the api call to just one item for volume counters usinng the following:

--cert /opt/harvest/cert/harvest_ro_user.crt \
--key /opt/harvest/cert/harvest_ro_user.key \
-w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
-X GET "https://<cluster url>/api/cluster/counter/tables/volume/rows?counters.name=write_latency&fields=%2A&max_records=500&return_records=true"```

and got the following results:
```DNS: 0.000302s
Connect: 0.073385s
TLS: 0.166814s
TTFB: 149.585088s
Total: 149.585163s```

All my other clusters are snappy and have no issues so I am believing this is a cluster issue and not a Harvest issue. Thoughts?
#

I just ran the full api request and got: DNS: 0.000318s Connect: 0.073350s TLS: 0.166637s TTFB: 137.154044s Total: 137.154247s

#

So it's not the content being requested but serving the request in general

leaden cargo
#

yes that looks odd for time_starttransfer to be so long

gaunt flint
#

I'll dig into the cluster side. I'll update the thread if/when I find something out. Other than that I learned a lot of good stuff in this discussion. Thanks @leaden cargo

leaden cargo
#

yes, please let us know if you uncover anything! You're welcome

gaunt flint
#

Sorry, real quick, I noticed that in the metadata dashboard there is a panel for % CPU Used of which mine has no data. The query is looking for poller_cpu_percent but when I curl all my metrics ports I see nothing be collected. I am getting poller_memory_percent. Do I need to enable this in one of the yaml configs?

leaden cargo
#

poller_cpu_percent comes from the Unix collector, do you have that specified in your harvest.yml
e.g.

    unix:
        datacenter: local
        collectors:
            - Unix
        exporters:
            - exporter: Prometheus

and then make sure the exporter you specified is being scrapped by Prometheus.

poller_memory_percent comes from the poller itself which explains why you see those metrics but not cpu