#Weird node latency numbers after upgrading to 25.11.0 (and nabox 4.1.0)

1 messages · Page 1 of 1 (latest)

silent locust
#

After upgrading to 25.11.0 along with nabox 4.1.0 I get periodic (but repeating) extreme node latency reported (>500ms) for one node . At the same time no corresponding latency increase is reported on volume or disk graphs and we have not seen similar slow down in clients.
The spikes in latency appeared just after the harvest upgrade. It could obviously be a coincidence and the latency is real but I am at a loss at the moment and trying to cover all bases.

Is there a chance that this is a harvest/graphana problem or should I really dig into it ?
I am on a FAS8300 and Ontap 9.16P3

orchid tiger
#

@silent locust Could you share metric name which is reporting high latency? is it node_volume_avg_latency?

silent locust
#

@orchid tiger yes node_volume_avg_latency and if I check volume_avg_latency for volumes in that particular node nothing even comes close to the reported node latency.
nothing jumps out with the statistics volume show command either. That is why I am confused

orchid tiger
orchid tiger
#

Are these spikes repeating?

silent locust
#

yes but if there is a pattern I am missing it completely.
There are times where it stops completely for 1-2h and then it starts again.

Trying to hunt down client side applications that might be causing this but with > 70 volumes and multiple apps and nothing to go on it is difficult.

orchid tiger
#

In volume dashboard, Is there any spike for volumes in Top Volumes by Average Latency panel?

silent locust
#

Nothing that even approaches 500ms and nothing with the same duration either.

orchid tiger
silent locust
#

OK Sorry for the delay but I think we figured it out.
We started testing ontap s3 and a (most probably) misconfigured thanos compactor for prometheus pod was causing the problem.

The thing is that the underlying s3 volumes are excluded from the volume stats (quite understandable) and took us some time to figure out that it was the actual cause, especially since it was misbehaving and experiencing random crashes that appeared to solve the problems only to be restarted after some time.

We have disabled the pod until we track down the problem.
Thanks for all the help @orchid tiger

orchid tiger
#

Thanks for the update @silent locust !

silent locust
#

Need to configure harvest for s3 buckets and all the related stats now 😄