Hello,
We recently added a new cluster to NAbox, but we are unable to see its metrics.
We are running NAbox 4.2 and Harvest 26.02.
After reviewing the containers logs, we saw frequent connection events where NAbox attemps to communicate with cluster, but the connection is lost . Specifically, the logs show "connection reset by peer" events.
Could you please help us determine what is the problem?
Thank you!
#NAbox 4.2 - Unable to see cluster metrics
41 messages · Page 1 of 1 (latest)
@vague wasp Could you upload NABox support bundle at https://upload.nabox.org/kiji-camu-zyqi
i cant because we are working on the dark site
does the line connection reset by peer appear in the harvest logs, or is it from a different nabox container log?
when i downloaded the NAbox support bundle there was a containers log file on its own
Okay. Could you share full log line of this error if possible? I need to understand if issue is NABox side or harvest side?
Yes one moment
May 19 10:10:58 nabox_hostname :havrest | time = ...... level=ERROR source=cache.go:161 msg="write metrics" poller= cluster_name export=nabox_victoriametrics error="write tcp ip->ip: write: connection reset by peer"
Thanks. It seems like there is an issue with Victoria Metrics data ingestion. How many clusters do you have? Have you tried restarting containers in NABox?
we have only one cluster and we tried to delete it and add it again but it didnt work
Try restarting all containers once and see?
i did restart the container now, and it shown metrics for 3 minutes. after that the metrics disappeared
Ok, do you see any errors if you run docker logs havrest in NABox cli?
i will check it
there are a lot of "level=ERROR" source=cache.go:** msg="write metrics" poller= cluster_name export=nabox_victoriametrics error="write tcp ip->ip: write: brocken pipe"
Any other errors apart from this?
1.level=error source=response.go msg="error response" status=500 response="GET \ https://.....: dial tcp lookup .... on ip no such host"
2.level =Error source=volumeAnaltics.go msg="failed to collect analytic data" pollar=cluster object=volumeAnalytics error="error making request connection error: Get \ ....: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Could you check if there are any errors on below page
https://NABox_IP/vm/targets
Also please check if metrics collection is working fine by doing ssh to container with below steps
docker exec -it havrest sh
ps -ef | grep poller
# Check poller promPort from cli and update 12001 to that
wget http://localhost:12001/metrics
cat metrics
# This cat command should show metrics collected from cluster
i see here that the second harvest endpoint is down and the error is : "exceeds -promscrape.maxScrapeSize or max_scrape_size in the scrape config. Possible reduce the response size for the target, increase promscrape.maxScrapeSize command-line flag, increase max_scrape_size value in scrape config for the given target" and there are 9046 errors + never scraped
Great!
I think in NABox, it is already 200MB
Is this cluster heavily loaded with many objects like number of volumes/luns etc?
we have 2331 volumes and 3679 luns
Understood. Reviewing the logs would have been helpful, but to resolve this issue, we should increase the maxScrapeSize setting in victoriametrics.
Have you customized any of Harvest templates in NABox?
@stable barn to help with this
we never did, but i will wait for a response tomorrow . Thank you very much!
Sure. I'll share a CLI with you to run whuch object is causing this much data.
Could you run below cli in NABox and share the output?
docker logs havrest 2>&1 | \
grep 'msg=Collected' | \
grep 'instancesExported=' | \
awk '
{
c=""; ie=0; me=0
for(i=1;i<=NF;i++){
split($i,kv,"=")
if(kv[1]=="collector") c=kv[2]
if(kv[1]=="instancesExported") ie=kv[2]+0
if(kv[1]=="metricsExported") me=kv[2]+0
}
if(c!=""){last_ie[c]=ie; last_me[c]=me}
}
END{
for(c in last_ie) print last_ie[c], last_me[c], c
}' | \
sort -rn | \
awk '{printf "%-40s instancesExported=%-6d metricsExported=%d\n", "collector="$3, $1, $2}'
yes just a moment
You can also use command below, which will be faster if you have a large number of Docker logs.
docker logs havrest --since 30m 2>&1 | \
grep 'msg=Collected' | \
grep 'instancesExported=' | \
awk '
{
c=""; ie=0; me=0
for(i=1;i<=NF;i++){
split($i,kv,"=")
if(kv[1]=="collector") c=kv[2]
if(kv[1]=="instancesExported") ie=kv[2]+0
if(kv[1]=="metricsExported") me=kv[2]+0
}
if(c!=""){last_ie[c]=ie; last_me[c]=me}
}
END{
for(c in last_ie) print last_ie[c], last_me[c], c
}' | \
sort -rn | \
awk '{printf "%-40s instancesExported=%-6d metricsExported=%d\n", "collector="$3, $1, $2}'
i see no output
Hmm How about for below command?
docker logs havrest --since 30m 2>&1 | grep 'msg=Collected' | grep 'instancesExported='
still no output
if you just run
docker logs havrest
Do you see any line with Collected text in it?
i see an error: unable to merge config error=open nabox: no such file or directory with a few warnings and another error: poller exited abnormaly, fork/exec bin/poller: no such file or directory
no collected
Okay it means poller isn't even started?
Looks like something is wrong with NABox file structure. @stable barn Could you pls check?
what version is it ?
mkdir /etc/nabox/harvest/nabox
forgive me for disappearing.
you meant what version is the Nabox?
yes