#disappearing cluster

1 messages · Page 1 of 1 (latest)

outer moon
#

Running NAbox 3.2 with Harvest 23.05.0-1. I've seen this issue in previous verwsions:
I have a Primary and a Secondary Data Center.
All of my primary clusters are on Primary and the Secondary Cluster is under Secondary.

After running reports for a while, my largest Primary cluster disappears from the drop downs. The only way I am able to bring it back is by rebooting.

I am viewing with Firefox LTS, and/or Microsoft Edge.

Has anyone seen this before?

balmy ice
outer moon
#

Hey Chris, nice to chat with you again. I normally am not allowed since we are a dark site, but let me have a look at the collection, I may be able to transfer it

outer moon
#

@balmy ice It does not appear that there is any classified info on the logs. However, I have to go through it, and the nalog is a big one. I probably won't be able to upload until tomorrow; Would that be oikay?

balmy ice
#

of course, whenever you get a chance

outer moon
#

Just emailed it Chris. Thank you

lethal holly
#

We're seeing it too. Go into the admin page and we see the cluster is unchecked. Turn it back on and it starts collecting again.

#

I think this started after we upgraded from 3.3beta to 3.3.

balmy ice
#

thanks for chiming in Ed. We'll investigate

#

spoke with @long copper about this - in nabox, when the cluster becomes unchecked, that means the poller has died. When you get a chance can you ssh into nabox and check if the oom-killer is reaping processes by running dmesg | grep -i kill. Please share your logs too and we'll make sure there's not anything happening there that causes the poller to die. https://nabox.org/documentation/troubleshooting/#collecting-logs and email them to ng-harvest-files@netapp.com

lethal holly
#

Confirmed - it's the oom-killer
[8401823.620144] poller invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[8401823.620198] oom_kill_process.cold+0xb/0x10
[8401823.620492] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=ca4bcd987f09f92d6cf477d794380caaead08d71fd3869b978dd92a00b5ef7df,mems_allowed=0,global_oom,task_memcg=/docker/ca4bcd987f09f92d6cf477d794380caaead08d71fd3869b978dd92a00b5ef7df,task=poller,pid=27604,uid=0
[8401823.620529] Out of memory: Killed process 27604 (poller) total-vm:7547364kB, anon-rss:5171416kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:13520kB oom_score_adj:0

balmy ice
#

thanks Ed

lethal holly
#

Is the fix as simple as giving the VM more memory?

balmy ice
#

yes and we're also making a few improvements on the Harvest side. We're going to disable a few templates that are not being used by any dashboards, but have high metric counts. We would love to take a look at your log files and see if there are any other issues there, if you can send them

lethal holly
#

Thanks. Memory bumped from 12GB to 16GB, /data increased while I was at it, and logs emailed.

balmy ice
#

still going through your logs Ed, but commenting out NetConnections in your conf/rest/default.yaml will make a difference. That's a change that will be in 23.08
nabox-harvest2 | 2023-08-11T13:28:47Z INF collector/collector.go:483 > Collected Poller=poller1 apiMs=64720 calcMs=0 collector=Rest:NetConnections instances=411630 metrics=4527930 parseMs=44733 pluginMs=0
shows 411K instances exporting 4,527,930 metrics

lethal holly
#

Thanks @balmy ice . What's the downside of commenting out NetConnections?

south karma
#

@lethal holly These metrics are not utilised in any of the Harvest default dashboards. The only drawback is that if you were using these metrics in some different way.