#CDOT system seems to randomly drop off nabox.
1 messages · Page 1 of 1 (latest)
@reef obsidian
Do you have capacity available ? It could also be bad credentials, can you stop all pollers, and run only that one while doing dc logs -f nabox-harvest2 in ssh ?
Capacity on the filer or the nabox? Same creds as all the other filers, same domain etc. Do you have a netapp email I could send the logs over to?
How do I stop the pollers?
Through the web interface, you un-check them
You can send the logs at help@nabox.org if you prefer
Perfect thank you
I saw this on the last major nabox release but haven't yet on the current one. Rolling cluster upgrades would frequently trigger it and I'd have to go into nabox and re-enable the cluster.
Yes that would make sense I believe. As far as I now, Harvest stops if it can't connect to the cluster. When the lif migrates, there is a chance it won't be reachable right at the moment harvest polls
No upgrades, all stable, been that version of CDOT for a while
I've just sent the logs over now
2023-01-19T09:25:24Z INF ./poller.go:552 > caught signal [terminated] Poller=xxxxxxxxxx
Did you manually stop the poller or that matches the issue you're describing ?
Done a reboot firstt mate and then stopped them
It couldnt open the systems page so rebooted and then stopped pollers
Let's have a remote session. How is your afternoon ?
free all afternoon
@hexed creek did a quick troubleshooting session, it seems that poller randomly stops which I never see before except when Harvest crashes. Wi didn't have the time of the crash in the logs, so @stable mauve will keep an eye on it and send me the logs, but I'm think we might be facing a bug
This is 22.11.1-1
thanks, we'll take a look at @stable mauve crash logs when they come
Is it worth upgrading to a nightly harvest release?
should be fine to do so, but we haven't made any fixes between 22.11 and now around crashing
which is to say we aren't aware of any bugs around crashing 🙂 The logs from one of the crashes will help
Ah, I will leave it on the stable harvest version for now
By the way https://nabox.org/downloads/ 3.2.1b1 is online and should add screen cli to ease troubleshooting down the road
If harvest can't connect, it shouldn't die forever until there is manual intervention. I'd be okay with a few missed samples, but then it should self-recover.
the release notes link for the 3.2.1b1 results in a 404.
@bold finch harvest does self recover if there is connectivity issues with cluster in between. It will not recover if there is a crash in harvest code. As @visual mica mentioned, we are not aware of any such crash bugs.
Thanks. As I said, I haven't seen it in the current code release and hopefully I won't. If I do, I'll try to capture some details.
Yes, that's me being lazy, there is no release notes 😄 I just added screen package
Morning/Afternoon, Just checked and the missing cluster is missing again. Unfortunatly my PC rebooted and killed my putty session and I didnt upgrade to the new version. Is there any logs you would like collected Yann and then I will upgrade
Hello all,
we have this problem too it seems.
Randomly Nabox stops collecting from some of our systems and shows them of in the admin overview.
For collecting again i have to enable them manually.
can you grab the logs for one of the pollers that stops working and send them to us via https://github.com/NetApp/harvest/wiki/FAQ#how-do-i-share-sensitive-log-files-with-netapp
How can i collect Logs for the poller?
For how long?
to collect logs from nabox run dc logs nabox-harvest2 > nabox-harvest2.log zip that file and email. In terms of how long, if the poller has restarted, the logs should already include information about why that happened so the current logs should suffice
More info on nabox log collection here https://nabox.org/documentation/troubleshooting/
My logs sent over now too
thanks, we'll take a look
Is there anyway of collecting older logs? The instructions above only contains today log, mine broke sometime from Friday to this morning
nabox is using docker compose under the hood, which means you should be able to use the --since flag as outlined here https://docs.docker.com/engine/reference/commandline/compose_logs/
checking if that works
no such luck - the version of docker compose on nabox does not support that version
ah dam, I will install the new patch versionthat has screen
No log probably means that the container restarted, which seems odd ?
To use screen you would have stop that poller from the UI, then screen -R after logging in, and run dc exec nabox-harvest2 bash. That'll get you to the harvest container and you can manually run the poller from here
cd /netapp-harvest;/netapp-harvest/bin/poller -p <poller name from harvest.yaml> --promPort <port number from harvest.yaml>
@visual mica are the logs enough or do you need more?
@torpid coyote not sure yet, @reef obsidian and I are still digging through them. We'll ping you if we need more data. Thanks!
is your issue just one poller? and the same poller everytime?
I dont know, for me it seems that it is random which system stops collecting.
We Monitor alot of Netapps with Nabox
Mmm, what logs ? I got those from @stable mauve
i sent them to you via email
Indeed the only stop I see is 2023-01-23T13:13:42 GMT and seems to be caused by disabling the poller in the UI
@reef obsidian you can see here that those systems didnt collect until being re enabled in the admin gui.
This happened already4 times in the last 3-4 months.
The trick is to catch the poller logs when it crashes, we didn't see that moment in the capture
How many controllers and volumes are we talking about ? It's possible NAbox is running out of memory.
Hmm i can check this.
Indeed today again the fashap07 was not collecting.
I will expand memory to 16G now and look further.
I have sent new logs from today @visual mica @reef obsidian
Awesome. Take a look at dmesg too, witht he recent crash you might see OOM kill
I think u are right
@torpid coyote Have you added new pollers recently? or any other changes?
No we havent
okay any harvest version upgrade?
Was 1-2 months ago i think as i updated to Nabox 3.2 i also updated to 22.11.1-1
okay and crashes have been happening since then or a different timeline?
we had crashes before the updates as well. Only not so often like last month
We added Storage Grid with the last update
okay thanks for the information.
@stable mauve when you get a chance, can you also take a look at dmesg and see if you have any OOM killer msgs there for your erratic pollers?
fwiw - following along here because I have the same issue - just checked my splunk logs and yep I'm seeing the same thing - so at least I can get an alert when it happens - I'll try adding more memory to my nabox vm as well
Jan 19 20:22:09 myhost.name kernel: [5483302.209948] Out of memory: Killed process 24718 (poller) total-vm:4347592kB, anon-rss:2016716kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:5612kB oom_score_adj:0
Jan 19 20:22:09 myhost.name kernel: [5483302.209902] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/docker/f01a0c8b7a93ca48bda89e7665b837f28e57925f5fbd07755a8c2956523bf2a2,task=poller,pid=24718,uid=0
Morning, So I havent had an issues since the upgrade of harvest or NAbox
Just checked Chris, I did have have OOM errors too before I upgraded
thanks @stable mauve , let's keep an eye on it. If you see an OOM error again, might be worth bumping the memory if you can
I'm hoping it's not harvest leaking memory for example
unlikely - we have 10+ instances that have been running for months without issue
just got my ram doubled - and yeah it seems odd, we only have 7 and it's usually the same two that do this ones a FAS8200 (that gets really loaded at night) and the other is an AFF300 that's not really doing much of anything so not even similar hardware or workloads
Docker stats might be able to give us some insights after running it for a while before it crashes.
I’ll also work on improving nabox dashboard with regards to memory usage.
Well that's interesting. Container-exporter seems to be leaking
You might want to give that a try : https://gist.github.com/ybizeul/9e7e0d7de3705f47f0ba2d7e47752c67
Whipping up an update for NAbox, ditching container-exporter which is deprecated and seems to me leaking pretty bad. b3 is on the way
😀
Yay. Beta 4 is available for download
Beta 4 installed 🤞
all right, keep an eye on memory
After 7 Days our 16 GB RAM were full 😄
@torpid coyote hopefully that is with a version earlier than beta 4?
We will schedule a reboot every night until its fixed in a stable branch
Yes it is earlier
will look into that
where can i find the beta 4 version?
hi @peak flame @reef obsidian can chime in but given when he posted the msg, I'm guessing it is 3.2.1b4 on this page https://nabox.org/downloads/ and maybe beta4 is superseded by 3.3b1 (2023-02-01) the changelog seems to match