#CDOT system seems to randomly drop off nabox.

1 messages · Page 1 of 1 (latest)

stable mauve
#

HI All,
Have a strange issue, I am montioring about 8 different CDOT boxes all 9.8 with the same username and password. But one just seems to randomly drop off. So when checking NAbox the checkbox is grey and not green

#

Version of NAbox is 3.2

hexed creek
#

@reef obsidian

reef obsidian
#

Do you have capacity available ? It could also be bad credentials, can you stop all pollers, and run only that one while doing dc logs -f nabox-harvest2 in ssh ?

stable mauve
#

Capacity on the filer or the nabox? Same creds as all the other filers, same domain etc. Do you have a netapp email I could send the logs over to?

#

How do I stop the pollers?

reef obsidian
#

Through the web interface, you un-check them

stable mauve
#

Perfect thank you

bold finch
#

I saw this on the last major nabox release but haven't yet on the current one. Rolling cluster upgrades would frequently trigger it and I'd have to go into nabox and re-enable the cluster.

reef obsidian
#

Yes that would make sense I believe. As far as I now, Harvest stops if it can't connect to the cluster. When the lif migrates, there is a chance it won't be reachable right at the moment harvest polls

stable mauve
#

No upgrades, all stable, been that version of CDOT for a while

#

I've just sent the logs over now

reef obsidian
#

2023-01-19T09:25:24Z INF ./poller.go:552 > caught signal [terminated] Poller=xxxxxxxxxx

Did you manually stop the poller or that matches the issue you're describing ?

stable mauve
#

Done a reboot firstt mate and then stopped them

#

It couldnt open the systems page so rebooted and then stopped pollers

reef obsidian
#

Let's have a remote session. How is your afternoon ?

stable mauve
#

free all afternoon

reef obsidian
#

@hexed creek did a quick troubleshooting session, it seems that poller randomly stops which I never see before except when Harvest crashes. Wi didn't have the time of the crash in the logs, so @stable mauve will keep an eye on it and send me the logs, but I'm think we might be facing a bug
This is 22.11.1-1

visual mica
#

thanks, we'll take a look at @stable mauve crash logs when they come

stable mauve
#

Is it worth upgrading to a nightly harvest release?

visual mica
#

should be fine to do so, but we haven't made any fixes between 22.11 and now around crashing

#

which is to say we aren't aware of any bugs around crashing 🙂 The logs from one of the crashes will help

stable mauve
#

Ah, I will leave it on the stable harvest version for now

reef obsidian
bold finch
bold finch
hexed creek
#

@bold finch harvest does self recover if there is connectivity issues with cluster in between. It will not recover if there is a crash in harvest code. As @visual mica mentioned, we are not aware of any such crash bugs.

bold finch
reef obsidian
stable mauve
#

Morning/Afternoon, Just checked and the missing cluster is missing again. Unfortunatly my PC rebooted and killed my putty session and I didnt upgrade to the new version. Is there any logs you would like collected Yann and then I will upgrade

torpid coyote
#

Hello all,

we have this problem too it seems.
Randomly Nabox stops collecting from some of our systems and shows them of in the admin overview.
For collecting again i have to enable them manually.

visual mica
torpid coyote
#

How can i collect Logs for the poller?
For how long?

visual mica
#

to collect logs from nabox run dc logs nabox-harvest2 > nabox-harvest2.log zip that file and email. In terms of how long, if the poller has restarted, the logs should already include information about why that happened so the current logs should suffice

torpid coyote
#

Ok i will try it now

#

Sent the logs @visual mica

stable mauve
#

My logs sent over now too

visual mica
#

thanks, we'll take a look

stable mauve
#

Is there anyway of collecting older logs? The instructions above only contains today log, mine broke sometime from Friday to this morning

visual mica
#

checking if that works

#

no such luck - the version of docker compose on nabox does not support that version

stable mauve
#

ah dam, I will install the new patch versionthat has screen

reef obsidian
#

No log probably means that the container restarted, which seems odd ?

#

To use screen you would have stop that poller from the UI, then screen -R after logging in, and run dc exec nabox-harvest2 bash. That'll get you to the harvest container and you can manually run the poller from here

#

cd /netapp-harvest;/netapp-harvest/bin/poller -p <poller name from harvest.yaml> --promPort <port number from harvest.yaml>

torpid coyote
#

@visual mica are the logs enough or do you need more?

visual mica
#

@torpid coyote not sure yet, @reef obsidian and I are still digging through them. We'll ping you if we need more data. Thanks!

stable mauve
torpid coyote
reef obsidian
#

Mmm, what logs ? I got those from @stable mauve

visual mica
#

i sent them to you via email

reef obsidian
#

Indeed the only stop I see is 2023-01-23T13:13:42 GMT and seems to be caused by disabling the poller in the UI

torpid coyote
#

@reef obsidian you can see here that those systems didnt collect until being re enabled in the admin gui.
This happened already4 times in the last 3-4 months.

reef obsidian
#

The trick is to catch the poller logs when it crashes, we didn't see that moment in the capture

#

How many controllers and volumes are we talking about ? It's possible NAbox is running out of memory.

torpid coyote
#

Hmm i can check this.
Indeed today again the fashap07 was not collecting.

#

I will expand memory to 16G now and look further.
I have sent new logs from today @visual mica @reef obsidian

reef obsidian
#

Awesome. Take a look at dmesg too, witht he recent crash you might see OOM kill

torpid coyote
#

I think u are right

hexed creek
#

@torpid coyote Have you added new pollers recently? or any other changes?

torpid coyote
#

No we havent

hexed creek
#

okay any harvest version upgrade?

torpid coyote
#

Was 1-2 months ago i think as i updated to Nabox 3.2 i also updated to 22.11.1-1

hexed creek
#

okay and crashes have been happening since then or a different timeline?

torpid coyote
#

we had crashes before the updates as well. Only not so often like last month

#

We added Storage Grid with the last update

hexed creek
#

okay thanks for the information.

visual mica
#

@stable mauve when you get a chance, can you also take a look at dmesg and see if you have any OOM killer msgs there for your erratic pollers?

timber halo
#

fwiw - following along here because I have the same issue - just checked my splunk logs and yep I'm seeing the same thing - so at least I can get an alert when it happens - I'll try adding more memory to my nabox vm as well

Jan 19 20:22:09 myhost.name kernel: [5483302.209948] Out of memory: Killed process 24718 (poller) total-vm:4347592kB, anon-rss:2016716kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:5612kB oom_score_adj:0
Jan 19 20:22:09 myhost.name kernel: [5483302.209902] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/docker/f01a0c8b7a93ca48bda89e7665b837f28e57925f5fbd07755a8c2956523bf2a2,task=poller,pid=24718,uid=0

stable mauve
#

Morning, So I havent had an issues since the upgrade of harvest or NAbox

#

Just checked Chris, I did have have OOM errors too before I upgraded

visual mica
#

thanks @stable mauve , let's keep an eye on it. If you see an OOM error again, might be worth bumping the memory if you can

reef obsidian
#

I'm hoping it's not harvest leaking memory for example

visual mica
#

unlikely - we have 10+ instances that have been running for months without issue

timber halo
#

just got my ram doubled - and yeah it seems odd, we only have 7 and it's usually the same two that do this ones a FAS8200 (that gets really loaded at night) and the other is an AFF300 that's not really doing much of anything so not even similar hardware or workloads

reef obsidian
#

Docker stats might be able to give us some insights after running it for a while before it crashes.

#

I’ll also work on improving nabox dashboard with regards to memory usage.

reef obsidian
#

Well that's interesting. Container-exporter seems to be leaking

reef obsidian
#

Whipping up an update for NAbox, ditching container-exporter which is deprecated and seems to me leaking pretty bad. b3 is on the way

timber halo
#

😀

reef obsidian
#

Yay. Beta 4 is available for download

timber halo
#

Beta 4 installed 🤞

reef obsidian
#

all right, keep an eye on memory

torpid coyote
#

After 7 Days our 16 GB RAM were full 😄

visual mica
#

@torpid coyote hopefully that is with a version earlier than beta 4?

torpid coyote
#

We will schedule a reboot every night until its fixed in a stable branch

reef obsidian
#

also it's worth doing a dc restart container-exporter

#

see if it frees resources

torpid coyote
#

will look into that

peak flame
visual mica
#

hi @peak flame @reef obsidian can chime in but given when he posted the msg, I'm guessing it is 3.2.1b4 on this page https://nabox.org/downloads/ and maybe beta4 is superseded by 3.3b1 (2023-02-01) the changelog seems to match