#Got an OOM-killed message, shutting down collection for one of my clusters
1 messages · Page 1 of 1 (latest)
@slender tinsel Can you share the results of running these command for sizing. Feel free to send as email (ng-harvest-files@netapp.com) if the output would be large
ps -o pid,time,rss,comm,args
free
Hi @slender tinsel you might hit a size limit with email. Please upload the support bundle here https://upload.nabox.org/cane-tyve-qeyu
We may want to grab a heap dump, I'll let you know after looking at your support bundle
bundle uploaded.
free
PID TIME RSS COMMAND COMMAND
44166 00:00:00 2404 sudo sudo su root
44167 00:00:00 3968 su su root
44168 00:00:00 3712 bash bash
44170 00:00:00 4096 ps ps -o pid,time,rss,comm,args
total used free shared buff/cache available
Mem: 12215828 5253516 4703700 9920 2552880 6962312
Swap: 0 0 0
I just logged in this morning and ran that command, was I supposed to do something different?
I just gathered the bundle this morning too, and uploaded that. Not sure if closer to the OOM-killed message, or closer to current is a better pull.
thanks for uploading Ryan, looking now. Closer to OOM is better but let me take a look at what you've uploaded
I apparently didn't catch it very quickly, the graphs look like it killed the process on 11/5
it was only 1 of the clusters though, the other 3 were ok.
that matches what I see in the logs, oom-kill happened at Nov 05 21:16:26
If you take a look at the Metadata dashbaord, Poller RSS memory panel, you should see one poller that is much larger than the others, ~1.8 GB
the panel that looks like this
Yeah, I see it, mine looks a lot cleaner than yours, but that's likely because the sizes are larger, making the variations smaller...
It seems to have gotten all the way to 8g before being killed... Is that a memory leak or something? Yesterday you can see all of them drop when I restarted NABox for poller modifications (I was trying to turn off the worker processes as we don't use them)
that does look like a leak. The green line, after restart, looks like it is slowly growing. You are on 24.08.0-1 Would you be able to upgrade to 24.11? There are memory improvements in 24.11 and we added the ability to capture a heap dump in that version. I'm hopefully that a heap dump from the green poller would help root cause any memory leak
Sure. the green line does definitely climb faster? more? than the others. I'm not sure I'll be able to upgrade this week, we're on a change freeze. I'll ask since it's just a monitoring tool, but it may have to wait until Monday.
understood. I'll keep digging through the log files and see if anything else jumps out
The support bundle shows that the s* poller is using a lot of memory monitoring an ONTAP cluster at version 9.13.1. Is that the green poller in your screenshot? Looks like you have enabled some of the disabled-by-default templates that are expensive: Rest:CIFSSession, Rest:CIFSShare, and Rest:NFSClients. Rest:CIFSSession is collecting 9,795 instances and 137,130 metrics, of which, 19,590 are exported per poll.
If you don't need those templates, you will definitely save memory by disabling them. That said, I don't think those templates fully explain green's growth. Let's try chasing that growth once you've had a chance to upgrade. Something like, grab a heap dump for the largest poller after upgrade, then grab another heap dump from the same poller a day later. Upload both the heap dumps to the share above.
To grab the heap dumps in nabox4 do this after upgrading to 24.11:
0. wait at least 10 minutes after upgrade
- ssh into nabox
- docker exec -it havrest wget -O heap1.pprof 'http://localhost:12001/debug/pprof/heap'
- wait one day
- docker exec -it havrest wget -O heap2.pprof 'http://localhost:12001/debug/pprof/heap'
- docker cp havrest:/harvest/heap1.pprof .
- docker cp havrest:/harvest/heap2.pprof .
- upload those two files to the share
Green is the C* poller, I* is red, S* is blue, and H* is yellow.
These systems are NAS only, so NFS and CIFS are the important templates, we're trying to get info about how to track usage on CIFS and NFSv3 (we have no v4) and what info we can get out of it. I oversized (12g RAM) the NABox system because I expected to utilize quite a bit for these templates across the board, but if I need to go bigger I can. however, with the growth of green, it doesn't look like it's releasing it as necessary, so I think you're right there's some shenanigans going on with that.
I got declined for upgrade this week, so I'll put it in for Monday or Tuesday next week. Should I do Harvest and NABox, or just Harvest for the upgrade.
Do you want a head dump now, or just right after the upgrade next week?
upgrading only Harvest is sufficient. In 24.08 it's not possible to gather a heap dump without restarting with additional args. I can figure those out if you want to try that this week or we can wait until next week when the steps I provided above will work out-of-the-box. In terms of the when to gather them, I've added a step 0 to the notes above to make that clearer
Doubtful they'd be happy about a restart without a change, they're pretty twitchy this week, so let's just wait until the upgrade.
Thanks for the color decoder. From the logs, the S* cluster was using 2.3 GB of memory on 2024-11-11T01:33:52Z. C* cluster is using a svelte 32MB, but that's because that poller was restarted right before the RSS was printed, so not very useful
Just finished the upgrade earlier this morning, but I'm not sure if I'm doing something wrong here...
Connecting to localhost:12001 (127.0.0.1:12001)
wget: can't connect to remote host (127.0.0.1): Connection refused```
can you do a docker ps -a and paste the results?
replace the port 12001 with the promPort from one of your pollers that has been using a lot of memory
I don't have a prom port? I don't think any of this is a security issue, but let me know if I should edit...
15455069f5f7 havrest:latest "/havrest" 39 minutes ago Up 38 minutes 8080/tcp havrest
a6c4aab594ac grafana-oss "/run.sh" 39 minutes ago Up 38 minutes 127.0.0.1:3000->3000/tcp grafana
383f9b3351ae vmalert "/vmalert-prod -rule…" 39 minutes ago Up 38 minutes 8880/tcp vmalert
143ac9031736 vmagent "/vmagent-prod -remo…" 39 minutes ago Up 38 minutes 8429/tcp vmagent
a0f735305213 victoria-metrics "/victoria-metrics-p…" 6 days ago Up 38 minutes 127.0.0.1:8428->8428/tcp victoria-metrics
dcb0491b2f23 cadvisor "/usr/bin/cadvisor -…" 2 months ago Up 38 minutes (healthy) 8080/tcp cadvisor
d83c1654d660 node-exporter "/bin/node_exporter …" 2 months ago Up 38 minutes 9100/tcp node-exporter
b1522d5167bc traefik:v3.0 "/entrypoint.sh trae…" 3 months ago Up 38 minutes 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp traefik
c0349504e5ac alertmanager "/bin/alertmanager -…" 3 months ago Up 38 minutes 9093/tcp alertmanager
Now I'm really confused! I went to look at the harvest.yml file, but...
bash: cd/etc/nabox: No such file or directory```
thanks Ryan - yes, none of that is a security concern. There are several ways to determine your poller's promPort. Let's try this one
ps aux | grep poller
You should see something like this that shows the poller name and promPort
So in this example, the poller A250-41-42-43 has promPort 12001
oh, right. for some reason mine moved to port 1299X
sudo less /etc/nabox/harvest/harvest.yml can also be used
can you retry the command, replacing 12001 with one of your poller's that is growing very large?
docker exec -it havrest wget -O heap1.pprof 'http://localhost:12001/debug/pprof/heap'
the HEAP collection is working with the correct port
any idea why the /etc/nabox directory is gone?
I used to be able to browse it, but not after the upgrade of Harvest and NABox
Are you logged in as admin? If so, the issue is root owns /etc/nabox you should be able to "see" it via sudo ls -la /etc/nabox/harvest/
uh. nevermind? i just exited root and came back. i usually use sudo su root when I log in, so when I exited, and then did that again it worked. maybe it was still doing something when I used it the first time?
yeah it's possible the fd of your current directory changed out from under the process when you ugpraded
oh man. Nope. I'm just a complete idiot. Sorry for that! 🤣
No space...
no worries, it happens - I noticed that but assumed you mistyped in Discord, but typed it with a space at the terminal 🙂
Please upload the heap to the share when you get a chance
I also wanted to let you know that I spent last week trying to reproduce this issue, reviewing code and your logs. I have a theory on what's causing the problem. It's not a leak, but a problem with how Harvest is making REST calls.
ONTAP paginates REST responses and today Harvest walks those pagination links via recursion. That's fine for most calls, but when an endpoint has lots of pagination (which some of your servers do), Harvest uses more memory than it should. That's because as the pagination links are walked, each of the recursive calls retains the previous REST response on the stack and that adds up.
I'm working on a fix that we'll review and run through CI. I'll ping you when it's available. We're also working on another complementary change to stream the REST results back from ONTAP instead of reading all the paginated results at once.
Do you want me to wait until I have both HEAP files, or toss the first one now, and the next one tomorrow once I get it?
if it's not too much trouble, upload the first now and then the next one tomorrow
Ok, uploaded heap1.pprof. I can't validate the file, but it should be there for you!
I also grabbed support bundles from before and after the upgrade, let me know if those are interesting to you.
great! yes, those are useful too
heap upload works, thanks! - your inuse_space shows the recursive calls I was talking about
heap2 uploaded
Before and After upgrade bundles are also uploaded
Hey @slender tinsel thanks so much. the two performance improvements I mentioned yesterday are in the latest nightly if you want to try it out. This nightly includes the pagination recursion and rest streaming improvements.
nightly build installed
It is still climbing, the top yellow is the c* poller, the one that's been the high mem usage one.
still under 1g, and it's around 7g usage that it dies.
thanks for the update Ryan! We have another pull request out that should make a bigger difference: perf: improve memory and cpu performance of rest collector
the nightly build will be done in roughly 28 minutes if you want to try it. I'll post when it has completed. From the poller RSS panel you shared, it looks like we should be able to tell after an hour or so if this change improves memory for the c* poller
Nightly is up with the change for perf: improve memory and cpu performance of rest collector
Updated. Will check again tomorrow
This is a happy update! 🙂 top yellow is the c* poller
(and top green from before...)
Awesome! Thanks for the update and trying the nightlies
how much memory is the yellow line showing now, looks like something between 100-200 MB?
For that period of time after the nightly
Thanks again Chris, it looks to be holding steady for nearly 7 days at this point:
Not sure why the colors change every time, but green is the problematic C* poller
awesome! Roughly a 40x improvement! \o/
In terms of why the colors changed, my guess is your promPort changed when you restarted nabox after upgrading. You can confirm this by going to VictoriaMetrics's UI (https://$ip/vm/vmui/) and querying for something like this (replace poller= with your poller name.) sum(poller_memory{poller="umeng-aff300-01-02",metric="rss"}) without (pid) If I change the promPort manually I can reproduce multiple time series. Notice two rows in the table below and two colors. Do you see something like this? You may need to increase the time range at the top to include your restart
Just bumping back to say thanks again! Rock solid after this fix!!