#Got an OOM-killed message, shutting down collection for one of my clusters

1 messages · Page 1 of 1 (latest)

slender tinsel
#

I'm only monitoring 4 clusters, each one is only a single HA-Pair. I know I had workload collection enabled, and I have since disabled this, but with 12g of RAM i'm not sure why I ran out. Can I get someone to help determine what used up all the RAM for this system please? I just pulled the support bundle...

radiant cobalt
#

@slender tinsel Can you share the results of running these command for sizing. Feel free to send as email (ng-harvest-files@netapp.com) if the output would be large

ps -o pid,time,rss,comm,args
free
crisp bridge
#

Hi @slender tinsel you might hit a size limit with email. Please upload the support bundle here https://upload.nabox.org/cane-tyve-qeyu
We may want to grab a heap dump, I'll let you know after looking at your support bundle

slender tinsel
#

I just logged in this morning and ran that command, was I supposed to do something different?

slender tinsel
crisp bridge
#

thanks for uploading Ryan, looking now. Closer to OOM is better but let me take a look at what you've uploaded

slender tinsel
#

it was only 1 of the clusters though, the other 3 were ok.

crisp bridge
#

that matches what I see in the logs, oom-kill happened at Nov 05 21:16:26

#

If you take a look at the Metadata dashbaord, Poller RSS memory panel, you should see one poller that is much larger than the others, ~1.8 GB

#

the panel that looks like this

slender tinsel
#

Yeah, I see it, mine looks a lot cleaner than yours, but that's likely because the sizes are larger, making the variations smaller...
It seems to have gotten all the way to 8g before being killed... Is that a memory leak or something? Yesterday you can see all of them drop when I restarted NABox for poller modifications (I was trying to turn off the worker processes as we don't use them)

crisp bridge
#

that does look like a leak. The green line, after restart, looks like it is slowly growing. You are on 24.08.0-1 Would you be able to upgrade to 24.11? There are memory improvements in 24.11 and we added the ability to capture a heap dump in that version. I'm hopefully that a heap dump from the green poller would help root cause any memory leak

slender tinsel
#

Sure. the green line does definitely climb faster? more? than the others. I'm not sure I'll be able to upgrade this week, we're on a change freeze. I'll ask since it's just a monitoring tool, but it may have to wait until Monday.

crisp bridge
#

understood. I'll keep digging through the log files and see if anything else jumps out

crisp bridge
#

The support bundle shows that the s* poller is using a lot of memory monitoring an ONTAP cluster at version 9.13.1. Is that the green poller in your screenshot? Looks like you have enabled some of the disabled-by-default templates that are expensive: Rest:CIFSSession, Rest:CIFSShare, and Rest:NFSClients. Rest:CIFSSession is collecting 9,795 instances and 137,130 metrics, of which, 19,590 are exported per poll.

#

If you don't need those templates, you will definitely save memory by disabling them. That said, I don't think those templates fully explain green's growth. Let's try chasing that growth once you've had a chance to upgrade. Something like, grab a heap dump for the largest poller after upgrade, then grab another heap dump from the same poller a day later. Upload both the heap dumps to the share above.

#

To grab the heap dumps in nabox4 do this after upgrading to 24.11:
0. wait at least 10 minutes after upgrade

  1. ssh into nabox
  2. docker exec -it havrest wget -O heap1.pprof 'http://localhost:12001/debug/pprof/heap'
  3. wait one day
  4. docker exec -it havrest wget -O heap2.pprof 'http://localhost:12001/debug/pprof/heap'
  5. docker cp havrest:/harvest/heap1.pprof .
  6. docker cp havrest:/harvest/heap2.pprof .
  7. upload those two files to the share
slender tinsel
# crisp bridge The support bundle shows that the s* poller is using a lot of memory monitoring ...

Green is the C* poller, I* is red, S* is blue, and H* is yellow.
These systems are NAS only, so NFS and CIFS are the important templates, we're trying to get info about how to track usage on CIFS and NFSv3 (we have no v4) and what info we can get out of it. I oversized (12g RAM) the NABox system because I expected to utilize quite a bit for these templates across the board, but if I need to go bigger I can. however, with the growth of green, it doesn't look like it's releasing it as necessary, so I think you're right there's some shenanigans going on with that.

#

I got declined for upgrade this week, so I'll put it in for Monday or Tuesday next week. Should I do Harvest and NABox, or just Harvest for the upgrade.
Do you want a head dump now, or just right after the upgrade next week?

crisp bridge
#

upgrading only Harvest is sufficient. In 24.08 it's not possible to gather a heap dump without restarting with additional args. I can figure those out if you want to try that this week or we can wait until next week when the steps I provided above will work out-of-the-box. In terms of the when to gather them, I've added a step 0 to the notes above to make that clearer

slender tinsel
crisp bridge
#

Thanks for the color decoder. From the logs, the S* cluster was using 2.3 GB of memory on 2024-11-11T01:33:52Z. C* cluster is using a svelte 32MB, but that's because that poller was restarted right before the RSS was printed, so not very useful

slender tinsel
crisp bridge
#

can you do a docker ps -a and paste the results?

#

replace the port 12001 with the promPort from one of your pollers that has been using a lot of memory

slender tinsel
#

I don't have a prom port? I don't think any of this is a security issue, but let me know if I should edit...

15455069f5f7   havrest:latest     "/havrest"               39 minutes ago   Up 38 minutes             8080/tcp                                                                   havrest
a6c4aab594ac   grafana-oss        "/run.sh"                39 minutes ago   Up 38 minutes             127.0.0.1:3000->3000/tcp                                                   grafana
383f9b3351ae   vmalert            "/vmalert-prod -rule…"   39 minutes ago   Up 38 minutes             8880/tcp                                                                   vmalert
143ac9031736   vmagent            "/vmagent-prod -remo…"   39 minutes ago   Up 38 minutes             8429/tcp                                                                   vmagent
a0f735305213   victoria-metrics   "/victoria-metrics-p…"   6 days ago       Up 38 minutes             127.0.0.1:8428->8428/tcp                                                   victoria-metrics
dcb0491b2f23   cadvisor           "/usr/bin/cadvisor -…"   2 months ago     Up 38 minutes (healthy)   8080/tcp                                                                   cadvisor
d83c1654d660   node-exporter      "/bin/node_exporter …"   2 months ago     Up 38 minutes             9100/tcp                                                                   node-exporter
b1522d5167bc   traefik:v3.0       "/entrypoint.sh trae…"   3 months ago     Up 38 minutes             0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp   traefik
c0349504e5ac   alertmanager       "/bin/alertmanager -…"   3 months ago     Up 38 minutes             9093/tcp                                                                   alertmanager
#

Now I'm really confused! I went to look at the harvest.yml file, but...

bash: cd/etc/nabox: No such file or directory```
crisp bridge
#

thanks Ryan - yes, none of that is a security concern. There are several ways to determine your poller's promPort. Let's try this one

ps aux | grep poller

You should see something like this that shows the poller name and promPort

So in this example, the poller A250-41-42-43 has promPort 12001

slender tinsel
#

oh, right. for some reason mine moved to port 1299X

crisp bridge
#

sudo less /etc/nabox/harvest/harvest.yml can also be used
can you retry the command, replacing 12001 with one of your poller's that is growing very large?
docker exec -it havrest wget -O heap1.pprof 'http://localhost:12001/debug/pprof/heap'

slender tinsel
#

any idea why the /etc/nabox directory is gone?

#

I used to be able to browse it, but not after the upgrade of Harvest and NABox

crisp bridge
#

Are you logged in as admin? If so, the issue is root owns /etc/nabox you should be able to "see" it via sudo ls -la /etc/nabox/harvest/

slender tinsel
#

uh. nevermind? i just exited root and came back. i usually use sudo su root when I log in, so when I exited, and then did that again it worked. maybe it was still doing something when I used it the first time?

crisp bridge
#

yeah it's possible the fd of your current directory changed out from under the process when you ugpraded

slender tinsel
#

oh man. Nope. I'm just a complete idiot. Sorry for that! 🤣
No space...

crisp bridge
#

no worries, it happens - I noticed that but assumed you mistyped in Discord, but typed it with a space at the terminal 🙂

#

Please upload the heap to the share when you get a chance

#

I also wanted to let you know that I spent last week trying to reproduce this issue, reviewing code and your logs. I have a theory on what's causing the problem. It's not a leak, but a problem with how Harvest is making REST calls.

ONTAP paginates REST responses and today Harvest walks those pagination links via recursion. That's fine for most calls, but when an endpoint has lots of pagination (which some of your servers do), Harvest uses more memory than it should. That's because as the pagination links are walked, each of the recursive calls retains the previous REST response on the stack and that adds up.

I'm working on a fix that we'll review and run through CI. I'll ping you when it's available. We're also working on another complementary change to stream the REST results back from ONTAP instead of reading all the paginated results at once.

slender tinsel
#

Do you want me to wait until I have both HEAP files, or toss the first one now, and the next one tomorrow once I get it?

crisp bridge
#

if it's not too much trouble, upload the first now and then the next one tomorrow

slender tinsel
#

I also grabbed support bundles from before and after the upgrade, let me know if those are interesting to you.

crisp bridge
#

great! yes, those are useful too

#

heap upload works, thanks! - your inuse_space shows the recursive calls I was talking about

slender tinsel
#

Before and After upgrade bundles are also uploaded

crisp bridge
#

Hey @slender tinsel thanks so much. the two performance improvements I mentioned yesterday are in the latest nightly if you want to try it out. This nightly includes the pagination recursion and rest streaming improvements.

crisp bridge
#

nice!

#

let's check the memory of the pollers tomorrow and see how they're doing

slender tinsel
#

It is still climbing, the top yellow is the c* poller, the one that's been the high mem usage one.

#

still under 1g, and it's around 7g usage that it dies.

crisp bridge
crisp bridge
#

the nightly build will be done in roughly 28 minutes if you want to try it. I'll post when it has completed. From the poller RSS panel you shared, it looks like we should be able to tell after an hour or so if this change improves memory for the c* poller

crisp bridge
#

Nightly is up with the change for perf: improve memory and cpu performance of rest collector

slender tinsel
#

Updated. Will check again tomorrow

slender tinsel
#

This is a happy update! 🙂 top yellow is the c* poller

#

(and top green from before...)

crisp bridge
#

Awesome! Thanks for the update and trying the nightlies

#

how much memory is the yellow line showing now, looks like something between 100-200 MB?

slender tinsel
slender tinsel
#

Thanks again Chris, it looks to be holding steady for nearly 7 days at this point:

#

Not sure why the colors change every time, but green is the problematic C* poller

crisp bridge
#

awesome! Roughly a 40x improvement! \o/
In terms of why the colors changed, my guess is your promPort changed when you restarted nabox after upgrading. You can confirm this by going to VictoriaMetrics's UI (https://$ip/vm/vmui/) and querying for something like this (replace poller= with your poller name.) sum(poller_memory{poller="umeng-aff300-01-02",metric="rss"}) without (pid) If I change the promPort manually I can reproduce multiple time series. Notice two rows in the table below and two colors. Do you see something like this? You may need to increase the time range at the top to include your restart

slender tinsel
#

Just bumping back to say thanks again! Rock solid after this fix!!