Hi Yann8373 we have an interesting | NetApp | Page 1

upper coral Sep 2, 2022, 11:41 AM

#

Yes, looked like the exact same issue as another user, Prometheus is crashing

#

anything in dmesg ? Like Kernel OOM ?

#

oh wait, you are the user in question 😄

white flare Sep 2, 2022, 12:49 PM

#

Yes looks like a memory issue
[86684.876436] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/docker/c51f43fe444eb668ce6d695a1895bd3b8ddfe80a4735398049d75858fefa3c12,task=prometheus,pid=24274,uid=65534
[86684.876476] Out of memory: Killed process 24274 (prometheus) total-vm:104350756kB, anon-rss:14121444kB, file-rss:0kB, shmem-rss:0kB, UID:65534 pgtables:33068kB oom_score_adj:0

#

interesting is that after reboot the systems screen is still empty

#

I'm now going from 16GB RAM to 24GB

#

done but still looks like that

upper coral Sep 2, 2022, 1:08 PM

#

How is prom doing now, does it start correctly ?

white flare Sep 2, 2022, 6:05 PM

#

yes now prom is back online and working well

#

#

here you see the memory usage from the last 30 days. I think when the 16G was full used the container crashed

#

here is a screen what currently is configured if you are interested 🙂

upper coral Sep 2, 2022, 7:02 PM

#

Damn... How many controllers is that ?

#

So you're saying it's all good now or NAbox home screen is still empty ?

white flare Sep 2, 2022, 8:46 PM

#

If you are interested I can give you some more insides

#

Yes it is all good now, but I will have a look on the memory consumption from time to time

upper coral Sep 2, 2022, 10:03 PM

#

It seems Prometheus might become a problem for big installations honestly. Especially with 2y retention

white flare Sep 3, 2022, 11:19 AM

#

Lets look how prom will do with 24gb ram 🙂 and I will give you the numbers of nodes which are collected

white flare Sep 5, 2022, 7:19 AM

#

Hi Yann currently we are collecting 48 nodes of 19 clusters

white flare Sep 6, 2022, 6:54 AM

#

Hi Yann, unfortunately Prometheus runned again out of memory 🙁 when you look on the vCenter the memory consumption increased extrem quick in a short time

#

I will now increase the memory to 32GG

upper coral Sep 7, 2022, 7:31 AM

#

Wow, that's getting out of control...

#

Can you check https://nabox/prometheus/tsdb-status ?

#

Interesting thoughts here : https://source.coveo.com/2021/03/03/prometheus-memory/ I'd be curious about cardinality. If we're lucky, maybe some labels can be dropped ?

source.coveo.com

Prometheus - Investigation on high memory consumption

Our technical blog.

white flare Sep 7, 2022, 7:51 AM

#

Currently it looks better then yesterday

#

upper coral Sep 7, 2022, 7:53 AM

#

@quick robin do we actually use user/user_id in dashboards ? Could that be optional ? I'm not sure that's the issue here but...

#

Those memory footprints look ridiculous but maybe that's not the actual memory used

quick robin Sep 8, 2022, 7:04 AM

#

user and user_id are related with quota dashboard. We only user user in dashboard and can disable user_id from export to reduce memory load. @white flare How many quota do you have across clusters?

white flare Sep 8, 2022, 11:25 AM

#

@quick robin currently over the whole farm we have 65790 quotas

#

What also is interesting that the line was pretty flat, then a huge consumption was until the process got killed and since I've extended the RAM to 32GB the line is flat again

upper coral Sep 12, 2022, 5:40 PM

#

I got another user whose Prometheus is stopping unexpectedly, might be related

rose breach Sep 12, 2022, 5:43 PM

#

hey @upper coral know what version of Harvest they're using and how many quotas they have? If it's the same issue and they are on 22.08, you can suggest they disable user and group collection by commenting those out in qtree.yaml per https://github.com/NetApp/harvest/pull/1271/files

tiny scarab Sep 12, 2022, 8:01 PM

#

@rose breach @upper coral We were running 21.11 but upgraded to 22.02 to see if it would fix the Prometheus issue but nothing so far.

upper coral Sep 12, 2022, 8:30 PM

#

Ok so that should be a different issue then.

white flare Sep 13, 2022, 6:56 PM

#

Hi @rose breach we are curretly using the nightly build with the qutoa fix where we had the duplicated entry issue

#

@upper coral what do you, would be possible to create like a opt in/out page in the admin menu, to select what will be collected?

rose breach Sep 13, 2022, 7:07 PM

#

that's great @white flare, hopefully the cmds I shared about collecting a poller trace worked for you

white flare Sep 13, 2022, 8:10 PM

#

rose breach that's great <@1006611101024256041>, hopefully the cmds I shared about collectin...

I'm probably getting old but which cmds?😅

upper coral Sep 14, 2022, 8:43 AM

#

@rose breach you're refering to the email chain ?

#Hi Yann8373 we have an interesting