#Hi Yann8373 we have an interesting

1 messages · Page 1 of 1 (latest)

upper coral
#

Yes, looked like the exact same issue as another user, Prometheus is crashing

#

anything in dmesg ? Like Kernel OOM ?

#

oh wait, you are the user in question 😄

white flare
#

Yes looks like a memory issue
[86684.876436] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/docker/c51f43fe444eb668ce6d695a1895bd3b8ddfe80a4735398049d75858fefa3c12,task=prometheus,pid=24274,uid=65534
[86684.876476] Out of memory: Killed process 24274 (prometheus) total-vm:104350756kB, anon-rss:14121444kB, file-rss:0kB, shmem-rss:0kB, UID:65534 pgtables:33068kB oom_score_adj:0

#

interesting is that after reboot the systems screen is still empty

#

I'm now going from 16GB RAM to 24GB

#

done but still looks like that

upper coral
#

How is prom doing now, does it start correctly ?

white flare
#

yes now prom is back online and working well

#

here you see the memory usage from the last 30 days. I think when the 16G was full used the container crashed

#

here is a screen what currently is configured if you are interested 🙂

upper coral
#

Damn... How many controllers is that ?

#

So you're saying it's all good now or NAbox home screen is still empty ?

white flare
#

If you are interested I can give you some more insides

#

Yes it is all good now, but I will have a look on the memory consumption from time to time

upper coral
#

It seems Prometheus might become a problem for big installations honestly. Especially with 2y retention

white flare
#

Lets look how prom will do with 24gb ram 🙂 and I will give you the numbers of nodes which are collected

white flare
#

Hi Yann currently we are collecting 48 nodes of 19 clusters

white flare
#

Hi Yann, unfortunately Prometheus runned again out of memory 🙁 when you look on the vCenter the memory consumption increased extrem quick in a short time

#

I will now increase the memory to 32GG

upper coral
#

Wow, that's getting out of control...

#

Can you check https://nabox/prometheus/tsdb-status ?

white flare
#

Currently it looks better then yesterday

upper coral
#

@quick robin do we actually use user/user_id in dashboards ? Could that be optional ? I'm not sure that's the issue here but...

#

Those memory footprints look ridiculous but maybe that's not the actual memory used

quick robin
#

user and user_id are related with quota dashboard. We only user user in dashboard and can disable user_id from export to reduce memory load. @white flare How many quota do you have across clusters?

white flare
#

@quick robin currently over the whole farm we have 65790 quotas

#

What also is interesting that the line was pretty flat, then a huge consumption was until the process got killed and since I've extended the RAM to 32GB the line is flat again

upper coral
#

I got another user whose Prometheus is stopping unexpectedly, might be related

rose breach
#

hey @upper coral know what version of Harvest they're using and how many quotas they have? If it's the same issue and they are on 22.08, you can suggest they disable user and group collection by commenting those out in qtree.yaml per https://github.com/NetApp/harvest/pull/1271/files

tiny scarab
#

@rose breach @upper coral We were running 21.11 but upgraded to 22.02 to see if it would fix the Prometheus issue but nothing so far.

upper coral
#

Ok so that should be a different issue then.

white flare
#

Hi @rose breach we are curretly using the nightly build with the qutoa fix where we had the duplicated entry issue

#

@upper coral what do you, would be possible to create like a opt in/out page in the admin menu, to select what will be collected?

rose breach
#

that's great @white flare, hopefully the cmds I shared about collecting a poller trace worked for you

white flare
upper coral
#

@rose breach you're refering to the email chain ?