#Hi Yann8373 we have an interesting
1 messages · Page 1 of 1 (latest)
Yes, looked like the exact same issue as another user, Prometheus is crashing
anything in dmesg ? Like Kernel OOM ?
oh wait, you are the user in question 😄
Yes looks like a memory issue
[86684.876436] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/docker/c51f43fe444eb668ce6d695a1895bd3b8ddfe80a4735398049d75858fefa3c12,task=prometheus,pid=24274,uid=65534
[86684.876476] Out of memory: Killed process 24274 (prometheus) total-vm:104350756kB, anon-rss:14121444kB, file-rss:0kB, shmem-rss:0kB, UID:65534 pgtables:33068kB oom_score_adj:0
interesting is that after reboot the systems screen is still empty
I'm now going from 16GB RAM to 24GB
done but still looks like that
How is prom doing now, does it start correctly ?
yes now prom is back online and working well
here you see the memory usage from the last 30 days. I think when the 16G was full used the container crashed
here is a screen what currently is configured if you are interested 🙂
Damn... How many controllers is that ?
So you're saying it's all good now or NAbox home screen is still empty ?
If you are interested I can give you some more insides
Yes it is all good now, but I will have a look on the memory consumption from time to time
It seems Prometheus might become a problem for big installations honestly. Especially with 2y retention
Lets look how prom will do with 24gb ram 🙂 and I will give you the numbers of nodes which are collected
Hi Yann currently we are collecting 48 nodes of 19 clusters
Hi Yann, unfortunately Prometheus runned again out of memory 🙁 when you look on the vCenter the memory consumption increased extrem quick in a short time
I will now increase the memory to 32GG
Wow, that's getting out of control...
Can you check https://nabox/prometheus/tsdb-status ?
Interesting thoughts here : https://source.coveo.com/2021/03/03/prometheus-memory/ I'd be curious about cardinality. If we're lucky, maybe some labels can be dropped ?
@quick robin do we actually use user/user_id in dashboards ? Could that be optional ? I'm not sure that's the issue here but...
Those memory footprints look ridiculous but maybe that's not the actual memory used
user and user_id are related with quota dashboard. We only user user in dashboard and can disable user_id from export to reduce memory load. @white flare How many quota do you have across clusters?
@quick robin currently over the whole farm we have 65790 quotas
What also is interesting that the line was pretty flat, then a huge consumption was until the process got killed and since I've extended the RAM to 32GB the line is flat again
I got another user whose Prometheus is stopping unexpectedly, might be related
hey @upper coral know what version of Harvest they're using and how many quotas they have? If it's the same issue and they are on 22.08, you can suggest they disable user and group collection by commenting those out in qtree.yaml per https://github.com/NetApp/harvest/pull/1271/files
@rose breach @upper coral We were running 21.11 but upgraded to 22.02 to see if it would fix the Prometheus issue but nothing so far.
Ok so that should be a different issue then.
Hi @rose breach we are curretly using the nightly build with the qutoa fix where we had the duplicated entry issue
@upper coral what do you, would be possible to create like a opt in/out page in the admin menu, to select what will be collected?
that's great @white flare, hopefully the cmds I shared about collecting a poller trace worked for you
I'm probably getting old but which cmds?😅
@rose breach you're refering to the email chain ?