#Prometheus Consuming Huge Space / chunks

1 messages · Page 1 of 1 (latest)

remote knoll
#

We are monitoring 38 clusters, mainly 2 nodes but have several 6 node clusters.

I have the Prometheus retention set to 180 days but I'm constantly running out of space. We are now at 600GB for the data drive and at 93% as I just increased it from 580GB.

I'm looking at the /data/prometheus/data directory consuming 538GB and running a du - h /data/prometheus/data I see there are roughly 30+ 01HXXXXX directories. Shouldn't prometheus be compressing the data out? WAL is only 3.8GB currently.

Just to give an example:
36G /data/prometheus/data/01HE6B4XWAWWEA41FHM1V4YF45/chunks
38G /data/prometheus/data/01HE6B4XWAWWEA41FHM1V4YF45

Any suggestions?

tropic dew
#

That is actually really good IMO.

#

If you think a standard perf archive for support purposes is about 100 MB compressed, times 68+nodes...it sounds about right.

#

PAs have a lot more data overall...

fierce blaze
#

You might want to disable detailed metrics collection, there will be a checkbox for this in 3.4

remote knoll
#

@fierce blaze How would I go about doing that? Also, we don't really need to monitor quotas and possibly even qtrees, is there a simple way to prevent polling those and delete the metrics it's captured?

fierce blaze
#

I would recommend you update to 3.4b, unfortunately it's not available at the moment

#

back up in a few days probably

quiet dragon
remote knoll
#

@quiet dragon I've emailed logs to the address you provided.

quiet dragon
#

Thanks Received.

smoky tusk
fierce blaze
#

It’s generated with every start unfortunately. You can add things but not remove from defaults

smoky tusk
#

would a temporary workaround be to make the changes in /opt/packages/harvest2/conf those will persist across restarts until you upgrade Harvest?

remote knoll
#

I'm having an issue with Prometheus replaying WAL, times out then restarts again. I have space and increased the RAM to 64GB.

quiet dragon
#

@remote knoll Based on the logs, there are 2 pollers where Qtree and CIFSSession has high object count. We can disable the Zapi Qtree and ZapiPerf Qtree templates, as well as the Zapi CIFSSession template, if you are not using them.

remote knoll
#

We are not so I would imagine that would help with performance improvements

#

And those clusters are actually the same. The 01 was renamed to the -GEN. In NABOX Admin I updated the cluster name but kept the 01 address so we wouldn't lose historical data but I think we ended up losing it anyway

#

@quiet dragon Will you be sending an email on disabling those?

quiet dragon
#

Do you mean steps how to disable them in NABox?

remote knoll
#

Yes

quiet dragon
#

Sure I'll share in a moment.

remote knoll
#

Thank you

quiet dragon
remote knoll
#

Thank you kindly

#

I have implemented the changes. Is there a way to delete the stats for these to free up space?

smoky tusk
remote knoll
#

I have seen that. Do I need to do it for all qtrees/CIFS sessions or is there a wildcard I could use?

smoky tusk
remote knoll
#

@smoky tusk Thank you very much! I was just looking at updating that metric and looking for the correct syntax so your response was very helpful.

#

The issue I have at the moment however is that Prometheus will not come up, it's keeps replaying the WAL so I'm unable to get the correct query.

smoky tusk
#

It may take awhile to start - another person was facing a similar long time to start in this thread (not sure you saw it) #1156924701621366794 message

remote knoll
#

It'll run then eventually show :Error: Error fetching Runtime Information: Bad Gateway. If I refresh the Prometheus web interface it'll restart the Replay

#

Looks like they experienced the same I am. I can wait and see if it finishes but if not I'll do what was recommended and delete the WAL files.