#NaBox strange Prometheus issue
1 messages · Page 1 of 1 (latest)
hi @dark rover can you ssh to your nabox instance and take a look at the Prometheus logs by running docker logs prometheus
Please find the log file attached. The servers becomes very slow and almost unresponsive.
@dark rover The logs you've provided indicate that Prometheus is in the process of loading its Write-Ahead Log (WAL). The WAL is storing recent data that has not yet been written to disk to prevent data loss in the event of a crash.
It means that Prometheus is still in the process of starting up. Depending on the amount of data stored in the WAL, this process can take some time. Once all the segments up to maxSegment are loaded, Prometheus will be ready to serve queries.
If you're experiencing issues with Prometheus while these logs are being generated (like the "error fetching runtime information: failed to fetch" error you mentioned earlier), it could be because Prometheus is not yet fully started.
ts=2023-09-28T13:34:16.322Z caller=head.go:755 level=info component=tsdb msg="WAL segment loaded" segment=1091 maxSegment=7089
We need to wait until all WAL are loaded.
also check how much free disk space you have, could be that Prometheus is taking a long time to reply due to low disk space
And finally, it’s not impossible that prom is running out of memory while doing this and crashes forever. One solution in that case is to prune the wal files
Seems like it is never finishing loading the Wal files... disk space is ok. Added extra memory and rebooted
Server runs fine. But after rebooting, it will load the wal again and it needs the 64 GB again. I will attach new logs.
New logs
@dark rover The duration of the WAL replay is directly proportional to the volume of changes recorded in it. A high rate of data ingestion or frequent modifications can lead to a larger WAL, which in turn increases the time required for replay.
Could you please send the Harvest logs to ng-harvest-files@netapp.com? How to collect these logs at the following link: https://netapp.github.io/harvest/23.08/help/log-collection/#nabox.
It's good news the replay is successful, but it's abnormal the replay happens with every reboot.
I would like to test a dc restart prometheus see if it replays as well
Will do!
Hi, here is the output.
dc restart prometheus
Restarting prometheus ... done
This also starts the WAL reply.
wow.
Sorry.....
(probably) one solution would be to stop prometheus and delete (or move) the .wal files.
But I don't know how much history you would lose
Please do
ls -l /data/prometheus/data
ls -l /data/prometheus/data/wal
ls -l /data/prometheus/data/chunks_head
Here is the output
Damn, looks like it goes back as far as May 9th
You're sure after being up for a while it doesn't cleanup ?
Pretty sure. It has been running before this restart for about 10 days
Ok at this point I only see the following (probably) solution :
dc stop prometheus
rm -rf /data/prometheus/data/wal/*
dc prometheus start
I can't guarantee you're not going to lose any data, and it's possible you won't 😄