#NaBox strange Prometheus issue

1 messages · Page 1 of 1 (latest)

dark rover
#

Hi,
We see an issue with Prometheus not starting up correctly. DC ps shows running. But when we go to /prometheus we see a counter with the text starting up. On the maintenance gui, it said: unavailable.
Updated to latest NaBox and Harvest. Did a reboot and waited 2 hours.
Any help?
Thanks.

pulsar prawn
#

hi @dark rover can you ssh to your nabox instance and take a look at the Prometheus logs by running docker logs prometheus

dark rover
hazy schooner
#

@dark rover The logs you've provided indicate that Prometheus is in the process of loading its Write-Ahead Log (WAL). The WAL is storing recent data that has not yet been written to disk to prevent data loss in the event of a crash.
It means that Prometheus is still in the process of starting up. Depending on the amount of data stored in the WAL, this process can take some time. Once all the segments up to maxSegment are loaded, Prometheus will be ready to serve queries.
If you're experiencing issues with Prometheus while these logs are being generated (like the "error fetching runtime information: failed to fetch" error you mentioned earlier), it could be because Prometheus is not yet fully started.

ts=2023-09-28T13:34:16.322Z caller=head.go:755 level=info component=tsdb msg="WAL segment loaded" segment=1091 maxSegment=7089

We need to wait until all WAL are loaded.

pulsar prawn
#

also check how much free disk space you have, could be that Prometheus is taking a long time to reply due to low disk space

dusk bobcat
#

And finally, it’s not impossible that prom is running out of memory while doing this and crashes forever. One solution in that case is to prune the wal files

dark rover
#

Seems like it is never finishing loading the Wal files... disk space is ok. Added extra memory and rebooted

dark rover
#

Server runs fine. But after rebooting, it will load the wal again and it needs the 64 GB again. I will attach new logs.

hazy schooner
dusk bobcat
#

It's good news the replay is successful, but it's abnormal the replay happens with every reboot.
I would like to test a dc restart prometheus see if it replays as well

dark rover
#

Hi, here is the output.

dc restart prometheus

Restarting prometheus ... done
This also starts the WAL reply.

dusk bobcat
#

wow.

dark rover
dusk bobcat
#

(probably) one solution would be to stop prometheus and delete (or move) the .wal files.
But I don't know how much history you would lose

#

Please do

ls -l /data/prometheus/data
ls -l /data/prometheus/data/wal
ls -l /data/prometheus/data/chunks_head
dark rover
dusk bobcat
#

Damn, looks like it goes back as far as May 9th

dusk bobcat
#

You're sure after being up for a while it doesn't cleanup ?

dark rover
#

Pretty sure. It has been running before this restart for about 10 days

dusk bobcat
#

Ok at this point I only see the following (probably) solution :

dc stop prometheus
rm -rf /data/prometheus/data/wal/*
dc prometheus start

I can't guarantee you're not going to lose any data, and it's possible you won't 😄