#Error 503 Grafana
1 messages · Page 1 of 1 (latest)
@worn pendant Is Prometheus up and running, and scraping data?
ok let's try to check data source connection from Grafana. Could you check if Grafana is able to reach Prometheus.
Could you share prometheus target page?
okay yes this is the problem. Let's check prometheus logs first.
How do I collect them mate?
if you ssh to nabox with root then running command dc logs prometheus should give you prometheus logs.
Have you got an email address I can send them to please?
should be the current dir
@worn pendant Checking if you were able to get logs?
The Zip file is 89GB for some reason
Could you share command you have run?
dc logs nabox-api > nabox-api.log; dc logs nabox-harvest2 > nabox-harvest2.log;
tar -czf nabox-logs-date +%Y-%m-%d_%H:%M:%S.tgz *
if you just run dc logs -f --tail 1000 nabox-harvest2 Do you notice any errors?
None all looks clean
ok please you share prometheus logs via email or here as you prefer.
How do I get them logs? Just the same command but use prometheus?
Harvest logs looks good except some permission issues for Rest Objects. It appears problem lies with Prometheus.
Could you share your prometheus data retention configuration?
Also see if there are any disk issues as we have discussed in past as here #1090549586281648148 message
I expanded the disk yesterday so thats fixed, but this has been happening for a few days
were there any recent upgrades to nabox?
Updated to a nightly not long ago
okay That would be Harvest upgrade. That should not be an issue
There are some of these logs in prometheus
Ive just done a dc down and a dc up and it seems to be up now
Yes, As you said it may crash again.
Could you share output of below commands
ps -eo pid,ppid,cmd,comm,%mem,%cpu --sort=-%mem | head -10
dmesg -e | grep -i kill
free
ah that's a busybox. Let me see
Its crashed again
okay. What is the output of dmesg | grep -i kill
Yes free also shows lack of memory
i will bump it up to 16gb and maybe ask for another cpu core too
Thank you again for your help mate
np
How do I kill the restapi requests? The harvest.yml file?
sorry which restapi?
You said there was restapi permissions denied in the log files. How do I turn off restapi and just use ZAPI for now?
got it. Let me check
Trying to get extra memory may take a few days so I think that along with stopping pollers I dont need running may get me through the next few days
Yes that should be harvest.yml only. You can disable those pollers through NAbox UI as well.
Yes Prometheus needs to recover from crash
Is there anyway of optimising the WAL replays? It gets so far and crashes, even if I down the other containers
@worn pendant ideally replay only happens once and afterwards no more crashes or out of space issues cause it to happen again. Sounds like that isn't the case for you since replay does not finish without crashing? What error does Prometheus give you when it crashes? Might be worth searching on their GitHub for the error text you're seeing
To summarize, prom is trying to recover the DB and causing huge memory utilization and terminated by kernel ?
Do we have a hint on the original issue ?
You confirm that we currently have 8G to work with and that seems to be too little for prom to recover ?
No space issue either on /data ?
Maybe something to try :
dc down
dc up -d prometheus
See if having prometheus running without live targets helps it settle down
As a last resort, it seems clearing wal directory might do the trick at the expense of losing some data.
So i've downed all containers and it still seems to crash. I have cleaned tomstones and all data, still crashes. Ive added an entry saying no lock file and it seems to be stable so far
yeah 8GB and now 500GB of free space
Morning, I left all the containers down last night apart from Prometheus and its still been crashing through the night
how about dmesg | grep -i kill? Is it still OOM error?
Yeah, seems to be using the memory during the WAL replays
okay probably prometheus needs more memory to recover from a crash state
16GB should be enough from 8?
I am not sure but 16GB will be a good start to try if it recovers
I will try for 32GB
is Prometheus up now?
Yeah
okay good. So logs shared in above screenshot are still happening?
@worn pendant Have you configured Alertmanager in Prometheus? It's possible that the issue is related to the Alertmanager service not being set up correctly. You may want to verify that the Alertmanager service is running.
Whatever the default setting was mate, never really messed about with alert manager. Its not a big issue
okay If you are not using alertmanager then you can safely ignore these logs. @valid gull Could you check this.
Yeah that should be commented, my bad. Please edit /usr/local/nabox/files/prometheus/prometheus.yml and comment out the folllowing block :
# Alert Manager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
Cheers mate
Ah it already is comment out