#Error 503 Grafana

1 messages · Page 1 of 1 (latest)

worn pendant
#

Hi All, I seem to be getting 503 Errors on grafana however the collectors are still collecting

meager kayak
#

@worn pendant Is Prometheus up and running, and scraping data?

worn pendant
#

Yeah its up, I only seem to get the issue fixed if i bounce the box

meager kayak
#

ok let's try to check data source connection from Grafana. Could you check if Grafana is able to reach Prometheus.

meager kayak
worn pendant
meager kayak
#

okay yes this is the problem. Let's check prometheus logs first.

worn pendant
#

How do I collect them mate?

meager kayak
#

if you ssh to nabox with root then running command dc logs prometheus should give you prometheus logs.

worn pendant
#

Have you got an email address I can send them to please?

meager kayak
#

Also if you could share harvest logs that will be helpful

worn pendant
#

Thank you

#

Where does the zip file of logs get saved?

meager kayak
#

should be the current dir

meager kayak
#

@worn pendant Checking if you were able to get logs?

worn pendant
#

The Zip file is 89GB for some reason

meager kayak
#

Could you share command you have run?

worn pendant
#

dc logs nabox-api > nabox-api.log; dc logs nabox-harvest2 > nabox-harvest2.log;
tar -czf nabox-logs-date +%Y-%m-%d_%H:%M:%S.tgz *

meager kayak
#

if you just run dc logs -f --tail 1000 nabox-harvest2 Do you notice any errors?

worn pendant
#

None all looks clean

meager kayak
#

ok please you share prometheus logs via email or here as you prefer.

worn pendant
#

How do I get them logs? Just the same command but use prometheus?

meager kayak
#

yes dc logs prometheus

#

to save to a file dc logs prometheus > prometheus.log

worn pendant
#

All sent over mate thank you

meager kayak
#

Harvest logs looks good except some permission issues for Rest Objects. It appears problem lies with Prometheus.

#

Could you share your prometheus data retention configuration?

worn pendant
#

I expanded the disk yesterday so thats fixed, but this has been happening for a few days

meager kayak
#

were there any recent upgrades to nabox?

worn pendant
#

Updated to a nightly not long ago

meager kayak
#

okay That would be Harvest upgrade. That should not be an issue

#

There are some of these logs in prometheus

worn pendant
#

Ive just done a dc down and a dc up and it seems to be up now

meager kayak
#

Yes, As you said it may crash again.

#

Could you share output of below commands
ps -eo pid,ppid,cmd,comm,%mem,%cpu --sort=-%mem | head -10
dmesg -e | grep -i kill
free

worn pendant
#

ps -eo pid,ppid,cmd,comm,%mem,%cpu --sort=-%mem | head -10

meager kayak
#

ah that's a busybox. Let me see

worn pendant
#

Its crashed again

meager kayak
#

okay. What is the output of dmesg | grep -i kill

worn pendant
#

I think we have our answer

meager kayak
#

Yes free also shows lack of memory

worn pendant
#

i will bump it up to 16gb and maybe ask for another cpu core too

#

Thank you again for your help mate

meager kayak
#

np

worn pendant
#

How do I kill the restapi requests? The harvest.yml file?

meager kayak
#

sorry which restapi?

worn pendant
#

You said there was restapi permissions denied in the log files. How do I turn off restapi and just use ZAPI for now?

meager kayak
#

got it. Let me check

worn pendant
#

Trying to get extra memory may take a few days so I think that along with stopping pollers I dont need running may get me through the next few days

meager kayak
worn pendant
#

Looks like its trying to rebuild a DB and then using that much memory its crashing

meager kayak
#

Yes Prometheus needs to recover from crash

worn pendant
#

Is there anyway of optimising the WAL replays? It gets so far and crashes, even if I down the other containers

ashen briar
#

@worn pendant ideally replay only happens once and afterwards no more crashes or out of space issues cause it to happen again. Sounds like that isn't the case for you since replay does not finish without crashing? What error does Prometheus give you when it crashes? Might be worth searching on their GitHub for the error text you're seeing

valid gull
#

To summarize, prom is trying to recover the DB and causing huge memory utilization and terminated by kernel ?
Do we have a hint on the original issue ?
You confirm that we currently have 8G to work with and that seems to be too little for prom to recover ?
No space issue either on /data ?

#

As a last resort, it seems clearing wal directory might do the trick at the expense of losing some data.

worn pendant
#

So i've downed all containers and it still seems to crash. I have cleaned tomstones and all data, still crashes. Ive added an entry saying no lock file and it seems to be stable so far

#

yeah 8GB and now 500GB of free space

worn pendant
#

Morning, I left all the containers down last night apart from Prometheus and its still been crashing through the night

meager kayak
#

how about dmesg | grep -i kill? Is it still OOM error?

worn pendant
#

Yeah, seems to be using the memory during the WAL replays

meager kayak
#

okay probably prometheus needs more memory to recover from a crash state

worn pendant
#

16GB should be enough from 8?

meager kayak
#

I am not sure but 16GB will be a good start to try if it recovers

worn pendant
#

I will try for 32GB

worn pendant
meager kayak
#

is Prometheus up now?

worn pendant
#

Yeah

meager kayak
#

okay good. So logs shared in above screenshot are still happening?

worn pendant
meager kayak
#

@worn pendant Have you configured Alertmanager in Prometheus? It's possible that the issue is related to the Alertmanager service not being set up correctly. You may want to verify that the Alertmanager service is running.

worn pendant
#

Whatever the default setting was mate, never really messed about with alert manager. Its not a big issue

meager kayak
#

okay If you are not using alertmanager then you can safely ignore these logs. @valid gull Could you check this.

valid gull
#

Yeah that should be commented, my bad. Please edit /usr/local/nabox/files/prometheus/prometheus.yml and comment out the folllowing block :

# Alert Manager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - 'alertmanager:9093'
worn pendant
#

Cheers mate

worn pendant
#

Ah it already is comment out