Error 503 Grafana | NetApp | Page 1

worn pendant Apr 28, 2023, 9:46 AM

#

Hi All, I seem to be getting 503 Errors on grafana however the collectors are still collecting

meager kayak Apr 28, 2023, 9:50 AM

#

@worn pendant Is Prometheus up and running, and scraping data?

worn pendant Apr 28, 2023, 9:51 AM

#

Yeah its up, I only seem to get the issue fixed if i bounce the box

#

meager kayak Apr 28, 2023, 9:53 AM

#

ok let's try to check data source connection from Grafana. Could you check if Grafana is able to reach Prometheus.

meager kayak Apr 28, 2023, 9:55 AM

#

worn pendant

Could you share prometheus target page?

worn pendant Apr 28, 2023, 9:55 AM

#

meager kayak Apr 28, 2023, 9:56 AM

#

okay yes this is the problem. Let's check prometheus logs first.

worn pendant Apr 28, 2023, 9:57 AM

#

How do I collect them mate?

meager kayak Apr 28, 2023, 9:58 AM

#

if you ssh to nabox with root then running command dc logs prometheus should give you prometheus logs.

worn pendant Apr 28, 2023, 9:59 AM

#

Have you got an email address I can send them to please?

meager kayak Apr 28, 2023, 10:00 AM

#

yes here https://github.com/NetApp/harvest/wiki/FAQ#how-do-i-share-sensitive-log-files-with-netapp.

#

Also if you could share harvest logs that will be helpful

#

https://nabox.org/documentation/troubleshooting/#collecting-logs

worn pendant Apr 28, 2023, 10:02 AM

#

Thank you

#

Where does the zip file of logs get saved?

meager kayak Apr 28, 2023, 10:12 AM

#

should be the current dir

meager kayak Apr 28, 2023, 12:46 PM

#

@worn pendant Checking if you were able to get logs?

worn pendant Apr 28, 2023, 12:54 PM

#

The Zip file is 89GB for some reason

meager kayak Apr 28, 2023, 12:55 PM

#

Could you share command you have run?

worn pendant Apr 28, 2023, 12:55 PM

#

dc logs nabox-api > nabox-api.log; dc logs nabox-harvest2 > nabox-harvest2.log;
tar -czf nabox-logs-date +%Y-%m-%d_%H:%M:%S.tgz *

meager kayak Apr 28, 2023, 12:57 PM

#

if you just run dc logs -f --tail 1000 nabox-harvest2 Do you notice any errors?

worn pendant Apr 28, 2023, 12:59 PM

#

None all looks clean

meager kayak Apr 28, 2023, 12:59 PM

#

ok please you share prometheus logs via email or here as you prefer.

worn pendant Apr 28, 2023, 1:00 PM

#

How do I get them logs? Just the same command but use prometheus?

meager kayak Apr 28, 2023, 1:01 PM

#

yes dc logs prometheus

#

to save to a file dc logs prometheus > prometheus.log

worn pendant Apr 28, 2023, 1:07 PM

#

All sent over mate thank you

#

meager kayak Apr 28, 2023, 1:20 PM

#

Harvest logs looks good except some permission issues for Rest Objects. It appears problem lies with Prometheus.

#

Could you share your prometheus data retention configuration?

#

Also see if there are any disk issues as we have discussed in past as here #1090549586281648148 message

worn pendant Apr 28, 2023, 1:24 PM

#

I expanded the disk yesterday so thats fixed, but this has been happening for a few days

#

#

meager kayak Apr 28, 2023, 1:28 PM

#

were there any recent upgrades to nabox?

worn pendant Apr 28, 2023, 1:28 PM

#

Updated to a nightly not long ago

meager kayak Apr 28, 2023, 1:29 PM

#

okay That would be Harvest upgrade. That should not be an issue

#

There are some of these logs in prometheus

worn pendant Apr 28, 2023, 1:42 PM

#

Ive just done a dc down and a dc up and it seems to be up now

meager kayak Apr 28, 2023, 1:42 PM

#

Yes, As you said it may crash again.

#

Could you share output of below commands
ps -eo pid,ppid,cmd,comm,%mem,%cpu --sort=-%mem | head -10
dmesg -e | grep -i kill
free

worn pendant Apr 28, 2023, 2:03 PM

#

ps -eo pid,ppid,cmd,comm,%mem,%cpu --sort=-%mem | head -10

#

meager kayak Apr 28, 2023, 2:05 PM

#

ah that's a busybox. Let me see

worn pendant Apr 28, 2023, 2:09 PM

#

Its crashed again

meager kayak Apr 28, 2023, 2:16 PM

#

okay. What is the output of dmesg | grep -i kill

worn pendant Apr 28, 2023, 2:18 PM

#

I think we have our answer

#

meager kayak Apr 28, 2023, 2:19 PM

#

Yes free also shows lack of memory

worn pendant Apr 28, 2023, 2:19 PM

#

i will bump it up to 16gb and maybe ask for another cpu core too

#

Thank you again for your help mate

meager kayak Apr 28, 2023, 2:20 PM

#

np

worn pendant Apr 28, 2023, 2:21 PM

#

How do I kill the restapi requests? The harvest.yml file?

meager kayak Apr 28, 2023, 2:21 PM

#

sorry which restapi?

worn pendant Apr 28, 2023, 2:21 PM

#

You said there was restapi permissions denied in the log files. How do I turn off restapi and just use ZAPI for now?

meager kayak Apr 28, 2023, 2:22 PM

#

got it. Let me check

worn pendant Apr 28, 2023, 2:23 PM

#

Trying to get extra memory may take a few days so I think that along with stopping pollers I dont need running may get me through the next few days

meager kayak Apr 28, 2023, 2:25 PM

#

worn pendant You said there was restapi permissions denied in the log files. How do I turn of...

Yes that should be harvest.yml only. You can disable those pollers through NAbox UI as well.

worn pendant Apr 28, 2023, 5:13 PM

#

Looks like its trying to rebuild a DB and then using that much memory its crashing

#

meager kayak Apr 28, 2023, 5:14 PM

#

Yes Prometheus needs to recover from crash

worn pendant May 2, 2023, 3:42 PM

#

Is there anyway of optimising the WAL replays? It gets so far and crashes, even if I down the other containers

ashen briar May 2, 2023, 3:45 PM

#

@worn pendant ideally replay only happens once and afterwards no more crashes or out of space issues cause it to happen again. Sounds like that isn't the case for you since replay does not finish without crashing? What error does Prometheus give you when it crashes? Might be worth searching on their GitHub for the error text you're seeing

valid gull May 3, 2023, 1:40 AM

#

To summarize, prom is trying to recover the DB and causing huge memory utilization and terminated by kernel ?
Do we have a hint on the original issue ?
You confirm that we currently have 8G to work with and that seems to be too little for prom to recover ?
No space issue either on /data ?

#

Maybe something to try :
dc down
dc up -d prometheus

See if having prometheus running without live targets helps it settle down

https://github.com/prometheus/prometheus/issues/4609

GitHub

Prometheus Crash Recovery Consumes Excessive Amount of Memory · Iss...

As Of Prometheus 2.3.2 Crash recovery can be excessively memory important leading to the case when normally running system is unable to ever recover after abnormal reboot. How to repeat: Run promet...

#

As a last resort, it seems clearing wal directory might do the trick at the expense of losing some data.

worn pendant May 3, 2023, 10:00 PM

#

So i've downed all containers and it still seems to crash. I have cleaned tomstones and all data, still crashes. Ive added an entry saying no lock file and it seems to be stable so far

#

yeah 8GB and now 500GB of free space

worn pendant May 4, 2023, 8:11 AM

#

Morning, I left all the containers down last night apart from Prometheus and its still been crashing through the night

meager kayak May 4, 2023, 8:12 AM

#

how about dmesg | grep -i kill? Is it still OOM error?

worn pendant May 4, 2023, 8:13 AM

#

Yeah, seems to be using the memory during the WAL replays

meager kayak May 4, 2023, 8:13 AM

#

okay probably prometheus needs more memory to recover from a crash state

worn pendant May 4, 2023, 8:15 AM

#

16GB should be enough from 8?

meager kayak May 4, 2023, 8:15 AM

#

I am not sure but 16GB will be a good start to try if it recovers

worn pendant May 4, 2023, 8:18 AM

#

I will try for 32GB

worn pendant May 4, 2023, 10:30 AM

#

meager kayak May 4, 2023, 11:07 AM

#

is Prometheus up now?

worn pendant May 4, 2023, 11:07 AM

#

Yeah

meager kayak May 4, 2023, 11:08 AM

#

okay good. So logs shared in above screenshot are still happening?

worn pendant May 4, 2023, 11:09 AM

#

meager kayak May 4, 2023, 11:27 AM

#

@worn pendant Have you configured Alertmanager in Prometheus? It's possible that the issue is related to the Alertmanager service not being set up correctly. You may want to verify that the Alertmanager service is running.

worn pendant May 4, 2023, 11:40 AM

#

Whatever the default setting was mate, never really messed about with alert manager. Its not a big issue

meager kayak May 4, 2023, 11:42 AM

#

okay If you are not using alertmanager then you can safely ignore these logs. @valid gull Could you check this.

valid gull May 5, 2023, 11:36 AM

#

Yeah that should be commented, my bad. Please edit /usr/local/nabox/files/prometheus/prometheus.yml and comment out the folllowing block :

# Alert Manager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - 'alertmanager:9093'

worn pendant May 5, 2023, 12:17 PM

#

Cheers mate

worn pendant May 5, 2023, 12:45 PM

#

Ah it already is comment out

#Error 503 Grafana