We have had a SG shutdown for about a month, and it cannot start up now... Before we shut it down, we placed all storage nodes into maint. mode. (the admin node is virtual)
We have logged into the storage nodes and rebooted them into "normal" mode...
We thinkg the issue is the admin node that seems to have an issue with the NTP service as it starts it reports "not locked" and ends up with a status of "PLL,UNSYNC, NANO"... all other services seems OK, but we cannot get the NTP service to start normally, and acording to support pages it looks to be the issue... does anyone have a "quick" fix for this? 😉 Not sure why is has happened? Just because of a months of down time... This is on 11.8.0.7 and "storagegrid-status" shows everything ok... but as we try to login in web GUI we get the "Waiting for services..."
#Waiting for services to start...
1 messages · Page 1 of 1 (latest)
I think storage nodes can only be down for a maximum 15 days
you might have to initate some recovery procedure to integrate the node back into the cluster
I will look into this and see if i can get it up... is it the admin node that cannot be down, or the storage nodes?
ah ok I misread; I didnt realize you shut the whole grid
Not sure of the implications there; I'd contact support
Yeah, check-cassandra-rebuild after more than 14 days, that's the solution
I thought that the admin node would always get up, and I was able to access the webgui... but apparently it doesn't work like that... it is dependant on the storage nodes... the documentation isn't the best.. because the "nodetool" I wasn't able to find anywhere... there is a "nodestatus" etc.. and also I think it should be documented better where the commands are to be run because it can be a bit confusing... but I managed to get to one of the storage nodes and "storagegrid-info" shows that cassandra is down and acct,idnt,kstn, stat are down... so I just ran the "check-cassandra-rebuild" which apparently takes awhile 😉 problem is that I didn't know, and had I known I would have ran it in a screen session.. Question.. do I need to do this on all storage nodes, or will one of them be OK?
I guess this message answers the last question: ATTENTION: Do not rebuild more than a single node within a 15 day period.
Rebuilding 2 or more nodes within 15 days of each other may result in data loss.
Good thing this is not that important data 🙂
Hmm not sure if this is good? `Enter 'y' to rebuild the Cassandra database for this Storage Node. [y/N]? y
Rebuilding may take 12-24 hours. Do not stop or pause the rebuild.
If the rebuild was stopped or paused, re-run this command.
Starting ntp service.
Starting nginx service.
Starting dynip service.
Cleaning Cassandra directories for node.
Warning: Unable to determine what address to replace. Using eth0 IP 10.0.200.10 instead.
Adding replace_address_first_boot flag.
Starting cassandra service.
Warning: Command "/etc/init.d/cassandra start" failed with result "starting cassandra ... failed to start: entered error state." and exit code "1". Failed to start cassandra service
Starting cassandra service.
Warning: Command "/etc/init.d/cassandra start" failed with result "starting cassandra ... failed to start: entered error state." and exit code "1". Failed to start cassandra service
Starting cassandra service.`
Seems to just be retrying...
It just keeps on going... I guess the next question is how I "nuke" the storage nodes into submission if I choose to create a new grid? I guess I will have to install a new admin node... but how to I force the storage nodes into maint mode?
I only have the management port 1 connected... and serial for both the e-series and the SG node (this SG5712 nodes)
I guess the "sgareinstall" will do the trick 😉 sad that the rebuild didn't work...
If the whole grid was shutdown you can bring the nodes up again. No recovery needed, just that there is kind of a "block" making cassandra not startup automatically. Support would have been able to help with that. But now when you have tried to rebuild cassandra not sure what the way forward is. it might still be that the rebuild did not happen because startup is blocked.
Strongly advice you to log a support case.