#HA Cluster *ucked after power-outage...

1 messages · Page 1 of 1 (latest)

elfin osprey
#

The cluster is acting strange after a total power-outage... (this is a simple FAS2720 in a HA setup), it reports strange things like "cf.fm.versionMismatch:ALERT]: Failover monitor: CF monitor version mismatch detected: 2/0" which some how stops all the cluster services from starting up... take a look at this...
`DKHED-NETAPP02::*> cluster show
Node Health Eligibility Epsilon


DKHED-NETAPP02-01 true true false
DKHED-NETAPP02-02 true true false
2 entries were displayed.

DKHED-NETAPP02::*> storage failover show
Takeover
Node Partner Possible State Description


DKHED-NETAPP02-01
DKHED- false Connected to DKHED-NETAPP02-02,
NETAPP02-02 Partial giveback, Takeover is not
possible: NVRAM log not synchronized
DKHED-NETAPP02-02
- true Connected to partner. Waiting for
partner lock synchronization.
2 entries were displayed.`

I have tried to power off one node, boot up the other, then force takeover the downed node... works OK.. but as I boot up the other node it will do the giveback to a point as shown above... looke like they have a different CF monitor version whatever that is? I haven't had an issue like this before... any ideas what can be the cause? and how to fix it?

#

If you run the "storage failower show" on and on, it changes its message to: `DKHED-NETAPP02::*> storage failover show
Takeover
Node Partner Possible State Description


DKHED-NETAPP02-01
DKHED- true Connected to DKHED-NETAPP02-02,
NETAPP02-02 Partial giveback
DKHED-NETAPP02-02
- false Connected to partner. Waiting for
partner lock synchronization.
Takeover is not possible: Storage
failover mailbox version mismatch,
The version of software running on
each node of the SFO pair is
incompatible`

plush tulip
#

storage failover takeover -ofnode cluster-01 -bypass-optimization true -option allow-version-mismatch

bypass the version mismatch, no idea about the rest of it

weak aurora
#

perhaps check that both nodes are actually running the same OS version as well

elfin osprey
#

`DKHED-NETAPP02::storage*> version -node *

DKHED-NETAPP02-01:
NetApp Release 9.15.1P1: Tue Jul 30 05:15:49 UTC 2024

DKHED-NETAPP02-02:
NetApp Release 9.15.1P1: Tue Jul 30 05:15:49 UTC 2024

2 entries were displayed.`

steel valve
#

What does cluster ring show say?
Login to both node-mgmt IPs

elfin osprey
# steel valve What does `cluster ring show` say? Login to both node-mgmt IPs

`DKHED-NETAPP02::storage*> cluster ring show
Node UnitName Epoch DB Epoch DB Trnxs Master Online


DKHED-NETAPP02-01 mgmt 15 15 1044 DKHED-NETAPP02-01 master
DKHED-NETAPP02-01 vldb 15 15 139 DKHED-NETAPP02-01 master
DKHED-NETAPP02-01 vifmgr 15 15 218 DKHED-NETAPP02-01 master
DKHED-NETAPP02-01 bcomd 15 15 71 DKHED-NETAPP02-01 master
DKHED-NETAPP02-01 crs 15 15 1 DKHED-NETAPP02-01 master
DKHED-NETAPP02-02 mgmt 15 15 1046 DKHED-NETAPP02-01 secondary
DKHED-NETAPP02-02 vldb 15 15 139 DKHED-NETAPP02-01 secondary
DKHED-NETAPP02-02 vifmgr 15 15 218 DKHED-NETAPP02-01 secondary
DKHED-NETAPP02-02 bcomd 15 15 71 DKHED-NETAPP02-01 secondary
DKHED-NETAPP02-02 crs 15 15 1 DKHED-NETAPP02-01 secondary
10 entries were displayed.`

#

(same on the other node)

elfin osprey
#

1/14/2025 16:31:13 DKHED-NETAPP02-01 NOTICE cf.fsm.takeoverOfPartnerEnabled: Failover monitor: takeover of DKHED-NETAPP02-02 enabled 1/14/2025 16:30:37 DKHED-NETAPP02-01 NOTICE cf.fsm.takeoverByPartnerEnabled: Failover monitor: takeover of DKHED-NETAPP02-01 by DKHED-NETAPP02-02 enabled 1/14/2025 16:30:36 DKHED-NETAPP02-01 ERROR cf.fsm.takeoverByPartnerDisabled: Failover monitor: takeover of DKHED-NETAPP02-01 by DKHED-NETAPP02-02 disabled (version mismatch). 1/14/2025 16:30:36 DKHED-NETAPP02-01 ERROR cf.fsm.takeoverOfPartnerDisabled: Failover monitor: takeover of DKHED-NETAPP02-02 disabled (version mismatch). 1/14/2025 16:30:15 DKHED-NETAPP02-02 ALERT rdb.ha.mboxError: Bidirectional failover under the 'cluster HA' configuration is not currently functional due to problem with the on-disk mailboxes.

#

also strange messages in the event log...

plush tulip
#

no disk with errors?
:>cf monitor all

#

diag mode

#

other than that you'd have to disable HA, then enable it again.

cluster ha modify -configured false

#

cluster ha show, anything reported there?

elfin osprey
#

got these messages on node 02..: Reason Takeover not Possible: Storage failover mailbox version mismatch The version of software running on each node of the SFO pair is incompatible Reason Takeover not Possible (Enum value): fm_version_mismatch version Interconnect Up: true Interconnect Links: RDMA Interconnect is up (Link up) Interconnect Type: GOP (PLX PEX8725 NTB) State Description: Connected to partner, Takeover is not possible: Storage failover mailbox version mismatch, The version of software running on each node of the SFO pair is incompatible Partner State: Up Time Until Takeover: - Reason Takeover not Possible by Partner: The version of software running on each node of the SFO pair is incompatible NVRAM log not synchronized Reason Takeover not Possible by Partner (Enum value): version log_unsync

plush tulip
#

yea, looks like something is up with the mailbox disks

elfin osprey
#

I can disable and enable the HA... DKHED-NETAPP02::storage*> cluster ha show High-Availability Configured: true High-Availability Backend Configured (MBX): true

#

...no failed disks

plush tulip
#

might want to verify with support but doing the ha -false and then ha -true is the only way i know of to fix a mailbox issue if there are no broken disks
however, might break something else if all the VLDBs are not sync'd correctly

elfin osprey
#

yeah it's a strange one... thankfully this is just a backup system... and it's actually online with all the volumes...

plush tulip
#

what did cf monitor all come back with

elfin osprey
weak aurora
#

one can delete stale mailboxes in the maint boot, but i'd definitely have someone in Support look at it before I went that route

drowsy urchin
#

Verify both cluster ports are online. I’ve seen systems where there were down. Forcing another power cycle was the only way to get them online again

elfin osprey
jagged kraken
#

you could try to boot into maint. mode and destroy the mailbox disks.... they will be re-created during boot. if the nodes were in takeover you will lose that information though, and both nodes will come up with their own data/nvram. Which means you might lose up to 10s of data (the most recent CP) if the nvrams were not in sync during the outage

elfin osprey
#

This looks strange? `DKHED-NETAPP02::cf monitor*> storage failover mailbox-disk show
Mailbox
Node Owner Disk Name Disk UUID


DKHED-NETAPP02-01
local 1.0.1.P2 6000C500:A7427F17:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
local 1.0.3.P2 6000C500:A6F39FBB:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
local 1.0.5.P2 6000C500:A74CD127:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
partner 1.0.0.P2 6000C500:A74259B3:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
partner 1.0.2.P2 6000C500:A74CBFF3:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
partner 1.0.4.P2 6000C500:A73F88AF:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
DKHED-NETAPP02-02
local 1.0.0.P2 6000C500:A74259B3:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
local 1.0.2.P2 6000C500:A74CBFF3:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
local 1.0.4.P2 6000C500:A73F88AF:500A0981:00000002:00000000:00000000:00000000:00000000:00000000:00000000
partner 1.2.5 5000C500:957C86EB:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000
partner 1.2.1 5000C500:957C4AF3:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000
11 entries were displayed.`

#

Isn't there supossed to be 6 on each node?

plush tulip
#

looks like 1.2.3 is missing

elfin osprey
#

Looks like node 01 known about node 02's MB disks, but not the other way arround?...

jagged kraken
#

yeah it looks like the mailbox disks are somehow borked...

#

I would probably go down the mailbox recreation route but if you want to be extra safe you should probably open a case

elfin osprey
#

yeah since this is a backup system with only snapmirrors, I think the MB-disks will just have to by nuked 😉

#

so both nodes into maint mode and then nuke the mbox disks and then reboot node 01... wait a bit and then node 02 ?

#

but very strange that one of the nodes have a totally different view of the world... 🙂

#

I'll update on my progress...

jagged kraken
elfin osprey
#

ha-config looks ok: "ha" on both nodes...

#

can't remember the mailbox disk commands 😉

#

why are the commands hidden 😉 just for ref. the command is "mailbox" 😉 and "mailbox destroy local" is the one I did on both nodes...

#

`DKHED-NETAPP02::> storage failover show
Takeover
Node Partner Possible State Description


DKHED-NETAPP02-01
DKHED- true Connected to DKHED-NETAPP02-02
NETAPP02-02
DKHED-NETAPP02-02
DKHED- true Connected to DKHED-NETAPP02-01
NETAPP02-01
2 entries were displayed.

DKHED-NETAPP02::> cluster show
Node Health Eligibility


DKHED-NETAPP02-01 true true
DKHED-NETAPP02-02 true true
2 entries were displayed.`

#

Yep after I nuked the mbox disks on both nodes from maint mode, then halted them, and booted them up again, it "fixed" it self...

#

but a bit strange how a power-failure can nuke the mbx disks like that...

steel valve
#

What do you see here?
system ha interconnect status show

elfin osprey
#

Yeah, they were also OK before I nuked the mbx...

#

`DKHED-NETAPP02::*> system ha interconnect status show

                   Node: DKHED-NETAPP02-01
            Link Status: up
     IC RDMA Connection: up

                   Node: DKHED-NETAPP02-02
            Link Status: up
     IC RDMA Connection: up

2 entries were displayed.`

#

I think I should do a pair of failovers for good messure? 😉

steel valve
#

yea, I would too

#

after power-issues you also sometimes need to reseat controllers/IOMs/PSUs

#

at least with SAS-shelves you can't remove some errors otherwise (like orange LED all the time even though everything looks fine)

jagged kraken
#

mailbox destroy [local|partner] is the WD40 of unexplained cluster HA issues 😄

elfin osprey
#

Seems so... thanks for the input... it seems to have helped in this case too

weak aurora
#

it was the hammer one needed for headswaps "back in the day"

elfin osprey
# weak aurora it was the hammer one needed for headswaps "back in the day"

OK, I have never used it when doing headswap... and I did a few 😉 not that many anymore... but I have had to nuke the mbx before a few times... and also had to recover cluster configurations and "unset" bootargs in order to get the cluster up and running 😉 you learn a few things over 25 years... (and forget a few too) 😉

#

...when I started working with NetApp we booted a new system off a set of 7 floppy disks 😉 good times! 😉

drowsy urchin
#

I started at NetApp when it was one floppy disk. NetApp appliance release 5.3 tripled the size (atm fore-ip/spans was added and tripled the code base!)

#

It was so much better when the flash devices were used

elfin osprey
#

Yeah, can't remember how many we used to the first ones... I seem to remember F4xx and F5xx models and of cause a shit load of F720's (the ones with the blue bezzel) I regret not to have saved a bezzel from each model 😉

drowsy urchin
#

That F700 series….i think that was the shortest lived platform in NetApp history. Manufacturing defect caused an issue on every motherboard

#

I remember NetApp doing some crazy 75% discounts to upgrade

elfin osprey
#

OK, that's not how I remember it... I installed a lot of the "blue" ones an I think it was the F720... and then the Green ones 😉