#How can I tell the aggregates were really temporarily taken over by the other node

1 messages · Page 1 of 1 (latest)

split wagon
#

We just tested failover and giveback function by taking over node 2 from node1 via System Manager. Just to test node2 only. Node2 did reboot.
On node1's SP screen I only saw this message nothing else: Feb 19 16:28:00 [node-01:monitor.globalStatus.critical:EMERGENCY]: This node has taken over node-02.

On node2's SP: Waiting for giveback...(Press Ctrl-C to abort wait)Continuing boot...
Feb 19 16:31:40 [node-02:cf.fm.discardNvram:notice]: Failover monitor: node was previously taken over, nvram may be discarded...

During the process, I am expecting the aggregates on node2 should be temporarily re-owned by node1, and then giveback to node2? How can I found out after the fact, the process really indeed happened? I did "storage failover show" during the process, I didn't see that happened.

Somehow I remember there is a command to show you all historic records about failover and giveback?

ancient pilot
#

Run "aggr show" during a takeover.

#

you'll see that the node has changed ownership of the aggr.

split wagon
#

I know. But, now it’s already done. I would like to know if it indeed happened after the fact. During the time, I stepped away, didn’t have the chance runt the command?

ancient pilot
#

just tested in my lab. it's in the event log

#

but mid takeover (and this is actually right after I ran the givebback command). you can see that they're all owned by node 2

#

`WOPR::*> aggr show

Aggregate Size Available Used% State #Vols Nodes RAID Status


N1_aggr1 5.48TB 2.98TB 46% online 35 WOPR-02 raid_dp,
normal
N2_aggr1 5.48TB 3.59TB 35% online 22 WOPR-02 raid_dp,
normal
root_aggr0_N1 - - - unknown - WOPR-02 -
root_aggr0_N2 368.4GB 15.83GB 96% online 2 WOPR-02 raid_dp,
normal
4 entries were displayed.`

split wagon
#

I checked my "event show log", and I don't see the keyword "SFO Phase of Takeover" found in yours. So, doesn't certainly mean that aggregates takeover didn't happen?
Also, after failover / giveback completed, anything can I do to check "net into show"? I checked, all LIF's now "in home" true. Does that sound right? Shouldn't some of them show as "false"?
Thanks for working on it for me!

ancient pilot
#

do you see any Debug, Notice or info alerts? if not you need to run -severity * at the end.

#

event log show -severity *

split wagon
#

Yeah, now I can so takeover / giveback happened to me. Could you please verify? It was between node-09 and node-10.
Do me another favor, please. Can you please tell how long did the failover and giveback take respectively?

ancient pilot
#

2/19/2023 16:27:07 node-09 INFORMATIONAL cf.fm.takeoverDuration: Failover monitor: takeover duration time is 4 seconds. 2/19/2023 16:27:07 node-09 NOTICE cf.fm.takeoverComplete: Failover monitor: takeover completed 2/19/2023 16:27:07 node-09 NOTICE callhome.reboot.takeover: Call home for PARTNER REBOOT (CONTROLLER TAKEOVER) 2/19/2023 16:27:07 node-09 NOTICE callhome.sfo.takeover.m: Call home for CONTROLLER TAKEOVER COMPLETE MANUAL

#

i'd also clear a few of the lines out of that doc related to username/domain/email

#

i just see the takeover in there too, not the give back. \

split wagon
#

Did you see giveback/Giveback in your "event log"? I didn't see it neither in yours.
"aggr show" shows everything normal now, node_10_sas_aggr1 owned by the node itself now.
Did you see the node_10_sas_aggr1 indeed used to be owned by node-09? If yes, giveback should already happen, otherwise the node_10_sas_aggr1 won't be owned by node_10 now.
Make sense?

The reason I am so persistent is because I did run "aggr show" multiple times, I didn't see the aggregate ever owned by the other node. I might miss it. That's why I wanted to trace back to make sure if it indeed happened.
Thanks for cleaning up for me.

north swallow
#

Not to tell you what to do, but imo you're doing failover-tests the "wrong way". It's ok and nice-to-know how ONTAP behaves during a takover. But the important part is: What happend to your workloads/VMs/applications during the takeover event? Did they notice anything? If yes, for how long? If not, well great. Did the latency go up for some seconds? Etc.
If you lose the connection only for some seconds you pretty much know that the ownership of your aggr has been taken over... otherwise you would lose your connection for much longer.

It's not important how long ONTAP takes for the takeover/giveback. Sometimes when you do the giveback in ONTAP it says "partial giveback" for maybe 5min minutes because some internal services still need to come up after booting the node back up again. But if your applications are still fine that's not important.

So for failover-tests which actually mean something:

  1. create at least one LIF per node
  2. create at least one volume per aggr (owned by each node)
  3. connect your applications/VMs via the protocol of your choice
  4. let some test-workload run
  5. on your VMs ping the data LIFs (with a timestamp and write that to a txt-file)
  6. do your takeover (either negotiated via ONTAP cmds or unnegotiated by simply powering of one node via SP/BMC)
  7. measure the time your applications lose the connection or check for latency-spikes
  8. report to management: everything looks great 🙂
split wagon
#

Yes, we are in the same understanding. 🙂
The workload//vm's/application all looked OK. That's why I wanted to go back to make sure if aggregates failover/giveback really happened, because that could make a lot of differences to evaluate the test

north swallow
#

If your applications looked OK and node2 rebooted, your aggrs have been taken over. Otherwise your applications wouldn't look OK. 😉

#

(as long as you have volumes running on the aggrs of node2)

split wagon
#

That’s the thing, all I am trying to do is to make sure aggr on node2 had really been taken over

#

And givenback

north swallow
#

Did you lose the connection for like 10minutes?
yes --> the ownership of the aggrs stayed with node2
no, only for some seconds --> the ownership changed to node1

There is no other way. 🙂

#

You can't access an aggr if the ownership stays with node2 and node2 is down. 100%

#

You can go through the logs to find the correct lines but I don't see the the value in that...