#FAS8040 Controller replacement - now a phantom node?

1 messages · Page 1 of 1 (latest)

spare aurora
#

We have a 2-node 1-chassis FAS8040 that we wanted to replace a node on. I'll cut to the chase - we followed this procedure: https://library.netapp.com/ecm/ecm_download_file/ECMP1199896
and are unable to giveback the replacement node at the end, because now the healthy node thinks the replacement is called something else. If our healthy node is called 'ourarray-02', we would expect to see 'ourarray-01' when it comes up for giveback. However, it's been renamed to NETAPP-8040-01 and thus will not let us give it back, nor can we remove it because the cluster is not healthy.
We've had no luck restoring the cluster from backup configs, etc. So 2 questions:

  1. How did this happen?
  2. How do we fix it?
daring oar
#

maybe you forgot to swap over the boot device of the old controller?

#

is this a 7-mode system or cDOT? Did you do the disk assignment properly?

spare aurora
#

cdot, and the disk ownership was done automatically as part of the procedure. Did definitely move boot media over

#

I was curious if "NETAPP-8040-01" Looked like some factory default value to anyone else

spare aurora
#

the only part of the procedure that wasn't followed exactly was moving the NVRAM DIMM over from the failed to the replacement node, but this is a volatile DIMM / DRAM anyway, so that shouldn't matter, should it?

daring oar
#

well the "NV" in "NVRAM" stands for non-volatile 😉 but yes as soon as you remove it, it loses its data. it only needs to be moved over if the replacement system was shipped without one (which usually happens on a mainboard replacement)
as for why your system still had an old nodename I have no idea, I only ever saw that happen if either the boot media wasn't replaced or the procedure (especially the "(6) restroe boot media from disk") was not properly followed

spare aurora
daring oar
#

this basically restores the data (filesystem) that is stored on the NVRAM from the last copy that was saved to disk. Usually you always do that when changing hardware, i.e. boot to the ONTAP boot menu and select option (6)

#

so if you didn't move the NVRAM over (or cleared the NVRAM), the NVRAM of the replacement node retains its data, and that data contains (among others) IP addresses of all cluster nodes and its own node name

spare aurora
#

and credentials, which explains why we couldn't log into the replacement node. But, I did move the NVRAM battery from the failed to the replacement node, should that not have cleared any stale NVRAM data on the replacement? Even being momentarily disconnected from power? Or is there some other means why which it holds onto that data?

daring oar
#

it takes a surprisingly long amount of time until data in DRAM is actually lost (which is actually an attack vector on e.g. encrypted laptops), so a few seconds can easily be bridged (especially since that DRAM is ECC, which can repair single-bit errors, and because it is rather huge, and the actual data that it pulled from the database is rather small so chances that a double-bit flip occured in the password- or nodename fields is even slimmer)

spare aurora
#

ok THAT is telling. very very interesting

#

so it's not that it's secretly some NVDIMM or has some proprietary storage hidden out of sight somewhere, it's just the nature of DRAM itself that allows it to hold onto that data

#

do you know of any way to definitively wipe it out? or is waiting the only way?

daring oar
#

using boot menu option (4) or the wipeconfig command can clear the NVRAM. In newer ONTAP versions, the NVRAM is also encrypted so that there's no real need to wipe it anymore (since the key is stored in the TPM)

spare aurora
#

I'm just thinking of potential cases where a replacement controller doesn't come with an NVDIMM, but then when the original node's NVDIMM gets moved over, the only thing it could possibly have is data that the system expects to see

#

ooooh ok so as of say, ONTAP 9.x, it's unable to read that NVRAM after a migration to a new node, and will begin overwriting it? Do we know the precise version where that became a thing?

daring oar
spare aurora
#

ahh and with a TPM, that's another key point. I wonder which models started shipping with TPMs, and if the 8040 was prior to this

spare aurora
daring oar
#

google "Cold Boot attack", that's the name they use for extracting crypto keys etc. from the RAM of a powered down laptop. I think there are also some videos on youtube that show how slowly DRAM decays by storing an image and reading it back after 1, 2, 3, 5, 10... seconds

dusty tulip
#

@mtexter What happened to the original Node-2 that warranted replacement? As in, did the node panic/fail? Or did you manually halt the node at the time?

#

Reason that I ask: The files that get restored by SBM option 6 only get written to disk during a clean shutdown. If there was a panic or a takeover, it will not write those files to disk.

#

So you may have restored a really old copy in the process.

#

And in the case of the FAS80x0, the NVRAM controller is on-board. There is a dedicated DIMM for NVRAM on those systems, though. That is why your SystemID changes when you replaced the PCM.