#AFF-A20 2 node switchless cluster failing to create during setup

1 messages · Page 1 of 1 (latest)

sour furnace
#

Hey All

Just trying to setup our brand new 2 node switchless cluster.

As NetApp removed the onboard ports for Cluster Networking our NetApp Sales Team added 2 quad port 10/25 cards into each node

When i plug in E4a and E4b back to back the ports light up but the Cluster Setup does not detect the partner node

E4a from node 1 to E4a node 2
E4b from node 1 to E4b node 2

I also tried

E2a from node 1 to E2a node 2
E4a from node 1 to E4a node 2

Looking at HWU it seems to only list Dual NIC for switchless cluster

A20 came with 9.16.1RC1

This was the console output

<snip>
Otherwise, press Enter to complete cluster setup using the command line
interface:

Do you want to create a new cluster or join an existing cluster? {create, join}:
create

Error: command failed: Creation of auto-cluster LIFs is not complete. Verify
that this node is cabled correctly, and then try the operation again.

<snip>

I have logged a case with NetApp and our account team just thought this might be faster if someone has played with the A20's

#

AFF-A20 2 node switchless cluster failing to create during setup

rocky light
#

You should be able to exit the cluster setup, add the ports manually to the Cluster broadcast-domain in the Cluster ipspace (might even need to create it first) and make sure you use MTU9000. Also make sure no other ports are in there. Then start the cluster setup again.

But if that exact installed card is not listed in HWU I would say it's most likely not supported to be used as Cluster Interconnect.

sour furnace
#

the cards we got were 2 x X60132A-C IO Module,4PT,10/25GbE,-C

#

::> network port show

This was from a quick poke around on the nodesNode: localhost
Speed(Mbps) Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status


e0M Default - up 1500 auto/1000 -
e2a Cluster - - 9000 auto/- -
e2b Default - down 9000 auto/auto -
e2c Default - down 9000 auto/auto -
e2d Default - down 9000 auto/auto -
e4a Cluster - - 9000 auto/- -
e4b Default - down 9000 auto/auto -
e4c Default - down 9000 auto/auto -
e4d Default - down 9000 auto/auto -
9 entries were displayed.

::> network interface show
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home


Default
mgmt1 up/up 10.200.200.170/24 localhost e0M true

::> version
NetApp Release 9.16.1RC1: Mon Nov 04 10:39:46 UTC 2024

Notice: Showing the version for the local node; the cluster-wide version could
not be determined.

::>

#

looking and docs it kinda lists 2 ports

#

we ordered these pretty much on day 1 and i was watching a video and picked up NetApp removed the onboard cluster ports from the controllers.NetApp team just added in another quad. I'll see what support comes back with and if we need new Dual Port Cards im sure NetApp will fix it up given its so new 🙂..

sly island
#

So since you have two of those NICs, the system defaulted to the 2 NIC config, which is fine. So you should be using the e2a and e4a ports for cluster connection.
Are you sure you are using the right cables? 10G DAC cables and 25G DAC cables are different and only support their respective speeds (no down-negoitiation). If you're using SFPs, make sure it's the correct ones and you're using the same speed on both the controller and the switch side.

#

also, when HWU contradicts other documentation, you can assume that HWU is correct and the other documentation isn't

sour furnace
#

We did not get any DACS in the box (which surpised me) and the cluster was shipped with 16 x X65404-N-C (SFP2825GbESR-C)

sly island
#

are you using cluster switches ? or is it a switchless cluster?

sour furnace
#

2 node switchless cluster

sly island
#

it's strange that the system shows the ports as neither "up" nor "down", just with a "-" ...

sour furnace
#

P2 ticket has now gone to core engineering so lets see what they come back with.. but gut feel was the NetApp quote tool might of been missing the requirements for cluster cards at the time but it might of been update now (2 months later)..

sly island
#

well, the X60132 cards are listed as supported for the cluster ports, so that should be fine 🤷‍♂️

sour furnace
#

i wonder if is the 9.16.1RC1 and i might need to GA release..

sly island
#

ah, good point, I was looking at the GA config

#

but that usually shouldn't matter

#

yeah it should not matter

#

please keep us updated 🙂

sour furnace
#

Will do.. we have 2 of these clusters just arrived from Singapore and I think we are very close to being the first in Australia with this new gen of hardware

civic hearth
#

@sour furnace If you have these in your Netapp: X65404, specifically in ports e2a and e4a, just run a fiber from e2a <-> e2a and another from e4a <-> e4a.

#

WHoever sold this was lazy and did not mark the "0.5 cluster interconnect cables" for the order. Any OM3/OM4 rated fiber (usually Aqua in color) with LC/LC ends will do

#

After you see link, you should be able to go to the CLI and run "cluster setup" on one node. and when it finishes, repeat on the seond node

#

if you want, before you run cluster setup, run this command:
net port show -ip Cluster

That will show just the cluster ports and you can verify the link is up before continuing. IF it fails, post the output back here

#

I would talk to your NetApp sales team. They have a process to get "missing cables" rectified. They should be able to order 2 x X66240A-05 (that is a 0.5M twinax cable, get 2!)

rocky light
sour furnace
#

we installed the second cluster today and same issues..

#

speaking to NetApp they said the Quad cards are supported for cluster interconnects.

#

We have a webex catchup next week to work through it

#

::> run -node localhost -command sysconfig -ac
sysconfig: slot 2 OK: X60132A: 4p, 10G/25G Ethernet Controller CX7
sysconfig: slot 4 OK: X60132A: 4p, 10G/25G Ethernet Controller CX7
sysconfig: No shelf configuration errors detected.
sysconfig: There are no configuration errors.

::>

#

slot 0: Intel USB XHCI Adapter u0b (0x0000200ffff60000)
slot 2: Quad 10G/25G Ethernet Controller CX7
e2a MAC Address: d0:39:ea:c6:c7:55 (auto-25g_sr-fd-up)
e2b MAC Address: d0:39:ea:c6:c7:56 (auto-unknown-fd-down)
slot 2: Quad 10G/25G Ethernet Controller CX7
e2c MAC Address: d0:39:ea:c6:c7:57 (auto-unknown-fd-down)
e2d MAC Address: d0:39:ea:c6:c7:58 (auto-unknown-fd-down)
slot 4: Quad 10G/25G Ethernet Controller CX7
e4a MAC Address: d0:39:ea:c6:42:01 (auto-10g_sr-fd-up)
e4b MAC Address: d0:39:ea:c6:42:02 (auto-unknown-fd-down)
slot 4: Quad 10G/25G Ethernet Controller CX7
e4c MAC Address: d0:39:ea:c6:42:03 (auto-unknown-fd-down)
e4d MAC Address: d0:39:ea:c6:42:04 (auto-unknown-fd-down)

#

its funny that e2a and e2a have negotiated 25gb.. yet e4a is 10g

#

we will work it through with NetApp and if there is tweeking to the boot params for these cards we will sort it out and hopefully next customers and NetApp PSE installers dont run into this

sour furnace
# civic hearth I would talk to your NetApp sales team. They have a process to get "missing cabl...

::> net port show -ip Cluster
(network port show)

Node: localhost
Speed(Mbps) Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status


e2a Cluster - - 9000 auto/- -
e4a Cluster - - 9000 auto/- -
2 entries were displayed.

NODE 2

::> net port show -ip Cluster
(network port show)

Node: localhost
Speed(Mbps) Health
Port IPspace Broadcast Domain Link MTU Admin/Oper Status


e2a Cluster - - 9000 auto/- -
e4a Cluster - - 9000 auto/- -
2 entries were displayed.

::>

sly island
#

While we have a couple AFF A20 2-node clusters here in preparation for customer delivery, none of them is in a config with two 4-port cards (they all have a 2-port card for HA and a 4-port card for networking) so I cannot check if they exhibit the same issues.

civic hearth
#

Just curious on a couple things

What’s happens if you delete the cluster lifs and modify the ports to be default? Does the link come up?

If I had a little time, I would do a full reinit
From loader, set-defaults. saveenv, NetBoot 9.16.1
At menu install new software (this wipes and reformats the boot media after NetBoot, don’t restore and then reboot on new image)
Finally do the 9a/9b

topaz wigeon
#

For what it is worth, boot parameters don't have anything to do with the negotiated port speed. There is an EEPROM inside of the SFP28 transceiver that tells the port at which speed to run. As first blush, you have a 10GbE (SFP+) cable inserted into 25GbE (SFP28) port e4a.

#

Since you have identical 25GbE adapters in slots 2 & 4, this is the correct behavior: cluster & HA will use e2a & e4a. In the case of a mismatched adapter in slots 2 & 4, the cluster & HA ports will be e4a & e4b.

civic hearth
#

I agree Scott but I’ve also seen some very odd loader variables on fresh installations. The set defaults will clear. When possible, I always re-init/clean install.

#

Heck, I’ve seen new disks on a new install with ownership that wasn’t in the cluster and had engineering names

sour furnace
#

BTW this was our BOM

sly island
#

I mean the fibers in the each cable, not the cables between the ports

civic hearth
#

I’m thinking the same thing. One end needs to be “rolled”

#

Flip the tx/rx at one end

#

That may be why the link status is odd.

stable sluice
#

i think 25GE qsfp can tx/rx on both ports

sly island
#

still, the link showing up as "-" instead of "down" is kinda weird

sour furnace
#

We had an A250 during Covid without DAC and NetApp supplied sfp+ and optic cables and we never rolled them and it worked fine.. DAC cables aren’t rolled are they ?

civic hearth
#

Dac does auto detection

main crown
#

This is a recently discovered issue with recently shipped AFF A20 systems. An environment variable (bootarg) controlling HA behavior was incorrectly set in Manufacturing. It is currently tracked internally by issue CONTAP-383633.

I recommend following this workaround...

On each node, go directly to the LOADER prompt and check the following environment variable:
LOADER> printenv bootarg.ha_port_sharing_disabled

If this returns "true", execute the following:

LOADER> setenv bootarg.ha_port_sharing_disabled false
LOADER> saveenv

Then boot both nodes and follow normal cluster setup.

sour furnace
#

I’ll give it a crack first thing Monday morning and keep you posted

#

😃

civic hearth
#

@main crown just curious
Running “set-defaults” would also “clear” or “unset” that arg also?

sour furnace
#

boom.. that will do it

#

just halted both nodes and ran set-defaults on each and then boot_ontap .. @main crown @civic hearth

sour furnace
#

so it looks like NetApp have given us the wrong SFP's in the second quad card

#

when i google the 2 SFP part numbers.. the SFP's in Quad Card in slot 2 (which have negotiated 25gb CI link ) are 25G.. when i google the SFP part number for the SFPs in Quad card in slot 4 (which have negotiated 10gb CI link ) they are listed at 10G.. argh

#

and our BOM said 16 x 25g

#

This is for both our A20 clusters

sly island
#

did you check in all the shipping boxes? Sometimes they throw in a couple of SFPs separately and you have to replace them yourself

sour furnace
#

yeah nothing in them.. All SFPs were inserted into the PCI cards when unboxing

sly island
#

yeah, they are always plugged in, but we had instances where the wrong ones were plugged in and they just threw the correct ones into one of the boxes instead of replacing them in the cards... Thought maybe that happened to you as well. But if not then yeah, that sucks 😦 I would probably match the SFPs for cluster and HA, i.e. use the 25g SFPs for those and use the 10g for client connectivity. At least until the correct ones are shipped

civic hearth
#

If they are all plugged in why aren’t all of them in your output?

If you look at them, most (not all unfortunately) indicate the speed on them

sly island
civic hearth
#

@sour furnace I would at least flip the sfp from e2b and move it to e4a on both nodes and then worry about getting the correct sfps

sly island
#

yeah that's what I suggested as well. running a cluster with mismatched speed in the backend is not a good idea (although I doubt it would make much of a difference at these speeds)

topaz wigeon
topaz wigeon
sour furnace
#

after a bit of SFP shuffling (until we get the extra 25GB SFPs)

sour furnace
#

thank you all for the advice and assistance !

#

now onto Openshift/Trident/Trident-Protect 🙂