#MetroCluster failover

1 messages · Page 1 of 1 (latest)

worthy surge
#

Hello, dear all.
We have setup following the Trident's documentation a Trident + MetroCluster. When we did a metrocluster switchover test, we discovered, that pods with PVCs on that metrocluster were restartred and down for a period of the switchover (20-30 seconds). The question is - is that an expected behaviour (that pods will restarted in any case)? Or a switchover should be transparent and we configured something wrong?

sick cradle
#

This KB might help: Trident_backend_definition_update_on_MetroCluster_switchover_and_-back_made_easy

sick cradle
#

Also recommend Trident v22.07 or higher for MC

worthy surge
mild cairn
#

@worthy surge, this should be a transparent switchover with recent versions of Trident.

worthy surge
#

Hi, @mild cairn and @lyric iris . We made an upgrade of the trident to the latest one, we followed the provided KB and made a failover test. The pods were restarted. Is there something we're doing wrong? 🙂 Anything else we can check maybe?

cyan zealot
#

can you paste your backend definition?

worthy surge
#

I can see that the DataLIF and SVM properties has a values, but we did remove them, when the we did apply the backend YAML.

#

Originally, DataLIF and SVM properties has a values in the backend, which was managed by tridentctl. But in order to follow the provided KB, we did remove those values from the tridentbackendconfig YAML and apply it. The kubectl apply... was successful, but looks like it didnt remove the values from the backend

cyan zealot
#

update needs to be done from tridentctl, something like:
tridentctl update backend backend -f tridentbackendconfig.yaml -n trident

worthy surge
#

Hi, @cyan zealot . Should the YAML contains only the values i want to update (add / remove)?

worthy surge
#

and i am getting this error : Cannot update backend created using TridentBackendConfig CR; please update the TridentBackendConfig CR instead. I have a feeling that once the TBC is created, it must to be managed via kubectl only

#

If i run kubectl apply -f tridentbackendconfig.yaml this is applied, but the values i want to remove (like dataLIF) are still there

cyan zealot
worthy surge
#

Got it. Let me try to find it

worthy surge
#

Hi, @cyan zealot .
I wasnt able to find the original, yaml, so I made a dump of the current setup by running ./tridentctl get backend -n trident backend -o yaml. I did remove only the dataLIF and svm values and applied the updated yaml by running ./tridentctl -n trident update backend backend -f backend.yaml.

That is an error i got: Error: could not update backend backend: cannot update backend 'backend' created using TridentBackendConfig CR; please update the TridentBackendConfig CR (400 Bad Request)

cyan zealot
#

Tried to repro, but works here
[root@rhel3 ~]# tridentctl -n trident update backend BackendForNAS -f BackendForNAS.yaml
+---------------+----------------+--------------------------------------+--------+---------+
| NAME | STORAGE DRIVER | UUID | STATE | VOLUMES |
+---------------+----------------+--------------------------------------+--------+---------+
| BackendForNAS | ontap-nas | 586b1cd5-8cf8-428d-a76c-2872713612c1 | online | 0 |
+---------------+----------------+--------------------------------------+--------+---------+

can you look if your CRDs are there?
k get crd -n trident

you should have:
tridentbackendconfigs.trident.netapp.io
tridentbackends.trident.netapp.io

k get tbe -n trident -> should give all your backends.
k get tbc -n trident -> should be empty

worthy surge
#

Hi, @cyan zealot
I have the CRDs you have mentioned. And I have TBCs.

NAME                     BACKEND NAME             BACKEND UUID                           PHASE   STATUS
backend                  backend                  akf21b59-a3df-41b1-810e-f59be4bc20d4   Bound   Success

Maybe i will try to bring a bit more context 🙂 . We started to use Trident a few years ago. It was configured with tridentctl. A few month ago we moved to manage it with TBC. A few weeks ago we was asked to setup Metrocluser with Trident. The KB provided above is asking to make sure the dataLIF and SVM properties are empty. The original backed configuration does include those properties and their values. No, we need to remove those values. And the question is - how to do it, if the backend is managed by TBC/TBE now?

cyan zealot
#

ok, a TBC, sorry confusion on my end, tried that:
k get tbc backend-tbc-ontap-nas -n trident
NAME BACKEND NAME BACKEND UUID PHASE STATUS
backend-tbc-ontap-nas ontap-nas-backend 7491fe89-2ff5-4480-a721-7fecc6e17bee Bound Success

k describe tbc backend-tbc-ontap-nas -n trident
...
Spec:
Auto Export CID Rs:
192.168.0.0/24
Backend Name: ontap-nas-backend
Credentials:
Name: tbcsecret
Data LIF: 192.168.0.132
Management LIF: 192.168.0.135
Storage Driver Name: ontap-nas
Svm: nfs_svm
Version: 1
...

k edit tbc backend-tbc-ontap-nas -n trident
<removed the dataLIF and the svm line>
tridentbackendconfig.trident.netapp.io/backend-tbc-ontap-nas edited

k describe tbc backend-tbc-ontap-nas -n trident
...
Spec:
Auto Export CID Rs:
192.168.0.0/24
Backend Name: ontap-nas-backend
Credentials:
Name: tbcsecret
Management LIF: 192.168.0.135
Storage Driver Name: ontap-nas
Version: 1
Status:
Backend Info:
Backend Name: ontap-nas-backend
Backend UUID: 7491fe89-2ff5-4480-a721-7fecc6e17bee
Deletion Policy: delete
Last Operation Status: Success
Message: Backend 'ontap-nas-backend' updated
Phase: Bound
Events:
Type Reason Age From Message


Normal Success 2m52s trident-crd-controller Backend 'ontap-nas-backend' created
Normal Success 13s trident-crd-controller Backend 'ontap-nas-backend' updated

Seems to work.... but change is not reflected in tridentctl....
Will investigate some more...

cyan zealot
#

ok figured it out, without dataLif and svm in the tbc, trident will autopopulate those from the actual value from the managementLif , however the re-inquiry only happens if a change is made on the tbc. so setting a changed property like the debugTraceFlags (and back) or another property will update the backend automagically 🙂

cyan zealot
#

BTW this is different for tridentctl update backend, if you control your backend with tridentctl updating the backend with the same file without the dataLif and svm defined will trigger an backend update though the managementLif

worthy surge
cyan zealot
#

I updated the KB mentioned earlier, could you have a look at that?

worthy surge
#

Sure. Let me see

#

Ok. So.. If i got it properly, the kubectl edit tbe ... (the word tbe is missing in the documentation if i am getting it properly 🙂 ) + removing the values should do the job, right?

cyan zealot
#

yep that will work

#

will add the tbe

worthy surge
#

that would be great. going to test it and will keep you posted. thank you so much for your help 🙂 🙏🏻

worthy surge
#

Hi, @cyan zealot .
So, we made a failover test now, and both pods went down.

cyan zealot
#

because host lost connection to the storage?

worthy surge
#

yes

#

those were 2 pods, sitting on two different k8s worker nodes, both with mounted PVCs from MC

cyan zealot
#

so something needs changing in your NFS setup... maybe open a case for that
But, the storage backend is working, right, you can still get a new PVC

worthy surge
#

yeah. the backend is ok and a few seconds after, when the failover task is completed, both pods were up and running. so, no issue there. just during the failover itself...

worthy surge
#

@cyan zealot , should we open a regular support ticket and explain the problem? Is Trident support directly by Netapp?

cyan zealot
#

yes it is (on both), but I think this is a NFS problem, not Trident

#

Trident only does the creation, mapping, grow, snapshot, etc... the host itself maintains the NFS connection

#

so opening a case to investigate would help

wraith tiger
# cyan zealot because host lost connection to the storage?

Hi, colleague of @worthy surge here. This is an expensive Metrocluster, so the whole point is that the NFS client is not loosing the connection. At least this is what your sales guys told us 😆 . In our setup, no other NFS clients ever complained when we do MC switchovers or -backs. So I would assume the MC does what it's paid for. To me it appears that trident seems overly aggressive in failing a connection that's maybe down some too many msecs? Or is there no healthcheck at all implemented in Trident? If this is the case, I am totally with you and start invetigating on the k8s host side.

cyan zealot
#

Trident does not have healthcheck like that.
You can try setting nfsOptions to see if timeout/retry will help

wraith tiger
#

OK, then we'll investigate on the k8s host side, thanks!

neon wagon
#

NFS does not use "connections", it is a stateless protocol. When the IP moves somewhere else, it will pick up exactly where it left as soon as NFS packets start getting through. Usually the propagation of the Gratuitous ARPs take a few seconds (I've seen up to 30 seconds in production systems), and the switchover adds a few more seconds (up to 15 if it's unplanned). As long as the NFS mount is a hard mount, it will block and resume as soon as the IP is reachable again

fluid tide
neon wagon
fluid tide
#

hmm, okay, didn't know that. There is also no differentiation in e.g. TR-4067 and TR-3580. There is always written: NFSv4.x is stateful while NFSv3 is stateless.