Hello, dear all.
We have setup following the Trident's documentation a Trident + MetroCluster. When we did a metrocluster switchover test, we discovered, that pods with PVCs on that metrocluster were restartred and down for a period of the switchover (20-30 seconds). The question is - is that an expected behaviour (that pods will restarted in any case)? Or a switchover should be transparent and we configured something wrong?
#MetroCluster failover
1 messages · Page 1 of 1 (latest)
This KB might help: Trident_backend_definition_update_on_MetroCluster_switchover_and_-back_made_easy
Also recommend Trident v22.07 or higher for MC
Yeah. Thanks. This KB was implemented. But do you know maybe the answer to my original question - is the failover should be transparent to the pods or any small downtime is excepted and it’s a normal behavior?
@worthy surge, this should be a transparent switchover with recent versions of Trident.
Hi, @mild cairn and @lyric iris . We made an upgrade of the trident to the latest one, we followed the provided KB and made a failover test. The pods were restarted. Is there something we're doing wrong? 🙂 Anything else we can check maybe?
can you paste your backend definition?
Please
I can see that the DataLIF and SVM properties has a values, but we did remove them, when the we did apply the backend YAML.
Originally, DataLIF and SVM properties has a values in the backend, which was managed by tridentctl. But in order to follow the provided KB, we did remove those values from the tridentbackendconfig YAML and apply it. The kubectl apply... was successful, but looks like it didnt remove the values from the backend
update needs to be done from tridentctl, something like:
tridentctl update backend backend -f tridentbackendconfig.yaml -n trident
Hi, @cyan zealot . Should the YAML contains only the values i want to update (add / remove)?
and i am getting this error : Cannot update backend created using TridentBackendConfig CR; please update the TridentBackendConfig CR instead. I have a feeling that once the TBC is created, it must to be managed via kubectl only
If i run kubectl apply -f tridentbackendconfig.yaml this is applied, but the values i want to remove (like dataLIF) are still there
no all values should be there
think you are editing the wrong yaml file, it should be the file that you used to create the backends, eg the ones you get when doing tridentct get backend -n trident
Got it. Let me try to find it
Hi, @cyan zealot .
I wasnt able to find the original, yaml, so I made a dump of the current setup by running ./tridentctl get backend -n trident backend -o yaml. I did remove only the dataLIF and svm values and applied the updated yaml by running ./tridentctl -n trident update backend backend -f backend.yaml.
That is an error i got: Error: could not update backend backend: cannot update backend 'backend' created using TridentBackendConfig CR; please update the TridentBackendConfig CR (400 Bad Request)
Tried to repro, but works here
[root@rhel3 ~]# tridentctl -n trident update backend BackendForNAS -f BackendForNAS.yaml
+---------------+----------------+--------------------------------------+--------+---------+
| NAME | STORAGE DRIVER | UUID | STATE | VOLUMES |
+---------------+----------------+--------------------------------------+--------+---------+
| BackendForNAS | ontap-nas | 586b1cd5-8cf8-428d-a76c-2872713612c1 | online | 0 |
+---------------+----------------+--------------------------------------+--------+---------+
can you look if your CRDs are there?
k get crd -n trident
you should have:
tridentbackendconfigs.trident.netapp.io
tridentbackends.trident.netapp.io
k get tbe -n trident -> should give all your backends.
k get tbc -n trident -> should be empty
Hi, @cyan zealot
I have the CRDs you have mentioned. And I have TBCs.
NAME BACKEND NAME BACKEND UUID PHASE STATUS
backend backend akf21b59-a3df-41b1-810e-f59be4bc20d4 Bound Success
Maybe i will try to bring a bit more context 🙂 . We started to use Trident a few years ago. It was configured with tridentctl. A few month ago we moved to manage it with TBC. A few weeks ago we was asked to setup Metrocluser with Trident. The KB provided above is asking to make sure the dataLIF and SVM properties are empty. The original backed configuration does include those properties and their values. No, we need to remove those values. And the question is - how to do it, if the backend is managed by TBC/TBE now?
ok, a TBC, sorry confusion on my end, tried that:
k get tbc backend-tbc-ontap-nas -n trident
NAME BACKEND NAME BACKEND UUID PHASE STATUS
backend-tbc-ontap-nas ontap-nas-backend 7491fe89-2ff5-4480-a721-7fecc6e17bee Bound Success
k describe tbc backend-tbc-ontap-nas -n trident
...
Spec:
Auto Export CID Rs:
192.168.0.0/24
Backend Name: ontap-nas-backend
Credentials:
Name: tbcsecret
Data LIF: 192.168.0.132
Management LIF: 192.168.0.135
Storage Driver Name: ontap-nas
Svm: nfs_svm
Version: 1
...
k edit tbc backend-tbc-ontap-nas -n trident
<removed the dataLIF and the svm line>
tridentbackendconfig.trident.netapp.io/backend-tbc-ontap-nas edited
k describe tbc backend-tbc-ontap-nas -n trident
...
Spec:
Auto Export CID Rs:
192.168.0.0/24
Backend Name: ontap-nas-backend
Credentials:
Name: tbcsecret
Management LIF: 192.168.0.135
Storage Driver Name: ontap-nas
Version: 1
Status:
Backend Info:
Backend Name: ontap-nas-backend
Backend UUID: 7491fe89-2ff5-4480-a721-7fecc6e17bee
Deletion Policy: delete
Last Operation Status: Success
Message: Backend 'ontap-nas-backend' updated
Phase: Bound
Events:
Type Reason Age From Message
Normal Success 2m52s trident-crd-controller Backend 'ontap-nas-backend' created
Normal Success 13s trident-crd-controller Backend 'ontap-nas-backend' updated
Seems to work.... but change is not reflected in tridentctl....
Will investigate some more...
ok figured it out, without dataLif and svm in the tbc, trident will autopopulate those from the actual value from the managementLif , however the re-inquiry only happens if a change is made on the tbc. so setting a changed property like the debugTraceFlags (and back) or another property will update the backend automagically 🙂
BTW this is different for tridentctl update backend, if you control your backend with tridentctl updating the backend with the same file without the dataLif and svm defined will trigger an backend update though the managementLif
Hi, @cyan zealot
I am so appreciate for your time on that. But unfortunately, i am not sure 100% i got, what should i do to remove dataLIF and SMV values? 🙂 The TBC doesnt have those properties dataLIF and SMV... Only TBE has them.
Is it possible i can ask you to provide a step by step solution what exactly should i edit / do? 🙏🏻
I updated the KB mentioned earlier, could you have a look at that?
Sure. Let me see
Ok. So.. If i got it properly, the kubectl edit tbe ... (the word tbe is missing in the documentation if i am getting it properly 🙂 ) + removing the values should do the job, right?
that would be great. going to test it and will keep you posted. thank you so much for your help 🙂 🙏🏻
Hi, @cyan zealot .
So, we made a failover test now, and both pods went down.
because host lost connection to the storage?
yes
those were 2 pods, sitting on two different k8s worker nodes, both with mounted PVCs from MC
so something needs changing in your NFS setup... maybe open a case for that
But, the storage backend is working, right, you can still get a new PVC
yeah. the backend is ok and a few seconds after, when the failover task is completed, both pods were up and running. so, no issue there. just during the failover itself...
@cyan zealot , should we open a regular support ticket and explain the problem? Is Trident support directly by Netapp?
yes it is (on both), but I think this is a NFS problem, not Trident
Trident only does the creation, mapping, grow, snapshot, etc... the host itself maintains the NFS connection
so opening a case to investigate would help
Hi, colleague of @worthy surge here. This is an expensive Metrocluster, so the whole point is that the NFS client is not loosing the connection. At least this is what your sales guys told us 😆 . In our setup, no other NFS clients ever complained when we do MC switchovers or -backs. So I would assume the MC does what it's paid for. To me it appears that trident seems overly aggressive in failing a connection that's maybe down some too many msecs? Or is there no healthcheck at all implemented in Trident? If this is the case, I am totally with you and start invetigating on the k8s host side.
Trident does not have healthcheck like that.
You can try setting nfsOptions to see if timeout/retry will help
OK, then we'll investigate on the k8s host side, thanks!
NFS does not use "connections", it is a stateless protocol. When the IP moves somewhere else, it will pick up exactly where it left as soon as NFS packets start getting through. Usually the propagation of the Gratuitous ARPs take a few seconds (I've seen up to 30 seconds in production systems), and the switchover adds a few more seconds (up to 15 if it's unplanned). As long as the NFS mount is a hard mount, it will block and resume as soon as the IP is reachable again
thats true for nfs3 but not for nfs4
edit: oh I just recognized, I'm quite late to the party 😇
actually, it's only untrue for NFS 4.1 and newer, 4.0 didn't use sessions 😉
hmm, okay, didn't know that. There is also no differentiation in e.g. TR-4067 and TR-3580. There is always written: NFSv4.x is stateful while NFSv3 is stateless.