Chuck Fouts0462 yes that s exactly the | NetApp | Page 1

tiny jackal Sep 29, 2022, 7:06 PM

#

@hollow radish in your clusters are you deleting the Node object after the Spotify instance terminates?

hollow radish Sep 29, 2022, 7:07 PM

#

Yes

tiny jackal Sep 29, 2022, 7:08 PM

#

This is built into the automation so it happens fairly quickly after the instance disappears?

#

Just reviewing the Kubernetes documentation regarding deleting StatefulSet pods. https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

Kubernetes

Force Delete StatefulSet Pods

This page shows how to delete Pods which are part of a stateful set, and explains the considerations to keep in mind when doing so.
Before you begin This is a fairly advanced task and has the potential to violate some of the properties inherent to StatefulSet. Before proceeding, make yourself familiar with the considerations enumerated below. St...

hollow radish Sep 29, 2022, 7:11 PM

#

The k8s AWS cloud provider is responsible for deleting the nodes. We do run the AWS node termination handler https://github.com/aws/aws-node-termination-handler that tries to taint and drain nodes before AWS terminates them.

Our clusters are orchestrated using kOps, and we see similar behavior when executing a rolling replace of nodes using kOps. kOps does drain nodes before executing the termination command to the AWS API.

So this might very well have to do with timing, but unfortunately this is how it behaves in AWS.

GitHub

GitHub - aws/aws-node-termination-handler: Gracefully handle EC2 in...

Gracefully handle EC2 instance shutdown within Kubernetes - GitHub - aws/aws-node-termination-handler: Gracefully handle EC2 instance shutdown within Kubernetes

#

So the multi attach error is kind of expected. The problem is that the process of attaching a PVC on a new node doesn't proceed after the default 6 minutes timeout.

#

Even with Rook Ceph PVCs we see multi attach errors for block volumes (RBD), but after being "stuck" for 6 minutes in multi attach error the attach process succeeds.

tiny jackal Sep 29, 2022, 7:15 PM

#

I believe it probably is a timing issue. There has been focus in that area of the code recently to improve detach performance.

#

Are you seeing any errors in the Trident logs?

hollow radish Sep 29, 2022, 7:16 PM

#

I can easily reproduce the error and fetch some logs for you. Which Trident logs are you most interested in?

tiny jackal Sep 29, 2022, 7:17 PM

#

The Trident controller logs and the node logs for the node that is being terminated

hollow radish Sep 29, 2022, 7:18 PM

#

The main issue is that a PVC needs to reattach after 6 minutes being stuck in multi attach error. There is always a risk of a node crashing hard so no detach process will be able to run.

#

OK, I'll collect some logs and will come back to you.

tiny jackal Sep 29, 2022, 7:19 PM

#

Thanks! That will help

hollow radish Sep 29, 2022, 7:45 PM

#

I will also like to point out that this doesn't only apply to StatefulSets. A deployment or any other pod can also mount a Trident PVC. Currently working on collecting logs...

#

Hm, with my current test using the 22.07 release It looks like the attach process continues after the multi attach error and that I hit some other iscsi problems.
Events:
Type Reason Age From Message

Normal Scheduled 16m default-scheduler Successfully assigned gdonline-echo-test/netapp-block-csi-7b88c59bb5-2ptzp to ip-10-26-206-12.eu-north-1.compute.internal
Warning FailedAttachVolume 16m attachdetach-controller Multi-Attach error for volume "pvc-50aad985-1f5f-45da-8762-26ae500890dd" Volume is already used by pod(s) netapp-block-csi-7b88c59bb5-qnmxh
Normal SuccessfulAttachVolume 16m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-50aad985-1f5f-45da-8762-26ae500890dd"
Warning FailedMount 5m52s kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[kube-api-access-m8f4s pvc]: timed out waiting for the condition
Warning FailedMount 111s (x9 over 14m) kubelet MountVolume.MountDevice failed for volume "pvc-50aad985-1f5f-45da-8762-26ae500890dd" : rpc error: code = Internal desc = failed to stage volume: no devices present yet

Please see the attached logs. I will execute some more tests..

#

📎 controller-trident-main.log 📎 node-trident-main.log

#

Now I manually terminated a node to simulate a node failure. Trident version 22.07 managed to proceed from the multi attach error after a little while. But I still got the same iSCSI problem as described in my previous message.

#

I might have misinterpreted the problem we see with the 22.07 driver to be identical to the multi attach error. I now start to believe it's a different problem related to iSCSI.

tiny jackal Sep 29, 2022, 8:16 PM

#

okay, that could be possible. We have been looking at making sure the volume attachment can be successfully removed quickly when the volume is being unpublished/unstaged .

#

We're always balancing doing things quickly against not properly flushing I/O.

#

I need to step away for a while. It might be a good idea to open a NetApp Support case. They can take the full logs and also provide more frequent updates

hollow radish Sep 29, 2022, 8:22 PM

#

OK, thanks and sorry for the inconvenience. I would really like to understand and solve the iSCSI problem we experience with the 22.07 driver. With 22.01.1 the PVCs mounts fine one the same nodes with the same configuration. How to I open a support ticket with NetApp? Our backend is AWS FSx.

#Chuck Fouts0462 yes that s exactly the