#Chuck Fouts0462 yes that s exactly the
1 messages · Page 1 of 1 (latest)
@hollow radish in your clusters are you deleting the Node object after the Spotify instance terminates?
Yes
This is built into the automation so it happens fairly quickly after the instance disappears?
Just reviewing the Kubernetes documentation regarding deleting StatefulSet pods. https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/
This page shows how to delete Pods which are part of a stateful set, and explains the considerations to keep in mind when doing so.
Before you begin This is a fairly advanced task and has the potential to violate some of the properties inherent to StatefulSet. Before proceeding, make yourself familiar with the considerations enumerated below. St...
The k8s AWS cloud provider is responsible for deleting the nodes. We do run the AWS node termination handler https://github.com/aws/aws-node-termination-handler that tries to taint and drain nodes before AWS terminates them.
Our clusters are orchestrated using kOps, and we see similar behavior when executing a rolling replace of nodes using kOps. kOps does drain nodes before executing the termination command to the AWS API.
So this might very well have to do with timing, but unfortunately this is how it behaves in AWS.
So the multi attach error is kind of expected. The problem is that the process of attaching a PVC on a new node doesn't proceed after the default 6 minutes timeout.
Even with Rook Ceph PVCs we see multi attach errors for block volumes (RBD), but after being "stuck" for 6 minutes in multi attach error the attach process succeeds.
I believe it probably is a timing issue. There has been focus in that area of the code recently to improve detach performance.
Are you seeing any errors in the Trident logs?
I can easily reproduce the error and fetch some logs for you. Which Trident logs are you most interested in?
The Trident controller logs and the node logs for the node that is being terminated
The main issue is that a PVC needs to reattach after 6 minutes being stuck in multi attach error. There is always a risk of a node crashing hard so no detach process will be able to run.
OK, I'll collect some logs and will come back to you.
Thanks! That will help
I will also like to point out that this doesn't only apply to StatefulSets. A deployment or any other pod can also mount a Trident PVC. Currently working on collecting logs...
Hm, with my current test using the 22.07 release It looks like the attach process continues after the multi attach error and that I hit some other iscsi problems.
Events:
Type Reason Age From Message
Normal Scheduled 16m default-scheduler Successfully assigned gdonline-echo-test/netapp-block-csi-7b88c59bb5-2ptzp to ip-10-26-206-12.eu-north-1.compute.internal
Warning FailedAttachVolume 16m attachdetach-controller Multi-Attach error for volume "pvc-50aad985-1f5f-45da-8762-26ae500890dd" Volume is already used by pod(s) netapp-block-csi-7b88c59bb5-qnmxh
Normal SuccessfulAttachVolume 16m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-50aad985-1f5f-45da-8762-26ae500890dd"
Warning FailedMount 5m52s kubelet Unable to attach or mount volumes: unmounted volumes=[pvc], unattached volumes=[kube-api-access-m8f4s pvc]: timed out waiting for the condition
Warning FailedMount 111s (x9 over 14m) kubelet MountVolume.MountDevice failed for volume "pvc-50aad985-1f5f-45da-8762-26ae500890dd" : rpc error: code = Internal desc = failed to stage volume: no devices present yet
Please see the attached logs. I will execute some more tests..
Now I manually terminated a node to simulate a node failure. Trident version 22.07 managed to proceed from the multi attach error after a little while. But I still got the same iSCSI problem as described in my previous message.
I might have misinterpreted the problem we see with the 22.07 driver to be identical to the multi attach error. I now start to believe it's a different problem related to iSCSI.
okay, that could be possible. We have been looking at making sure the volume attachment can be successfully removed quickly when the volume is being unpublished/unstaged .
We're always balancing doing things quickly against not properly flushing I/O.
I need to step away for a while. It might be a good idea to open a NetApp Support case. They can take the full logs and also provide more frequent updates
OK, thanks and sorry for the inconvenience. I would really like to understand and solve the iSCSI problem we experience with the 22.07 driver. With 22.01.1 the PVCs mounts fine one the same nodes with the same configuration. How to I open a support ticket with NetApp? Our backend is AWS FSx.