I'm currently seeing a failure when running "./tridentctl install -n trident" where the installation gets stuck on "Waiting for Trident pod to start. When I do "oc describe pod ..." Im seeing this in the events: " Warning FailedScheduling 2m33s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..".. Currently running OpenShift 4.15.3 on baremetal with 64gb of RAM
#Trident installation with RedHat OpenShift failing.
1 messages · Page 1 of 1 (latest)
It looks like you need to do a custom Operator install (https://docs.netapp.com/us-en/trident/trident-get-started/kubernetes-customize-deploy.html#understanding-controller-pods-and-node-pods) using controllerPluginTolerations and nodePluginTolerations to handle the taints that are on your nodes.
gotcha. Will try this out thank you @snow flicker
Having some trouble figuring out which how to use controllerPluginToleration and nodePluginTolerations. Should these be used in one of the yaml files if so which one? Im currently trying to use the tridentclt to generate custom install files.
The parameters on that page go into the trident-installer/deploy/crds/tridentorchestrator_cr.yaml
got it.
I tried doing a custom install yesterday and saw the same disk pressure failure. when I looked at the pods in trident name space I ended up seeing trident-controller pods being constantly created and failing/
The top of the page tells you this with the sentence "he Trident operator allows you to customize Astra Trident installation using the attributes in the TridentOrchestrator spec." But it took me a conversation with another NetApp tech and some digging to figure it out. 🙄 So don't feel bad about not knowing. I'm only telling you now so it jogs your memory in 6 months when you go to upgrade and forget.
Thank you I appreciate that yeah took me a bit to figure out some of that lol
It was probably the attempt in create, but k8s couldn't schedule them because of a taint & toleration limitation.
FWIW that's not a taint I've seen before, you might want to check out Openshift documentation to see why it was put there.
ah ok and just for my knowledge would this still occur (trident-controller) creations even if I ran the trident uninstaller? because thats what we were seeing in our setup.
just spoke to a team member and he mentioned that he was able to taint those tolerations and also increase memory space in our master node (OpenShift) so we will try another attempt at installation. Again thanks for you help hopefully the installation goes a bit smoothly this time
Hey man, I hope you got it figured out. Disk pressure warning is when the local disk on the nodes is crossing a fullness threshold. This happens when a node has a large number of pods with pod images (cached locally) and is running out of local disk space. Scaling the cluster with additional worker nodes to allow the pods to rebalance can help with this.
yup we were able to alleviate it by increasing storage and memory upon startup of our OpenShift cluster. Appreciate your help!
unfortunatley we are running into pod creation issues with PVC volumes not mounting. They are trying to mount the volume on a directory that doesn't even exist... so trying to debug that now
The underlying directory should be created (by either k8s or Trident, not sure) for Trident to mount the NFS or iSCSI device. I've never seen it not exist.
Doing a describe on the pod should give a bit of a clue as to why it isn't working.
Looked like for iscsi you need to have certain tools installed on rhelCoreOS since thats where the OpenShift cluster resides. So got that to work but seeing crashLoopBackOff on my test pod now seems to create the first time then just keeps restarting/crashing . My test pod yaml contains very simple docker file that it pulls from my registry sucessfully (I see it worked from the <oc describe> command. Not sure what the issue could be here. NFS still has mount issues, im not even able to install nfs-utils on rhelCoreOS so still working on figuring that out to be able to even mount the volume.
Do logs on the docker-test container show anything?
unfortunately not the logs all spit out similar information to the events I see on the describe for the pod. Some how we were able to get an iscsi pod running we didnt change anything so kinda weird and unsure how it worked. We are still having issues with NFS where the volume wont even mount. Curious if its because we cannot install nfs utils on coreOS where the cluster lies?
so I tried manually mounting the nfs share on to RHCOS and im getting permissions issue even though the storage gui is showing full export permissions...
Looks like there is a nfs-utils-coreos package specifically for RHCOS. Is that installed? or can it be?
What's the exact error you get when trying to mount?
The "access denied by server" usually means there is an issue with the NFS export policy on the storage controller.
ahh thats it! thank you cant deploy the pod now perfectly fine.
Good to hear.
ok home stretch here... i promise lol so was able to configure a pod with a 3Gi PVC mounted to it im running dd command to fill up the volume for a test and seeing that right before filling up it errors our saying "read only file system error" every time I run dd to the mount/path/testfile it gives me that error. the way im writing to the volume is im using kubectl exec to get into the pod then running dd. seems that RW permissions are fine on the storage system side...
https://kb.netapp.com/Cloud/Astra/Trident/Application_pods_crashing_after_Trident_import I found this article to help bring back to rw permission for the volume but why did it go into read only in the first place and is this the only way for us to bring it back into rw? even after bring it back to RW the location where the volume is mount when doing 'ls command' it gives a "bad message" and becomes unsuable
NFS... not sure. iSCSI.... It's probably because the fractional reserve is set to 0 and there was a snapshot taken at some point.
right so NFS works fine as soon as I delete my test file that I wrote random bytes to on the PVC. It also deletes it on the ontap side but for some reason iscsi doesnt work like that. Before filling up i get a read only error unline NFS which gives a "No space left on disk"
it didnt look like anysnapshot was taken for iscsi so not sure whats going on
The NFS filesystem is using ONTAP's native filesystem directly. iSCSI has a filesystem in a lun (which is just a file) in the same filesystem. So deletes inside a lun don't happen with the same speed as they do in NFS. In most circumstances this isn't a problem.
I should have added that this has everything to do with ONTAP and very little to do with Trident.
sure I understand just want to know how long the latency is between unmount scsi commands going back to ontap to reclaim space...
About three items up on the left is an ontap discussion board. Ask there and you'll get a better answer.