I'm worried about the fact that my Trident pods don't set resource requests for all containers. In particular, the trident-node-linux DaemonSet does not set resources at all for the pods it creates. I can't find a Helm chart value for this, because I think the DaemonSet is created by the operator. Not having resource requests may lead to erratic behavior, so I want to fix this. How can I configure resource requests for the Trident pods?
#Setting Trident resource requests
1 messages · Page 1 of 1 (latest)
The development team is looking into it per https://github.com/NetApp/trident/issues/853.
I missed that. Thank you for the link!
This is pretty bad… Without resource requests, containers may behave pretty unpredictibly, for example having random restarts or getting 0 seconds CPU time during a network handshake or a number of other random things that may mess up behavor. Any chance this could get some prio? Seems to not have been much movement on the issue for a long while now…
We have a couple dozen Trident installs with our customers but we never encountered any issues like this. Generally, trident getting 0 CPU time is not an issue anyways, since it's not in the data path in any way. It's only a management layer on top of k8s
But if you are actively seeing these issues, you can always open a support case with NetApp to get it addressed
Generally, trident getting 0 CPU time is not an issue anyways
Why deploy it if it's not supposed to do anything? 😉
It's hard to tell if our issues are becuase of this, since programs may behave very strange when choked. But I'll mention it. Thanks for the input.
Trident only does something when it needs to satisfy a PVC (by creating a volume/export/LUN on your NetApp cluster). After it has created the volume and mounted it, there's nothing more to do. You can even uninstall it without impacting the running pods (in fact, uninstalling and re-instlaling is the official way to upgrade a Trident installation)
Hej Andreas, what issues are you seeing specifically?
My issue turned out to be this one: https://kb.netapp.com/Cloud/Astra/Trident/Unable_to_provision_PVCs_when_REST_is_enabled
But before we got to the bottom of that, we had to try a few different things to debug. And I never trust behavior from a container without resource requests configured. Have fell into that trap too many times.
It's perfectly fine if you want to allocate very small memory and CPU. But having it unset is dangerous. The only kind of workloads where one can safely do that is when its functionality is utterly irrelevant and only runs when no other load requires resources from the node.
With no memory requests set, you are instructing Kubernetes that it's fine to kill off the container at any time if another pod requests memory. According to Murphey's law, that will happen just when the container tries to do something important. 🙂
And since you have not set any CPU requests, the pod may be scheduled but never get to execute. If that happens, the pod looks healthy but the functionality doesn't work. So theoretically forever, that pod will just sit and be useless.
I've seen many random raceconditions that are masked on "normal" processors because of how fast they are, but appear when processors takes seconds instead of milliseconds to perform operations. If you do anything network related with the container, the TCP connection may break and/or TCP handshakes may fail if the container doesn't have time to execute.
As long as the node is not under heavy use, it's usually fine. The CPU will scale up as far as it can go and allocating memory is fine as long as nobody else requires it. But as soon as other resources get scheduled to the node and starts requesting resources, they will take precedence and start messing with the containers that don't have any resoruce requests.
So TL;DR: resources: {} for a container means "my container may never execute and may get killed off at any time". That is rarely a valid scenario for a workload.
if your worker nodes are so CPU-contended that they starve other processes down to zero, or so memory-contended that they oom-kill random processes, I think you just need more resources in your worker nodes. The core Linux scheduler is pretty good at not starving processes, even if other processes have a much higher priority, as long as enough resources are available