#Astra Trident operator fails to recognise version, too short pod deployment timeout

1 messages · Page 1 of 1 (latest)

south heart
#

We installed the astra trident operator on our dev cluster using the Helm chart. We noticed we installed an older chart version so we upgraded to the latest release. Once the new operator pod started it wanted to pull the installation version via some version docker container. It starts a pod with this container and waits for it to come online so it can run a command in the pod to export version details in yaml format.

The problem is it doesn't wait longer than 30 seconds for this pod to show up. After the 30 seconds it concludes that the pod failed to start and it triggers a pod termination process + it queues a retry job. In our environment this pod can take longer than 30 second to start up. There can be many reasons for it to take a bit longer than 30 seconds, the CNI can take a bit before it hands out a pod IP, the container image needs to be pulled via a slow HTTP proxy, etc...

Luckily after retrying many times the version is determined at some point and the install / upgrade procedure continues. This is fine in the dev environment but if we want to run this operator in production it would be nice not to depend on a race condition.

Can this logic be updated so the timeout can be configured?

Attached is a snapshot of the logs. These logs we see repeatedly for every retry attempt.

#

Astra Trident operator fails to recognise version, too short pod deployment timeout

molten niche