#Dagger hangs when calling terraform function multiple times

1 messages · Page 1 of 1 (latest)

shy pilot
#

I have a workflow that runs terraform multiple times. The first invocation runs quickly, but every subsequent run after gets significantly slower

This is my top level Dagger definition

async def run_workflow(apply):
    runner_image_tag = os.getenv('INFRA_RUNNER_IMAGE_TAG') #https://github.com/Wi3ard/docker-terraform-gcloud-aws/blob/master/Dockerfile
    config = dagger.Config(log_output=sys.stderr, execute_timeout=60)
    async with dagger.Connection(config) as client: 

        runner_image = await infra_runner.build_image_from_tag(runner_image_tag, client) 
        remote_state_image = await terraform.run_module("remote-state/gcp", runner_image, runner_image_tag, False, client)
        project_image = await terraform.run_module("project/gcp", remote_state_image, runner_image_tag, apply, client)
        network_image = await terraform.run_module("network/gcp", project_image, runner_image_tag, apply, client)
        await terraform.run_module("serverless/gcp/cloudrun", network_image, runner_image_tag, apply, client)

        print("All tasks have finished")

async def build_image_from_tag(tag: str, client):
    
        base_image = ( 
            client.container()
            .from_(tag)  
            .exec(["mkdir", "-p", "/output"])
            .exec(["mkdir", "-p", "/.terraform.d/plugin-cache"])
        )

        print(f"Building Docker Image for tag {tag}")

        await base_image.exit_code()
        return base_image

async def run_module(module: str, image, runner_tag: str, apply: bool, client):

    terraform_action  = (
        client.container()
        .from_(runner_tag)
        .with_directory("/input", image.directory("/output")
        .with_directory(plugin_cache_dir, image.directory(plugin_cache_dir)
        .with_mounted_directory("/src",local_src)

  # some code here that runs terraform init/plan/apply
    )
    await terraform_action.exit_code()
    return terraform_action


#

Here's' the dagger logs. You can see that it does some work and then just stops and healthchecks over and over

Stdout is weird too.

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 0.4s

#2 mkdir /meta
#2 DONE 0.0s

#3 docker-image://docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#3 resolve docker.io/wi3ard/docker-terraform-gcloud-aws:latest 0.1s done
#3 DONE 0.1s
Starting terraform action for module remote-state/gcp

Resolves image config once.

Then it's called again, it resolves twice

Starting terraform action for module project/gcp
#20 0.123 done
#20 DONE 0.2s

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 0.9s

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 1.0s

Then again it's called, it resolves twice again


finished running terraform module project/gcp
Starting terraform action for module network/gcp

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 1.2s

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 1.4s

Finally it resolves 4x, and then just hangs on the last call ("serverless/gcp/cloudrun"


#46 echo done
#46 0.115 done
#46 DONE 0.1s

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 4.2s

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 4.5s

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 4.8s

#1 resolve image config for docker.io/wi3ard/docker-terraform-gcloud-aws:latest
#1 DONE 5.0s

river nimbus
#

Hey @shy pilot 👋 Can you tell me more about the context where this is being executed? Just locally on your workstation? I'm also wondering if there's a reason for the await base_image.exit_code() from the base image function. Is there more happening there that's omitted?

shy pilot
#

Yes, just locally on my machine right now. Nothing else is happening that's omitted currently before await base_image.exit_code(). It's just a pattern I put in there in case I wanted to do more stuff in the future. Can I be rid of it?

river nimbus
#

Yeah in the code right now it's not doing anything, since the definition of the container is being rebuilt in the other function anyway basically. As for resolving that image tag multiple times, that seems to make sense since you're creating a new image from that tag each time you call the run_module function

shy pilot
#

Ok, so it's expected behavior to see it once (first invocation), twice (subsequent invocation), and then 4x (subsequent invocation)?

river nimbus
#

if I understand the code correctly (just mapping it out in my head), it should:

  • resolve once when you call build_image_from_tag
  • resolve twice in the first run_module
  • resolve twice in the second run_module
  • resolve four times in the third run_module
  • resolve eight? times in the fourth run_module
#

because each run_module creates an image from that tag, and then creates another image for base_image which also uses that tag. And then the last 2 run_modules are nested from previous ones

shy pilot
#

Ah, i figured it was something that was growing like that. It seems like I'm Doing it Wrong

#

What I actually want to do is have a clean image of that runner, and copy in a file into it every time that was produced from the previous run_module. Do I need to create the image outside of run_module to do that?

river nimbus
#

Not necessarily, but the way it's setup right now you could use base_image instead of creating a new image in each function call. So run_module would look more like:

async def run_module(module: str, image, runner_tag: str, apply: bool, client):

    terraform_action  = (
        image
        .with_directory(plugin_cache_dir, image.directory(plugin_cache_dir)
        .with_mounted_directory("/src",local_src)

  # some code here that runs terraform init/plan/apply
    )
    await terraform_action.exit_code()
    return terraform_action
#

because image is just

client.container()
            .from_(tag)  
            .exec(["mkdir", "-p", "/output"])
            .exec(["mkdir", "-p", "/.terraform.d/plugin-cache"])

so you're just shorthanding having that part in run_module

#

it's not actually getting reused between each run. I'm not sure if that was the intention or not

shy pilot
#

It makes sense. I'm gonna update that now and see if that was related or orthogonal to the issue mentioned above. I guess I was thinking if I reused the same image that was getting passed in I would also be reusing all the stuff that happened in the run as well

river nimbus
#

it is possible to set it up like that, but right now the code has a fresh image each time

shy pilot
#

To be clear, my original intention was

  1. Take the outputs of terraform and write them to a json file
  2. Copy that json to a totally clean version of that infra-runner image and then use the clean image + the json file to run the next terraform
river nimbus
#

got it! yeah what might make sense for that case is to pass around a Directory object. So instead of calling terraform_action.exit_code(), you could return outputs.with_directory('project/gcp/output', terraform_action.directory('/output') where outputs is a Directory (that may not be 100% accurate syntax, just typing it out off the top of my head)

#

and with dagger, getting an output directory from an image is something that triggers an execution, so exit_code is not needed

shy pilot
#

Oh interesting, I thought waiting for an exit code was my only option here. Are there docs that explain this a bit more in depth?

river nimbus
#

We definitely have an issue for that, let me find it

shy pilot
#

Thanks, that was helpful.

I made some changes per your recommendations. A couple of findings:

  • I now copy the directory over as you recommended. works fine

  • However, if I remove await terraform_action.exit_code(), dagger doesn't actually execute anything.

  • If I leave terraform_action.exit_code() in, everything will run, but after the last terraform action finishes Dagger just sits there and waits after the final DONE

#

New code


async def run_workflow(apply):
    runner_image_tag = os.getenv('INFRA_RUNNER_IMAGE_TAG') #https://github.com/Wi3ard/docker-terraform-gcloud-aws/blob/master/Dockerfile
    config = dagger.Config(log_output=sys.stderr, execute_timeout=60)
    async with dagger.Connection(config) as client: 

        remote_state_output = await terraform.run_module("remote-state/gcp", runner_image, None, False, client)
        project_output = await terraform.run_module("project/gcp", runner_image, remote_state_output, apply, client)
        network_output = await terraform.run_module("network/gcp", runner_image, project_output, apply, client)
        await terraform.run_module("serverless/gcp/cloudrun", runner_image, network_output, apply, client)

        print("All tasks have finished")


async def run_module(module: str, image, prev_runner_output, apply: bool, client):

    terraform_action  = (
        image
        .with_directory("/input", prev_runner_output if prev_runner_output else client.directory())
        .with_mounted_directory("/src",local_src)

  # some code here that runs terraform init/plan/apply
    )
    await terraform_action.exit_code() # hangs here at the final call. If commented out will not actually invoke the action
    print(f"finished running terraform module {module}")
    # create empty directory to hold build outputs
    outputs = client.directory()
    return outputs.with_directory("/", terraform_action.directory("/output"))

river nimbus
#

Looks good! I think the outputs.with_directory needs an await which is why its not working without the exit_code(), let me double check that

#

oh wait, no I missed a step facepalm if you're passing the directory forward like this, you need something like an await outputs.entries() after the last run_module
something like:

gcp = await terraform.run_module("serverless/gcp/cloudrun", runner_image, network_output, apply, client)
await gcp.entries()
#

or you could even await gcp.export('./') to save the outputs to the host

shy pilot
#

Hmm, so that did work, but still never finishes. New code now:

New code


async def run_workflow(apply):
    runner_image_tag = os.getenv('INFRA_RUNNER_IMAGE_TAG') #https://github.com/Wi3ard/docker-terraform-gcloud-aws/blob/master/Dockerfile
    config = dagger.Config(log_output=sys.stderr, execute_timeout=60)
    async with dagger.Connection(config) as client: 

        remote_state_output = await terraform.run_module("remote-state/gcp", runner_image, None, False, client) # create bucket in mgmt project to store tfstate. 
        await remote_state_output.entries()
        project_output = await terraform.run_module("project/gcp", runner_image, remote_state_output, apply, client) # create new project to hold the env
        await project_output.entries()
        network_output = await terraform.run_module("network/gcp", runner_image, project_output, apply, client) # create network in project
        await network_output.entries()
        cloudrun_output = await terraform.run_module("serverless/gcp/cloudrun", runner_image, network_output, apply, client)
        await cloudrun_output.entries()

        print("All tasks have finished")


async def run_module(module: str, image, prev_runner_output, apply: bool, client):

    terraform_action  = (
        image
        .with_directory("/input", prev_runner_output if prev_runner_output else client.directory())
        .with_mounted_directory("/src",local_src)

  # some code here that runs terraform init/plan/apply
    )
    # THIS NOW WORKS WITHOUT EXIT CODE
    print(f"finished running terraform module {module}")
    # create empty directory to hold build outputs
    outputs = client.directory()
    return outputs.with_directory("/", terraform_action.directory("/output"))

#

basically sits here and never exits

#

^^ That's the last code in terraform_action

#

Engine logs show that it runs the last line in terraform_action then just chills and healthchecks forever

#

Actually, wait! It is doing something else. It resolved image config again after a long time.

river nimbus
#

Nice yeah I was wondering if the terraform was actually still running? or maybe getting rate limited by dockerhub although that seems unlikely to happen at the same point every time
@spiral musk any other ideas?

shy pilot
#

How can i find out what is happening here in between these two calls? it says they take 3.2 and 2.5 seconds but in reality it was like 3 minutes of dagger just healthchecking in between them

#

I also think your hypothesis about docker hub rate limiting might have merit, because it does seem to freeze up at different spots in the workflow at different times

river nimbus
#

two options to test that: 1) authenticate to dockerhub, 2) use a tag other than :latest. Could use the current latest manifest sha if there aren't other tags that will work

shy pilot
#

just ensured I logged in with docker login success, it did not help, I'll try locking to a specific tag now

#

Using a different tag other than latest still causes it to hang. So Terraform appears to finish (all tasks in the action actually finish executing) but now I wondered if this is some Terraform BS because of what you said.

My next hypothesis was to run it without running terraform and see if it finishes. So I commented out the exec that runs terraform plan and voila, it finished. So there is some weirdness happening related to my call to terraform plan that basically is causing dagger to hang and not exit

#

How can I debug why dagger hangs at the end when I call terraform plan? Engine logs don't reveal much, as demonstrated earlier

river nimbus
#

interesting, do you get any stdout from the terraform plans in the previous steps? maybe that module is missing a var or something and it's waiting for input?

shy pilot
#

yeah, i get stdout, I'm actually writing that stdout to json. Maybe now my "hidden" terraform code is actually relevant to the issue

#

here's my stdout, and the code:

#

#48 9.349 google_cloud_run_service_iam_policy.noauth: Refreshing state... [id=v1/projects/dev-ab-anc-proj/locations/us-west3/services/maintainers-ui]
#48 14.21
#48 14.21 │ Warning: Values for undeclared variables
#48 14.21 │
#48 14.21 │ In addition to the other similar warnings shown, 3 other variable(s)
#48 14.21 │ defined without being declared.
#48 14.21 ╵
#48 DONE 14.4s

#49 echo Skipping apply because --apply flag was not supplied.
#0 0.117 Skipping apply because --apply flag was not supplied.
#49 DONE 0.1s

#50 /bin/bash -c terraform output -json | jq 'with_entries(.value |= .value)'
#50 2.938 {}
#50 DONE 3.0s

And the python impl


async def run_module(module: str, image, prev_runner_output, apply: bool, client):

    terraform_action  = (
        image
        .with_directory("/input", prev_runner_output if prev_runner_output else client.directory())
        .with_mounted_directory("/src",local_src)

        .exec(["terraform, init"])
        .exec(["terraform", "plan", "-var-file", "/input/prev_output.json", "-input=false"])
        .exec(["/bin/bash", "-c", "terraform output -json | jq 'with_entries(.value |= .value)'"], redirect_stdout="/output/prev_output.json")

    )
    # THIS NOW WORKS WITHOUT EXIT CODE
    print(f"finished running terraform module {module}")
    # create empty directory to hold build outputs
    outputs = client.directory()
    return outputs.with_directory("/", terraform_action.directory("/output"))

#

note my use of redirect_stdout here. I wonder if that is possibly related ?

#

as you can see, the terraform runs successfully and the /bin/bash with redirect_stdout actually runs just fine, but then it hangs after that

#

You can also note that I pass in the -input=false option to plan, meaning it shouldn't be waiting for any input

river nimbus
#

and just to clarify, that google_cloud_run_service_iam_policy.noauth is in the "serverless/gcp/cloudrun" module, right?

shy pilot
#

correct

#

i run 3 other modules before that in the same way and they all seem to finish planning, write the stdout to json, and copy into the new container just fine. But also when I don't run terraform plan I also don't get dagger trying to sit there and resolve another container after the last one

river nimbus
#

and if you let it hang for some amount of time it'll eventually resolve the image again?

shy pilot
#

exactly

#

and if i comment out the terraform plan it doesn't try to resolve again

#

(and finishes as expected)

river nimbus
#

I can't figure out why it would need to resolve it again after that /bin/bash -c is complete...

shy pilot
#

me neither. How can I look at what is happening in Dagger itself besides engine logs? Something has to be happening that is observable somewhere no?

#

It definitely still has to do with the container itself, because I have print(f"finished running terraform module {module}") right after the action finishes and that never actually happens, it just goes straight to resolving again

river nimbus
#

so one thing which is probably not a fix but will maybe make the code more clear now - with the await cloudrun_output.entries() being the thing that actually triggers the whole execution now, the run_module function no longer needs to be async

#

oh I also just noticed you're calling entries() after each one. If it's wired together properly, you should only need it after the last one

shy pilot
#

Ha, I had this question typed: "Should I only be awaiting the cloudrun_output.entries() and not the other output entries? "

#

let me try that

#

maybe it's because I'm having dagger execute itself 4 times when it should really only be doing it once

river nimbus
#

yeah that might be a problem

shy pilot
#

Removing the cloudrun_output.entries did not help

river nimbus
#

deleted wrong chat 😂

shy pilot
#
        runner_image = await infra_runner.build_image_from_tag(runner_image_tag, client)
        remote_state_output = terraform.run_module("remote-state/gcp", runner_image, None, False, client) # create bucket in mgmt project to store tfstate. 
        project_output = terraform.run_module("project/gcp", runner_image, remote_state_output, apply, client) # create new project to hold the env
        network_output = terraform.run_module("network/gcp", runner_image, project_output, apply, client) # create network in project
        cloudrun_output = terraform.run_module("serverless/gcp/cloudrun", runner_image, network_output, apply, client)
        await cloudrun_output.entries()
#

that's what it looks like now, I had to keep the await for the runner image too. I guess I can try removing that as well

river nimbus
#

just mapping it out in my head again

#

one other potential optimization, you could replace this

outputs = client.directory()
return outputs.with_directory("/", terraform_action.directory("/output"))

with

return terraform_action.directory("/output")

I had initially thought all the directories would get merged together but I see in this flow that's not necessary

#

are you able to run the terraform plan directly on the "serverless/gcp/cloudrun" module with the appropriate inputs to verify it's behaving as expected?

shy pilot
#

Ok I did a lot more debugging on my end

Running all the way up to network_output works fine... dagger exits as expected

Running cloudrun_output, or any other modules I have (I also have a GKE module, bastion module, and load balancer module) as the 4th module all hang

#

Not running redirect_stdout doesn't affect results... only running terraform plan does

#

all the modules work fine when running them locally on my machine

#

I got serverless/cloudrun to finish by running it as the second module. Then I ran two modules after that and the 4th one hung as before. Something is causing the 4th call to hang, no matter what module it is

Modules that succeeded before when they are the second or third modules now don't finish

#

Dagger hangs when calling terraform function multiple times

shy pilot
#

Last update:

This works:


    async with dagger.Connection(config) as client: 

        runner_image = infra_runner.build_image_from_tag(runner_image_tag, client)
        project_output = terraform.run_module("project/gcp", runner_image, None, apply, client) 
        terraform.run_module("serverless/gcp/cloudrun", runner_image, project_output, apply, client)
        network_output = terraform.run_module("network/gcp", runner_image, project_output, apply, client)
        terraform.run_module("loadbalancer/gcp/serverless", runner_image, network_output, apply, client)
        terraform.run_module("cluster/k8s/gcp",runner_image, network_output, apply, client)
        bastion_output = terraform.run_module("bastion/gcp", runner_image, network_output, apply, client)
        await bastion_output.entries()
        

Note! We are not passing inputs to outputs more than three times in this approach. Runner image -> project_output -> third module

This hangs, no matter what the fourth module is, provided that the 4th module takes in network_output as its input.


    async with dagger.Connection(config) as client: 

        runner_image = infra_runner.build_image_from_tag(runner_image_tag, client)
        remote_state_output = terraform.run_module("remote-state/gcp", runner_image, None, False, client) 
        project_output = terraform.run_module("project/gcp", runner_image, remote_state_output, apply, client) 
        network_output = terraform.run_module("network/gcp", runner_image, project_output, apply, client)
        cloudrun_output = terraform.run_module("serverless/gcp/cloudrun", runner_image, project_output, apply, client)
        await cloudrun_output.entries()
        

So there's something about passing image in that causes it to hang on the fourth time. I found the workaround for now (which will work exactly until I need to use outputs of one of these modules as inputs for another, unfortunately). Still no root cause though.

river nimbus
#

Interesting! So it could be a stack size thing with how big the 4th execution ends up being. Let me see if there's anything we can do to debug that better