#.export only partially transfers files, gets stuck

1 messages · Page 1 of 1 (latest)

idle hornet
#

While porting our application build scripts, I ran into an issue where the Dagger process would get stuck while exporting the node_modules folder back to the host filesystem (the folder contains around 150.000+ files).

While troubleshooting the larger job, I created the most simple scenario which describes what I’m trying to do: install aws-cli, download tarball from s3, extract it, and copy it to the host FS. This still got stuck on exporting the job.
I simplified the job further, by skipping the whole s3 part and using the client.host().directory) method to provide the node_modules.tar.gz testfile. That worked! The export was done within a short time.

I’ve now added fallocate -l 536870912 ./dagger_workaround/512MB.zip to the script that executes dagger, and with this hack I’m able to bypass the issue I’m experiencing

async def main():
    async with dagger.Connection(dagger.Config(log_output=sys.stderr)) as client:
        out = await (
            client.container()
            .from_("alpine:latest")
            .with_directory("/host", client.host().directory("./dagger_workaround/"))
            .with_env_variable("AWS_ACCESS_KEY_ID", os.environ["AWS_ACCESS_KEY_ID"])
            .with_env_variable("AWS_SECRET_ACCESS_KEY", os.environ["AWS_SECRET_ACCESS_KEY"])
            .with_env_variable("AWS_SESSION_TOKEN", os.environ["AWS_SESSION_TOKEN"])
            .with_env_variable("AWS_DEFAULT_REGION", "eu-west-1")
            .with_exec(["apk", "add", "--update", "--no-cache", "aws-cli"])
            .with_exec(["aws", "s3", "cp", f"s3://location-of-my-bucket/node_modules.tar.gz", "/node_modules.tar.gz", "--only-show-errors"])
            .with_exec(["tar", "-zxf", "/node_modules.tar.gz"])
            .directory("/node_modules")
            .export("./node_modules_out")
        )
    print(out)
#

I'm running this on Kubernetes (EKS) and connecting through unix:// or kube-pod:// does not seem to make a difference

north oar
#

hey @idle hornet! welcome to the Dagger community. I'm about to sign off so I'll probably take a look tomorrow. Maybe @civic panther or @cerulean herald have some bandwidth to take a look 👀

#

cc @shut spindle also since this might be related to a gRPC limit somehow?

idle hornet
#

thanks @north oar! Please let me know if more details are required.

cerulean herald
#

Hey @idle hornet great to meet you, would you be open to hooking this up to Dagger Cloud to make it easier for us to see exactly what is happening in these steps? I can set up a free trial for your team.

idle hornet
#

hi @cerulean herald, likewise! Hooking up Dagger Cloud is no issue at all. I'm signing off for today, feel free to drop some instructions!

cerulean herald
#

Thanks! Ill DM you to get an email and go from there.

lapis quail
#

Just to add some more context, I think I've seen this upstream in buildkit here: https://github.com/moby/buildkit/issues/2950
I know we've also seen this in some internal channel (searching for the link finds it).

I think we could track this our side though with the upstream/buildkit label.

#

@idle hornet does the last comment from the above issue seem to track with what you're seeing?

It seems like this is a problem only when many files are exported
If so, it looks like that's likely what's going on here.

#

If not, we might have a new one 😄

idle hornet
#

would trying another buildkit version just be a matter of replacing the buildctl binary inside the engine container?

lapis quail
#

i think this issue is still present on buildkit's master though, so some cleverer solutions are required

idle hornet
lapis quail
#

buildctl here is just a helper client-side component, not the full buildkit dependency - that version is defined in go.mod, and would need to be updated

idle hornet
#

gotcha

north oar
#

🤔 I'm actually curious how the fallocate allows bypassing this behavior

#

also curious why the client.host() approach works as well.. since it should be the same behavior as downloading the .tar.gz from s3. Maybe it just works by chance?

idle hornet
#

@north oar I'm really wondering the same as in why.. I've created a CI workflow to run the same job a few times, and I get repeatable output. I'm working on reproducing the issue without using our private assets, so I can share the scenario at least 🙂

idle hornet
#
#!/usr/local/env python3

import sys
import anyio
import dagger

async def main():
    async with dagger.Connection(dagger.Config(log_output=sys.stderr)) as client:
        out = await (
            client.container()
            .from_("node:18-alpine3.19")
            .with_directory("/host", client.host().directory("/tmp/empty/"))
            .with_exec(["npm", "install", "gulp-cli", "yo", "@microsoft/generator-sharepoint", "--global"])
            .directory("/usr/local/lib/node_modules")
            .export("/tmp/sandbox_new_out")
        )
    print(out)
anyio.run(main)

The code up here tries to export enough files to trigger the issue consistently. If I leave /tmp/empty an empty directory, the process gets stuck. If I create a file with fallocate -l 536870912 /tmp/empty/512MB.zip before starting the job, everything works

lapis quail
#

ok, so one thing i know is that fallocate creates sparse files

#

i wonder if that's possibly got an impact here - what happens if you do something that's more like dd if=/dev/zero or something

#

maybe irrelevant - it could just be that this large file causes the file upload to complete in a slightly less racy way

idle hornet
#

creating the file as dd if=/dev/zero of=/tmp/empty/zero.file bs=512k count=1000 gives the same behavior as with the fallocate command

lapis quail
#

ok phew

#

that's good at least 😄

idle hornet
#

I've added the example with a small description of my environment, just let me know if you think anything else would be useful to add