#Can dagger play nice with large volumes of data?

1 messages ยท Page 1 of 1 (latest)

weak ice
#

Hey guys, I've made a few tangential posts about this, really hoping there's a possible workaround, but I dont think dagger can really work with large amounts of data?

Here's the tl;dr of my situation.

  • I'm working on a new build pipeline that requires about 500GB of source code data to be present at build time.
  • I cannot seem to find an API in either API (modules or classic) that lets me mount this directory without sending it over a wire somehow - this is basically a dealbreaker, transferring this much data over the wire takes like 6 hours.

Here's my current code.

    yoctoWorkspace := "/home/ubuntu/firmware-vendor"
    yoctoWorkspaceDir := client.Host().Directory(yoctoWorkspace)
    out, err := client.Container().From("public.ecr.aws/firmware-vendor-20.04").
        WithExec([]string{"/bin/create-user.sh", "1000", "1000"}).
        WithUser("turbox").WithWorkdir("/home/user/workspace").
        WithMountedDirectory("/home/user/workspace", yoctoWorkspaceDir).
        WithExec([]string{"./firmware-vendor_bitbake_build.sh", "-a", "-l"}).Stdout(ctx)
    fmt.Println(out)

Which results in the daemon basically hanging on "uploading" the files to the daemon.

This build could work fine in jenkins, GHA, or other through just a docker container and mounted volume, no sweat. Hoping that I'm just missing something simple and I get the same kind of functionality out of dagger.

I guess I'm just posting a last ditch effort here to keep dagger before I have to abandon ship and pick a new solution for this build. Is there anything I can do to run builds this large via dagger?

vapid marlin
#

@weak ice i think it should work fine, there's probably a more efficient way to get the files into the runtime - after that, life should be good. The other thing to watch out for is cache eviction, you may want to tweak the settings there to allow your cache to grow bigger - otherwise you may get premature cache eviction which will lead to thrashing

#

cc @maiden crow

#

@weak ice one way to optimize is to add include/exclude filters to that Host().Directory() call, to avoid uploading stuff you don't need

weak ice
#

no but that's exactly the thing, I need all 500GB ๐Ÿ˜…

maybe im misintepreting the build output? It's hanging for a significant amount of time (> 5 min) when I'm just trying to mount this directory, and the build output says "uploading"

vapid marlin
#

Well there's an initial copy that takes place, from your client machine into the buildkit runtime. That happens over the wire. Depending on your setup, it might be local to your machine, or to a remote server.

maiden crow
#

Yeah Dagger has the same limitations here as buildkit/dockerfiles, which doesn't allow direct bind mounts from the host either. Basically, this comes from the fact that dagger/buildkit never assumes that the client+server are on the same machine (which is the only time that bind mounts are possible).

If you need all 500GB, then the initial load of the dir is indeed probably going to take a while, but the filesync is smart enough to reuse previous imports (provided they haven't been pruned like Solomon mentioned about tweaking cache eviction). It should only reimport files that have changed since last import, same idea as rsync

#

I can imagine ways of improving this, but it would get into things like FUSE filesystems to allow truly lazy loading, which would be great but probably is too complicated to say "coming soon" in terms of features ๐Ÿ˜…

weak ice
#

dang forsure

#

I'll probably have to go another route for this then, appreciate the help though!

vapid marlin
#

@weak ice could your dagger function download the 500GB directly from its source, rather than uploading it from the client system?

#

That might be more efficient

weak ice
#

Maybe, but I'm still in the same situation, that download and zip/unzip take a huge amount of time

#

and then I probably have some cache bomb waiting to go off

vapid marlin
#

Right but it will be cached; and depending on your workflow, you may not need to download it to your host machine at all

weak ice
#

Right now I'm using block level snapshots to ensure the data is where it needs to be, and my build hosts are all ephemeral to save on cost

#

I mean I guess I could try to persist the dagger cache on EBS disk somehow as well.... and then just do some s3 copy and unzip - then hope it caches. This is about a 3 hr command to try to run though lol

#

many many little files, not few big ones

vapid marlin
#

This seems like a great use case for our planned prod-readiness work cc @wraith peak @acoustic crest @tacit swift .

  • Cache persistence
  • Control over cache eviction
  • Measurable performance bottleneck

Just saying ๐Ÿ™‚

mystic remnant
open drum
#

@weak ice @vapid marlin @maiden crow is a "very ugly" hack allowed here by any chance?

#

If you're using a linux host I think there's a very ugly way you can make this work

#

as long as your engine is sharing the same kernel where the underlying block device lies

open drum
#

you can basically run mknod and then mount your device in your step (requires insecure root capabilities) before the command that requires that is executed.

just gave it a try and seems to work ๐Ÿ˜ฌ

weak ice
#

not blockchain - this is for an embedded linux build.

@open drum - very ugly is (unfortunately) welcome here, anything to avoid going back to jenkins. Docs on GHA around mounting devices in this manner seem to be slim. Couldn't quite figure out the right pathing to mount a host vol using kubernetes executor.

Is this method basically passing just the block device as a file, then handling the mount in dagger? That actually doesn't sound too bad to me

#

the device is actually available on /dev/nvmeXnXp1 - do I need to do any mknod type stuff in this case?

open drum
#

keep in mind that you need to use the InsecureRootCapabilities flag in the with_exec and that every step that requires that drive to be present will have to mount it given that each step in you Dagger pipeline creates a brand new container which won't have that drive mounted

vapid marlin
#

This will only work if your client OS is also running the dagger engine container. So it won't work with Docker for Mac, for example

#

Note to self: maybe enabling privileged containers was not a good idea (cc @maiden crow )

maiden crow
vapid marlin
#

Once 10% or more of Daggerverse has a README like "make sure you follow these 10 steps on your host before running this module, or it won't work!" then the ecosystem is broken

maiden crow
#

I mean it's either A) configurable somehow or B) we don't support it and:

  1. Rule out certain legit/non-hacky use cases where the kernel strictly requires "true" root (use of loopback mounts comes to mind, needed in order to create filesystem images and not possible inside rootless user namespaces). The only way to support those would be to add core API functionality that allows the privileged engine to setup things for containers, which is going to be a long game of whack-a-mole and fill the API with obscure Linux things (which may themselves end up becoming kernel/platform specific)
  2. Require everything else that can run rootlessly to do so, which itself makes Dagger extremely dependent on kernel versions since what's possible to do rootlessly still changes very frequently
vapid marlin
#

I wish you were wrong

maiden crow
#

There may be some out here with the use of microVMs, which let you be true root in the VM, though it still has enormous amounts of caveats and questions

#

i.e. you can't run VMs on EC2 unless they are bare metal laughcry

weak ice
#

This will only work if your client OS is also running the dagger engine container. So it won't work with Docker for Mac, for example

~Hmm so this all depends on running docker from the sound of it?~
Better question: Will this not work with buildkit?

I'm running in k8s following this pattern.

I've confirmed the following.

  1. The dagger engine pod can see my target device /dev/nvme1n1p1
  2. The worker pod (the one actually calling client.Connect ) can also see the device at /dev/nvme1n1p1
  3. Now (in dagger) I try to find my device and run the mknod and get nothing.

Deployment with Helm

#

Here's my run output and code sample


18: exec ls -al /dev/ DONE
18: [0.07s] total 0
18: [0.07s] drwxr-xr-x 6 root root     640 May  1 01:46 .
18: [0.07s] drwxr-xr-x 1 root 1001      32 May  1 01:46 ..
18: [0.07s] lrwxrwxrwx 1 root root      11 May  1 01:46 core -> /proc/kcore
18: [0.07s] crw-rw-rw- 1 root root 10, 203 May  1 01:46 cuse
18: [0.07s] lrwxrwxrwx 1 root root      13 May  1 01:46 fd -> /proc/self/fd
....
18: [0.07s] srw-rw-rw- 1 root root       0 May  1 01:37 otel-grpc.sock
18: [0.07s] lrwxrwxrwx 1 root root       8 May  1 01:46 ptmx -> pts/ptmx
18: [0.07s] drwxr-xr-x 2 root root       0 May  1 01:46 pts
18: [0.07s] crw-rw-rw- 1 root root  1,   8 May  1 01:46 random
18: [0.07s] drwxrwxrwt 2 root root      40 May  1 01:46 shm
18: [0.07s] lrwxrwxrwx 1 root root      15 May  1 01:46 stderr -> /proc/self/fd/2
18: [0.07s] lrwxrwxrwx 1 root root      15 May  1 01:46 stdin -> /proc/self/fd/0
18: [0.07s] lrwxrwxrwx 1 root root      15 May  1 01:46 stdout -> /proc/self/fd/1
...

17: exec mknod /dev/sdbz b 259 4
17: exec mknod /dev/sdbz b 259 4 DONE
...
18: exec ls -al /dev/ DONE
[same output as above]

15: exec mount /dev/sdbz /mnt
15: [0.07s] mount: /mnt: special device /dev/sdbz does not exist.
15: exec mount /dev/sdbz /mnt ERROR: process "mount /dev/sdbz /mnt" did not complete successfully: exit code: 32

        WithExec([]string{"ls", "-al", "/dev/"}, dagger.ContainerWithExecOpts{InsecureRootCapabilities: true}).
        WithExec([]string{"mknod", "/dev/sdbz", "b", "259", "4"}, dagger.ContainerWithExecOpts{InsecureRootCapabilities: true}).
        WithExec([]string{"ls", "-al", "/dev/"}, dagger.ContainerWithExecOpts{InsecureRootCapabilities: true}).
        WithExec([]string{"mount", "/dev/sdbz", "/mnt"}, dagger.ContainerWithExecOpts{InsecureRootCapabilities: true}).
        WithExec([]string{"ls", "-alr", "/mnt"}).Stdout(ctx)
open drum
#

so you can do WithExec([]string{"sh", "-c", "mknod .... && mount ... && ls -la /mnt"})

#

and you'll have to do it for each WithExec that you want your drive to be present

weak ice
#

oh boy ok

open drum
weak ice
#

It'll be worth it if I can get it going!

open drum
#

you can wrap all the mounting dance in a wrapper script.. doesn't make it less ugly though but you can put some make up on it ๐Ÿ’„

weak ice
#

holy cow - you guys are wizards

#

ok so lemme see if I can understand what's happening here....

  • My buildkit daemon is running as root, so technically it has block device access - regardless of mounts i've configured via k8s.
  • I'm issuing some explicitly insecure commands via WithExec which therefore inherit this root level access.
  • For the duration of this instruction (??) the mount is available.

I guess I'm still left wondering why I can't directly access the device? This feels somewhat similar to GPU access that I've heard is possible here, what makes a block device special?

open drum
# weak ice ok so lemme see if I can understand what's happening here.... * My buildkit dae...

I guess I'm still left wondering why I can't directly access the device? This feels somewhat similar to GPU access that I've heard is possible here, what makes a block device special?

mostly design decisions. Buildkit is designed as a build system with the properties that the client and the server could live in different hosts. That's why it optimises context uploads and cache de-dupping to make this as performant as possbile. Having said that, for some scenarios, this is a heavy trade-off. Adding support for additional devices and platform specific configurations is just very complex and it's hard justify the effort given the 80/20 rule

#

I'd say that your use-case Steven for what I've seen in my experience lands in that 20%

weak ice
#

Makes sense.

FWIW - not implying I need a feature request or anything here, I was just as surprised by this use case. Just trying to learn a bit more about internals

#

Really appreciate the help guys thanks!

vapid marlin
#

Note: we need better architecture diagrams

#

Thanks for being patient with us @weak ice

mystic remnant
#

I so want a video tutorial on dagger kubernetes

open drum
weak ice
# maiden crow Yeah Dagger has the same limitations here as buildkit/dockerfiles, which doesn't...

Sorry to drudge up old news here, but I came across bind mounts in some other docker specific stuff I'm doing.

While I could probably understand the design decision to exclude things like block volumes from the api, would it not make sense for a WithBindVolume type func to exist? We seem to have cache and secret volumes. Adding an explicit func in this way would allow the user to accept the trade off and give up caching optimizations present with other methods of passing directories.

GitHub

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit - moby/buildkit

#

I guess I ended up making a feature request after all ๐Ÿ˜…

maiden crow
#

That's not to say your feature request is wrong :-), it's just the same situation as before where it's very non-trivial work to get something that behaves similar to true bind mounts. There's a lot of use cases that call for it (yours of course, but also development environments where you want to mount in source code) so I think we will do it, it's just not on the immediate roadmap (currently anyways)

weak ice
#

Hmm yeah maybe I'm mixing up the bind mount docs in docker with build kit here

maiden crow