#Run code on the host
1 messages ยท Page 1 of 1 (latest)
No, Dagger doesn't allow this. The properties that makes Dagger's graph useful, require all operations to be containerized. But you can do this from your own code, intermixed with calls to the Dagger API.
One "escape hatch" that is available, is accessing host services. Dagger can orchestrate containers such that they have access to specific network endpoints running on your host network. You could setup an ssh service, then ssh into that from the containers. Or, if the software you need to run on the host has a server mode, just expose that instead.
But as a rule, if you're using Dagger, containerizing should be the rule, and "escaping" the containers the exception. Otherwise you're going against the grain of the platform.
I'm not Linux ninja enough to know, but is it possible to run containers in a way that a process will outlive the container? I'm still trying to find ways to make stateful Bazel work with Dagger, there is no advertised way to start the Bazel server and then configure the client to talk to a specific server. But I'll see if I can find some hidden flags for it, there are plenty of them in Bazel ๐
You can have a container connect to another container as a service, but it will only last for the duration of your Dagger session - so the Bazel server will be wiped after each run, which is not what you want.
One great solution would be to persist the state of the bazel server each time, then you can use cache volumes like we discussed in the beginning. But I don't know enough about Bazel to tell you if that's possible or not
I think you mentioned long start times
Yeah, it's not really possible to persist the state since it's just stored in memory of the Bazel process. I think I'll just have to leave Dagger out of the Bazel part of the pipeline for now.
Could I mount a named pipe to the Dagger container and the pipe commands to the host? That does feel very hacky ๐
If the only thing you need to run on the host is the Bazel server, I would just run it as a "regular" service, without involving Dagger, and forward the TCP port via Dagger's C2H networking feature
I don't think it's worth rigging a way for your Dagger logic to run commands on the host, if the only command you'll run is that bazel server
reading a bit more about the bazel client/server architecture (https://bazel.build/run/client-server) seems like you'll also have to take into account what is the userid and the path of the base workspace directory.
Is the bazel server currently running in your laptop?
During local dev the client and server run on my laptop
And in CI they run on the CI host. AFAIK there is no way to start a remote Bazel server and then issue commands to it with the client. I at least have not found a way to do so. It might also not make sense to do so since the client and the server need access to the same set of source files.
That's the part that's not clear to me. How do you start the Bazel server today? If 1) your bazel server is long lived, and 2) your CI jobs which use that bazel server are short-lived, doesn't that require "starting a remote bazel server and then issuing commands to it with the client"?
I've skimmed the bazel docs and seems like the server is automatically started by the client if it's not running (similary to what Dagger does). Since the CI machines are stateful, the server always uses the same caching directories when it starts.
Additionally, bazel starts different servers depending on the userId and workspace directory of the build. Apparently that's how they supporto multi-tenancy of multiple projects and users.
@formal ledge let me check if I find anything how to make a client connect to an existing server
fking bazel
that's what happens when you want your software to wrap everything, and never the other way around. Headaches.
If that's true, then cache volumes should be viable then
specifically a cache volume with shared option
@unreal rampart what docs page are you looking at?
now skimming through the code to see how the client <> server communication works
FYI @burnt grove ๐
Yeah, this is pretty much correct @unreal rampart
The issue not really the caches that Bazel uses. They can either be on disk or in remote servers which works fine with Dagger.
The issue is that Bazel also keeps in memory state to make incremental builds faster. Doing the initial build might add a few seconds up to minutes to your build depending on how many Bazel targets you have in your repo.
Right, I now remember that you explained that earlier
So what happens when the first client transparently spawns the server, then exits - it just keeps running in the background, detached from the initial process group?
a sort of auto-daemon mode in the user's session I guess
and @formal ledge do you know how the client connects to that server? A named pipe or unix socket at a specific location in the workspace?
Ironically this is quite similar to how Dagger handles its engine container ๐
here's the code that connects to the server: https://github.com/bazelbuild/bazel/blob/351b9a079d528b073d1f1405597d5bc11f6a8dbd/src/main/cpp/blaze.cc#L1593
seems like GRPC over TCP
seems like the architecture is very "same sandbox" oriented. I can see from the code that the connect function does some pid level checking to make sure the server is running, etc.
it also expects to have access to the path where the server writes some files to read the info: https://github.com/bazelbuild/bazel/blob/351b9a079d528b073d1f1405597d5bc11f6a8dbd/src/main/cpp/blaze.cc#L1596 ๐ฌ
there might be a way to hack this by running the dagger engine in non-sandbox mode (no pid namespace basically) and then sending the server_info.rawproto file to the bazel client container. However, it's very very hacky and will probably require some very custom ugly setup
I strongly discourage doing that ๐
Thanks for really digging into this. This is all very interesting, I'm curious if you've ever had other stateful processes that haven't really fit into the Dagger model?
Would it make sense for Dagger to have some kind of persistent containers that can live through many invocations?
Hadn't encountered this particular problem before, it definitely opens the question
Alrighty, do you want me to create a GH issue to track this?
would definitely be worth it ๐
Share the execroot between runs, youโll only have to rebuild the analysis graph that way
Which got a lot faster in bazel 7 with skymeld btw
And yes there is no official way to separate the โserverโ and the cli, actually the server is shipped inside the cli. There is a server only to keep the state in memory and for watching the workspace. Same with buck2.
thx for sharing! good to know all this
Thank you Steeve! Does this mean re-running a server each time would be less overhead that way?
yeah, starting the server is like subsecond, what can take a long time is fetching artifacts and the graph building phase (called the analysis phase)
fun part, the server is actually a jar file that lives in the install base:
~/c/g/z/zml (master)> ls $(bazel info install_base)
total 238328
drwxr-xr-x 14 steeve wheel 448B Dec 11 23:25 ./
drwxr-xr-x 5 steeve wheel 160B Dec 11 22:58 ../
-rwxr-xr-x 1 steeve wheel 116M Dec 8 2033 A-server.jar* <----------- BAZEL SERVER
-rwxr-xr-x 1 steeve wheel 5B Dec 8 2033 build-label.txt*
-rwxr-xr-x 1 steeve wheel 72K Dec 8 2033 build-runfiles*
-rwxr-xr-x 1 steeve wheel 50K Dec 8 2033 daemonize*
drwxr-xr-x 8 steeve wheel 256B Dec 11 22:58 embedded_tools/
-rwxr-xr-x 1 steeve wheel 32B Dec 8 2033 install_base_key*
-rwxr-xr-x 1 steeve wheel 51K Dec 8 2033 libcpu_profiler.dylib*
-rwxr-xr-x 1 steeve wheel 16K Dec 8 2033 linux-sandbox*
drwxr-xr-x 6 steeve wheel 192B Dec 11 22:58 platforms/
-rwxr-xr-x 1 steeve wheel 106K Dec 8 2033 process-wrapper*
drwxr-xr-x 9 steeve wheel 288B Dec 11 22:58 rules_java/
-rwxr-xr-x 1 steeve wheel 150K Dec 8 2033 xcode-locator*
and reusing the rootexec allows skipping the artifact fetching phase?
indeed
in theory you can also leverage --repository-cache and --disk-cache too, but they are made mostly for cross WORKSPACE sharing, but they work for (somewhat) stateless builds
but for true stateless system a remote cache is the way to go, since you only download the invalidated leafs in the build graph (google build without the bytes)
Now we're getting somewhere ๐
"Remote cache" being the long-running server (one per workspace per user) that was discussed earlier, or something else?
no something running remotely, an actually always running CAS server
you can use GCS itself as a bazel cache, too, but it's rather slow
i wrote a GCS backed bazel cache at Zenly (https://x.com/steeve/status/1367530898598592526?s=20) which was open sourced and support build without the bytes (leverages a merkle tree), but since Snap closed the company, the repo vanished (asshole move...)
actually, since bazel has dynamic scheduling (meaning it'll race the local build with the download), this may not be that big of a deal nowadays, but you do lose a bit in GCS latency
Well in the case of running Bazel in Dagger, the rootexec directory will be persisted on a Dagger cache volume, which can be distributed across nodes just like the rest of the Dagger cache. So the end result I think will be stateless bazel, or at least stateless enough
yeah, although the problem is that cache is huge, so snapshotting/restoring is something to watch out for
Zenly iOS that was like 5 to 8GB for a cold build
@formal ledge I believe we have a possible solution to your problem, which will not require a long-running Bazel running on the host after all! Thanks to @burnt grove
I'm going to create a "Dagger and Bazel" channel to celebrate ๐
honestly, i'd just start with the cache in the volume strategy, i'm curious!
it's just that on Github Actions snapshot/restore is painfully slow
We don't use the Github Actions cache ๐
With Dagger Cloud distributed caching, the snapshots go to the closest storage bucket. We have an edge in each AWS and GCP region, so for self-hosted GHA runners in either of those clouds it should be very very fast. If running in managed Github Actions (Azure), it will fallback to Cloudflare R2 which is quite good too
how does that work? is it a blob ?
For layer cache it's just the buildkit state directory, so a bunch of layers. For cache volumes (just a bunch of bind mounts) they're just directories that we snaphot. I don't know what the snapshotting method is, @midnight tinsel will know. I'm guessing something straightforward like one tarball per volume maybe? ๐คทโโ๏ธ
so actually it does look like you could get away with good performance with a low maintenance