#Engine container communication

1 messages ยท Page 1 of 1 (latest)

hoary yew
#

Or a variation: no network service, the CLI executes a special client and the exec's stdio is the transport. The same compute driver that owns getting access to a containerd, also owns the communication to containerd, so auth, tunneling etc becomes an implementation detail of the driver. CLI just worries about the containerd API

high gazelle
#

This would mean that there can be no deduped jobs against a dagger instance.
So if CI job 1 does a withexec, and CI job 2 does the same one (maybe they use common code), they both have to do it from scratch (assuming that one of the jobs haven't finished and uploaded cache)

#

It does hugely simplify engine lifetimes though, so that would be nice - but it would throw out a potentially nice performance optimization that users may notice it's removal.

high gazelle
# hoary yew Or a variation: no network service, the CLI executes a special client and the ex...

This is pretty close to https://github.com/dagger/dagger/pull/6288
We could almost do this today with that PR (non containerd part, it would speak the dagger API directly)

GitHub

This patch starts to simplify the connhelpers and docker-image helper into specific engine drivers (see #5583) - adding new drivers now becomes easier.
Each driver registers itself under a name, an...

hoary yew
high gazelle
#

the future we configured the image such that the daemon is only reachable by the CLI that provisioned it
it can't connect to an existing one created by another cli (as is the case in the 2 ci jobs above) if this is the case

hoary yew
#

๐Ÿ‘‹ @wraith barn ๐Ÿ™‚

#

I don't have a good answer @high gazelle , but @wraith barn was just telling me that sharing a dagger engine with multiple CI runners is hurting performance. So we have a conendrum

high gazelle
#

That's probably true sometimes, but a couple points:

  1. I'm not sure our own ci is actually a great example here, we've been seeing a lot of slow jobs due to unrelated reasons that we need to clean up. We also actually don't understand why our CI is slower than we want.
  2. We should try and collect data on these cases (not just from us) - the analytics we're getting are going to be useful here.

I actually really like the implications of the architecture you suggested, but I don't think we should jump into it without some data to back it up (imo perf concerns should trump architectural ones - but if there really is no perf differences, we should go for it!).

#

One really good point of having an engine per client, is that if the engine crashes it only brings down one client and makes debugging easier. I really like that actually.

wraith barn
#

One of our first steps was to start capturing metrics for all pipelines, including what happens even before we invoke Dagger.

Now that we have this in place, we can run a few alternative configurations in parallel, including using 1 CI runner per Engine, and see if that does away with these two errors that we've been seeing frequently:

I am excited to try a few experiments and see how they improve our pipeline reliability & consistency.

high gazelle
#

Agreed that those bugs are real issues we should address - but also once I manage to clear my emergency backlog of issues, these are right at the top of my list to investigate

wraith barn
#

I know, we are all doing the best we can to juggle everything. As long as we see this improve, we know we are doing the right thing:

high gazelle
#

They're complicated logic bugs, but I would like to hunt them down, so that users can run dagger in this config, even if we choose not to