#Engine container communication
1 messages ยท Page 1 of 1 (latest)
Or a variation: no network service, the CLI executes a special client and the exec's stdio is the transport. The same compute driver that owns getting access to a containerd, also owns the communication to containerd, so auth, tunneling etc becomes an implementation detail of the driver. CLI just worries about the containerd API
This would mean that there can be no deduped jobs against a dagger instance.
So if CI job 1 does a withexec, and CI job 2 does the same one (maybe they use common code), they both have to do it from scratch (assuming that one of the jobs haven't finished and uploaded cache)
It does hugely simplify engine lifetimes though, so that would be nice - but it would throw out a potentially nice performance optimization that users may notice it's removal.
This is pretty close to https://github.com/dagger/dagger/pull/6288
We could almost do this today with that PR (non containerd part, it would speak the dagger API directly)
But the CLI can still decide to run a new container or connect to an existing one.
the future we configured the image such that the daemon is only reachable by the CLI that provisioned it
it can't connect to an existing one created by another cli (as is the case in the 2 ci jobs above) if this is the case
๐ @wraith barn ๐
I don't have a good answer @high gazelle , but @wraith barn was just telling me that sharing a dagger engine with multiple CI runners is hurting performance. So we have a conendrum
That's probably true sometimes, but a couple points:
- I'm not sure our own ci is actually a great example here, we've been seeing a lot of slow jobs due to unrelated reasons that we need to clean up. We also actually don't understand why our CI is slower than we want.
- We should try and collect data on these cases (not just from us) - the analytics we're getting are going to be useful here.
I actually really like the implications of the architecture you suggested, but I don't think we should jump into it without some data to back it up (imo perf concerns should trump architectural ones - but if there really is no perf differences, we should go for it!).
One really good point of having an engine per client, is that if the engine crashes it only brings down one client and makes debugging easier. I really like that actually.
One of our first steps was to start capturing metrics for all pipelines, including what happens even before we invoke Dagger.
Now that we have this in place, we can run a few alternative configurations in parallel, including using 1 CI runner per Engine, and see if that does away with these two errors that we've been seeing frequently:
I am excited to try a few experiments and see how they improve our pipeline reliability & consistency.
Agreed that those bugs are real issues we should address - but also once I manage to clear my emergency backlog of issues, these are right at the top of my list to investigate
I know, we are all doing the best we can to juggle everything. As long as we see this improve, we know we are doing the right thing:
They're complicated logic bugs, but I would like to hunt them down, so that users can run dagger in this config, even if we choose not to