#Concerns about having a local engine
1 messages · Page 1 of 1 (latest)
I would rank that as number 1, but there's some other tricky stuff we'd have to port into each language
The support for gateway containers has a lot of custom go code on top of the underlying grpc api
Yeah I know just food for thoughts đ
No totally I agree with the general sentiment
But yeah if it was the case. Then each sdk would just spin up the dagger docker image
Nothing else
No dependency on the binary
What problem are we talking about?
No bootstrapping of dagger-buildkitd with dagger
Yes the dependency on the binary is a product of lack of resources to reimplement lots of code in each SDK, which can always be addressed by supplying more resources, but we'll need to go through carefully and make sure we know how much we're taking on
It's worth more thought though I agree
A bunch đ sorry at a doc appt on my phone canât type a lot
Ok, going on a call in 5mn but will catch up after. At first glance that list looks like a list of design decisions, rather than a list of problems
^ Yeah it ties into our architecture (we spin up the local engine as binary which just talks to the separate daemon binary, currently buildkitd, soon to be our buildkitd wrapper). It also thus ties into multitenant support, everything we were talking about in terms of session-state yesterday, etc. etc.
For instance, SDKs depend on the binary. How to package distribute etc
If cloak could fully run as a docker container then the problem goes away
Or the fact that we have a weird arch (see my comment on buildkit embedding PR)
but if we packaged it as a container, wouldn't we lose host { } ?
Another thing we'd have to reimplement in each SDK: provisioners (just thinking out loud)
Exactly. Because the lack of filesysnc
Itâs not the case for buildkit because the SDK has filesync
It's technically possible to reimplement the equivalent of host in each SDK, it's just probably not straightforward at all
(probably)
If the SDK were to upload local, then cloak could run anywhere (container, hosted, etc). Doesnât matter anymore
have you been talking to Olivier recently Andrea? đ
Nope
Haha đ
It ties into the virtualized host stuff, sessions, local dirs on the playground
With filesync all of that doesnât matter anymore
that's a pretty major design change that would affect a lot of things. I think we should talk about the actual problem part a bit more before going to deep down this rabbit hole
It's all coming to head in multiple areas (cloud, embedding buildkit, etc.) so we should have that discussion soon, but I agree there's no time to dive fully at this precise moment đ
is the problem "packaging a binary in a SDK is too hard"?
That's one, but also just that the architecture becomes much more complicated with multiple daemons talking to one another. It would be nice to simplify
I also don't see the relationship to packaging buildkit (wouldn't we have to do that inside the container anyway?)
Comment from yesterday on the buildkit PR explains a bit
Thereâs no more buildkit just graphql, with this approach
Whereas now thereâs a sdk â> localbin â> container over 2 different protocols
Concerns about having a local engine
- Concerns about having a local engine AND a containairized engine at the same time
To illustrate the issue
Two engines
Because one must run on the host, the other in a container
Right I understand (but that's too long for a discord channel name đ
Either we make buildkit run on the host (not possible), or we make cloak run in a container (limitation: filesync)
Anyway
Itâs incredibly hard to do so itâs just day dreaming at this point
If the problem was filesync alone I'd actually categorize it at the edge of possible (many possibilities besides just literally reimplementing the filesync api), but I think the random pile of other assorted issues probably puts it over the edge of realistic in the medium term.
But worth double-checking those assumptions
And thinking through what the shortest possible path to getting something working would be
Trying to finish dagger.json stuff asap so not much detail, but made a stub for further discussion: https://github.com/dagger/dagger/discussions/3280
if the engine is always remote, then we would have to either 1) add file streaming to the graphql API or 2) expose parts of the buildkit/grpc API to all clients, right?
- in both cases, all SDKs would have to implement it
we would also have to change the secrets API to support sending secret value in the api (similar options 1 or 2 to do that)
Yes that, among other things.
I do wonder if there could be a shortcut though. If there's a file transfer API with native support for the languages we support and we can just expose that over a service from the engine, maybe there's a relatively cheap path to accomplishing this
That's a lot of "ifs" obviously
we could ship the engine with an ssh server
make ssh part of the api
gql + ssh
rsync or sftp for sync
regular ssh session for attach
We need native ssh clients in each language then. Unless we shell out to ssh, but shelling out to ssh is not conceptually different than shelling out to local engine
v1 just a black box container with duct tape. we could bundle it in a binary later but less urgent now
I donât know for sure where the pain is today, so hard to say if forkexecing ssh is meaningfully different
(I'd also suggest replacing ssh with websockets if we are just going to run rsync through it, but implementation detail)
Yeah it's a different kind of pain, if it's better or worse I don't know yet
oh I didnât mean anything as clean as implementing the ssh protocol. I meant literally running openssh in our container with a custom configuration and hooks. actually rsync files to an actual dir in the container then plug that into buildkit client somehow
duct tape basically đ
Oh totally, but if we want to connect to the openssh server in the engine from a script in, e.g., javascript, then we either need a native javascript ssh client or to shell out to ssh. I was just thinking websockets because that's what we are using for service otherwise. But doesn't matter, point is we need a transport.
Curiosity got the best of me and I took a brief detour to google native lang rsync implementations:
- https://github.com/gokrazy/rsync
- https://github.com/isislovecruft/pyrsync
- https://github.com/WebDeltaSync/WebRsync
I don't really trust the python/js ones, but having given them a very brief glance I am surprised how simple they are. It makes me wonder if it would be easier to implement the rsync protocol in python/js than implementing the buildkit filesync stuff. I truly have no idea, it just is an intriguing line of thought.
I'm sure this all occurred to Tonis and he made his decision to implement the filesync the way he did for some reason, so I expect it's more complicated than it seems
thereâs also sftp
my head hurts
Not to stress you out, but once we ship a design built around a concept of local engine, it will be very hard to rip it out later
OK another idea: how about a fork-exec helper of reduced scope, designed as a stopgap for a future where we reimplement it all in each SDK?
for exemple a dagger-sync-workdir tool?
I'm thinking something along these lines too. Like implement our SDKs such that all the functionality like this (filesync, connecting to service containers, etc.) can be implemented via shelling out right now and then later swapped for native impls
Router goes into our buildkitd wrapper
Yeah this needs more thought
But it is probably right. I gotta finish up the other stuff at this exact moment obviously but this can't wait long either
We would still need to bundle buildkit though. At best we get the luxury of bundling it in a container image for now, instead of a binary. But we still need the bundling for a bunch of reasons.
Yes 100%, the bundling becomes more important than ever here
In this scenario would we still want to bundle both the engine + client utility in the dagger binary?
Another issue is the provisioners, it seems wrong to reimplement them in every SDK, even long term
My default would be to just bundle everything into one binary still to start because it makes it easier to think about and change. And once we have solidified designs we could consider breaking up if there's a need too
catching up
and this
it's not an issue of hard engineering (like eg. implementing filesync) it's the fragmentation that is a permanent problem
Erik you know the internals of bk better than me, but I think rsync or whatever would be a big downgrade compared to native right?
AFAIK a ton of work went into filesync to be snappy for that particular use case
@timber osprey tally so far:
- Problem solved by removing local engine
- ? unclear for me at the moment
- Implications of removing local engine
- Filesync
- Secrets
- Provisioners
- Host API (workdir, env)
- API is no longer 100% graphql?
(although it doens't feel like that at times đ)
Honestly I don't know, I think the filesync service is built around similar concepts in terms of diff copying. I know rsync has heuristics around block based diffing too. But I have never looked into either in depth to have an informed opinion
provisioners less so I believe?
even in that world, there would be a "dagger" CLI (only, at this point, more like a daggerctl than a daggerd)
I would be biased to agree with what you're saying though because I doubt all the work went into filesync for nothing
API is already not 100% graphql today due to services (expose streams over websockets). That's not a bad thing either, I think it's fine as long as it doesn't keep growing with more and more protocols/endpoints/etc.
Yeah. I would assume, in a naive way, that there MUST be a reason why they're not rsyncing stuff around
But then we'd still be fork-execing to dagger from the SDK, isn't that what you wanted to avoid?
Agreed, that's where I'm at too. But at the same time rsync is widely used and "okay enough" IME so I don't know what the reasoning was
so that was my question
does the sdk need to provision?
Something needs to provision. I think the fact that if the engine isn't running it will be spun up for you is extremely extremely nice
with the current assumptions (avoid managing stateful daemons) yes
Yeah that too
part of that requires illusion/polyfill, but it's worth it because fundamentally there's no reason for buildkit daemons to be pets, should be cattle
by provisioning we mean spinning up the container? or more complex provisioning (e.g. k8s etc)?
unclear
Starting with just container, but maybe someday more complex, nice to leave "more" complex open
@timber osprey what do you think about the idea of retaining the local binary, but don't run it as an engine, just shell out to it for some utilities we don't want to rewrite in every language (mentioned above somewhere I think)
the former -- it's really nice to have. however:
- if we don't, not that weird (that's what you do with virtually all tooling)
- and if we do, not rocket science for every SDK to do, right?
e.g. not rocket science --> compared to the freaking codegen and xdx đ
the bar is already high
Ideally it would all be a single, indempotent action that can be called on demand by the SDK
ie. the first engine.Start may do a lot. Subsequent calls may be faster because images are already downloaded kub services deployed, etc. But conceptually it's all bundled together
yeah, like a helper binary
that's why I was asking simple container vs k8s deployments etc
to get an answer? đ
for the latter, I don't know honestly. Would I feel comfy giving root access to a tool to do stuff on my k8s cluster?
vs having an official helm chart, CF template, etc
same feeling as "use the language you're familiar with" -- "use the native tooling you're already familiar with"
but debatable
Some shops give access to kub namespaces. So you could use that
If your kub admin won't let you deploy stuff on kub, then you don't use the kub provisioner
in that case "provision" may just be "connect to this known URL"
maybe provisioner is the wrong term?
with the assumption that there's not a one-size fits all deployment, we'd have to replicate what 3rd party tools are doing
like, ok, deploy to my k8s cluster. but I happen to use istio or whatever service mesh, so i need those extra labels. Oh and rbac is configured this way so blah blah
I would consider setting BUILDKITD_HOST env to be the dumbest possible provisioner implementation. Agree the term provisioner is not totally accurate
(not complaining about k8s fragmentation -- just wondering if complex "provisioning" is better handled outside of SDKs, into native tooling like cloudformation, terraform, helm, etc etc -- in which case we only need "basic default provisioning" which is doable in every SDK, I think)
but anyway -- doesn't change anything for filesync, secrets, etc
yeah agreed
but we either 1) need something pluggable in the client to invoke stateless engines remotely (possibly pre-provisioned) or 2) everyone needs to manage pet daemons
on kubernetes I think correct invocation is actually with a one-off exec (as opposed to declarative kubect apply)
on a remote server: ssh + exec
I mentioned the shell out utility idea here: https://github.com/dagger/dagger/discussions/3280#discussioncomment-3819142
Think that would be a step up from current state if nothing else. Could also be stepping stone to native clients or just stay like that forever
would be DAGGER_HOST though right?
yes once Guillaume's PR is merged đ
this helper binary would need to be a local proxy for the duration of the session. Since it needs to handle callbacks from the server
so the http2/grpc plumbing needs to flow through the helper or too much work left to the sdk
this actually solves the provisioning problem too
gives the helper a hook to handle that too
same helper can wrap your shell script to solve session management
host API could remain: callbacks would flow back from remote engine to local helper
oh and socket forwarding too (adding that to tally)
dagger client ?
or dagger-client
Yes to all the above, dagger client is fine, in my current thinking this would be a hidden command from end users, just meant to be shelled out by sdks
If we continue to use buildkit's session management (as we do today), then we'll need to either A) fix session sharing upstream or B) have dagger client be long-running for full duration of session and accept commands over pipes (brings us back to two daemons technically I guess, but in slightly different form)
Or we can make our own session concept and not expose buildkit's from daggerd. Usual tradeoff of more work for more customizability.
Just thinking out loud
doesnât the problem go away if we do everything in one buildkit client connection?
Yes, but that also means that the helper binary needs to run for the duration of the session. So it's not shelling out for individual functionality, it's running continuously and just accepting commands for individual pieces of functionality over pipes. Which is all totally fine, just thinking about it
thatâs what Iâm saying, we need that anyway because of the callback-driven nature of filesync (not to mention socket forwarding)
no way around it imo
itâs basically exactly the same as today except instead of running the engine locally youâre running a local proxy to the engine
Yes this has somehow come full circle... I think there's probably cleanups in the details to be made, but our current architecture is probably not very far off from reasonable even though it seems weird
itâs suddenly way less weird if you fork-exec a helper proxy rather than a server talking to another server
Now actually feel that we could keep the fork-exec forever (but keep the option open to change our minds later)
remaining part is the API now having a non-gql component
Yeah I think it may just be more a change in terminology rather than any huge changes in implementation
still would be a pretty major change not running a server locally, plus lots of new grpc plumbing
but smaller diff than I feared at the beginning of this thread
Yeah and I'm now questioning whether even that is all that beneficial actually.
If we are sending commands over pipes to a local proxy running for the duration of the session, then why not just do what we're now and run the router in that local proxy and send graphql through it.
idk I need to sleep on it, brain is sputtering
Thatâs where making a list of current pain would be helpful
then we can weigh cost & benefit
Good point. Need to think more about it. I guess the major difference is itâs scoped to a much smaller functionality rather than being the full engine
Basically, SDKs would use that binary as an âimplementation stopgapâ, with the end goal of eventually supporting it natively (and the SDK could actually support it natively from day 1)
That point of view changes things quite a bit. For instance it would be ok for the node sdk to just embed the proxy binary inside the npm package itself (Iâve seen a few packages doing that). Since itâs more of a â.dllâ than an engine.
Embedding the full engine has bigger implications. Every script has its own instance, no multi tenancy. Services die off as soon as the script is done. We talked about how cloak attach was weird because it wanted to attach to an existing server rather than spawning its own
On the operations side, taking playground as an example:
Marcos will probably run dagger on a container, and mount the docker socket inside so that dagger can launch itself again as a daggerd container and communicate through stdio, which is plenty weird (although we could probably make that better regardless for the use case of running dagger in a container)
But then if it runs as a container: what to do with local files etc? (problem we do have in playground)
Yes
Related to point above
Could potentially be a much smaller binary embedded in the SDK. Basically a dagger.so library-ish, but over fork exec, to stopgap whatever the SDKs canât implement on their own
I have to call it a night but now that I have a slight bit of breathing room plan is to think about this so we can draw out options some more. But yes first step will be to just write out the problems explicitly, ensure they are real, etc. I'll update this discussion (which is currently just my dumping grounds, but feel free to dump thoughts too, I'll clean it up once thoughts are formed): https://github.com/dagger/dagger/discussions/3280
Could potentially be a much smaller binary embedded in the SDK. Basically a dagger.so library-ish, but over fork exec, to stopgap whatever the SDKs canât implement on their own
I'd like this a lot; there are devils in the details related to session crap but I have been thinking about it on and off today and have some vague ideas on shortcuts we can look into. The local dirs specifically have an insanely complicated caching scheme, there's a small chance we might be able to use them independent of sessions, which will simplify things greatly I think
(Hope iâm not adding to confusion here, just brainstorming, not convinced either way)
No the architecture is complicated and we need to make sure if we continue going down this route it's at least consciously. But even better would be simplifying it. Worthy exercise to go through
Iâve been thinking about filesync/rsync etc, and would would be the simplest possible thing to do (relatively speaking).
Turns out, you can âwrapâ gRPC over websockets (found an example repo on GitHub). And you can âproxyâ buildkit (Erik you implemented that in the early days of the typescript SDK when TS was talking gRPC directly).
So in theory: engine could proxy buildkitâs filesync as is. Wrap it in WS.
Go SDK natively imports bk, wraps it in WS, exposed as SDK functions
Small binary uses the Go SDK to do just that. Itâs embedded by SDKs that canât implement the session stuff
Someday: we replace gRPC over WS by our own thing over WS, SDKs use that directly, no more binary
(e.g. /ws/v1/filesync is just a proxy for the filesync bits in gRPC over WS. /ws/v2/filesync can be a simpler protocol in the future that can be implemented natively in TS etc)
Err: by âexposed as SDKâ functions I didnât mean we expose bk. I meant the filesync bits work under the hood in the Go SDK because internally it uses bk client, but thatâs an implantation detail.
Then in theory if this could be compiled as a .so and linked to JS, it would work too. Since itâs not possible, itâs embedded bin+fork/exec
100% â letâs make sure the problems are real before diving into a costly implementation
Have a great weekend everyone!
You too!
Yes to this, sync-as-a-service (or similar) in the very long term feels like a nice approach. Interestingly, we might actually replace local source llb ops with cache mounts running on gateway containers if we do that (maybe).
Slight tangent, but I was thinking a few days ago about how youâd implement âhot reloadâ of generated code clients whenever a change is made to your code first extension schema. One possibility would be to run mutagen (two way remote syncing service) over a cloak service. The HLB authors actually told me about that idea, they ran it directly over bk ssh sockets. So then you sync local extension code changes to the remote service over websocket, it generates client code and syncs that back to you.
I guess you could enable general hot reload use cases with this approach too? Like frontend dev tools and stuff. Or hot reload of extensions into the cloak engine too I suppose.
Either way, not high priority but kind of interesting and semi related in that it also involved file syncing. But mutagen doesnât have native clients in non-go (I think, didnt check though), so doesnt really answer any of those questions.
So, I understand the concern about having a complicated architecture with one local engine and possibly more remote engines. I find the potential solution (local engine is replaced by a helper proxy) elegant.
But. I have a concern of my own on the UX side: no more local engine means there is always a stateful daemon that needs to be installed and managed out of band. No more âinstall SDK, it just works!â. This leads to a UX similar to docker engine: soon you need to manage infrastructure on the side before you can do anything. Enter Docker Machine, boot2docker, Docker Toolbox, Vagrant, and of course podman, kubernetes⊠That is to say, a horrible and fragmented UX. Everyone endlessly tending to pet-like stateful engines, and arguing over the best tool to do so. Turning everyone into grumpy sysadmins.
How do we avoid that?
Unlike docker engine, our engine can get away with being stateless because it is based on buildkit which only has a cache to worry about. It would be a shame if we wasted that opportunity with a pet management UX a-la docker machine
I agree that's one of the things we need to figure out and that we should find any way we can to hide the requirement of persistently running daemons. But I also think it's mostly orthogonal to whether the local engine exists or not.
In today's current state w/ the local engine, there is still a persistent stateful daemon (buildkitd). We could in theory fix that by making our buildkitd wrapper ephemeral while still retaining the local engine.
In a world where we get rid of the local engine (either have a helper binary or put everything in the client SDKs), we will still need to solve the problem of running buildkitd functionality ephemeraly.
So it's not that the problems have no interaction with one another, it's more that making buildkitd ephemeral and cattle-like will require its own independent set of solutions.