I had a question about IDs. You can | Dagger | Page 1

hoary mortar Jan 21, 2024, 6:03 PM

#

Basically, at some point after retrieving an ID, assuming the engine has received other operations, the resources behind that ID will disappear, it will become unusable, and the original query tree that produced it will need to be re-run. Surely it never comes up in practice, at least in the withDirectory implementation, because the queries are milliseconds apart and gc runs are seconds apart, and you'll typically only have one tenant per engine so it takes a while for whatever LRU or other heuristic the GC uses is actually going to invalidate an ID.

#

So, if it never happens, why do I care about it in theory? Well, one, it's an interesting software architecture question. But also, I've been working on my own Dagger SDK, reading through the Node one for reference, and noticed ID fields are never actually set upon retrieval - there's little to no caching done on the client side. Makes sense, would be a micro-optimization. Except, until it potentially isn't. I'm looking into doing some highly granular integration with Dagger that may result in hundreds of simultaneous queries with unpredictable and potentially far apart output re-use. One of my main concerns is actually having my SDK run the exact same query 10 times at the same time, instead of having the 10 dependent trees await the same query coroutine, and the easiest way to do that is caching IDs at the object level.

#

Thus my interest in race "correctness" - is there a way to achieve strong guarantees in that regard from Dagger? Retrying when a sub-query fails is the obvious approach, but parsing out "this ID was missing" from a failure result sounds kind of messy, compared e.g. explicit minimum guaranteed expiry timestamps, keepalive handles, or the ability to establish a sort of "session frame" for a series of queries.

hoary mortar Jan 21, 2024, 6:48 PM

#

Oh, the JS SDK does actually cache certain IDs indefinitely. In computeQueryTree, BaseClient instances inside of the args array are replaced with ID strings on the canonical instance of the query tree.

kindred spruce Jan 21, 2024, 6:48 PM

#

cc @brave turtle who is working on a major overhaul of how IDs work

hoary mortar Jan 21, 2024, 6:49 PM

#

Thanks! (I hope that doesn't imply a major overhaul to how SDKs need to work, though)

kindred spruce Jan 21, 2024, 6:49 PM

#

and @visual steppe who is intimately familiar with the internals of bk caching

kindred spruce Jan 21, 2024, 6:49 PM

#

hoary mortar Thanks! (I hope that doesn't imply a major overhaul to how SDKs need to work, t...

not at all

brave turtle Jan 21, 2024, 7:06 PM

#

@hoary mortar so there's a bit to cover here:

Not sure if you're aware of it, but there's a concept of a "Dagger session" that corresponds 1:1 to e.g. dagger.Connect from an SDK. Right now IDs are only valid within the lifetime of the session that created them, and this session is what keeps all values referenced by IDs from being garbage-collected. It's expected that you run many queries against the same session.
There's technically one extra trick we have to employ to make sure non-reproducible blob:// sources don't get GC'd between a SDK returning it from a module and the 'parent session': https://github.com/dagger/dagger/blob/f7d1805f812b4ff58a08611eef63d6bffa3e5cee/core/modfunc.go#L297 - but this is to handle a more specific case involving nested Dagger sessions with blobs uploaded from the nested session.
The very latest release (v0.9.7) introduces a new architecture whereby IDs are effectively encodings of the query that created them, so it's now technically possible to re-use them across sessions, but this isn't an advertised feature yet (will probably be tied to some sort of "saving/publishing/sharing/loading pipelines" DX in the future). The main difference is they no longer contain encoded Buildkit pb.Definitions; they're more like recipes, rather than actual values.
As part of that new architecture we also cache queries within a session (structurally, by ID, not syntactically), so for your use case of potentially running the same query repeatedly, you could consider just running them all against the same session; the engine will coordinate to make sure it doesn't do any redundant work.

hoary mortar Jan 21, 2024, 7:16 PM

#

Thanks for the detailed explanation! Of course, I hadn't thought of the session.

brave turtle Jan 21, 2024, 7:23 PM

#

happy to help! It's a great question

hoary mortar Jan 21, 2024, 7:26 PM

#

For point 4, I figured that was the case - my concern is not being wasteful with open HTTP requests & coroutine objects on the client side. Maybe premature optimization, but I want to at least consider what direction I might need to go.

#

For point 3... are you effectively talking about consolidating queries & IDs into a single concept?

kindred spruce Jan 21, 2024, 7:58 PM

#

Here's an old issue where we discussed some possible future use cases of stable IDs: https://github.com/dagger/dagger/issues/3923

Like @brave turtle said, we're not actively working on those features, but now we have the foundation to build them.

GitHub

Save & load pipelines · Issue #3923 · dagger/dagger

Overview I propose adding a new feature to the Dagger API: saving & loading of pipelines. Save: snapshot the state of a pipeline, encode it as text, send it to the client Load: restore the stat...

brave turtle Jan 21, 2024, 7:59 PM

#

hoary mortar For point 3... are you effectively talking about consolidating queries & IDs int...

Kinda-sorta - IDs are like a query that always resolves to a single value and knows what type it expects to return, as opposed to a regular GraphQL query which can have many sub-selections and doesn't include any extra info about the expected result.

They also have a bit of extra metadata derived from the schema (meta / tainted) which is used to influence caching, plus some extra info required to keep the ID self-contained (module).

Here's their definition (currently protobuf): https://github.com/dagger/dagger/blob/f7d1805f812b4ff58a08611eef63d6bffa3e5cee/dagql/idproto/id.proto

As the engine parses and executes a query, each selection internally constructs an ID and uses it as a cache key (unless it's tainted), and that ID is automatically assigned to any newly created objects, and the id field just returns that ID. Technically this applies to scalar values too, you just never see those IDs externally because there's no way to query it.

I don't know if I'd frame it as consolidating the two, since it's more like they work in tandem, but it's true that we're applying extra leverage to how Dagger uses GraphQL in order to define how these new IDs work and automate them as much as possible. More on that here: https://github.com/dagger/dagger/tree/main/dagql#axioms

#

And yeah the issue Solomon linked above has some interesting context from before the implementation work began, I think it's nearly all still accurate

kindred spruce Jan 21, 2024, 8:01 PM

#

By the way @brave turtle what's your current thought on supporting source pinning / locking in IDsv2?

#

I find myself often wishing I could "lock down" every external source in my project, without the knowledge or cooperation of the modules I'm using

#

For example before I board an airplane 😛

#

Do you see those kinds of features as tightly coupled to IDs, or decoupled?

brave turtle Jan 21, 2024, 8:10 PM

#

hmmmmm... will have to think about it some more. the biggest connection I can think of is with the "impure resolver returning pure value" pattern, e.g. container.from(address: "golang") returning an ID that actually contains container.from(address: "golang@sha256:..."). I guess what you'd want with source pinning is to specifically cache those impure resolvers which are not normally cached? 🤔

#

(also I cheated in that example, since even the from(address: "golang@sha256:...") ID would still be considered impure because it uses the same container.from field; this works better for host.directory => blob)

#

will have to dig up the previous explorations, I remember it being another rabbit hole, but haven't looked at it through the lens of IDs v2 yet. I could see it being decoupled, since this is more to do with how queries are submitted, and IDs are just an outcome of queries. But I could see us also choosing to couple it, if it ends up being convenient or expressive

kindred spruce Jan 21, 2024, 9:07 PM

#

I do like decoupled. Plenty of time to explore this topic later...

#

I'm excited that we got the plumbing in 😁

hoary mortar Jan 21, 2024, 11:12 PM

#

The knowledge that IDs stay alive within their session is enough for me to move forward with my SDK.

This idea of pinning certain objects to live beyond their enclosing session, though, is really intriguing.

The first thing that came to mind, regarding that question about "impure resolver returning pure value", is that "package-lock" a la NPM & Go is sort of an existing solution pattern for that problem pattern - snapshotting a particular mapping of an ambiguous "address" to an deterministic one. I'm not sure that host.directory => blob is different from the container @sha256 in that regard. You can look at the database of a Docker Registry as a big package-lock mapping. Whether we're talking about a container or a filesystem snapshot, both can be summarized (cache-addressed) by a sha256 of their contents. If the cache is dropped (which can happen with a container registry - particular @sha256 disappear from Docker.io unless they're pinned by a tag), by nature there's no guarantee that it can be re-created perfectly, because the "unambiguous address" in both cases isn't the sha256 but rather the combination of file-path + host filesysem (mine vs yours) + timestamp (now vs later after I changed the files), and that timestamp part is fleeting. That's specifically why you have this abstraction of intent-conveying named pins (tags) over the abstraction of cache-significant content-based addressing.

#

You could say that publish is that "pinning" feature in Dagger... but obviously it's less flexible. You can only publish containers. Saving an identifier for later reuse without the need to fully upload/copy it to a different storage mechanism is appealing. If you have a way to alias + pin a Dagger Object, then you basically have a more flexible superset of container "addresses", which makes me think, gee, what if I could transfer an object by ID from one Engine to another. If I can do that, well, what do I need a Docker Registry for anymore? For Compose, I suppose, but I have already been thinking about trying to replace Compose with Dagger. Partly because of annoying restrictions with YAML + shell scripts on that side that I had already escaped on the build-side with Dagger. Mostly because moving end-result containers from Dagger to Compose is a pain (requires registry or export/import blob or hacks) and I can't help but think wouldn't it be easier to just keep it running and serve to the host? It's kind of like, Dagger's API is just turning out to be a better abstraction of OCI than Docker's API, and the ability to pin and transfer object IDs opens up several opportunities to consolidate/shed a lot of the accumulated layers of orchestration and the friction of getting them to work together.

#I had a question about IDs. You can