#eager git ls-remote pulls *all refs*
1 messages ยท Page 1 of 1 (latest)
my take? we should drop this behavior - clone with --no-tags
this avoids the host of issues with that behavior (and i don't have a bunch of bugs to go fix in the future), and also means the pinning can entirely drop those.
all we'd need to keep now would be a Name and a SHA that we resolved to - instead of everything
In terms of pure lockfile size, the main issue seems to be that github servers include all pull requests, ever, in their list of refs
fyi - i think this is safe behavior to remove - no one actually noticed we were only returning it for branches + tags and not commits, until i stumbled across it just a couple weeks ago
so it seems that it's actually just accidental behavior
@smoky swift sorry I didn't understand what change you're proposing
sorry, suggesting removing the behavior where if you say git(url).branch(main).tree(depth=0) you'll get a tags that on the main branch
getting those tags is the default behavior of git clone and such
oops, see https://github.com/dagger/dagger/pull/11073 for a proper writeup of where i discovered this was a thing to be aware of
my suggestion is to officially remove this behavior
i think this would simplify the lockfile
i do have a proposal though about lockfiles ๐
i would really like to implement this through dagql. my idea would be that we could add a new "pin" field to return as part of IDs.
the server could record all IDs that have a pin, and write them to a file. the key of the pin would be a human readable version of the ID (e.g. container.from("alpine") or http("url"), the value would be specific to each handler.
a handler that returns a pin would be expected to read the pin back in (and handle that)
but then, a lockfile is just an id rewriter! it annotates ids with pin values it finds.
- this feels super powerful - not just for locking, we could even one day use this kind of logic to dynamically rewrite IDs - i could write a lock file that replaced my
container.from("alpine")with a host container. think "vendoring". - adding new things to the lockfile is just a matter of modifying the dagql handler - there's no special handling in
lockfile.go - it means we could eventually extend pinning to module functions ๐ e.g. i could write an s3 module, that allows pinning to a sha.
oh and of course:
4. better visualization in cloud! if it's part of an id, we can add that to the trace viewer, and show that it was pinned.
also this fits better with caching - otherwise, everywhere we use the lockfile session attachable we'll need to give per-client caching, or we'll inevitably be mixing up the lockfiles from different clients
adding the per-client cache keys could potentially make performance worse in some cases.
(fyi id rewriting is on my mind because i was doing something similar to this for live-dev work - see https://github.com/dagger/dagger/pull/11158/files#diff-de4bfdb139ab0adecbc1594294aaf84b5a1f3d19fb991fcf40f5e55f1d2584e7R94)
@smoky swift I'm interested in learning more. It's not immediately obvious how your proposal relates to mine.
adding new things to the lockfile is just a matter of modifying the dagql handler - there's no special handling in lockfile.go
This is already the case.
it means we could eventually extend pinning to module functions ๐ e.g. i could write an s3 module, that allows pinning to a sha.
Also planned in my proposal (note the module field, currently always set to core)
better visualization in cloud! if it's part of an id, we can add that to the trace viewer, and show that it was pinned.
I don't really understand the fundamental difference. In the case of container.from, the pinning should already show up in the trace (because container.from already recursively called itself with the new arguments. Maybe that's what you mean by "ID rewriting"?
I didn't do it this way for git and http, but it should be trivial to do.
So perhaps we're already talking about the same thing?
Re-reading your explanation, it looks like a slight generalization of my prototype, where my "hooks" could be simpler because they could rely on a low-level facility where you can attach a pin field to the ID. In which case: sure why not!
--> Since either way we'll need a stable file format and session attachable to actually read to and read from the lockfile, perhaps we can start with my less ambitious version, then seamlessly "upgrade" to the more powerful version based on ID pinning, with no breaking change in wire protocol or file format?
it would let us remove all the ls-remote refs that are currently making the file huge
i think yes, continue as if we hadn't had this discussion?
I'm gonna give my little low level hooks a spin later in my day today, since I now think we would need it for an initial prototype - without it, I'm worried we'd actually make perf worse because of the need to mixin the client cache keys everywhere.
we could make it show up in the trace as a pin, e.g. with a little pin icon if we wanted (not just showing it in the trace)
maybe I can have two demos for later ๐
On that part: as a short-term fix, would it make sense to keep your structure for git (which looks super clean) and just make the ref lookups lazy? Ie. only list remote tags when calling GitRepository.tags(), branches on .branches() etc?
That actually sounds like a robust long-term fix - just independent from any pinning system (which to be clear also sounds super cool)
yeah we could make ref lookups lazy - but the point of the original PR I did was to make it so we didn't do that ๐ข we used to do that
the problem was, that meant:
- we were sometimes doing lots of listing remote tags
- in between getting remote tags, the remote tags could change
but wouldn't that lookup be cached?
If so, then it's best of both worlds
you get a reliable snapshot for the duration of the session (same as eager fetch) but also, benefits of laziness
not really, we were just calling it everywhere, not through dagql
seems fixable though no?
yeahhh, I just don't see it as an easy fix, it's essentially a revert
we could do it, but there were other nice things about it, like better cache key computation
and simpler checking of is private (it used to be worse)
if it's necessary for the pinning? id just exclude the ls-remote from the pin, just put the git sha+ref for the git ref pin.
If we keep it the way it is, a viable mitigation is to hardcode an exception where we exclude the github PR refs from the list. At least for our repo (and most typical github repos) that should remove 99% of the file size. But it will still be an unreasonably large upfront download for eg. large enterprise repos
I don't think they're used for anything in GitRef logic ATM.
Even excluding the lock file, it's really hard to justify an upfront download of so much data
Yes it's necessary (as far as I can tel). Whatever data is not stored in the lockfile, will not be available at all in offline operations, even if it's in the cache. Every ref I remove from the list in the lockfile, is a ref that as far as the engine knows, does not exist (when loading from the lockfile)
Unless you can make that data a layer, so it can be retrieved from the cache
then I only need to store the digest of that file in the lockfile
that would completely decouple my lockfile size problem, from your "eager vs lazy" dilemma
Given its size, it kind of makes sense to make it an actual file
but then you can't share a lockfile between different engines
unless we do persistent objects
lemme have a think about this one
ok ๐
actually maybe even the lockfile can be removed from this equation
If you can make that ls-remote offlineable (eg. by making it a cached operation) then that's maybe all we need
or is it already, and I did my experiment wrong?
basically I turn off my wifi, then see what blows up, and investigate that part of the code ๐
that's how I found the "is it public" check