eager git ls-remote pulls *all refs* | Dagger | Page 1

glad granite Sep 30, 2025, 4:21 PM

#

Starting a thread 🧵

#

<@&946480760016207902>

smoky swift Sep 30, 2025, 4:22 PM

#

my take? we should drop this behavior - clone with --no-tags

#

this avoids the host of issues with that behavior (and i don't have a bunch of bugs to go fix in the future), and also means the pinning can entirely drop those.

glad granite Sep 30, 2025, 4:23 PM

#

(lockfile for reference https://gist.githubusercontent.com/shykes/8a7bb8f564072f7f59462c666fef2fc0/raw/abce16f31bad588da3c7be72255db237eea4592d/gistfile1.txt)

smoky swift Sep 30, 2025, 4:23 PM

#

smoky swift this avoids the host of issues with that behavior (and i don't have a bunch of b...

all we'd need to keep now would be a Name and a SHA that we resolved to - instead of everything

glad granite Sep 30, 2025, 4:24 PM

#

In terms of pure lockfile size, the main issue seems to be that github servers include all pull requests, ever, in their list of refs

smoky swift Sep 30, 2025, 4:24 PM

#

smoky swift all we'd need to keep now would be a `Name` and a `SHA` that we resolved to - in...

fyi - i think this is safe behavior to remove - no one actually noticed we were only returning it for branches + tags and not commits, until i stumbled across it just a couple weeks ago

#

so it seems that it's actually just accidental behavior

glad granite Sep 30, 2025, 4:27 PM

#

@smoky swift sorry I didn't understand what change you're proposing

smoky swift Sep 30, 2025, 4:57 PM

#

sorry, suggesting removing the behavior where if you say git(url).branch(main).tree(depth=0) you'll get a tags that on the main branch

#

getting those tags is the default behavior of git clone and such

smoky swift Oct 1, 2025, 11:15 AM

#

oops, see https://github.com/dagger/dagger/pull/11073 for a proper writeup of where i discovered this was a thing to be aware of

#

my suggestion is to officially remove this behavior

#

i think this would simplify the lockfile

#

i do have a proposal though about lockfiles 🙂
i would really like to implement this through dagql. my idea would be that we could add a new "pin" field to return as part of IDs.

the server could record all IDs that have a pin, and write them to a file. the key of the pin would be a human readable version of the ID (e.g. container.from("alpine") or http("url"), the value would be specific to each handler.

a handler that returns a pin would be expected to read the pin back in (and handle that)

#

but then, a lockfile is just an id rewriter! it annotates ids with pin values it finds.

#

this feels super powerful - not just for locking, we could even one day use this kind of logic to dynamically rewrite IDs - i could write a lock file that replaced my container.from("alpine") with a host container. think "vendoring".
adding new things to the lockfile is just a matter of modifying the dagql handler - there's no special handling in lockfile.go
it means we could eventually extend pinning to module functions 🙂 e.g. i could write an s3 module, that allows pinning to a sha.

#

oh and of course:
4. better visualization in cloud! if it's part of an id, we can add that to the trace viewer, and show that it was pinned.

#

also this fits better with caching - otherwise, everywhere we use the lockfile session attachable we'll need to give per-client caching, or we'll inevitably be mixing up the lockfiles from different clients
adding the per-client cache keys could potentially make performance worse in some cases.

#

(fyi id rewriting is on my mind because i was doing something similar to this for live-dev work - see https://github.com/dagger/dagger/pull/11158/files#diff-de4bfdb139ab0adecbc1594294aaf84b5a1f3d19fb991fcf40f5e55f1d2584e7R94)

glad granite Oct 2, 2025, 12:16 AM

#

@smoky swift I'm interested in learning more. It's not immediately obvious how your proposal relates to mine.

#

adding new things to the lockfile is just a matter of modifying the dagql handler - there's no special handling in lockfile.go

This is already the case.

#

it means we could eventually extend pinning to module functions 🙂 e.g. i could write an s3 module, that allows pinning to a sha.

Also planned in my proposal (note the module field, currently always set to core)

#

better visualization in cloud! if it's part of an id, we can add that to the trace viewer, and show that it was pinned.

I don't really understand the fundamental difference. In the case of container.from, the pinning should already show up in the trace (because container.from already recursively called itself with the new arguments. Maybe that's what you mean by "ID rewriting"?

I didn't do it this way for git and http, but it should be trivial to do.

So perhaps we're already talking about the same thing?

#

Re-reading your explanation, it looks like a slight generalization of my prototype, where my "hooks" could be simpler because they could rely on a low-level facility where you can attach a pin field to the ID. In which case: sure why not!

--> Since either way we'll need a stable file format and session attachable to actually read to and read from the lockfile, perhaps we can start with my less ambitious version, then seamlessly "upgrade" to the more powerful version based on ID pinning, with no breaking change in wire protocol or file format?

smoky swift Oct 2, 2025, 6:28 AM

#

glad granite <@488718750690967563> I'm interested in learning more. It's not immediately obvi...

it would let us remove all the ls-remote refs that are currently making the file huge

smoky swift Oct 2, 2025, 6:30 AM

#

glad granite Re-reading your explanation, it looks like a slight generalization of my prototy...

i think yes, continue as if we hadn't had this discussion?

I'm gonna give my little low level hooks a spin later in my day today, since I now think we would need it for an initial prototype - without it, I'm worried we'd actually make perf worse because of the need to mixin the client cache keys everywhere.

smoky swift Oct 2, 2025, 6:31 AM

#

glad granite > better visualization in cloud! if it's part of an id, we can add that to the t...

we could make it show up in the trace as a pin, e.g. with a little pin icon if we wanted (not just showing it in the trace)

#

maybe I can have two demos for later 😛

glad granite Oct 2, 2025, 6:32 AM

#

smoky swift it would let us remove *all* the ls-remote refs that are currently making the fi...

On that part: as a short-term fix, would it make sense to keep your structure for git (which looks super clean) and just make the ref lookups lazy? Ie. only list remote tags when calling GitRepository.tags(), branches on .branches() etc?

#

That actually sounds like a robust long-term fix - just independent from any pinning system (which to be clear also sounds super cool)

smoky swift Oct 2, 2025, 6:34 AM

#

yeah we could make ref lookups lazy - but the point of the original PR I did was to make it so we didn't do that 😢 we used to do that

#

the problem was, that meant:

we were sometimes doing lots of listing remote tags
in between getting remote tags, the remote tags could change

glad granite Oct 2, 2025, 6:35 AM

#

smoky swift the problem was, that meant: 1. we were sometimes doing *lots* of listing remote...

but wouldn't that lookup be cached?

#

If so, then it's best of both worlds

#

you get a reliable snapshot for the duration of the session (same as eager fetch) but also, benefits of laziness

smoky swift Oct 2, 2025, 6:36 AM

#

not really, we were just calling it everywhere, not through dagql

glad granite Oct 2, 2025, 6:37 AM

#

seems fixable though no?

smoky swift Oct 2, 2025, 6:37 AM

#

yeahhh, I just don't see it as an easy fix, it's essentially a revert

#

we could do it, but there were other nice things about it, like better cache key computation

#

and simpler checking of is private (it used to be worse)

#

if it's necessary for the pinning? id just exclude the ls-remote from the pin, just put the git sha+ref for the git ref pin.

glad granite Oct 2, 2025, 6:39 AM

#

If we keep it the way it is, a viable mitigation is to hardcode an exception where we exclude the github PR refs from the list. At least for our repo (and most typical github repos) that should remove 99% of the file size. But it will still be an unreasonably large upfront download for eg. large enterprise repos

smoky swift Oct 2, 2025, 6:40 AM

#

smoky swift if it's necessary for the pinning? id just exclude the ls-remote from the pin, j...

I don't think they're used for anything in GitRef logic ATM.

glad granite Oct 2, 2025, 6:40 AM

#

Even excluding the lock file, it's really hard to justify an upfront download of so much data

glad granite Oct 2, 2025, 6:41 AM

#

smoky swift if it's necessary for the pinning? id just exclude the ls-remote from the pin, j...

Yes it's necessary (as far as I can tel). Whatever data is not stored in the lockfile, will not be available at all in offline operations, even if it's in the cache. Every ref I remove from the list in the lockfile, is a ref that as far as the engine knows, does not exist (when loading from the lockfile)

smoky swift Oct 2, 2025, 6:43 AM

#

mm

#

fuck

#

kinda wish we'd had this discussion two weeks ago

glad granite Oct 2, 2025, 6:43 AM

#

Unless you can make that data a layer, so it can be retrieved from the cache

#

then I only need to store the digest of that file in the lockfile

#

that would completely decouple my lockfile size problem, from your "eager vs lazy" dilemma

#

Given its size, it kind of makes sense to make it an actual file

smoky swift Oct 2, 2025, 6:44 AM

#

but then you can't share a lockfile between different engines

#

unless we do persistent objects

#

lemme have a think about this one

glad granite Oct 2, 2025, 6:45 AM

#

ok 🙂

#

actually maybe even the lockfile can be removed from this equation

#

If you can make that ls-remote offlineable (eg. by making it a cached operation) then that's maybe all we need

#

or is it already, and I did my experiment wrong?

#

basically I turn off my wifi, then see what blows up, and investigate that part of the code 🙂

#

that's how I found the "is it public" check

smoky swift Oct 2, 2025, 6:47 AM

#

either - we kinda fundamentally rework how git works and fetches data, or we find a better way to pin it, orrrr, we say git can't be offlineable

#

off to bed with you btw

#eager git ls-remote pulls *all refs*

#eager git ls-remote pulls all refs