#@justin do you have time this week to

1 messages · Page 1 of 1 (latest)

rain lantern
#

In terms of maybe quick wins, when I was looking into the perf of smartevent module loading w/ @zinc magnet the other day I saw that half the total execution time was spent calling git ls-remote --symref https://github.com/dagger/dagger 4 times in the same session (each call taking ~1s). This was in the case where everything turned out to be cached, but still needed to call ls-remote to check everything and see the cache hit.

Seems like the result of ls-remote should be cached for a given URL within the session

verbal birch
#

Oh yeah that would definitely help

#

Separately, we discussed the possibility of making ls-remote lazy, while still avoiding the issues that required making it eager in the first place (sorry my recollection of the details is vague).

I remember it was related to auth - calling ls-remote eagerly allows determining right away whether a repo is auth-protected, which in turn has implications for caching of future operations. But, I believe we discussed this in the cab in Oceanside, and concluded that there was a clean way to make ls-remote lazy after all

zinc magnet
#

It's possible that merging Justin's https://github.com/dagger/dagger/pull/11221 could help by allowing users to pin gitrefs while specifying a branch/tag which greatly improves the git fetching performance because if you only specify git commit, if it's not in cache, it has to fetch all the branches, on bigger repos it adds up. This is to improve cold start performance, which is very different from what Erik was saying about ls-remote on warm performance.

verbal birch
#

ls-remote is expensive by itself regardless of other fetching, because it includes metadata about all PRs ever

zinc magnet
#

@rain lantern wdyt of simply lazifying ls-remote which currently runs on each git() operation ? With justin's PR 112211 mentioned above, a well specified gitref (branch+commit), should never require an ls-remote at all.

EDIT: sorry, i just realized it's precisely what Solomon already said above barring the auth issues.

rain lantern
#

@timid stump can remind us of what the details were there

timid stump
#

so there's a couple reasons it was nice to eagerly evaluate this (just track where repo.Remote is used in gitSchema.git):

  • we can use this as a recomputed ID of the return value - this means that we'll cache the results of tags/branches/etc. it also means that if we do multiple calls to tags/branches, we won't need to redo them because we have a stable object. if we lazify the ls-remote, then we'd want to keep the property of not re-running ls-remote if you run multiple calls to it. i think this will mean doing manual GetOrInitialize calls into the dagql cache, like we do for module loading.
  • if you specify a ref github.com/dagger/dagger, we need to dynamically discover if we should access this over SSH or HTTPS - see https://github.com/dagger/dagger/pull/10960. the "cheapest" operation to discover if a repo is accessible seems to be ls-remote. so in this case, we need to make sure we're not making excess ls-remote calls if we've done this path and discovered it exists. making this case lazy is fiddly, but maybe possible? we would need to lazily get the GitRepository.URL which is a pain.
  • the "auth" thing. if the repo is private, we want to avoid another client from getting access to any info about that repo (mainly the contents, but any kind of info like tags could also be an issue). this is made significantly more complicated because of dynamic credential loading from the client.

however, i think there should be good tests for all of this... if nothing breaks with a refactor it should be good

timid stump
rain lantern
# timid stump hm, interesting. currently it *should* be cached per client?

The dagger git API itself is but that can be called with different args which all result in the exact same ls-remote call being made. e.g. these all occur in the same trace (while loading every module in the dagger repo):

so while it does make sense that git re-runs for each of those since they are technically different args, it does feel wasteful that the exact same ls-remote cmd is run repeatedly in the same session

timid stump
#

mmm that does move towards just manually using the GetOrInitialize call then to me

#

then we can cache through that

#

instead of relying purely on the graphql call chain

rain lantern
timid stump
#

ngl, my suggestion is going to again be that i should not pick this up

#

like the internal api feels like a quick fix, but i feel like if doing that, it might be worth just lazying the whole thing

#

but that feels longer than the couple days i have

verbal birch
#

Yeah it seems tight. A context dump is more realistic, and appreciated.

timid stump
#

in terms of a context dump