#SDK initialization performance 🧵

1 messages · Page 1 of 1 (latest)

unique patrol
#

👋 we've noticed that SDK initialization (TS and Python mostly) generally take a while when during the inner dev loop of a pipeline. Seems like most of the time is going in re-installing cached dependencies and running the codegen process.

Typscript seems to be particularly bad as each re-evaluation generally takes around ~20s on a modern machine with good internet connection. Python is generally around 6-10s.

This has also been raised by some our community members as well https://discord.com/channels/707636530424053791/1274170967815356468

Starting this thread as a joint collaboration between @random meadow @gaunt tinsel and @unique jungle to assess our options to improve this as it's huring our user's DX 🙏

lament parrot
#

small tangent: i'm also working on https://github.com/dagger/dagger/pull/8201
i think this is more of an internal implementation detail (that fixes some bugs), but potentially if you're looking into optimizations, might be interesting to just be aware of
(note, this new SDK.init step won't do dependency fetching, it's essentially just the bootstrapping of the template)

random meadow
#

Hey! Thanks for this thread!

I'll do some tests tomorrow to try to improve the cache hit, starting with what you suggested (I repost the convo here to group everything)

Marcos' question:
not sure if you were aware but checking some SDK init performance things today with Andrea we realized that the packageManager typescript feature seems to be adding quite some overhead to the inner dev loop
as it seems like corepack enable and all that doesn't seem to be cached
so it's downloading yarn every time
https://dagger.cloud/marcos-test/traces/9f3af4a7f923e36271937fbd72591067?span=b585895bfb0474b8
---- My answer:
It's strange that corepack enable isn't cache though, it's one of the first step done by the init
Normally it should be done only once, however there not much workaround there because the feature needs corepack to support both yarn & pnpm
11.6s is also pretty fast compared to what we had before btw!
But yeah, I would love to cache this operation too! I don't understand why it's not because there's not much steps before, maybe it's because I add the src directory to get the packageManager field, I'll see if there's a way to add the sources later maybe and only select the package.json

lament parrot
#

it does definitely seem that even for go we have a lot of issues around caching during init, not quite sure why specifically we see those

unique jungle
random meadow
unique jungle
#

@random meadow with @unique patrol we also noticed the runtime being called multiple times for a single call — but that’s unrelated to TS in particular I think

random meadow
unique jungle
#

@random meadow but it looks like we’re calling codegen even when the code didn’t change … I think

#

Or it looks like it’s not properly cached

random meadow
#

I think it's not properly cached yes

#

Once I finish splitting Test Module, I'll check on it

random meadow
#

@unique patrol Any news or PR to share related to performances? I can have a look if it changes TS runtime

unique patrol
unique jungle
#

@random meadow 👋

Any idea why we're calling /runtime twice in asModule? The first one takes 5s ... is it transpiling or something?

unique jungle
#

@random meadow Ok here is a worse problem:

https://dagger.cloud/dagger/traces/fefb70456e44a55fad89198adf9ef3ac?span=845b4c6062bbfaba

Initialize is taking 11s

installing module / initialize / exec tsx --no-deprecation --tsconfig /src/tsconfig.json /src/src/__dagger.entrypoint.ts --> 1.4s

analyzing module / initialize / exec tsx --no-deprecation --tsconfig /src/tsconfig.json /src/src/__dagger.entrypoint.ts --> 1.4s

Actually, the overall problem is: "installing module" and "analyzing module" are both calling .asModule + .initialize. Initialize is taking 2.1s and 1.6s -- I don't understand why it's not cached?

/cc @buoyant latch @unique patrol

#

Also, it looks like it's the exact same parent -- not sure why it's not entirely cached?

mod := modConf.Source.AsModule().Initialize()

serveCtx, serveSpan := Tracer().Start(ctx, "installing module", telemetry.Encapsulate())
err = mod.Serve(serveCtx)
...

ctx, loadSpan := Tracer().Start(ctx, "analyzing module", telemetry.Encapsulate())
...
name, err := mod.Name(ctx)
#

This is what happens -- the second modConf.Source.AsModule().Initialize() call takes 500ms (AsModule) + 1.6s (Initialize)

random meadow
#

Normally the 2nd call should be cached since the source didn't changed.. Or maybe the cache is invalidated because of the codegen...

#

Like installing module -> codegen -> analyzing module (use the user's source + codegen which burst the cache)

random meadow
#

Btw if you want, we can work in pair together on the TS runtime, might be better in a call

unique jungle
random meadow
#

Oh okay!

unique jungle
#

I think there's multiple problems: 1) TS specific problem of corepack not being cached + not using cache volumes + 2) generic caching problem with modules, making the TS problems worse (but affecting all SDKs)

#

@random meadow but yeah happy to sync up! About to get lunch, later today works for you?

unique jungle
random meadow
unique jungle
#

trying to automate having some charts of init time per SDK so we can check regressions etc

random meadow
unique jungle
#

@random meadow got some time for a quick call?

random meadow
#

Sure!

unique jungle
lament parrot
unique jungle
#

@lament parrot 🙏

#

@lament parrot hmmm ... looks like:

  1. there's a bunch of logic tied to the digest (including how we name the container)
  2. there's a side effect in checking the digest, which is using that for analytics

I'll add instrumentation so we'll have the data and can later decide

@modern halo @worn snow how dependent are we on the "registry IPs" for analytics? Turns out (to be verified) we're slowing down every dagger call by ~1s because of the registry ping

lament parrot
#

yeah, i think removing/caching the digest feels scary - ideally we should bake it in during release, but yeah, might need to change how we analytics it

#

although - essentially right now, every dagger call/functions/query is turning into a "pull"

#

so that's not super useful

unique jungle
#

@lament parrot there's a chicken and egg problem there

lament parrot
#

yeah 😦 cli in engine makes things harder

#

but not impossible 😄 (the digests already don't match, since the outer one uses go-releaser, and the inner one doesn't - which is it's own problem, but not one super easily solvable right now until we decide to rip out goreleaser)

unique jungle
#

wondering if a "simple" solution would be:

  • use tags rather than digests
  • do a registry lookup in the background to verify the local and remote shas match. Issue a warning/error/whatever if they don't. Serves 2 purposes:
  1. adds a (soft) safety check for tag mismatch
  2. keeps analytycs flowing
modern halo
lament parrot
#

There's also the security argument for having the absolute digest - without that, a bad actor could potentially hijack our registry and upload bad images - but if we bake the digest to the cli, that entire issue can be avoided and protected against

#

But yes - this would affect the pulls to only occur once per machine, instead of each call, so potentially might screw up analytics there (but also probably to look more accurate, and align with use in kubernetes and similar environments imo)

unique jungle
#

@modern halo @random meadow @unique patrol @lament parrot Ok I just created https://github.com/dagger/dagger/pull/8250 to automatically (nightly and on workflow dispatch) run a benchmark on each SDK (dagger init, functions, change some code, functions again)

#

I created a dashboard on Honeycomb to keep track of performance

buoyant latch
# unique jungle <@281874480651829250> Ok here is a worse problem: https://dagger.cloud/dagger/t...

It should be cached on the second call. Are we 100% sure it's actually running twice though? If you expand out the traces then in the first one (installing module) there's a bunch of spans under "initialize". If you expand out the second one (analyzing module) then there's just a single one (exec tsx --no-deprecation --tsconfig /src/tsconfig.json /src/src/__dagger.entrypoint.ts) with the exact same runtime as the previous call

#

Just double checking it's actually running twice as opposed to showing the runtime of the cached call that ran before

unique jungle
buoyant latch
#

If it is actually running twice then yeah something non-deterministic is happening

unique jungle
buoyant latch
#

Could also just change the code locally to make one of those calls many times in a loop and see if it's actually taking real-world clock time or not

#

But yeah, if it's actually executing "expensively" each time then I bet something non-deterministic is going into the cache key. In that case, I would be curious if the Go SDK has the same behavior or not. The Go SDK is simpler, so wondering if it avoids this

woven sage
#

the little invader icons are different between the two calls, so it doesn't look like it's running twice to me - they're different somehow

#

initialize 1 ID: ChV4eGgzOjRiMGE3ZmQ0MTg2ZDRjMDgSXwoVeHhoMzo0YjBhN2ZkNDE4NmQ0YzA4EkYKFXh4aDM6ZGM4MDFhMjUxYjAwZjdhMRIKCgZNb2R1bGUYARoKaW5pdGlhbGl6ZUoVeHhoMzo0YjBhN2ZkNDE4NmQ0YzA4EoIBChV4eGgzOmRjODAxYTI1MWIwMGY3YTESaQoVeHhoMzo4NjNiMzg4MjcxNTg4NGQyEgoKBk1vZHVsZRgBGgp3aXRoU291cmNlIiEKBnNvdXJjZRIXChV4eGgzOjBjM2QwNTBjNjBjOTFhNjJKFXh4aDM6ZGM4MDFhMjUxYjAwZjdhMRJEChV4eGgzOjg2M2IzODgyNzE1ODg0ZDISKxIKCgZNb2R1bGUYARoGbW9kdWxlShV4eGgzOjg2M2IzODgyNzE1ODg0ZDISjwEKFXh4aDM6MGMzZDA1MGM2MGM5MWE2MhJ2ChV4eGgzOmE0MTllZjBjODIxMDQyOWISEAoMTW9kdWxlU291cmNlGAEaFHdpdGhDb250ZXh0RGlyZWN0b3J5Ih4KA2RpchIXChV4eGgzOmNkMDE4MGQ1YjdlMGM4NTZKFXh4aDM6MGMzZDA1MGM2MGM5MWE2MhJkChV4eGgzOmE0MTllZjBjODIxMDQyOWISSxIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIhIKCXJlZlN0cmluZxIFOgMuLy5KFXh4aDM6YTQxOWVmMGM4MjEwNDI5YhLAAgoVeHhoMzpjZDAxODBkNWI3ZTBjODU2EqYCEg0KCURpcmVjdG9yeRgBGgRibG9iIlMKBmRpZ2VzdBJJOkdzaGEyNTY6YjBhOTcyYTA4ODlkMTBlYjA3OGY4MTJmNDVmMDczNTQ2NWZkOGExNDkzOWY2ZDk4ZmNmOTc4MjlkZDFmNjQ3OSI6CgltZWRpYVR5cGUSLTorYXBwbGljYXRpb24vdm5kLm9jaS5pbWFnZS5sYXllci52MS50YXIrenN0ZCIMCgRzaXplEgQoy4kFIlkKDHVuY29tcHJlc3NlZBJJOkdzaGEyNTY6ZTUyYWI5MTVmZWNmNDcyNTAyMzU3MjJkODE1YzZjZDEwMDIyYWE3YjQzYjFjMGI4OGNjODIwZjgzYWZhNGM4YkoVeHhoMzpjZDAxODBkNWI3ZTBjODU2
initialize 2 ID: ChV4eGgzOjM0YmQ1Zjg4MjllZDg0OTMSXwoVeHhoMzozNGJkNWY4ODI5ZWQ4NDkzEkYKFXh4aDM6MWIzZmM3YWNhZjUyODlhMxIKCgZNb2R1bGUYARoKaW5pdGlhbGl6ZUoVeHhoMzozNGJkNWY4ODI5ZWQ4NDkzEoIBChV4eGgzOjFiM2ZjN2FjYWY1Mjg5YTMSaQoVeHhoMzo4NjNiMzg4MjcxNTg4NGQyEgoKBk1vZHVsZRgBGgp3aXRoU291cmNlIiEKBnNvdXJjZRIXChV4eGgzOjhmMTAyMTk3ZTFhYmMxMWNKFXh4aDM6MWIzZmM3YWNhZjUyODlhMxJEChV4eGgzOjg2M2IzODgyNzE1ODg0ZDISKxIKCgZNb2R1bGUYARoGbW9kdWxlShV4eGgzOjg2M2IzODgyNzE1ODg0ZDISjwEKFXh4aDM6OGYxMDIxOTdlMWFiYzExYxJ2ChV4eGgzOmE0MTllZjBjODIxMDQyOWISEAoMTW9kdWxlU291cmNlGAEaFHdpdGhDb250ZXh0RGlyZWN0b3J5Ih4KA2RpchIXChV4eGgzOjk2YWNkNjcyZDEzYTU1ZDZKFXh4aDM6OGYxMDIxOTdlMWFiYzExYxJkChV4eGgzOmE0MTllZjBjODIxMDQyOWISSxIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIhIKCXJlZlN0cmluZxIFOgMuLy5KFXh4aDM6YTQxOWVmMGM4MjEwNDI5YhLAAgoVeHhoMzo5NmFjZDY3MmQxM2E1NWQ2EqYCEg0KCURpcmVjdG9yeRgBGgRibG9iIlMKBmRpZ2VzdBJJOkdzaGEyNTY6MTBmMDRjOTNhZDJjYThjOWYyYjNmMGUwNDVkOWQ0NzlhZWUwZWNkNGIwYjcxYWM2MTIxNjFlMDM2YTBlMTJjZCI6CgltZWRpYVR5cGUSLTorYXBwbGljYXRpb24vdm5kLm9jaS5pbWFnZS5sYXllci52MS50YXIrenN0ZCIMCgRzaXplEgQoy4kFIlkKDHVuY29tcHJlc3NlZBJJOkdzaGEyNTY6YWI4NmU2MWJjNzQyZDRjZWQ0YjQyYzg4YmM3YWY4MWZhNWRmNjU0YjlhMjY5MjUxZmViMTkwM2NmZjEwZmUxM0oVeHhoMzo5NmFjZDY3MmQxM2E1NWQ2

unique jungle
#

Pretty late here, I'll do some digging tomorrow

@buoyant latch I added some automated benchmarking (see above) to get a baseline

Did notice a few weird things already, @random meadow opened a PR (https://github.com/dagger/dagger/pull/8236) but I think it's more widespread than just the TS SDK

woven sage
buoyant latch
woven sage
#

this might be because resolveFromCaller (or something related) is marked impure, so it runs twice, and we actually sync files multiple times - it's believable that something changes in the meantime, or that it's not 100% deterministic

#

(btw you can click the invaders to copy the full ID to your clipboard, which is how I debugged that)

buoyant latch
#

I'd suspect non-determinism

woven sage
#

it might be worth reconsidering whether these should even be marked impure - it only affects session-local query caching at the moment, maybe keeping those 'pure' is a saner choice for how it's currently used

#

(gets back to the idea of "degrees of impurity" concept we talk about now and then)

buoyant latch
#

Well it would break our integ tests since we have nested execs loading different modules from the same path in the same session.

But like Ive been saying I am becoming more biased towards just moving away from that model since it seems to cause more problems than it solves, so in that case yes we could get away with making it pure

#

I suspect another part of the problem is the fact that the sdks don’t interact with impure apis well. I opened an issue about this a month or two ago, on mobile so will find when back in a few hours

#

But also the blobs should be deterministic

#

So probably multiple levels of issues

buoyant latch
#

Oh apparently that issue I mentioned above about things not working with impure IDs was fixed by Helder: https://github.com/dagger/dagger/issues/7788

I still am surprised that the impure API runs twice (might be another variation on that problem not handled yet), but it's only 31ms second time so doesn't matter right now. The fact that the blob digest is non-deterministic seems like the main overhead. Wouldn't be surprised if that is causing all sorts of slowness all over the place

GitHub

A tangential issue I noticed during some adventures with #7767 At least in the Go SDK, the generated Sync method will just return the query it has built so far, e.g. dagger/sdk/go/dagger.gen.go Lin...

#

Will take a look once I pivot to performance stuff after deflaking

#

If anyone else wants to look at that in the meantime I can try to give some pointers on where to look at what's going wrong

unique jungle
unique jungle
#

@gaunt tinsel btw -- I'm noticing that while Python is faster than Typescript, it looks like it takes roughly the same time whether the source code has changed or not (hinting that maybe there's a caching issue?)

unique jungle
unique jungle
unique jungle
#

@buoyant latch weirdly it's not happening with typescript (and not once but 3 times in a row, same results with typescript and python)

I'm thinking this is a determinstic non-determinstic issue 🙂

unique jungle
#

but I'm looking at a small sample size (N=3), so this might be pure coincidence

buoyant latch
#

I may start poking around at this today. Still working on the flakes but even though the flakes happen first try on main I am going like 5+ runs in a row of everything passing in the PR I'm using to debug (🤬 🤬 🤬 )

So may as well look at this locally while waiting for those to run.

buoyant latch
#

But as a first step I'll just see if there's anything obvious in the filesync that leads to non-determinism in the contents that are loaded. If so, fix that and see how the behavior changes

unique jungle
buoyant latch
unique patrol
lament parrot
#

I can't get it to happen on that pr

#

But it happens consistently on main 😭

buoyant latch
unique jungle
lament parrot
#

I have seen it occasionally in other pr environments, taunting me

buoyant latch
lament parrot
#

Which one are you looking at? The codegen Basic/interface confusion one?

buoyant latch
lament parrot
#

I also wonder if there's something with me doing a bunch of merges at the same time (common morning task for me, merge everything that needs doing)

#

So again, might just be a capacity thing? Mayyyybe?

buoyant latch
#

The engines are single-tenant so I would have thought that was less of a thing

#

But yes, who knows

lament parrot
unique patrol
#

maybe it's a mistake in our telemetry about where the time is effectively being spent on?

gaunt tinsel
gaunt tinsel
lament parrot
#

hm, potentially? i know go build can occassionally be non-deterministic, but what's ideally, the codegen shouldn't be running again at all right? which would imply some input is non-deterministic?

gaunt tinsel
lament parrot
#

does python have an embed where typescript doesn't?

gaunt tinsel
#

Yep, 2 of them.

#

Simple files. Template for main.py and Dockerfile.

lament parrot
#

i know if you have a embed.FS, it captures the file metadata into the binary as well

#

i wouldn't expect a string to be different with this

gaunt tinsel
#

Really missing viz on which inputs cause a cache miss.

lament parrot
#

if you have a dnm merge pr, you could potentially take the inputs, and print out their contents / print .Digests

#

i know @woven sage was looking at resurrecting cached labels, not sure where that is right now (but also it's holiday so not urgent)

unique jungle
gaunt tinsel
unique jungle
unique jungle
unique jungle
lament parrot
#

yeah, because of the networking incident I think this goes up in priority imo

worn snow
#

doesn't answer the question, but pointing out that I reported this in detail 90 days ago and got crickets

#

Losing the telemetry will hurt, but we do have other sources I guess

#

@unique jungle Is there a way to make the lookup async and best effort?

lament parrot
worn snow
unique jungle
unique jungle
#

Currently the result is used to name the container using the image digest. In order to stop blocking, we’d have to name the container against the (immutable) tag. We can still do an asynchronous registry lookup but by then the container will be long created, so the result of the lookup goes into the trash

lament parrot
#

^ with this suggestion - if we're (ab)using the registry lookup for analytics, and not actually giving any functionality for it, it feels like it should be respecting the DO_NOT_TRACK env

#

not a blocker to the suggestion, i think it's good as a quickfix, but i don't think we should leave it there longterm

worn snow
#

Yeah I thought it was used to check for a new version, and trigger an update. If that's the case then the information could be stored locally async, and the uodate done best effort, either in that call or the next one.

woven sage
# gaunt tinsel Really missing viz on which inputs cause a cache miss.

Unless I'm missing a more obvious way of doing this, I think we could implement it by building on the same "stable function digest" idea we're planning to use for analytics/insights. Basically just store which input ID digests are seen for a given stable digest, and mark any who haven't been seen before so the UI can highlight them. That state would have to be kept on the engine side, but it's not much data (just digest relationships)

(edit: linearize'd - https://linear.app/dagger/issue/DEV-2925/why-was-my-pipeline-not-cached#comment-f05bbc31)

#

The other thing that's coming back is the CACHED state, but that doesn't tell you who busted the cache, just whether the cache was busted

unique jungle
unique jungle
#

@random meadow Benchmark ran again ... we're saving about ~7 seconds in typescript when the code changes

random meadow
#

So my changes are nice?

unique jungle
#

yep

lament parrot
#

🎉

unique jungle
#

[note: 16 vs 14 might be random]

before

  • init: 28 seconds
  • functions: 16 seconds
  • functions (modified code): 16 seconds
  • functions (cached): <1s

after:

  • init: 28 seconds
  • functions: 14 seconds
  • functions (modified code): 9 seconds
  • functions (cached): <1s
random meadow
unique jungle
#

tldr: before, first time == modified time. Now, modified < first time, so we're re-using some cache

random meadow
#

That's strange the init is the same though, it should be faster 😮

#

Oh maybe you're only testing with yarn?

#

It's a special case where I actually need to download dependencies to generate the lockfile

unique jungle
#

then 39s the second time

#

but those are one offs

random meadow
#

Could you add some benchmark for pnpm, npm & bun?

#

It would also be useful for our users

#

and yarn v4 is you feel it too 😉

#

(So many runtime & package manager in ts, sorry)

unique jungle
#

how?

#

which is affecting all SDKs, not just TS

unique jungle
random meadow
#

Or set the packageManager to npm|yarn|pnpm field, that should work too

unique jungle
#

@random meadow btw -- I'm seeing 45+ seconds on dagger initfor typescript

random meadow
#

I'll work on a benchmark

unique jungle
#

Does it make sense to default to yarn vs npm?

random meadow
#

We wanna use the faster basically

#

Yarn was supposed to be the fastest but not in our case, because we start from a clean state in our benchmark, it's faster when things are cached though, we could assert that with specific benchmarks

#

That's why I want to add some in your benchmark workflow

random meadow
unique jungle
#

replied

#

Please use honeycomb when looking at that data

#

(and update the dashboard with the new data points)

random meadow
#

PR updated, how can I try this workflow?

random meadow
#

@unique jungle Need your help to run the test, I'm not sure it's possible from my PR directly with the permission I have

random meadow