SDK initialization performance 🧵 | Dagger | Page 1

unique patrol Aug 21, 2024, 3:40 PM

#

👋 we've noticed that SDK initialization (TS and Python mostly) generally take a while when during the inner dev loop of a pipeline. Seems like most of the time is going in re-installing cached dependencies and running the codegen process.

Typscript seems to be particularly bad as each re-evaluation generally takes around ~20s on a modern machine with good internet connection. Python is generally around 6-10s.

This has also been raised by some our community members as well https://discord.com/channels/707636530424053791/1274170967815356468

Starting this thread as a joint collaboration between @random meadow @gaunt tinsel and @unique jungle to assess our options to improve this as it's huring our user's DX 🙏

lament parrot Aug 21, 2024, 3:51 PM

#

small tangent: i'm also working on https://github.com/dagger/dagger/pull/8201
i think this is more of an internal implementation detail (that fixes some bugs), but potentially if you're looking into optimizations, might be interesting to just be aware of
(note, this new SDK.init step won't do dependency fetching, it's essentially just the bootstrapping of the template)

random meadow Aug 21, 2024, 3:53 PM

#

Hey! Thanks for this thread!

I'll do some tests tomorrow to try to improve the cache hit, starting with what you suggested (I repost the convo here to group everything)

Marcos' question:
not sure if you were aware but checking some SDK init performance things today with Andrea we realized that the packageManager typescript feature seems to be adding quite some overhead to the inner dev loop
as it seems like corepack enable and all that doesn't seem to be cached
so it's downloading yarn every time
https://dagger.cloud/marcos-test/traces/9f3af4a7f923e36271937fbd72591067?span=b585895bfb0474b8
---- My answer:
It's strange that corepack enable isn't cache though, it's one of the first step done by the init
Normally it should be done only once, however there not much workaround there because the feature needs corepack to support both yarn & pnpm
11.6s is also pretty fast compared to what we had before btw!
But yeah, I would love to cache this operation too! I don't understand why it's not because there's not much steps before, maybe it's because I add the src directory to get the packageManager field, I'll see if there's a way to add the sources later maybe and only select the package.json

lament parrot Aug 21, 2024, 3:54 PM

#

it does definitely seem that even for go we have a lot of issues around caching during init, not quite sure why specifically we see those

unique jungle Aug 21, 2024, 3:59 PM

#

random meadow Hey! Thanks for this thread! I'll do some tests tomorrow to try to improve the ...

looks like core pack is not using cache volumes
it’s not layer cached at all when changing module source. Probably depends on the full source code?

random meadow Aug 21, 2024, 4:01 PM

#

unique jungle 1) looks like core pack is not using cache volumes 2) it’s not layer cached at...

Yes, that why I think I can improve that by choosing specific files on init and only include user's source when it's really necessary

unique jungle Aug 21, 2024, 6:09 PM

#

random meadow Yes, that why I think I can improve that by choosing specific files on init and ...

Also I think corepack is caching to a different directory than the ones we’re caching — meaning that it’s redownloading everything from scratch each time

#

@random meadow with @unique patrol we also noticed the runtime being called multiple times for a single call — but that’s unrelated to TS in particular I think

random meadow Aug 21, 2024, 9:56 PM

#

unique jungle Also I think corepack is caching to a different directory than the ones we’re ca...

Yes I'm aware of that and no, it's common to all SDKs and that's because we recall Codegen before every actual call to sync the module.
So for TypeScript what happens on every call is:
-> Call to Codegen (which call CodegenBase)
-> Call to Runtime (which again call CodegenBase but steps are expected to be cached that time)

unique jungle Aug 22, 2024, 10:25 AM

#

@random meadow but it looks like we’re calling codegen even when the code didn’t change … I think

#

Or it looks like it’s not properly cached

random meadow Aug 22, 2024, 10:35 AM

#

I think it's not properly cached yes

#

Once I finish splitting Test Module, I'll check on it

random meadow Aug 23, 2024, 4:20 PM

#

@unique patrol Any news or PR to share related to performances? I can have a look if it changes TS runtime

unique patrol Aug 23, 2024, 7:10 PM

#

random meadow <@336241811179962368> Any news or PR to share related to performances? I can hav...

I haven't seen any. Andrea (OOO today) is working on this

unique jungle Aug 26, 2024, 11:23 AM

#

@random meadow 👋

Any idea why we're calling /runtime twice in asModule? The first one takes 5s ... is it transpiling or something?

unique jungle Aug 26, 2024, 11:57 AM

#

@random meadow Ok here is a worse problem:

https://dagger.cloud/dagger/traces/fefb70456e44a55fad89198adf9ef3ac?span=845b4c6062bbfaba

Initialize is taking 11s

installing module / initialize / exec tsx --no-deprecation --tsconfig /src/tsconfig.json /src/src/__dagger.entrypoint.ts --> 1.4s

analyzing module / initialize / exec tsx --no-deprecation --tsconfig /src/tsconfig.json /src/src/__dagger.entrypoint.ts --> 1.4s

Actually, the overall problem is: "installing module" and "analyzing module" are both calling .asModule + .initialize. Initialize is taking 2.1s and 1.6s -- I don't understand why it's not cached?

/cc @buoyant latch @unique patrol

#

Also, it looks like it's the exact same parent -- not sure why it's not entirely cached?

mod := modConf.Source.AsModule().Initialize()

serveCtx, serveSpan := Tracer().Start(ctx, "installing module", telemetry.Encapsulate())
err = mod.Serve(serveCtx)
...

ctx, loadSpan := Tracer().Start(ctx, "analyzing module", telemetry.Encapsulate())
...
name, err := mod.Name(ctx)

#

This is what happens -- the second modConf.Source.AsModule().Initialize() call takes 500ms (AsModule) + 1.6s (Initialize)

random meadow Aug 26, 2024, 12:10 PM

#

unique jungle <@281874480651829250> Ok here is a worse problem: https://dagger.cloud/dagger/t...

Thanks for the insight!

#

Normally the 2nd call should be cached since the source didn't changed.. Or maybe the cache is invalidated because of the codegen...

#

Like installing module -> codegen -> analyzing module (use the user's source + codegen which burst the cache)

random meadow Aug 26, 2024, 12:12 PM

#

unique jungle <@281874480651829250> 👋 Any idea why we're calling `/runtime` twice in `asMod...

What do you mean by /runtime

#

Btw if you want, we can work in pair together on the TS runtime, might be better in a call

unique jungle Aug 26, 2024, 12:37 PM

#

random meadow Btw if you want, we can work in pair together on the TS runtime, might be better...

I'm looking at the overall performance, not doing anything TS specific right now

random meadow Aug 26, 2024, 12:38 PM

#

Oh okay!

unique jungle Aug 26, 2024, 12:38 PM

#

I think there's multiple problems: 1) TS specific problem of corepack not being cached + not using cache volumes + 2) generic caching problem with modules, making the TS problems worse (but affecting all SDKs)

#

@random meadow but yeah happy to sync up! About to get lunch, later today works for you?

unique jungle Aug 26, 2024, 12:42 PM

#

random meadow Like `installing module` -> `codegen` -> `analyzing module` (use the user's sour...

any clue on how to verify that? and couldn't we do codegen before so it's cached?

random meadow Aug 26, 2024, 12:45 PM

#

unique jungle <@281874480651829250> but yeah happy to sync up! About to get lunch, later today...

Sure

random meadow Aug 26, 2024, 12:45 PM

#

unique jungle any clue on how to verify that? and couldn't we do codegen before so it's cached...

I think I can reorganize the steps to do the codegen before everything and do the config later.
I need to do some research on that!

unique jungle Aug 26, 2024, 12:45 PM

#

unique jungle I'm looking at the overall performance, not doing anything TS specific right now

also looking at this (but much less important): https://github.com/dagger/dagger/pull/8232

#

@random meadow and this https://github.com/dagger/dagger/pull/8200

#

trying to automate having some charts of init time per SDK so we can check regressions etc

random meadow Aug 26, 2024, 12:54 PM

#

unique jungle <@281874480651829250> and this https://github.com/dagger/dagger/pull/8200

That's a really good idea!

unique jungle Aug 26, 2024, 2:31 PM

#

@random meadow got some time for a quick call?

random meadow Aug 26, 2024, 2:36 PM

#

Sure!

unique jungle Aug 27, 2024, 4:00 PM

#

unique jungle Also, it looks like it's the exact same parent -- not sure why it's not entirely...

👋 @buoyant latch Is this a red herring or it's actually unexpected?

lament parrot Aug 27, 2024, 4:11 PM

#

forgot to x-post here - i did some more investigation, i wonder if this could be the guilty candidate that adds 1 second at the beginning of most dagger commands: https://github.com/dagger/dagger/pull/8232#pullrequestreview-2263097784

unique jungle Aug 27, 2024, 4:34 PM

#

@lament parrot 🙏

#

@lament parrot hmmm ... looks like:

there's a bunch of logic tied to the digest (including how we name the container)
there's a side effect in checking the digest, which is using that for analytics

I'll add instrumentation so we'll have the data and can later decide

@modern halo @worn snow how dependent are we on the "registry IPs" for analytics? Turns out (to be verified) we're slowing down every dagger call by ~1s because of the registry ping

lament parrot Aug 27, 2024, 4:43 PM

#

yeah, i think removing/caching the digest feels scary - ideally we should bake it in during release, but yeah, might need to change how we analytics it

#

although - essentially right now, every dagger call/functions/query is turning into a "pull"

#

so that's not super useful

unique jungle Aug 27, 2024, 4:44 PM

#

@lament parrot there's a chicken and egg problem there

lament parrot Aug 27, 2024, 4:44 PM

#

yeah 😦 cli in engine makes things harder

#

but not impossible 😄 (the digests already don't match, since the outer one uses go-releaser, and the inner one doesn't - which is it's own problem, but not one super easily solvable right now until we decide to rip out goreleaser)

unique jungle Aug 27, 2024, 4:46 PM

#

wondering if a "simple" solution would be:

use tags rather than digests
do a registry lookup in the background to verify the local and remote shas match. Issue a warning/error/whatever if they don't. Serves 2 purposes:

adds a (soft) safety check for tag mismatch
keeps analytycs flowing

modern halo Aug 27, 2024, 6:17 PM

#

unique jungle <@488718750690967563> hmmm ... looks like: 1) there's a bunch of logic tied to ...

Very dependent. While responses on the server side vary, we aim to respond within 500ms 99% of the time:

Would the image remain registry.dagger.io/engine:VERSION ?

lament parrot Aug 27, 2024, 6:29 PM

#

modern halo Very dependent. While responses on the server side vary, we aim to respond withi...

Yeah, the ideal end state would be we keep the tags/etc, but we attach the digest to the cli we distribute.
Then, when we already have the image downloaded, we don't need to go to the registry to see if we have a new tag published, since now we've got an absolute digest baked in

#

There's also the security argument for having the absolute digest - without that, a bad actor could potentially hijack our registry and upload bad images - but if we bake the digest to the cli, that entire issue can be avoided and protected against

#

But yes - this would affect the pulls to only occur once per machine, instead of each call, so potentially might screw up analytics there (but also probably to look more accurate, and align with use in kubernetes and similar environments imo)

unique jungle Aug 27, 2024, 6:41 PM

#

@modern halo @random meadow @unique patrol @lament parrot Ok I just created https://github.com/dagger/dagger/pull/8250 to automatically (nightly and on workflow dispatch) run a benchmark on each SDK (dagger init, functions, change some code, functions again)

#

I created a dashboard on Honeycomb to keep track of performance

#

https://ui.honeycomb.io/dagger/environments/experimental/board/yUGn6ftYPtX/Dagger-Benchmark

#

e.g.

buoyant latch Aug 27, 2024, 6:42 PM

#

unique jungle <@281874480651829250> Ok here is a worse problem: https://dagger.cloud/dagger/t...

It should be cached on the second call. Are we 100% sure it's actually running twice though? If you expand out the traces then in the first one (installing module) there's a bunch of spans under "initialize". If you expand out the second one (analyzing module) then there's just a single one (exec tsx --no-deprecation --tsconfig /src/tsconfig.json /src/src/__dagger.entrypoint.ts) with the exact same runtime as the previous call

#

Just double checking it's actually running twice as opposed to showing the runtime of the cached call that ran before

unique jungle Aug 27, 2024, 6:43 PM

#

unique jungle Also, it looks like it's the exact same parent -- not sure why it's not entirely...

@buoyant latch this looks like it's the code originating that -- same mod instance and we call Serve and Name, but the underlying Initialize seems to be running twice

buoyant latch Aug 27, 2024, 6:44 PM

#

unique jungle <@949034677610643507> this looks like it's the code originating that -- same `mo...

Right I just want to make sure this isn't a problem with how traces display (i.e. it's showing a cached call taking 1.4s two times even though it only took 1.4s the first time). Not sure if we've fixed every issue around showing cached vs "real" times in the web ui yet

#

If it is actually running twice then yeah something non-deterministic is happening

unique jungle Aug 27, 2024, 6:45 PM

#

buoyant latch It should be cached on the second call. Are we 100% sure it's actually running t...

Oh I see

It does take overall time according to the parent span ...

@woven sage any known issues that could be misleading in this trace?

buoyant latch Aug 27, 2024, 6:45 PM

#

Could also just change the code locally to make one of those calls many times in a loop and see if it's actually taking real-world clock time or not

#

But yeah, if it's actually executing "expensively" each time then I bet something non-deterministic is going into the cache key. In that case, I would be curious if the Go SDK has the same behavior or not. The Go SDK is simpler, so wondering if it avoids this

woven sage Aug 27, 2024, 6:48 PM

#

the little invader icons are different between the two calls, so it doesn't look like it's running twice to me - they're different somehow

#

initialize 1 ID: ChV4eGgzOjRiMGE3ZmQ0MTg2ZDRjMDgSXwoVeHhoMzo0YjBhN2ZkNDE4NmQ0YzA4EkYKFXh4aDM6ZGM4MDFhMjUxYjAwZjdhMRIKCgZNb2R1bGUYARoKaW5pdGlhbGl6ZUoVeHhoMzo0YjBhN2ZkNDE4NmQ0YzA4EoIBChV4eGgzOmRjODAxYTI1MWIwMGY3YTESaQoVeHhoMzo4NjNiMzg4MjcxNTg4NGQyEgoKBk1vZHVsZRgBGgp3aXRoU291cmNlIiEKBnNvdXJjZRIXChV4eGgzOjBjM2QwNTBjNjBjOTFhNjJKFXh4aDM6ZGM4MDFhMjUxYjAwZjdhMRJEChV4eGgzOjg2M2IzODgyNzE1ODg0ZDISKxIKCgZNb2R1bGUYARoGbW9kdWxlShV4eGgzOjg2M2IzODgyNzE1ODg0ZDISjwEKFXh4aDM6MGMzZDA1MGM2MGM5MWE2MhJ2ChV4eGgzOmE0MTllZjBjODIxMDQyOWISEAoMTW9kdWxlU291cmNlGAEaFHdpdGhDb250ZXh0RGlyZWN0b3J5Ih4KA2RpchIXChV4eGgzOmNkMDE4MGQ1YjdlMGM4NTZKFXh4aDM6MGMzZDA1MGM2MGM5MWE2MhJkChV4eGgzOmE0MTllZjBjODIxMDQyOWISSxIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIhIKCXJlZlN0cmluZxIFOgMuLy5KFXh4aDM6YTQxOWVmMGM4MjEwNDI5YhLAAgoVeHhoMzpjZDAxODBkNWI3ZTBjODU2EqYCEg0KCURpcmVjdG9yeRgBGgRibG9iIlMKBmRpZ2VzdBJJOkdzaGEyNTY6YjBhOTcyYTA4ODlkMTBlYjA3OGY4MTJmNDVmMDczNTQ2NWZkOGExNDkzOWY2ZDk4ZmNmOTc4MjlkZDFmNjQ3OSI6CgltZWRpYVR5cGUSLTorYXBwbGljYXRpb24vdm5kLm9jaS5pbWFnZS5sYXllci52MS50YXIrenN0ZCIMCgRzaXplEgQoy4kFIlkKDHVuY29tcHJlc3NlZBJJOkdzaGEyNTY6ZTUyYWI5MTVmZWNmNDcyNTAyMzU3MjJkODE1YzZjZDEwMDIyYWE3YjQzYjFjMGI4OGNjODIwZjgzYWZhNGM4YkoVeHhoMzpjZDAxODBkNWI3ZTBjODU2
initialize 2 ID: ChV4eGgzOjM0YmQ1Zjg4MjllZDg0OTMSXwoVeHhoMzozNGJkNWY4ODI5ZWQ4NDkzEkYKFXh4aDM6MWIzZmM3YWNhZjUyODlhMxIKCgZNb2R1bGUYARoKaW5pdGlhbGl6ZUoVeHhoMzozNGJkNWY4ODI5ZWQ4NDkzEoIBChV4eGgzOjFiM2ZjN2FjYWY1Mjg5YTMSaQoVeHhoMzo4NjNiMzg4MjcxNTg4NGQyEgoKBk1vZHVsZRgBGgp3aXRoU291cmNlIiEKBnNvdXJjZRIXChV4eGgzOjhmMTAyMTk3ZTFhYmMxMWNKFXh4aDM6MWIzZmM3YWNhZjUyODlhMxJEChV4eGgzOjg2M2IzODgyNzE1ODg0ZDISKxIKCgZNb2R1bGUYARoGbW9kdWxlShV4eGgzOjg2M2IzODgyNzE1ODg0ZDISjwEKFXh4aDM6OGYxMDIxOTdlMWFiYzExYxJ2ChV4eGgzOmE0MTllZjBjODIxMDQyOWISEAoMTW9kdWxlU291cmNlGAEaFHdpdGhDb250ZXh0RGlyZWN0b3J5Ih4KA2RpchIXChV4eGgzOjk2YWNkNjcyZDEzYTU1ZDZKFXh4aDM6OGYxMDIxOTdlMWFiYzExYxJkChV4eGgzOmE0MTllZjBjODIxMDQyOWISSxIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIhIKCXJlZlN0cmluZxIFOgMuLy5KFXh4aDM6YTQxOWVmMGM4MjEwNDI5YhLAAgoVeHhoMzo5NmFjZDY3MmQxM2E1NWQ2EqYCEg0KCURpcmVjdG9yeRgBGgRibG9iIlMKBmRpZ2VzdBJJOkdzaGEyNTY6MTBmMDRjOTNhZDJjYThjOWYyYjNmMGUwNDVkOWQ0NzlhZWUwZWNkNGIwYjcxYWM2MTIxNjFlMDM2YTBlMTJjZCI6CgltZWRpYVR5cGUSLTorYXBwbGljYXRpb24vdm5kLm9jaS5pbWFnZS5sYXllci52MS50YXIrenN0ZCIMCgRzaXplEgQoy4kFIlkKDHVuY29tcHJlc3NlZBJJOkdzaGEyNTY6YWI4NmU2MWJjNzQyZDRjZWQ0YjQyYzg4YmM3YWY4MWZhNWRmNjU0YjlhMjY5MjUxZmViMTkwM2NmZjEwZmUxM0oVeHhoMzo5NmFjZDY3MmQxM2E1NWQ2

unique jungle Aug 27, 2024, 6:49 PM

#

Pretty late here, I'll do some digging tomorrow

@buoyant latch I added some automated benchmarking (see above) to get a baseline

Did notice a few weird things already, @random meadow opened a PR (https://github.com/dagger/dagger/pull/8236) but I think it's more widespread than just the TS SDK

woven sage Aug 27, 2024, 6:50 PM

#

woven sage initialize 1 ID: `ChV4eGgzOjRiMGE3ZmQ0MTg2ZDRjMDgSXwoVeHhoMzo0YjBhN2ZkNDE4NmQ0Yz...

they're ending up with different blob args (order here is swapped, sorry)

buoyant latch Aug 27, 2024, 6:52 PM

#

woven sage the little invader icons are different between the two calls, so it doesn't look...

Cool, that's assuredly indicating something wrong then

woven sage Aug 27, 2024, 6:52 PM

#

this might be because resolveFromCaller (or something related) is marked impure, so it runs twice, and we actually sync files multiple times - it's believable that something changes in the meantime, or that it's not 100% deterministic

#

(btw you can click the invaders to copy the full ID to your clipboard, which is how I debugged that)

buoyant latch Aug 27, 2024, 6:52 PM

#

I'd suspect non-determinism

woven sage Aug 27, 2024, 6:56 PM

#

it might be worth reconsidering whether these should even be marked impure - it only affects session-local query caching at the moment, maybe keeping those 'pure' is a saner choice for how it's currently used

#

(gets back to the idea of "degrees of impurity" concept we talk about now and then)

buoyant latch Aug 27, 2024, 7:01 PM

#

Well it would break our integ tests since we have nested execs loading different modules from the same path in the same session.

But like Ive been saying I am becoming more biased towards just moving away from that model since it seems to cause more problems than it solves, so in that case yes we could get away with making it pure

#

I suspect another part of the problem is the fact that the sdks don’t interact with impure apis well. I opened an issue about this a month or two ago, on mobile so will find when back in a few hours

#

But also the blobs should be deterministic

#

So probably multiple levels of issues

buoyant latch Aug 27, 2024, 9:51 PM

#

Oh apparently that issue I mentioned above about things not working with impure IDs was fixed by Helder: https://github.com/dagger/dagger/issues/7788

I still am surprised that the impure API runs twice (might be another variation on that problem not handled yet), but it's only 31ms second time so doesn't matter right now. The fact that the blob digest is non-deterministic seems like the main overhead. Wouldn't be surprised if that is causing all sorts of slowness all over the place

GitHub

sdks: `Sync` doesn't always work as expected with certain APIs · Is...

A tangential issue I noticed during some adventures with #7767 At least in the Go SDK, the generated Sync method will just return the query it has built so far, e.g. dagger/sdk/go/dagger.gen.go Lin...

#

Will take a look once I pivot to performance stuff after deflaking

#

If anyone else wants to look at that in the meantime I can try to give some pointers on where to look at what's going wrong

unique jungle Aug 28, 2024, 9:25 AM

#

Logged an issue https://github.com/dagger/dagger/issues/8254

#

@buoyant latch btw -- doesn't seem to happen consistently:

https://dagger.cloud/dagger/traces/e69f9cbfc2bbddfbb24a328d1228976c
https://dagger.cloud/dagger/traces/becca1c89f2fd627b038bbfcd7b26a86

unique jungle Aug 28, 2024, 9:56 AM

#

unique jungle wondering if a "simple" solution would be: - use tags rather than digests - do ...

Added tracing as @lament parrot suggested. Confirming this is what's taking time

#

Screenshot_2024-08-28_at_11.55.49_AM.png

unique jungle Aug 28, 2024, 4:32 PM

#

@gaunt tinsel btw -- I'm noticing that while Python is faster than Typescript, it looks like it takes roughly the same time whether the source code has changed or not (hinting that maybe there's a caching issue?)

unique jungle Aug 28, 2024, 4:34 PM

#

unique jungle <@768585883120173076> btw -- I'm noticing that while Python is faster than Types...

As a reference:

Python: Not cached: 7s // Cached: 5s

Typescript: Not cached: 21s // Cached: 1s

#

This is trace of python when it's supposed to be cached (e.g. running dagger functions twice in a row, this is the trace from the 2nd time): https://dagger.cloud/dagger/traces/305563fe13408b510ce44b6c496e1486

#

unique jungle Aug 28, 2024, 4:36 PM

#

woven sage initialize 1 ID: `ChV4eGgzOjRiMGE3ZmQ0MTg2ZDRjMDgSXwoVeHhoMzo0YjBhN2ZkNDE4NmQ0Yz...

Wondering if it's related to that (e.g. non deterministic init), so basically not specific to Python

buoyant latch Aug 28, 2024, 4:36 PM

#

unique jungle Wondering if it's related to that (e.g. non deterministic init), so basically no...

I'd suspect so yeah

unique jungle Aug 28, 2024, 4:37 PM

#

@buoyant latch weirdly it's not happening with typescript (and not once but 3 times in a row, same results with typescript and python)

I'm thinking this is a determinstic non-determinstic issue 🙂

unique jungle Aug 28, 2024, 4:43 PM

#

unique jungle <@949034677610643507> weirdly it's not happening with typescript (and not once b...

err -- basically sometimes it happens during the same session (installing vs analyzing)

sometimes not in the same session (analzying is much faster), but in subsequent sessions (e.g. dagger functions twice in a row)

but it SEEMS to be "consistently broken" -- for TS it's always broken in the same session, for Python it's always in subsequent sessions

#

but I'm looking at a small sample size (N=3), so this might be pure coincidence

buoyant latch Aug 28, 2024, 4:46 PM

#

I may start poking around at this today. Still working on the flakes but even though the flakes happen first try on main I am going like 5+ runs in a row of everything passing in the PR I'm using to debug (🤬 🤬 🤬 )

So may as well look at this locally while waiting for those to run.

buoyant latch Aug 28, 2024, 4:48 PM

#

unique jungle err -- basically sometimes it happens during the same session (installing vs ana...

I don't know what to make of it yet. I'll also see what I can do to make this even debuggable in the first place, it's just too complicated right now to make heads or tails of

#

But as a first step I'll just see if there's anything obvious in the filesync that leads to non-determinism in the contents that are loaded. If so, fix that and see how the behavior changes

unique jungle Aug 28, 2024, 4:49 PM

#

buoyant latch I may start poking around at this today. Still working on the flakes but even th...

and the most annoying flake goes to ...

buoyant latch Aug 28, 2024, 4:50 PM

#

unique jungle *and the most annoying flake goes to ...*

We would medal contendors for the annoying flake olympics at this point

unique patrol Aug 28, 2024, 4:50 PM

#

@buoyant latch is this non-determinism thing related to this? https://github.com/dagger/dagger/issues/7084

GitHub

🐞 Functions local context seems to be randomly uploaded. · Issue #...

What is the issue? As a follow-up of this discord thread: https://discord.com/channels/707636530424053791/1227300888050139146 and this previous issue: #6462, seems like dagger call is randomly uplo...

lament parrot Aug 28, 2024, 4:51 PM

#

buoyant latch We would medal contendors for the annoying flake olympics at this point

Speaking of this, could someone approve and merge the "improve verbosity pr" I have open (sorry on mobile, can't easily find it)

#

I can't get it to happen on that pr

#

But it happens consistently on main 😭

unique patrol Aug 28, 2024, 4:51 PM

#

lament parrot Speaking of this, could someone approve and merge the "improve verbosity pr" I h...

on it

buoyant latch Aug 28, 2024, 4:51 PM

#

lament parrot I can't get it to happen *on that pr*

Same! The infra is the same as far as i can tell though??

unique jungle Aug 28, 2024, 4:51 PM

#

@buoyant latch There's a few traces around if that helps.

There's a brand new GH workflow that runs nightly for each SDK to compare dagger functions for brand new vs cached vs modified: https://github.com/dagger/dagger/actions/workflows/benchmark.yml
There's only a few runs so far, but it's easy to grab cloud URLs for those
They get charted in Honeycomb (per SDK or per type): https://ui.honeycomb.io/dagger/environments/experimental/board/yUGn6ftYPtX/SDK-Benchmark?endTime=0&startTime=-604800

lament parrot Aug 28, 2024, 4:52 PM

#

buoyant latch Same! The infra is the same as far as i can tell though??

I have no idea if it's related - it could just be "today the flake god hates me"

#

I have seen it occasionally in other pr environments, taunting me

buoyant latch Aug 28, 2024, 4:52 PM

#

lament parrot I have no idea if it's related - it could just be "today the flake god hates me"

They have hated me for multiple days in a row now, which is raising my suspicions

buoyant latch Aug 28, 2024, 4:53 PM

#

unique patrol <@949034677610643507> is this non-determinism thing related to this? https://gi...

Highly possible yes

lament parrot Aug 28, 2024, 4:53 PM

#

Which one are you looking at? The codegen Basic/interface confusion one?

buoyant latch Aug 28, 2024, 4:53 PM

#

lament parrot Which one are you looking at? The codegen Basic/interface confusion one?

Yeah that one since it happens so much on main

unique jungle Aug 28, 2024, 4:54 PM

#

buoyant latch They have hated me for multiple days in a row now, which is raising my suspicion...

https://tenor.com/view/frosted-flakes-cereal-breakfast-gif-18369385

Tenor

lament parrot Aug 28, 2024, 4:54 PM

#

I also wonder if there's something with me doing a bunch of merges at the same time (common morning task for me, merge everything that needs doing)

#

So again, might just be a capacity thing? Mayyyybe?

buoyant latch Aug 28, 2024, 4:55 PM

#

The engines are single-tenant so I would have thought that was less of a thing

#

But yes, who knows

lament parrot Aug 28, 2024, 4:56 PM

#

buoyant latch But yes, who knows

Vibe of the week honestly

unique patrol Aug 30, 2024, 5:02 PM

#

@buoyant latch @lament parrot @unique jungle while helping a user by wiping his Dagger Cloud cache volumes here https://discord.com/channels/707636530424053791/1278526995885592626, he noticed that the SDK initialization time was taking a large amount of time (https://dagger.cloud/Tallied-Technologies-Inc/traces/49ff82984e8e84fb2f17553d33782c9c?span=dabdcd12a58eaa30). Any clues about how come that span could be particularly taking that much time?

#

maybe it's a mistake in our telemetry about where the time is effectively being spent on?

gaunt tinsel Aug 31, 2024, 8:54 AM

#

unique jungle As a reference: **Python**: Not cached: 7s // Cached: 5s **Typescript**: Not c...

I'm not seeing the same thing. This is a trace of a second dagger functions in a row, in a clean (temporary) directory: https://dagger.cloud/Correia/traces/5b14d86ff63303734ae8143f9472760e#c4f2f0470b6d1f17

gaunt tinsel Sep 2, 2024, 4:05 PM

#

@lament parrot, do you think it's possible that using embeds in a go module can create non-determinism in building the module? Looking at the benchmark traces, Python (cached) re-runs the codegen CLI, while Typescript doesn't:

lament parrot Sep 2, 2024, 4:06 PM

#

hm, potentially? i know go build can occassionally be non-deterministic, but what's ideally, the codegen shouldn't be running again at all right? which would imply some input is non-deterministic?

gaunt tinsel Sep 2, 2024, 4:08 PM

#

However.... locally I still can't replicate this. This is a second run, in Python: https://dagger.cloud/Correia/traces/3056ad023caa2b428063a4c4835478ff#ee5d1a41a222fd44

lament parrot Sep 2, 2024, 4:08 PM

#

does python have an embed where typescript doesn't?

gaunt tinsel Sep 2, 2024, 4:08 PM

#

Yep, 2 of them.

#

Simple files. Template for main.py and Dockerfile.

lament parrot Sep 2, 2024, 4:09 PM

#

i know if you have a embed.FS, it captures the file metadata into the binary as well

#

i wouldn't expect a string to be different with this

gaunt tinsel Sep 2, 2024, 4:16 PM

#

Really missing viz on which inputs cause a cache miss.

lament parrot Sep 2, 2024, 4:29 PM

#

if you have a dnm merge pr, you could potentially take the inputs, and print out their contents / print .Digests

#

i know @woven sage was looking at resurrecting cached labels, not sure where that is right now (but also it's holiday so not urgent)

unique jungle Sep 3, 2024, 10:55 AM

#

gaunt tinsel I'm not seeing the same thing. This is a trace of a second `dagger functions` in...

Weird, I see this consistently

Screenshot_2024-09-03_at_12.54.57_PM.png

#

@gaunt tinsel the commands executed are pretty straightforward: https://github.com/dagger/dagger/blob/main/.github/workflows/benchmark.yml#L29-L67

gaunt tinsel Sep 3, 2024, 11:01 AM

#

unique jungle Weird, I see this consistently

I meant to say that I don't see this happening on my computer, only in CI.

unique jungle Sep 3, 2024, 11:08 AM

#

gaunt tinsel I meant to say that I don't see this happening on my computer, only in CI.

CI starts from a fresh install ... maybe locally it's somehow reusing some previous cache, but for some reason it doesn't get cached immediately?

unique jungle Sep 3, 2024, 11:09 AM

#

unique patrol <@949034677610643507> <@488718750690967563> <@707661669819613324> while helping ...

Looks like a big upload maybe?

unique jungle Sep 3, 2024, 12:55 PM

#

unique jungle <@488718750690967563> hmmm ... looks like: 1) there's a bunch of logic tied to ...

ping on this -- can we make the switch to use native telemetry rather than registry telemetry for reporting?

registry "ping" on every dagger command is costly (~800ms startup time) and breaks the CLI if the registry goes down

lament parrot Sep 3, 2024, 12:58 PM

#

yeah, because of the networking incident I think this goes up in priority imo

worn snow Sep 3, 2024, 2:08 PM

#

unique jungle ping on this -- can we make the switch to use native telemetry rather than regis...

#maintainers message

#

doesn't answer the question, but pointing out that I reported this in detail 90 days ago and got crickets

#

Losing the telemetry will hurt, but we do have other sources I guess

#

@unique jungle Is there a way to make the lookup async and best effort?

lament parrot Sep 3, 2024, 2:14 PM

#

worn snow doesn't answer the question, but pointing out that I reported this in detail 90 ...

sorry, i must have missed this in the noise, my bad 😦

worn snow Sep 3, 2024, 2:28 PM

#

lament parrot sorry, i must have missed this in the noise, my bad 😦

It was summer, lots of people either gone or stretched thin, all good

unique jungle Sep 3, 2024, 2:46 PM

#

worn snow doesn't answer the question, but pointing out that I reported this in detail 90 ...

I added telemetry and traced it back to the registry lookup

unique jungle Sep 3, 2024, 2:52 PM

#

worn snow <@707661669819613324> Is there a way to make the lookup async and best effort?

Kinda. It should be doable, not sure how much refactoring is needed, but then it’s a bizarro lookup whose result is just immediately thrown away

#

Currently the result is used to name the container using the image digest. In order to stop blocking, we’d have to name the container against the (immutable) tag. We can still do an asynchronous registry lookup but by then the container will be long created, so the result of the lookup goes into the trash

lament parrot Sep 3, 2024, 2:58 PM

#

^ with this suggestion - if we're (ab)using the registry lookup for analytics, and not actually giving any functionality for it, it feels like it should be respecting the DO_NOT_TRACK env

#

not a blocker to the suggestion, i think it's good as a quickfix, but i don't think we should leave it there longterm

worn snow Sep 3, 2024, 3:01 PM

#

Yeah I thought it was used to check for a new version, and trigger an update. If that's the case then the information could be stored locally async, and the uodate done best effort, either in that call or the next one.

woven sage Sep 3, 2024, 3:08 PM

#

gaunt tinsel Really missing viz on which inputs cause a cache miss.

Unless I'm missing a more obvious way of doing this, I think we could implement it by building on the same "stable function digest" idea we're planning to use for analytics/insights. Basically just store which input ID digests are seen for a given stable digest, and mark any who haven't been seen before so the UI can highlight them. That state would have to be kept on the engine side, but it's not much data (just digest relationships)

(edit: linearize'd - https://linear.app/dagger/issue/DEV-2925/why-was-my-pipeline-not-cached#comment-f05bbc31)

#

The other thing that's coming back is the CACHED state, but that doesn't tell you who busted the cache, just whether the cache was busted

unique jungle Sep 3, 2024, 4:03 PM

#

worn snow Yeah I thought it was used to check for a new version, and trigger an update. If...

It's resolving tag->digest, for container naming purposes

unique jungle Sep 6, 2024, 12:53 PM

#

@random meadow Benchmark ran again ... we're saving about ~7 seconds in typescript when the code changes

random meadow Sep 6, 2024, 1:02 PM

#

So my changes are nice?

unique jungle Sep 6, 2024, 1:02 PM

#

yep

lament parrot Sep 6, 2024, 1:03 PM

#

🎉

unique jungle Sep 6, 2024, 1:05 PM

#

[note: 16 vs 14 might be random]

before

init: 28 seconds
functions: 16 seconds
functions (modified code): 16 seconds
functions (cached): <1s

after:

init: 28 seconds
functions: 14 seconds
functions (modified code): 9 seconds
functions (cached): <1s

random meadow Sep 6, 2024, 1:05 PM

#

unique jungle yep

Yeaaaah! I'm super happy

unique jungle Sep 6, 2024, 1:05 PM

#

tldr: before, first time == modified time. Now, modified < first time, so we're re-using some cache

random meadow Sep 6, 2024, 1:05 PM

#

That's strange the init is the same though, it should be faster 😮

#

Oh maybe you're only testing with yarn?

#

It's a special case where I actually need to download dependencies to generate the lockfile

unique jungle Sep 6, 2024, 1:06 PM

#

random meadow That's strange the init is the same though, it should be faster 😮

it was actually slower but I thought it was a fluke. It actually took 40s

#

then 39s the second time

#

but those are one offs

random meadow Sep 6, 2024, 1:06 PM

#

Could you add some benchmark for pnpm, npm & bun?

#

It would also be useful for our users

#

and yarn v4 is you feel it too 😉

#

(So many runtime & package manager in ts, sorry)

unique jungle Sep 6, 2024, 1:07 PM

#

how?

#

Also -- some of the slowness is due to https://github.com/dagger/dagger/issues/8254

#

which is affecting all SDKs, not just TS

unique jungle Sep 6, 2024, 1:50 PM

#

random meadow Oh maybe you're only testing with yarn?

not testing with anything in particular, just running init --sdk=typescript

How do you add the other ones?

The source is here: https://github.com/dagger/dagger/blob/69eaf710c5b8e3063fdee0e8ae0ffed4fef91bfe/.github/workflows/benchmark.yml#L4

GitHub

dagger/.github/workflows/benchmark.yml at 69eaf710c5b8e3063fdee0e8a...

An engine to run your pipelines in containers. Contribute to dagger/dagger development by creating an account on GitHub.

random meadow Sep 6, 2024, 2:45 PM

#

unique jungle not testing with anything in particular, just running init --sdk=typescript How...

You need to update the package.json for bun, for package managers, you need to do a <npm|yarn|pnpm> install first so the runtime can pick the right package manager

#

Or set the packageManager to npm|yarn|pnpm field, that should work too

#

https://docs.dagger.io/manuals/developer/dependencies go to typescript section

Dependencies | Dagger

When creating a Dagger module, there are two types of dependencies you will encounter:

unique jungle Sep 17, 2024, 1:47 PM

#

@random meadow btw -- I'm seeing 45+ seconds on dagger initfor typescript

random meadow Sep 17, 2024, 1:50 PM

#

unique jungle <@281874480651829250> btw -- I'm seeing 45+ seconds on `dagger init`for typescri...

I know, it's super weird, it's because corepack installs yarn 1.22.22 and takes a huge amount of time to do so...
I want to create benchmark for other package manager too to see if it's faster with npm or pnpm

#

I'll work on a benchmark

unique jungle Sep 17, 2024, 1:51 PM

#

Does it make sense to default to yarn vs npm?

random meadow Sep 17, 2024, 1:51 PM

#

We wanna use the faster basically

#

Yarn was supposed to be the fastest but not in our case, because we start from a clean state in our benchmark, it's faster when things are cached though, we could assert that with specific benchmarks

#

That's why I want to add some in your benchmark workflow

random meadow Sep 17, 2024, 2:30 PM

#

@unique jungle Could you give me a review on https://github.com/dagger/dagger/pull/8476 please? /cc @modern halo
This will be super helpful when it's merged to know which setup is the fastest

GitHub

feat: package manager benchmark job by TomChv · Pull Request #8476 ...

unique jungle Sep 17, 2024, 3:09 PM

#

replied

#

Please use honeycomb when looking at that data

#

https://ui.honeycomb.io/dagger/environments/experimental/board/yUGn6ftYPtX/SDK-Benchmark

#

(and update the dashboard with the new data points)

random meadow Sep 17, 2024, 3:34 PM

#

PR updated, how can I try this workflow?

random meadow Sep 17, 2024, 4:02 PM

#

@unique jungle Need your help to run the test, I'm not sure it's possible from my PR directly with the permission I have

random meadow Sep 18, 2024, 3:48 PM

#

@lament parrot This PR updates benchmark, CI's green but I'm not sure how to test it from my branch, do you have any idea? (Andrea is in PTO, so I'm asking you) https://github.com/dagger/dagger/pull/8476

GitHub

feat: package manager benchmark job by TomChv · Pull Request #8476 ...

#SDK initialization performance 🧵