#Performance degradation after upgrading from Dagger v0.16.3 to v0.18.9

1 messages · Page 1 of 1 (latest)

potent oxide
#

Hi,

We are currently using Dagger v0.16.3 in production within our company. We provide internal Dagger modules written in Python to our product teams, some of which have dependencies on other Dagger modules.

We recently tried upgrading to version v0.18.9, but we noticed a significant performance degradation, particularly during module loading.

In both cases, the Dagger engine is using the cache, so we would not expect such a drop in performance.

Could you please confirm if there have been any engine-level changes between these versions that might explain this performance drop?

Here are the links to Dagger Cloud traces for comparison:
• v0.16.3 trace: https://dagger.cloud/betclicgroup/traces/fd6abcfcc76d09de0384588d3f9f87a7?listen=c9b8c0058bdd3b64&listen=f13770b6d1eee0cd&listen=fc8f4e8b661a1126

•    v0.18.9 trace: https://dagger.cloud/betclicgroup/traces/66158ef88cba183e1b69adf45f5dd338?listen=ca3eb477f892ba9b&listen=c53cb2f1c9bf65b2&listen=28a0f0a24e3d381f&listen=9ce52c3a0ba0ffa8

Thank you in advance for your help.

floral sigil
#

@potent oxide seems to me that the v0.18.9 calls are not being executed under the same conditions from the v0.16.3 ones.

For example, if you check these spans from both 16.3 and 18.9 respectively, you'll see that in the latter a bunch of module init operations do not seem to be cached

https://dagger.cloud/betclicgroup/traces/fd6abcfcc76d09de0384588d3f9f87a7?span=2ec5f6951a392f6d

^ 16.3 here, you can see that all the Python module runtime initialization is cached

https://dagger.cloud/betclicgroup/traces/66158ef88cba183e1b69adf45f5dd338?span=45ac94db267e8f66

^ 18.9 here you can see that some of these operations are not being cached

potent oxide
#

Hi @floral sigil

Thanks for your reply.

What we’re trying to understand is precisely why load module operations are not being cached in v0.18.9. As far as we can tell, the tests were executed under the exact same conditions.

We are using GitHub Actions runners that only contain the Dagger CLI. Each command triggers execution against a persistent Dagger engine. The only thing that changes between runs is the Dagger version.

For example, here are two executions of the same Dagger module, run one after the other within the same GitHub Actions workflow. The commands clearly used some cache—otherwise, the full execution would have taken over 5 minutes. However, we noticed that the module loading time remains the same between both runs, suggesting that the module initialization step is not being cached, unlike what we observed with v0.16.3.

Here are the traces (this time we used version 0.18.8):
Run 1
Run 2

Is there any change in the engine or the caching logic in v0.18.X that could explain why module initialization is no longer cached?

iron prairie
#

Poked around a bit at this and tried a few things based on some diffs I noticed in the two runs you shared.

One thing I noticed is that we seem to end up invalidating cache when the SSH_AUTH_SOCK value changes (even if the actual auth being provided is no different).

That invalidation doesn't make a huge difference in my local runs, but could easily vary depending on the module dependencies.

#

Checking to see how quick of a fix that is

#

@keen thistle I forget what the reason for including the CurrentID is here: https://github.com/sipsma/dagger/blob/4116659290b496b2e749973e0484c8097834151a/core/schema/git.go#L625-L625

But I think that's at least one possible cause of the above issue where the ssh auth sock path changing invalidates the python SDK cache:

  • Their Run 1 zoomed in, python SKD mounts dir at /src/xxh3:ce91349520610781, if you expand args in that WithMountDirectory you see "/tmp/ssh-XXXXXXFAb28l/agent.81"
  • Their Run 2 zoomed in, python SKD mounts dir at /src/xxh3:f0855ab3e6858635, if you expand args in that WithMountDirectory you see "/tmp/ssh-XXXXXXKc3IFo/agent.79"

Would the content hashing in the withExec PR help at all?

#

(sorry for the late ping forget to do @silent 😬 )

potent oxide
#

Hi @iron prairie

Thank you very much for the analysis. Indeed, I was able to reproduce the behavior locally by changing the value of SSH_AUTH_SOCK. I can also confirm that with version 0.16.3, changing the SSH_AUTH_SOCK value does not invalidate the module cache.

On our CI, the SSH agent is started on each run, which causes the SSH_AUTH_SOCK value to be different every time.

For us, this has a significant impact on CI execution time. For example, in the cases shown above, the duration goes from 1m10s to 1m34s, which is a 34% increase. It’s worth noting that in this example, the module being used has two dependencies, which in turn may have their own dependencies as well.

Given the current situation, we’re unable to migrate to version 0.18.x, so a fix would be highly appreciated 🙂

keen thistle
#

so the line you point to is just the mixin for the content hash to prevent the weird potential collision where you might be able to get the contents of a private repo

the actual computation of the digest changed though during resolution of ref. see https://github.com/dagger/dagger/pull/10271/files#r2060529944

#

lemme try undoing that, and dive into what's going on

keen thistle
keen thistle
potent oxide