#Error on 0.19.9 upgrade
1 messages · Page 1 of 1 (latest)
Couple questions:
- What's the output of
uname -aon the machine where this is occurring? - Does this happen right away with any command or did you successfully run some other commands before this that succeeded?
- If possible, what's the output of
sudo dmesg | grep overlay | grep -v 'meaningless'on the machine where this is happening?
Darwin N1004907 25.2.0 Darwin Kernel Version 25.2.0: Tue Nov 4 20:46:55 PST 2025; root:xnu-12377.60.50.501.1~2/RELEASE_ARM64_T6030 arm64 arm Darwin
ah okay nevermind about 3) then, if you are on macos
- The only command so far that has succeed for me is
dagger core versionother commands do not work:
dagger functionsdagger init --sdk go test
dagger develop
^ those all don't work
are you using docker desktop on macos? i.e. that vs orbstack or something else
when you have a sec, what's the output of:
docker run --rm --privileged -it alpine:latest uname -adocker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
(basically same as before but now hopefully inside the docker desktop vm)
not sure tbh, depends on implementation details of docker desktop. It could plausibly matter though
Okay I am just running a quick test with Apple Virtualisatoin framework
Will let you know if that works or not... then will run those commands
Interesting... its working now for some reason
Okay will switch back to Docker VMM and see if it breaks again
very interesting indeed!
Okay its working Docker VMM now 🙂
Okay tell you what... will get some of my colleagues on this and see if we can reproduce
huh... I mean I'll take it, but that's definitely a bit mysterious
that sounds great, thank you
The first command I actually ran on my machine was dagger develop --recursive on our dagger mono repo.
That was the one that broke it originally. I am gonna try rerunning that again
well its now working
wow common
Sorry for the bad upgrade experience @jovial shard !
if it pops up again at any point let us know of course! I am at a loss as to what would have caused it to happen once and then disappear
Yeah who knows could be an absolute flake on my machine that decided to pop up just as I was testing the new release
A little gift to freak you out
I am gonna be upgrading our engines today hopefully so will let you know if anything comes up
Btw here are the results of those commands:
Docker VMM
docker run --rm --privileged -it alpine:latest uname -a
Linux 6e674a40ec10 6.12.54-linuxkit #1 SMP Tue Nov 4 21:21:47 UTC 2025 aarch64 Linux
docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
<no-output>
Apple Virtualisation Framework
docker run --rm --privileged -it alpine:latest uname -a
Linux 745a02853958 6.12.54-linuxkit #1 SMP Tue Nov 4 21:21:47 UTC 2025 aarch64 Linux
docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
<no-output>
thanks, that's still helpful! On the off-chance it does ever happen again, that second one with the greps would be useful to see the output of. Unfortunately the kernel puts useful information in the kernel logs while the error itself is just "invalid argument"
right okay
I'll also try to get you a dump of the dagger logs
I can't share the trace of the very first error that I got due to it containing some company code however the error occurred in an asModule operation and it was:
failed to load dependencies as modules: failed to load module dependencies: failed to initialize module: failed to get type defs json during module sdk codegen: mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1043758443", fstype: overlay, flags: 0, data: "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/149/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs,volatile,index=off,redirect_dir=off", err: invalid argument
I did actually restart docker desktop, and delete the dagger container after that
Perhaps this has "cleaned the state" so to speak and the issue has gone away as a result
@wise shoal tagging your for brevity
If you hit the same issue, please dump your logs here 🙂
Okay I hit the issue again:
dagger call \
--bao-addr=${BAO_ADDR} \
--bao-token=file://~/.vault-token \
--local-aws-credentials=file://~/.aws/credentials \
--local-aws-profile=${AWS_PROFILE} \
--local-gcp-credentials=file://~/.config/gcloud/application_default_credentials.json \
--ssh=${SSH_AUTH_SOCK} \
check-pull-request
✔ connect 0.3s
✘ load module: . 6.2s ERROR
┇ initializing module › ModuleSource.asModule › load dep modules › ModuleSource.asModule › load dep modules › ModuleSource.asModule ›
✘ withExec codegen generate-typedefs --module-source-path /src/modules/dx --module-name dx --introspection-json-path /schema.json --output typedefs.json (
┆ experimentalPrivilegedNesting: true
┆ execMD: "
{\"ClientID\":\"ulpbyf0h7vonq4z0u13zp1s1u\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"offvnz76gswlelo3x58ew1qw9\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjVjNTdkNTc2ZmVhYjk4YzMSvAEKFXh4aDM6MmUzNGIzOGE5NTZiZjkzZhKiARIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIlIKCXJlZlN0cmluZxJFOkMvVXNlcnMvY2hyaXN0b3BoZXIucGFsbWVyL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL2R4IhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjJlMzRiMzhhOTU2YmY5M2ZYARJxChV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESWAoVeHhoMzoyZTM0YjM4YTk1NmJmOTNmEhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSIMCgRuYW1lEgQ6AmR4ShV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESXQoVeHhoMzo1YzU3ZDU3NmZlYWI5OGMzEkQKFXh4aDM6MzMzMDczYjJkNDgyODRlMRIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6NWM1N2Q1NzZmZWFiOThjMw==\",\"EncodedModuleID\":\"ChV4eGgzOmYyMDg2YTI1ZDkxOThlNDESvAEKFXh4aDM6MmUzNGIzOGE5NTZiZjkzZhKiARIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIlIKCXJlZlN0cmluZxJFOkMvVXNlcnMvY2hyaXN0b3BoZXIucGFsbWVyL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL2R4IhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjJlMzRiMzhhOTU2YmY5M2ZYARJxChV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESWAoVeHhoMzoyZTM0YjM4YTk1NmJmOTNmEhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSIMCgRuYW1lEgQ6AmR4ShV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESXwoVeHhoMzpmMjA4NmEyNWQ5MTk4ZTQxEkYKFXh4aDM6MzMzMDczYjJkNDgyODRlMRIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6ZjIwODZhMjVkOTE5OGU0MVgB\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"
) 0.1s ERROR
! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit301354842", fstype: overlay, flags: 0, data:
"workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/va
r/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/sn
apshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs:/var/lib/dagger/w
orker/snapshots/snapshots/146/fs,volatile,index=off,redirect_dir=off", err: invalid argument
docker run --rm --privileged -it alpine:latest uname -a
Linux 84c891249553 6.12.54-linuxkit #1 SMP Tue Nov 4 21:21:47 UTC 2025 aarch64 Linux
docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[ 1585.795731] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.796150] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.816909] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.838332] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.855822] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.881574] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.916250] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.918271] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.940469] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.962526] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.972130] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.972689] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.974413] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.977663] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.984430] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.988866] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.989469] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.990043] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.992682] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.996780] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.010497] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.031813] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.034578] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.035641] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.037057] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.037536] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.039058] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.040479] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.041607] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.043329] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.051474] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
Ah okay, thanks for the info dump. Based on that I suspect there's some codepath where cleanup is supposed to happen that doesn't always happen.
Was there anything obvious you did that triggered it? Or just keep using it and it randomly pops up?
Nothing obvious unfortunately
I took a break from running dagger commands for about 20min
Came back and decided to run our main dagger function for the repo
That triggered it...
ok, I will see if I can track down exactly what's happening, but also in the worst case I can probably add in some kludge to the engine code to see when this error happens and handle it. So we'll get a v0.19.10 out ASAP.
Not sure if this is related but I crashed when trying to replicate the issue. I bumped the version in the repo from 0.19.3 -> 0.19.9 and ran dagger develop --recursive.
! failed to generate code: Post "http://dagger/query": command [docker exec -i dagger-engine-v0.19.9 buildctl dial-stdio] has exited with exit status 137, make
sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=
I've attached the engine logs, pretty sure the failed run beings at L6835. The logs before that line were from a dagger init --sdk go test in another directory.
Lots of:
time="2026-01-09T00:01:28Z" level=warning msg="failed to release network namespace \"qly2sfnlac7v0p0lq9zp6fv0w\" left over from previous run: plugin type=\"loopback\" failed (delete): unknown FS magic on \"/var/lib/dagger/net/cni/qly2sfnlac7v0p0lq9zp6fv0w\": ef53"
followed by:
could not load snapshot...
❯ docker run --rm --privileged -it alpine:latest uname -a
Linux 46d36ec3380b 6.11.11-linuxkit #1 SMP Wed Oct 22 09:37:46 UTC 2025 aarch64 Linux
❯ docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
I jsut ran it again and got an error similar to Chris
✘ withExec codegen generate-typedefs --module-source-path /src/modules/service-catalog --module-name service-catalog --introspection-json-path /schema.json --output typedefs.json (
┆ experimentalPrivilegedNesting: true
┆ execMD: "{\"ClientID\":\"rddzwfw7gadxzk28hu86uxvwq\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"q1ycgxiygqjba8s27okt71lyl\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjMxOTA1NTczYjlkYmZiM2ESXQoVeHhoMzozMTkwNTU3M2I5ZGJmYjNhEkQKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MzE5MDU1NzNiOWRiZmIzYRLCAQoVeHhoMzo2NWE1YTc3NDgxNmY2Y2M4EqgBEhAKDE1vZHVsZVNvdXJjZRgBGgxtb2R1bGVTb3VyY2UiWAoJcmVmU3RyaW5nEks6SS9Vc2Vycy9sdWtlLmJyYWtlbC93b3Jrc3BhY2UvbGlicmFyeS1jaS13b3JrZmxvd3MvbW9kdWxlcy9zZXJ2aWNlLWNhdGFsb2ciEwoNZGlzYWJsZUZpbmRVcBICGAFKFXh4aDM6NjVhNWE3NzQ4MTZmNmNjOFgBEn4KFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBJlChV4eGgzOjY1YTVhNzc0ODE2ZjZjYzgSEAoMTW9kdWxlU291cmNlGAEaCHdpdGhOYW1lIhkKBG5hbWUSEToPc2VydmljZS1jYXRhbG9nShV4eGgzOmUwYzRmZWE3MzcyYjZkOGQ=\",\"EncodedModuleID\":\"ChV4eGgzOjQ5YmMzNmZlODJkNTg3M2ESXwoVeHhoMzo0OWJjMzZmZTgyZDU4NzNhEkYKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6NDliYzM2ZmU4MmQ1ODczYVgBEsIBChV4eGgzOjY1YTVhNzc0ODE2ZjZjYzgSqAESEAoMTW9kdWxlU291cmNlGAEaDG1vZHVsZVNvdXJjZSJYCglyZWZTdHJpbmcSSzpJL1VzZXJzL2x1a2UuYnJha2VsL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL3NlcnZpY2UtY2F0YWxvZyITCg1kaXNhYmxlRmluZFVwEgIYAUoVeHhoMzo2NWE1YTc3NDgxNmY2Y2M4WAESfgoVeHhoMzplMGM0ZmVhNzM3MmI2ZDhkEmUKFXh4aDM6NjVhNWE3NzQ4MTZmNmNjOBIQCgxNb2R1bGVTb3VyY2UYARoId2l0aE5hbWUiGQoEbmFtZRIROg9zZXJ2aWNlLWNhdGFsb2dKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZA==\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"
) 0.1s ERROR
! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1594712920", fstype: overlay, flags: 0, data:
"workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots
/snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/
153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/145/fs:/var
/lib/dagger/worker/snapshots/snapshots/94/fs:/var/lib/dagger/worker/snapshots/snapshots/86/fs,volatile,index=off,redirect_dir=off", err: invalid argument
trying to repro locally. On my linux working perfectly. Switching to mac rn
Worked fine for me on remote Docker/linux, and remote Dagger-hosted sandbox... Trying to re-install docker-for-mac
Yeah I've been using v0.19.9 (and previously main builds) on Linux running similar sort of commands as what's repro'ing it above and haven't hit it yet... I really can't imagine how it's docker desktop specific, but I guess my imagination could be limited
Engine logs for the second failure
❯ docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[112951.651719] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.711449] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.730308] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.779398] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.781942] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.814648] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.831404] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.876855] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.933803] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.946117] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.949512] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.953730] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.985612] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.015875] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.043365] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.047606] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
@jovial shard @wise shoal do you by any chance run the engine with custom CA certs to your knowledge?
there's an outside possibility of that mattering
Yeah we do
yep
okay, lemme try to repro with that in place...
yeah no bug on my end up to now
Here's the output of my docker inspect for the dagger container:
not sure if thats of any help whatsover but haha, there you go
yep that confirms you're setting up the ca certs; was about to go through double checking we're on the same page there, so that helps 👍
haven't repro'd your exact error yet but did trigger something suspicious looking
Thanks for looking into this guys on short notice really appreciate!
having custom ca certs triggers some extra setup/teardown execution for each container, which I suspect may be creating a race around mounts. That'd also explain why our CI didn't catch this. We have a number of integ tests for CA certs but they only make up a small chunk of the 1000s, so something like this could have slipped through the cracks
Error on 0.19.9 upgrade
right...
Erik, could it be a cleanup issue on work/incompat/volatile, for subsequent mounts ?
trying to repro now 👀
yes that's for sure what it is based on the kernel logs they sent, but the question is where/why is the cleanup missing since our bases should have been covered. Could be something related to the custom CA cert setup/teardown codepaths
I'd ideally like to find that exact root cause and patch it, but I think also no matter what I'll add some fallback handling where if we try to make a mount and hit this error, we just cleanup the incompat dir at that time. Not super duper ideal in the long term but probably the right move to prevent pain like this, at least in the medium term.
oh you mean the race ok 👍
MMmmh, stil unable to hit it --'
I'm on kernel 6.10.14-linuxkit, updating docker desktop atm. i've been trying to hit it pretty hard
ok bumped, now i'm like chris and luke. At least this could be triggered just on the latest docker for mac version ?
docker run --rm alpine:latest uname -r
6.12.54-linuxkit
I did upgrade to the latest docker desktop quite recently
Luke is on an older version though as you can see if you look at his post above
forgot about it 🙏 I was on an even older one personally
you're getting err: invalid argument when creating a mount now? with or without custom ca certs?
time="2026-01-09T02:08:36Z" level=error msg="failed to create cacerts installer, falling back to not installing CA certs: invalid argument" span="dagop.ctr Container.withEx
verifying it's not a "me" error with the ca certs installs
(probably me)
that's not the error they are getting exactly; that's a non-fatal one
the one they are hitting actually results in a hard user-facing error
but could be related
mmh it was due to me (symlink (f99307f9.0 -> test-repro-ca.crt) in the CA certs directory Even without the symlink I have this error, could be related, no hard error as they have though, on more than 200+ runs
for the record we do symlink the ca-certificates directory like this:
ln -s "$HOME/workspace/ssl" "$HOME/Library/Application Support/dagger/ca-certificates"
Not sure if this helps at all, but worth looking into
I might try just copying the certs in there and see what happens
I think it's a red herring, it's just from a typed-nil problem hitting this: https://github.com/sipsma/dagger/blob/400ffd3e2a6e9bbff5cbd9476938db61d97681b8/engine/buildkit/containerfs/fs.go#L332 when it should have exited earlier https://github.com/sipsma/dagger/blob/400ffd3e2a6e9bbff5cbd9476938db61d97681b8/engine/buildkit/containerfs/fs.go#L294
which is worth a fix but not harmful and not what they are hitting
Getting my colleagues to do this:
rm "$HOME/Library/Application Support/dagger/ca-certificates"
cp -R $HOME/workspace/ssl "$HOME/Library/Application Support/dagger/ca-certificates"
and retest
will be away for 2 hours, will resume after 🙏
I have been running this configuration so far and haven't been able to reproduce the error
Will keep running with it and also will get my colleagues to try it out and see if this was it
Looking at it now
Thanks, yeah I surprisingly wasn't able to repro the same error you all were hitting even with a custom CA installed and then running 100s of integ tests against that engine. So hopefully that squashes it permanently for you.
@silent @jolly steppe (silent ping, for tomorrow), if they hit the problem again, do you think tomorrow you would have time to try implementing the fallback of "whenever mount is made, pre-emptively remove any work/incompat dir"? I have a separate issue I'm looking into for another user that I gotta continue with tomorrow
Unfortunately... we are still getting it:
make check-pull-request
dagger call \
--bao-addr=${BAO_ADDR} \
--bao-token=file://~/.vault-token \
--local-aws-credentials=file://~/.aws/credentials \
--local-aws-profile=${AWS_PROFILE} \
--local-gcp-credentials=file://~/.config/gcloud/application_default_credentials.json \
--ssh=${SSH_AUTH_SOCK} \
check-pull-request
✔ connect 0.3s
✘ load module: . 5.6s ERROR
┇ initializing module › ModuleSource.asModule › load dep modules › ModuleSource.asModule › load dep modules › ModuleSource.asModule ›
✘ withExec codegen generate-typedefs --module-source-path /src/modules/aws --module-name aws --introspection-json-path /schema.json --output typedefs.json (
┆ experimentalPrivilegedNesting: true
┆ execMD: "
{\"ClientID\":\"86cwwdrgtnzajs4nhgqz5k204\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"ihdmopo3idqgxuo9q04p1gaf8\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjExNDZiOGU5NjcxOTQ1ZDASXQoVeHhoMzoxMTQ2YjhlOTY3MTk0NWQwEkQKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MTE0NmI4ZTk2NzE5NDVkMBK9AQoVeHhoMzozMGQyNWIwYzkyMGNhZGM2EqMBEhAKDE1vZHVsZVNvdXJjZRgBGgxtb2R1bGVTb3VyY2UiUwoJcmVmU3RyaW5nEkY6RC9Vc2Vycy9jaHJpc3RvcGhlci5wYWxtZXIvd29ya3NwYWNlL2xpYnJhcnktY2ktd29ya2Zsb3dzL21vZHVsZXMvYXdzIhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjMwZDI1YjBjOTIwY2FkYzZYARJyChV4eGgzOjhlOGU0YjE3ZTQ5YWFmZDkSWQoVeHhoMzozMGQyNWIwYzkyMGNhZGM2EhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSINCgRuYW1lEgU6A2F3c0oVeHhoMzo4ZThlNGIxN2U0OWFhZmQ5\",\"EncodedModuleID\":\"ChV4eGgzOjFmYjZmNGM1YzRmZTA1NzISXwoVeHhoMzoxZmI2ZjRjNWM0ZmUwNTcyEkYKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MWZiNmY0YzVjNGZlMDU3MlgBEr0BChV4eGgzOjMwZDI1YjBjOTIwY2FkYzYSowESEAoMTW9kdWxlU291cmNlGAEaDG1vZHVsZVNvdXJjZSJTCglyZWZTdHJpbmcSRjpEL1VzZXJzL2NocmlzdG9waGVyLnBhbG1lci93b3Jrc3BhY2UvbGlicmFyeS1jaS13b3JrZmxvd3MvbW9kdWxlcy9hd3MiEwoNZGlzYWJsZUZpbmRVcBICGAFKFXh4aDM6MzBkMjViMGM5MjBjYWRjNlgBEnIKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORJZChV4eGgzOjMwZDI1YjBjOTIwY2FkYzYSEAoMTW9kdWxlU291cmNlGAEaCHdpdGhOYW1lIg0KBG5hbWUSBToDYXdzShV4eGgzOjhlOGU0YjE3ZTQ5YWFmZDk=\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"
) 0.2s ERROR
! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1689234798", fstype: overlay, flags: 0, data:
"workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapsh
ots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snap
shots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/147/fs:/var/lib/dagger/worker/snapshots/snapshots/146/fs,volatile,index=off,redirect_dir=off", err:
invalid argument
The issue actually follows after an engine crash though...
- Engine crashes and I lose my first working session
- I run another dagger command.
- New dagger engine up
- Start getting the error
However weirdly the new engine does not reuse the old docker volume
So I have no idea why this would be occuring since it doesn't share any state with the previously running engine
docker run --rm --privileged -it alpine:latest uname -a
Linux d6a10ba7dbb2 6.12.54-linuxkit #1 SMP Tue Nov 4 21:21:47 UTC 2025 aarch64 Linux
docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[ 2401.461405] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.464349] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.467592] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.469147] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
Oh I missed that there was a crash. That explains this way more… the cleanup of the mount isn’t happening because there was an actual crash. The cleanup is of on disk state created by the kernel so it would persist across engine restart.
We still need better handling of that case but fixing the crash is obviously even more important. I am afk atm but will look at the engine log output when back
Right, so is this error:
! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit2284938268", fstype: overlay, flags: 0, data: "workdir=/var/lib/dagger/worker/snapshots/snapshots/163/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/163/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/157/fs:/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs:/var/lib/dagger/worker/snapshots/snapshots/147/fs,volatile,index=off,redirect_dir=off", err: invalid argument
caused by these not being cleaned up?:
# stat -f -c %T /var/lib/dagger/worker/snapshots/snapshots/163/work/work/incompat/volatile/dirty
ext2/ext3
Yes that's correct!
they should get cleaned up, but after a hard crash they won't. But theoretically the engine should have been discarding those mounts entirely after the crash, but it's not
So that's one bug. But more importantly I want to see why the engine is restarting in the first place
in the engine logs I didn't actually see a panic or anything
is it just getting manually restarted?
or is the crash stack trace just not showing up there?
I didn't see one, I'll nuke it all and see what I get
is it just getting manually restarted?
Nah it's crashing
Ah, unfortunately after the crash the container goes down with the logs so we don't get to see why it did that
Is there a way to tell dagger not to put --rm on the engine docker container it makes?
There’s ways, but possibly simpler would be to remove the container, start it simply with just “dagger core version”, and then run in a separate terminal “docker logs -f dagger-engine-v0-19-9 2>&1 | tee ~/engine.log”. Then do the stuff that triggers the crash and you’ll have the output in that file
[ 5315.992327] Out of memory: Killed process 1109 (dagger-engine) total-vm:2403788kB, anon-rss:371748kB, file-rss:716kB, shmem-rss:0kB, UID:0 pgtables:2472kB oom_score_adj:0
That's from dmesg inside the engine. The repo failing is a library that has over 20 module dependencies, each with their own test module.
Huh… RSS is only like 372MB which isn’t a lot. Is there some docker desktop limit being applied?
Here are the engine stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
9a6a83c20a7d dagger-engine-v0.19.9 0.00% 211.9MiB / 5.786GiB 3.58% 1.17kB / 126B 191MB / 512kB 20
Yes of course 😇
ps: I'm back, will dig the logs 🙏
I manually deleted the imcompat overlays and it ran successfully using whatever was cached from the first run.
It's definitely using more memory than that. I crashed it again and it climbed to around the limit before ooming.
/ # dmesg | grep oom
[ 7909.167555] dagger-engine invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[ 7909.167634] oom_kill_process+0x144/0x360
[ 7909.167690] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[ 7909.168079] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init,mems_allowed=0,global_oom,task_memcg=/docker/032e6c36b5e726638fa46d78b218b68555386fa899d18fe418b8f408b4e1bca9/init,task=dagger-engine,pid=31178,uid=0
[ 7909.168231] Out of memory: Killed process 31178 (dagger-engine) total-vm:2341756kB, anon-rss:461572kB, file-rss:9636kB, shmem-rss:0kB, UID:0 pgtables:2344kB oom_score_adj:0
It's a global oom, it seems that it's the entire VM running out of memory
constraint=CONSTRAINT_NONE
global_oom
Questions (if answered above sorry, i checked but might have missed messages):
-
What's your Docker Desktop memory allocation?
- Docker Desktop → Settings → Resources → Memory
- (This sets the LinuxKit VM's total memory)
-
Inside the engine, what do these show?
docker exec dagger-engine-v0.19.9 cat /sys/fs/cgroup/memory.max
docker exec dagger-engine-v0.19.9 cat /sys/fs/cgroup/memory.current
- What's the VM's total memory?
docker exec dagger-engine-v0.19.9 cat /proc/meminfo | head -5
Is your number lower than the 5.7gb of the engine that we set ?
If you can increase it for testing purposes, can you please try and make a run after the cleanup
I have this on my machine:
here, the limit being 5.7gb ? or your docker limit ?
What surprises me the most is that, as part of our release process, we bump recursively Dagger's internal components too (without CAs though), with the same command. And it's around the same amount of modules (i'll try with even more tomorrow to confirm).
It's totally possible that we have a memory leak or that we introduced something that consumes more memory since last release
By the way, do you have the same problem with 0.19.8 ?
5.7 GB is what docker stats reported back to me as the limit. I bumped it to 8GB and it still crashed. I'm going to push it to 12GB and 2GB swap and try again.
-
Previous run was 8GB / 1GB Swap. The following will be with the updated resource values.
-
Fresh engine, only using
dagger core version
/ # cat /sys/fs/cgroup/memory.max
max
/ # cat /sys/fs/cgroup/memory.current
93048832
- Also pre-run
/ # cat /proc/meminfo | head -5
MemTotal: 12235596 kB
MemFree: 11203196 kB
MemAvailable: 11612708 kB
Buffers: 169800 kB
Cached: 397192 kB
dagger develop --recursivew/ 12GB mem & 2GB swap
📉 Min Available: 1895 MB
📱 Max App RAM: 7500 MB (Active Anon)
🚨 Max Dirty: 1101 MB
🧱 Max Slab: 797 MB
🗄️ Max Cache: 4508 MB
🗺️ Max PageTbl: 52 MB
(I got gemini to write me a script to track it)
It ran successfully.
Thanks Luke 🙏
This seems to confirm that the OOM kills the engine, and leaves it in a weird state
Hypothesis: it crashed due to memory, then you guys restarted the engine, and I suppose that the volume still has the volatile directory (hard kill). Then, if you run any command since that point in time, it just crashes (due to that volatile dir).
So, I'll triple check tomorrow, but the cleanup fix that Erik suggested is actually useful to handle those kind of OOM dirty states
Now, the real question is: why is it OOMing right now for such a small scale ?
I suppose that you were already doing the recursive develop in the past I suppose ? I'll track tomorrow the perf between releases
Can you please confirm from what dagger version you're jumping from ?
Will follow up tomorrow, thank you very much for taking the time with Chris 😍, I should be able to repro🙏
It's the weekend for us tomorrow, I'll check back on Monday.
Now, the real question is: why is it OOMing right now for such a small scale ?
I'm not sure what your scale is, but we have a single "library" module with about 28 dependent modules (each with a test module). The main module is essentially the CI layer to generate/test/validate all the others.
I suppose that you were already doing the recursive develop in the past I suppose ?
Correct
Can you please confirm that from what dagger version you're jumping from ?
0.19.3
Here are the stats from a run on 0.19.3 (Docker: 12GB + 2GB)
📉 Min Available: 4106 MB
📱 Max App RAM: 6888 MB (Active Anon)
🚨 Max Dirty: 606 MB
🧱 Max Slab: 734 MB
🗄️ Max Cache: 6480 MB
🗺️ Max PageTbl: 38 MB
And with less resources (Docker: 6GB + 1GB)
📉 Min Available: 445 MB
📱 Max App RAM: 3616 MB (Active Anon)
🚨 Max Dirty: 494 MB
🧱 Max Slab: 512 MB
🗄️ Max Cache: 3475 MB
🗺️ Max PageTbl: 35 MB
Thanks for the info, I am seeing if I can repro any super high memory usage/leaks. So far running stuff (develop --recursive and some expensive dagger calls) in our repo w/ quite a few dagger module dependencies has not replicated anything like what you're seeing, but will keep trying
Found something interesting.
dagger develop --recursive on your repo at different versions (found traces in your org):
You do indeed have a ton of modules, a few times more than us!
The fact that the newer engine version is quite a bit faster honestly might explain why you started hitting OOMs sometimes. I've seen in the past this sort of thing happen where unblocking CPU/IO bottlenecks increases peak memory usage since you can just allocate more in parallel faster than before. I'm suspecting it's that, given I can't replicate any sort of memory leak with develop --recursive (engine RSS always goes back down to baseline after forcing a gc cycle). The fact that the OOMs are inconsistent also suggest that the problem is sort of "borderline" rather than just some absurd memory usage bug.
We should work on improving the memory usage of course, but that might be a piecemeal effort over time. For the shorter term I think we should:
- Add a parallelism limit to
dagger develop --recursive(maybe just num CPUs). That should stop the peak memory usage from going crazy in repos with tons of deps like you have. I suspect it won't actually slow it down very much if at all since the CPU is probably getting maxed out anyways with that much parallelism - Fix the problem with the mounts getting cleaned up after a hard crash (which is what caused the original error that started this whole thread)
cc @jolly steppe
LOL @weary oar you made it too fast
Implementing those as follow up, thanks Erik 🙏
lemme know if I can give any pointers
Erik appreciate that analysis. I have seen things like that before, when you make improvements to apps that were previously blocked by I/O, they can now utilise the CPU much more and its possible for them to run out of resources.
We are an interesting case study because we have a mono repo with a lot of modules. We actually have parallelism limits in some of pur dagger functions for that repo, because on some machines, there was too much in parallel, progress would just grind to a halt!
We actually had to implement the same thing
I have noticed in the jump from 0.19.3, the engine has gotten so much zippier, so very likely its just able to do more in parallel.
But the tricky part is: with auto-scale-out, that parallelism limit no longer makes sense: if we just hardcode it in the module, we're potentially leaving acceleration on the table
Perhaps it should be a client setting?
Yeah maybe. But right now it's just custom module logic. Not something the engine is exposed to. We could change that
System-wide parallelization throttle could make sense
Actually it might make sense even in a cluster, to keep cost under control 🙂
Right
We used to have that but it creates some really subtle and tricky deadlock scenarios. If you can keep scaling out forever (for some approximation of forever) that changes it of course though.
I guess what we could do is find some way of limiting “top-level” parallelism (ie number of checks, number of modules generating not including their deps, etc). That’d avoid the deadlock issues
And in mean time one off whack a moles will help 😄
Seems like a good idea
Thanks for jumping on this and not only looking at the first issue with mounts but also considering the reasons for the OOM kills 🙂
Hopefully the feedback is helping you guys discover ways we are using the product, and therefore improve it!
We love it ! 😍
Cant wait to get these fixes in and get upgrade next week. We have plans to use checks and improvements to .env file to start building a very strong local experience for developers
Updated with your suggested implem Erik, resuming the dirty state cleanup PR
For the cleanup, how would you test it ? I'm having a hard time repro-ing ? Shall I manually lower my ram, try it until an OOM ? It's just that i'm gonna have a hard time making an integration test out of that 🤔
I guess I could lower my allocated ram and generate a 100 dependency module locally
You could manually test by just sigkilling the engine and then starting it again. Integ test would be a but involved but you could probably do it with an engine as a service that you force kill in the middle of an operation. I would consider that to be nice to have but not necessary given how tough it might be to make consistent
Looking good guys 🙂
We're going to rollout dagger 0.19.9 to our team's runners and leave everyone else on 0.19.3 for now. Do you guys reckon the fixes would be out some time this week or would you be looking at batching them in the next release cycle?
👋 we're waiting for my fix for the cleanup of the weird state + potentially a few other bugfixes a(nd I'll let Erik confirm) but the idea is to release asap (probably this week)
cool sounds good
Erik just merged the second PR (youhou ! 😍 )
Awesome, currently planning to do the release our tomorrow morning!
Hey guys. We have deploy 0.19.9 to our CI runners and we are experience some weird issues where the engine starts failing its kube health check and becomes unresponsive.
Here is the output from a client trying to connect to the engine:
Run exec dagger call \
1 : connect
1 : [0.0s] | cloud url=https://dagger.cloud/nine/traces/a345ff2dd8ed5a6a55694a664246d778
2 : ┆ starting engine
2 : ┆ starting engine DONE [0.0s]
3 : ┆ connecting to engine
3 : ┆ [0.0s] | 23:09:23 INF connected name=dagger-platform-engineering-dagger-helm-engine-dc62z client-version=v0.19.9 server-version=v0.19.9
3 : ┆ connecting to engine DONE [0.0s]
4 : ┆ starting session
0.19.10 or 0.19.9 ?
In progress at this very moment, I guess try it out once it's out ahah 😇
right okay
just posting the engine logs for brevity
when 0.19.10 comes out, we'll update to that and see if it goes away
when the engine is in this state by the way, we can't terminate the engine pod in k8s.. its like its stuck
the way I got it to close last time, was execing into the pod and doing a killall on everything in the container.
sorry I realise those logs are ordered newest first
^ oldest first here
you can see in our logs we see a spike in errors around 10:07:25
in kube we see the pod is failing its healthcheck:
Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Warning Unhealthy 3m58s (x163 over 43m) kubelet Readiness probe failed: command timed out: "sh -exc dagger core version" timed out after 14s
Running ps aux in the container:
/ # ps aux
PID USER TIME COMMAND
1 root 1h43 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml
232 root 0:05 /usr/sbin/dnsmasq --keep-in-foreground --log-facility=- --log-debug -u root --conf-file=/var/run/containers/cni/dnsname/dagger/dnsmasq.conf
199708 root 0:00 [git]
199730 root 0:00 [git]
661657 root 0:00 [git]
661679 root 0:00 [git]
1232164 root 0:00 [runc:[2:INIT]]
1583779 root 0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/tribqpgdkgmxu25sbtc01ojl6 --keep tribqpgdkgmxu25sbt
1583835 root 0:00 /.init bao server -dev -dev-root-token-id dev-only-token
1583995 root 0:04 bao server -dev -dev-root-token-id dev-only-token
1586579 root 0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/ob80b4d2f2ldlmcvyhy32am30 --keep ob80b4d2f2ldlmcvyh
1586706 root 0:00 /.init bao server -dev -dev-root-token-id bao-dev-token
1587208 root 0:04 bao server -dev -dev-root-token-id bao-dev-token
1587232 root 0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/z7gv71hnxyssw0gwtesfkrwl6 --keep z7gv71hnxyssw0gwte
1587356 root 0:00 /.init bao server -dev -dev-root-token-id dev-only-token
1587600 root 0:04 bao server -dev -dev-root-token-id dev-only-token
1595301 root 0:00 dagger core version
1595320 root 0:00 sh
1595326 root 0:00 ps aux
I have it in a broken state right now, is there anything I could run on the pod to help you guys diagnose this one?
kernel logs could help, sudo dmesg
that's a very odd failure scenario
it feels like something deeply wrong like on the kernel level
e.g. 1232164 root 0:00 [runc:[2:INIT]], that's an intermediate state of a separate runc process that should be super short lived, so the fact that it's seemingly sitting there is extra odd
cat /proc/meminfo (or otherwise machine wide memory usage could help too), to check its not in swap hell or something
/ # cat /proc/meminfo
MemTotal: 64777188 kB
MemFree: 2363604 kB
MemAvailable: 52168964 kB
Buffers: 1584 kB
Cached: 45458216 kB
SwapCached: 0 kB
Active: 11956156 kB
Inactive: 38439244 kB
Active(anon): 15764 kB
Inactive(anon): 5217864 kB
Active(file): 11940392 kB
Inactive(file): 33221380 kB
Unevictable: 199448 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Zswap: 0 kB
Zswapped: 0 kB
Dirty: 1680 kB
Writeback: 0 kB
AnonPages: 5135376 kB
Mapped: 2024580 kB
Shmem: 298028 kB
KReclaimable: 5361856 kB
Slab: 9227904 kB
SReclaimable: 5361856 kB
SUnreclaim: 3866048 kB
KernelStack: 17584 kB
PageTables: 39860 kB
SecPageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 32388592 kB
Committed_AS: 15087008 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 351864 kB
VmallocChunk: 0 kB
Percpu: 25696 kB
HardwareCorrupted: 0 kB
AnonHugePages: 2048 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 30720 kB
FilePmdMapped: 2048 kB
Unaccepted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 2107820 kB
DirectMap2M: 62867456 kB
DirectMap1G: 1048576 kB
oh and df -h to check disks
/ # df -h
Filesystem Size Used Available Use% Mounted on
overlay 49.9G 8.7G 41.2G 17% /
tmpfs 64.0M 0 64.0M 0% /dev
tmpfs 12.4G 5.1M 12.3G 0% /run/dagger
/dev/nvme0n1p1 49.9G 8.7G 41.2G 17% /etc/hosts
/dev/nvme0n1p1 49.9G 8.7G 41.2G 17% /dev/termination-log
/dev/nvme0n1p1 49.9G 8.7G 41.2G 17% /etc/hostname
/dev/nvme0n1p1 49.9G 8.7G 41.2G 17% /etc/resolv.conf
shm 64.0M 0 64.0M 0% /dev/shm
/dev/nvme0n1p1 49.9G 8.7G 41.2G 17% /etc/dagger/engine.toml
/dev/nvme0n1p1 49.9G 8.7G 41.2G 17% /etc/dagger/engine.json
/dev/nvme1n1 2.3T 79.1G 2.2T 3% /var/lib/dagger
tmpfs 58.3G 12.0K 58.3G 0% /var/run/secrets/kubernetes.io/serviceaccount
/dev/nvme0n1p1 49.9G 8.7G 41.2G 17% /etc/dnsmasq-resolv.conf
overlay 49.9G 8.7G 41.2G 17% /etc/resolv.conf
overlay 2.3T 79.1G 2.2T 3% /tmp/rootfs1299463981
overlay 49.9G 8.7G 41.2G 17% /tmp/rootfs1299463981/etc/resolv.conf
/dev/nvme1n1 2.3T 79.1G 2.2T 3% /tmp/rootfs1299463981/etc/hosts
overlay 49.9G 8.7G 41.2G 17% /tmp/rootfs1299463981/.init
overlay 2.3T 79.1G 2.2T 3% /tmp/rootfs2552938690
overlay 49.9G 8.7G 41.2G 17% /tmp/rootfs2552938690/etc/resolv.conf
/dev/nvme1n1 2.3T 79.1G 2.2T 3% /tmp/rootfs2552938690/etc/hosts
overlay 49.9G 8.7G 41.2G 17% /tmp/rootfs2552938690/.init
overlay 2.3T 79.1G 2.2T 3% /tmp/rootfs3808259598
overlay 49.9G 8.7G 41.2G 17% /tmp/rootfs3808259598/etc/resolv.conf
/dev/nvme1n1 2.3T 79.1G 2.2T 3% /tmp/rootfs3808259598/etc/hosts
overlay 49.9G 8.7G 41.2G 17% /tmp/rootfs3808259598/.init
yep yesterday after upgrading we encountered this about 2 times
Is k8s trying to shut it down or something? The engine logs are all context canceled, which is what would happen when the engine is trying to shutdown after a SIGTERM for example
the health check has been failing for a while
I will just check
will its a readiness probe not a liveness probe
so theoretically k8s shouldn't have tried to shut it down
These are the only events we have:
Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Warning Unhealthy 77s (x244 over 61m) kubelet Readiness probe failed: command timed out: "sh -exc dagger core version" timed out after 14s
the engine logs don't really show anything except the engine seemingly trying to shutdown, I'm guessing they just got cut off due to a log size limit?
I only took the engine logs from around when the health check started failing
I can grab everything
I don't remember k8s semantics enough, if readiness probes failed does that mean it never entered the "ready" state? so like it just is failing to start successfully vs. it was running fine and then later went into some unhealthy state
here we go
sorry the previous logs also contained other engine's logs because I pulled it from our logging infra
This is just the failing engine's pod logs
I beliee we don't have debug logs enabled on our engines anymore unfortunately...
https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/#readiness-probe
Readiness probes determine when a container is ready to accept traffic. This is useful when waiting for an application to perform time-consuming initial tasks that depend on its backing services; for example: establishing network connections, loading files, and warming caches. Readiness probes can also be useful later in the container’s lifecycle, for example, when recovering from temporary faults or overloads.
If the readiness probe returns a failed state, Kubernetes removes the pod from all matching service endpoints.
Readiness probes run on the container during its whole lifecycle.
Kubernetes has various types of probes:
Liveness probe Readiness probe Startup probe Liveness probe Liveness probes determine when to restart a container. For example, liveness probes could catch a deadlock when an application is running but unable to make progress.
If a container fails its liveness probe repeatedly, the kubelet restarts the con...
We actually connect the client to the engine via a host mounted unix socket, so I doubt readiness probes are having any effect here at all
Are there any debug endpoints I can hit on the engine to get the current state out of it?
it's possible to enable debug endpoints but for this sort of thing they won't have any more information than the logs (they have cpu/memory/etc. profile type of info)
i'm still looking through the logs you sent
okay no worries
I'll leave it in the current state for another hour
then will restart the pod
yesterday when I restarted the pod ( which required me to exec into the container and to a killall runc, killall dagger-engine), it came back up with no problems
I assume the engine is in some kind of borked state right now
When I exec into the container and run:
dagger core version
It hangs indefinitely
not sure if this is useful or not but the engine stopped emitting cache metrics at 10:07 ~AEST~ AEDT
sorry I'll give UTC
UTC 2026-01-14T23:07
yeah that's where I was looking at the logs, everything started becoming canceled like the engine was shutting down at 2026-01-14T23:07:22Z
And last log line is written at 2026-01-14T23:08:59Z
But looks like the process has continued to stick around long after...
Yeah its still there if I run ps aux
1 root 1h47 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml
do you happen to have metrics (memory, cpu, etc.) data from while it was still running? just curious if it like was consuming a bunch and then got requested to shutdown at 23:07
the node cpu spiked around that time
the node memory without cache metric is caculated as:
node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)
the memory the node has is 64Gb so nowhere near the imit
hm okay interesting...
Grafana only reports things after the fact and sometimes misses spikes
so its possible this is not representative
this is the node memory metric though so it should be accurate
Just an FYI we have sysdig running on this node as well
This could potentially be complicating our setup
it does sort of seem like there may have been a huge burst of containers trying to start right before everything went bad, based on the logs, though admittedly I have to infer that indirectly since the logs are not great.
One follow-up for this no matter what is to go do a pass on the logs we write and the levels for them; I spend so much time looking at debug logs I didn't realize how utterly useless the non-debug ones have become
did someone try to run dagger develop --recursive without the fix that's in v0.19.10 at 23:07 UTC? 🙂
haha no this is a CI pod
so its not getting used for that sort of thing
me wishes
hahaha
Do you have any knowledge of sysdig?
I haven't used it in a long time
Just noticing a few entries in our sysdig log around 23:07 UTC
yeah I'd take a look for sure
oh wait! I totally missed something crucial in the dmesg you sent earlier:
[68804.239285] dr=syscall_sins invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
...
[68804.240571] memory: usage 2764800kB, limit 2764800kB, failcnt 334918
[68804.262006] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[68804.262815] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice:
...
[68804.291303] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[68804.292791] [ 4471] 65535 4471 255 141 0 141 0 36864 0 -998 pause
[68804.316926] [1507141] 0 1507141 21051 12106 1312 10794 0 196608 0 -997 dr-monitor
[68804.318342] [1553644] 0 1553644 1255711 645279 608273 37006 0 6287360 0 -997 dr-agent
[68804.319823] [1553645] 0 1553645 21947 3189 2111 1078 0 143360 0 -997 dr-mounted_fs_r
[68804.321409] [1553646] 0 1553646 323142 15931 5471 10460 0 258048 0 -997 cointerface
[68804.322932] [1553647] 0 1553647 547238 10182 2322 7860 0 319488 0 -997 responder
[68804.326798] [1553648] 0 1553648 321385 11772 2837 8935 0 229376 0 -997 kspm-analyzer
[68804.328265] [1553649] 0 1553649 337601 12262 3719 8543 0 389120 0 -997 host-scanner
[68804.329788] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,task=dr-agent,pid=1553644,uid=0
[68804.336127] Memory cgroup out of memory: Killed process 1553644 (dr-agent) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[68804.338327] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope are going to be killed due to memory.oom.group set
[68804.341454] Memory cgroup out of memory: Killed process 1507141 (dr-monitor) total-vm:84204kB, anon-rss:5248kB, file-rss:43176kB, shmem-rss:0kB, UID:0 pgtables:192kB oom_score_adj:-997
[68804.351550] Memory cgroup out of memory: Killed process 1556720 (dr=syscall_sins) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
OOM kill happened, but the engine wasn't killed
dr-agent, dr-monitor and dr=syscall_sins got killed
not sure what those are
or exactly how that would cause the dagger engine to start freaking out, but possible the engine was just trying to shutdown and got in a weird state during that
tbh I can't say this is for sure related, since the kernel logs include everything on the whole node, just seemed worth noting
unfortunately it's just a count since the machine started (I think seconds? but could be misremembering), so you have to find when the machine started to convert to absolute time
okay
Start Time: Wed, 14 Jan 2026 15:59:11 +1100
Okay so the machine was up at 2026-01-14T04:59
This free time calculator can add or subtract time values in terms of number of days, hours, minutes, or seconds. Also, learn the different concepts of time.
2026-01-15T12:05:55
so seems unrelated
no, that dosn't make sens because that's the future
Yeah Idk what to make of it.
The only other thing I can think of thatm ight be interesting is current CPU/memory stats for the dagger engine process specifically
I think that is sysdig
Theory:
The kernel says the memory cgroup limit was 2,764,800 kB (~2.64 GiB) and it hit it repeatedly (failcnt 334918).
Then it killed dr-agent first (big RSS), and because memory.oom.group is set, it killed the rest of the processes in that same cgroup as a group.
Sysdig’s agent sits on the syscall path (driver/eBPF) and does heavy analysis. If it’s overloaded/restarting (OOM loop), you can get node-level jitter ?
I'd say that in general, things you could do to help debug this if it happens again:
- enable pprof metrics endpoints for the engine, which requires starting the
dagger-engineprocess with e.g.--debugaddr=0.0.0.0:6060(or whatever port would make sense for you) - enable debug logs (
--debug)
On our end, I'll try to cleanup the logs we print on levels above debug so they can be actually useful again
Thanks Erik
Sorry for bothering you guys about it, could very well be an issue on our side
btw got the right times
[Wed Jan 14 22:32:27 2026] Tasks state (memory values in pages):
[Wed Jan 14 22:32:27 2026] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[Wed Jan 14 22:32:27 2026] [ 4471] 65535 4471 255 141 0 141 0 36864 0 -998 pause
[Wed Jan 14 22:32:27 2026] [1507141] 0 1507141 21051 12106 1312 10794 0 196608 0 -997 dr-monitor
[Wed Jan 14 22:32:27 2026] [1553644] 0 1553644 1255711 645279 608273 37006 0 6287360 0 -997 dr-agent
[Wed Jan 14 22:32:27 2026] [1553645] 0 1553645 21947 3189 2111 1078 0 143360 0 -997 dr-mounted_fs_r
[Wed Jan 14 22:32:27 2026] [1553646] 0 1553646 323142 15931 5471 10460 0 258048 0 -997 cointerface
[Wed Jan 14 22:32:27 2026] [1553647] 0 1553647 547238 10182 2322 7860 0 319488 0 -997 responder
[Wed Jan 14 22:32:27 2026] [1553648] 0 1553648 321385 11772 2837 8935 0 229376 0 -997 kspm-analyzer
[Wed Jan 14 22:32:27 2026] [1553649] 0 1553649 337601 12262 3719 8543 0 389120 0 -997 host-scanner
[Wed Jan 14 22:32:27 2026] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,task=dr-agent,pid=1553644,uid=0
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1553644 (dr-agent) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope are going to be killed due to memory.oom.group set
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1507141 (dr-monitor) total-vm:84204kB, anon-rss:5248kB, file-rss:43176kB, shmem-rss:0kB, UID:0 pgtables:192kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1556720 (dr=syscall_sins) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] dagger0: port 18(veth005471b0) entered disabled state
so its happening at 22:32 which is 30m before our weird spike in logs at 2026-01-14T23:07:22Z
probably not relevant
thanks guys.. I'll quit bothering you for now
and yeah we will do a few things on our end:
- upgrade to 0.19.10
- turn on debug logs
- turn on pprof
Okay we have done all of this now and I tested that I can hit pprof all good and get traces
So next time this happens, any particular pprof commands you are interesting?
Perhaps these?
Or to look at the goroutine blocking profile, after calling runtime.SetBlockProfileRate in your program:
go tool pprof http://localhost:6060/debug/pprof/block
Or to look at the holders of contended mutexes, after calling runtime.SetMutexProfileFraction in your program:go tool pprof http://localhost:6060/debug/pprof/mutex
perhaps the pprof/trace is the most useful as it will show where the app is I guess
curl '<ip>:6060/debug/pprof/goroutine' > gr.pprof (or however you can best hit the endpoint and save the output) will dump goroutines, which would be very useful.
/debug/pprof/heap also may be helpful
BTW we got another report of a user hitting this, so almost certainly not something wrong in your infra or similar
Also, if the engine is unresponsive to anything including on those debug endpoints, a helpful last resort would be to just manually send SIGQUIT to it (kill -s QUIT <pid>), which should dump the goroutine stacks to its output (unless things are so bad the go runtime is also borked)
Thanks Erik will stay on the lookout, we should be giving our 0.19.10 engine a heavy workout today, so will report back if we hit the bug
thanks! appreciate you working with us on tracking this down!
Yeah easy