#Error on 0.19.9 upgrade

1 messages · Page 1 of 1 (latest)

weary oar
#

Couple questions:

  1. What's the output of uname -a on the machine where this is occurring?
  2. Does this happen right away with any command or did you successfully run some other commands before this that succeeded?
  3. If possible, what's the output of sudo dmesg | grep overlay | grep -v 'meaningless' on the machine where this is happening?
jovial shard
#
Darwin N1004907 25.2.0 Darwin Kernel Version 25.2.0: Tue Nov  4 20:46:55 PST 2025; root:xnu-12377.60.50.501.1~2/RELEASE_ARM64_T6030 arm64 arm Darwin
weary oar
jovial shard
#
  1. The only command so far that has succeed for me is dagger core version other commands do not work:
  • dagger functions
  • dagger init --sdk go test
#

dagger develop

#

^ those all don't work

weary oar
#

are you using docker desktop on macos? i.e. that vs orbstack or something else

jovial shard
#

Docker Desktop on MacOS yes

#

I will try the other vmm option and report badck

weary oar
# jovial shard Docker Desktop on MacOS yes

when you have a sec, what's the output of:

  1. docker run --rm --privileged -it alpine:latest uname -a
  2. docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
#

(basically same as before but now hopefully inside the docker desktop vm)

jovial shard
#

Does it matter which VMM I am on?

#

I just switched to Apple Virtualizaton framwork

weary oar
jovial shard
#

Okay I am just running a quick test with Apple Virtualisatoin framework

#

Will let you know if that works or not... then will run those commands

#

Interesting... its working now for some reason

#

Okay will switch back to Docker VMM and see if it breaks again

weary oar
jovial shard
#

Okay its working Docker VMM now 🙂

#

Okay tell you what... will get some of my colleagues on this and see if we can reproduce

weary oar
jovial shard
#

The first command I actually ran on my machine was dagger develop --recursive on our dagger mono repo.

#

That was the one that broke it originally. I am gonna try rerunning that again

#

well its now working

#

wow common

uncut roost
#

Sorry for the bad upgrade experience @jovial shard !

jovial shard
#

hahahaha

#

its okay, theres a lot in flux right now I understand

weary oar
#

if it pops up again at any point let us know of course! I am at a loss as to what would have caused it to happen once and then disappear

jovial shard
#

Yeah who knows could be an absolute flake on my machine that decided to pop up just as I was testing the new release

#

A little gift to freak you out

#

I am gonna be upgrading our engines today hopefully so will let you know if anything comes up

#

Btw here are the results of those commands:

Docker VMM

docker run --rm --privileged -it alpine:latest uname -a
Linux 6e674a40ec10 6.12.54-linuxkit #1 SMP Tue Nov  4 21:21:47 UTC 2025 aarch64 Linux

docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
<no-output>

Apple Virtualisation Framework

docker run --rm --privileged -it alpine:latest uname -a
Linux 745a02853958 6.12.54-linuxkit #1 SMP Tue Nov  4 21:21:47 UTC 2025 aarch64 Linux

docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
<no-output>
weary oar
jovial shard
#

right okay

#

I'll also try to get you a dump of the dagger logs

#

I can't share the trace of the very first error that I got due to it containing some company code however the error occurred in an asModule operation and it was:

failed to load dependencies as modules: failed to load module dependencies: failed to initialize module: failed to get type defs json during module sdk codegen: mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1043758443", fstype: overlay, flags: 0, data: "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/149/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs,volatile,index=off,redirect_dir=off", err: invalid argument
#

I did actually restart docker desktop, and delete the dagger container after that

#

Perhaps this has "cleaned the state" so to speak and the issue has gone away as a result

jovial shard
#

@wise shoal tagging your for brevity

#

If you hit the same issue, please dump your logs here 🙂

#

Okay I hit the issue again:

dagger call \
        --bao-addr=${BAO_ADDR} \
        --bao-token=file://~/.vault-token \
        --local-aws-credentials=file://~/.aws/credentials \
        --local-aws-profile=${AWS_PROFILE} \
        --local-gcp-credentials=file://~/.config/gcloud/application_default_credentials.json \
        --ssh=${SSH_AUTH_SOCK} \
check-pull-request

✔ connect 0.3s
✘ load module: . 6.2s ERROR
┇ initializing module › ModuleSource.asModule › load dep modules › ModuleSource.asModule › load dep modules › ModuleSource.asModule › 
✘ withExec codegen generate-typedefs --module-source-path /src/modules/dx --module-name dx --introspection-json-path /schema.json --output typedefs.json (
  ┆ experimentalPrivilegedNesting: true
  ┆ execMD: "
#
{\"ClientID\":\"ulpbyf0h7vonq4z0u13zp1s1u\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"offvnz76gswlelo3x58ew1qw9\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjVjNTdkNTc2ZmVhYjk4YzMSvAEKFXh4aDM6MmUzNGIzOGE5NTZiZjkzZhKiARIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIlIKCXJlZlN0cmluZxJFOkMvVXNlcnMvY2hyaXN0b3BoZXIucGFsbWVyL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL2R4IhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjJlMzRiMzhhOTU2YmY5M2ZYARJxChV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESWAoVeHhoMzoyZTM0YjM4YTk1NmJmOTNmEhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSIMCgRuYW1lEgQ6AmR4ShV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESXQoVeHhoMzo1YzU3ZDU3NmZlYWI5OGMzEkQKFXh4aDM6MzMzMDczYjJkNDgyODRlMRIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6NWM1N2Q1NzZmZWFiOThjMw==\",\"EncodedModuleID\":\"ChV4eGgzOmYyMDg2YTI1ZDkxOThlNDESvAEKFXh4aDM6MmUzNGIzOGE5NTZiZjkzZhKiARIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIlIKCXJlZlN0cmluZxJFOkMvVXNlcnMvY2hyaXN0b3BoZXIucGFsbWVyL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL2R4IhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjJlMzRiMzhhOTU2YmY5M2ZYARJxChV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESWAoVeHhoMzoyZTM0YjM4YTk1NmJmOTNmEhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSIMCgRuYW1lEgQ6AmR4ShV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESXwoVeHhoMzpmMjA4NmEyNWQ5MTk4ZTQxEkYKFXh4aDM6MzMzMDczYjJkNDgyODRlMRIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6ZjIwODZhMjVkOTE5OGU0MVgB\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"
  ) 0.1s ERROR
#
! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit301354842", fstype: overlay, flags: 0, data:
  "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/va
  r/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/sn
  apshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs:/var/lib/dagger/w
  orker/snapshots/snapshots/146/fs,volatile,index=off,redirect_dir=off", err: invalid argument
#
docker run --rm --privileged -it alpine:latest uname -a
Linux 84c891249553 6.12.54-linuxkit #1 SMP Tue Nov  4 21:21:47 UTC 2025 aarch64 Linux

docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[ 1585.795731] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.796150] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.816909] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.838332] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.855822] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.881574] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.916250] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.918271] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.940469] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
#
[ 1585.962526] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.972130] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.972689] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.974413] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.977663] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.984430] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.988866] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.989469] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.990043] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.992682] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.996780] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.010497] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.031813] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.034578] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.035641] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.037057] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.037536] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.039058] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.040479] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.041607] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.043329] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.051474] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
weary oar
#

Ah okay, thanks for the info dump. Based on that I suspect there's some codepath where cleanup is supposed to happen that doesn't always happen.

Was there anything obvious you did that triggered it? Or just keep using it and it randomly pops up?

jovial shard
#

Nothing obvious unfortunately

#

I took a break from running dagger commands for about 20min

#

Came back and decided to run our main dagger function for the repo

#

That triggered it...

weary oar
#

ok, I will see if I can track down exactly what's happening, but also in the worst case I can probably add in some kludge to the engine code to see when this error happens and handle it. So we'll get a v0.19.10 out ASAP.

jovial shard
#

yep cool

#

happy to test for you off the main branch as well

wise shoal
#

Not sure if this is related but I crashed when trying to replicate the issue. I bumped the version in the repo from 0.19.3 -> 0.19.9 and ran dagger develop --recursive.

! failed to generate code: Post "http://dagger/query": command [docker exec -i dagger-engine-v0.19.9 buildctl dial-stdio] has exited with exit status 137, make
  sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

I've attached the engine logs, pretty sure the failed run beings at L6835. The logs before that line were from a dagger init --sdk go test in another directory.

Lots of:

time="2026-01-09T00:01:28Z" level=warning msg="failed to release network namespace \"qly2sfnlac7v0p0lq9zp6fv0w\" left over from previous run: plugin type=\"loopback\" failed (delete): unknown FS magic on \"/var/lib/dagger/net/cni/qly2sfnlac7v0p0lq9zp6fv0w\": ef53"

followed by:

could not load snapshot...
❯ docker run --rm --privileged -it alpine:latest uname -a
Linux 46d36ec3380b 6.11.11-linuxkit #1 SMP Wed Oct 22 09:37:46 UTC 2025 aarch64 Linux

❯ docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
#

I jsut ran it again and got an error similar to Chris

#
✘ withExec codegen generate-typedefs --module-source-path /src/modules/service-catalog --module-name service-catalog --introspection-json-path /schema.json --output typedefs.json (
  ┆ experimentalPrivilegedNesting: true
#
  ┆ execMD: "{\"ClientID\":\"rddzwfw7gadxzk28hu86uxvwq\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"q1ycgxiygqjba8s27okt71lyl\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjMxOTA1NTczYjlkYmZiM2ESXQoVeHhoMzozMTkwNTU3M2I5ZGJmYjNhEkQKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MzE5MDU1NzNiOWRiZmIzYRLCAQoVeHhoMzo2NWE1YTc3NDgxNmY2Y2M4EqgBEhAKDE1vZHVsZVNvdXJjZRgBGgxtb2R1bGVTb3VyY2UiWAoJcmVmU3RyaW5nEks6SS9Vc2Vycy9sdWtlLmJyYWtlbC93b3Jrc3BhY2UvbGlicmFyeS1jaS13b3JrZmxvd3MvbW9kdWxlcy9zZXJ2aWNlLWNhdGFsb2ciEwoNZGlzYWJsZUZpbmRVcBICGAFKFXh4aDM6NjVhNWE3NzQ4MTZmNmNjOFgBEn4KFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBJlChV4eGgzOjY1YTVhNzc0ODE2ZjZjYzgSEAoMTW9kdWxlU291cmNlGAEaCHdpdGhOYW1lIhkKBG5hbWUSEToPc2VydmljZS1jYXRhbG9nShV4eGgzOmUwYzRmZWE3MzcyYjZkOGQ=\",\"EncodedModuleID\":\"ChV4eGgzOjQ5YmMzNmZlODJkNTg3M2ESXwoVeHhoMzo0OWJjMzZmZTgyZDU4NzNhEkYKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6NDliYzM2ZmU4MmQ1ODczYVgBEsIBChV4eGgzOjY1YTVhNzc0ODE2ZjZjYzgSqAESEAoMTW9kdWxlU291cmNlGAEaDG1vZHVsZVNvdXJjZSJYCglyZWZTdHJpbmcSSzpJL1VzZXJzL2x1a2UuYnJha2VsL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL3NlcnZpY2UtY2F0YWxvZyITCg1kaXNhYmxlRmluZFVwEgIYAUoVeHhoMzo2NWE1YTc3NDgxNmY2Y2M4WAESfgoVeHhoMzplMGM0ZmVhNzM3MmI2ZDhkEmUKFXh4aDM6NjVhNWE3NzQ4MTZmNmNjOBIQCgxNb2R1bGVTb3VyY2UYARoId2l0aE5hbWUiGQoEbmFtZRIROg9zZXJ2aWNlLWNhdGFsb2dKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZA==\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"
#
  ) 0.1s ERROR
! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1594712920", fstype: overlay, flags: 0, data:
  "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots
  /snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/
  153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/145/fs:/var
  /lib/dagger/worker/snapshots/snapshots/94/fs:/var/lib/dagger/worker/snapshots/snapshots/86/fs,volatile,index=off,redirect_dir=off", err: invalid argument
jolly steppe
#

trying to repro locally. On my linux working perfectly. Switching to mac rn

uncut roost
#

Worked fine for me on remote Docker/linux, and remote Dagger-hosted sandbox... Trying to re-install docker-for-mac

weary oar
#

Yeah I've been using v0.19.9 (and previously main builds) on Linux running similar sort of commands as what's repro'ing it above and haven't hit it yet... I really can't imagine how it's docker desktop specific, but I guess my imagination could be limited

wise shoal
#
❯ docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[112951.651719] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.711449] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.730308] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.779398] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.781942] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.814648] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.831404] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.876855] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.933803] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.946117] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.949512] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.953730] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.985612] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.015875] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.043365] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.047606] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
weary oar
#

@jovial shard @wise shoal do you by any chance run the engine with custom CA certs to your knowledge?

#

there's an outside possibility of that mattering

wise shoal
#

Yeah we do

jovial shard
#

yep

weary oar
#

okay, lemme try to repro with that in place...

jolly steppe
#

yeah no bug on my end up to now

jovial shard
#

Here's the output of my docker inspect for the dagger container:

#

not sure if thats of any help whatsover but haha, there you go

weary oar
jovial shard
#

Thanks for looking into this guys on short notice really appreciate!

weary oar
#

having custom ca certs triggers some extra setup/teardown execution for each container, which I suspect may be creating a race around mounts. That'd also explain why our CI didn't catch this. We have a number of integ tests for CA certs but they only make up a small chunk of the 1000s, so something like this could have slipped through the cracks

uncut roost
#

Error on 0.19.9 upgrade

jovial shard
#

right...

jolly steppe
#

Erik, could it be a cleanup issue on work/incompat/volatile, for subsequent mounts ?

#

trying to repro now 👀

weary oar
#

I'd ideally like to find that exact root cause and patch it, but I think also no matter what I'll add some fallback handling where if we try to make a mount and hit this error, we just cleanup the incompat dir at that time. Not super duper ideal in the long term but probably the right move to prevent pain like this, at least in the medium term.

jolly steppe
jolly steppe
#

MMmmh, stil unable to hit it --'

#

I'm on kernel 6.10.14-linuxkit, updating docker desktop atm. i've been trying to hit it pretty hard

#

ok bumped, now i'm like chris and luke. At least this could be triggered just on the latest docker for mac version ?

docker run --rm alpine:latest uname -r
6.12.54-linuxkit
jovial shard
#

I did upgrade to the latest docker desktop quite recently

#

Luke is on an older version though as you can see if you look at his post above

jolly steppe
weary oar
jolly steppe
#

time="2026-01-09T02:08:36Z" level=error msg="failed to create cacerts installer, falling back to not installing CA certs: invalid argument" span="dagop.ctr Container.withEx

verifying it's not a "me" error with the ca certs installs

#

(probably me)

weary oar
#

the one they are hitting actually results in a hard user-facing error

#

but could be related

jolly steppe
jovial shard
#

for the record we do symlink the ca-certificates directory like this:

ln -s "$HOME/workspace/ssl" "$HOME/Library/Application Support/dagger/ca-certificates"
#

Not sure if this helps at all, but worth looking into

#

I might try just copying the certs in there and see what happens

jovial shard
#

Getting my colleagues to do this:

rm "$HOME/Library/Application Support/dagger/ca-certificates"
cp -R $HOME/workspace/ssl "$HOME/Library/Application Support/dagger/ca-certificates"

and retest

jolly steppe
#

will be away for 2 hours, will resume after 🙏

jovial shard
#

if you need to sleep / kids please do that

#

there is always tomorrow

jovial shard
#

I have been running this configuration so far and haven't been able to reproduce the error

#

Will keep running with it and also will get my colleagues to try it out and see if this was it

wise shoal
#

Looking at it now

weary oar
# jovial shard I have been running this configuration so far and haven't been able to reproduce...

Thanks, yeah I surprisingly wasn't able to repro the same error you all were hitting even with a custom CA installed and then running 100s of integ tests against that engine. So hopefully that squashes it permanently for you.

@silent @jolly steppe (silent ping, for tomorrow), if they hit the problem again, do you think tomorrow you would have time to try implementing the fallback of "whenever mount is made, pre-emptively remove any work/incompat dir"? I have a separate issue I'm looking into for another user that I gotta continue with tomorrow

jovial shard
#

Unfortunately... we are still getting it:

make check-pull-request
dagger call \
    --bao-addr=${BAO_ADDR} \
    --bao-token=file://~/.vault-token \
    --local-aws-credentials=file://~/.aws/credentials \
    --local-aws-profile=${AWS_PROFILE} \
    --local-gcp-credentials=file://~/.config/gcloud/application_default_credentials.json \
    --ssh=${SSH_AUTH_SOCK} \
check-pull-request
✔ connect 0.3s
✘ load module: . 5.6s ERROR
┇ initializing module › ModuleSource.asModule › load dep modules › ModuleSource.asModule › load dep modules › ModuleSource.asModule › 
✘ withExec codegen generate-typedefs --module-source-path /src/modules/aws --module-name aws --introspection-json-path /schema.json --output typedefs.json (
  ┆ experimentalPrivilegedNesting: true
  ┆ execMD: "
#
{\"ClientID\":\"86cwwdrgtnzajs4nhgqz5k204\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"ihdmopo3idqgxuo9q04p1gaf8\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjExNDZiOGU5NjcxOTQ1ZDASXQoVeHhoMzoxMTQ2YjhlOTY3MTk0NWQwEkQKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MTE0NmI4ZTk2NzE5NDVkMBK9AQoVeHhoMzozMGQyNWIwYzkyMGNhZGM2EqMBEhAKDE1vZHVsZVNvdXJjZRgBGgxtb2R1bGVTb3VyY2UiUwoJcmVmU3RyaW5nEkY6RC9Vc2Vycy9jaHJpc3RvcGhlci5wYWxtZXIvd29ya3NwYWNlL2xpYnJhcnktY2ktd29ya2Zsb3dzL21vZHVsZXMvYXdzIhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjMwZDI1YjBjOTIwY2FkYzZYARJyChV4eGgzOjhlOGU0YjE3ZTQ5YWFmZDkSWQoVeHhoMzozMGQyNWIwYzkyMGNhZGM2EhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSINCgRuYW1lEgU6A2F3c0oVeHhoMzo4ZThlNGIxN2U0OWFhZmQ5\",\"EncodedModuleID\":\"ChV4eGgzOjFmYjZmNGM1YzRmZTA1NzISXwoVeHhoMzoxZmI2ZjRjNWM0ZmUwNTcyEkYKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MWZiNmY0YzVjNGZlMDU3MlgBEr0BChV4eGgzOjMwZDI1YjBjOTIwY2FkYzYSowESEAoMTW9kdWxlU291cmNlGAEaDG1vZHVsZVNvdXJjZSJTCglyZWZTdHJpbmcSRjpEL1VzZXJzL2NocmlzdG9waGVyLnBhbG1lci93b3Jrc3BhY2UvbGlicmFyeS1jaS13b3JrZmxvd3MvbW9kdWxlcy9hd3MiEwoNZGlzYWJsZUZpbmRVcBICGAFKFXh4aDM6MzBkMjViMGM5MjBjYWRjNlgBEnIKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORJZChV4eGgzOjMwZDI1YjBjOTIwY2FkYzYSEAoMTW9kdWxlU291cmNlGAEaCHdpdGhOYW1lIg0KBG5hbWUSBToDYXdzShV4eGgzOjhlOGU0YjE3ZTQ5YWFmZDk=\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"
  ) 0.2s ERROR
#
! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1689234798", fstype: overlay, flags: 0, data:
  "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapsh
  ots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snap
  shots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/147/fs:/var/lib/dagger/worker/snapshots/snapshots/146/fs,volatile,index=off,redirect_dir=off", err:
  invalid argument
#

The issue actually follows after an engine crash though...

#
  1. Engine crashes and I lose my first working session
  2. I run another dagger command.
  3. New dagger engine up
  4. Start getting the error
#

However weirdly the new engine does not reuse the old docker volume

#

So I have no idea why this would be occuring since it doesn't share any state with the previously running engine

#
docker run --rm --privileged -it alpine:latest uname -a
Linux d6a10ba7dbb2 6.12.54-linuxkit #1 SMP Tue Nov  4 21:21:47 UTC 2025 aarch64 Linux
docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[ 2401.461405] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.464349] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.467592] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.469147] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
weary oar
# jovial shard 1. Engine crashes and I lose my first working session 2. I run another dagger co...

Oh I missed that there was a crash. That explains this way more… the cleanup of the mount isn’t happening because there was an actual crash. The cleanup is of on disk state created by the kernel so it would persist across engine restart.

We still need better handling of that case but fixing the crash is obviously even more important. I am afk atm but will look at the engine log output when back

wise shoal
#

Right, so is this error:

! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit2284938268", fstype: overlay, flags: 0, data: "workdir=/var/lib/dagger/worker/snapshots/snapshots/163/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/163/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/157/fs:/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs:/var/lib/dagger/worker/snapshots/snapshots/147/fs,volatile,index=off,redirect_dir=off", err: invalid argument

caused by these not being cleaned up?:

# stat -f -c %T /var/lib/dagger/worker/snapshots/snapshots/163/work/work/incompat/volatile/dirty
ext2/ext3
weary oar
#

they should get cleaned up, but after a hard crash they won't. But theoretically the engine should have been discarding those mounts entirely after the crash, but it's not

#

So that's one bug. But more importantly I want to see why the engine is restarting in the first place

#

in the engine logs I didn't actually see a panic or anything

#

is it just getting manually restarted?

#

or is the crash stack trace just not showing up there?

wise shoal
#

I didn't see one, I'll nuke it all and see what I get

#

is it just getting manually restarted?
Nah it's crashing

jovial shard
#

Ah, unfortunately after the crash the container goes down with the logs so we don't get to see why it did that

#

Is there a way to tell dagger not to put --rm on the engine docker container it makes?

weary oar
wise shoal
#
[ 5315.992327] Out of memory: Killed process 1109 (dagger-engine) total-vm:2403788kB, anon-rss:371748kB, file-rss:716kB, shmem-rss:0kB, UID:0 pgtables:2472kB oom_score_adj:0
#

That's from dmesg inside the engine. The repo failing is a library that has over 20 module dependencies, each with their own test module.

weary oar
wise shoal
#

Here are the engine stats

CONTAINER ID   NAME                    CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O       PIDS
9a6a83c20a7d   dagger-engine-v0.19.9   0.00%     211.9MiB / 5.786GiB   3.58%     1.17kB / 126B   191MB / 512kB   20
jolly steppe
wise shoal
#

I manually deleted the imcompat overlays and it ran successfully using whatever was cached from the first run.

#

It's definitely using more memory than that. I crashed it again and it climbed to around the limit before ooming.

/ # dmesg | grep oom
[ 7909.167555] dagger-engine invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[ 7909.167634]  oom_kill_process+0x144/0x360
[ 7909.167690] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[ 7909.168079] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init,mems_allowed=0,global_oom,task_memcg=/docker/032e6c36b5e726638fa46d78b218b68555386fa899d18fe418b8f408b4e1bca9/init,task=dagger-engine,pid=31178,uid=0
[ 7909.168231] Out of memory: Killed process 31178 (dagger-engine) total-vm:2341756kB, anon-rss:461572kB, file-rss:9636kB, shmem-rss:0kB, UID:0 pgtables:2344kB oom_score_adj:0
jolly steppe
# wise shoal It's definitely using more memory than that. I crashed it again and it climbed t...

It's a global oom, it seems that it's the entire VM running out of memory

  constraint=CONSTRAINT_NONE
  global_oom

Questions (if answered above sorry, i checked but might have missed messages):

  1. What's your Docker Desktop memory allocation?

    • Docker Desktop → Settings → Resources → Memory
    • (This sets the LinuxKit VM's total memory)
  2. Inside the engine, what do these show?

docker exec dagger-engine-v0.19.9 cat /sys/fs/cgroup/memory.max
docker exec dagger-engine-v0.19.9 cat /sys/fs/cgroup/memory.current
  1. What's the VM's total memory?
docker exec dagger-engine-v0.19.9 cat /proc/meminfo | head -5

Is your number lower than the 5.7gb of the engine that we set ?

If you can increase it for testing purposes, can you please try and make a run after the cleanup

jolly steppe
jolly steppe
# jolly steppe here, the limit being 5.7gb ? or your docker limit ?

What surprises me the most is that, as part of our release process, we bump recursively Dagger's internal components too (without CAs though), with the same command. And it's around the same amount of modules (i'll try with even more tomorrow to confirm).

It's totally possible that we have a memory leak or that we introduced something that consumes more memory since last release

By the way, do you have the same problem with 0.19.8 ?

wise shoal
#

5.7 GB is what docker stats reported back to me as the limit. I bumped it to 8GB and it still crashed. I'm going to push it to 12GB and 2GB swap and try again.

  1. Previous run was 8GB / 1GB Swap. The following will be with the updated resource values.

  2. Fresh engine, only using dagger core version

/ # cat /sys/fs/cgroup/memory.max
max
/ # cat /sys/fs/cgroup/memory.current
93048832
  1. Also pre-run
/ # cat /proc/meminfo | head -5
MemTotal:       12235596 kB
MemFree:        11203196 kB
MemAvailable:   11612708 kB
Buffers:          169800 kB
Cached:           397192 kB
  1. dagger develop --recursive w/ 12GB mem & 2GB swap
📉 Min Available:     1895 MB
📱 Max App RAM:       7500 MB (Active Anon)
🚨 Max Dirty:         1101 MB
🧱 Max Slab:          797 MB
🗄️ Max Cache:         4508 MB
🗺️ Max PageTbl:       52 MB

(I got gemini to write me a script to track it)

It ran successfully.

jolly steppe
# wise shoal 5.7 GB is what `docker stats` reported back to me as the limit. I bumped it to 8...

Thanks Luke 🙏

This seems to confirm that the OOM kills the engine, and leaves it in a weird state

Hypothesis: it crashed due to memory, then you guys restarted the engine, and I suppose that the volume still has the volatile directory (hard kill). Then, if you run any command since that point in time, it just crashes (due to that volatile dir).

So, I'll triple check tomorrow, but the cleanup fix that Erik suggested is actually useful to handle those kind of OOM dirty states

Now, the real question is: why is it OOMing right now for such a small scale ?

I suppose that you were already doing the recursive develop in the past I suppose ? I'll track tomorrow the perf between releases

Can you please confirm from what dagger version you're jumping from ?

Will follow up tomorrow, thank you very much for taking the time with Chris 😍, I should be able to repro🙏

wise shoal
#

It's the weekend for us tomorrow, I'll check back on Monday.

Now, the real question is: why is it OOMing right now for such a small scale ?
I'm not sure what your scale is, but we have a single "library" module with about 28 dependent modules (each with a test module). The main module is essentially the CI layer to generate/test/validate all the others.

I suppose that you were already doing the recursive develop in the past I suppose ?
Correct

Can you please confirm that from what dagger version you're jumping from ?
0.19.3

Here are the stats from a run on 0.19.3 (Docker: 12GB + 2GB)

📉 Min Available:     4106 MB
📱 Max App RAM:       6888 MB (Active Anon)
🚨 Max Dirty:         606 MB
🧱 Max Slab:          734 MB
🗄️ Max Cache:         6480 MB
🗺️ Max PageTbl:       38 MB

And with less resources (Docker: 6GB + 1GB)

📉 Min Available:     445 MB
📱 Max App RAM:       3616 MB (Active Anon)
🚨 Max Dirty:         494 MB
🧱 Max Slab:          512 MB
🗄️ Max Cache:         3475 MB
🗺️ Max PageTbl:       35 MB
weary oar
#

Thanks for the info, I am seeing if I can repro any super high memory usage/leaks. So far running stuff (develop --recursive and some expensive dagger calls) in our repo w/ quite a few dagger module dependencies has not replicated anything like what you're seeing, but will keep trying

weary oar
#

Found something interesting.

dagger develop --recursive on your repo at different versions (found traces in your org):

You do indeed have a ton of modules, a few times more than us!

The fact that the newer engine version is quite a bit faster honestly might explain why you started hitting OOMs sometimes. I've seen in the past this sort of thing happen where unblocking CPU/IO bottlenecks increases peak memory usage since you can just allocate more in parallel faster than before. I'm suspecting it's that, given I can't replicate any sort of memory leak with develop --recursive (engine RSS always goes back down to baseline after forcing a gc cycle). The fact that the OOMs are inconsistent also suggest that the problem is sort of "borderline" rather than just some absurd memory usage bug.

We should work on improving the memory usage of course, but that might be a piecemeal effort over time. For the shorter term I think we should:

  1. Add a parallelism limit to dagger develop --recursive (maybe just num CPUs). That should stop the peak memory usage from going crazy in repos with tons of deps like you have. I suspect it won't actually slow it down very much if at all since the CPU is probably getting maxed out anyways with that much parallelism
  2. Fix the problem with the mounts getting cleaned up after a hard crash (which is what caused the original error that started this whole thread)

cc @jolly steppe

uncut roost
#

LOL @weary oar you made it too fast

jolly steppe
weary oar
jolly steppe
#

followup on the cleanup incoming

jovial shard
#

Erik appreciate that analysis. I have seen things like that before, when you make improvements to apps that were previously blocked by I/O, they can now utilise the CPU much more and its possible for them to run out of resources.

#

We are an interesting case study because we have a mono repo with a lot of modules. We actually have parallelism limits in some of pur dagger functions for that repo, because on some machines, there was too much in parallel, progress would just grind to a halt!

uncut roost
jovial shard
#

I have noticed in the jump from 0.19.3, the engine has gotten so much zippier, so very likely its just able to do more in parallel.

uncut roost
#

But the tricky part is: with auto-scale-out, that parallelism limit no longer makes sense: if we just hardcode it in the module, we're potentially leaving acceleration on the table

jovial shard
#

Perhaps it should be a client setting?

uncut roost
#

Yeah maybe. But right now it's just custom module logic. Not something the engine is exposed to. We could change that

#

System-wide parallelization throttle could make sense

#

Actually it might make sense even in a cluster, to keep cost under control 🙂

jovial shard
#

Right

weary oar
# uncut roost System-wide parallelization throttle could make sense

We used to have that but it creates some really subtle and tricky deadlock scenarios. If you can keep scaling out forever (for some approximation of forever) that changes it of course though.

I guess what we could do is find some way of limiting “top-level” parallelism (ie number of checks, number of modules generating not including their deps, etc). That’d avoid the deadlock issues

#

And in mean time one off whack a moles will help 😄

jovial shard
#

Seems like a good idea

#

Thanks for jumping on this and not only looking at the first issue with mounts but also considering the reasons for the OOM kills 🙂

#

Hopefully the feedback is helping you guys discover ways we are using the product, and therefore improve it!

jovial shard
#

Cant wait to get these fixes in and get upgrade next week. We have plans to use checks and improvements to .env file to start building a very strong local experience for developers

jolly steppe
jolly steppe
weary oar
jovial shard
#

Looking good guys 🙂

jovial shard
#

We're going to rollout dagger 0.19.9 to our team's runners and leave everyone else on 0.19.3 for now. Do you guys reckon the fixes would be out some time this week or would you be looking at batching them in the next release cycle?

jolly steppe
jovial shard
#

cool sounds good

jolly steppe
jovial shard
#

Great!

#

I will let my team loose on your main branch for testing

wise shoal
#

lgtm

#

Cleanup is working, and the parallelism is preventing OOMs

weary oar
#

Awesome, currently planning to do the release our tomorrow morning!

jovial shard
#

Hey guys. We have deploy 0.19.9 to our CI runners and we are experience some weird issues where the engine starts failing its kube health check and becomes unresponsive.

Here is the output from a client trying to connect to the engine:

Run exec dagger call \
1   : connect
1   : [0.0s] | cloud url=https://dagger.cloud/nine/traces/a345ff2dd8ed5a6a55694a664246d778
2   : ┆ starting engine
2   : ┆ starting engine DONE [0.0s]
3   : ┆ connecting to engine
3   : ┆ [0.0s] | 23:09:23 INF connected name=dagger-platform-engineering-dagger-helm-engine-dc62z client-version=v0.19.9 server-version=v0.19.9
3   : ┆ connecting to engine DONE [0.0s]
4   : ┆ starting session
jovial shard
#

0.19.9

#

is 0.19.10 released?

jolly steppe
jovial shard
#

right okay

#

just posting the engine logs for brevity

#

when 0.19.10 comes out, we'll update to that and see if it goes away

#

when the engine is in this state by the way, we can't terminate the engine pod in k8s.. its like its stuck

#

the way I got it to close last time, was execing into the pod and doing a killall on everything in the container.

#

sorry I realise those logs are ordered newest first

#

^ oldest first here

#

you can see in our logs we see a spike in errors around 10:07:25

#

in kube we see the pod is failing its healthcheck:

Events:                                                                                                                                                                                                        │
│   Type     Reason     Age                    From     Message                                                                                                                                                  │
│   ----     ------     ----                   ----     -------                                                                                                                                                  │
│   Warning  Unhealthy  3m58s (x163 over 43m)  kubelet  Readiness probe failed: command timed out: "sh -exc dagger core version" timed out after 14s       
#

Running ps aux in the container:

/ # ps aux
PID   USER     TIME  COMMAND
    1 root      1h43 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml
  232 root      0:05 /usr/sbin/dnsmasq --keep-in-foreground --log-facility=- --log-debug -u root --conf-file=/var/run/containers/cni/dnsname/dagger/dnsmasq.conf
199708 root      0:00 [git]
199730 root      0:00 [git]
661657 root      0:00 [git]
661679 root      0:00 [git]
1232164 root      0:00 [runc:[2:INIT]]
1583779 root      0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/tribqpgdkgmxu25sbtc01ojl6 --keep tribqpgdkgmxu25sbt
1583835 root      0:00 /.init bao server -dev -dev-root-token-id dev-only-token
1583995 root      0:04 bao server -dev -dev-root-token-id dev-only-token
1586579 root      0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/ob80b4d2f2ldlmcvyhy32am30 --keep ob80b4d2f2ldlmcvyh
1586706 root      0:00 /.init bao server -dev -dev-root-token-id bao-dev-token
1587208 root      0:04 bao server -dev -dev-root-token-id bao-dev-token
1587232 root      0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/z7gv71hnxyssw0gwtesfkrwl6 --keep z7gv71hnxyssw0gwte
1587356 root      0:00 /.init bao server -dev -dev-root-token-id dev-only-token
1587600 root      0:04 bao server -dev -dev-root-token-id dev-only-token
1595301 root      0:00 dagger core version
1595320 root      0:00 sh
1595326 root      0:00 ps aux
#

I have it in a broken state right now, is there anything I could run on the pod to help you guys diagnose this one?

weary oar
#

that's a very odd failure scenario

#

it feels like something deeply wrong like on the kernel level

#

e.g. 1232164 root 0:00 [runc:[2:INIT]], that's an intermediate state of a separate runc process that should be super short lived, so the fact that it's seemingly sitting there is extra odd

jovial shard
weary oar
#

cat /proc/meminfo (or otherwise machine wide memory usage could help too), to check its not in swap hell or something

jovial shard
#
/ # cat /proc/meminfo
MemTotal:       64777188 kB
MemFree:         2363604 kB
MemAvailable:   52168964 kB
Buffers:            1584 kB
Cached:         45458216 kB
SwapCached:            0 kB
Active:         11956156 kB
Inactive:       38439244 kB
Active(anon):      15764 kB
Inactive(anon):  5217864 kB
Active(file):   11940392 kB
Inactive(file): 33221380 kB
Unevictable:      199448 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:              1680 kB
Writeback:             0 kB
AnonPages:       5135376 kB
Mapped:          2024580 kB
Shmem:            298028 kB
KReclaimable:    5361856 kB
Slab:            9227904 kB
SReclaimable:    5361856 kB
SUnreclaim:      3866048 kB
KernelStack:       17584 kB
PageTables:        39860 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    32388592 kB
Committed_AS:   15087008 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      351864 kB
VmallocChunk:          0 kB
Percpu:            25696 kB
HardwareCorrupted:     0 kB
AnonHugePages:      2048 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:     30720 kB
FilePmdMapped:      2048 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     2107820 kB
DirectMap2M:    62867456 kB
DirectMap1G:     1048576 kB
weary oar
#

oh and df -h to check disks

jovial shard
#
/ # df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  49.9G      8.7G     41.2G  17% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    12.4G      5.1M     12.3G   0% /run/dagger
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/hosts
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /dev/termination-log
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/hostname
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/resolv.conf
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/dagger/engine.toml
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/dagger/engine.json
/dev/nvme1n1              2.3T     79.1G      2.2T   3% /var/lib/dagger
tmpfs                    58.3G     12.0K     58.3G   0% /var/run/secrets/kubernetes.io/serviceaccount
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/dnsmasq-resolv.conf
overlay                  49.9G      8.7G     41.2G  17% /etc/resolv.conf
#
overlay                   2.3T     79.1G      2.2T   3% /tmp/rootfs1299463981
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs1299463981/etc/resolv.conf
/dev/nvme1n1              2.3T     79.1G      2.2T   3% /tmp/rootfs1299463981/etc/hosts
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs1299463981/.init
overlay                   2.3T     79.1G      2.2T   3% /tmp/rootfs2552938690
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs2552938690/etc/resolv.conf
/dev/nvme1n1              2.3T     79.1G      2.2T   3% /tmp/rootfs2552938690/etc/hosts
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs2552938690/.init
overlay                   2.3T     79.1G      2.2T   3% /tmp/rootfs3808259598
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs3808259598/etc/resolv.conf
/dev/nvme1n1              2.3T     79.1G      2.2T   3% /tmp/rootfs3808259598/etc/hosts
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs3808259598/.init
weary oar
#

hm okay that all looks okay

#

has this happened before?

jovial shard
#

yep yesterday after upgrading we encountered this about 2 times

weary oar
#

Is k8s trying to shut it down or something? The engine logs are all context canceled, which is what would happen when the engine is trying to shutdown after a SIGTERM for example

jovial shard
#

the health check has been failing for a while

#

I will just check

#

will its a readiness probe not a liveness probe

#

so theoretically k8s shouldn't have tried to shut it down

#

These are the only events we have:

Events:                                                                                                                                                                                                        │
│   Type     Reason     Age                  From     Message                                                                                                                                                    │
│   ----     ------     ----                 ----     -------                                                                                                                                                    │
│   Warning  Unhealthy  77s (x244 over 61m)  kubelet  Readiness probe failed: command timed out: "sh -exc dagger core version" timed out after 14s  
weary oar
#

the engine logs don't really show anything except the engine seemingly trying to shutdown, I'm guessing they just got cut off due to a log size limit?

jovial shard
#

I only took the engine logs from around when the health check started failing

#

I can grab everything

weary oar
#

I don't remember k8s semantics enough, if readiness probes failed does that mean it never entered the "ready" state? so like it just is failing to start successfully vs. it was running fine and then later went into some unhealthy state

jovial shard
#

here we go

#

sorry the previous logs also contained other engine's logs because I pulled it from our logging infra

#

This is just the failing engine's pod logs

#

I beliee we don't have debug logs enabled on our engines anymore unfortunately...

#

https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/#readiness-probe

Readiness probes determine when a container is ready to accept traffic. This is useful when waiting for an application to perform time-consuming initial tasks that depend on its backing services; for example: establishing network connections, loading files, and warming caches. Readiness probes can also be useful later in the container’s lifecycle, for example, when recovering from temporary faults or overloads.

If the readiness probe returns a failed state, Kubernetes removes the pod from all matching service endpoints.

Readiness probes run on the container during its whole lifecycle.

#

We actually connect the client to the engine via a host mounted unix socket, so I doubt readiness probes are having any effect here at all

#

Are there any debug endpoints I can hit on the engine to get the current state out of it?

weary oar
#

i'm still looking through the logs you sent

jovial shard
#

okay no worries

#

I'll leave it in the current state for another hour

#

then will restart the pod

#

yesterday when I restarted the pod ( which required me to exec into the container and to a killall runc, killall dagger-engine), it came back up with no problems

#

I assume the engine is in some kind of borked state right now

#

When I exec into the container and run:

dagger core version

It hangs indefinitely

#

not sure if this is useful or not but the engine stopped emitting cache metrics at 10:07 ~AEST~ AEDT

#

sorry I'll give UTC

#

UTC 2026-01-14T23:07

weary oar
#

And last log line is written at 2026-01-14T23:08:59Z

#

But looks like the process has continued to stick around long after...

jovial shard
#

Yeah its still there if I run ps aux

#

1 root 1h47 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml

weary oar
#

do you happen to have metrics (memory, cpu, etc.) data from while it was still running? just curious if it like was consuming a bunch and then got requested to shutdown at 23:07

jovial shard
#

the node cpu spiked around that time

#

the node memory without cache metric is caculated as:

node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)
#

the memory the node has is 64Gb so nowhere near the imit

weary oar
#

hm okay interesting...

jovial shard
#

Grafana only reports things after the fact and sometimes misses spikes

#

so its possible this is not representative

#

this is the node memory metric though so it should be accurate

#

Just an FYI we have sysdig running on this node as well

#

This could potentially be complicating our setup

weary oar
#

it does sort of seem like there may have been a huge burst of containers trying to start right before everything went bad, based on the logs, though admittedly I have to infer that indirectly since the logs are not great.

One follow-up for this no matter what is to go do a pass on the logs we write and the levels for them; I spend so much time looking at debug logs I didn't realize how utterly useless the non-debug ones have become

#

did someone try to run dagger develop --recursive without the fix that's in v0.19.10 at 23:07 UTC? 🙂

jovial shard
#

haha no this is a CI pod

#

so its not getting used for that sort of thing

#

me wishes

#

hahaha

#

Do you have any knowledge of sysdig?

weary oar
jovial shard
#

Just noticing a few entries in our sysdig log around 23:07 UTC

weary oar
jovial shard
#

Okay will send you through a bit of the log

weary oar
#

oh wait! I totally missed something crucial in the dmesg you sent earlier:

[68804.239285] dr=syscall_sins invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
...
[68804.240571] memory: usage 2764800kB, limit 2764800kB, failcnt 334918
[68804.262006] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[68804.262815] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice:
...
[68804.291303] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[68804.292791] [   4471] 65535  4471      255      141        0      141         0    36864        0          -998 pause
[68804.316926] [1507141]     0 1507141    21051    12106     1312    10794         0   196608        0          -997 dr-monitor
[68804.318342] [1553644]     0 1553644  1255711   645279   608273    37006         0  6287360        0          -997 dr-agent
[68804.319823] [1553645]     0 1553645    21947     3189     2111     1078         0   143360        0          -997 dr-mounted_fs_r
[68804.321409] [1553646]     0 1553646   323142    15931     5471    10460         0   258048        0          -997 cointerface
[68804.322932] [1553647]     0 1553647   547238    10182     2322     7860         0   319488        0          -997 responder
[68804.326798] [1553648]     0 1553648   321385    11772     2837     8935         0   229376        0          -997 kspm-analyzer
[68804.328265] [1553649]     0 1553649   337601    12262     3719     8543         0   389120        0          -997 host-scanner
[68804.329788] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,task=dr-agent,pid=1553644,uid=0
[68804.336127] Memory cgroup out of memory: Killed process 1553644 (dr-agent) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[68804.338327] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope are going to be killed due to memory.oom.group set
[68804.341454] Memory cgroup out of memory: Killed process 1507141 (dr-monitor) total-vm:84204kB, anon-rss:5248kB, file-rss:43176kB, shmem-rss:0kB, UID:0 pgtables:192kB oom_score_adj:-997
[68804.351550] Memory cgroup out of memory: Killed process 1556720 (dr=syscall_sins) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
#

OOM kill happened, but the engine wasn't killed

#

dr-agent, dr-monitor and dr=syscall_sins got killed

#

not sure what those are

#

or exactly how that would cause the dagger engine to start freaking out, but possible the engine was just trying to shutdown and got in a weird state during that

#

tbh I can't say this is for sure related, since the kernel logs include everything on the whole node, just seemed worth noting

jovial shard
#

cool

#

how do you read those timestamps on the left?

weary oar
jovial shard
#

okay

#
 Start Time:       Wed, 14 Jan 2026 15:59:11 +1100
#

Okay so the machine was up at 2026-01-14T04:59

#

2026-01-15T12:05:55

#

so seems unrelated

#

no, that dosn't make sens because that's the future

weary oar
#

Yeah Idk what to make of it.

The only other thing I can think of thatm ight be interesting is current CPU/memory stats for the dagger engine process specifically

jovial shard
#

I think that is sysdig

jolly steppe
# jovial shard no, that dosn't make sens because that's the future

Theory:

The kernel says the memory cgroup limit was 2,764,800 kB (~2.64 GiB) and it hit it repeatedly (failcnt 334918).

Then it killed dr-agent first (big RSS), and because memory.oom.group is set, it killed the rest of the processes in that same cgroup as a group.

Sysdig’s agent sits on the syscall path (driver/eBPF) and does heavy analysis. If it’s overloaded/restarting (OOM loop), you can get node-level jitter ?

jovial shard
#

I will try to work out exactly what time that was

#

hahah

weary oar
#

I'd say that in general, things you could do to help debug this if it happens again:

  1. enable pprof metrics endpoints for the engine, which requires starting the dagger-engine process with e.g. --debugaddr=0.0.0.0:6060 (or whatever port would make sense for you)
  2. enable debug logs (--debug )

On our end, I'll try to cleanup the logs we print on levels above debug so they can be actually useful again

jovial shard
#

Thanks Erik

#

Sorry for bothering you guys about it, could very well be an issue on our side

#

btw got the right times

#
[Wed Jan 14 22:32:27 2026] Tasks state (memory values in pages):
[Wed Jan 14 22:32:27 2026] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[Wed Jan 14 22:32:27 2026] [   4471] 65535  4471      255      141        0      141         0    36864        0          -998 pause
[Wed Jan 14 22:32:27 2026] [1507141]     0 1507141    21051    12106     1312    10794         0   196608        0          -997 dr-monitor
[Wed Jan 14 22:32:27 2026] [1553644]     0 1553644  1255711   645279   608273    37006         0  6287360        0          -997 dr-agent
[Wed Jan 14 22:32:27 2026] [1553645]     0 1553645    21947     3189     2111     1078         0   143360        0          -997 dr-mounted_fs_r
[Wed Jan 14 22:32:27 2026] [1553646]     0 1553646   323142    15931     5471    10460         0   258048        0          -997 cointerface
[Wed Jan 14 22:32:27 2026] [1553647]     0 1553647   547238    10182     2322     7860         0   319488        0          -997 responder
[Wed Jan 14 22:32:27 2026] [1553648]     0 1553648   321385    11772     2837     8935         0   229376        0          -997 kspm-analyzer
[Wed Jan 14 22:32:27 2026] [1553649]     0 1553649   337601    12262     3719     8543         0   389120        0          -997 host-scanner
#
[Wed Jan 14 22:32:27 2026] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,task=dr-agent,pid=1553644,uid=0
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1553644 (dr-agent) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope are going to be killed due to memory.oom.group set
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1507141 (dr-monitor) total-vm:84204kB, anon-rss:5248kB, file-rss:43176kB, shmem-rss:0kB, UID:0 pgtables:192kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1556720 (dr=syscall_sins) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] dagger0: port 18(veth005471b0) entered disabled state
#

so its happening at 22:32 which is 30m before our weird spike in logs at 2026-01-14T23:07:22Z

#

probably not relevant

#

thanks guys.. I'll quit bothering you for now

#

and yeah we will do a few things on our end:

  • upgrade to 0.19.10
  • turn on debug logs
  • turn on pprof
jovial shard
#

Okay we have done all of this now and I tested that I can hit pprof all good and get traces

#

So next time this happens, any particular pprof commands you are interesting?

#

perhaps the pprof/trace is the most useful as it will show where the app is I guess

weary oar
#

BTW we got another report of a user hitting this, so almost certainly not something wrong in your infra or similar

weary oar
#

Also, if the engine is unresponsive to anything including on those debug endpoints, a helpful last resort would be to just manually send SIGQUIT to it (kill -s QUIT <pid>), which should dump the goroutine stacks to its output (unless things are so bad the go runtime is also borked)

jovial shard
#

Thanks Erik will stay on the lookout, we should be giving our 0.19.10 engine a heavy workout today, so will report back if we hit the bug

weary oar
jovial shard
#

Yeah easy

jovial shard
#

Hi Erik we have not seen this error since the upgrade to 0.19.10

#

Did you guys get to the bottom of it in the end? Was it something that was fixed in 0.19.10?