Error on 0.19.9 upgrade | Dagger | Page 1

weary oar Jan 8, 2026, 11:22 PM

#

Couple questions:

What's the output of uname -a on the machine where this is occurring?
Does this happen right away with any command or did you successfully run some other commands before this that succeeded?
If possible, what's the output of sudo dmesg | grep overlay | grep -v 'meaningless' on the machine where this is happening?

jovial shard Jan 8, 2026, 11:22 PM

#

Darwin N1004907 25.2.0 Darwin Kernel Version 25.2.0: Tue Nov  4 20:46:55 PST 2025; root:xnu-12377.60.50.501.1~2/RELEASE_ARM64_T6030 arm64 arm Darwin

weary oar Jan 8, 2026, 11:23 PM

#

jovial shard 1. ``` Darwin N1004907 25.2.0 Darwin Kernel Version 25.2.0: Tue Nov 4 20:46:55 ...

ah okay nevermind about 3) then, if you are on macos

jovial shard Jan 8, 2026, 11:23 PM

#

The only command so far that has succeed for me is dagger core version other commands do not work:

dagger functions
dagger init --sdk go test

#

dagger develop

#

^ those all don't work

weary oar Jan 8, 2026, 11:23 PM

#

are you using docker desktop on macos? i.e. that vs orbstack or something else

jovial shard Jan 8, 2026, 11:23 PM

#

Docker Desktop on MacOS yes

#

#

#

I will try the other vmm option and report badck

weary oar Jan 8, 2026, 11:26 PM

#

jovial shard Docker Desktop on MacOS yes

when you have a sec, what's the output of:

docker run --rm --privileged -it alpine:latest uname -a
docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'

#

(basically same as before but now hopefully inside the docker desktop vm)

jovial shard Jan 8, 2026, 11:27 PM

#

Does it matter which VMM I am on?

#

I just switched to Apple Virtualizaton framwork

weary oar Jan 8, 2026, 11:27 PM

#

jovial shard Does it matter which VMM I am on?

not sure tbh, depends on implementation details of docker desktop. It could plausibly matter though

jovial shard Jan 8, 2026, 11:28 PM

#

Okay I am just running a quick test with Apple Virtualisatoin framework

#

Will let you know if that works or not... then will run those commands

#

Interesting... its working now for some reason

#

Okay will switch back to Docker VMM and see if it breaks again

weary oar Jan 8, 2026, 11:29 PM

#

jovial shard Interesting... its working now for some reason

very interesting indeed!

uncut roost Jan 8, 2026, 11:29 PM

#

https://tenor.com/view/twijfel-nee-ja-smile-i-totaly-believe-you-gif-17013282

Tenor

#

https://tenor.com/view/notebook-jotdown-takedownnotes-gif-6133490

Tenor

jovial shard Jan 8, 2026, 11:31 PM

#

Okay its working Docker VMM now 🙂

#

Okay tell you what... will get some of my colleagues on this and see if we can reproduce

weary oar Jan 8, 2026, 11:31 PM

#

jovial shard Okay its working Docker VMM now 🙂

huh... I mean I'll take it, but that's definitely a bit mysterious

weary oar Jan 8, 2026, 11:32 PM

#

jovial shard Okay tell you what... will get some of my colleagues on this and see if we can r...

that sounds great, thank you

jovial shard Jan 8, 2026, 11:36 PM

#

The first command I actually ran on my machine was dagger develop --recursive on our dagger mono repo.

#

That was the one that broke it originally. I am gonna try rerunning that again

#

well its now working

#

wow common

uncut roost Jan 8, 2026, 11:41 PM

#

Sorry for the bad upgrade experience @jovial shard !

jovial shard Jan 8, 2026, 11:41 PM

#

hahahaha

#

its okay, theres a lot in flux right now I understand

weary oar Jan 8, 2026, 11:43 PM

#

if it pops up again at any point let us know of course! I am at a loss as to what would have caused it to happen once and then disappear

jovial shard Jan 8, 2026, 11:44 PM

#

Yeah who knows could be an absolute flake on my machine that decided to pop up just as I was testing the new release

#

A little gift to freak you out

#

I am gonna be upgrading our engines today hopefully so will let you know if anything comes up

#

Btw here are the results of those commands:

Docker VMM

docker run --rm --privileged -it alpine:latest uname -a
Linux 6e674a40ec10 6.12.54-linuxkit #1 SMP Tue Nov  4 21:21:47 UTC 2025 aarch64 Linux

docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
<no-output>

Apple Virtualisation Framework

docker run --rm --privileged -it alpine:latest uname -a
Linux 745a02853958 6.12.54-linuxkit #1 SMP Tue Nov  4 21:21:47 UTC 2025 aarch64 Linux

docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
<no-output>

weary oar Jan 8, 2026, 11:47 PM

#

jovial shard Btw here are the results of those commands: Docker VMM ``` docker run --rm --pr...

thanks, that's still helpful! On the off-chance it does ever happen again, that second one with the greps would be useful to see the output of. Unfortunately the kernel puts useful information in the kernel logs while the error itself is just "invalid argument"

jovial shard Jan 8, 2026, 11:48 PM

#

right okay

#

I'll also try to get you a dump of the dagger logs

#

I can't share the trace of the very first error that I got due to it containing some company code however the error occurred in an asModule operation and it was:

failed to load dependencies as modules: failed to load module dependencies: failed to initialize module: failed to get type defs json during module sdk codegen: mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1043758443", fstype: overlay, flags: 0, data: "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/149/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs,volatile,index=off,redirect_dir=off", err: invalid argument

#

#

I did actually restart docker desktop, and delete the dagger container after that

#

Perhaps this has "cleaned the state" so to speak and the issue has gone away as a result

jovial shard Jan 9, 2026, 12:08 AM

#

@wise shoal tagging your for brevity

#

If you hit the same issue, please dump your logs here 🙂

#

Okay I hit the issue again:

dagger call \
        --bao-addr=${BAO_ADDR} \
        --bao-token=file://~/.vault-token \
        --local-aws-credentials=file://~/.aws/credentials \
        --local-aws-profile=${AWS_PROFILE} \
        --local-gcp-credentials=file://~/.config/gcloud/application_default_credentials.json \
        --ssh=${SSH_AUTH_SOCK} \
check-pull-request

✔ connect 0.3s
✘ load module: . 6.2s ERROR
┇ initializing module › ModuleSource.asModule › load dep modules › ModuleSource.asModule › load dep modules › ModuleSource.asModule › 
✘ withExec codegen generate-typedefs --module-source-path /src/modules/dx --module-name dx --introspection-json-path /schema.json --output typedefs.json (
  ┆ experimentalPrivilegedNesting: true
  ┆ execMD: "

#

{\"ClientID\":\"ulpbyf0h7vonq4z0u13zp1s1u\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"offvnz76gswlelo3x58ew1qw9\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjVjNTdkNTc2ZmVhYjk4YzMSvAEKFXh4aDM6MmUzNGIzOGE5NTZiZjkzZhKiARIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIlIKCXJlZlN0cmluZxJFOkMvVXNlcnMvY2hyaXN0b3BoZXIucGFsbWVyL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL2R4IhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjJlMzRiMzhhOTU2YmY5M2ZYARJxChV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESWAoVeHhoMzoyZTM0YjM4YTk1NmJmOTNmEhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSIMCgRuYW1lEgQ6AmR4ShV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESXQoVeHhoMzo1YzU3ZDU3NmZlYWI5OGMzEkQKFXh4aDM6MzMzMDczYjJkNDgyODRlMRIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6NWM1N2Q1NzZmZWFiOThjMw==\",\"EncodedModuleID\":\"ChV4eGgzOmYyMDg2YTI1ZDkxOThlNDESvAEKFXh4aDM6MmUzNGIzOGE5NTZiZjkzZhKiARIQCgxNb2R1bGVTb3VyY2UYARoMbW9kdWxlU291cmNlIlIKCXJlZlN0cmluZxJFOkMvVXNlcnMvY2hyaXN0b3BoZXIucGFsbWVyL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL2R4IhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjJlMzRiMzhhOTU2YmY5M2ZYARJxChV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESWAoVeHhoMzoyZTM0YjM4YTk1NmJmOTNmEhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSIMCgRuYW1lEgQ6AmR4ShV4eGgzOjMzMzA3M2IyZDQ4Mjg0ZTESXwoVeHhoMzpmMjA4NmEyNWQ5MTk4ZTQxEkYKFXh4aDM6MzMzMDczYjJkNDgyODRlMRIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6ZjIwODZhMjVkOTE5OGU0MVgB\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"
  ) 0.1s ERROR

#

! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit301354842", fstype: overlay, flags: 0, data:
  "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/va
  r/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/sn
  apshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs:/var/lib/dagger/w
  orker/snapshots/snapshots/146/fs,volatile,index=off,redirect_dir=off", err: invalid argument

#

docker run --rm --privileged -it alpine:latest uname -a
Linux 84c891249553 6.12.54-linuxkit #1 SMP Tue Nov  4 21:21:47 UTC 2025 aarch64 Linux

docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[ 1585.795731] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.796150] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.816909] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.838332] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.855822] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.881574] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.916250] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.918271] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.940469] overlayfs: overlay with incompat feature 'volatile' cannot be mounted

#

[ 1585.962526] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.972130] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.972689] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.974413] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.977663] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.984430] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.988866] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.989469] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.990043] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.992682] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1585.996780] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.010497] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.031813] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.034578] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.035641] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.037057] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.037536] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.039058] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.040479] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.041607] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.043329] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 1586.051474] overlayfs: overlay with incompat feature 'volatile' cannot be mounted

#

📎 engine-logs.txt

weary oar Jan 9, 2026, 12:19 AM

#

Ah okay, thanks for the info dump. Based on that I suspect there's some codepath where cleanup is supposed to happen that doesn't always happen.

Was there anything obvious you did that triggered it? Or just keep using it and it randomly pops up?

jovial shard Jan 9, 2026, 12:19 AM

#

Nothing obvious unfortunately

#

I took a break from running dagger commands for about 20min

#

Came back and decided to run our main dagger function for the repo

#

That triggered it...

weary oar Jan 9, 2026, 12:22 AM

#

ok, I will see if I can track down exactly what's happening, but also in the worst case I can probably add in some kludge to the engine code to see when this error happens and handle it. So we'll get a v0.19.10 out ASAP.

jovial shard Jan 9, 2026, 12:23 AM

#

yep cool

#

happy to test for you off the main branch as well

wise shoal Jan 9, 2026, 12:28 AM

#

Not sure if this is related but I crashed when trying to replicate the issue. I bumped the version in the repo from 0.19.3 -> 0.19.9 and ran dagger develop --recursive.

! failed to generate code: Post "http://dagger/query": command [docker exec -i dagger-engine-v0.19.9 buildctl dial-stdio] has exited with exit status 137, make
  sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

I've attached the engine logs, pretty sure the failed run beings at L6835. The logs before that line were from a dagger init --sdk go test in another directory.

Lots of:

time="2026-01-09T00:01:28Z" level=warning msg="failed to release network namespace \"qly2sfnlac7v0p0lq9zp6fv0w\" left over from previous run: plugin type=\"loopback\" failed (delete): unknown FS magic on \"/var/lib/dagger/net/cni/qly2sfnlac7v0p0lq9zp6fv0w\": ef53"

followed by:

could not load snapshot...

❯ docker run --rm --privileged -it alpine:latest uname -a
Linux 46d36ec3380b 6.11.11-linuxkit #1 SMP Wed Oct 22 09:37:46 UTC 2025 aarch64 Linux

❯ docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'

📎 logs

#

I jsut ran it again and got an error similar to Chris

#

✘ withExec codegen generate-typedefs --module-source-path /src/modules/service-catalog --module-name service-catalog --introspection-json-path /schema.json --output typedefs.json (
  ┆ experimentalPrivilegedNesting: true

#

  ┆ execMD: "{\"ClientID\":\"rddzwfw7gadxzk28hu86uxvwq\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"q1ycgxiygqjba8s27okt71lyl\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjMxOTA1NTczYjlkYmZiM2ESXQoVeHhoMzozMTkwNTU3M2I5ZGJmYjNhEkQKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MzE5MDU1NzNiOWRiZmIzYRLCAQoVeHhoMzo2NWE1YTc3NDgxNmY2Y2M4EqgBEhAKDE1vZHVsZVNvdXJjZRgBGgxtb2R1bGVTb3VyY2UiWAoJcmVmU3RyaW5nEks6SS9Vc2Vycy9sdWtlLmJyYWtlbC93b3Jrc3BhY2UvbGlicmFyeS1jaS13b3JrZmxvd3MvbW9kdWxlcy9zZXJ2aWNlLWNhdGFsb2ciEwoNZGlzYWJsZUZpbmRVcBICGAFKFXh4aDM6NjVhNWE3NzQ4MTZmNmNjOFgBEn4KFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBJlChV4eGgzOjY1YTVhNzc0ODE2ZjZjYzgSEAoMTW9kdWxlU291cmNlGAEaCHdpdGhOYW1lIhkKBG5hbWUSEToPc2VydmljZS1jYXRhbG9nShV4eGgzOmUwYzRmZWE3MzcyYjZkOGQ=\",\"EncodedModuleID\":\"ChV4eGgzOjQ5YmMzNmZlODJkNTg3M2ESXwoVeHhoMzo0OWJjMzZmZTgyZDU4NzNhEkYKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZBIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6NDliYzM2ZmU4MmQ1ODczYVgBEsIBChV4eGgzOjY1YTVhNzc0ODE2ZjZjYzgSqAESEAoMTW9kdWxlU291cmNlGAEaDG1vZHVsZVNvdXJjZSJYCglyZWZTdHJpbmcSSzpJL1VzZXJzL2x1a2UuYnJha2VsL3dvcmtzcGFjZS9saWJyYXJ5LWNpLXdvcmtmbG93cy9tb2R1bGVzL3NlcnZpY2UtY2F0YWxvZyITCg1kaXNhYmxlRmluZFVwEgIYAUoVeHhoMzo2NWE1YTc3NDgxNmY2Y2M4WAESfgoVeHhoMzplMGM0ZmVhNzM3MmI2ZDhkEmUKFXh4aDM6NjVhNWE3NzQ4MTZmNmNjOBIQCgxNb2R1bGVTb3VyY2UYARoId2l0aE5hbWUiGQoEbmFtZRIROg9zZXJ2aWNlLWNhdGFsb2dKFXh4aDM6ZTBjNGZlYTczNzJiNmQ4ZA==\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"

#

  ) 0.1s ERROR
! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1594712920", fstype: overlay, flags: 0, data:
  "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots
  /snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/
  153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/145/fs:/var
  /lib/dagger/worker/snapshots/snapshots/94/fs:/var/lib/dagger/worker/snapshots/snapshots/86/fs,volatile,index=off,redirect_dir=off", err: invalid argument

jolly steppe Jan 9, 2026, 12:34 AM

#

trying to repro locally. On my linux working perfectly. Switching to mac rn

uncut roost Jan 9, 2026, 12:35 AM

#

Worked fine for me on remote Docker/linux, and remote Dagger-hosted sandbox... Trying to re-install docker-for-mac

weary oar Jan 9, 2026, 12:36 AM

#

Yeah I've been using v0.19.9 (and previously main builds) on Linux running similar sort of commands as what's repro'ing it above and haven't hit it yet... I really can't imagine how it's docker desktop specific, but I guess my imagination could be limited

wise shoal Jan 9, 2026, 12:37 AM

#

Engine logs for the second failure

📎 logs2

#

❯ docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[112951.651719] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.711449] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.730308] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.779398] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.781942] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.814648] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.831404] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.876855] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.933803] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.946117] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.949512] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.953730] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112951.985612] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.015875] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.043365] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[112952.047606] overlayfs: overlay with incompat feature 'volatile' cannot be mounted

weary oar Jan 9, 2026, 12:51 AM

#

@jovial shard @wise shoal do you by any chance run the engine with custom CA certs to your knowledge?

#

there's an outside possibility of that mattering

wise shoal Jan 9, 2026, 12:55 AM

#

Yeah we do

jovial shard Jan 9, 2026, 12:55 AM

#

yep

weary oar Jan 9, 2026, 12:56 AM

#

okay, lemme try to repro with that in place...

jolly steppe Jan 9, 2026, 12:57 AM

#

yeah no bug on my end up to now

jovial shard Jan 9, 2026, 1:12 AM

#

Here's the output of my docker inspect for the dagger container:

📎 message.txt

#

not sure if thats of any help whatsover but haha, there you go

weary oar Jan 9, 2026, 1:13 AM

#

jovial shard not sure if thats of any help whatsover but haha, there you go

yep that confirms you're setting up the ca certs; was about to go through double checking we're on the same page there, so that helps 👍

haven't repro'd your exact error yet but did trigger something suspicious looking

jovial shard Jan 9, 2026, 1:14 AM

#

Thanks for looking into this guys on short notice really appreciate!

weary oar Jan 9, 2026, 1:16 AM

#

having custom ca certs triggers some extra setup/teardown execution for each container, which I suspect may be creating a race around mounts. That'd also explain why our CI didn't catch this. We have a number of integ tests for CA certs but they only make up a small chunk of the 1000s, so something like this could have slipped through the cracks

uncut roost Jan 9, 2026, 1:16 AM

#

Error on 0.19.9 upgrade

jovial shard Jan 9, 2026, 1:16 AM

#

right...

jolly steppe Jan 9, 2026, 1:27 AM

#

Erik, could it be a cleanup issue on work/incompat/volatile, for subsequent mounts ?

#

trying to repro now 👀

weary oar Jan 9, 2026, 1:28 AM

#

jolly steppe Erik, could it be a cleanup issue on `work/incompat/volatile`, for subsequent mo...

yes that's for sure what it is based on the kernel logs they sent, but the question is where/why is the cleanup missing since our bases should have been covered. Could be something related to the custom CA cert setup/teardown codepaths

#

I'd ideally like to find that exact root cause and patch it, but I think also no matter what I'll add some fallback handling where if we try to make a mount and hit this error, we just cleanup the incompat dir at that time. Not super duper ideal in the long term but probably the right move to prevent pain like this, at least in the medium term.

jolly steppe Jan 9, 2026, 1:31 AM

#

weary oar I'd ideally like to find that exact root cause and patch it, but I think also no...

oh you mean the race ok 👍

jolly steppe Jan 9, 2026, 1:53 AM

#

MMmmh, stil unable to hit it --'

#

I'm on kernel 6.10.14-linuxkit, updating docker desktop atm. i've been trying to hit it pretty hard

#

ok bumped, now i'm like chris and luke. At least this could be triggered just on the latest docker for mac version ?

docker run --rm alpine:latest uname -r
6.12.54-linuxkit

jovial shard Jan 9, 2026, 2:04 AM

#

I did upgrade to the latest docker desktop quite recently

#

Luke is on an older version though as you can see if you look at his post above

jolly steppe Jan 9, 2026, 2:08 AM

#

jovial shard Luke is on an older version though as you can see if you look at his post above

forgot about it 🙏 I was on an even older one personally

weary oar Jan 9, 2026, 2:12 AM

#

jolly steppe ok bumped, now i'm like chris and luke. *At least this could be triggered just o...

you're getting err: invalid argument when creating a mount now? with or without custom ca certs?

jolly steppe Jan 9, 2026, 2:17 AM

#

time="2026-01-09T02:08:36Z" level=error msg="failed to create cacerts installer, falling back to not installing CA certs: invalid argument" span="dagop.ctr Container.withEx

verifying it's not a "me" error with the ca certs installs

#

(probably me)

weary oar Jan 9, 2026, 2:19 AM

#

jolly steppe `time="2026-01-09T02:08:36Z" level=error msg="failed to create cacerts installer...

that's not the error they are getting exactly; that's a non-fatal one

#

the one they are hitting actually results in a hard user-facing error

#

but could be related

jolly steppe Jan 9, 2026, 2:22 AM

#

weary oar that's not the error they are getting exactly; that's a non-fatal one

~~mmh it was due to me (symlink (f99307f9.0 -> test-repro-ca.crt) in the CA certs directory~~ Even without the symlink I have this error, could be related, no hard error as they have though, on more than 200+ runs

jovial shard Jan 9, 2026, 3:05 AM

#

for the record we do symlink the ca-certificates directory like this:

ln -s "$HOME/workspace/ssl" "$HOME/Library/Application Support/dagger/ca-certificates"

#

Not sure if this helps at all, but worth looking into

#

I might try just copying the certs in there and see what happens

weary oar Jan 9, 2026, 3:05 AM

#

jolly steppe ~~mmh it was due to me (symlink (f99307f9.0 -> test-repro-ca.crt) in the CA cert...

I think it's a red herring, it's just from a typed-nil problem hitting this: https://github.com/sipsma/dagger/blob/400ffd3e2a6e9bbff5cbd9476938db61d97681b8/engine/buildkit/containerfs/fs.go#L332 when it should have exited earlier https://github.com/sipsma/dagger/blob/400ffd3e2a6e9bbff5cbd9476938db61d97681b8/engine/buildkit/containerfs/fs.go#L294

which is worth a fix but not harmful and not what they are hitting

jovial shard Jan 9, 2026, 3:25 AM

#

Getting my colleagues to do this:

rm "$HOME/Library/Application Support/dagger/ca-certificates"
cp -R $HOME/workspace/ssl "$HOME/Library/Application Support/dagger/ca-certificates"

and retest

jolly steppe Jan 9, 2026, 3:29 AM

#

will be away for 2 hours, will resume after 🙏

jovial shard Jan 9, 2026, 3:29 AM

#

if you need to sleep / kids please do that

#

there is always tomorrow

jovial shard Jan 9, 2026, 3:45 AM

#

I have been running this configuration so far and haven't been able to reproduce the error

#

Will keep running with it and also will get my colleagues to try it out and see if this was it

wise shoal Jan 9, 2026, 3:50 AM

#

Looking at it now

weary oar Jan 9, 2026, 3:50 AM

#

jovial shard I have been running this configuration so far and haven't been able to reproduce...

Thanks, yeah I surprisingly wasn't able to repro the same error you all were hitting even with a custom CA installed and then running 100s of integ tests against that engine. So hopefully that squashes it permanently for you.

@silent @jolly steppe (silent ping, for tomorrow), if they hit the problem again, do you think tomorrow you would have time to try implementing the fallback of "whenever mount is made, pre-emptively remove any work/incompat dir"? I have a separate issue I'm looking into for another user that I gotta continue with tomorrow

jovial shard Jan 9, 2026, 4:18 AM

#

Unfortunately... we are still getting it:

make check-pull-request
dagger call \
    --bao-addr=${BAO_ADDR} \
    --bao-token=file://~/.vault-token \
    --local-aws-credentials=file://~/.aws/credentials \
    --local-aws-profile=${AWS_PROFILE} \
    --local-gcp-credentials=file://~/.config/gcloud/application_default_credentials.json \
    --ssh=${SSH_AUTH_SOCK} \
check-pull-request
✔ connect 0.3s
✘ load module: . 5.6s ERROR
┇ initializing module › ModuleSource.asModule › load dep modules › ModuleSource.asModule › load dep modules › ModuleSource.asModule › 
✘ withExec codegen generate-typedefs --module-source-path /src/modules/aws --module-name aws --introspection-json-path /schema.json --output typedefs.json (
  ┆ experimentalPrivilegedNesting: true
  ┆ execMD: "

#

{\"ClientID\":\"86cwwdrgtnzajs4nhgqz5k204\",\"SessionID\":\"\",\"SecretToken\":\"\",\"Hostname\":\"\",\"ClientStableID\":\"\",\"ExecID\":\"ihdmopo3idqgxuo9q04p1gaf8\",\"Internal\":true,\"CallID\":\"ChV4eGgzOjExNDZiOGU5NjcxOTQ1ZDASXQoVeHhoMzoxMTQ2YjhlOTY3MTk0NWQwEkQKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MTE0NmI4ZTk2NzE5NDVkMBK9AQoVeHhoMzozMGQyNWIwYzkyMGNhZGM2EqMBEhAKDE1vZHVsZVNvdXJjZRgBGgxtb2R1bGVTb3VyY2UiUwoJcmVmU3RyaW5nEkY6RC9Vc2Vycy9jaHJpc3RvcGhlci5wYWxtZXIvd29ya3NwYWNlL2xpYnJhcnktY2ktd29ya2Zsb3dzL21vZHVsZXMvYXdzIhMKDWRpc2FibGVGaW5kVXASAhgBShV4eGgzOjMwZDI1YjBjOTIwY2FkYzZYARJyChV4eGgzOjhlOGU0YjE3ZTQ5YWFmZDkSWQoVeHhoMzozMGQyNWIwYzkyMGNhZGM2EhAKDE1vZHVsZVNvdXJjZRgBGgh3aXRoTmFtZSINCgRuYW1lEgU6A2F3c0oVeHhoMzo4ZThlNGIxN2U0OWFhZmQ5\",\"EncodedModuleID\":\"ChV4eGgzOjFmYjZmNGM1YzRmZTA1NzISXwoVeHhoMzoxZmI2ZjRjNWM0ZmUwNTcyEkYKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORIKCgZNb2R1bGUYARoIYXNNb2R1bGVKFXh4aDM6MWZiNmY0YzVjNGZlMDU3MlgBEr0BChV4eGgzOjMwZDI1YjBjOTIwY2FkYzYSowESEAoMTW9kdWxlU291cmNlGAEaDG1vZHVsZVNvdXJjZSJTCglyZWZTdHJpbmcSRjpEL1VzZXJzL2NocmlzdG9waGVyLnBhbG1lci93b3Jrc3BhY2UvbGlicmFyeS1jaS13b3JrZmxvd3MvbW9kdWxlcy9hd3MiEwoNZGlzYWJsZUZpbmRVcBICGAFKFXh4aDM6MzBkMjViMGM5MjBjYWRjNlgBEnIKFXh4aDM6OGU4ZTRiMTdlNDlhYWZkORJZChV4eGgzOjMwZDI1YjBjOTIwY2FkYzYSEAoMTW9kdWxlU291cmNlGAEaCHdpdGhOYW1lIg0KBG5hbWUSBToDYXdzShV4eGgzOjhlOGU0YjE3ZTQ5YWFmZDk=\",\"EncodedFunctionCall\":null,\"CallerClientID\":\"\",\"ParentIDs\":null,\"CacheMixin\":\"\",\"HostAliases\":null,\"ExtraSearchDomains\":null,\"RedirectStdinPath\":\"\",\"RedirectStdoutPath\":\"\",\"RedirectStderrPath\":\"\",\"SecretEnvNames\":null,\"SecretFilePaths\":null,\"SystemEnvNames\":null,\"EnabledGPUs\":null,\"SSHAuthSocketPath\":\"\",\"NoInit\":false,\"AllowedLLMModules\":null,\"ClientVersionOverride\":\"\"}"
  ) 0.2s ERROR

#

! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit1689234798", fstype: overlay, flags: 0, data:
  "workdir=/var/lib/dagger/worker/snapshots/snapshots/162/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/162/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapsh
  ots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snap
  shots/151/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/147/fs:/var/lib/dagger/worker/snapshots/snapshots/146/fs,volatile,index=off,redirect_dir=off", err:
  invalid argument

#

The issue actually follows after an engine crash though...

#

Engine crashes and I lose my first working session
I run another dagger command.
New dagger engine up
Start getting the error

#

However weirdly the new engine does not reuse the old docker volume

#

So I have no idea why this would be occuring since it doesn't share any state with the previously running engine

#

📎 engine-logs2.txt

#

docker run --rm --privileged -it alpine:latest uname -a
Linux d6a10ba7dbb2 6.12.54-linuxkit #1 SMP Tue Nov  4 21:21:47 UTC 2025 aarch64 Linux
docker run --rm --privileged -it alpine:latest sh -c 'dmesg | grep overlay | grep -v meaningless'
[ 2401.461405] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.464349] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.467592] overlayfs: overlay with incompat feature 'volatile' cannot be mounted
[ 2401.469147] overlayfs: overlay with incompat feature 'volatile' cannot be mounted

weary oar Jan 9, 2026, 5:08 AM

#

jovial shard 1. Engine crashes and I lose my first working session 2. I run another dagger co...

Oh I missed that there was a crash. That explains this way more… the cleanup of the mount isn’t happening because there was an actual crash. The cleanup is of on disk state created by the kernel so it would persist across engine restart.

We still need better handling of that case but fixing the crash is obviously even more important. I am afk atm but will look at the engine log output when back

wise shoal Jan 9, 2026, 5:15 AM

#

Right, so is this error:

! mount source: "overlay", target: "/var/lib/dagger/worker/cachemounts/buildkit2284938268", fstype: overlay, flags: 0, data: "workdir=/var/lib/dagger/worker/snapshots/snapshots/163/work,upperdir=/var/lib/dagger/worker/snapshots/snapshots/163/fs,lowerdir=/var/lib/dagger/worker/snapshots/snapshots/157/fs:/var/lib/dagger/worker/snapshots/snapshots/156/fs:/var/lib/dagger/worker/snapshots/snapshots/155/fs:/var/lib/dagger/worker/snapshots/snapshots/154/fs:/var/lib/dagger/worker/snapshots/snapshots/153/fs:/var/lib/dagger/worker/snapshots/snapshots/152/fs:/var/lib/dagger/worker/snapshots/snapshots/150/fs:/var/lib/dagger/worker/snapshots/snapshots/148/fs:/var/lib/dagger/worker/snapshots/snapshots/147/fs,volatile,index=off,redirect_dir=off", err: invalid argument

caused by these not being cleaned up?:

# stat -f -c %T /var/lib/dagger/worker/snapshots/snapshots/163/work/work/incompat/volatile/dirty
ext2/ext3

weary oar Jan 9, 2026, 5:16 AM

#

wise shoal Right, so is this error: ``` ! mount source: "overlay", target: "/var/lib/dagger...

Yes that's correct!

#

they should get cleaned up, but after a hard crash they won't. But theoretically the engine should have been discarding those mounts entirely after the crash, but it's not

#

So that's one bug. But more importantly I want to see why the engine is restarting in the first place

#

in the engine logs I didn't actually see a panic or anything

#

is it just getting manually restarted?

#

or is the crash stack trace just not showing up there?

wise shoal Jan 9, 2026, 5:30 AM

#

I didn't see one, I'll nuke it all and see what I get

#

is it just getting manually restarted?
Nah it's crashing

jovial shard Jan 9, 2026, 5:35 AM

#

Ah, unfortunately after the crash the container goes down with the logs so we don't get to see why it did that

#

Is there a way to tell dagger not to put --rm on the engine docker container it makes?

weary oar Jan 9, 2026, 5:47 AM

#

jovial shard Is there a way to tell dagger not to put `--rm` on the engine docker container i...

There’s ways, but possibly simpler would be to remove the container, start it simply with just “dagger core version”, and then run in a separate terminal “docker logs -f dagger-engine-v0-19-9 2>&1 | tee ~/engine.log”. Then do the stuff that triggers the crash and you’ll have the output in that file

wise shoal Jan 9, 2026, 5:51 AM

#

[ 5315.992327] Out of memory: Killed process 1109 (dagger-engine) total-vm:2403788kB, anon-rss:371748kB, file-rss:716kB, shmem-rss:0kB, UID:0 pgtables:2472kB oom_score_adj:0

#

That's from dmesg inside the engine. The repo failing is a library that has over 20 module dependencies, each with their own test module.

weary oar Jan 9, 2026, 6:01 AM

#

wise shoal ``` [ 5315.992327] Out of memory: Killed process 1109 (dagger-engine) total-vm:2...

Huh… RSS is only like 372MB which isn’t a lot. Is there some docker desktop limit being applied?

wise shoal Jan 9, 2026, 6:23 AM

#

Here are the engine stats

CONTAINER ID   NAME                    CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O       PIDS
9a6a83c20a7d   dagger-engine-v0.19.9   0.00%     211.9MiB / 5.786GiB   3.58%     1.17kB / 126B   191MB / 512kB   20

jolly steppe Jan 9, 2026, 6:25 AM

#

weary oar Thanks, yeah I surprisingly wasn't able to repro the same error you all were hit...

Yes of course 😇

ps: I'm back, will dig the logs 🙏

wise shoal Jan 9, 2026, 6:28 AM

#

I manually deleted the imcompat overlays and it ran successfully using whatever was cached from the first run.

#

It's definitely using more memory than that. I crashed it again and it climbed to around the limit before ooming.

/ # dmesg | grep oom
[ 7909.167555] dagger-engine invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[ 7909.167634]  oom_kill_process+0x144/0x360
[ 7909.167690] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[ 7909.168079] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init,mems_allowed=0,global_oom,task_memcg=/docker/032e6c36b5e726638fa46d78b218b68555386fa899d18fe418b8f408b4e1bca9/init,task=dagger-engine,pid=31178,uid=0
[ 7909.168231] Out of memory: Killed process 31178 (dagger-engine) total-vm:2341756kB, anon-rss:461572kB, file-rss:9636kB, shmem-rss:0kB, UID:0 pgtables:2344kB oom_score_adj:0

jolly steppe Jan 9, 2026, 7:01 AM

#

wise shoal It's definitely using more memory than that. I crashed it again and it climbed t...

It's a global oom, it seems that it's the entire VM running out of memory

  constraint=CONSTRAINT_NONE
  global_oom

Questions (if answered above sorry, i checked but might have missed messages):

What's your Docker Desktop memory allocation?
- Docker Desktop → Settings → Resources → Memory
- (This sets the LinuxKit VM's total memory)
Inside the engine, what do these show?

docker exec dagger-engine-v0.19.9 cat /sys/fs/cgroup/memory.max
docker exec dagger-engine-v0.19.9 cat /sys/fs/cgroup/memory.current

What's the VM's total memory?

docker exec dagger-engine-v0.19.9 cat /proc/meminfo | head -5

Is your number lower than the 5.7gb of the engine that we set ?

If you can increase it for testing purposes, can you please try and make a run after the cleanup

jolly steppe Jan 9, 2026, 7:04 AM

#

jolly steppe It's a global oom, it seems that it's the entire VM running out of memory ``` ...

I have this on my machine:

jolly steppe Jan 9, 2026, 7:05 AM

#

wise shoal It's definitely using more memory than that. I crashed it again and it climbed t...

here, the limit being 5.7gb ? or your docker limit ?

jolly steppe Jan 9, 2026, 7:16 AM

#

jolly steppe here, the limit being 5.7gb ? or your docker limit ?

What surprises me the most is that, as part of our release process, we bump recursively Dagger's internal components too (without CAs though), with the same command. And it's around the same amount of modules (i'll try with even more tomorrow to confirm).

It's totally possible that we have a memory leak or that we introduced something that consumes more memory since last release

By the way, do you have the same problem with 0.19.8 ?

wise shoal Jan 9, 2026, 7:23 AM

#

5.7 GB is what docker stats reported back to me as the limit. I bumped it to 8GB and it still crashed. I'm going to push it to 12GB and 2GB swap and try again.

Previous run was 8GB / 1GB Swap. The following will be with the updated resource values.
Fresh engine, only using dagger core version

/ # cat /sys/fs/cgroup/memory.max
max
/ # cat /sys/fs/cgroup/memory.current
93048832

Also pre-run

/ # cat /proc/meminfo | head -5
MemTotal:       12235596 kB
MemFree:        11203196 kB
MemAvailable:   11612708 kB
Buffers:          169800 kB
Cached:           397192 kB

dagger develop --recursive w/ 12GB mem & 2GB swap

📉 Min Available:     1895 MB
📱 Max App RAM:       7500 MB (Active Anon)
🚨 Max Dirty:         1101 MB
🧱 Max Slab:          797 MB
🗄️ Max Cache:         4508 MB
🗺️ Max PageTbl:       52 MB

(I got gemini to write me a script to track it)

It ran successfully.

jolly steppe Jan 9, 2026, 7:30 AM

#

wise shoal 5.7 GB is what `docker stats` reported back to me as the limit. I bumped it to 8...

Thanks Luke 🙏

This seems to confirm that the OOM kills the engine, and leaves it in a weird state

Hypothesis: it crashed due to memory, then you guys restarted the engine, and I suppose that the volume still has the volatile directory (hard kill). Then, if you run any command since that point in time, it just crashes (due to that volatile dir).

So, I'll triple check tomorrow, but the cleanup fix that Erik suggested is actually useful to handle those kind of OOM dirty states

Now, the real question is: why is it OOMing right now for such a small scale ?

I suppose that you were already doing the recursive develop in the past I suppose ? I'll track tomorrow the perf between releases

Can you please confirm from what dagger version you're jumping from ?

Will follow up tomorrow, thank you very much for taking the time with Chris 😍, I should be able to repro🙏

wise shoal Jan 9, 2026, 8:13 AM

#

It's the weekend for us tomorrow, I'll check back on Monday.

Now, the real question is: why is it OOMing right now for such a small scale ?
I'm not sure what your scale is, but we have a single "library" module with about 28 dependent modules (each with a test module). The main module is essentially the CI layer to generate/test/validate all the others.

I suppose that you were already doing the recursive develop in the past I suppose ?
Correct

Can you please confirm that from what dagger version you're jumping from ?
0.19.3

Here are the stats from a run on 0.19.3 (Docker: 12GB + 2GB)

📉 Min Available:     4106 MB
📱 Max App RAM:       6888 MB (Active Anon)
🚨 Max Dirty:         606 MB
🧱 Max Slab:          734 MB
🗄️ Max Cache:         6480 MB
🗺️ Max PageTbl:       38 MB

And with less resources (Docker: 6GB + 1GB)

📉 Min Available:     445 MB
📱 Max App RAM:       3616 MB (Active Anon)
🚨 Max Dirty:         494 MB
🧱 Max Slab:          512 MB
🗄️ Max Cache:         3475 MB
🗺️ Max PageTbl:       35 MB

weary oar Jan 9, 2026, 6:27 PM

#

Thanks for the info, I am seeing if I can repro any super high memory usage/leaks. So far running stuff (develop --recursive and some expensive dagger calls) in our repo w/ quite a few dagger module dependencies has not replicated anything like what you're seeing, but will keep trying

weary oar Jan 9, 2026, 6:52 PM

#

Found something interesting.

dagger develop --recursive on your repo at different versions (found traces in your org):

v0.19.9 - 3m15s
v0.19.3 - 4m50s

You do indeed have a ton of modules, a few times more than us!

The fact that the newer engine version is quite a bit faster honestly might explain why you started hitting OOMs sometimes. I've seen in the past this sort of thing happen where unblocking CPU/IO bottlenecks increases peak memory usage since you can just allocate more in parallel faster than before. I'm suspecting it's that, given I can't replicate any sort of memory leak with develop --recursive (engine RSS always goes back down to baseline after forcing a gc cycle). The fact that the OOMs are inconsistent also suggest that the problem is sort of "borderline" rather than just some absurd memory usage bug.

We should work on improving the memory usage of course, but that might be a piecemeal effort over time. For the shorter term I think we should:

Add a parallelism limit to dagger develop --recursive (maybe just num CPUs). That should stop the peak memory usage from going crazy in repos with tons of deps like you have. I suspect it won't actually slow it down very much if at all since the CPU is probably getting maxed out anyways with that much parallelism
Fix the problem with the mounts getting cleaned up after a hard crash (which is what caused the original error that started this whole thread)

cc @jolly steppe

Dagger Cloud

Browse and visualize Dagger traces.

Dagger Cloud

Browse and visualize Dagger traces.

uncut roost Jan 9, 2026, 7:04 PM

#

LOL @weary oar you made it too fast

jolly steppe Jan 9, 2026, 8:01 PM

#

weary oar Found something interesting. `dagger develop --recursive` on your repo at diffe...

Implementing those as follow up, thanks Erik 🙏

weary oar Jan 9, 2026, 11:43 PM

#

jolly steppe Implementing those as follow up, thanks Erik 🙏

lemme know if I can give any pointers

jolly steppe Jan 10, 2026, 12:07 AM

#

weary oar lemme know if I can give any pointers

PR n°1: https://github.com/dagger/dagger/pull/11659

#

followup on the cleanup incoming

jovial shard Jan 10, 2026, 12:31 AM

#

Erik appreciate that analysis. I have seen things like that before, when you make improvements to apps that were previously blocked by I/O, they can now utilise the CPU much more and its possible for them to run out of resources.

#

We are an interesting case study because we have a mono repo with a lot of modules. We actually have parallelism limits in some of pur dagger functions for that repo, because on some machines, there was too much in parallel, progress would just grind to a halt!

uncut roost Jan 10, 2026, 12:34 AM

#

jovial shard We are an interesting case study because we have a mono repo with a lot of modul...

We actually had to implement the same thing

jovial shard Jan 10, 2026, 12:35 AM

#

I have noticed in the jump from 0.19.3, the engine has gotten so much zippier, so very likely its just able to do more in parallel.

uncut roost Jan 10, 2026, 12:35 AM

#

But the tricky part is: with auto-scale-out, that parallelism limit no longer makes sense: if we just hardcode it in the module, we're potentially leaving acceleration on the table

jovial shard Jan 10, 2026, 12:35 AM

#

Perhaps it should be a client setting?

uncut roost Jan 10, 2026, 12:36 AM

#

Yeah maybe. But right now it's just custom module logic. Not something the engine is exposed to. We could change that

#

System-wide parallelization throttle could make sense

#

Actually it might make sense even in a cluster, to keep cost under control 🙂

jovial shard Jan 10, 2026, 12:36 AM

#

Right

weary oar Jan 10, 2026, 12:39 AM

#

uncut roost System-wide parallelization throttle could make sense

We used to have that but it creates some really subtle and tricky deadlock scenarios. If you can keep scaling out forever (for some approximation of forever) that changes it of course though.

I guess what we could do is find some way of limiting “top-level” parallelism (ie number of checks, number of modules generating not including their deps, etc). That’d avoid the deadlock issues

#

And in mean time one off whack a moles will help 😄

jovial shard Jan 10, 2026, 12:42 AM

#

Seems like a good idea

#

Thanks for jumping on this and not only looking at the first issue with mounts but also considering the reasons for the OOM kills 🙂

#

Hopefully the feedback is helping you guys discover ways we are using the product, and therefore improve it!

jolly steppe Jan 10, 2026, 12:54 AM

#

jovial shard Hopefully the feedback is helping you guys discover ways we are using the produc...

We love it ! 😍

jovial shard Jan 10, 2026, 1:22 AM

#

Cant wait to get these fixes in and get upgrade next week. We have plans to use checks and improvements to .env file to start building a very strong local experience for developers

jolly steppe Jan 10, 2026, 1:23 AM

#

jolly steppe PR n°1: https://github.com/dagger/dagger/pull/11659

Updated with your suggested implem Erik, resuming the dirty state cleanup PR

jolly steppe Jan 10, 2026, 3:27 AM

#

weary oar Found something interesting. `dagger develop --recursive` on your repo at diffe...

For the cleanup, how would you test it ? I'm having a hard time repro-ing ? Shall I manually lower my ram, try it until an OOM ? It's just that i'm gonna have a hard time making an integration test out of that 🤔

I guess I could lower my allocated ram and generate a 100 dependency module locally

weary oar Jan 10, 2026, 7:03 PM

#

jolly steppe For the cleanup, how would you test it ? I'm having a hard time repro-ing ? Shal...

You could manually test by just sigkilling the engine and then starting it again. Integ test would be a but involved but you could probably do it with an engine as a service that you force kill in the middle of an operation. I would consider that to be nice to have but not necessary given how tough it might be to make consistent

jovial shard Jan 11, 2026, 1:34 AM

#

Looking good guys 🙂

jovial shard Jan 12, 2026, 9:35 PM

#

We're going to rollout dagger 0.19.9 to our team's runners and leave everyone else on 0.19.3 for now. Do you guys reckon the fixes would be out some time this week or would you be looking at batching them in the next release cycle?

jolly steppe Jan 12, 2026, 10:21 PM

#

jovial shard We're going to rollout dagger 0.19.9 to our team's runners and leave everyone el...

👋 we're waiting for my fix for the cleanup of the weird state + potentially a few other bugfixes a(nd I'll let Erik confirm) but the idea is to release asap (probably this week)

jovial shard Jan 12, 2026, 11:16 PM

#

cool sounds good

jolly steppe Jan 13, 2026, 5:49 PM

#

jovial shard cool sounds good

Erik just merged the second PR (youhou ! 😍 )

uncut roost Jan 13, 2026, 5:53 PM

#

https://tenor.com/view/yoo-hoo-waving-you-who-yoo-who-you-hoo-gif-14690612

Tenor

jovial shard Jan 13, 2026, 9:56 PM

#

Great!

#

I will let my team loose on your main branch for testing

wise shoal Jan 14, 2026, 12:21 AM

#

lgtm

#

Cleanup is working, and the parallelism is preventing OOMs

weary oar Jan 14, 2026, 12:23 AM

#

Awesome, currently planning to do the release our tomorrow morning!

jovial shard Jan 14, 2026, 11:43 PM

#

Hey guys. We have deploy 0.19.9 to our CI runners and we are experience some weird issues where the engine starts failing its kube health check and becomes unresponsive.

Here is the output from a client trying to connect to the engine:

Run exec dagger call \
1   : connect
1   : [0.0s] | cloud url=https://dagger.cloud/nine/traces/a345ff2dd8ed5a6a55694a664246d778
2   : ┆ starting engine
2   : ┆ starting engine DONE [0.0s]
3   : ┆ connecting to engine
3   : ┆ [0.0s] | 23:09:23 INF connected name=dagger-platform-engineering-dagger-helm-engine-dc62z client-version=v0.19.9 server-version=v0.19.9
3   : ┆ connecting to engine DONE [0.0s]
4   : ┆ starting session

jolly steppe Jan 14, 2026, 11:45 PM

#

jovial shard Hey guys. We have deploy 0.19.9 to our CI runners and we are experience some wei...

0.19.10 or 0.19.9 ?

jovial shard Jan 14, 2026, 11:46 PM

#

0.19.9

#

is 0.19.10 released?

jolly steppe Jan 14, 2026, 11:46 PM

#

jovial shard is 0.19.10 released?

In progress at this very moment, I guess try it out once it's out ahah 😇

jovial shard Jan 14, 2026, 11:46 PM

#

right okay

#

just posting the engine logs for brevity

#

when 0.19.10 comes out, we'll update to that and see if it goes away

#

when the engine is in this state by the way, we can't terminate the engine pod in k8s.. its like its stuck

#

the way I got it to close last time, was execing into the pod and doing a killall on everything in the container.

#

sorry I realise those logs are ordered newest first

#

📎 engine-logs-2026-01-15.txt

#

^ oldest first here

#

#

you can see in our logs we see a spike in errors around 10:07:25

#

in kube we see the pod is failing its healthcheck:

Events:                                                                                                                                                                                                        │
│   Type     Reason     Age                    From     Message                                                                                                                                                  │
│   ----     ------     ----                   ----     -------                                                                                                                                                  │
│   Warning  Unhealthy  3m58s (x163 over 43m)  kubelet  Readiness probe failed: command timed out: "sh -exc dagger core version" timed out after 14s

#

Running ps aux in the container:

/ # ps aux
PID   USER     TIME  COMMAND
    1 root      1h43 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml
  232 root      0:05 /usr/sbin/dnsmasq --keep-in-foreground --log-facility=- --log-debug -u root --conf-file=/var/run/containers/cni/dnsname/dagger/dnsmasq.conf
199708 root      0:00 [git]
199730 root      0:00 [git]
661657 root      0:00 [git]
661679 root      0:00 [git]
1232164 root      0:00 [runc:[2:INIT]]
1583779 root      0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/tribqpgdkgmxu25sbtc01ojl6 --keep tribqpgdkgmxu25sbt
1583835 root      0:00 /.init bao server -dev -dev-root-token-id dev-only-token
1583995 root      0:04 bao server -dev -dev-root-token-id dev-only-token
1586579 root      0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/ob80b4d2f2ldlmcvyhy32am30 --keep ob80b4d2f2ldlmcvyh
1586706 root      0:00 /.init bao server -dev -dev-root-token-id bao-dev-token
1587208 root      0:04 bao server -dev -dev-root-token-id bao-dev-token
1587232 root      0:00 /usr/local/bin/runc --log /var/lib/dagger/worker/executor/runc-log.json --log-format json run --bundle /var/lib/dagger/worker/executor/z7gv71hnxyssw0gwtesfkrwl6 --keep z7gv71hnxyssw0gwte
1587356 root      0:00 /.init bao server -dev -dev-root-token-id dev-only-token
1587600 root      0:04 bao server -dev -dev-root-token-id dev-only-token
1595301 root      0:00 dagger core version
1595320 root      0:00 sh
1595326 root      0:00 ps aux

#

I have it in a broken state right now, is there anything I could run on the pod to help you guys diagnose this one?

weary oar Jan 15, 2026, 12:00 AM

#

jovial shard I have it in a broken state right now, is there anything I could run on the pod ...

kernel logs could help, sudo dmesg

#

that's a very odd failure scenario

#

it feels like something deeply wrong like on the kernel level

#

e.g. 1232164 root 0:00 [runc:[2:INIT]], that's an intermediate state of a separate runc process that should be super short lived, so the fact that it's seemingly sitting there is extra odd

jovial shard Jan 15, 2026, 12:02 AM

#

📎 dmesg-2026-01-15.yml

weary oar Jan 15, 2026, 12:02 AM

#

cat /proc/meminfo (or otherwise machine wide memory usage could help too), to check its not in swap hell or something

jovial shard Jan 15, 2026, 12:03 AM

#

/ # cat /proc/meminfo
MemTotal:       64777188 kB
MemFree:         2363604 kB
MemAvailable:   52168964 kB
Buffers:            1584 kB
Cached:         45458216 kB
SwapCached:            0 kB
Active:         11956156 kB
Inactive:       38439244 kB
Active(anon):      15764 kB
Inactive(anon):  5217864 kB
Active(file):   11940392 kB
Inactive(file): 33221380 kB
Unevictable:      199448 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:              1680 kB
Writeback:             0 kB
AnonPages:       5135376 kB
Mapped:          2024580 kB
Shmem:            298028 kB
KReclaimable:    5361856 kB
Slab:            9227904 kB
SReclaimable:    5361856 kB
SUnreclaim:      3866048 kB
KernelStack:       17584 kB
PageTables:        39860 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    32388592 kB
Committed_AS:   15087008 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      351864 kB
VmallocChunk:          0 kB
Percpu:            25696 kB
HardwareCorrupted:     0 kB
AnonHugePages:      2048 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:     30720 kB
FilePmdMapped:      2048 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     2107820 kB
DirectMap2M:    62867456 kB
DirectMap1G:     1048576 kB

weary oar Jan 15, 2026, 12:03 AM

#

oh and df -h to check disks

jovial shard Jan 15, 2026, 12:04 AM

#

/ # df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  49.9G      8.7G     41.2G  17% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    12.4G      5.1M     12.3G   0% /run/dagger
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/hosts
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /dev/termination-log
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/hostname
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/resolv.conf
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/dagger/engine.toml
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/dagger/engine.json
/dev/nvme1n1              2.3T     79.1G      2.2T   3% /var/lib/dagger
tmpfs                    58.3G     12.0K     58.3G   0% /var/run/secrets/kubernetes.io/serviceaccount
/dev/nvme0n1p1           49.9G      8.7G     41.2G  17% /etc/dnsmasq-resolv.conf
overlay                  49.9G      8.7G     41.2G  17% /etc/resolv.conf

#

overlay                   2.3T     79.1G      2.2T   3% /tmp/rootfs1299463981
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs1299463981/etc/resolv.conf
/dev/nvme1n1              2.3T     79.1G      2.2T   3% /tmp/rootfs1299463981/etc/hosts
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs1299463981/.init
overlay                   2.3T     79.1G      2.2T   3% /tmp/rootfs2552938690
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs2552938690/etc/resolv.conf
/dev/nvme1n1              2.3T     79.1G      2.2T   3% /tmp/rootfs2552938690/etc/hosts
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs2552938690/.init
overlay                   2.3T     79.1G      2.2T   3% /tmp/rootfs3808259598
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs3808259598/etc/resolv.conf
/dev/nvme1n1              2.3T     79.1G      2.2T   3% /tmp/rootfs3808259598/etc/hosts
overlay                  49.9G      8.7G     41.2G  17% /tmp/rootfs3808259598/.init

weary oar Jan 15, 2026, 12:04 AM

#

hm okay that all looks okay

#

has this happened before?

jovial shard Jan 15, 2026, 12:05 AM

#

yep yesterday after upgrading we encountered this about 2 times

weary oar Jan 15, 2026, 12:07 AM

#

Is k8s trying to shut it down or something? The engine logs are all context canceled, which is what would happen when the engine is trying to shutdown after a SIGTERM for example

jovial shard Jan 15, 2026, 12:07 AM

#

the health check has been failing for a while

#

I will just check

#

will its a readiness probe not a liveness probe

#

so theoretically k8s shouldn't have tried to shut it down

#

These are the only events we have:

Events:                                                                                                                                                                                                        │
│   Type     Reason     Age                  From     Message                                                                                                                                                    │
│   ----     ------     ----                 ----     -------                                                                                                                                                    │
│   Warning  Unhealthy  77s (x244 over 61m)  kubelet  Readiness probe failed: command timed out: "sh -exc dagger core version" timed out after 14s

weary oar Jan 15, 2026, 12:10 AM

#

the engine logs don't really show anything except the engine seemingly trying to shutdown, I'm guessing they just got cut off due to a log size limit?

jovial shard Jan 15, 2026, 12:10 AM

#

I only took the engine logs from around when the health check started failing

#

I can grab everything

weary oar Jan 15, 2026, 12:11 AM

#

I don't remember k8s semantics enough, if readiness probes failed does that mean it never entered the "ready" state? so like it just is failing to start successfully vs. it was running fine and then later went into some unhealthy state

jovial shard Jan 15, 2026, 12:12 AM

#

here we go

#

📎 2026-01-15-all-engine-logs.txt

#

sorry the previous logs also contained other engine's logs because I pulled it from our logging infra

#

This is just the failing engine's pod logs

#

I beliee we don't have debug logs enabled on our engines anymore unfortunately...

#

https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/#readiness-probe

Readiness probes determine when a container is ready to accept traffic. This is useful when waiting for an application to perform time-consuming initial tasks that depend on its backing services; for example: establishing network connections, loading files, and warming caches. Readiness probes can also be useful later in the container’s lifecycle, for example, when recovering from temporary faults or overloads.

If the readiness probe returns a failed state, Kubernetes removes the pod from all matching service endpoints.

Readiness probes run on the container during its whole lifecycle.

Kubernetes

Liveness, Readiness, and Startup Probes

Kubernetes has various types of probes:
Liveness probe Readiness probe Startup probe Liveness probe Liveness probes determine when to restart a container. For example, liveness probes could catch a deadlock when an application is running but unable to make progress.
If a container fails its liveness probe repeatedly, the kubelet restarts the con...

#

We actually connect the client to the engine via a host mounted unix socket, so I doubt readiness probes are having any effect here at all

#

Are there any debug endpoints I can hit on the engine to get the current state out of it?

weary oar Jan 15, 2026, 12:28 AM

#

jovial shard Are there any debug endpoints I can hit on the engine to get the current state o...

it's possible to enable debug endpoints but for this sort of thing they won't have any more information than the logs (they have cpu/memory/etc. profile type of info)

#

i'm still looking through the logs you sent

jovial shard Jan 15, 2026, 12:29 AM

#

okay no worries

#

I'll leave it in the current state for another hour

#

then will restart the pod

#

yesterday when I restarted the pod ( which required me to exec into the container and to a killall runc, killall dagger-engine), it came back up with no problems

#

I assume the engine is in some kind of borked state right now

#

When I exec into the container and run:

dagger core version

It hangs indefinitely

#

#

not sure if this is useful or not but the engine stopped emitting cache metrics at 10:07 ~AEST~ AEDT

#

sorry I'll give UTC

#

UTC 2026-01-14T23:07

weary oar Jan 15, 2026, 12:47 AM

#

jovial shard UTC 2026-01-14T23:07

yeah that's where I was looking at the logs, everything started becoming canceled like the engine was shutting down at 2026-01-14T23:07:22Z

#

And last log line is written at 2026-01-14T23:08:59Z

#

But looks like the process has continued to stick around long after...

jovial shard Jan 15, 2026, 12:48 AM

#

Yeah its still there if I run ps aux

#

1 root 1h47 /usr/local/bin/dagger-engine --config /etc/dagger/engine.toml

weary oar Jan 15, 2026, 12:49 AM

#

do you happen to have metrics (memory, cpu, etc.) data from while it was still running? just curious if it like was consuming a bunch and then got requested to shutdown at 23:07

jovial shard Jan 15, 2026, 12:50 AM

#

#

the node cpu spiked around that time

#

the node memory without cache metric is caculated as:

node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)

#

the memory the node has is 64Gb so nowhere near the imit

weary oar Jan 15, 2026, 12:51 AM

#

hm okay interesting...

jovial shard Jan 15, 2026, 12:52 AM

#

Grafana only reports things after the fact and sometimes misses spikes

#

so its possible this is not representative

#

this is the node memory metric though so it should be accurate

#

Just an FYI we have sysdig running on this node as well

#

This could potentially be complicating our setup

weary oar Jan 15, 2026, 12:55 AM

#

it does sort of seem like there may have been a huge burst of containers trying to start right before everything went bad, based on the logs, though admittedly I have to infer that indirectly since the logs are not great.

One follow-up for this no matter what is to go do a pass on the logs we write and the levels for them; I spend so much time looking at debug logs I didn't realize how utterly useless the non-debug ones have become

#

did someone try to run dagger develop --recursive without the fix that's in v0.19.10 at 23:07 UTC? 🙂

jovial shard Jan 15, 2026, 12:56 AM

#

haha no this is a CI pod

#

so its not getting used for that sort of thing

#

me wishes

#

hahaha

#

Do you have any knowledge of sysdig?

weary oar Jan 15, 2026, 1:01 AM

#

jovial shard Do you have any knowledge of sysdig?

I haven't used it in a long time

jovial shard Jan 15, 2026, 1:01 AM

#

Just noticing a few entries in our sysdig log around 23:07 UTC

weary oar Jan 15, 2026, 1:01 AM

#

jovial shard Just noticing a few entries in our sysdig log around 23:07 UTC

yeah I'd take a look for sure

jovial shard Jan 15, 2026, 1:01 AM

#

Okay will send you through a bit of the log

#

📎 message.txt

weary oar Jan 15, 2026, 1:05 AM

#

oh wait! I totally missed something crucial in the dmesg you sent earlier:

[68804.239285] dr=syscall_sins invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
...
[68804.240571] memory: usage 2764800kB, limit 2764800kB, failcnt 334918
[68804.262006] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[68804.262815] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice:
...
[68804.291303] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[68804.292791] [   4471] 65535  4471      255      141        0      141         0    36864        0          -998 pause
[68804.316926] [1507141]     0 1507141    21051    12106     1312    10794         0   196608        0          -997 dr-monitor
[68804.318342] [1553644]     0 1553644  1255711   645279   608273    37006         0  6287360        0          -997 dr-agent
[68804.319823] [1553645]     0 1553645    21947     3189     2111     1078         0   143360        0          -997 dr-mounted_fs_r
[68804.321409] [1553646]     0 1553646   323142    15931     5471    10460         0   258048        0          -997 cointerface
[68804.322932] [1553647]     0 1553647   547238    10182     2322     7860         0   319488        0          -997 responder
[68804.326798] [1553648]     0 1553648   321385    11772     2837     8935         0   229376        0          -997 kspm-analyzer
[68804.328265] [1553649]     0 1553649   337601    12262     3719     8543         0   389120        0          -997 host-scanner
[68804.329788] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,task=dr-agent,pid=1553644,uid=0
[68804.336127] Memory cgroup out of memory: Killed process 1553644 (dr-agent) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[68804.338327] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope are going to be killed due to memory.oom.group set
[68804.341454] Memory cgroup out of memory: Killed process 1507141 (dr-monitor) total-vm:84204kB, anon-rss:5248kB, file-rss:43176kB, shmem-rss:0kB, UID:0 pgtables:192kB oom_score_adj:-997
[68804.351550] Memory cgroup out of memory: Killed process 1556720 (dr=syscall_sins) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997

#

OOM kill happened, but the engine wasn't killed

#

dr-agent, dr-monitor and dr=syscall_sins got killed

#

not sure what those are

#

or exactly how that would cause the dagger engine to start freaking out, but possible the engine was just trying to shutdown and got in a weird state during that

#

tbh I can't say this is for sure related, since the kernel logs include everything on the whole node, just seemed worth noting

jovial shard Jan 15, 2026, 1:11 AM

#

cool

#

how do you read those timestamps on the left?

weary oar Jan 15, 2026, 1:12 AM

#

jovial shard how do you read those timestamps on the left?

unfortunately it's just a count since the machine started (I think seconds? but could be misremembering), so you have to find when the machine started to convert to absolute time

jovial shard Jan 15, 2026, 1:12 AM

#

okay

#

 Start Time:       Wed, 14 Jan 2026 15:59:11 +1100

#

Okay so the machine was up at 2026-01-14T04:59

#

Using this date calculator

Time Calculator

This free time calculator can add or subtract time values in terms of number of days, hours, minutes, or seconds. Also, learn the different concepts of time.

#

2026-01-15T12:05:55

#

so seems unrelated

#

no, that dosn't make sens because that's the future

weary oar Jan 15, 2026, 1:25 AM

#

Yeah Idk what to make of it.

The only other thing I can think of thatm ight be interesting is current CPU/memory stats for the dagger engine process specifically

jovial shard Jan 15, 2026, 1:25 AM

#

I think that is sysdig

jolly steppe Jan 15, 2026, 1:25 AM

#

jovial shard no, that dosn't make sens because that's the future

Theory:

The kernel says the memory cgroup limit was 2,764,800 kB (~2.64 GiB) and it hit it repeatedly (failcnt 334918).

Then it killed dr-agent first (big RSS), and because memory.oom.group is set, it killed the rest of the processes in that same cgroup as a group.

Sysdig’s agent sits on the syscall path (driver/eBPF) and does heavy analysis. If it’s overloaded/restarting (OOM loop), you can get node-level jitter ?

jovial shard Jan 15, 2026, 1:26 AM

#

I will try to work out exactly what time that was

#

hahah

weary oar Jan 15, 2026, 1:28 AM

#

I'd say that in general, things you could do to help debug this if it happens again:

enable pprof metrics endpoints for the engine, which requires starting the dagger-engine process with e.g. --debugaddr=0.0.0.0:6060 (or whatever port would make sense for you)
enable debug logs (--debug )

On our end, I'll try to cleanup the logs we print on levels above debug so they can be actually useful again

jovial shard Jan 15, 2026, 1:32 AM

#

Thanks Erik

#

Sorry for bothering you guys about it, could very well be an issue on our side

#

btw got the right times

#

[Wed Jan 14 22:32:27 2026] Tasks state (memory values in pages):
[Wed Jan 14 22:32:27 2026] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[Wed Jan 14 22:32:27 2026] [   4471] 65535  4471      255      141        0      141         0    36864        0          -998 pause
[Wed Jan 14 22:32:27 2026] [1507141]     0 1507141    21051    12106     1312    10794         0   196608        0          -997 dr-monitor
[Wed Jan 14 22:32:27 2026] [1553644]     0 1553644  1255711   645279   608273    37006         0  6287360        0          -997 dr-agent
[Wed Jan 14 22:32:27 2026] [1553645]     0 1553645    21947     3189     2111     1078         0   143360        0          -997 dr-mounted_fs_r
[Wed Jan 14 22:32:27 2026] [1553646]     0 1553646   323142    15931     5471    10460         0   258048        0          -997 cointerface
[Wed Jan 14 22:32:27 2026] [1553647]     0 1553647   547238    10182     2322     7860         0   319488        0          -997 responder
[Wed Jan 14 22:32:27 2026] [1553648]     0 1553648   321385    11772     2837     8935         0   229376        0          -997 kspm-analyzer
[Wed Jan 14 22:32:27 2026] [1553649]     0 1553649   337601    12262     3719     8543         0   389120        0          -997 host-scanner

#

[Wed Jan 14 22:32:27 2026] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope,task=dr-agent,pid=1553644,uid=0
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1553644 (dr-agent) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda7f87ab7_cbda_4837_83de_3f16a8c8419f.slice/cri-containerd-369272a83e73df5e022fa226e22ba7f3aa8b43f6ffec7160eba78ed4251b7173.scope are going to be killed due to memory.oom.group set
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1507141 (dr-monitor) total-vm:84204kB, anon-rss:5248kB, file-rss:43176kB, shmem-rss:0kB, UID:0 pgtables:192kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] Memory cgroup out of memory: Killed process 1556720 (dr=syscall_sins) total-vm:5022844kB, anon-rss:2433092kB, file-rss:148024kB, shmem-rss:0kB, UID:0 pgtables:6140kB oom_score_adj:-997
[Wed Jan 14 22:32:27 2026] dagger0: port 18(veth005471b0) entered disabled state

#

so its happening at 22:32 which is 30m before our weird spike in logs at 2026-01-14T23:07:22Z

#

probably not relevant

#

thanks guys.. I'll quit bothering you for now

#

and yeah we will do a few things on our end:

upgrade to 0.19.10
turn on debug logs
turn on pprof

jovial shard Jan 15, 2026, 3:30 AM

#

Okay we have done all of this now and I tested that I can hit pprof all good and get traces

#

So next time this happens, any particular pprof commands you are interesting?

#

Perhaps these?

Or to look at the goroutine blocking profile, after calling runtime.SetBlockProfileRate in your program:

go tool pprof http://localhost:6060/debug/pprof/block
Or to look at the holders of contended mutexes, after calling runtime.SetMutexProfileFraction in your program:

go tool pprof http://localhost:6060/debug/pprof/mutex

#

perhaps the pprof/trace is the most useful as it will show where the app is I guess

weary oar Jan 15, 2026, 9:07 PM

#

jovial shard So next time this happens, any particular pprof commands you are interesting?

curl '<ip>:6060/debug/pprof/goroutine' > gr.pprof (or however you can best hit the endpoint and save the output) will dump goroutines, which would be very useful.

/debug/pprof/heap also may be helpful

#

BTW we got another report of a user hitting this, so almost certainly not something wrong in your infra or similar

weary oar Jan 15, 2026, 9:48 PM

#

Also, if the engine is unresponsive to anything including on those debug endpoints, a helpful last resort would be to just manually send SIGQUIT to it (kill -s QUIT <pid>), which should dump the goroutine stacks to its output (unless things are so bad the go runtime is also borked)

jovial shard Jan 15, 2026, 10:25 PM

#

Thanks Erik will stay on the lookout, we should be giving our 0.19.10 engine a heavy workout today, so will report back if we hit the bug

weary oar Jan 15, 2026, 10:26 PM

#

jovial shard Thanks Erik will stay on the lookout, we should be giving our 0.19.10 engine a h...

thanks! appreciate you working with us on tracking this down!

jovial shard Jan 15, 2026, 10:30 PM

#

Yeah easy

jovial shard Jan 19, 2026, 11:11 PM

#

Hi Erik we have not seen this error since the upgrade to 0.19.10

#

Did you guys get to the bottom of it in the end? Was it something that was fixed in 0.19.10?

#Error on 0.19.9 upgrade