#Cache cleanup seems not to work as expected

1 messages · Page 1 of 1 (latest)

fading sequoia
#

Hey guys, I have huge machine and using dagger docker with some volume mounts. exec command looks like this:

/usr/bin/docker run --rm --name dagger-engine --security-opt seccomp=unconfined --privileged -v /etc/dagger/engine.json:/etc/dagger/engine.json -v /etc/dagger/engine.toml:/etc/dagger/engine.toml -v /run/dagger-dagger:/run/dagger -v /var/lib/docker/dagger-cache:/var/lib/dagger registry.dagger.io/engine:v0.19.7

Mainly mounting cache, socket, and configurations to control the engine. My mounted configurations:

{
   "registries": {
        "docker.io": {
             "mirrors": ["mirror.gcr.io"]
        }
    },
    "gc": {
        "enabled": true,
        "maxUsedSpace": "300GB"
    }
}

and

[registry."docker.io"]
mirrors = ["mirror.gcr.io"]

[worker.oci]
gc = true
maxUsedSpace = "300GB"

Registry configuration seems to work fine, but other GC does not work. I have limited the disk usage to 300GB of cache but from disk usage I do see that Im using whatever disk size is mounted + it just fills the disk and fails silently - like some continer failed to create or similar.

In attached picture disk size creep can be seen, this disk is 1.1 TB in size, and it does not respect any provided values

#

As well sometimes it feels that cache is left hanging after a while, dagger reports one size and volume mounted reports different size

# du -sh /var/lib/docker/*/  2>/dev/null | sort -rh | head -20
643G    /var/lib/docker/dagger-cache/

vs

# dagger core engine local-cache entry-set disk-space-bytes
✔ connect 0.0s
✔ loading type definitions 0.3s
✔ parsing command line arguments 0.0s

✔ engine: Engine! 0.0s
✔ .localCache: EngineCache! 0.0s
✔ .entrySet: EngineCacheEntrySet! 0.0s
✔ .diskSpaceBytes: Int! 0.0s

1.30041712293e+11

where this equals to around 121Gb

primal loom
#

Hey @fading sequoia
Taking a look today !

primal loom
#

@primal loom self pinging 😇 🤣

primal loom
# fading sequoia As well sometimes it feels that cache is left hanging after a while, dagger repo...

I confirmed a real root cause, not yet 100% that it is the only cause of your exact 643G vs 121G snapshot.

Confirmed locally:

  • During an active exec writing large data, raw /var/lib/dagger grows.
  • dagger core engine local-cache entry-set disk-space-bytes under-reports it badly.
  • GC/prune does not act on that hidden active mutable state.
  • Reproduces on v0.19.7, v0.20.7, and main.

So the likely root cause is: active mutable BuildKit/Dagger cache state is consuming disk but is not counted as local-cache disk usage while it is active. GC cannot delete active data, but it also should not make the cache limit/accounting look like everything is fine.

#

Could you please run this command against that engine, when du is much larger than Dagger's local-cache number:

  docker exec dagger-engine sh -lc '
  date -Is
  echo RAW
  du -xsh /var/lib/dagger
  echo ACCOUNTED
  dagger -s core engine local-cache entry-set disk-space-bytes
  echo TOP_DIRS
  du -xsh /var/lib/dagger/* /var/lib/dagger/worker/* 2>/dev/null | sort -h | tail -30
  '

Also, side questions:

  1. Were those numbers captured while Dagger jobs were actively running?
  2. If you wait until no Dagger jobs are running, does local-cache entry-set disk-space-bytes jump closer to du?
  3. After an engine restart window, does raw /var/lib/dagger drop?

Could you also please capture:

docker logs dagger-engine 2>&1 | grep -E 'gc cleaned|remove snapshot|starting dagger engine|no space|failed'
fading sequoia
#
  1. No active jobs. Its already casual state of that machine
  2. No, my example showed was from stale no active jobs machine
  3. No, it doesnot drop
primal loom
#

Ok so prob not the root cause, keep digging 🙏

fading sequoia
#
docker exec dagger-engine sh -lc '
>   date -Is
>   echo RAW
>   du -xsh /var/lib/dagger
>   echo ACCOUNTED
>   dagger -s core engine local-cache entry-set disk-space-bytes
>   echo TOP_DIRS
>   du -xsh /var/lib/dagger/* /var/lib/dagger/worker/* 2>/dev/null | sort -h | tail -30
>   '
2026-05-05T21:14:23+00:00
RAW
361.4G  /var/lib/dagger
ACCOUNTED

2.49527889885e+11TOP_DIRS
0       /var/lib/dagger/dagger-engine.lock
4.0K    /var/lib/dagger/dagql-cache.db
4.0K    /var/lib/dagger/secret-salt
4.0K    /var/lib/dagger/worker/workerid
8.0K    /var/lib/dagger/net
20.0K   /var/lib/dagger/dagql-cache.db-wal
32.0K   /var/lib/dagger/dagql-cache.db-shm
3.9M    /var/lib/dagger/cache.db
20.9M   /var/lib/dagger/worker/containerdmeta.db
537.3M  /var/lib/dagger/worker/metadata_v2.db
361.4G  /var/lib/dagger/worker
primal loom
#

I cant seem to repro locally. Once the pipeline finishes running, the estimated cache use gets updated (above comment), and if it gets above the threshold, It does seem to trigger the gc

Also, I might hit my knowledge limit cc @orchid meteor for any guidance on how to isolate the problem 🙏

Could you please run this now (again, when you're in a bad state):

docker exec dagger-engine sh -lc '
  set -eu
  date -Is

  echo RAW
  du -xsh /var/lib/dagger

  echo ACCOUNTED
  dagger -s core engine local-cache entry-set disk-space-bytes

  echo WORKER_LS
  ls -lahA /var/lib/dagger/worker

  echo WORKER_CHILDREN
  for p in /var/lib/dagger/worker/.[!.]* /var/lib/dagger/worker/*; do
    [ -e "$p" ] || continue
    du -xsh "$p" 2>/dev/null || true
  done | sort -h

  echo MOUNTS
  cat /proc/self/mountinfo | grep /var/lib/dagger || true

  echo PROCS
  ps -ef | grep -E "dagger -s|runc|containerd-shim|dd|sleep" | grep -v grep || true
  '
fading sequoia
#

Basically this descrepency is a casual thing already, and with bigger usage it just get bigger and bigger up until the disks are totally full.

primal loom
#

This is very useful, thanks 🙏

I think we can rule out the active-exec theory for this snapshot: PROCS is empty, so this is idle/cold state (you were correct 😇)

What stands out now is:

  • ACCOUNTED: ~280GB decimal / 261GiB
  • raw /var/lib/dagger/worker/snapshots: ~391GiB

So Dagger/BuildKit sees ~280GB of cache, but the snapshotter has ~391GiB on disk. That leaves roughly 110-130GiB of snapshot data outside local-cache accounting (consistent with your first message --> this is the gap that gets bigger and bigger).

Since your GC policy is maxUsedSpace=300GB, the engine thinks it is still under the limit and does not prune.

Current theory

This could be related to volatile overlay. We added cleanup (suposedly only on after-crash restarts) for dirty volatile snapshots later, and I don’t think v0.19.7 has it.

Did you already try the same cache volume with v0.20.7? If v0.20.7 cleans it up or stops the growth, then this is probably already fixed since above commit? 🤔 If it still happens on v0.20.7, it could be worth testing on Theseus (main) just to avoid exploring some obscure problems (Erik redesigned entirely the caching system, and we have more control on those obscure things) [this will be release as part of 0.21.0]

primal loom
# fading sequoia Output of the command

--
One last thing that would help confirm/infirm the volatile-overlay theory. Could please you run this while the engine is idle and in the bad state?

  docker exec dagger-engine sh -lc '
  set -eu
  SN=/var/lib/dagger/worker/snapshots/snapshots

  echo WHEN
  date -Is

  echo RAW_BYTES
  for p in /var/lib/dagger /var/lib/dagger/worker /var/lib/dagger/worker/snapshots; do
    [ -e "$p" ] || continue
    timeout 300 du -xsb "$p" 2>/dev/null || echo "ERR_OR_TIMEOUT $p"
  done

  echo ACCOUNTED_BYTES
  dagger -s core engine local-cache entry-set disk-space-bytes || true

  echo VOLATILE_TOTAL_BYTES_AND_COUNT
  find "$SN" -mindepth 1 -maxdepth 1 -type d 2>/dev/null |
  while IFS= read -r d; do
    [ -e "$d/work/incompat/volatile" ] || continue
    du -xsb "$d" 2>/dev/null || true
  done |
  awk "{s+=\$1; n++} END{print s+0, n+0}"

  echo VOLATILE_SAMPLE
  find "$SN" -mindepth 1 -maxdepth 1 -type d 2>/dev/null |
  while IFS= read -r d; do
    [ -e "$d/work/incompat/volatile" ] && echo "$d/work/incompat/volatile"
  done |
  head -20
  '

If VOLATILE_TOTAL_BYTES_AND_COUNT is close to the raw/accounted gap, dirty volatile snapshots are very likely the cause. If it is 0 or tiny, volatile is probably not the main culprit and I'm definitely lost ahah 👼😛

fading sequoia
#

Ill check it whenever ill have a chance:)

fading sequoia
#
WHEN
2026-05-07T07:02:17+00:00
RAW_BYTES
570724662254    /var/lib/dagger
ERR_OR_TIMEOUT /var/lib/dagger
564983855340    /var/lib/dagger/worker
567768628237    /var/lib/dagger/worker/snapshots
ERR_OR_TIMEOUT /var/lib/dagger/worker/snapshots
ACCOUNTED_BYTES
3.24955763689e+11

A new release of dagger is available: v0.19.7 → v0.20.8
To upgrade, see https://docs.dagger.io/install
https://github.com/dagger/dagger/releases/tag/v0.20.8
VOLATILE_TOTAL_BYTES_AND_COUNT
0 0
VOLATILE_SAMPLE
primal loom
#

Ok, another theory, I have a kind-ish repro

primal loom
#

But, if i am right, Theseus (dagger v0.21) will probably fix it

primal loom
#

Would it be possible for you to test on the dev main version ?

fading sequoia
#

Not sure, as we are running the CI loads there it might cause some issues with old client versions, and upgrading kinda pain a bit

primal loom
fading sequoia
#

Thanks, will try to upgrade to stable then 🙂

#

Will it be compatible with 19.7?

fading sequoia
#

@primal loom funny story - this type of strange behavior only happens on dagger running inside docker with mounts, as kubernetes helm chart seems to work as expected

primal loom
#

@hidden gale continuing the thread here #general message. Do you have mounts too ?