#Disk grows even as cache is garbage collected

1 messages ยท Page 1 of 1 (latest)

dapper warren
#

We are seeing our dagger engines in k8s being Evicted due to DiskPressure even though we can see that the disk cache is in fact being garbage collected. I can see on the node that disk build up is indeed from the cache folder /var/lib/dagger-dagger-18/worker/snapshots. I can see that even though the cache is being sweeped. The size of this folder is rarely reduced and over time increases until the disk is full. What we doing wrong?

Our engine config:

  "logLevel": "info",
  "gc": {
    "enabled": true,
    "maxUsedSpace": "50%",
    "reservedSpace": "15GB",
    "minFreeSpace": "20%"
  }
}
chilly night
dapper warren
#

The node disk is 250Gi

dapper warren
#

Very grateful for any help as we are really stumped

chilly night
#

@dapper warren are both charts from the same node? according to the chart on the left which if IIUC that's coming from the engine metrics, it seems to be correctly staying under the 125GB mark per your GC config

dapper warren
#

Yeah it is the same node (one is from the pod the other from our node exporter).
You are hitting upon the very core of your predicament. Even though the cache is garbage collected meaning the size goes down the actual size of the folder where buildkit stores the cache keeps increasing.

chilly night
fleet grove
#

the dagger engine is continuously running right?

#

i'm not entirely sure why it wouldn't be collecting this, how are you verifying that the garbage collection is definitely running?

#

what's the output of echo "{engine{localCache{maxUsedSpace,minFreeSpace,reservedSpace,targetSpace}}}" | dagger query
curious if for some reason the config isn't been detected correctly

chilly night
dapper warren
# fleet grove what's the output of `echo "{engine{localCache{maxUsedSpace,minFreeSpace,reserve...

I do not have this output for you right now but I have previously confirmed that yes it picking up the configuration and this aligns perfectly with what we are seeing from the engine metric reporting the cache size.
I should mention that beyond the engine there a gha runners using the dagger-github action running on the node. Even if these were using up some disk space I would not expect it to be inside /var/lib/dagger-dagger-18/worker/snapshots

dapper warren
fleet grove
#

okay, cool ๐Ÿ™
just to check, how are you measuring disk saturation? what's the data point you're using?

#

cc @night citrus not sure if you've seen this kind of issue before

dapper warren
#

We are using node_filesystem_size_bytes from the Prometheus node exporter. I have also ssh'ed to the node and observed the folder directly

dapper warren
#

We have tried to switch to a statefulset with PVC instead in order to isolate the issue. While the node is not killed due to DiskPressure the underlying issue persists. The cache grows but is regularly garbage collected but the disk (pvc) utilization is not reduced. Since the node is not killed we instead end up with jobs not being able to start because the disk is full:

! start session: connect buildkit session: unexpected status 200: get or init client: open client DB: ping file:///var/lib/dagger/worker/clientdbs/r4m66jpzi5mxrpduxqxxkq2ja.db?_pragma=foreign_keys%3DON&_pragma=journal_mode%3DWAL&_pragma=synchronous%3DOFF&_pragma=busy_timeout%3D10000&_txlock=immediate: database or disk is full (13)
โ”œโ”€โ— starting engine 0.0s
โ”œโ”€โ— connecting to engine 0.0s
โ•ฐโ”€โœ˜ starting session 0.0s ERROR
  ! connect buildkit session: unexpected status 200: get or init client: open client DB: ping file:///var/lib/dagger/worker/clientdbs/r4m66jpzi5mxrpduxqxxkq2ja.db?_pragma=foreign_keys%3DON&_pragma=journal_mode%3DWAL&_pragma=synchronous%3DOFF&_pragma=busy_timeout%3D10000&_txlock=immediate: database or disk is full (13)```

The pvc size is 200gb  (set to not expand) and our gc config is:

```{
    "engine": {
        "localCache": {
            "maxUsedSpace": 147000000000,
            "minFreeSpace": 40000000000,
            "reservedSpace": 10000000000,
            "targetSpace": 0
        }
    }
}```
dapper warren
#

Adding to this. The disk usage did finally drop after 3 hours. Looking at the cache size graph it could look like a garbage collection that finally finished at that point. Does the garbage collection first evict the cache entries and then delete them? I guess this could be related to having many small cache entries?

dapper warren
#

We had 75K cache entries when we maxed out. Are we caching too much?

dapper warren
#

During this period we are using 2-4 cpu cores (out of 4) while the engine is doing nothing but garbage collecting... presumably

chilly night
dapper warren
#

Thank you very much. This does indeed sound promising. We will test this

(We did actually try it before but this was before the naming bug was fixed so did not work and forgot all about it)

misty cairn
#

Update. tmab is on deserved summer vacation. We now see the garbage collection finish after changing the setting to 5% instead of default 50%. It is still quite slow and we are not currently using NVME disks so that is our next step to try. Again, thanks for the help getting us unblocked.

chilly night
misty cairn
#

We have enabled a sweepSize: 5% and have not seen evictions since then ๐Ÿ™ When we have moved to NVME drives then we might be able to increase sweepsize back to default. Thanks for all the help.

chilly night