We have tried to switch to a statefulset with PVC instead in order to isolate the issue. While the node is not killed due to DiskPressure the underlying issue persists. The cache grows but is regularly garbage collected but the disk (pvc) utilization is not reduced. Since the node is not killed we instead end up with jobs not being able to start because the disk is full:
! start session: connect buildkit session: unexpected status 200: get or init client: open client DB: ping file:///var/lib/dagger/worker/clientdbs/r4m66jpzi5mxrpduxqxxkq2ja.db?_pragma=foreign_keys%3DON&_pragma=journal_mode%3DWAL&_pragma=synchronous%3DOFF&_pragma=busy_timeout%3D10000&_txlock=immediate: database or disk is full (13)
โโโ starting engine 0.0s
โโโ connecting to engine 0.0s
โฐโโ starting session 0.0s ERROR
! connect buildkit session: unexpected status 200: get or init client: open client DB: ping file:///var/lib/dagger/worker/clientdbs/r4m66jpzi5mxrpduxqxxkq2ja.db?_pragma=foreign_keys%3DON&_pragma=journal_mode%3DWAL&_pragma=synchronous%3DOFF&_pragma=busy_timeout%3D10000&_txlock=immediate: database or disk is full (13)```
The pvc size is 200gb (set to not expand) and our gc config is:
```{
"engine": {
"localCache": {
"maxUsedSpace": 147000000000,
"minFreeSpace": 40000000000,
"reservedSpace": 10000000000,
"targetSpace": 0
}
}
}```