Cache filling disk | Dagger | Page 1

mortal vortex Sep 2, 2025, 5:51 AM

#

Hi Chris, normally there are safeguards in place, perhaps you have hit an edge case? Sorry about that.

#

@fervent pasture @urban saddle @outer sonnet will have more context

trim scaffold Sep 2, 2025, 5:52 AM

#

All good

#

I don't believe the cache should really even be 1TB in size.. I would have thought that many cache items would be deduplicated / cleaned up.

#

I understand dagger has a GC which should deal with this

#

We currently don't have metrics setup to capture the cache size so unforutnately we can't tell how this happened over time

mortal vortex Sep 2, 2025, 5:54 AM

#

yes the GC should have caught this

trim scaffold Sep 2, 2025, 5:54 AM

#

But the culprit could be something in our node setup prohibiting the GC from doing its job

#

We'll keep looking and let you know if we find something interesting

mortal vortex Sep 2, 2025, 5:55 AM

#

it might be worth searching the discord archives

trim scaffold Sep 2, 2025, 5:57 AM

#

okay

fervent pasture Sep 2, 2025, 7:25 PM

#

trim scaffold okay

by any chance do you have a custom engine.toml or engine.json GC setting?

the default buildkit GC pruning trigger has always been a bit obscure to me and AFAIK there's barely any logs to understand what's happening there. Since I know @outer sonnet has been recently-ish touching some things around GC, I'm wondering if there' are some logs we could include here to understand when and what policy buildkit is using to run GC

#

@trim scaffold does manually running dagger core engine local-cache prune release anything?

trim scaffold Sep 2, 2025, 10:18 PM

#

We ended up nuking the node because running the above command was slow. We waited over an hour. The cache was coming down but it was very slow.

#

For reference we are using i7i.2xlarge

#

Should have plenty of disk iops

fervent pasture Sep 2, 2025, 10:49 PM

#

trim scaffold We ended up nuking the node because running the above command was slow. We waite...

thx and sorry for the slowness 🙏 . Re-surfacing my question above: do you happen to have any custom config for the GC policy?

trim scaffold Sep 2, 2025, 10:52 PM

#

just asked the team. no we don't have any in place

#

all good. it could be something on our end quite possibly. We do have sysdig running on the node. may have done something weird

trim scaffold Sep 2, 2025, 11:20 PM

#

Just a question on how the cache works..

if say i have 3 parallel processes which execute the same container steps, and there is nothing in cache, i presume all 3 cannot yake advantage of cache.
Now the results of all 3 will be saved to cache, will a process deduplicate the 3 results so that only 1 is saved on disk? Is this the GC?

#

Just interested in the cache internals

#

Just took a look at the storage metrics for our node over 90 days

#

#

The storage slowly climbs up

mortal vortex Sep 3, 2025, 12:16 AM

#

trim scaffold Just a question on how the cache works.. 1. if say i have 3 parallel processes...

normally yes
yes it will dedup. That's distinct from the gc I believe

#

dedup at the operation level not content level

trim scaffold Sep 3, 2025, 1:16 AM

#

thanks Solomon

trim scaffold Sep 3, 2025, 5:44 AM

#

So when we checked the cache value for maxUsedSpace we found it was very high... 1800GB~

#

Believe this is a result of the default cache policy which uses 75% of the available disk.

#

Because we have a very large node ( mainly for higher IOPS ) it also comes with a very large SSD disk.

#

Most of which we don't need probably

#

My guess is that at a certain size of cache, running a GC doesn't perform very well anymore

#

When we checked the size of the cache, it was only a little higher than the maxUsedSpace value. I would guess the GC attempted to run but couldn't succeed.

#

We are going to try tweaking the maxUsedSpace parameter to a lower value and see if it improves things

trim scaffold Sep 3, 2025, 6:03 AM

#

Is there a metric for observing "cache misses" from the dagger engine?

#

Might help to find the right value for maxUsedSpace going forward

fervent pasture Sep 3, 2025, 1:55 PM

#

trim scaffold My guess is that at a certain size of cache, running a GC doesn't perform very w...

if GC is taking too much time, we added an option called sweepSize (https://docs.dagger.io/reference/configuration/engine#garbage-collection) whichs helps by defining the max space to be collected in a single pass of the GC.

trim scaffold Sep 4, 2025, 12:03 AM

#

Okay thanks I'll take a look at that

trim scaffold Sep 4, 2025, 4:58 AM

#

Things seem to be stable since we set maxUsedSpace last night. The moral of the story is that we needed to tweak these parameters given the huge available disk that there is on our machines

fervent pasture Sep 4, 2025, 5:45 PM

#

trim scaffold Things seem to be stable since we set `maxUsedSpace` last night. The moral of th...

hmm that's a bit odd given that it should have set maxUsedSpace automatically to the total amount of the disk size. We'll take this one to see what might be happening

outer sonnet Sep 4, 2025, 5:53 PM

#

i think it's still less than the available size

#

but it's that the disk is huge

trim scaffold Sep 4, 2025, 9:50 PM

#

Yes it was less than the available disk which was 2.5TB, but was huge

#

I made an error in my initial statement... the cache wasn't filling the disk, it was actually around the correct size based on the GC defaults.

#

GC cycles were probably running in fact

#

I think however that at a certain size the cache performance degrades somehow

#

We set a new maxUsedSpace of 90Gb and performance is better.

#

We are going to up that and keep an eye on performance 🙂

mortal vortex Sep 4, 2025, 9:56 PM

#

Is it possible that it was the GC process itself which was becoming slower and slower, and perhaps that's what swamped the machine?

#

(as opposed to regular engine operations being slowed down by a big cache)

trim scaffold Sep 4, 2025, 10:49 PM

#

that's possible yes

#

In your opinion is that the more probable outcome?

mortal vortex Sep 4, 2025, 11:31 PM

#

trim scaffold In your opinion is that the more probable outcome?

it's just an educated guess, based on:

a message in this thread indicating that gc has been known to have performance issues on large caches
no report that I can remember of the engine itself slowing down with a large cache.

Note that I am not authoritative on this, just an educated guess.

#Cache filling disk