#Cache filling disk
1 messages ยท Page 1 of 1 (latest)
Hi Chris, normally there are safeguards in place, perhaps you have hit an edge case? Sorry about that.
@fervent pasture @urban saddle @outer sonnet will have more context
All good
I don't believe the cache should really even be 1TB in size.. I would have thought that many cache items would be deduplicated / cleaned up.
I understand dagger has a GC which should deal with this
We currently don't have metrics setup to capture the cache size so unforutnately we can't tell how this happened over time
yes the GC should have caught this
But the culprit could be something in our node setup prohibiting the GC from doing its job
We'll keep looking and let you know if we find something interesting
it might be worth searching the discord archives
okay
by any chance do you have a custom engine.toml or engine.json GC setting?
the default buildkit GC pruning trigger has always been a bit obscure to me and AFAIK there's barely any logs to understand what's happening there. Since I know @outer sonnet has been recently-ish touching some things around GC, I'm wondering if there' are some logs we could include here to understand when and what policy buildkit is using to run GC
@trim scaffold does manually running dagger core engine local-cache prune release anything?
We ended up nuking the node because running the above command was slow. We waited over an hour. The cache was coming down but it was very slow.
For reference we are using i7i.2xlarge
Should have plenty of disk iops
thx and sorry for the slowness ๐ . Re-surfacing my question above: do you happen to have any custom config for the GC policy?
just asked the team. no we don't have any in place
all good. it could be something on our end quite possibly. We do have sysdig running on the node. may have done something weird
Just a question on how the cache works..
-
if say i have 3 parallel processes which execute the same container steps, and there is nothing in cache, i presume all 3 cannot yake advantage of cache.
-
Now the results of all 3 will be saved to cache, will a process deduplicate the 3 results so that only 1 is saved on disk? Is this the GC?
Just interested in the cache internals
Just took a look at the storage metrics for our node over 90 days
The storage slowly climbs up
-
normally yes
-
yes it will dedup. That's distinct from the gc I believe
dedup at the operation level not content level
thanks Solomon
So when we checked the cache value for maxUsedSpace we found it was very high... 1800GB~
Believe this is a result of the default cache policy which uses 75% of the available disk.
Because we have a very large node ( mainly for higher IOPS ) it also comes with a very large SSD disk.
Most of which we don't need probably
My guess is that at a certain size of cache, running a GC doesn't perform very well anymore
When we checked the size of the cache, it was only a little higher than the maxUsedSpace value. I would guess the GC attempted to run but couldn't succeed.
We are going to try tweaking the maxUsedSpace parameter to a lower value and see if it improves things
Is there a metric for observing "cache misses" from the dagger engine?
Might help to find the right value for maxUsedSpace going forward
if GC is taking too much time, we added an option called sweepSize (https://docs.dagger.io/reference/configuration/engine#garbage-collection) whichs helps by defining the max space to be collected in a single pass of the GC.
Okay thanks I'll take a look at that
Things seem to be stable since we set maxUsedSpace last night. The moral of the story is that we needed to tweak these parameters given the huge available disk that there is on our machines
hmm that's a bit odd given that it should have set maxUsedSpace automatically to the total amount of the disk size. We'll take this one to see what might be happening
Yes it was less than the available disk which was 2.5TB, but was huge
I made an error in my initial statement... the cache wasn't filling the disk, it was actually around the correct size based on the GC defaults.
GC cycles were probably running in fact
I think however that at a certain size the cache performance degrades somehow
We set a new maxUsedSpace of 90Gb and performance is better.
We are going to up that and keep an eye on performance ๐
Is it possible that it was the GC process itself which was becoming slower and slower, and perhaps that's what swamped the machine?
(as opposed to regular engine operations being slowed down by a big cache)
it's just an educated guess, based on:
-
a message in this thread indicating that gc has been known to have performance issues on large caches
-
no report that I can remember of the engine itself slowing down with a large cache.
Note that I am not authoritative on this, just an educated guess.