#Cache filling disk

1 messages ยท Page 1 of 1 (latest)

mortal vortex
#

Hi Chris, normally there are safeguards in place, perhaps you have hit an edge case? Sorry about that.

#

@fervent pasture @urban saddle @outer sonnet will have more context

trim scaffold
#

All good

#

I don't believe the cache should really even be 1TB in size.. I would have thought that many cache items would be deduplicated / cleaned up.

#

I understand dagger has a GC which should deal with this

#

We currently don't have metrics setup to capture the cache size so unforutnately we can't tell how this happened over time

mortal vortex
#

yes the GC should have caught this

trim scaffold
#

But the culprit could be something in our node setup prohibiting the GC from doing its job

#

We'll keep looking and let you know if we find something interesting

mortal vortex
#

it might be worth searching the discord archives

trim scaffold
#

okay

fervent pasture
# trim scaffold okay

by any chance do you have a custom engine.toml or engine.json GC setting?

the default buildkit GC pruning trigger has always been a bit obscure to me and AFAIK there's barely any logs to understand what's happening there. Since I know @outer sonnet has been recently-ish touching some things around GC, I'm wondering if there' are some logs we could include here to understand when and what policy buildkit is using to run GC

#

@trim scaffold does manually running dagger core engine local-cache prune release anything?

trim scaffold
#

We ended up nuking the node because running the above command was slow. We waited over an hour. The cache was coming down but it was very slow.

#

For reference we are using i7i.2xlarge

#

Should have plenty of disk iops

fervent pasture
trim scaffold
#

just asked the team. no we don't have any in place

#

all good. it could be something on our end quite possibly. We do have sysdig running on the node. may have done something weird

trim scaffold
#

Just a question on how the cache works..

  1. if say i have 3 parallel processes which execute the same container steps, and there is nothing in cache, i presume all 3 cannot yake advantage of cache.

  2. Now the results of all 3 will be saved to cache, will a process deduplicate the 3 results so that only 1 is saved on disk? Is this the GC?

#

Just interested in the cache internals

#

Just took a look at the storage metrics for our node over 90 days

#

The storage slowly climbs up

mortal vortex
#

dedup at the operation level not content level

trim scaffold
#

thanks Solomon

trim scaffold
#

So when we checked the cache value for maxUsedSpace we found it was very high... 1800GB~

#

Believe this is a result of the default cache policy which uses 75% of the available disk.

#

Because we have a very large node ( mainly for higher IOPS ) it also comes with a very large SSD disk.

#

Most of which we don't need probably

#

My guess is that at a certain size of cache, running a GC doesn't perform very well anymore

#

When we checked the size of the cache, it was only a little higher than the maxUsedSpace value. I would guess the GC attempted to run but couldn't succeed.

#

We are going to try tweaking the maxUsedSpace parameter to a lower value and see if it improves things

trim scaffold
#

Is there a metric for observing "cache misses" from the dagger engine?

#

Might help to find the right value for maxUsedSpace going forward

fervent pasture
trim scaffold
#

Okay thanks I'll take a look at that

trim scaffold
#

Things seem to be stable since we set maxUsedSpace last night. The moral of the story is that we needed to tweak these parameters given the huge available disk that there is on our machines

fervent pasture
outer sonnet
#

i think it's still less than the available size

#

but it's that the disk is huge

trim scaffold
#

Yes it was less than the available disk which was 2.5TB, but was huge

#

I made an error in my initial statement... the cache wasn't filling the disk, it was actually around the correct size based on the GC defaults.

#

GC cycles were probably running in fact

#

I think however that at a certain size the cache performance degrades somehow

#

We set a new maxUsedSpace of 90Gb and performance is better.

#

We are going to up that and keep an eye on performance ๐Ÿ™‚

mortal vortex
#

Is it possible that it was the GC process itself which was becoming slower and slower, and perhaps that's what swamped the machine?

#

(as opposed to regular engine operations being slowed down by a big cache)

trim scaffold
#

that's possible yes

#

In your opinion is that the more probable outcome?

mortal vortex
# trim scaffold In your opinion is that the more probable outcome?

it's just an educated guess, based on:

  1. a message in this thread indicating that gc has been known to have performance issues on large caches

  2. no report that I can remember of the engine itself slowing down with a large cache.

Note that I am not authoritative on this, just an educated guess.