#Are you referring to Dagger's own cache?
1 messages · Page 1 of 1 (latest)
Yeah, /var/lib/dagger.
We have a number of GPU nodes and would want to make sure to scale those to zero when there aren't workflows running which would mean each morning we'd see longer run times until the caches were rebuilt.
Makes sense! Are these nodes that scale to zero removed and new nodes are provisioned, or are you shutting down and turning on the same set of nodes?
They would be removed and new nodes would be provisioned since this is handled by karpenter
There is ongoing debate/work around the idea of "Storage Drivers" (https://github.com/dagger/dagger/issues/5583). At the moment, Dagger's cache is designed to be local to a node and accessed by a single engine at the same time. With the idea of storage drivers we could introduce a native way for engines to sync their state to different locations in a way that guarantees no corruption.
In a dynamic approach, such as the one explained in the blog post, where you can have any number of nodes spawn up and down, building a setup that persists the caches for later re-use in a way that guarantees no corruption and concurrent access is a bit messy. You could have a lifecycle hook that stores the contents of /var/lib/dagger to some object storage and then restore it locally on boot, but since you have multiple runners and no coordination between them you could very easily corrupt the cache and then have to do constant cleanups, rendering the use of the cache kind of pointless.
We have an experimental service in Dagger Cloud that provides a distributed cache (hosted in our cloud), however I don't know much about its state and how/if we recommend it to users to try it out (adding @cerulean summit since he knows this stuff!)
Are you currently running this setup and seeing really big execution times due to this?
building a setup that persists the caches for later re-use in a way that guarantees no corruption and concurrent access is a bit messy.
Yeah this was my concern. But for what its worth in this situation I wouldn't be need concurrent access to read/write to the cache, it would likely just be concurrent pulling it from S3 or whatever.
No we aren't running this setup, I'm thinking about where we may want to take our dagger setup in the next 3-6 months and this is something I keep coming back to.
Also thanks for linking that issue, I guess I already subscribed to it and just forgot ðŸ«
I took a short look at Dagger Cloud last week and its interesting but we're way too early to benefit from that. I'm keeping an eye on it though 🙂
@urban sphinx we haven't tried this but it should be possible if you use EBS volumes for your dagger cache and then take snapshots before your VM shuts down
it's also important to measure the build performance by using provisioned iops disks given that Dagger's cache is quite disk intensive so some trade-offs will have to be taken there