#No longer able to clean dagger cloud caches

1 messages ยท Page 1 of 1 (latest)

next tulip
#

Hi! Several weeks ago I implemented a dagger function that mounts our shared caches, deletes the contents, and gives the engine plenty of time to shut down. This was working quite well, we ran it on saturday nights. Well, the last two saturdays it has ran it did not persist the empty directories back to dagger cloud. Our team is still running on dagger 0.13.1 but even when running our cache clean function after upgrading to 0.13.3 and then 0.13.5, the problem remains. After digging in and pulling engine logs, I see this for each cache we're trying to clean, on engine shutdown: skipped pushing cache mount. Our cache clean logic hasn't changed, the only thing that has changed is dagger engine versions. I'm also seeing a ton of namespace worker failed errors.

This is becoming a big issue, as it is now taking several minutes to pull down caches at startup for every single dagger run in our CI.

I'll share the clean implementation in the thread.

#

Interesting engine logs prior to the skipped pushing cache mount logs:

 time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"golangci-lint-cache\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"lambda-binaries\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"modgobuildcache\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"golangci-lint-build\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"service-tests\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"pulumi-plugin-cache\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"go-mod\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"modgomodcache\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"go-build\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"system-tests\" already committed"
time="2024-10-21T00:37:19Z" level=debug msg="cache mount \"unit-tests\" already committed"
#

Just did a search of old threads, I noticed @forest thorn you mentioned you know what is happening with the skipped pushing cache mount logs: #1278526995885592626 message

forest thorn
next tulip
#

I'm seeing this in dagger source, but i'm not sure how the 'skip' field in the getcachemountuploadurl is determined:

                    if err := contentWriter.Commit(ctx, 0, ""); err != nil {
                        if errors.Is(err, errdefs.ErrAlreadyExists) {
                            // we should be releasing these, but if it was already there, that's weird but fine
                            bklog.G(ctx).Debugf("cache mount %q already committed", cacheMountName)
                        } else {
                            return fmt.Errorf("failed to commit content: %w", err)
                        }
                    }
                    contentDigest := contentWriter.Digest()

                    // now that we have the digest we can upload from the content store to the url
                    contentReaderAt, err := m.Worker.ContentStore().ReaderAt(ctx, ocispecs.Descriptor{
                        Digest: contentDigest,
                    })
                    if err != nil {
                        return fmt.Errorf("failed to create content reader: %w", err)
                    }
                    defer contentReaderAt.Close()
                    contentLength := contentReaderAt.Size()
                    getURLResp, err := m.cacheClient.GetCacheMountUploadURL(ctx, GetCacheMountUploadURLRequest{
                        CacheName: cacheMountName,
                        Digest:    contentDigest,
                        Size:      contentLength,
                    })
                    if err != nil {
                        return fmt.Errorf("failed to get cache mount upload url: %w", err)
                    }

                    if getURLResp.Skip {
                        bklog.G(ctx).Debugf("skipped pushing cache mount %s", cacheMountName)
                        return nil
                    }
#

thanks @forest thorn !

#

much appreciated

next tulip
#

Unfortunately we're seeing this again; I'm able to clean the dagger caches and we're back up to 2+ minute engine start time :/

#

Semi-related, I'm also seeing that cache volumes aren't working as they were before. we build a few 'base' images for our functions, From().WithExec()... to setup dependencies for functions like Build(), Test(), etc. One of them installs something that takes ~45s, and we're seeing this happen from scratch every time. The behavior I observed for some time was that it would be loaded from magicache.

#

We've invested a lot into dagger on our team and we absolutely love it, but the cache struggles for ephemeral runners are a significant burden. At this point, for every run we're paying ~2 minutes to download caches and now 1+ minutes to rebuild base images. With caching on an expected no-op, these functions should take a dozen seconds or so.

forest thorn
forest thorn
forest thorn
#

@next tulip I've found why your wiping stopped working. The reality is that it never completely worked form the start. It's actually an edge-case that we need to support in our caching service which is related to the skipped syncing mount remotely message that you've shared before.

The current temporary "fix" you can do to overcome this edge, is that instead of only doing rm -rf /cache/* to your volume's paths to remove them, please do rm -rf /cache/* && touch /cache/$EPOCH so this forces the empty cache volume to be re-synced again and effectively updating the pointer in our cloud service so on the next sync you get an empty volume.

Does that makes sense?

#

There's an improvement that we need to do from our end which is moving the pointer to the correct volume even if we skip to sync it.

forest thorn
next tulip
#

Thanks or the quick response @forest thorn , much appreciated. I'll give the workaround a shot and report back!

next tulip
#

Thanks for cleaning the caches! I noticed their sizes were as expected after a clean (4KB), but I'm still seeing exactly 1 minute engine starts. I dove into the logs and it looks like the culprit is:

time="2024-10-28T00:24:57Z" level=debug msg="importing cache"
time="2024-10-28T00:24:57Z" level=debug msg="calling import cache"
time="2024-10-28T00:25:57Z" level=debug msg="finished importing cache in 1m0.052667271s"
time="2024-10-28T00:25:57Z" level=error msg="failed to import cache at startup" error="Get \"https://api.dagger.cloud/magicache/import\": context deadline exceeded"

I'm also seeing exactly 3 minute engine shutdowns, even with docker stop -t 3600 and found the culprit in the logs as well:

time="2024-10-28T00:26:51Z" level=debug msg="starting cache export"
...
time="2024-10-28T00:29:52Z" level=debug msg="finished cache export in 3m1.107326634s"
time="2024-10-28T00:29:52Z" level=error msg="failed to export cache" error="unexpected status code: 504: <html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body>\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n</body>\r\n</html>\r\n"
time="2024-10-28T00:29:52Z" level=info msg="stopping server"
time="2024-10-28T00:29:52Z" level=debug msg="cleaning up cni pool"
#

Thats the bad news, but I have good news: the workaround you suggested does in fact force the emptying of our cache volumes. Thank you very much ๐Ÿ™Œ

#

Regarding the above issues, they smell like they are part of the layers cache problems we're having. Do you have a mechanism on your side to blow away cached layers @forest thorn ? I'm not sure that doing so would fix anything, but perhaps it'd get our magicache back in working order in the short term.

forest thorn
forest thorn
next tulip
#

Just re-ran the first CI run since you reported the layers were cleaned. Same logs as above; unable to import or export magicache @forest thorn .

forest thorn
forest thorn
#

I've also deleted layers and cache mounts for non us-west-2 regions as well. LMK if you're still seeing errors please ๐Ÿ™

next tulip
#

This is in the CI runners in AWS us-west-2.

#

A teammate had a run ~1 hour ago and the engine logged the same cache import/export failures.

#

A shot in the dark, but i'm going to rebuild our dagger 0.12.6 AMI.

#

Another thing I'll try is a vanilla AMI and let the dagger GHA start the engine, to rule out our AMI.

#

An aside, it would be great (for a higher price no doubt) to have a dagger cloud VPC Endpoint ๐Ÿ˜„

next tulip
next tulip
#

@forest thorn is there a way to run the engine with higher log verbosity? It'd also be nice to enable OTEL on the engine itself, if thats even a thing

#

I don't know if this could be causing the issues, probably not, but worth sharing just in case: I accidentally created a cache volume with an empty string cache key. I've disabled it in dagger cloud, but roughly around the time I accidentally created that, we started having issues. Would it be possible to straight up destroy all of our caches entirely?

forest thorn
#

lol. Let me delete that cache volume really quick

#

@next tulip done, deleted

next tulip
#

No dice @forest thorn

#

thats a run on github default arm64 ubuntu runners, with the dagger GHA starting the engine

forest thorn
#

I'm troubleshooting this now. I gimme a bit ๐Ÿ™

forest thorn
#

@next tulip can you check if it's behaving better now?

next tulip
#

kk

forest thorn
#

just recycled our cache API instances since they hold some in-memory cache which I think could have caused some issues ๐Ÿ™

#

@next tulip the cache volumes seem to be quite heavy.. trying this in a VM in the cloud and it's filling my 40gb disk to start the engine using your volume config ๐Ÿ™

next tulip
#

That looks like it fixed The import/export cache timeouts*

#

Yeah, we've got a variety of shared cache volumes. This is what they look like after a single PR (PR validation job, Merge Queue Job, and Release job):

4.4G    go-build
1.9G    go-mod
6.5G    golangci-lint-build
121.8M    golangci-lint-cache
4.1G    lambda-binaries
4.0K    modgobuildcache
4.0K    modgomodcache
935.4M    pulumi-plugin-cache
3.0G    service-tests
1.6G    system-tests
109.2M    tallctl
19.1M    tools
4.8G    unit-tests
forest thorn
#

๐Ÿ‘

next tulip
#

I'd love to not have so much of it though...

forest thorn
#

it's always a trade-off between re-computing and re-using the cache

#

it's hard to get right. Fortunatenly since Dagger has a lot of metadata about your pipelines we'll be able to do some intelligent caching in the future ๐Ÿ™

next tulip
#

That's great news, very excited about the future!

forest thorn
#

just need to wrap up my current task. ETA today/tomorrow

#

Feel free to ping me if you're having issues Matt. Shuld be behaving now ๐Ÿ™

next tulip
#

Thanks @forest thorn ! Caches are clean and import/exports are behaving now confirmed. I'm still seeing our base images rebuilt on every run though, is this expected? this is the base

dag.Container(dagger.ContainerOpts{
        Platform: "linux/arm64",
    }).
        From("golang:"+version+"-alpine").
        WithExec([]string{"apk", "add", "--no-cache", "git", "curl", "bash", "openssh"}).
        WithEnvVariable("GOOS", "linux").
        WithEnvVariable("GOARCH", "arm64").
        WithEnvVariable("CGO_ENABLED", "0").
        WithEnvVariable("GO111MODULE", "on").
        WithEnvVariable("GOPRIVATE", "github.com/Tallied-Technologies").
        WithLabel("name", "go-base")

It gets rebuilt on every CI ephemeral runner run, but we also have a few images that extend this base with additional WithExec commands and the like. They too are being fully rebuilt. My expectation would be that the layers are cached in dagger cloud.

forest thorn
next tulip
#

the main module span shows the base image being built, and in the function span child spans show the extended base images being built, notably 'Test Service' span

next tulip
#

Oddly I'm also not seeing the module's initialize phase being fully cached. resolving module ref and resolveFromCaller pulls from magicache, but not asModule

#

Could be related, i'm not seeing the dagger module cache volumes change size even after a dozen runs.

4.0K    modgobuildcache
4.0K    modgomodcache
forest thorn
#

@next tulip I'd assume you can run this locally in your machine, right?

next tulip
forest thorn
#

is there a chance you could re-run a CI pipeline and share cloud URL's of both? The first run and the re-run

next tulip
#

yep, on it

forest thorn
#

since the run will be of the same commit, that'll give me more insights to understand what's being invalidated ๐Ÿ™

next tulip
#

I made sure to wait for the engine shutdown from the first run before re-running.

forest thorn
#

is there something that you do in your Github Actions workflows that could modify some files in the dagger call context directory that invalides that?