#Using cache volumes from multiple sessions seems to break dagger

1 messages · Page 1 of 1 (latest)

wary monolith
#

Posting details

#

I run multiple builds on 3 nodes.
Each node has N-1 "executors" (N=cpu count)

Jenkins starts on each executore a new go run main.go

Where in this main.go, I do client.Connect()

When I have those

    caches := map[string]*dagger.CacheVolume{
        consts.NpmCacheDir:       client.CacheVolume(consts.GlobalNpmCacheKey),
        consts.ElectronCacheDir:  client.CacheVolume(consts.GlobalElectronCacheKey),
        consts.PreCommitCacheDir: client.CacheVolume(consts.GlobalPreCommitCacheKey),
        consts.NextImageCacheDir: client.CacheVolume(consts.GlobalNextImageCacheKey),
    }

    // Attach caches to the container
    cacheOpts := dagger.ContainerWithMountedCacheOpts{Sharing: dagger.Shared}
    for k, v := range caches {
        c = c.WithMountedCache(k, v, cacheOpts)
    }

I quite frequently get:

failed to run build: returned error 502 Bad Gateway: http do: Post "http://dagger/query": command [docker exec -i dagger-engine-v0.13.5 buildctl dial-stdio] has exited with exit status 1, make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=context canceled

Once I comment the above out, it works fine

#

I even tried to set each node to 2 executors with the caches mounted.
Same result, tho not that frequent

#

Alternatives I thought of doing
Is set jenkins executors to 1 for each node, and do a loop in the go side.

The issue there is that jenkins (I think) dont know how many CPUs each node has, so I can't just send X number of jobs on Y node, as I dont know the numbers

wary monolith
#

Using dagger.Locked seems to solve the issue as well

fast patrol
#

@wary monolith I assume you have 1 dagger engine per node, correct?

wary monolith
#

Correct

fast patrol
#

And the nodes seem to be stateful also, right?

wary monolith
#

Normal ubuntu VMs under XCP-ng.
Stateful yes

fast patrol
#

Could you try starting the engine in a TCP port and telling the client to use TCP instead of the standard connection mode?

#

I have an assumption that having many workers over stdio might be causing issues

wary monolith
#

Hmm, so docker run ...
and set the EXPERIMENTAL_blabla?

Do you happen to have the default docker run command?

fast patrol
wary monolith
#

Cool. I'll do that tomorrow morning, bit late here and fighting all day to find out whats going on 😄

But I'd say that when I did use 1 node with 14 executes (16 threads), (now 3) with SHARED it worked fine for few days. Tho it was mostly test runs, soo
I think in the meantime while I also started provisioning nodes, I updated to 0.13.4 and 0.13.5 (from .3).

So its a lot mixed up here to draw a conclusion

#

Oh well, its probably not the cacheVols.
I just got the same error with Locked on one of the runs, was probably delaying something and didnt stress it much.

Anyway, thanks for the suggestion!
bed time will test tomorrow

fair nexus
#

it may not be much, but I was hitting that error yesterday and it turned out my engine was panicking. (BUT because I made some incorrect changes to engine during dev). could you please look at engine logs and check if there are any errors there.

wary monolith
#

I do see alot of this

time="2024-10-20T16:36:39Z" level=error msg="namespace worker failed" error="cleanup container" span="[internal] exec npm run metatags" spanID=3fdae9248848fddd traceID=273451de2b4a87638e9b45a3f1596906

A few of these as well

time="2024-10-20T17:37:35Z" level=error msg="failed to GC client DBs" error="map[error:readdir /var/lib/dagger/worker/clientdbs: open /var/lib/dagger/worker/clientdbs: no such file or directory kind:*fmt.wrapError stack:<nil>]"

Now I'm trying to make it "break" again. so I can switch to tcp and test

wary monolith
#

So using shared caused it to fail pretty quickly by busting the cache with CACHE_BUST=time.Now()

Setting up dagger with

jenkins@jenkins2:~$ export _EXPERIMENTAL_DAGGER_RUNNER_HOST=tcp://127.0.0.1:8080
jenkins@jenkins2:~$ docker run --name dagger-eng --privileged -p 8080:8080 -v dagger-engine:/var/lib/dagger registry.dagger.io/engine:v0.13.5 --addr tcp://0.0.0.0:8080
wary monolith
#

well tcp seems to be a lot better. It actually now reaches the build stage (which now build fails due to OOM :P).
But with stdio I didnt even getting past the container creation

#

Adjusting executors a bit, now builds finish with dagger.Shared and CACHE_BUST

#

Yep, 3 runs, no 502 now.
Will (after my walk) still try to capture some logs when 502 happens after reverting to stdio.

wary monolith
#

Okay, I think it is mostly a false alarm.
The one node that was running OOM on builds, has the most executors. (14 == the vcpus) and 24gb ram.

Now the other 2 had much less executors, which usually those runs were finishing faster then the runs scheduled on the bigger node.
Jenkins does not make it easy to see what goes on which node. Which was hard to figure out whats going on.
I was seeing the majority of tasks failed and only very few complete, but never thought that all failed was from the same node.

Now digging in while setting up dagger to use TCP, I did see the OOMs (which wasn't obvious before).

Lowered executors to 12, reverted back to stdio and few runs with cache busted and dagger.Shared seems to handle it just fine.

TLDR, it was OOM that was killing dagger (among other things)

fast patrol