#It happened once but we shrugged it off
1 messages · Page 1 of 1 (latest)
Does it happen to multiple people? Something weird must be going on since I can't repro it by just having v0.18.12 CLI installed, engine running and then doing a go run of:
package main
import (
"context"
"fmt"
"dagger.io/dagger/dag"
)
func main() {
ctx := context.Background()
out, err := dag.Container().From("alpine:latest").WithExec([]string{"echo", "Hello, Dagger!"}).Stdout(ctx)
if err != nil {
fmt.Println("Error:", err)
return
}
fmt.Println("Output:", out)
}
where the go.mod specifies v0.18.11
If you have a cloud trace where it happened there could be something useful there in the early steps where it's connecting to the engine
I hit it once in the past -- haven't bumped yet to 18.2
yeah, i can't repro either.
one thing that would good to know is what the env looks like when this happens?
actually wait the original error shared here #p-envelope message is interesting
it feels like it's not the sdk provisioning code
@left otter @paper sigil this happened before to one of you I think?
Happened to me
I'm honestly not sure why either, I PRd the upgrade to the dagger deps to container-use
if anyone has a full trace, logs, etc, would be a lot easier to debug 👀
one thought - you wouldn't have upgraded your local dagger cli and also been using the older container-use without the bump?
and been running things at the same time
this is the error/when I reported: #p-envelope message
Going to look in cloud under mccallister-dev org for a trace @azure vortex
ok, I believe its one of these (sorry I can't be more specific):
https://dagger.cloud/mccallister-dev/traces/69b8db6f68c6428ee5bc5378a71f0182
https://dagger.cloud/mccallister-dev/traces/58c62d4e7e82d2ed5de540bf37b02b0d
(timeout due to inactivity)
So, I've been looking into the "No such container" error recently, and here's a quick update:
The error seems to be coming from BuildKit’s connection helper library when it runs this:
docker exec -i dagger-engine-v0.18.11 buildctl dial-stdio
I couldn't replicate the exact root cause Andrea has encountered yesterday, but I did reproduce a similar issue by intentionally running the command on a non-existent container.
Simple Repro Steps
Create a container with a mismatched name
docker run -d --name dagger-engine-2e6d64d8564a25d1 \
--privileged registry.dagger.io/engine:v0.18.11
Attempt to exec using the expected (but incorrect) name
docker exec -i dagger-engine-v0.18.11 buildctl dial-stdio
This gives exactly the error we're seeing:
Error response from daemon: No such container: dagger-engine-v0.18.11
Underlying Flow
- SDK provisions containers using a docker driver docker.go
- SDK returns a connection helper URL with the expected container name docker.go
- Connection helper (from BuildKit library) uses docker exec internally to establish the connection.
If container names don't match, we get the error above
Where GC Logs Go
The logs in the garbage collection function, which could explain what's happening, are sent via OpenTelemetry and don't show up in stderr by default
Current theory
- User has container dagger-engine-v0.18.12 running.
- SDK v0.18.11 expects a different container (dagger-engine-v0.18.11).
- SDK tries (but possibly fails silently) to garbage collect the incorrect container (v0.18.12).
- Something prevents the correct container (v0.18.11) from starting properly.
Connection helper tries to connect, finds nothing, and throws "No such container."
hm interesting. these kind of look as expected? one thing that's weird is that we don't actually have any way of seeing which version of the engine we actually connected to
i'll work on adding that log in
i don't really understand this, but sure, yeah, the container somehow stops existing or wasn't created in the first place
so not a fix at all, but this should at least help produce some more informative traces: https://github.com/dagger/dagger/pull/10709
Oh, it's just a checkpoint on the understanding -- some, like me at first, might not know 😇
I dug a bit more, there is a potential Time‑of‑check vs. time‑of‑use (TOCTOU) race that i'm trying to reproduce: the docker engine doesn't release the name right away -- if the engine is big, it can take a few secs / minutes for the docker daemon to release the name -- then, we enter into the edge case of "in-use" which silently fails, which could explain everything -- I think 🤔
yeah, but removing the container is done as a gc thing at the very end? it can take a while to release, but we've already selected which engine to use at that point
I guess this is explained a bit if you're running both a new dagger cli v0.18.12 and an older cu binary with v0.18.11
@paper sigil @deep relic could this have been the case for you?
if so, there are some things we could do here - we could "namespace" these container names such that they don't use the same resources at all.
alternatively. we could play a fun game of rebuild the garbage collection to just "work better somehow" 😭
That was the case for Andrea yes. And I think I inadvertendly encountered it today -- trying to extract a repro from it ; but getting more and more confident that's the issue
mmm I wonder if we should avoid fixing this "obviously" and just work out a better way to do gc
I had an idea, been playing in containerd code a lot recently, I wonder if we could use it's concept of "leases"
essentially, when you use an engine you "lease" it for a period of time - it can't be gc-ed while its under lease
the lease expires some amount of time after you've used it
that would mean that you could have a lease set so that old engines would get cleaned up... eventually
if set to like 24 hours or so
but part of the problem is we'd need to change who runs the GC code. I wonder if we could have engines just "delete themselves"? then each engine garbage collects itself when it's leases expire
Like it 💯
maybe need to investigate if this is possible (I guess otherwise we could maybe just have all the containers watch each other????? or have some sort of watchdog container)
I don't like those alternatives though
All of this is kind of related to another tangent issue we have on SDK / Dagger CLI cohabitation -- cu uses the dagger cli to jump into the terminal, and the SDK sometimes is not up-to-date with the latest CLI version installed (which Andrea, Connor or me usually update right away) -- meaning that we encounter cache bursts between terminal jumps on the same container.
The current DAGGER_LEAVE_OLD_ENGINE creates a new container so we don't benefit from the warm cache ...
Maybe your "engine lifecycle" could indeed unlock a better foundation / UX around that ?
maybe unless-stopped works. the engine can stop itself when it's run out of work to do. and we could gc sweep just stopped containers. will have a poke around tomorrow.
yeah exactly! would be cool to support more dagger versions coexisting at the same time
cc @deep relic -- just to make sure you see it: #1392339595009458227 message ⏫
we can't support cache shared between multiple versions (maybe one day)
My aim is to isolate a good repro, then I'll fire an issue with the findings, Justin, would that work ? Happy to explore your idea(s) too, I don't know how busy you are on other stuff -- and how relevant I am in that context ahah :-p (at least i can do the repro ahahah)
that would be perfect - I'm not reallllllly still understanding of the root cause of the issue, but having better gc behavior does feel like part of the problem
I'll have a poke around with docker tomorrow and see if I can work out what we would need to do - I can write up an issue proposal as well
this was the case for me,
there's already an issue here 😄 https://github.com/dagger/dagger/issues/3849
trying to gauge vibes on just... removing the auto-gc step - make it a breaking change, sorry, now you need to do docker system prune occasionally. maybe work a bit with upstream, to see if we can get some better mechanisms in place for long term.
cc @trim bolt? i know DAGGER_LEAVE_OLD_ENGINE is kind of a workaround for a lot of this