#It happened once but we shrugged it off

1 messages · Page 1 of 1 (latest)

kind roost
#

Does it happen to multiple people? Something weird must be going on since I can't repro it by just having v0.18.12 CLI installed, engine running and then doing a go run of:

package main

import (
    "context"
    "fmt"

    "dagger.io/dagger/dag"
)

func main() {
    ctx := context.Background()

    out, err := dag.Container().From("alpine:latest").WithExec([]string{"echo", "Hello, Dagger!"}).Stdout(ctx)
    if err != nil {
        fmt.Println("Error:", err)
        return
    }
    fmt.Println("Output:", out)
}

where the go.mod specifies v0.18.11

#

If you have a cloud trace where it happened there could be something useful there in the early steps where it's connecting to the engine

woven leaf
azure vortex
#

yeah, i can't repro either.

one thing that would good to know is what the env looks like when this happens?

#

actually wait the original error shared here #p-envelope message is interesting
it feels like it's not the sdk provisioning code

deep relic
#

@left otter @paper sigil this happened before to one of you I think?

paper sigil
#

Happened to me

#

I'm honestly not sure why either, I PRd the upgrade to the dagger deps to container-use

azure vortex
#

if anyone has a full trace, logs, etc, would be a lot easier to debug 👀

#

one thought - you wouldn't have upgraded your local dagger cli and also been using the older container-use without the bump?

#

and been running things at the same time

paper sigil
#

Going to look in cloud under mccallister-dev org for a trace @azure vortex

paper sigil
woven leaf
#

So, I've been looking into the "No such container" error recently, and here's a quick update:

The error seems to be coming from BuildKit’s connection helper library when it runs this:

docker exec -i dagger-engine-v0.18.11 buildctl dial-stdio

I couldn't replicate the exact root cause Andrea has encountered yesterday, but I did reproduce a similar issue by intentionally running the command on a non-existent container.

Simple Repro Steps
Create a container with a mismatched name

docker run -d --name dagger-engine-2e6d64d8564a25d1 \
  --privileged registry.dagger.io/engine:v0.18.11

Attempt to exec using the expected (but incorrect) name

docker exec -i dagger-engine-v0.18.11 buildctl dial-stdio

This gives exactly the error we're seeing:

Error response from daemon: No such container: dagger-engine-v0.18.11

Underlying Flow

  • SDK provisions containers using a docker driver docker.go
  • SDK returns a connection helper URL with the expected container name docker.go
  • Connection helper (from BuildKit library) uses docker exec internally to establish the connection.

If container names don't match, we get the error above

Where GC Logs Go

The logs in the garbage collection function, which could explain what's happening, are sent via OpenTelemetry and don't show up in stderr by default

Current theory

  • User has container dagger-engine-v0.18.12 running.
  • SDK v0.18.11 expects a different container (dagger-engine-v0.18.11).
  • SDK tries (but possibly fails silently) to garbage collect the incorrect container (v0.18.12).
  • Something prevents the correct container (v0.18.11) from starting properly.

Connection helper tries to connect, finds nothing, and throws "No such container."

azure vortex
#

i'll work on adding that log in

azure vortex
azure vortex
woven leaf
# azure vortex i don't really understand this, but sure, yeah, the container somehow stops exis...

Oh, it's just a checkpoint on the understanding -- some, like me at first, might not know 😇

I dug a bit more, there is a potential Time‑of‑check vs. time‑of‑use (TOCTOU) race that i'm trying to reproduce: the docker engine doesn't release the name right away -- if the engine is big, it can take a few secs / minutes for the docker daemon to release the name -- then, we enter into the edge case of "in-use" which silently fails, which could explain everything -- I think 🤔

azure vortex
#

yeah, but removing the container is done as a gc thing at the very end? it can take a while to release, but we've already selected which engine to use at that point

#

I guess this is explained a bit if you're running both a new dagger cli v0.18.12 and an older cu binary with v0.18.11
@paper sigil @deep relic could this have been the case for you?

#

if so, there are some things we could do here - we could "namespace" these container names such that they don't use the same resources at all.
alternatively. we could play a fun game of rebuild the garbage collection to just "work better somehow" 😭

woven leaf
azure vortex
#

mmm I wonder if we should avoid fixing this "obviously" and just work out a better way to do gc

#

I had an idea, been playing in containerd code a lot recently, I wonder if we could use it's concept of "leases"

#

essentially, when you use an engine you "lease" it for a period of time - it can't be gc-ed while its under lease

#

the lease expires some amount of time after you've used it

#

that would mean that you could have a lease set so that old engines would get cleaned up... eventually

#

if set to like 24 hours or so

#

but part of the problem is we'd need to change who runs the GC code. I wonder if we could have engines just "delete themselves"? then each engine garbage collects itself when it's leases expire

woven leaf
#

Like it 💯

azure vortex
#

I don't like those alternatives though

woven leaf
#

All of this is kind of related to another tangent issue we have on SDK / Dagger CLI cohabitation -- cu uses the dagger cli to jump into the terminal, and the SDK sometimes is not up-to-date with the latest CLI version installed (which Andrea, Connor or me usually update right away) -- meaning that we encounter cache bursts between terminal jumps on the same container.

The current DAGGER_LEAVE_OLD_ENGINE creates a new container so we don't benefit from the warm cache ...

Maybe your "engine lifecycle" could indeed unlock a better foundation / UX around that ?

azure vortex
azure vortex
woven leaf
azure vortex
#

we can't support cache shared between multiple versions (maybe one day)

woven leaf
#

My aim is to isolate a good repro, then I'll fire an issue with the findings, Justin, would that work ? Happy to explore your idea(s) too, I don't know how busy you are on other stuff -- and how relevant I am in that context ahah :-p (at least i can do the repro ahahah)

azure vortex
#

that would be perfect - I'm not reallllllly still understanding of the root cause of the issue, but having better gc behavior does feel like part of the problem

#

I'll have a poke around with docker tomorrow and see if I can work out what we would need to do - I can write up an issue proposal as well

azure vortex
#

trying to gauge vibes on just... removing the auto-gc step - make it a breaking change, sorry, now you need to do docker system prune occasionally. maybe work a bit with upstream, to see if we can get some better mechanisms in place for long term.

#

cc @trim bolt? i know DAGGER_LEAVE_OLD_ENGINE is kind of a workaround for a lot of this