Quick update: I now have a consistent repro on my side of one important part of this
I can reproduce that, during nested service startup, an early dependency failure can be surfaced as: lookup ... for hosts file ... no such host instead of a clearer dependency-startup error (when no healthcheck is expected in the dagger code)
So in this class of failures, the DNS message can be a secondary symptom (error propagation), not necessarily the root cause.
In your trace, /tidb-server also exits with code 1 in the same failure window. That makes TiDB startup instability a strong lead, but not yet proven as the primary cause.
Could you run 2 isolating checks?
-
TiDB cache OFF
Temporarily skip with_mounted_cache("/tmp/tidb", ...).
-
TiDB cache UNIQUE per CI job
Use a job-specific cache key, e.g.
tidb-${GITHUB_RUN_ID}-${GITHUB_RUN_ATTEMPT}-${GITHUB_JOB}
import os
from dagger import dag
mode = os.getenv("TIDB_CACHE_MODE", "shared")
cache_name = "tidb-cache"
if mode == "job":
cache_name = (
f"tidb-{os.getenv('GITHUB_RUN_ID')}-"
f"{os.getenv('GITHUB_RUN_ATTEMPT')}-"
f"{os.getenv('GITHUB_JOB')}"
)
tidb = dag.container().from_("pingcap/tidb:v7.5.2")
if mode != "off":
tidb = tidb.with_mounted_cache("/tmp/tidb", dag.cache_volume(cache_name))
Also helpful: please attach TiDB logs around /tidb-server exit for a failing run 🙏
In parallel, I’ll keep pushing on the engine side so this failure mode reports a clear dependency-startup error instead of DNS lookup noise.
PS: test/debug branch is here