#Avoiding DNS errors under load

1 messages · Page 1 of 1 (latest)

red grove
#

When jobs run concurrently I often get DNS errors like this:

start 6b54cec762hp0 (aliased as backend): lookup 5olmfqllgc9qo for hosts file: lookup 5olmfqllgc9qo on 10.87.0.1:53: no such host
lookup 5olmfqllgc9qo.co3lcrg5bd3e0.h8o6eg46r0m7i.dagger.local on 10.87.0.1:53: no such host
lookup 5olmfqllgc9qo.h8o6eg46r0m7i.dagger.local on 10.87.0.1:53: no such host

How should I avoid this?

I use @bold umbra 's proxy module.

#

Avoiding DNS errors under load

atomic falcon
red grove
red grove
#

any ideas, guys?

pearl yoke
atomic falcon
pearl yoke
# red grove any ideas, guys?

Quick update: I now have a consistent repro on my side of one important part of this

I can reproduce that, during nested service startup, an early dependency failure can be surfaced as: lookup ... for hosts file ... no such host instead of a clearer dependency-startup error (when no healthcheck is expected in the dagger code)

So in this class of failures, the DNS message can be a secondary symptom (error propagation), not necessarily the root cause.

In your trace, /tidb-server also exits with code 1 in the same failure window. That makes TiDB startup instability a strong lead, but not yet proven as the primary cause.

Could you run 2 isolating checks?

  1. TiDB cache OFF
    Temporarily skip with_mounted_cache("/tmp/tidb", ...).

  2. TiDB cache UNIQUE per CI job
    Use a job-specific cache key, e.g.
    tidb-${GITHUB_RUN_ID}-${GITHUB_RUN_ATTEMPT}-${GITHUB_JOB}

import os
from dagger import dag

mode = os.getenv("TIDB_CACHE_MODE", "shared")  # off|shared|job
cache_name = "tidb-cache"
if mode == "job":
    cache_name = (
        f"tidb-{os.getenv('GITHUB_RUN_ID')}-"
        f"{os.getenv('GITHUB_RUN_ATTEMPT')}-"
        f"{os.getenv('GITHUB_JOB')}"
    )

tidb = dag.container().from_("pingcap/tidb:v7.5.2")
if mode != "off":
    tidb = tidb.with_mounted_cache("/tmp/tidb", dag.cache_volume(cache_name))

Also helpful: please attach TiDB logs around /tidb-server exit for a failing run 🙏

In parallel, I’ll keep pushing on the engine side so this failure mode reports a clear dependency-startup error instead of DNS lookup noise.

PS: test/debug branch is here

red grove
#

I was using per-PR caching, which ought to have been safe because you don't run multiple jobs from the same PR at once. Per job would effectively disable caching.

pearl yoke
pearl yoke
# red grove I was using per-PR caching, which ought to have been safe because you don't run ...

Thanks, that per-PR cache detail helps.

I dug a bit more into engine behavior, and above "race" is not a race. It's by design: without exposed ports, readiness is intentionally weak.

  • In core/healthcheck.go (around line 42), healthcheck returns success immediately when there are no healthcheckable ports.
  • Service startup then follows the “ready” path (core/service.go, around lines 570 and 675).
  • If a dependency exits right after, downstream startup can fail while generating host aliases (engine/buildkit/executor_spec.go, around line 292) with lookup ... for hosts file.

So:

  • it is by design that no-port services don’t get strong readiness guarantees,
  • but it is still a UX/debug issue that the surfaced top-level error is DNS-ish noise instead of the dependency exit/root cause.

Could you confirm whether TiDB/backend are explicitly using with_exposed_port(...) in your Dagger functions (and not skipping healthchecks)? If not, that may explain why this is surfacing this way.

A repro would also really help, because i'm a bit blind here, at the moment 🙏

from your trace: proxy has exposed ports, but I don’t see an explicit withExposedPort(...) on the backend service before asService (same question for other deps). In Dagger, image EXPOSE metadata alone doesn’t provide runtime readiness checks

red grove
#

Thank you, I will provide you with all the information I can. We do have with_exposed_port for the backend, but not the database. Is the absence of that why an unready database manifests as a DNS failure? We do wait for the backend to be up before hitting it.

pearl yoke
red grove
#

Sorry for the late reply. Should I be using with_exposed_port even if I don't want to expose the service externally? tidb is currently created like so:

@function
async def create_tidb_service(
    self, cache_volume_name: str | None = "tidb-data"
) -> Service:
    """Creates a TiDB service

    :param cache_volume_name: Name of the cache volume for TiDB data. If None, caching is disabled.
    """
    container = dag.container().from_(consts.TIDB_IMAGE)

    if cache_volume_name is not None:
        container = container.with_mounted_cache(
            "/tmp/tidb", dag.cache_volume(cache_volume_name)
        )

    return container.as_service()
pearl yoke
atomic falcon
red grove
#

I'll try exposing the health check port.

red grove
#

I resolved the problem by cache isolation; each branch has its own.

atomic falcon