#DNS and /etc/resolv.conf

1 messages · Page 1 of 1 (latest)

sharp otter
#

How does this work? Why is it different from Docker?

It looks like the nameservers are excluded while search are included and extended

# tony @ dev-tony in ~ [16:59:59] 
$ cat /etc/resolv.conf                                                                                                                                                          
# resolv.conf(5) file generated by tailscale
# For more info, see https://tailscale.com/s/resolvconf-overwrite
# DO NOT EDIT THIS FILE BY HAND -- CHANGES WILL BE OVERWRITTEN

nameserver 100.100.100.100
search tail9ce624.ts.net c.ferrum-dev.internal google.internal

# tony @ dev-tony in ~ [17:00:05] 
$ docker run --rm us-central1-docker.pkg.dev/ferrum-dev/eng/google/cloud-sdk:486.0.0-alpine cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 100.100.100.100
search tail9ce624.ts.net c.ferrum-dev.internal google.internal

# Based on host file: '/etc/resolv.conf' (legacy)
# Overrides: []

# tony @ dev-tony in ~ [17:00:29] 
$ dagger --progress report core container from --address us-central1-docker.pkg.dev/ferrum-dev/eng/google/cloud-sdk:486.0.0-alpine with-exec --args cat --args /etc/resolv.conf stdout
▶ connect 0.2s
● loading type definitions 0.9s
● parsing command line arguments 0.0s

● container: Container! 0.0s
$ .from(address: "us-central1-docker.pkg.dev/ferrum-dev/eng/google/cloud-sdk:486.0.0-alpine"): Container! 1.3s CACHED
$ .withExec(args: ["cat", "/etc/resolv.conf"]): Container! 0.0s CACHED
▶ .stdout: String! 0.0s

nameserver 10.87.0.1
search tail9ce624.ts.net c.ferrum-dev.internal google.internal ttpo3gc8f7bce.dagger.local
#

We had an issue where Dagger was not working but Docker was. The dev was unable to curl the outside world. Eventually things got cleared up, the dev commented that it felt delayed like a DNS TTL because it after 10+ minutes from their last change, it just started working again

sleek glade
sharp otter
#

It's picking up tailscale search, but not nameserver, so it seems more complex than this

sleek glade
sharp otter
#

why aren't nameservers extended like search?

sleek glade
#

extended how?

#

you can't just add nameservers

sharp otter
#

look at the very last search line, it includes the outside + one for dagger, nameserver is completely overwritten

sleek glade
#

resolv.conf doesn't fallback to the next DNS if it can't find a record. It only does that when the DNS server is down or can't resolve (error resolving).

#

the nameserver setting doesn't work like that

#

so doing

nameserver $your_ip

won't work

sharp otter
#

It seems docker only changes it in one case

# tony @ dev-tony in ~ [17:22:31] C:125
$ docker run --net agc_default --rm us-central1-docker.pkg.dev/ferrum-dev/eng/google/cloud-sdk:486.0.0-alpine cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 127.0.0.11
search tail9ce624.ts.net c.ferrum-dev.internal google.internal
options ndots:0

# Based on host file: '/etc/resolv.conf' (internal resolver)
# ExtServers: [100.100.100.100]
# Overrides: []
# Option ndots from: internal

# tony @ dev-tony in ~ [17:22:44] 
$ docker run --net bridge --rm us-central1-docker.pkg.dev/ferrum-dev/eng/google/cloud-sdk:486.0.0-alpine cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 100.100.100.100
search tail9ce624.ts.net c.ferrum-dev.internal google.internal

# Based on host file: '/etc/resolv.conf' (legacy)
# Overrides: []

# tony @ dev-tony in ~ [17:23:23] 
$ docker run --net host --rm us-central1-docker.pkg.dev/ferrum-dev/eng/google/cloud-sdk:486.0.0-alpine cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 100.100.100.100
search tail9ce624.ts.net c.ferrum-dev.internal google.internal

# Based on host file: '/etc/resolv.conf'
# Overrides: []
#

tl;dr things worked outside of dagger but not inside. We want to understand why

#

why did DNS break and why did it magically resolve?

#

it persisted even after engine restarts

sleek glade
sleek glade
#

it's only mostly meant for redundancy purposes

sharp otter
#

This doesn't help me understand why it happened, i need to write an RCA and figure out how to make sure it doesn't happen again

sleek glade
#

many things could be happening here since network routing is usually quite dependent on your setup. Have you observed any behavior that could lead to a more deterministic repro?

sharp otter
#

From the perspective of the dev, Dagger was overwriting resolv.conf while Docker was not (regardless of the truth). It looks like Docker points this out when they do it in a comment in resolv.conf (which we would have seen had we used the network flag)

#

so curl worked outside the container, but not inside, we did narrow it down to DNS (unable to lookup host)

#

this machine had had IPTABLES work, undoing that cleared up most of the issues (before we hit the DNS issue) (it was consistent with known working setups at this point)

sleek glade
sharp otter
#
# Based on host file: '/etc/resolv.conf' (internal resolver)
# ExtServers: [100.100.100.100]
# Overrides: []
# Option ndots from: internal
#

tl;dr - this is unfamiliar tech for most people, having info like this is super helpful

sleek glade
sharp otter
#

@burnt fractal stepping back from this particular issue, it would be helpful if there was documentation for the casual or occasionaly user that explains the relation between docker, dagger, buildkit; where they align, overlap, diverge; across concepts and behaviors

Something at this level: https://atproto.com/articles/atproto-for-distsys-engineers with links to details as needed?

#

I'm not sure many people are even aware the difference between docker build and docker buildx and the relation to buildkit.

#

I've tried explaining to them that "going back to docker" is not going to change much now since docker moved build to buildkit

#

some light history in a piece too?

sleek glade
#

@sharp otter until we address the docs and config file comments, if this happens again, I'd ask your teams to check what upstream DNS server the engine is currently set with (command I've shared above) and check if they can reach it with telnet $ip 53. If they can't, it's something that's probably outside Dagger

#

there's multiple things outside Dagger that can block network traffic. I'm unsure why in your case it just stops working out of the blue. FWIW I don't think I've ever seen someone raising this same issue

sharp otter
#

I don't think is stopped out of the blue, but it did start working after some delay, post last change

#

It's definitely something this dev did on the host, because way more was broken when this all started

sleek glade
sharp otter
#

I think my RCA will have a tl;dr - don't play with IP tables on your dev-vm, do that on a separate box...

#

and my hunch is that the real RCA is copying output from ai into a terminal without much thought, I've done it myself 🙊

sleek glade
sharp otter
#

dude, they also run k8s directly on their dev-vm hosts, at least it's now been commented that this was a bad choice and they'd use a nested vm setup if they did it again

#

it's amazing how resilient the software house of cards are in practice

#

@sleek glade is there DNS caching going on anwhere in this chain (in the dagger/buildkit context)?

sleek glade
#

so it'll return cached responses until the TTL expires ni the same way DNS servers like systemd-resolved does

#

so it's faster and doesn't hammer the upstream servers

sharp otter
#

and if they restarted the engine without cleaning up the volume, the cache probably persisted across restarts?

sleek glade
sharp otter
#

the actual order of operations is unclear to me, knowing this helps me sort the real order out