Hanging/flaky service tests | Dagger | Page 1

lucid crane Apr 19, 2023, 4:39 PM

#

simple flower Apr 19, 2023, 5:05 PM

#

These are weird, @lucid crane. I have to imagine it's something to do with the dagger-in-dagger stuff I did. That's a substantial change. Happy to dig in with you, if that'd be helpful.

lucid crane Apr 19, 2023, 5:06 PM

#

That's my suspicion too, since all these tests passed repeatedly locally while I was benchmarking the GraphQL change yesterday, but it just failed on the first run with ./hack/make engine:test, the same way it failed in CI

#

I'm happy to dive in initially since I'm familiar with the services stack, but I'll keep you in the loop!

simple flower Apr 19, 2023, 5:09 PM

#

Sounds good, @lucid crane!

lucid crane Apr 19, 2023, 6:54 PM

#

probably unrelated, just kind of funny: I wonder what permissions this file is ending up with? 😂 (note the lack of octal notation for 755)

simple flower Apr 19, 2023, 6:56 PM

#

lucid crane probably unrelated, just kind of funny: I wonder what permissions this file is e...

Damnit. Don't look at the blame.

lucid crane Apr 19, 2023, 6:56 PM

#

too late

#

I mean, it works

simple flower Apr 19, 2023, 6:56 PM

#

🤦🏻‍♂️

lucid crane Apr 19, 2023, 6:57 PM

#

I remember someone pointing out relying on numbers for this API would be a footgun

#

but ... generally it looks the most familiar

#

maybe we could raise an error for any value greater than 511 (0777)

#

I guess one alternative is rwxr-xr-x but who wants to type that

simple flower Apr 19, 2023, 7:02 PM

#

https://github.com/dagger/dagger/pull/4982

GitHub

Update permissions to use octal by jlongtine · Pull Request #4982 ·...

🙏🏻 @vito

#

@lucid crane Actually... look at this git blame: https://github.com/dagger/dagger/blame/5b0861e5571f827570ed3b0629d7c95fcf417c13/internal/mage/engine.go#L359

lucid crane Apr 19, 2023, 7:04 PM

#

alright you're off the hook

simple flower Apr 19, 2023, 7:04 PM

#

😉

#

I wish I'd caught that.

elder wren Apr 19, 2023, 8:24 PM

#

Yeah I've hit this on occasion locally and in CI a few times, this was just earlier today: https://github.com/dagger/dagger/actions/runs/4746265072/jobs/8429673040#step:6:15207

lucid crane Apr 19, 2023, 8:30 PM

#

I managed to reproduce it that first time, but now it's passed 14 times in a row 🫠

elder wren Apr 19, 2023, 8:44 PM

#

lucid crane I managed to reproduce it that first time, but now it's passed 14 times in a row...

Yeah I haven't noticed any rhyme or reason to it. When it happened to me locally it seemed like either the DNS entry was not found when clients tried connecting or, when it would just hang a long time, the health checks just ran forever.

lucid crane Apr 19, 2023, 8:45 PM

#

I have a suspicion it has to do with the synchronizing to dedupe services, since a lot of these tests use the same test service without a cache buster or anything. A lot of moving parts have to align for that to work as expected with high test concurrency

#

Just a theory at this point

#

It's interesting that the health checks keep going too until the timeout, since it means the service Evaluate call hasn't returned

elder wren Apr 19, 2023, 8:52 PM

#

lucid crane I have a suspicion it has to do with the synchronizing to dedupe services, since...

Yeah, the fact that we were getting those buildkit scheduler errors before adding the 10 second grace period feels suspicious. Could be unrelated but clearly the scheduler is not handling this scenario perfectly, so wouldn't be surprised if something is still going haywire even if it doesn't result in an error anymore

#

Might be interesting to try reducing that timeout to see if it makes it more reproducible, if you haven't already

lucid crane Apr 19, 2023, 8:53 PM

#

Ah good idea, maybe it happens more often in CI because worse/busier hardware = more likely to hit that timeout

#

I also wonder if this is passing on repeated runs because a lot of the service test client calls just get cached, maybe my loop needs a prune in the middle

#

sidenote: would be nice if we had a dagger prune command, I use bass --prune all the time

simple flower Apr 19, 2023, 9:14 PM

#

lucid crane sidenote: would be nice if we had a `dagger prune` command, I use `bass --prune`...

I love this idea.

lucid crane Apr 19, 2023, 9:55 PM

#

hmmm... this looks mighty suspicious:

1871:#15 55.45 time="2023-04-19T21:39:54Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""
3646:#15 122.0 time="2023-04-19T21:41:01Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""
4707:#15 167.3 time="2023-04-19T21:41:46Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""

#

all of this should be wrapped with a flock already. though maybe it's weird that they're locking the same file that they're atomically replacing thinkies

#

going to try replacing that with locking a separate regular file instead

#

FWIW going down to a 1 second timeout didn't seem to help (passed a few times), but when I bumped the grace period to 1000 seconds instead, it seems to have failed on the first try. Not sure if I learned anything there, or if it was just a coincidence

lucid crane Apr 19, 2023, 10:28 PM

#

two successful runs so far, no more addnhosts errors in the logs. hopefully this is the root of it; it would make a lot of sense

#

addnhosts is the file that contains the mapping from hostname to container IP, so not correctly synchronizing would absolutely result in failed lookups

sleek rose Apr 19, 2023, 10:31 PM

#

lucid crane I remember someone pointing out relying on numbers for this API would be a footg...

🙈 I recall having this conversation with Andrea when implementing that API https://github.com/dagger/dagger/pull/3133/files#r973772676

reason why it works is because strconv.ParseUint("755", 8, 32) works the same for both "0755" and "755" respectively. Back then we decided not implementing a custom FileMode scalar since it wasn't worth it.

Having said that, agree that we should standardize it in the tests 🙏

lucid crane Apr 19, 2023, 10:35 PM

#

sleek rose 🙈 I recall having this conversation with Andrea when implementing that API http...

I don't know if we're even doing string parsing for these right now, as far as I can tell it's just an Int all the way from the SDK through GraphQL and into core/: https://github.com/dagger/dagger/blob/3e5992dbc114789bd6bab77d4bc8d417f9350f24/core/schema/directory.go#L140 (ok technically a fs.FileMode but it doesn't define any custom unmarshaling)

#

so my guess is we literally passed 755 (rather than 493 aka 0755) all the way to wherever the chmod happens, and it worked anyway?

sleek rose Apr 19, 2023, 10:53 PM

#

lucid crane I don't know if we're even doing string parsing for these right now, as far as I...

I see.. seems like that part of the code has been refactored indeed to use FileMode now instead of string as before, but for some reason it works 😄

lucid crane Apr 19, 2023, 10:54 PM

#

I tried to ask chatgpt about this but it was all high and mighty about "making sure to use octal notation"

#

Maybe I'll tell it my organization has standardized on decimal

#

ok, 9 successful runs in a row, opened https://github.com/dagger/dagger/pull/4983 with the (probable) fix

GitHub

fix flock usage for addnhosts by vito · Pull Request #4983 · dagger...

Previously we would lock the addnhosts file itself, which in hindsight doesn't make a lot of sense because we also atomically write to it via renaming. This was causing service related tests to...

sleek rose Apr 19, 2023, 11:04 PM

#

lucid crane I tried to ask chatgpt about this but it was all high and mighty about "making s...

well.. seems to work because 755 decimal casually sets the executable bit to "everyone" (https://go.dev/play/p/U1NLWyk5J7C).. so that makes it work I guess (?). I'll open a PR to fix it 🙏

Go Playground - The Go Programming Language

lucid crane Apr 19, 2023, 11:07 PM

#

lol, nice. so the owner can execute and write to it, but not read it wat (i mean, it's garbage-in garbage-out, just funny to think about)

sleek rose Apr 19, 2023, 11:11 PM

#

lucid crane lol, nice. so the owner can execute and write to it, but not read it <:wat:10298...

https://github.com/dagger/dagger/pull/4985 👀

lucid crane Apr 19, 2023, 11:12 PM

#

@simple flower also had a PR for it but it failed, probably because of the services flake (saw it took 17 mins). Just merged the services fix, so whoever rebases first wins 😛

simple flower Apr 19, 2023, 11:21 PM

#

lucid crane <@412836691720536065> also had a PR for it but it failed, probably because of th...

https://giphy.com/gifs/reaction-jon-stewart-the-daily-show-pEP4CwhG1f1uw

Giphy

#

@sleek rose beat me to it.

elder wren Apr 20, 2023, 1:00 AM

#

lucid crane sidenote: would be nice if we had a `dagger prune` command, I use `bass --prune`...

Yeah totally, that's another thing in the bucket of features that requires an "operator api", which I'm perpetually on the cusp of adding but then realizing there's a simpler way and that it's not worth adding yet. E.g. instead of prune there's docker rm -fv <engine> 😅 We will want official support though someday I'm sure

#

probably dagger engine prune --size <...> ..., alongside things like dagger engine config --disable-privileged-exec and similar

lucid crane Apr 20, 2023, 1:12 AM

#

Yeah, the only kind of annoying thing with having to use docker rm is that the engine name is dynamic, and sometimes there are multiple. I guess it's still a one-liner: docker ps -q --filter name=dagger-engine | xargs docker rm -fv

#

up until now I've been doing docker exec (docker ps --filter name=dagger-engine.dev -q) buildctl prune which only lets you do one at a time

#

I wonder if prune makes sense as user-facing vs. operator-facing though. I mean, ideally you never have to run it because of GC/etc., but the cache will still get pretty big, and at some point you might want to just free up a lot of space. I think it might just be an integral part of the experience, to some extent.

#

Really splitting hairs here because it might just come down to dagger prune vs dagger engine prune, and who cares really, just interested in the discussion. I guess Docker is similar with docker system prune, which I find myself running all the time

elder wren Apr 21, 2023, 9:23 PM

#

@lucid crane I just hit this error, where it looks like dnsmasq is no longer listening? If I'm interpreting it correctly? https://github.com/dagger/dagger/actions/runs/4768865376/jobs/8478663028?pr=5003#step:6:12705

elder wren Apr 21, 2023, 9:32 PM

#

lucid crane Really splitting hairs here because it might just come down to `dagger prune` vs...

Yeah I think it's probably important to keep a clean separation between clients wanting to open a session with the engine and operators who want to make global changes to the engine. One reason being that in the fullness of time we'd most likely want some sort of auth{n,z} to control who can do what.

lucid crane Apr 21, 2023, 10:00 PM

#

elder wren <@108011715077091328> I just hit this error, where it looks like dnsmasq is no l...

Yeah that's really odd. Saw the same thing happen in a recent build. Not sure what would be causing dnsmasq to exit. Unless this is somehow running before it even starts, but that seems unlikely? Haven't looked deeply at the new Dagger-in-Dagger testing flow yet.

lucid crane Apr 21, 2023, 10:01 PM

#

elder wren Yeah I think it's probably important to keep a clean separation between clients ...

That's fair, I'm mainly thinking of the local dev feedback loop case, where it's all just local to your machine and possibly eating up gobs of disk space.

#

I think we'd want the pruning functionality to be easily within reach for that scenario at least. Realistic example: I'm running Dagger in WSL, have been running builds all day, and now I want to install GTA V which needs ~95GB of space. 😛

elder wren Apr 21, 2023, 10:43 PM

#

lucid crane Yeah that's really odd. Saw the same thing happen in a recent build. Not sure wh...

Yeah it looks like it's actual engine inside an exec that's getting it, so I can't think of how that could introduce a new race condition relative to running the engine in a different context. And I think we are setting up all the networking stuff before we listen on the grpc address that clients connect to, so 🤷‍♂️

elder wren Apr 21, 2023, 10:45 PM

#

lucid crane I think we'd want the pruning functionality to be easily within reach for that s...

Oh totally, I think in the default case where we spin up the engine in your local dockerd, then we'd probably treat everyone as an operator. It's more of the self-provisioned "production infra" cases where the distinction probably becomes more important

lucid crane Apr 24, 2023, 5:02 PM

#

@elder wren I've been really tempted to just remove dnsmasq from the picture entirely, we're so close to being able to just shove mappings directly into each container's /etc/hosts, which we're already doing for service ~~bindings~~ aliases. thinkspin

#

I was hoping it'd be a lot more stable than this, had my doubts but kept thinking "it's a specialized tool for this one job" but at this point I think it's just down to how many moving parts there are

#

another option is to have the engine itself serve DNS queries, which would let us keep a bit more of the dynamic behavior and replace a moving part with one we can instrument more easily

lucid crane Apr 24, 2023, 5:29 PM

#

@simple flower in the meantime, I'll put together a PR to run dnsmasq in the foreground (rather than deamonizing), write logs directly to stdout/stderr, and scream loudly if it exists, which would hopefully go a long way towards troubleshooting all this

simple flower Apr 24, 2023, 5:30 PM

#

lucid crane <@412836691720536065> in the meantime, I'll put together a PR to run `dnsmasq` i...

🙏🏻 Sounds good! It'd be helpful to know if that's what is causing the flakiness.

#

Huh. One possible confounder in the k8s environment is that k8s services are in the 10.100.0.0/16 IP range.

#

I'm gonna change my dev network ip ranges...

lucid crane Apr 24, 2023, 5:34 PM

#

hmm that specific address range seems fine but that's definitely worth checking

simple flower Apr 24, 2023, 5:35 PM

#

Actually, I only use that range for the remote cache tests.

lucid crane Apr 24, 2023, 5:35 PM

#

it kind of seems like dnsmasq is blowing up though, since it happens pretty far into each of these test runs

simple flower Apr 24, 2023, 5:35 PM

#

lucid crane it kind of seems like dnsmasq is blowing up though, since it happens pretty far ...

yeah, that makes a good deal of sense to.

lucid crane Apr 24, 2023, 5:37 PM

#

hopefully I can get these failures to happen on the PR that adds the debug logs, but might be good to merge it anyway for max debuggability

simple flower Apr 24, 2023, 5:39 PM

#

I'll probably cherry pick to my fork, as well.

lucid crane Apr 24, 2023, 5:39 PM

#

https://github.com/dagger/dagger/pull/5015 - just waiting on the first builds now to see if it even works

simple flower Apr 24, 2023, 5:40 PM

#

https://giphy.com/gifs/popcorn-television-cartoon-u5BzptR1OTZ04

Giphy

SpongeBob SquarePants

simple flower Apr 24, 2023, 5:56 PM

#

Hmm... well, something broke.

#

I'm getting hangs on my branch with that commit cherry picked...

#

In CI, that is.

#

Looks like it is hanging locally for me, as well.

lucid crane Apr 24, 2023, 6:07 PM

#

welp yeah that didn't work at all

#

I'll kick the tires locally first this time haha

simple flower Apr 24, 2023, 6:07 PM

#

🙂 👍🏻

lucid crane Apr 24, 2023, 6:09 PM

#

oh, maybe dnsmasq decides not to create a pidfile when you run it in the foreground

simple flower Apr 24, 2023, 6:10 PM

#

Looks like it... https://github.com/jpillora/docker-dnsmasq/issues/11

#

Debug mode: don't fork to the background, don't write a pid file, don't change user id, generate a complete cache dump on receipt on SIGUSR1, log to stderr as well as syslog, don't fork new processes to handle TCP queries. Note that this option is for use in debugging only, to stop dnsmasq daemonising in production, use --keep-in-foreground.

#

Maybe --keep-in-foreground is a better bet?

#

-k, --keep-in-foreground
Do not go into the background at startup but otherwise run as normal. This is intended for use when dnsmasq is run under daemontools or launchd.

lucid crane Apr 24, 2023, 6:12 PM

#

nice catch, that seems to prevent it from printing logs though

#

I guess I could just write to the pidfile myself

simple flower Apr 24, 2023, 6:14 PM

#

lucid crane nice catch, that seems to prevent it from printing logs though

Yeah, maybe add?

-8, --log-facility=<facility>
Set the facility to which dnsmasq will send syslog entries, this defaults to DAEMON, and to LOCAL0 when debug mode is in operation. If the facility given contains at least one '/' character, it is taken to be a filename, and dnsmasq logs to the given file, instead of syslog. If the facility is '-' then dnsmasq logs to stderr. (Errors whilst reading configuration will still go to syslog, but all output from a successful startup, and all output whilst running, will go exclusively to the file.) When logging to a file, dnsmasq will close and reopen the file when it receives SIGUSR2. This allows the log file to be rotated without stopping dnsmasq.

lucid crane Apr 24, 2023, 6:14 PM

#

Nice

simple flower Apr 24, 2023, 6:14 PM

#

https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html

lucid crane Apr 24, 2023, 6:20 PM

#

nice, seems like that's working

simple flower Apr 24, 2023, 6:21 PM

#

🎉

lucid crane Apr 24, 2023, 6:23 PM

#

going to just keep rerunning this until something fails

simple flower Apr 24, 2023, 6:24 PM

#

I'll try it on my fork, as well.

#

@lucid crane I have some failures here: https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518074478

GitHub

dnsmasq: use --keep-in-foreground rather than --no-daemon · jlongti...

A portable devkit for CI/CD pipelines. Contribute to jlongtine/dagger development by creating an account on GitHub.

#

https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518074478#step:6:15764

GitHub

dnsmasq: use --keep-in-foreground rather than --no-daemon · jlongti...

A portable devkit for CI/CD pipelines. Contribute to jlongtine/dagger development by creating an account on GitHub.

#

I'm not seeing dnsmasq blow up in the logs I've looked through thus far.

#

I'll keep looking, though.

lucid crane Apr 24, 2023, 6:43 PM

#

hm yeah no smoking gun here, some of this looks like things just being slow. health check for the git daemon took 14+ seconds, eventually succeeded, but then the git operation failed anyway

#

(which is odd)

#

I wonder what the load average of this hardware is

simple flower Apr 24, 2023, 6:47 PM

#

I can try to find out.

lucid crane Apr 24, 2023, 6:48 PM

#

Are your builds running with the same setup as mine? Haven't been able to get it to fail yet

simple flower Apr 24, 2023, 6:48 PM

#

It's a pretty hefty box. I'm requesting 16 CPUs + 64Gi per job.

#

My jobs are running on the self-hosted EKS setup.

lucid crane Apr 24, 2023, 6:53 PM

#

Is it interesting that the python SDK tests also failed with timeouts? https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518074910

GitHub

dnsmasq: use --keep-in-foreground rather than --no-daemon · jlongti...

A portable devkit for CI/CD pipelines. Contribute to jlongtine/dagger development by creating an account on GitHub.

#

The way it failed doesn't necessarily seem related, just wondering if it points to broader network/resource issues. (I also feel obliged to say "MTU" just in case.)

simple flower Apr 24, 2023, 6:56 PM

#

Yeah, it is odd... I also don't understand that failure... :/

#

I can see what knobs I have around MTU with EKS.

lucid crane Apr 24, 2023, 6:59 PM

#

elder wren <@108011715077091328> I just hit this error, where it looks like dnsmasq is no l...

I think @elder wren was able to hit the dnsmasq issue on regular hardware with this PR 👆 so I'll keep rerunning

simple flower Apr 24, 2023, 7:01 PM

#

simple flower I can see what knobs I have around MTU with EKS.

AWS_VPC_ENI_MTU is currently set to 9001, which is the default (and max). It can go down to 576.

lucid crane Apr 24, 2023, 7:09 PM

#

yep looks like it's detecting that fine

#14 service 17FRCPMJI2GKK
#14 0.191 Connected to engine dagger-engine-4njls
#14 0.208 time="2023-04-24T18:30:12Z" level=info msg="detected mtu 9001 via interface eth1"

simple flower Apr 24, 2023, 7:10 PM

#

lucid crane I wonder what the load average of this hardware is

I enabled detailed monitoring on the runner node, and it looks like it's peaking at 12.5% CPU utilization (via AWS console)

lucid crane Apr 24, 2023, 7:14 PM

#

finally hit a flake, but it doesn't seem relevant: https://github.com/dagger/dagger/actions/runs/4789691840/jobs/8518865411#step:6:13384

simple flower Apr 24, 2023, 7:14 PM

#

Weird. engine-race-detection finished cleanly for me (hasn't happened in a while): https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518527623

simple flower Apr 24, 2023, 7:15 PM

#

lucid crane finally hit a flake, but it doesn't seem relevant: <https://github.com/dagger/da...

That's an odd one.

lucid crane Apr 24, 2023, 7:17 PM

#

one of your runs also failed to fetch from github.com?: https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518527771#step:6:16219

simple flower Apr 24, 2023, 7:18 PM

#

lucid crane one of your runs also failed to fetch from github.com?: <https://github.com/jlon...

Yup...

lucid crane Apr 24, 2023, 7:24 PM

#

I guess that could also be attributed to dnsmasq issues, since it takes over DNS for everything. @simple flower could be worth completely disabling services and running a bunch of builds to see if all the problems go away. We still have that feature flag

#

_EXPERIMENTAL_DAGGER_SERVICES_DNS=0

simple flower Apr 24, 2023, 7:27 PM

#

lucid crane I guess that could also be attributed to dnsmasq issues, since it takes over DNS...

That'll break any tests that utilize services, right?

#

Happy to try it

lucid crane Apr 24, 2023, 7:28 PM

#

nah they'll detect that env var and skip themselves. just need to set it for both the engine + the tests

simple flower Apr 24, 2023, 7:28 PM

#

Great!

lucid crane Apr 24, 2023, 7:28 PM

#

wouldn't be surprised if a test or two was added that depends on services and doesn't check, though. guess we'll find out!

#

ooo. might have found something

https://github.com/dagger/dagger/actions/runs/4789691840/jobs/8519201073#step:5:8201

#15 188.0 dnsmasq[52]: read /var/run/containers/cni/dnsname/dagger-dev/addnhosts - 18 addresses
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="healthcheck completed" actualDuration="710.109µs" spanID=f56e23b38bd39822 timeout=30s traceID=5608a742f82e179db56e38bedb42bee3
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="releasing cni network namespace mfneg6mbzjmgcy4p11o0zxnl4"
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="session started" spanID=72568c54215e28ae traceID=a6fade5dcf0682fc36ef0d4803b9416d
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="releasing cni network namespace zh7tgf0hzfjc86ktod7qm5l2m"
#15 188.0 dnsmasq[52]: failed to load names from /var/run/containers/cni/dnsname/dagger-dev/addnhosts: No such file or directory
#15 188.0 dnsmasq[52]: read /var/run/containers/cni/dnsname/dagger-dev/addnhosts - 16 addresses
#15 188.0 time="2023-04-24T19:29:55Z" level=error msg="/moby.buildkit.v1.frontend.LLBBridge/Solve returned error: rpc error: code = Unknown desc = failed to load cache key: failed to fetch remote git://1IBEKQRN0F82Q/repo.git: exit status 128"
#15 188.0 failed to load cache key: failed to fetch remote git://1IBEKQRN0F82Q/repo.git: exit status 128

#

the "addnhosts: No such file or directory" is what seems weird

simple flower Apr 24, 2023, 7:36 PM

#

I see a bunch of those throughout the logs.

lucid crane Apr 24, 2023, 7:37 PM

#

time to investigate the locking/atomic write flow again

#

err..... this seems wrong: it renames the live file to .old, and then writes to the live file path, and then removes the .old file ???

simple flower Apr 24, 2023, 7:40 PM

#

Hmm...

lucid crane Apr 24, 2023, 7:40 PM

#

i would have expected writing to a new file, and then renaming it over the old one

simple flower Apr 24, 2023, 7:41 PM

#

That seems like a cleaner way to me.

lucid crane Apr 24, 2023, 7:41 PM

#

good thing we already vendored all this 😅

simple flower Apr 24, 2023, 7:45 PM

#

😛

lucid crane Apr 24, 2023, 8:01 PM

#

@simple flower pushed a fix, so far so good, just waiting to re-run everything

#

I could see this fixing quite a few flakes, that's a pretty glaring bug

simple flower Apr 24, 2023, 8:11 PM

#

Nice! I'll pull that commit into my branch in a few. Gonna eat some lunch first.

lucid crane Apr 24, 2023, 8:33 PM

#

hit the cache flake again, rerunning: https://github.com/dagger/dagger/actions/runs/4790497346/jobs/8520291839#step:5:16662

simple flower Apr 24, 2023, 8:44 PM

#

Running: https://github.com/jlongtine/dagger/actions/runs/4790897227

#

@lucid crane That cleared it right up: https://github.com/jlongtine/dagger/actions/runs/4790962570/jobs/8520777220

#

Still have the python-sdk failure... but I'd bet that's something else.

lucid crane Apr 24, 2023, 9:34 PM

#

simple flower <@108011715077091328> That cleared it right up: https://github.com/jlongtine/dag...

nice! is it just the one run so far?

#

i guess even just passing the first time is a good sign?

simple flower Apr 24, 2023, 9:34 PM

#

This run failed... for some reason...

#

https://github.com/jlongtine/dagger/actions/runs/4791132585/attempts/1

#

But I haven't had a non-failure in a while... so I'm gonna call it a win. 🙂

lucid crane Apr 24, 2023, 9:36 PM

#

damn, https://github.com/jlongtine/dagger/actions/runs/4791132585/jobs/8521159840 seems like the same type of failures as before 😦

#

3275:2023-04-24T21:17:10.1468342Z #15 73.50 dnsmasq-debug[49]: stopped listening on dagger-dev0(#3): 10.88.0.1 port 53
3276:2023-04-24T21:17:10.1468880Z #15 73.50 dnsmasq-debug[49]: listening on dagger-dev0(#3): 10.88.0.1 port 53

???

#

(from 'view raw logs' => grep dnsmasq)

simple flower Apr 24, 2023, 9:39 PM

#

Yeah... I'm not sure why... I backed out my services disabling commits (which I'd already reverted)... and cherry picked your code again.

#

Weird.

#

Why... did it stop listening?

lucid crane Apr 24, 2023, 9:41 PM

#

exactly lol

simple flower Apr 24, 2023, 9:41 PM

#

https://github.com/jlongtine/dagger/actions/runs/4791132585 attempt #3 worked. 🙂

GitHub

redo atomic host file updates · jlongtine/dagger@5615e11

A portable devkit for CI/CD pipelines. Contribute to jlongtine/dagger development by creating an account on GitHub.

lucid crane Apr 24, 2023, 9:41 PM

#

i know we send it SIGHUP to signal there are changes to the hosts file, but it can't be that, or it'd be happening much more frequently

#

i'm at least not seeing 'addnhosts: No such file or directory' anymore - there's only one on boot which is normal. (we should maybe just create an empty file to quiet that down.)

#

  if (option_bool(OPT_CLEVERBIND))

this does not bode well

#

(poking around https://github.com/imp/dnsmasq/blob/770bce967cfc9967273d0acfb3ea018fb7b17522/src/network.c#L700)

simple flower Apr 24, 2023, 9:49 PM

#

I don't like that comment...

#

Garbage-collect listeners listening on addresses that no longer exist.

lucid crane Apr 24, 2023, 9:53 PM

#

lol, we do enable it in the dnsmasq config

#

it's been there since the initial commit

simple flower Apr 24, 2023, 9:55 PM

#

bind-dynamic?

lucid crane Apr 24, 2023, 9:55 PM

#

yep (sorry should have said that)

simple flower Apr 24, 2023, 9:55 PM

#

Not a problem!

lucid crane Apr 24, 2023, 9:56 PM

#

just gonna... remove that. and audit all these other settings

simple flower Apr 24, 2023, 9:57 PM

#

👍🏻

#

This is interesting, @lucid crane: https://github.com/jlongtine/dagger/actions/runs/4791132585/jobs/8521755702#step:6:12039

#19 294.5 #7 0.248 dnsmasq[45]: warning: interface remote-cache-00 does not currently exist
#19 294.5 #7 0.248 dnsmasq[45]: using only locally-known addresses for remote-cache-0.local
#19 294.5 #7 0.248 dnsmasq[45]: no servers found in /etc/dnsmasq-resolv.conf, will retry
#19 294.5 #7 0.248 dnsmasq[45]: reading /etc/dnsmasq-resolv.conf
#19 294.5 #7 0.248 dnsmasq[45]: using nameserver 10.88.0.1#53
#19 294.5 #7 0.248 dnsmasq[45]: using only locally-known addresses for remote-cache-0.local
#19 294.5 #7 0.248 time="2023-04-24T21:57:29Z" level=debug msg="setting up engine tracing"
#19 294.5 #7 0.248 dnsmasq[45]: failed to load names from /var/run/containers/cni/dnsname/remote-cache-0/addnhosts: No such file or directory

lucid crane Apr 24, 2023, 10:03 PM

#

the 'no servers' and 'addnhosts: No such file' should be fixed by a commit I just pushed (just noisy logs due to the order of events)

#

first line is interesting, I guess the interface doesn't exist until the CNI plugin runs? will try to track down where that's created

simple flower Apr 24, 2023, 10:05 PM

#

👍🏻

lucid crane Apr 24, 2023, 10:05 PM

#

pushed two commits; first one tidies up the logs I mentioned, second one removes bind-dynamic. waiting on builds now while i investigate

#

tests passed, linting failed, but I'm going to try another change to just have it listen on 0.0.0.0 - seems a lot simpler than depending on the specific network interface. this just runs in a container anyway.

#

ok that made everything detonate. going back. I wonder how it actually deals with the interface not existing yet. seems like it was working fine before even without bind-dynamic (...at least for that one run)

simple flower Apr 24, 2023, 10:22 PM

#

This run mostly passed: https://github.com/jlongtine/dagger/actions/runs/4791493805/jobs/8521981444

GitHub

dnsmasq: don't configure bind-dynamic · jlongtine/dagger@cba91df

A portable devkit for CI/CD pipelines. Contribute to jlongtine/dagger development by creating an account on GitHub.

#

#19 330.9 === NAME  TestContainerImport/Docker
#19 330.9     container_test.go:2679: 
#19 330.9             Error Trace:    /app/core/integration/container_test.go:2679
#19 330.9             Error:          Received unexpected error:
#19 330.9                             GET https://index.docker.io/v2/library/alpine/manifests/sha256:1304f174557314a7ed9eddb4eab12fed12cb0cd9809e4c28f29af86979a3c870: TOOMANYREQUESTS: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
#19 330.9             Test:           TestContainerImport/Docker
#19 330.9 --- FAIL: TestContainerImport (22.39s)
#19 330.9     --- PASS: TestContainerImport/OCI (11.22s)
#19 330.9     --- FAIL: TestContainerImport/Docker (0.30s)

lucid crane Apr 24, 2023, 10:23 PM

#

lol 😭

simple flower Apr 24, 2023, 10:28 PM

#

Ok. Gotta go get ready for Leander's soccer practice.

lucid crane Apr 24, 2023, 10:29 PM

#

sounds good! seems like we made good progress so far

simple flower Apr 24, 2023, 10:33 PM

#

lucid crane sounds good! seems like we made good progress so far

Yeah, I agree! Thank you for your help. I really appreciate it.

lucid crane Apr 24, 2023, 10:36 PM

#

anytime!

simple flower Apr 24, 2023, 10:36 PM

#

🙏🏻

simple flower Apr 25, 2023, 4:43 PM

#

@lucid crane It's looking much more stable for me now: https://github.com/jlongtine/dagger/actions/runs/4791493805/jobs/8540240248

I still have the python failure... gonna dig into that a bit more.

simple flower Apr 25, 2023, 5:47 PM

#

This is a relief: https://github.com/jlongtine/dagger/actions/runs/4800292722

lucid crane Apr 25, 2023, 5:47 PM

#

our long national nightmare is over

simple flower Apr 25, 2023, 5:48 PM

#

https://media.giphy.com/media/McmG2xY1ic369InI6M/giphy.gif

Giphy

#

@lucid crane Should we try putting dnsmasq back into the background? Or leave it in the foreground for now?

lucid crane Apr 25, 2023, 5:51 PM

#

I think it's better off in the foreground, I still don't trust it, and right now the daemonization results in a weird-looking zombie dnsmasq process anyway

#

I'll tidy up my PR and get it ready for merge

simple flower Apr 25, 2023, 5:51 PM

#

lucid crane I'll tidy up my PR and get it ready for merge

🙏🏻

#

I think the python test was spinning up too many simultaneous downloads of the CLI (because it's often on pretty hefty boxes in self-hosted CI).

#

I capped it to 16 connections, and that seems to be stable.

lucid crane Apr 25, 2023, 6:04 PM

#

PR is ready 👍

simple flower Apr 25, 2023, 6:07 PM

#

lucid crane PR is ready 👍

Looks great, @lucid crane

simple flower Apr 25, 2023, 7:04 PM

#

Oh, this is fun... https://github.com/jlongtine/dagger/actions/runs/4801104159/jobs/8542881405#step:8:10130

TestContainerInsecureRootCapabilitesWithService is failing because the docker daemon (which is running in the dev dagger (which is running in the stable dagger...)) is unable to download images, because rate limits...

lucid crane Apr 25, 2023, 8:40 PM

#

maybe @wooden parrot can get us a "DO YOU KNOW WHO I AM?!" token

#Hanging/flaky service tests