#Hanging/flaky service tests

1 messages ยท Page 1 of 1 (latest)

simple flower
#

These are weird, @lucid crane. I have to imagine it's something to do with the dagger-in-dagger stuff I did. That's a substantial change. Happy to dig in with you, if that'd be helpful.

lucid crane
#

That's my suspicion too, since all these tests passed repeatedly locally while I was benchmarking the GraphQL change yesterday, but it just failed on the first run with ./hack/make engine:test, the same way it failed in CI

#

I'm happy to dive in initially since I'm familiar with the services stack, but I'll keep you in the loop!

simple flower
#

Sounds good, @lucid crane!

lucid crane
#

probably unrelated, just kind of funny: I wonder what permissions this file is ending up with? ๐Ÿ˜‚ (note the lack of octal notation for 755)

simple flower
lucid crane
#

too late

#

I mean, it works

simple flower
#

๐Ÿคฆ๐Ÿปโ€โ™‚๏ธ

lucid crane
#

I remember someone pointing out relying on numbers for this API would be a footgun

#

but ... generally it looks the most familiar

#

maybe we could raise an error for any value greater than 511 (0777)

#

I guess one alternative is rwxr-xr-x but who wants to type that

lucid crane
#

alright you're off the hook

simple flower
#

๐Ÿ˜‰

#

I wish I'd caught that.

elder wren
lucid crane
#

I managed to reproduce it that first time, but now it's passed 14 times in a row ๐Ÿซ 

elder wren
lucid crane
#

I have a suspicion it has to do with the synchronizing to dedupe services, since a lot of these tests use the same test service without a cache buster or anything. A lot of moving parts have to align for that to work as expected with high test concurrency

#

Just a theory at this point

#

It's interesting that the health checks keep going too until the timeout, since it means the service Evaluate call hasn't returned

elder wren
#

Might be interesting to try reducing that timeout to see if it makes it more reproducible, if you haven't already

lucid crane
#

Ah good idea, maybe it happens more often in CI because worse/busier hardware = more likely to hit that timeout

#

I also wonder if this is passing on repeated runs because a lot of the service test client calls just get cached, maybe my loop needs a prune in the middle

#

sidenote: would be nice if we had a dagger prune command, I use bass --prune all the time

lucid crane
#

hmmm... this looks mighty suspicious:

1871:#15 55.45 time="2023-04-19T21:39:54Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""
3646:#15 122.0 time="2023-04-19T21:41:01Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""
4707:#15 167.3 time="2023-04-19T21:41:46Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""
#

all of this should be wrapped with a flock already. though maybe it's weird that they're locking the same file that they're atomically replacing thinkies

#

going to try replacing that with locking a separate regular file instead

#

FWIW going down to a 1 second timeout didn't seem to help (passed a few times), but when I bumped the grace period to 1000 seconds instead, it seems to have failed on the first try. Not sure if I learned anything there, or if it was just a coincidence

lucid crane
#

two successful runs so far, no more addnhosts errors in the logs. hopefully this is the root of it; it would make a lot of sense

#

addnhosts is the file that contains the mapping from hostname to container IP, so not correctly synchronizing would absolutely result in failed lookups

sleek rose
# lucid crane I remember someone pointing out relying on numbers for this API would be a footg...

๐Ÿ™ˆ I recall having this conversation with Andrea when implementing that API https://github.com/dagger/dagger/pull/3133/files#r973772676

reason why it works is because strconv.ParseUint("755", 8, 32) works the same for both "0755" and "755" respectively. Back then we decided not implementing a custom FileMode scalar since it wasn't worth it.

Having said that, agree that we should standardize it in the tests ๐Ÿ™

lucid crane
#

so my guess is we literally passed 755 (rather than 493 aka 0755) all the way to wherever the chmod happens, and it worked anyway?

sleek rose
lucid crane
#

I tried to ask chatgpt about this but it was all high and mighty about "making sure to use octal notation"

#

Maybe I'll tell it my organization has standardized on decimal

sleek rose
lucid crane
#

lol, nice. so the owner can execute and write to it, but not read it wat (i mean, it's garbage-in garbage-out, just funny to think about)

lucid crane
#

@simple flower also had a PR for it but it failed, probably because of the services flake (saw it took 17 mins). Just merged the services fix, so whoever rebases first wins ๐Ÿ˜›

elder wren
#

probably dagger engine prune --size <...> ..., alongside things like dagger engine config --disable-privileged-exec and similar

lucid crane
#

Yeah, the only kind of annoying thing with having to use docker rm is that the engine name is dynamic, and sometimes there are multiple. I guess it's still a one-liner: docker ps -q --filter name=dagger-engine | xargs docker rm -fv

#

up until now I've been doing docker exec (docker ps --filter name=dagger-engine.dev -q) buildctl prune which only lets you do one at a time

#

I wonder if prune makes sense as user-facing vs. operator-facing though. I mean, ideally you never have to run it because of GC/etc., but the cache will still get pretty big, and at some point you might want to just free up a lot of space. I think it might just be an integral part of the experience, to some extent.

#

Really splitting hairs here because it might just come down to dagger prune vs dagger engine prune, and who cares really, just interested in the discussion. I guess Docker is similar with docker system prune, which I find myself running all the time

elder wren
elder wren
lucid crane
lucid crane
#

I think we'd want the pruning functionality to be easily within reach for that scenario at least. Realistic example: I'm running Dagger in WSL, have been running builds all day, and now I want to install GTA V which needs ~95GB of space. ๐Ÿ˜›

elder wren
elder wren
lucid crane
#

@elder wren I've been really tempted to just remove dnsmasq from the picture entirely, we're so close to being able to just shove mappings directly into each container's /etc/hosts, which we're already doing for service bindings aliases. thinkspin

#

I was hoping it'd be a lot more stable than this, had my doubts but kept thinking "it's a specialized tool for this one job" but at this point I think it's just down to how many moving parts there are

#

another option is to have the engine itself serve DNS queries, which would let us keep a bit more of the dynamic behavior and replace a moving part with one we can instrument more easily

lucid crane
#

@simple flower in the meantime, I'll put together a PR to run dnsmasq in the foreground (rather than deamonizing), write logs directly to stdout/stderr, and scream loudly if it exists, which would hopefully go a long way towards troubleshooting all this

simple flower
#

Huh. One possible confounder in the k8s environment is that k8s services are in the 10.100.0.0/16 IP range.

#

I'm gonna change my dev network ip ranges...

lucid crane
#

hmm that specific address range seems fine but that's definitely worth checking

simple flower
#

Actually, I only use that range for the remote cache tests.

lucid crane
#

it kind of seems like dnsmasq is blowing up though, since it happens pretty far into each of these test runs

simple flower
lucid crane
#

hopefully I can get these failures to happen on the PR that adds the debug logs, but might be good to merge it anyway for max debuggability

simple flower
#

I'll probably cherry pick to my fork, as well.

lucid crane
simple flower
#

Hmm... well, something broke.

#

I'm getting hangs on my branch with that commit cherry picked...

#

In CI, that is.

#

Looks like it is hanging locally for me, as well.

lucid crane
#

welp yeah that didn't work at all

#

I'll kick the tires locally first this time haha

simple flower
#

๐Ÿ™‚ ๐Ÿ‘๐Ÿป

lucid crane
#

oh, maybe dnsmasq decides not to create a pidfile when you run it in the foreground

simple flower
#

Debug mode: don't fork to the background, don't write a pid file, don't change user id, generate a complete cache dump on receipt on SIGUSR1, log to stderr as well as syslog, don't fork new processes to handle TCP queries. Note that this option is for use in debugging only, to stop dnsmasq daemonising in production, use --keep-in-foreground.

#

Maybe --keep-in-foreground is a better bet?

#

-k, --keep-in-foreground
Do not go into the background at startup but otherwise run as normal. This is intended for use when dnsmasq is run under daemontools or launchd.

lucid crane
#

nice catch, that seems to prevent it from printing logs though

#

I guess I could just write to the pidfile myself

simple flower
# lucid crane nice catch, that seems to prevent it from printing logs though

Yeah, maybe add?

-8, --log-facility=<facility>
Set the facility to which dnsmasq will send syslog entries, this defaults to DAEMON, and to LOCAL0 when debug mode is in operation. If the facility given contains at least one '/' character, it is taken to be a filename, and dnsmasq logs to the given file, instead of syslog. If the facility is '-' then dnsmasq logs to stderr. (Errors whilst reading configuration will still go to syslog, but all output from a successful startup, and all output whilst running, will go exclusively to the file.) When logging to a file, dnsmasq will close and reopen the file when it receives SIGUSR2. This allows the log file to be rotated without stopping dnsmasq.

lucid crane
simple flower
lucid crane
#

nice, seems like that's working

simple flower
#

๐ŸŽ‰

lucid crane
#

going to just keep rerunning this until something fails

simple flower
#

I'll try it on my fork, as well.

#

I'm not seeing dnsmasq blow up in the logs I've looked through thus far.

#

I'll keep looking, though.

lucid crane
#

hm yeah no smoking gun here, some of this looks like things just being slow. health check for the git daemon took 14+ seconds, eventually succeeded, but then the git operation failed anyway

#

(which is odd)

#

I wonder what the load average of this hardware is

simple flower
#

I can try to find out.

lucid crane
#

Are your builds running with the same setup as mine? Haven't been able to get it to fail yet

simple flower
#

It's a pretty hefty box. I'm requesting 16 CPUs + 64Gi per job.

#

My jobs are running on the self-hosted EKS setup.

lucid crane
#

The way it failed doesn't necessarily seem related, just wondering if it points to broader network/resource issues. (I also feel obliged to say "MTU" just in case.)

simple flower
#

Yeah, it is odd... I also don't understand that failure... :/

#

I can see what knobs I have around MTU with EKS.

lucid crane
simple flower
lucid crane
#

yep looks like it's detecting that fine

#14 service 17FRCPMJI2GKK
#14 0.191 Connected to engine dagger-engine-4njls
#14 0.208 time="2023-04-24T18:30:12Z" level=info msg="detected mtu 9001 via interface eth1"
simple flower
lucid crane
simple flower
lucid crane
lucid crane
#

I guess that could also be attributed to dnsmasq issues, since it takes over DNS for everything. @simple flower could be worth completely disabling services and running a bunch of builds to see if all the problems go away. We still have that feature flag

#

_EXPERIMENTAL_DAGGER_SERVICES_DNS=0

simple flower
#

Happy to try it

lucid crane
#

nah they'll detect that env var and skip themselves. just need to set it for both the engine + the tests

simple flower
#

Great!

lucid crane
#

wouldn't be surprised if a test or two was added that depends on services and doesn't check, though. guess we'll find out!

#

ooo. might have found something

https://github.com/dagger/dagger/actions/runs/4789691840/jobs/8519201073#step:5:8201

#15 188.0 dnsmasq[52]: read /var/run/containers/cni/dnsname/dagger-dev/addnhosts - 18 addresses
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="healthcheck completed" actualDuration="710.109ยตs" spanID=f56e23b38bd39822 timeout=30s traceID=5608a742f82e179db56e38bedb42bee3
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="releasing cni network namespace mfneg6mbzjmgcy4p11o0zxnl4"
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="session started" spanID=72568c54215e28ae traceID=a6fade5dcf0682fc36ef0d4803b9416d
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="releasing cni network namespace zh7tgf0hzfjc86ktod7qm5l2m"
#15 188.0 dnsmasq[52]: failed to load names from /var/run/containers/cni/dnsname/dagger-dev/addnhosts: No such file or directory
#15 188.0 dnsmasq[52]: read /var/run/containers/cni/dnsname/dagger-dev/addnhosts - 16 addresses
#15 188.0 time="2023-04-24T19:29:55Z" level=error msg="/moby.buildkit.v1.frontend.LLBBridge/Solve returned error: rpc error: code = Unknown desc = failed to load cache key: failed to fetch remote git://1IBEKQRN0F82Q/repo.git: exit status 128"
#15 188.0 failed to load cache key: failed to fetch remote git://1IBEKQRN0F82Q/repo.git: exit status 128
#

the "addnhosts: No such file or directory" is what seems weird

simple flower
#

I see a bunch of those throughout the logs.

lucid crane
#

time to investigate the locking/atomic write flow again

#

err..... this seems wrong: it renames the live file to .old, and then writes to the live file path, and then removes the .old file ???

simple flower
#

Hmm...

lucid crane
#

i would have expected writing to a new file, and then renaming it over the old one

simple flower
#

That seems like a cleaner way to me.

lucid crane
#

good thing we already vendored all this ๐Ÿ˜…

simple flower
#

๐Ÿ˜›

lucid crane
#

@simple flower pushed a fix, so far so good, just waiting to re-run everything

#

I could see this fixing quite a few flakes, that's a pretty glaring bug

simple flower
#

Nice! I'll pull that commit into my branch in a few. Gonna eat some lunch first.

lucid crane
simple flower
#

Still have the python-sdk failure... but I'd bet that's something else.

lucid crane
#

i guess even just passing the first time is a good sign?

simple flower
#

This run failed... for some reason...

#

But I haven't had a non-failure in a while... so I'm gonna call it a win. ๐Ÿ™‚

lucid crane
#
3275:2023-04-24T21:17:10.1468342Z #15 73.50 dnsmasq-debug[49]: stopped listening on dagger-dev0(#3): 10.88.0.1 port 53
3276:2023-04-24T21:17:10.1468880Z #15 73.50 dnsmasq-debug[49]: listening on dagger-dev0(#3): 10.88.0.1 port 53

???

#

(from 'view raw logs' => grep dnsmasq)

simple flower
#

Yeah... I'm not sure why... I backed out my services disabling commits (which I'd already reverted)... and cherry picked your code again.

#

Weird.

#

Why... did it stop listening?

lucid crane
#

exactly lol

simple flower
lucid crane
#

i know we send it SIGHUP to signal there are changes to the hosts file, but it can't be that, or it'd be happening much more frequently

#

i'm at least not seeing 'addnhosts: No such file or directory' anymore - there's only one on boot which is normal. (we should maybe just create an empty file to quiet that down.)

#
  if (option_bool(OPT_CLEVERBIND))

this does not bode well

simple flower
#

I don't like that comment...

#

Garbage-collect listeners listening on addresses that no longer exist.

lucid crane
#

lol, we do enable it in the dnsmasq config

#

it's been there since the initial commit

simple flower
#

bind-dynamic?

lucid crane
#

yep (sorry should have said that)

simple flower
#

Not a problem!

lucid crane
#

just gonna... remove that. and audit all these other settings

simple flower
#

๐Ÿ‘๐Ÿป

#

This is interesting, @lucid crane: https://github.com/jlongtine/dagger/actions/runs/4791132585/jobs/8521755702#step:6:12039

#19 294.5 #7 0.248 dnsmasq[45]: warning: interface remote-cache-00 does not currently exist
#19 294.5 #7 0.248 dnsmasq[45]: using only locally-known addresses for remote-cache-0.local
#19 294.5 #7 0.248 dnsmasq[45]: no servers found in /etc/dnsmasq-resolv.conf, will retry
#19 294.5 #7 0.248 dnsmasq[45]: reading /etc/dnsmasq-resolv.conf
#19 294.5 #7 0.248 dnsmasq[45]: using nameserver 10.88.0.1#53
#19 294.5 #7 0.248 dnsmasq[45]: using only locally-known addresses for remote-cache-0.local
#19 294.5 #7 0.248 time="2023-04-24T21:57:29Z" level=debug msg="setting up engine tracing"
#19 294.5 #7 0.248 dnsmasq[45]: failed to load names from /var/run/containers/cni/dnsname/remote-cache-0/addnhosts: No such file or directory
lucid crane
#

the 'no servers' and 'addnhosts: No such file' should be fixed by a commit I just pushed (just noisy logs due to the order of events)

#

first line is interesting, I guess the interface doesn't exist until the CNI plugin runs? will try to track down where that's created

simple flower
#

๐Ÿ‘๐Ÿป

lucid crane
#

pushed two commits; first one tidies up the logs I mentioned, second one removes bind-dynamic. waiting on builds now while i investigate

#

tests passed, linting failed, but I'm going to try another change to just have it listen on 0.0.0.0 - seems a lot simpler than depending on the specific network interface. this just runs in a container anyway.

#

ok that made everything detonate. going back. I wonder how it actually deals with the interface not existing yet. seems like it was working fine before even without bind-dynamic (...at least for that one run)

simple flower
#
#19 330.9 === NAME  TestContainerImport/Docker
#19 330.9     container_test.go:2679: 
#19 330.9             Error Trace:    /app/core/integration/container_test.go:2679
#19 330.9             Error:          Received unexpected error:
#19 330.9                             GET https://index.docker.io/v2/library/alpine/manifests/sha256:1304f174557314a7ed9eddb4eab12fed12cb0cd9809e4c28f29af86979a3c870: TOOMANYREQUESTS: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
#19 330.9             Test:           TestContainerImport/Docker
#19 330.9 --- FAIL: TestContainerImport (22.39s)
#19 330.9     --- PASS: TestContainerImport/OCI (11.22s)
#19 330.9     --- FAIL: TestContainerImport/Docker (0.30s)
lucid crane
#

lol ๐Ÿ˜ญ

simple flower
#

Ok. Gotta go get ready for Leander's soccer practice.

lucid crane
#

sounds good! seems like we made good progress so far

simple flower
lucid crane
#

anytime!

simple flower
#

๐Ÿ™๐Ÿป

simple flower
simple flower
lucid crane
#

our long national nightmare is over

simple flower
#

@lucid crane Should we try putting dnsmasq back into the background? Or leave it in the foreground for now?

lucid crane
#

I think it's better off in the foreground, I still don't trust it, and right now the daemonization results in a weird-looking zombie dnsmasq process anyway

#

I'll tidy up my PR and get it ready for merge

simple flower
#

I think the python test was spinning up too many simultaneous downloads of the CLI (because it's often on pretty hefty boxes in self-hosted CI).

#

I capped it to 16 connections, and that seems to be stable.

lucid crane
#

PR is ready ๐Ÿ‘

simple flower
simple flower
lucid crane
#

maybe @wooden parrot can get us a "DO YOU KNOW WHO I AM?!" token