#Hanging/flaky service tests
1 messages ยท Page 1 of 1 (latest)
- https://github.com/dagger/dagger/actions/runs/4737450102/jobs/8410261880
- https://github.com/dagger/dagger/actions/runs/4737450102/jobs/8410652380
- https://github.com/dagger/dagger/actions/runs/4737450102/jobs/8411223893
- https://github.com/dagger/dagger/actions/runs/4745232912/jobs/8427248870
- https://github.com/dagger/dagger/actions/runs/4734519287/jobs/8403555343
- https://github.com/dagger/dagger/actions/runs/4724489806/jobs/8381765020
- https://github.com/dagger/dagger/actions/runs/4682202105/jobs/8295745189
- https://github.com/dagger/dagger/actions/runs/4737453452/jobs/8410268881
- https://github.com/dagger/dagger/actions/runs/4737450856/jobs/8410263484
These are weird, @lucid crane. I have to imagine it's something to do with the dagger-in-dagger stuff I did. That's a substantial change. Happy to dig in with you, if that'd be helpful.
That's my suspicion too, since all these tests passed repeatedly locally while I was benchmarking the GraphQL change yesterday, but it just failed on the first run with ./hack/make engine:test, the same way it failed in CI
I'm happy to dive in initially since I'm familiar with the services stack, but I'll keep you in the loop!
Sounds good, @lucid crane!
probably unrelated, just kind of funny: I wonder what permissions this file is ending up with? ๐ (note the lack of octal notation for 755)
Damnit. Don't look at the blame.
๐คฆ๐ปโโ๏ธ
I remember someone pointing out relying on numbers for this API would be a footgun
but ... generally it looks the most familiar
maybe we could raise an error for any value greater than 511 (0777)
I guess one alternative is rwxr-xr-x but who wants to type that
@lucid crane Actually... look at this git blame: https://github.com/dagger/dagger/blame/5b0861e5571f827570ed3b0629d7c95fcf417c13/internal/mage/engine.go#L359
alright you're off the hook
Yeah I've hit this on occasion locally and in CI a few times, this was just earlier today: https://github.com/dagger/dagger/actions/runs/4746265072/jobs/8429673040#step:6:15207
I managed to reproduce it that first time, but now it's passed 14 times in a row ๐ซ
Yeah I haven't noticed any rhyme or reason to it. When it happened to me locally it seemed like either the DNS entry was not found when clients tried connecting or, when it would just hang a long time, the health checks just ran forever.
I have a suspicion it has to do with the synchronizing to dedupe services, since a lot of these tests use the same test service without a cache buster or anything. A lot of moving parts have to align for that to work as expected with high test concurrency
Just a theory at this point
It's interesting that the health checks keep going too until the timeout, since it means the service Evaluate call hasn't returned
Yeah, the fact that we were getting those buildkit scheduler errors before adding the 10 second grace period feels suspicious. Could be unrelated but clearly the scheduler is not handling this scenario perfectly, so wouldn't be surprised if something is still going haywire even if it doesn't result in an error anymore
Might be interesting to try reducing that timeout to see if it makes it more reproducible, if you haven't already
Ah good idea, maybe it happens more often in CI because worse/busier hardware = more likely to hit that timeout
I also wonder if this is passing on repeated runs because a lot of the service test client calls just get cached, maybe my loop needs a prune in the middle
sidenote: would be nice if we had a dagger prune command, I use bass --prune all the time
I love this idea.
hmmm... this looks mighty suspicious:
1871:#15 55.45 time="2023-04-19T21:39:54Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""
3646:#15 122.0 time="2023-04-19T21:41:01Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""
4707:#15 167.3 time="2023-04-19T21:41:46Z" level=error msg="unable to delete '/var/run/containers/cni/dnsname/dagger-dev/addnhosts.old': \"remove /var/run/containers/cni/dnsname/dagger-dev/addnhosts.old: no such file or directory\""
all of this should be wrapped with a flock already. though maybe it's weird that they're locking the same file that they're atomically replacing 
going to try replacing that with locking a separate regular file instead
FWIW going down to a 1 second timeout didn't seem to help (passed a few times), but when I bumped the grace period to 1000 seconds instead, it seems to have failed on the first try. Not sure if I learned anything there, or if it was just a coincidence
two successful runs so far, no more addnhosts errors in the logs. hopefully this is the root of it; it would make a lot of sense
addnhosts is the file that contains the mapping from hostname to container IP, so not correctly synchronizing would absolutely result in failed lookups
๐ I recall having this conversation with Andrea when implementing that API https://github.com/dagger/dagger/pull/3133/files#r973772676
reason why it works is because strconv.ParseUint("755", 8, 32) works the same for both "0755" and "755" respectively. Back then we decided not implementing a custom FileMode scalar since it wasn't worth it.
Having said that, agree that we should standardize it in the tests ๐
I don't know if we're even doing string parsing for these right now, as far as I can tell it's just an Int all the way from the SDK through GraphQL and into core/: https://github.com/dagger/dagger/blob/3e5992dbc114789bd6bab77d4bc8d417f9350f24/core/schema/directory.go#L140 (ok technically a fs.FileMode but it doesn't define any custom unmarshaling)
so my guess is we literally passed 755 (rather than 493 aka 0755) all the way to wherever the chmod happens, and it worked anyway?
I see.. seems like that part of the code has been refactored indeed to use FileMode now instead of string as before, but for some reason it works ๐
I tried to ask chatgpt about this but it was all high and mighty about "making sure to use octal notation"
Maybe I'll tell it my organization has standardized on decimal
ok, 9 successful runs in a row, opened https://github.com/dagger/dagger/pull/4983 with the (probable) fix
well.. seems to work because 755 decimal casually sets the executable bit to "everyone" (https://go.dev/play/p/U1NLWyk5J7C).. so that makes it work I guess (?). I'll open a PR to fix it ๐
lol, nice. so the owner can execute and write to it, but not read it
(i mean, it's garbage-in garbage-out, just funny to think about)
@simple flower also had a PR for it but it failed, probably because of the services flake (saw it took 17 mins). Just merged the services fix, so whoever rebases first wins ๐
@sleek rose beat me to it.
Yeah totally, that's another thing in the bucket of features that requires an "operator api", which I'm perpetually on the cusp of adding but then realizing there's a simpler way and that it's not worth adding yet. E.g. instead of prune there's docker rm -fv <engine> ๐
We will want official support though someday I'm sure
probably dagger engine prune --size <...> ..., alongside things like dagger engine config --disable-privileged-exec and similar
Yeah, the only kind of annoying thing with having to use docker rm is that the engine name is dynamic, and sometimes there are multiple. I guess it's still a one-liner: docker ps -q --filter name=dagger-engine | xargs docker rm -fv
up until now I've been doing docker exec (docker ps --filter name=dagger-engine.dev -q) buildctl prune which only lets you do one at a time
I wonder if prune makes sense as user-facing vs. operator-facing though. I mean, ideally you never have to run it because of GC/etc., but the cache will still get pretty big, and at some point you might want to just free up a lot of space. I think it might just be an integral part of the experience, to some extent.
Really splitting hairs here because it might just come down to dagger prune vs dagger engine prune, and who cares really, just interested in the discussion. I guess Docker is similar with docker system prune, which I find myself running all the time
@lucid crane I just hit this error, where it looks like dnsmasq is no longer listening? If I'm interpreting it correctly? https://github.com/dagger/dagger/actions/runs/4768865376/jobs/8478663028?pr=5003#step:6:12705
Yeah I think it's probably important to keep a clean separation between clients wanting to open a session with the engine and operators who want to make global changes to the engine. One reason being that in the fullness of time we'd most likely want some sort of auth{n,z} to control who can do what.
Yeah that's really odd. Saw the same thing happen in a recent build. Not sure what would be causing dnsmasq to exit. Unless this is somehow running before it even starts, but that seems unlikely? Haven't looked deeply at the new Dagger-in-Dagger testing flow yet.
That's fair, I'm mainly thinking of the local dev feedback loop case, where it's all just local to your machine and possibly eating up gobs of disk space.
I think we'd want the pruning functionality to be easily within reach for that scenario at least. Realistic example: I'm running Dagger in WSL, have been running builds all day, and now I want to install GTA V which needs ~95GB of space. ๐
Yeah it looks like it's actual engine inside an exec that's getting it, so I can't think of how that could introduce a new race condition relative to running the engine in a different context. And I think we are setting up all the networking stuff before we listen on the grpc address that clients connect to, so ๐คทโโ๏ธ
Oh totally, I think in the default case where we spin up the engine in your local dockerd, then we'd probably treat everyone as an operator. It's more of the self-provisioned "production infra" cases where the distinction probably becomes more important
@elder wren I've been really tempted to just remove dnsmasq from the picture entirely, we're so close to being able to just shove mappings directly into each container's /etc/hosts, which we're already doing for service bindings aliases. 
I was hoping it'd be a lot more stable than this, had my doubts but kept thinking "it's a specialized tool for this one job" but at this point I think it's just down to how many moving parts there are
another option is to have the engine itself serve DNS queries, which would let us keep a bit more of the dynamic behavior and replace a moving part with one we can instrument more easily
@simple flower in the meantime, I'll put together a PR to run dnsmasq in the foreground (rather than deamonizing), write logs directly to stdout/stderr, and scream loudly if it exists, which would hopefully go a long way towards troubleshooting all this
๐๐ป Sounds good! It'd be helpful to know if that's what is causing the flakiness.
Huh. One possible confounder in the k8s environment is that k8s services are in the 10.100.0.0/16 IP range.
I'm gonna change my dev network ip ranges...
hmm that specific address range seems fine but that's definitely worth checking
Actually, I only use that range for the remote cache tests.
it kind of seems like dnsmasq is blowing up though, since it happens pretty far into each of these test runs
yeah, that makes a good deal of sense to.
hopefully I can get these failures to happen on the PR that adds the debug logs, but might be good to merge it anyway for max debuggability
I'll probably cherry pick to my fork, as well.
https://github.com/dagger/dagger/pull/5015 - just waiting on the first builds now to see if it even works
Hmm... well, something broke.
I'm getting hangs on my branch with that commit cherry picked...
In CI, that is.
Looks like it is hanging locally for me, as well.
๐ ๐๐ป
oh, maybe dnsmasq decides not to create a pidfile when you run it in the foreground
Looks like it... https://github.com/jpillora/docker-dnsmasq/issues/11
Debug mode: don't fork to the background, don't write a pid file, don't change user id, generate a complete cache dump on receipt on SIGUSR1, log to stderr as well as syslog, don't fork new processes to handle TCP queries. Note that this option is for use in debugging only, to stop dnsmasq daemonising in production, use --keep-in-foreground.
Maybe --keep-in-foreground is a better bet?
-k, --keep-in-foreground
Do not go into the background at startup but otherwise run as normal. This is intended for use when dnsmasq is run under daemontools or launchd.
nice catch, that seems to prevent it from printing logs though
I guess I could just write to the pidfile myself
Yeah, maybe add?
-8, --log-facility=<facility>
Set the facility to which dnsmasq will send syslog entries, this defaults to DAEMON, and to LOCAL0 when debug mode is in operation. If the facility given contains at least one '/' character, it is taken to be a filename, and dnsmasq logs to the given file, instead of syslog. If the facility is '-' then dnsmasq logs to stderr. (Errors whilst reading configuration will still go to syslog, but all output from a successful startup, and all output whilst running, will go exclusively to the file.) When logging to a file, dnsmasq will close and reopen the file when it receives SIGUSR2. This allows the log file to be rotated without stopping dnsmasq.

nice, seems like that's working
๐
going to just keep rerunning this until something fails
I'll try it on my fork, as well.
@lucid crane I have some failures here: https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518074478
I'm not seeing dnsmasq blow up in the logs I've looked through thus far.
I'll keep looking, though.
hm yeah no smoking gun here, some of this looks like things just being slow. health check for the git daemon took 14+ seconds, eventually succeeded, but then the git operation failed anyway
(which is odd)
I wonder what the load average of this hardware is
I can try to find out.
Are your builds running with the same setup as mine? Haven't been able to get it to fail yet
It's a pretty hefty box. I'm requesting 16 CPUs + 64Gi per job.
My jobs are running on the self-hosted EKS setup.
Is it interesting that the python SDK tests also failed with timeouts? https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518074910
The way it failed doesn't necessarily seem related, just wondering if it points to broader network/resource issues. (I also feel obliged to say "MTU" just in case.)
Yeah, it is odd... I also don't understand that failure... :/
I can see what knobs I have around MTU with EKS.
I think @elder wren was able to hit the dnsmasq issue on regular hardware with this PR ๐ so I'll keep rerunning
AWS_VPC_ENI_MTU is currently set to 9001, which is the default (and max). It can go down to 576.
yep looks like it's detecting that fine
#14 service 17FRCPMJI2GKK
#14 0.191 Connected to engine dagger-engine-4njls
#14 0.208 time="2023-04-24T18:30:12Z" level=info msg="detected mtu 9001 via interface eth1"
I enabled detailed monitoring on the runner node, and it looks like it's peaking at 12.5% CPU utilization (via AWS console)
finally hit a flake, but it doesn't seem relevant: https://github.com/dagger/dagger/actions/runs/4789691840/jobs/8518865411#step:6:13384
Weird. engine-race-detection finished cleanly for me (hasn't happened in a while): https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518527623
That's an odd one.
one of your runs also failed to fetch from github.com?: https://github.com/jlongtine/dagger/actions/runs/4789788929/jobs/8518527771#step:6:16219
Yup...
I guess that could also be attributed to dnsmasq issues, since it takes over DNS for everything. @simple flower could be worth completely disabling services and running a bunch of builds to see if all the problems go away. We still have that feature flag
_EXPERIMENTAL_DAGGER_SERVICES_DNS=0
That'll break any tests that utilize services, right?
Happy to try it
nah they'll detect that env var and skip themselves. just need to set it for both the engine + the tests
Great!
wouldn't be surprised if a test or two was added that depends on services and doesn't check, though. guess we'll find out!
ooo. might have found something
https://github.com/dagger/dagger/actions/runs/4789691840/jobs/8519201073#step:5:8201
#15 188.0 dnsmasq[52]: read /var/run/containers/cni/dnsname/dagger-dev/addnhosts - 18 addresses
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="healthcheck completed" actualDuration="710.109ยตs" spanID=f56e23b38bd39822 timeout=30s traceID=5608a742f82e179db56e38bedb42bee3
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="releasing cni network namespace mfneg6mbzjmgcy4p11o0zxnl4"
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="session started" spanID=72568c54215e28ae traceID=a6fade5dcf0682fc36ef0d4803b9416d
#15 188.0 time="2023-04-24T19:29:55Z" level=debug msg="releasing cni network namespace zh7tgf0hzfjc86ktod7qm5l2m"
#15 188.0 dnsmasq[52]: failed to load names from /var/run/containers/cni/dnsname/dagger-dev/addnhosts: No such file or directory
#15 188.0 dnsmasq[52]: read /var/run/containers/cni/dnsname/dagger-dev/addnhosts - 16 addresses
#15 188.0 time="2023-04-24T19:29:55Z" level=error msg="/moby.buildkit.v1.frontend.LLBBridge/Solve returned error: rpc error: code = Unknown desc = failed to load cache key: failed to fetch remote git://1IBEKQRN0F82Q/repo.git: exit status 128"
#15 188.0 failed to load cache key: failed to fetch remote git://1IBEKQRN0F82Q/repo.git: exit status 128
the "addnhosts: No such file or directory" is what seems weird
I see a bunch of those throughout the logs.
time to investigate the locking/atomic write flow again
err..... this seems wrong: it renames the live file to .old, and then writes to the live file path, and then removes the .old file ???
Hmm...
i would have expected writing to a new file, and then renaming it over the old one
That seems like a cleaner way to me.
good thing we already vendored all this ๐
๐
@simple flower pushed a fix, so far so good, just waiting to re-run everything
I could see this fixing quite a few flakes, that's a pretty glaring bug
Nice! I'll pull that commit into my branch in a few. Gonna eat some lunch first.
hit the cache flake again, rerunning: https://github.com/dagger/dagger/actions/runs/4790497346/jobs/8520291839#step:5:16662
@lucid crane That cleared it right up: https://github.com/jlongtine/dagger/actions/runs/4790962570/jobs/8520777220
Still have the python-sdk failure... but I'd bet that's something else.
nice! is it just the one run so far?
i guess even just passing the first time is a good sign?
This run failed... for some reason...
But I haven't had a non-failure in a while... so I'm gonna call it a win. ๐
damn, https://github.com/jlongtine/dagger/actions/runs/4791132585/jobs/8521159840 seems like the same type of failures as before ๐ฆ
3275:2023-04-24T21:17:10.1468342Z #15 73.50 dnsmasq-debug[49]: stopped listening on dagger-dev0(#3): 10.88.0.1 port 53
3276:2023-04-24T21:17:10.1468880Z #15 73.50 dnsmasq-debug[49]: listening on dagger-dev0(#3): 10.88.0.1 port 53
???
(from 'view raw logs' => grep dnsmasq)
Yeah... I'm not sure why... I backed out my services disabling commits (which I'd already reverted)... and cherry picked your code again.
Weird.
Why... did it stop listening?
exactly lol
https://github.com/jlongtine/dagger/actions/runs/4791132585 attempt #3 worked. ๐
i know we send it SIGHUP to signal there are changes to the hosts file, but it can't be that, or it'd be happening much more frequently
i'm at least not seeing 'addnhosts: No such file or directory' anymore - there's only one on boot which is normal. (we should maybe just create an empty file to quiet that down.)
if (option_bool(OPT_CLEVERBIND))
this does not bode well
I don't like that comment...
Garbage-collect listeners listening on addresses that no longer exist.
bind-dynamic?
yep (sorry should have said that)
Not a problem!
just gonna... remove that. and audit all these other settings
๐๐ป
This is interesting, @lucid crane: https://github.com/jlongtine/dagger/actions/runs/4791132585/jobs/8521755702#step:6:12039
#19 294.5 #7 0.248 dnsmasq[45]: warning: interface remote-cache-00 does not currently exist
#19 294.5 #7 0.248 dnsmasq[45]: using only locally-known addresses for remote-cache-0.local
#19 294.5 #7 0.248 dnsmasq[45]: no servers found in /etc/dnsmasq-resolv.conf, will retry
#19 294.5 #7 0.248 dnsmasq[45]: reading /etc/dnsmasq-resolv.conf
#19 294.5 #7 0.248 dnsmasq[45]: using nameserver 10.88.0.1#53
#19 294.5 #7 0.248 dnsmasq[45]: using only locally-known addresses for remote-cache-0.local
#19 294.5 #7 0.248 time="2023-04-24T21:57:29Z" level=debug msg="setting up engine tracing"
#19 294.5 #7 0.248 dnsmasq[45]: failed to load names from /var/run/containers/cni/dnsname/remote-cache-0/addnhosts: No such file or directory
the 'no servers' and 'addnhosts: No such file' should be fixed by a commit I just pushed (just noisy logs due to the order of events)
first line is interesting, I guess the interface doesn't exist until the CNI plugin runs? will try to track down where that's created
๐๐ป
pushed two commits; first one tidies up the logs I mentioned, second one removes bind-dynamic. waiting on builds now while i investigate
tests passed, linting failed, but I'm going to try another change to just have it listen on 0.0.0.0 - seems a lot simpler than depending on the specific network interface. this just runs in a container anyway.
ok that made everything detonate. going back. I wonder how it actually deals with the interface not existing yet. seems like it was working fine before even without bind-dynamic (...at least for that one run)
This run mostly passed: https://github.com/jlongtine/dagger/actions/runs/4791493805/jobs/8521981444
#19 330.9 === NAME TestContainerImport/Docker
#19 330.9 container_test.go:2679:
#19 330.9 Error Trace: /app/core/integration/container_test.go:2679
#19 330.9 Error: Received unexpected error:
#19 330.9 GET https://index.docker.io/v2/library/alpine/manifests/sha256:1304f174557314a7ed9eddb4eab12fed12cb0cd9809e4c28f29af86979a3c870: TOOMANYREQUESTS: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
#19 330.9 Test: TestContainerImport/Docker
#19 330.9 --- FAIL: TestContainerImport (22.39s)
#19 330.9 --- PASS: TestContainerImport/OCI (11.22s)
#19 330.9 --- FAIL: TestContainerImport/Docker (0.30s)
lol ๐ญ
Ok. Gotta go get ready for Leander's soccer practice.
sounds good! seems like we made good progress so far
Yeah, I agree! Thank you for your help. I really appreciate it.
anytime!
๐๐ป
@lucid crane It's looking much more stable for me now: https://github.com/jlongtine/dagger/actions/runs/4791493805/jobs/8540240248
I still have the python failure... gonna dig into that a bit more.
This is a relief: https://github.com/jlongtine/dagger/actions/runs/4800292722
our long national nightmare is over
@lucid crane Should we try putting dnsmasq back into the background? Or leave it in the foreground for now?
I think it's better off in the foreground, I still don't trust it, and right now the daemonization results in a weird-looking zombie dnsmasq process anyway
I'll tidy up my PR and get it ready for merge
๐๐ป
I think the python test was spinning up too many simultaneous downloads of the CLI (because it's often on pretty hefty boxes in self-hosted CI).
I capped it to 16 connections, and that seems to be stable.
PR is ready ๐
Looks great, @lucid crane
Oh, this is fun... https://github.com/jlongtine/dagger/actions/runs/4801104159/jobs/8542881405#step:8:10130
TestContainerInsecureRootCapabilitesWithService is failing because the docker daemon (which is running in the dev dagger (which is running in the stable dagger...)) is unable to download images, because rate limits...
maybe @wooden parrot can get us a "DO YOU KNOW WHO I AM?!" token