Two workflow are timing out constantly | Dagger | Page 1

still hawk Aug 20, 2024, 11:57 PM

#

These failures are all new looking, not any of the flakes we've been dealing with for a while. I'm looking through the recent merges to main to see where they came from

#

I first see these failures on main after https://github.com/dagger/dagger/pull/8191 was merged cc @jaunty hamlet I may try sending out a revert PR quick to confirm it fixes CI

GitHub

tests: try killing subprocesses harder if they misbehave by jedevc ...

Part of #8184, and split out of #8168, since my general approach there turns out to not be possible 😢
Somehow, we see weird timeouts because calls to the dagger cli aren't properly exiting ...

still hawk Aug 21, 2024, 12:51 AM

#

I got a clean CI run of the tests with that reverted: https://github.com/dagger/dagger/pull/8196

Will re-roll a few more times to confirm

GitHub

Revert "tests: try killing subprocesses harder if they misbehave" b...

#8191 may be causing a lot of failures on main and PRs based off main. Not 100% sure yet but they seem to have started with that PR being merged. Confirming by trying a revert here. Example failure...

cloud trail Aug 21, 2024, 1:15 AM

#

Interesting... I am keen to rebase my PR once we find the culprit!

jaunty hamlet Aug 21, 2024, 12:47 PM

#

still hawk I first see these failures on main after https://github.com/dagger/dagger/pull/8...

replied there, the massive number of failures isn't "legitimate", it appears there's a race condition somewhere

#

however, i have no idea, since the stack trace is almost nonsense

#

so approved, but hm, the stack traces actually point more towards a potential flake in https://github.com/dagger/dagger/pull/7708 maybe

still hawk Aug 21, 2024, 9:33 PM

#

cloud trail Interesting... I am keen to rebase my PR once we find the culprit!

Just merged that one, so should be improved once you rebase

still hawk Aug 21, 2024, 9:34 PM

#

jaunty hamlet replied there, the massive number of failures isn't "legitimate", it appears the...

Yeah at first glance I don't get it yet either, will do another pass quick to see if I can figure it out

still hawk Aug 21, 2024, 10:28 PM

#

I think I understand the first stack trace, opened a separate issue for that: https://github.com/dagger/dagger/issues/8204

But 100% agree the second one fundamentally makes no sense. The fact that one of the lines of the stack trace is a field in a struct genuinely seems like it could be a bug in go's stack trace printing logic or something. It's hard to imagine how that could ever be correct. I searched the go github issues for a bit but didn't find anything matching (hard to search for though, so who knows)

cloud trail Aug 21, 2024, 10:29 PM

#

I rebased, the same tests are still failing: https://github.com/dagger/dagger/actions/runs/10497976176/job/29082594179?pr=8171 - I ran it twice to make sure

GitHub

feat: implements --interactive-command for overriding the default c...

An engine to run your pipelines in containers. Contribute to dagger/dagger development by creating an account on GitHub.

still hawk Aug 21, 2024, 10:33 PM

#

https://tenor.com/view/confused-confused-math-gif-26401692

Tenor

#

The failures were consistently there before the revert. After the revert they were consistently gone in the PR for 3 runs. On main all the tests passed too after the revert was merged.

Now they are still there in that PR...

I'm guessing that since the failures are in the race detector, which is best-effort, we just so happen to not hit them in the current state of main (with the revert in place), but changing anything tweaks timing enough to hit them again??

#

I'll see if this happen locally for me on your PR Sam

still hawk Aug 21, 2024, 10:55 PM

#

Doesn't happen locally, do you mind if I push commits to your PR @cloud trail? Since it seems to be the place to consistently repro this at the moment..

#

I just need to test fixes for those race conditions there

cloud trail Aug 21, 2024, 10:55 PM

#

np, feel free

still hawk Aug 22, 2024, 12:10 AM

#

omg okay this finally makes sense to me

@cloud trail there actually was a bug in in your PR that was causing the go race detector to trigger. I pushed the fix here (just deletes unneeded line of code from the test): https://github.com/dagger/dagger/pull/8171/commits/d1078624b6f343e959a4f9d6a8b07b11c85b97e8

The reason this was so confusing is that:

When the race detector triggers, everything gets cancelled so you get tons of errors all over the place in unrelated tests
Thus, the failures looked pretty much the same before (prior to the revert of 8191) and after (with the revert), just tons of random test failures everywhere
It's only once both the revert is in place and the new race condition introduced in the PR's test were fixed, that everything was okay

Also Justin, the incomprehensible stack trace from the race detector was because you and I were looking at the test implementation on main; in the PR that file got changed and the stack trace actually made perfect sense 😄

The biggest overall problem here was that when a test failed due to the race detection, it's almost impossible to tell what's happening in the cloud traces. You just see red everywhere and it's extremely difficult to find the actual race detector stack trace

#

Either way you're all green now Sam, I'll merge it

cloud trail Aug 22, 2024, 12:14 AM

#

shit 🫣

#

thanks @still hawk

still hawk Aug 22, 2024, 12:15 AM

#

No problem at all, it was impossible to tell what was happening. And I think the revert of the earlier commit was still needed (but worth confirmation from Justin tomorrow)

cloud trail Aug 22, 2024, 12:16 AM

#

I was confused, because I could run the test independently just fine. I can say I made that mistake on purpose so we can fix our CI (with something completely unrelated, so smart)

still hawk Aug 22, 2024, 12:18 AM

#

cloud trail I was confused, because I could run the test independently just fine. I can say ...

The other thing I realized is that I couldn't repro the problem locally because I forgot to include -race facepalm

Once I did dagger call --source=.:default test specific --pkg="./core/integration" --run="TestModule/TestDaggerTerminal" --race=true locally, I actually could repro the failure locally. Easy to miss...

cloud trail Aug 22, 2024, 12:18 AM

#

ah ok that makes sense, I did not have race either

jaunty hamlet Aug 22, 2024, 11:01 AM

#

still hawk No problem at all, it was impossible to tell what was happening. And I *think* t...

hm

#

i wonder if the race condition comes from trying to print more logs in cleanup? which is conflicting with the thing you noticed in https://github.com/dagger/dagger/issues/8204

#

i'm not gonna open a revert-revert pr yet, i've not seen that weird timeout again

#

so mayyybe it's all okay

#Two workflow are timing out constantly