#Two workflow are timing out constantly
1 messages · Page 1 of 1 (latest)
These failures are all new looking, not any of the flakes we've been dealing with for a while. I'm looking through the recent merges to main to see where they came from
I first see these failures on main after https://github.com/dagger/dagger/pull/8191 was merged cc @jaunty hamlet I may try sending out a revert PR quick to confirm it fixes CI
Part of #8184, and split out of #8168, since my general approach there turns out to not be possible 😢
Somehow, we see weird timeouts because calls to the dagger cli aren't properly exiting ...
I got a clean CI run of the tests with that reverted: https://github.com/dagger/dagger/pull/8196
Will re-roll a few more times to confirm
Interesting... I am keen to rebase my PR once we find the culprit!
replied there, the massive number of failures isn't "legitimate", it appears there's a race condition somewhere
however, i have no idea, since the stack trace is almost nonsense
so approved, but hm, the stack traces actually point more towards a potential flake in https://github.com/dagger/dagger/pull/7708 maybe
Just merged that one, so should be improved once you rebase
Yeah at first glance I don't get it yet either, will do another pass quick to see if I can figure it out
I think I understand the first stack trace, opened a separate issue for that: https://github.com/dagger/dagger/issues/8204
But 100% agree the second one fundamentally makes no sense. The fact that one of the lines of the stack trace is a field in a struct genuinely seems like it could be a bug in go's stack trace printing logic or something. It's hard to imagine how that could ever be correct. I searched the go github issues for a bit but didn't find anything matching (hard to search for though, so who knows)
I rebased, the same tests are still failing: https://github.com/dagger/dagger/actions/runs/10497976176/job/29082594179?pr=8171 - I ran it twice to make sure
The failures were consistently there before the revert. After the revert they were consistently gone in the PR for 3 runs. On main all the tests passed too after the revert was merged.
Now they are still there in that PR...
I'm guessing that since the failures are in the race detector, which is best-effort, we just so happen to not hit them in the current state of main (with the revert in place), but changing anything tweaks timing enough to hit them again??
I'll see if this happen locally for me on your PR Sam
Doesn't happen locally, do you mind if I push commits to your PR @cloud trail? Since it seems to be the place to consistently repro this at the moment..
I just need to test fixes for those race conditions there
np, feel free
omg okay this finally makes sense to me
@cloud trail there actually was a bug in in your PR that was causing the go race detector to trigger. I pushed the fix here (just deletes unneeded line of code from the test): https://github.com/dagger/dagger/pull/8171/commits/d1078624b6f343e959a4f9d6a8b07b11c85b97e8
The reason this was so confusing is that:
- When the race detector triggers, everything gets cancelled so you get tons of errors all over the place in unrelated tests
- Thus, the failures looked pretty much the same before (prior to the revert of 8191) and after (with the revert), just tons of random test failures everywhere
- It's only once both the revert is in place and the new race condition introduced in the PR's test were fixed, that everything was okay
Also Justin, the incomprehensible stack trace from the race detector was because you and I were looking at the test implementation on main; in the PR that file got changed and the stack trace actually made perfect sense 😄
The biggest overall problem here was that when a test failed due to the race detection, it's almost impossible to tell what's happening in the cloud traces. You just see red everywhere and it's extremely difficult to find the actual race detector stack trace
Either way you're all green now Sam, I'll merge it
No problem at all, it was impossible to tell what was happening. And I think the revert of the earlier commit was still needed (but worth confirmation from Justin tomorrow)
I was confused, because I could run the test independently just fine. I can say I made that mistake on purpose so we can fix our CI (with something completely unrelated, so smart)
The other thing I realized is that I couldn't repro the problem locally because I forgot to include -race 
Once I did dagger call --source=.:default test specific --pkg="./core/integration" --run="TestModule/TestDaggerTerminal" --race=true locally, I actually could repro the failure locally. Easy to miss...
ah ok that makes sense, I did not have race either
hm
i wonder if the race condition comes from trying to print more logs in cleanup? which is conflicting with the thing you noticed in https://github.com/dagger/dagger/issues/8204
i'm not gonna open a revert-revert pr yet, i've not seen that weird timeout again
so mayyybe it's all okay