#Two workflow are timing out constantly

1 messages · Page 1 of 1 (latest)

still hawk
#

These failures are all new looking, not any of the flakes we've been dealing with for a while. I'm looking through the recent merges to main to see where they came from

still hawk
cloud trail
#

Interesting... I am keen to rebase my PR once we find the culprit!

jaunty hamlet
#

however, i have no idea, since the stack trace is almost nonsense

still hawk
still hawk
still hawk
#

I think I understand the first stack trace, opened a separate issue for that: https://github.com/dagger/dagger/issues/8204

But 100% agree the second one fundamentally makes no sense. The fact that one of the lines of the stack trace is a field in a struct genuinely seems like it could be a bug in go's stack trace printing logic or something. It's hard to imagine how that could ever be correct. I searched the go github issues for a bit but didn't find anything matching (hard to search for though, so who knows)

cloud trail
still hawk
#

The failures were consistently there before the revert. After the revert they were consistently gone in the PR for 3 runs. On main all the tests passed too after the revert was merged.

Now they are still there in that PR...

I'm guessing that since the failures are in the race detector, which is best-effort, we just so happen to not hit them in the current state of main (with the revert in place), but changing anything tweaks timing enough to hit them again??

#

I'll see if this happen locally for me on your PR Sam

still hawk
#

Doesn't happen locally, do you mind if I push commits to your PR @cloud trail? Since it seems to be the place to consistently repro this at the moment..

#

I just need to test fixes for those race conditions there

cloud trail
#

np, feel free

still hawk
#

omg okay this finally makes sense to me

@cloud trail there actually was a bug in in your PR that was causing the go race detector to trigger. I pushed the fix here (just deletes unneeded line of code from the test): https://github.com/dagger/dagger/pull/8171/commits/d1078624b6f343e959a4f9d6a8b07b11c85b97e8

The reason this was so confusing is that:

  1. When the race detector triggers, everything gets cancelled so you get tons of errors all over the place in unrelated tests
  2. Thus, the failures looked pretty much the same before (prior to the revert of 8191) and after (with the revert), just tons of random test failures everywhere
  3. It's only once both the revert is in place and the new race condition introduced in the PR's test were fixed, that everything was okay

Also Justin, the incomprehensible stack trace from the race detector was because you and I were looking at the test implementation on main; in the PR that file got changed and the stack trace actually made perfect sense 😄

The biggest overall problem here was that when a test failed due to the race detection, it's almost impossible to tell what's happening in the cloud traces. You just see red everywhere and it's extremely difficult to find the actual race detector stack trace

#

Either way you're all green now Sam, I'll merge it

cloud trail
#

shit 🫣

#

thanks @still hawk

still hawk
#

No problem at all, it was impossible to tell what was happening. And I think the revert of the earlier commit was still needed (but worth confirmation from Justin tomorrow)

cloud trail
#

I was confused, because I could run the test independently just fine. I can say I made that mistake on purpose so we can fix our CI (with something completely unrelated, so smart)

still hawk
cloud trail
#

ah ok that makes sense, I did not have race either

jaunty hamlet
#

i'm not gonna open a revert-revert pr yet, i've not seen that weird timeout again

#

so mayyybe it's all okay