#Diving into some CI failures ๐งต
1 messages ยท Page 1 of 1 (latest)
๐ @brittle pike @drowsy junco
I'm still seeing 30min timeout hangs even after our work in https://github.com/dagger/dagger/pull/8155 https://github.com/dagger/dagger/pull/8157
these newer hangs seem to be service related:
- here's one that's in
TestStartServices: https://dagger.cloud/dagger/traces/b2c19b96dd8a2033bb183fa68272f258?span=fc65693b09d02491#b7b8f75fe891fa78 - here's another in
TestDaggerUp: https://dagger.cloud/dagger/traces/3e12c1a3aa6b40119ab74e136dad9e50?span=1e20b5449908ff06#11150cfb0ad2dc77
hm, TestDaggerUp should have been interrupted ๐ค but it's not hanging up
something seems odd in our signal handling
merged https://github.com/dagger/dagger/pull/8167, so at least we're consistently doing the new otel for dagger-in-dagger EDIT: this doesn't seem to have an affect, sill seeing the hang
making some progress in https://github.com/dagger/dagger/pull/8168
test repetitions are possible but tricky
the problem is, without entirely forking go's test framework, we can't really "unfail" a test once it's failed (mayyyybe we could defer the failure? but this feels like it could create some odd issues)
so essentially, if a test would fail but we can repeat it, we hackily skip it instead, and then do another run of it
this looks something like:
I think this is alright? i can attach a skip reason in there, but imo, this approach is the only reasonable way of doing it in a way that doesn't fork the whole test package (which bleh, maybe we'll do one day)
will keep messing away at this, since getting the flakes undone is a pretty high priority item (but also need to timebox this, so that i can do some code reviews/etc)
we also are seeing increased rates of https://github.com/dagger/dagger/issues/8031#issuecomment-2260352601 recently
i'm not quite sure how this happened, it's potentially possible that https://github.com/dagger/dagger/pull/8095 could have changed a lot of initialization things in unexpected ways - that would be surprising though, but i really can't see anything else in that kinda of area
i'll also attempt to dive in cannot find default codegen DaggerObject interface in more detail, the error message is not very clear here, and if we're hitting this something is going quite badly wrong
@brittle pike would appreciate early feedback on whether this approach seems fine for analyzing test results through honeycomb/etc? or would there be some issue doing analysis, i'd like to not break how that's being done right now
@unreal lake gotestsum has a --rerun-fails if we want to just switch back to it
This is what I was afraid of haha
Has to be done from the outer runner I think
okay, i'm actually happy with the current impl ๐
it lets each test write t.Retry(<number of attempts>)
so we can just mark each flake as we need to
๐ kinda neat, i'd rather have something more minimal like that, since rerunning all tests feels possible like we'd also hide future flakes
Looking now.
for sure! i like your approach
OK, just finished going over the code changes. The TestCtxSuite looks good too!
As long as the dagger call ... test all returns a non-zero exit status, even if tests within it failed, got retried & succeeded, the run will be considered successful. So that part is fine for reporting purposes.
What I don't know is if the flaking test will not trigger failures in all other tests. Here is one such example from past runs: https://dagger.cloud/dagger/traces/4405587cb87a0d3af873130c2fe8704f?span=635ba6d56d420c7a
Those are more rare, so the overall reliability should improve. What I expect to get longer and more variable is the overall duration. I think that's fine in the short-term, especially if the more important dimension (reliability) is improving and holding steady.
I am currently running dagger call --source=.:default test custom --run=TestModule locally using your PR. I expect it to timeout within 30 minutes. At the same time, I am going to provision a remote beefy machine and try this there.
gonna pick this back up on monday, i'm not feeling at 100%, and can't manage to focus ๐ฆ