#@Win thanks for looking at the elixir
1 messages · Page 1 of 1 (latest)
This is my guess. It's because we use the same cache key across 3 SDKs in the integration tests. Those tests run in SHARED mode, which is the default. When your PR comes, it makes everything run simultaneously, which causes that to fail because of cache behaviour.
So, I don't think the YAML configs on that PR do anything wrong.
Ah I see, i kept the same concurrency grouping logic, but it uses the workflow ID in the group
mmm, but the grouping key also includes the job ID no?
It seems that it uses the workflow name (github.workflow) in the concurrency.group. So, no any workflows use the same key.
OK that's the issue then. I will change to workflow + job
@coral goblet I'm working on implementation now FYI
I thought I understood the issue, but actually I'm not so sure
- Before: one big workflow called
SDK, with 41 jobs (14 SDKs x 3 jobs - 1 exception). Concurrency group isSDK-$REF - After: 41 workflows with 1 job each. Each workflow has its own concurrency group.
I suspect that the elixir SDK relies on lint,test,test-publish being called in sequence, with the same cache volume being reused. But in my PR they run in parallel, in 3 different machines
So I don't think the concurrency group is the issue
Update: I implemented this fix, and just pushed it. Let's see if it works! As a bonus, it simplifies both the code and the generated yaml. Also brings our job count from 41 to 30 🙂
This error is new to me
https://github.com/dagger/dagger/actions/runs/10692693511/job/29641584369?pr=8241
Happens during start the engine and fail, bring the rest of workflow cancelled.
It pass on the next retry. 😭
OK it looks like my simplification fixed it 🙂
The problem was indeed that the elixir SDK needs lint->test->publish-test to be done serially, and not in parallel
Could you make it run in parallel? I think it should work because it's not related. If it's not, then I should investigating and eliminating it. ✌️
But if everything still works when I run them in parallel, then I really don't understand why it was failing in the first place 😭
Is it possible that 3 SDKs (Go, Python, Elixir) conflict each others because of the same cache key? If you see my screenshot above, those SDKs use the same cache key and mount the cache with SHARED mode.
mmmmmm i have seen this one somewhere before, something internal going quite wrong there
i'll split this to a separate issue
Seen last in https://github.com/dagger/dagger/actions/runs/10692693511/job/29641584369?pr=8241 (but I also do think I've seen this flake out before): Stdout: marshal: json: error calling Marsha...
Thank you.
I don't think so because they are running on different machines (1 job = 1 machine)
So that is weird then. I didn’t used this cache key on other places than this test. 🥲