#I am still seeing builds failing wit
1 messages ยท Page 1 of 1 (latest)
I filed this https://github.com/dagger/dagger/issues/8524
Please add any other details you might have
No progress. But Iโve hit it
Ive been hitting it all afternoon unfortunately, but its def not consistent. It happens to me like 30% of the time
I thought it was due to a longer running function and a timeout perhaps, but havenโt built a function with a sleep to attempt repro yet
Do you have a sense of the order of magnitude for "longer"?
Heres a very recent example of one https://dagger.cloud/levs-test-org/traces/028defc376ba5be1c93145f7f349c6cb
The step that triggered it was only running for about 5 minutes.
Just happened again about 7 minutes this time https://dagger.cloud/levs-test-org/traces/8bff13ba44e63f2d7702b03955c6579a
Oh interesting I got a stack trace this time
Full trace at https://dagger.cloud/levs-test-org/traces/b4c15419bf49ce5834a877a24f9ac175
Error: response from query: input: medplum.buildMatrix resolve: call function "buildMatrix": process "tsx --no-deprecation --tsconfig /src/.dagger/tsconfig.json /src/.dagger/src/__dagger.entrypoint.ts" did not complete successfully: exit code: 1
Stderr:
/src/.dagger/sdk/api/utils.ts:234
throw new UnknownDaggerError(
^
UnknownDaggerError: Encountered an unknown error while requesting data via graphql
at compute (/src/.dagger/sdk/api/utils.ts:234:11)
at computeQuery (/src/.dagger/sdk/api/utils.ts:158:10)
at Container.stdout (/src/.dagger/sdk/api/client.gen.ts:1831:39) {
cause: TypeError: fetch failed
at node:internal/deps/undici/undici:12502:13
at <anonymous> (/src/.dagger/node_modules/graphql-request/src/legacy/helpers/runRequest.ts:191:10)
at runRequest (/src/.dagger/node_modules/graphql-request/src/legacy/helpers/runRequest.ts:72:25)
at GraphQLClient.request (/src/.dagger/node_modules/graphql-request/src/legacy/classes/GraphQLClient.ts:131:22)
at compute (/src/.dagger/sdk/api/utils.ts:202:20)
at computeQuery (/src/.dagger/sdk/api/utils.ts:158:10)
at Container.stdout (/src/.dagger/sdk/api/client.gen.ts:1831:39) {
[cause]: HeadersTimeoutError: Headers Timeout Error
at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:7569:32)
at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:6659:17)
at listOnTimeout (node:internal/timers:573:17)
at process.processTimers (node:internal/timers:514:7) {
code: 'UND_ERR_HEADERS_TIMEOUT'
}
},
code: 'D101'
}
Node.js v20.15.0
Ok so it definitely come from there: https://github.com/dagger/dagger/blob/f15c2996d26dbbc17bec4b6377629bdf1e934ddf/sdk/typescript/api/utils.ts#L202
So it's during the request
cc @lime inlet ๐
Maybe I could upgrade the timeout: https://github.com/jasonkuhrt/graffle/issues/103 ??
Hei, would it be possible to support configuring the timeout ?
So maybe back to a timeout with https://github.com/nodejs/undici/?
code: 'UND_ERR_HEADERS_TIMEOUT'
I also see that our graphql-client will have a big update with 8.0.0, I'll have some work to do convert to the new code.
I'll open a PR to extend the timeout
@bright venture I wonder what should be the timeout of the request though
cc @lime inlet if we have an operation that takes 10minutes to resolves, what should happen? Should I set a 30minutes timeout? I'm not sure that actually make sense
should work, there's no expected timeouts on requests as a whole
So 30minutes timeout? ๐ฎ
Pr is opened: https://github.com/dagger/dagger/pull/8549
There's no expected timeouts, if a user exec is running a build that takes several hours (not implausible for certain use cases), then it shouldn't timeout
(will comment on the PR)
Hey @jaunty frost sadly I am still seeing this exact same error on latest dagger
Setup tracing at https://dagger.cloud/traces/setup. To hide: export NOTHANKS=1
Error: response from query: input: medplum.buildMatrix resolve: call function "buildMatrix": process "tsx --no-deprecation --tsconfig /src/.dagger/tsconfig.json /src/.dagger/src/__dagger.entrypoint.ts" did not complete successfully: exit code: 1
Stderr:
/src/.dagger/sdk/api/utils.ts:234
throw new UnknownDaggerError(
^
UnknownDaggerError: Encountered an unknown error while requesting data via graphql
at compute (/src/.dagger/sdk/api/utils.ts:234:11)
at computeQuery (/src/.dagger/sdk/api/utils.ts:158:10)
at Container.stdout (/src/.dagger/sdk/api/client.gen.ts:1843:39) {
cause: TypeError: fetch failed
at node:internal/deps/undici/undici:12502:13
at <anonymous> (/src/.dagger/sdk/graphql/client.ts:19:14)
at <anonymous> (/src/.dagger/node_modules/graphql-request/src/legacy/helpers/runRequest.ts:191:10)
at runRequest (/src/.dagger/node_modules/graphql-request/src/legacy/helpers/runRequest.ts:72:25)
at GraphQLClient.request (/src/.dagger/node_modules/graphql-request/src/legacy/classes/GraphQLClient.ts:131:22)
at compute (/src/.dagger/sdk/api/utils.ts:202:20)
at computeQuery (/src/.dagger/sdk/api/utils.ts:158:10)
at Container.stdout (/src/.dagger/sdk/api/client.gen.ts:1843:39) {
[cause]: HeadersTimeoutError: Headers Timeout Error
at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:7569:32)
at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:6659:17)
at listOnTimeout (node:internal/timers:573:17)
at process.processTimers (node:internal/timers:514:7) {
code: 'UND_ERR_HEADERS_TIMEOUT'
}
},
code: 'D101'
}
Node.js v20.15.0
Running with dagger v0.13.4-010101000000-dev-2a8c3f8854b9 (registry.dagger.io/engine:) linux/amd64
This is for a step that failed in under 5 minutes so I dont think the timeout is actually the issue here.
You can see a trace inside of the terminal tab here: https://dagger.cloud/levs-test-org/traces/60677125f038f95c65c821ce8cec9987
Here's a one liner to reproduce the issue, but note that it does not happen consistently for me.
dagger -m https://github.com/dagger/dagger call dev with-mounted-directory --path "medplum" --source "https://github.com/levlaz/medplum#daggerize" with-workdir --path "medplum" with-exec --args "dagger,call,build-matrix"
since this does not happen consistently and the steps do get cached, one way to try to reproduce this is to go into a terminal and pass a different value for node-version
dagger -m https://github.com/dagger/dagger call dev with-mounted-directory --path "medplum" --source "https://github.com/levlaz/medplum#daggerize" with-workdir --path "medplum" terminal
then
dagger call build --node-version 19
That should bust the cache and get the full build to run
Thanks! I'm checking rn
Currently testing another fix, btw I found an quicker way to test with the dagger engine:
./hack/dev
dagger -m https://github.com/levlaz/medplum@daggerize call build --node-version 19
@slow citrus I've hit a different error with your repro: https://dagger.cloud/Quartz/traces/d90f485ff2d457c1544662010d04133a?span=094ea0eeec4eddeb#910af4513f45ddfd
Trying with node 20, I want to see if I can repro it with my latest changes
okay with node 20 it works, I'm trying with node21
I also opened a PR, if you wanna give it a try: https://github.com/dagger/dagger/pull/8576
We'll wait for your tests before merging it this time
I'm not able to repro the bug with my PR's version (node 20, 21 & 22)
Pr is green, waiting for your feedbacks @slow citrus
Thanks for this! I am trying to test this now but please do note that the problem was intermittent so its unfortunately difficult to confirm
One other thing to note I like the other approach because anyone can run it without needing to clone our repo.
For example when Marcos said "install dagger cli from main and test it out"
True!
Hey @jaunty frost !
I am getting some other error that I was not getting before (I think the same one you saw) which feels odd :/
This entire pipeline failed intermitently but sometimes succeeded
Now it fails consistently on node 20, posible its not related to dagger but I am suspicious of that because my code has not changed
https://dagger.cloud/levs-test-org/traces/ff98dd5fba5f43c96f78bcf6c4d76f62
@copper orbit just FYI - in that "nested trace" above the thing fails and shows up as failed at the high level but the specific failing step appears to be green for some reason
It worked on node 20 last time I tried, btw I don't see the failing trace details, (as you mentioned to Alex)
it's because it never saw the end of the relevant spans, for some reason
well, the error may have been thrown, but we just never received it
Yeah I got some stuff in terminal
โ medplum-nextjs-demo:build:
โ @medplum/graphiql:build: rendering chunks...
โ medplum-nextjs-demo:build: Creating an optimized production build ...
โ medplum-websocket-subscriptions-demo:build: vite v5.4.5 building for produ
โ ction...
โ medplum-websocket-subscriptions-demo:build: transforming...
โ medplum-task-demo:build: vite v5.4.5 building for production...
โ medplum-task-demo:build: transforming...
โ medplum-provider:build: vite v5.4.5 building for production...
โ medplum-provider:build: transforming...
โ medplum-live-chat-demo:build: vite v5.4.5 building for production...
โ medplum-live-chat-demo:build: transforming...
โ medplum-scheduling-demo:build: โ 6960 modules transformed.
โ Container.withExec(args: ["npm", "run", "lint"]): Container! 34m28.2s
โ Container.stdout: String! 34m28.2s
โ Container.sync: ContainerID! 34m48.4s
Full trace at https://dagger.cloud/levs-test-org/traces/ff98dd5fba5f43c96f78bcf6c4d76f62
Error: response from query: Post "http://dagger/query": command [docker exec -i dagger-engine-v0.13.3 buildctl dial-stdio] has exited with exit status 137, make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=
Run 'dagger call dev with-mounted-directory with-workdir with-exec --help' for usage.
Hm, actually this once again feels like an error with the engine to me
Engine logs in case it helps
Doh yeah sorry I should have seen that huge "!" ๐
exit status 137 is odd there, is your machine running out of RAM or something? 137 is usually kill -9 so maybe the OOM killer kicking in?
Yeah the engine seems to be dying ๐ฆ
@jaunty frost something crazy is happening because the thing where the parallel builds were not working now seem to be working!
My build are happening concurrently and I am running into the same type of issues that motion used to complain about where CPU and memory spikes locally for no apparent reason (its really just running npm build... installing some dependencies)
Lol so wait is my fixes working or not? haha
Yup looks like you should kill/restart your engine, not a TS issue there
I did try to restart the engine but still running into issues
I have no evidence, but I still think this is broken.
The graphql errors really feel to me like a misdirection and this is finally starting to show the real underlying issue (even though its not clear what it might be)
I would still like to try a build with that rolled back version of this library to rule that out as an issue if its possible
Okay, will open up a PR tomorrow
@jaunty frost FYI i ran a bunch of builds and was not able to reproduce the issue, so I would feel comfortable merging your most recent PR, its not any worse than the current state ๐
However, ill leave it up to you to decide if you want to roll back like you said and wait for the next iteration of that library.
I am having some strange issues with concurrency but I dont think those have anything to do with this issue.
Hmmm do we have another way to repro that issue, with another module to confirm that it actually fixes our issue.
I asked this person to help because they seemed to be running into this issue more consistently #1288358859190308874 message
@bright venture do you have a project you could test with too?