#Failed to emit telemetry, with DO_NOT_TRACK=1

1 messages · Page 1 of 1 (latest)

white summit
#

Using Dagger v0.13.6, I'm having trouble with a Dagger module that installs the Dagger CLI to container in order to test other Dagger modules.

I'm intermittently receiving these errors

2024/10/29 16:55:34 ERROR failed to emit telemetry error="failed to upload metrics: context deadline exceeded: retry-able request failure"
16:55:50 ERR failed to emit telemetry error="traces export: failed to send to http://127.0.0.1:46755/v1/traces: 500 Internal Server Error"
16:55:55 ERR failed to emit telemetry error="failed to send logs to http://127.0.0.1:46755/v1/logs: 500 Internal Server Error"

I have DO_NOT_TRACK=1 set on the host calling the Dagger CLI, and in the containers which also call Dagger CLI, and I still receive the message. Which eventually causes a timeout, or at the very best severely extends the time it takes to execute the pipeline when compared to v0.13.5.

Please help 🥹

reef remnant
#

Are you able to share your engine logs as well? You can DM me or send an email lev@dagger.io

reef remnant
#

Hey @carmine niche do you have any ideas on what might be causing this? I looked through the logs (happy to share with you) but nothing jumps out at me.

I also don't think that DO_NOT_TRACK makes a difference here because I suspect this isn't necessarily talking about sending metrics to dagger cloud but instead sending them to the built-in OTEL collector (I could be wrong on this part)

carmine niche
#

Hm yeah that looks like it's somehow failing to send to our OTel proxy. Agree DO_NOT_TRACK is unrelated, this is just a little service we run in each withExec which is minimally required so that telemetry run by the exec makes it out to the TUI, even without Cloud configured. So seeing a 500 from it is pretty surprising. Happy to look at engine logs

white summit
#

@carmine niche I was wondering if there was anything helpful in the engine logs. Or if there's anything else I can send to help with troubleshooting?

carmine niche
reef remnant
#

@white summit in the meantime are you able to upgrade to 0.13.7 just to rule out the chance this bug hasn’t already been fixed somehow

white summit
#

I can't locally test my team's modules using the same ci workflow until this issue is resolved. I'm unable to progress past 0.13.5. :/

#

This issue only occurs when using 0.13.6 or 0.13.7 🙂

reef remnant
#

thanks for confirming!

carmine niche
#

well - I really don't think we even need transactions for this, so I've put up a PR that just removes them and we'll see how it goes. I'll let you know when there's an image pushed if you want to try it out and confirm. https://github.com/dagger/dagger/pull/8861

GitHub

Some users are still hitting 'database is locked' errors despite SQLite being tuned properly as far as I can tell. We really don't need transactions for this, so just remove them.

carmine niche
#

@white summit here's an image built from that if you want to give it a try: registry.dagger.io/engine:925cf02d75b452fb0be2bd77f0da10cbd1872105

white summit
# carmine niche <@128260833594572800> here's an image built from that if you want to give it a t...

There's some progress in that I've not yet seen the '500 Internal Server' error. But after a 43min run of a task that normally takes 8mins. I had to terminate as no progress was being made. Seems it slowed right down, and was just logging this over and over:

15:12:39 ERR failed to emit telemetry error="traces export: context deadline exceeded: retry-able request failure"
15:13:09 ERR failed to emit telemetry error="traces export: context deadline exceeded: retry-able request failure"
15:13:09 ERR failed to emit telemetry error="context deadline exceeded: retry-able request failure"
15:13:19 ERR failed to emit telemetry error="failed to upload metrics: context deadline exceeded: retry-able request failure"
15:13:20 ERR failed to emit telemetry error="failed to upload metrics: context deadline exceeded: retry-able request failure"

Version of the binary:

dagger version
dagger v0.13.8-241106154711-925cf02d75b4 (registry.dagger.io/engine:925cf02d75b452fb0be2bd77f0da10cbd1872105) darwin/amd64
carmine niche
#

What kind of deployment is it? Just local Docker, or something more bespoke?

white summit
#

Interesting -- I was able to run this pipeline on the same machine under v0.13.5 without a hitch. I tested today on a Windows AVD using v0.13.7, and the pipeline functioned fine with a few 'failed to emits', but it recovered fine and didn't enter a failed state. It took over an hour to execute, in v0.13.7, and 25mins in v0.13.5.

I'll dig into it a bit further, maybe over the weekend.

carmine niche
#

hmm ok that's interesting, since all of this telemetry SQLite DB stuff was added way back in v0.12.5 and hasn't moved much since. If it's fine in v0.13.5 it must be something else going on, but it's nice that the version range is narrowed down

carmine niche
#

👋 If you want to try again from main, just pushed another fix: registry.dagger.io/engine:e40b6926652a3c38668168ae9d002d820a11e627

reef remnant
#

I think I just ran into this myself - I was at a conference booth in the middle of a demo so I just restarted my whole computer but its the first time I have seen similar symptoms on my own :/

#

This is running 0.14

frosty owl
#

We are seeing also this behaviourm, where ALL pipelines are running slower (much slower) than usual.
From the engine logs, I can see plenty of messages like:

time="2024-12-12T15:15:02Z" level=error msg="failed to emit telemetry" error="map[error:insert metrics: context deadline exceeded kind:*fmt.wrapError stack:<nil>]"
time="2024-12-12T15:15:09Z" level=error msg="failed to emit telemetry" error="map[error:insert metrics: context deadline exceeded kind:*fmt.wrapError stack:<nil>]"
time="2024-12-12T15:15:09Z" level=error msg="failed to emit telemetry" error="map[error:insert log: database is locked (5) (SQLITE_BUSY) kind:*fmt.wrapError stack:<nil>]"
time="2024-12-12T15:15:09Z" level=error msg="failed to emit telemetry" error="map[error:insert span: database is locked (5) (SQLITE_BUSY) kind:*fmt.wrapError stack:<nil>]"
time="2024-12-12T15:15:10Z" level=error msg="failed to emit telemetry" error="map[error:insert span: database is locked (5) (SQLITE_BUSY) kind:*fmt.wrapError stack:<nil>]"
time="2024-12-12T15:15:10Z" level=error msg="failed to emit telemetry" error="map[error:insert span: database is locked (5) (SQLITE_BUSY) kind:*fmt.wrapError stack:<nil>]"
time="2024-12-12T15:15:10Z" level=error msg="failed to emit telemetry" error="map[error:insert span: database is locked (5) (SQLITE_BUSY) kind:*fmt.wrapError stack:<nil>]"
time="2024-12-12T15:15:10Z" level=error msg="failed to emit telemetry" error="map[error:insert metrics: context deadline exceeded kind:*fmt.wrapError stack:<nil>]"
time="2024-12-12T15:15:10Z" level=error msg="failed to emit telemetry" error="map[error:insert metrics: context deadline exceeded kind:*fmt.wrapError stack:<nil>]"

I can't collate traces of the pipelines that trigger this behaviour though (company policies)

#

I'll do more research to collate as much info as possible (resource usage?)

reef remnant
#

cc @carmine niche once you are back can you take a look at this please?

carmine niche
#

I think that's all fixed in v0.15

frosty owl
#

We'll upgrade to 0.15 and see how that looks. (Likely after christmas...)
I was unable to find anything to point my finger to as a root cause. All the metrics from the disk look quite healthy (latency/throughtput/disk queue depth)...

frosty owl
#

And our pipelines (running engine 0.15.2) are slower than a couple of months ago (way slower). I'll collect all info I can to create a GH issue to track this down, as I may need some help finding out the root cause.
We've had also a case last week where the engine was OOM killed which we need to investigate further
(long running engine in a pod in k8s, with a node of 80cpu's/300gb mem allocated to the engine)

#

re: @candid stag

sacred citrus
# frosty owl 😦 Still seeing some logs with the similar error

And our pipelines (running engine 0.15.2) are slower than a couple of months ago (way slower). I'll collect all info I can to create a GH issue to track this down, as I may need some help finding out the root cause.
@frosty owl if you're still seeing this slowless, we'd love to try to figure it out.

bronze fulcrum
#

we saw a performance regression in v0.15.2, and didn't catch it in time - there's a fix on main now, that should make it into the release that we do this week

#

there is a known leak (which got worse with the perf regression), and we're working on a fix for the root cause of that as well