#Intermittent performance issues and long execution times

1 messages · Page 1 of 1 (latest)

shell basalt
#

Hello,

Yesterday we experienced significant performance degradation during several Dagger executions.

At Betclic we use Dagger extensively across many CI pipelines, so this type of slowdown has a strong impact on our delivery workflows.

For example:

Use case 1
• Slow execution (46 min):
https://dagger.cloud/betclicgroup/traces/fe32031c8a78270343e46759fb7e3daa
• Usual execution time (~5 min):
https://dagger.cloud/betclicgroup/traces/8e159016a8ab80e0f9c8d3325f08ac8c

Use case 2
• Slow execution (31 min):
https://dagger.cloud/betclicgroup/traces/0ca1d1f58f275f78fdeacb214057db25
• Usual execution time (~4 min):
https://dagger.cloud/betclicgroup/traces/f8cf72b07f524094ba5f8a3d033d31be

From the traces it appears that the execution hangs when returning from a module function call.

We also observed the following warnings related to telemetry:

WARN - failures detected while emitting telemetry. trace information incomplete
(failed to upload metrics: context deadline exceeded)

Could these telemetry errors explain the slowdown we observed?

Could you help us understand what might be happening and whether this could be related to telemetry upload failures or another issue?

Thanks in advance for your help.

vital sable
#

@eager token ^ this is our issue as well. Looks like your fix from yesterday did not solve it.

humble birch
#

@shell basalt hey rodrigo!! I looked into your traces. Yesterday we had an outage that ended at around 11.10am UTC-7. I see that the last trace you shared is from 10:44am UTC-7. did you continue observing this issue after that time? If so, could you share a trace with us so that we can take a closer look?

#

@vital sable would it be possible for you to share a few traces so that we can take a closer look?

vital sable
#

@humble birch I've sent you a friend request, will DM you the trace. Already did to @eager token

humble birch
#

Thanks Daniel!!

We just did some adjustments to our connection towards the telemtry provider. Let me know if you continue to see this issue

vital sable
#

Still hanging for me

placid nest
shell basalt
shell basalt
#

So, the root cause was a dagger cloud outage ?

placid nest
# shell basalt So, the root cause was a dagger cloud outage ?

Yes, our ingestion API was degraded yesterday, because of an underlying clickhouse cloud outage that we didn't shield ourselves sufficiently from. The dagger CLI also was missing a timeout for that situation. We have resolved the Dagger Cloud issue yesterday, and addressed the root cause. And the next release of dagger CLI will be more resilient to ingestion API degradation.