Workflow realiability (resume on crash, idempotency etc.)? | Dapr | Page 1

past light Oct 22, 2024, 12:58 PM

#

How does Dapr handle workflow resiliency, resume from crash etc.? For example,

if a workflow service/app/process crashes inbetween (for any reason) does the Dapr runtime start it again?
The examples, such as this one or the orderProcessingWorkflow use yield to run a squence of activities in a flow. What happens in the flow process crashes in the middle (say, after processesing 2 activities)? Would it auto resume from where it crashed?

If the workflow does not auto resume from where it crashed earlier, how to handle it programatically? There is no example which demonstrates how to handle resume operation for a workflow after a restart of the flow process.

Any help / pointers would be of great help. Thank you.

GitHub

js-sdk/examples/workflow/authoring/src/activity-sequence.ts at main...

Dapr SDK for Javascript. Contribute to dapr/js-sdk development by creating an account on GitHub.

solemn cave Oct 22, 2024, 1:07 PM

#

Hey!

Think of Dapr Workflows as having checkpoints and whenever the process crashes or restarts, it will resume from the last checkpoint

Now you may be thinking, "I see no call to any CheckPoint() method anywhere?" and you are correct.

The Checkpointing is essentially implicit. Every time you call Context.CallActivity or context.WaitForExternalEvent or context.CreateTimer this operation gets persisted in the event source history of the workflow. At this point the downstream operation is scheduled asynchronously, and for stuff like CallActvity , the downstream activity will repeat until it successfully completes. Think of this as similar to an at-least-once guarantee that a message broker may give you.

Point is, that Activity will run, eventually, and its result will be returned to the Workflow, eventually.

#

Does that help?

past light Oct 22, 2024, 1:37 PM

#

Thank you @solemn cave for the detailed explanation. It is very helpful.

Just to confirm my understanding is correct, in the below workflow (from one of the examples):

  const sequence: TWorkflow = async function* (ctx: WorkflowContext): any {
    const cities: string[] = [];

    const result1 = yield ctx.callActivity(hello, "Tokyo");
    cities.push(result1);
    const result2 = yield ctx.callActivity(hello, "Seattle");
    cities.push(result2);
    const result3 = yield ctx.callActivity(hello, "London");
    cities.push(result3);

    return cities;
  };

say the above code/app/process crashes after successfully completing the activity of Seattle (result2). Then these below things will happen:

The flow instance would be schedule again by Dapr, and
It would resume from where it left off earlier, i.e. it would process activity3 “London” (and would not re-run earlier activities which have already been completed).

If the above is correct, then that is awesome. As long as I just use the ctx.callActivity() and similar DAPR primitives I can have some sort of reliability assured for my workflows. That is great. Please confirm.

solemn cave Oct 22, 2024, 1:41 PM

#

Yep you got it

#

Just to be super super super clear....

when recovering from a crash, all of the workflow code will execute from the very start of the workflow, but don't worry its not rescheduling all those activities and such, its just checking the event source history to see if a result was already received.

This is known as replaying -- when you attach a debugger after a process crash you will see what looks like the Workflow re-running from the very start, and this is very weird to see. But don't worry its not actually rescheduling anything thats already completed, its just getting itself back to the same state as it was when it last checkpointed before the crash.

#

This is why there is a strong requirement to make sure your workflow code are deterministic in their operation. Any non-deterministic behaviour could throw out the event source history and things would match up. the workflow engine will then fail the workflow instance.

past light Oct 22, 2024, 2:12 PM

#

Excellent. Thank you @solemn cave for the very detailed explanation. It is very helpful

#Workflow realiability (resume on crash, idempotency etc.)?