Recovering from Failed Workflow Instances

In my previous post, Manipulating Failed Workflow Instances, I have demonstrated how to Terminate a failed workflow instance that has been persisted to the Workflow Instance Store. The original purpose that lead to that discovery was actually part of a study on how to recover failed workflow instances.

In this post, I will explain a common scenario which is having the workflow service failing on the first call - either caused by missed validations, logic errors or data(base) related issues. I have created a simple workflow to illustrate the scenario.



The above workflow shows a Receive activity that exposes a Submit service method to external clients and then executing a Submit custom activity. Below the Submit activity is an If that throws a bogus exception if the description of the Expense entity object contains the text "Gimme an Error" (Well, we asked for it!).

The Send activity is marked with PersistBeforeSend but in our case, the exception will occur before the instance can be persisted - which will cause a lot of hair-pulling later.

The following will simulate the exception on the first call to the workflow service:


// Create service proxy.
ExpenseWorkflowServiceClient proxy =
    new ExpenseWorkflowServiceClient();

// Create a correlation id for the workflow.
GuidcorrelationId = Guid.NewGuid();

// Create entity object to pass in.
Expense expense = new Expense();
expense.Amount = 88.00;
expense.Employee = "Serena";
expense.WorkflowID = correlationId;

// Ask for bogus exception.
expense.Description = "Gimme an Error";
proxy.Submit(expense);


Once executed, the workflow will throw an ApplicationException with the message "You asked for it!". If we inspect the instance store, we will be able to see the status of our workflow instance.


Now, if you try to use the code in my previous post to try to manipulate the instance, you will be greeted by the following when you attempt to load it into a WorkflowApplication:

InstanceNotReadyException

The execution of an InstancePersistenceCommand was interrupted because the instance '{ Guid }' has not yet been persisted to the instance store.

The workflow instance is actually in a limbo state. You will not be able to Terminate, Abandon, Cancel, Abort, Run or do anything with it. And if you have not been doing good unit testing on your code, you will probably have gazillion of these limbo instances inside your WorkflowInstanceStore. (Are you screaming yet?)

There are two options to solve this. The first option is to perform a retry. Now, before we can actually perform a retry, we must first remember to set the Correlation Id (from the CorrelatesOn property) of the Receive activity before the failure occurs. So, please remember to set your Correlation ID in your workflow services! If you have not set your Correlation Id and your workflow fails, there isn't anything you can do about it any more.


In the example, the Correlation Id is a column name WorkflowID which I have created. When we call the Workflow Services again with the same Correlation Id (in this case, WorkflowID), the runtime will reuse the previous failed instance record in the instance store.

That means, if we execute something like this, our limbo state instance will be somewhat "recovered" (or overwritten).

correlationId = new Guid("Guid of Instance");
expense.WorkflowID = correlationId;
expense.Description = "No more errors please!";
proxy.Submit(expense);


If the above code is executed, the instance record will be removed from the instance store because my configuration is set with instanceCompletionAction="DeleteAll".

The other option is of course more extreme, which is to delete the limbo instances - as explained in MSDN. I would recommend you to practice caution when doing this in a production environment.

I will continue to post more discoveries as I explore more of WF4.

No comments:

Post a Comment

Popular Post