Workflow Management Service Memory Leak

AppFabric for Windows Server (or what used to be called Windows Server AppFabric) is an extension to IIS whereby, it provides some useful application server features such as service monitoring, workflow management and caching. Is it common to host long running workflows on AppFabric because of its Auto-Start feature that allows workflows to continue running in the event of an IISRESET or a server reboot.

But recently, I have encountered a problem with the AppFabric Workflow Management Service (WMS) where it will constantly leak memory over a period of time. The symptom is after a fresh reboot, the memory of the WorkflowManagementService.exe monitored from Task Manager will slowly increase to ridiculous proportions and eventually take up all of the server's memory.

This will cause any hosted Windows Communication Foundation (WCF) service or Workflow Services hosted on the server to cease accepting new requests. It will also cause any Workflow Services that have been persisted to fail auto-start.

After some serious troubleshooting with Microsoft, we were lucky to be able to find the root cause. If you encounter similar memory leak problems, you should check for error logs in the Event Viewer. AppFabric logs errors to this location:

Application and Services Log -> Microsoft -> Windows -> Application Server-System Services -> Admin

We have discovered a lot of errors (logged for almost every minute) with the following message:

Failed to invoke service management endpoint at 'net.pipe://[server-name]/ServiceManagement.svc' to activate service '/[workflow-service-name].svc'. Exception: 'The message with To 'net.pipe://[server-name]/ServiceManagement.svc' cannot be processed at the receiver, due to an AddressFilter mismatch at the EndpointDispatcher. Check that the sender and receiver's EndpointAddresses agree.'

There are few scenarios that will cause the above error.

1. net.pipe has been disabled for the Application in the Web Site that is hosting the Workflow Service. This wasn't the case for me.

2. There are Workflow Services that were deployed on the server with active instances but then they were deleted after testing. Their active workflow instances still exist in AppFabric Persistence Store causing the WMS to think that those instances are still available on the server and therefore will continuously try to activate them.

3. There are more than one AppFabric installations that are sharing the same AppFabric Persistence Store. All AppFabric instances register its Workflow Services in the persistence store and each WMS will try to activate all of the Workflows including those that belong to other servers. The activation fails and it will continuously retry.

4. Developers install AppFabric on their development machines but points their local AppFabric to use the server's AppFabric Persistence Store. This causes both scenario 2 and 3 to happen.

To confirm the issue, you can open the Internet Information Services (IIS) Manager on the affected server and check to see whether the erroneous Workflow Service that is mentioned in the event log exist under the Web Site.

If it doesn't exist, then proceed to the AppFabric Persistence Store and query the System.Activities.DurableInstancing.ServiceDeploymentsTable for the Workflow Service. You can filter by the RelativeServicePath column. Take note that your service name is registered as /[workflow-service-name].svc in the database.

If you manage to locate the row, that means you have an orphaned Workflow Service in your application server and that is the cause for the memory leak. To rectify the issue, delete all the rows in the System.Activities.DurableInstancing.InstancesTable that belongs to the ServiceDeploymentId of the orphaned Workflow Service and then proceed to delete the row in the ServiceDeploymentsTable itself.

Do this for all orphaned Workflow Services and the Event Viewer should no longer log any errors for the missing Workflow Services. Once that is done, restart the WorkflowManagementService.exe from the Services.msc the WMS should resume back to normal behaviour without any memory leak.

Do take note that the situation is more complicated when you have 2 or more servers sharing the same AppFabric Persistence Store. You may end-up needing to sacrifice the other servers to preserve the most critical one or down the workflows on the other servers, re-configure them to use their own Persistence Store and bring-up the workflows again.

As a guideline, I would recommend the following when using AppFabric Workflow Management Service:

DO THIS


DON'T DO THIS



This problem is found on AppFabric 1.1 for Windows Server with cumulative patch 1 to 4 installed.

No comments:

Post a Comment

Popular Post