Configure Azure Auto-Healing for your Azure Web Sites

While there a whole host of great ideas you can apply to monitoring and alerting, one of the key reasons you spend time crafting your operations story is to avoid being interrupted during family time. So the auto-healing feature for Azure Web Sites is your family friendly helper that will take care of minor issues without human intervention.

The story behind this feature is that there are a rather broad set of problems that are solved by simply recycling your worker process. From slow-downs to full-on website down scenarios, restarting your website will solve the issue quite a lot of the time. This is where Azure’s auto-healing can step in and take action, without disturbing your quality time.

You can enable auto-healing based on a number of factors – and it is as simple as adding a little configuration to your application:

<system.webServer>
    <monitoring>
        <triggers>
            <!-- The cool stuff happens here! -->
        </triggers>
        <actions value="..."/>
    </monitoring>
</system.webServer>

Invalid Child Element

The monitoring element is only really applicable once your application is on Azure, so you may see the error 'system.webServer' has invalid child element 'monitoring'. when you are working locally on this configuration. A common method for this is to add the configuration as a transformation in a separate file:

<?xml version="1.0" encoding="utf-8"?>
<configuration xmlns:xdt="http://schemas.microsoft.com/XML-Document-Transform">
    <system.webServer>
        <monitoring xdt:Transform="Insert">
            <triggers>
                <!-- The cool stuff happens here! -->
            </triggers>
           <actions value="..."/>
        </monitoring>
    </system.webServer>
</configuration>

Auto-Healing Triggers

You can trigger auto-healing in a number of different scenarios, based on:

  • The number of requests
  • The number of slow requests
  • The number of requests matching an HTTP status code, or sub-status code
  • The memory usage of a worker process

TO implement this effectively, you’ll need to understand what normal looks like, but the goal is to eliminate out-of-hours emergencies over time, so overshoot at first, and then gradually bring in the numbers until the phone stops ringing. It may be tempting to just get auto-healing to step in at the drop of a hat, but you don’t want to end up in a situation where you are automatically restarting your application every ten minutes due to false alarms.

Here are the standard examples for you to take a look at…

<system.webServer>
    <monitoring>
        <triggers>
            <!-- Triggers when you have "count" number of requests within "timeInterval" amount of time -->
            <requests count="1000" timeInterval="00:10:00"/>

            <!-- Triggers when you have "count" number of requests that take "timeTaken" within "timeInterval" amount of time -->
            <slowRequests timeTaken="00:00:45" count="20" timeInterval="00:02:00" />

            <!-- Triggers when your worker process reaches "privateBytesInKB" kilobytes of private set -->
            <memory privateBytesInKB="800000"/>

            <!-- Triggers when you have "count" responses matching the configured status within "timeInterval" amount of time -->
            <statusCode>
                <add statusCode="500" subStatusCode="100" win32StatusCode="0" count="10" timeInterval="00:00:30"/>
            </statusCode>
        </triggers>

        <!-- Performs an overlapping recycle of the worker process when a trigger fires -->
        <actions value="Recycle"/>
    </monitoring>
</system.webServer>

Kudu

You can also set up auto-healing in Kudu, by navigating to Kudu -> Tools -> Support, selecting the application you want to configure, and opening the Mitigate tab:

Kudu Auto-Healing Mitigation

You can configure rules for requests, status codes, slow requests, and memory – although you need to select the appropriate heading for each of these before you select “Add new rule”.

Don’t forget to visit the Action heading to select the action you want to take when the triggers are hit.