Dealing with Failure: Failure Escalation Policy in Unmanaged CLR Hosts

May 10, 2022

—

Offensive tooling built upon the .NET framework and its runtime environment, the Common Language Runtime (CLR), is an important part of the red teaming ecosystem. .NET tools offer rapid development times, a low barrier to entry, and are highly extensible through native interoperability. These tools have been and will continue to be used effectively on offensive engagements. Perhaps the cornerstone of continued interest in .NET offensive tooling is the ability to execute .NET assemblies from an unmanaged host using the CLR hosting interfaces; commonly referred to as running an assembly inline. Unfortunately, running arbitrary assemblies inline can be quite dangerous. Let’s see if we can address this problem.

Safer In Process Assembly Execution

By definition, inline assemblies share the same process as their unmanaged host — they run from within the host. For offensive use, sharing the same process is the principally motivating factor for running assemblies inline. However, there are some known drawbacks to this approach. Namely, exceptions originating from within the inline assembly will cause the process to unceremoniously exit if the CLR deems the exception unrecoverable; a decidedly less desirable consequence of sharing process space with arbitrary managed code. Even attempts to isolate inline assemblies in individual AppDomains will not yield the desired effect for catastrophic failures.

When a managed exception is thrown and remains unhandled or is unable to be handled from user code, as is the case with particularly nasty exceptions (e.g. StackOverflowException), the default policy for the CLR host is to exit the process. This scenario is not ideal for offensive tooling; especially when considering an operationally sensitive CLR host such as a long running agent. Fortunately, there are (documented!) ways to configure and extend a failure escalation policy that the CLR uses to determine actions taken in the event of failures and timeouts.

Before continuing, most of this post was facilitated (read: copied nearly verbatim) by the definitive (by virtue of being the sole printed reference available) resource on this subject, Customizing the Microsoft .NET Framework Common Language Runtime, written by Steven Pratschner.

Failure Escalation Policy

As the CLR matured from its basal iterations into the 2.0 release, there was a need to expose functionality for scenarios which require long process lifetimes such as servers and operating system processes. Thus, starting with version 2.0 of the CLR there exists infrastructure that unmanaged hosts can use to remove and therefore isolate exceptional code without affecting the availability of the process itself. First, let’s examine the different types of failures that are exposed to the host, the actions the host may choose to take in response to a failure, and the escalation policy, which directs the execution flow of the aforementioned operations. Then we will implement a custom policy capable of continuing unmanaged code execution after a catastrophic managed exception.

The EClrFailure enumeration describes which types of failures are available to be customized through an escalation policy. I believe the commented fields were introduced with the release of the CoreCLR versions.

typedef enum {  
    FAIL_NonCriticalResource,  
    FAIL_CriticalResource,  
    FAIL_FatalRuntime,  
    FAIL_OrphanedLock,  
    FAIL_StackOverflow  
    // FAIL_AccessViolation  
    // FAIL_CodeContract  
} EClrFailure;

Failure to allocate a resource: A resource typically refers to a thread, block of memory, synchronization primitive, or some other resource managed by the operating system.
Failure to allocate a resource in a critical region: A critical region is defined as any code that might be dependent on a shared state between threads. This is distinguished from the previous failure because a resource that relies on states from other threads cannot be safely cleaned up by terminating only the exceptional thread. The CLR assumes that any exception occurring within a region of code which depends on a synchronization primitive is a critical region.
Orphaned lock: An orphaned lock is an abandoned synchronization primitive that is likely to leave the execution context in an inconsistent state. This can occur when resource allocation fails in code regions that are awaiting a synchronization primitive. Additionally, a thread may be aborted before the synchronization primitive is freed. In both scenarios, the primitive is lost and cannot be freed. This failure is also a resource leak and can eventually lead to resource exhaustion.
Fatal runtime error: If the CLR encounters a fatal internal error and is no longer able to run managed code, the default behavior is to terminate, with varying levels of respect to cleanup, the process. It is possible to override this behavior and continue execution of native code.

The EPolicyAction enumeration describes the actions the host may take when presented with the different types of failures. The CLR provides two flavors of actions: a graceful action and a rude action.

typedef enum {  
    eNoAction,  
    eThrowException,  
    eAbortThread,  
    eRudeAbortThread,  
    eUnloadAppDomain,  
    eRudeUnloadAppDomain,  
    eExitProcess,  
    eFastExitProcess,  
    eRudeExitProcess,  
    eDisableRuntime  
} EPolicyAction;

Graceful Action: Graceful actions attempt to properly free resources by running exception-handling routines and finalizers, freeing associated CLR data structures, and in the case of process exit, finishing processing necessary for a proper shutdown.
Rude Action: Rude actions make no such attempts. The CLR does not guarantee any finalizers are run, with the exception of critical finalizers.

Additionally, the unmanaged host may set timeouts for the specified policy actions to complete. This is especially useful when dealing with unresponsive code, such as an abandoned synchronization primitive or infinite loop. Timeouts are configured by specifying an operation upon which a policy action is taken after a determined interval. The set of operations exposed to the host is documented in the EClrOperation enumeration.

typedef enum {  
    OPR_ThreadAbort,  
    OPR_ThreadRudeAbortInNonCriticalRegion,  
    OPR_ThreadRudeAbortInCriticalRegion,  
    OPR_AppDomainUnload,  
    OPR_AppDomainRudeUnload,  
    OPR_ProcessExit,  
    OPR_FinalizerRun  
} EClrOperation;

Together, the failure types, policy actions, and operations make up the failure escalation policy of a host. It can be put no more succinctly than by Pratschner, “The escalation policy is the host’s expression of how failures in a process should be handled.” Still, it’s easiest to visualize how a custom escalation policy might look when an exception occurs.

The above escalation policy will be expressed in the process as follows:

The CLR first determines if the exceptional code depends on a synchronization primitive, to establish if the exception originates from a non-critical or critical code region.
- If the exception occurs from within a non-critical region, the CLR will attempt to gracefully abort the thread. If the graceful thread abortion times out then the action is escalated to rudely abort the thread. Additionally, if the graceful thread abortion occurs in a critical region, it is escalated to gracefully unload the AppDomain.
- If the exception occurs from a critical region, the policy will escalate to gracefully unload the AppDomain.
If the graceful unloading of the offending AppDomain times out, the action will be escalated to rudely unload the AppDomain.
If rudely unloading the AppDomain times out, the runtime is disabled. No more managed code may be run.
If at any point the CLR encounters a fatal runtime failure, the policy overrides the default action of shutting down the process with disabling the CLR.

Note: Rude thread aborts are not able to be escalated to anything more useful than disabling the runtime. We will come back to this point in a bit.

Policy Configuration

Now that we know what a failure escalation policy is, at least in the context of CLR hosts, we can dive into customizing a runtime host with the policy pictured in the diagram above. Hopefully, this will remedy some stability issues facing long running native agents.

CLR hosts implement escalation policies using the failure policy manager exposed by the CLR hosting interfaces. The failure policy manager is composed of two interfaces: ICLRPolicyManager and IHostPolicyManager. The ICLRPolicyManager interface is implemented by the CLR and subsequently exposed to the user, whereas the IHostPolicyManager interface is implemented by the host. This design pattern is common throughout the CLR hosting interfaces.

The first step in implementing a custom failure escalation policy is to obtain a pointer to the CLR’s ICLRPolicyManager implementation. The following code shows how to do this.

Note: All following code will assume a pointer to the ICLRRuntimeHost interface has already been obtained from calling either CorBindToRuntime/CorBindToRuntimeEx or CLRCreateInstance.

ICLRRuntimeHost* myCustomHost = nullptr;
// myCustomHost is obtained...
ICLRControl* myClrControl = nullptr;
myCustomHost->GetCLRControl(&myClrControl);
ICLRPolicyManager* myPolicyManager = nullptr;
myClrControl->GetCLRManager(IID_ICLRPolicyManager, (LPVOID*)&myPolicyManager);

After obtaining the ICLRPolicyManager implementation, the host sets actions to take on failures.

myPolicyManager->SetActionOnFailure(FAIL_NonCriticalResource, eAbortThread);
myPolicyManager->SetActionOnFailure(FAIL_CriticalResource, eUnloadAppDomain);
myPolicyManager->SetActionOnFailure(FAIL_OrphanedLock, eUnloadAppDomain);
myPolicyManager->SetActionOnFailure(FAIL_StackOverflow, eRudeUnloadAppDomain);
myPolicyManager->SetActionOnFailure(FAIL_FatalRuntime, eDisableRuntime);

Most actions can be taken on failures. There are a few exceptions. A failure associated with an orphaned lock must at least unload the AppDomain gracefully. Similarly, a stack overflow failure commands at least a rude unloading of the AppDomain from which it occurred. When a fatal runtime failure occurs, then the only suitable actions are to exit the process or disable the runtime. There are some additional cases not covered; they can be found in the remarks section of the ICLRPolicyManager::SetActionOnFailure MSDN page.

Then, the host sets the timeout period associated with an operation and the subsequent action to take upon exceeding the defined timeout.

myPolicyManager->SetTimeoutAndAction(OPR_FinalizerRun, TIMEOUT, eAbortThread);
myPolicyManager->SetTimeoutAndAction(OPR_ThreadAbort, TIMEOUT, eRudeAbortThread);
myPolicyManager->SetTimeoutAndAction(OPR_AppDomainUnload, TIMEOUT, eRudeUnloadAppDomain);
myPolicyManager->SetTimeoutAndAction(OPR_ThreadRudeAbortInNonCriticalRegion, TIMEOUT, eDisableRuntime);
myPolicyManager->SetTimeoutAndAction(OPR_ThreadRudeAbortInCriticalRegion, TIMEOUT, eDisableRuntime);
myPolicyManager->SetTimeoutAndAction(OPR_AppDomainRudeUnload, TIMEOUT, eDisableRuntime);

There are some nuances to be aware of. First, the CLR’s default implementation specifies no timeouts (read: infinite) associated with any operation other than OPR_ProcessExit. In this case, if a process does not cleanly exit within 40 seconds, the action is escalated to rudely exit the process. Second, the subset of EClrOperation values upon which a timeout and action can be specified, as described in the MSDN documentation, is inaccurate. Consulting the SSCLI2 source code, we can see the EEPolicy::SetTimeoutAndAction method validates the operation and action by calling the EEPolicy::IsValidActionForTimeout method, shown below. As long as this switch statement is satisfied, the combination of operation and action is valid.

BOOL EEPolicy::IsValidActionForTimeout(EClrOperation operation, EPolicyAction action)
{
    CONTRACTL
    {
        GC_NOTRIGGER;
        NOTHROW;
    }
    CONTRACTL_END;
    
    switch (operation) {
    case OPR_ThreadAbort:
        return action > eAbortThread &&
            action < MaxPolicyAction;
        break;
    case OPR_ThreadRudeAbortInNonCriticalRegion:
    case OPR_ThreadRudeAbortInCriticalRegion:
        return action > eRudeUnloadAppDomain &&
            action < MaxPolicyAction;
        break;
    case OPR_AppDomainUnload:
        return action > eUnloadAppDomain &&
            action < MaxPolicyAction;
        break;
    case OPR_AppDomainRudeUnload:
        return action > eRudeUnloadAppDomain &&
            action < MaxPolicyAction;
        break;
    case OPR_ProcessExit:
        return action > eExitProcess &&
            action < MaxPolicyAction;
        break;
    case OPR_FinalizerRun:
        return action == eNoAction ||
            (action >= eAbortThread &&
             action < MaxPolicyAction);
        break;
    default:
        _ASSERT (!"Do not know valid action for this operation");
        break;
    }
    return FALSE;
}

The host now sets default actions to take in response to a given operation. In our case, we only want to override the default action of one operation — OPR_ProcessExit. This way, the host will stop the CLR from shutting down the process.

myPolicyManager->SetDefaultAction(OPR_ProcessExit, eDisableRuntime);

Default actions must only be used to escalate the action taken on failure; one cannot downgrade the action. One may consult the SSCLI2 implementation for the complete list of valid failure and action combinations.

Finally, the host must specify that the unhandled exception policy is defined by the host rather than the CLR.

myPolicyManager->SetUnhandledExceptionPolicy(eHostDeterminedPolicy);

The failure escalation policy is now configured and takes effect once the CLR is started.

Receiving Notifications and Host-Implemented Managers

The CLR host may also choose to implement the IHostPolicyManager interface. This allows the host to receive basic notifications resulting from either the default escalation policy or a custom one. Before implementing this interface, let’s take a step back and understand how the CLR host discovers our host-implemented manager.

As previously shown, the CLR host accesses specific CLR-implemented managers by calling ICLRControl::GetCLRManager. The ICLRPolicyManager interface, used above to configure the failure escalation policy, is one of a handful of CLR-implemented classes that may be accessed by the host. Inversely, the CLR hosting interfaces also expose a way for the CLR to discover host-implemented managers. This is accomplished by supplying a host-implemented instance of a class derived from the IHostControl interface to the ICLRRuntimeHost::SetHostControl method. This must be done before the CLR is started. Just like there are a number of CLR-implemented managers, there are many host-implemented managers which can further customize the functionality of the host. One such manager that may be of interest for offensive usage would be the IHostMemoryManager. It can be used to configure a custom memory manager for the CLR.

To better illustrate how the CLR discovers host-implemented managers, let’s take a look at code which will:

Implement the IHostPolicyManager interface; this class will receive notifications related to the failure policy
Create a class which derives from the IHostControl interface
Register the host-implemented class with the CLR

First, the CLR host creates a class derived from the IHostPolicyManager interface.

#pragma once
#include <mscoree.h>

class CustomHostPolicyManager : public IHostPolicyManager
{
public:
	CustomHostPolicyManager();
	virtual ~CustomHostPolicyManager();
	 ULONG STDMETHODCALLTYPE AddRef() override;
	 ULONG STDMETHODCALLTYPE Release() override;
	 STDMETHODIMP QueryInterface(const IID& iid, void** ppv) override;
	 STDMETHODIMP OnDefaultAction(EClrOperation operation, EPolicyAction action) override;
	 STDMETHODIMP OnFailure(EClrFailure failure, EPolicyAction action) override;
	 STDMETHODIMP OnTimeout(EClrOperation operation, EPolicyAction action) override;
private:
	volatile LONG m_cRef;
};

#include "CustomHostPolicyManager.hpp"

CustomHostPolicyManager::CustomHostPolicyManager() : m_cRef(0) { }
CustomHostPolicyManager::~CustomHostPolicyManager() { }
ULONG CustomHostPolicyManager::AddRef()
{
	return InterlockedIncrement(&m_cRef);
}
ULONG CustomHostPolicyManager::Release()
{
	ULONG ulRef = InterlockedDecrement(&m_cRef);
	
	if (ulRef == 0)
	{
		delete this;
	}
	return ulRef;
}
HRESULT CustomHostPolicyManager::QueryInterface(const IID& iid, void** ppv)
{
	if (ppv == nullptr)
	{
		return E_INVALIDARG;
	}
	if (iid == IID_IUnknown || iid == IID_IHostPolicyManager)
	{
		*ppv = this;
		AddRef();
		return S_OK;
	}
	return E_NOINTERFACE;
}
HRESULT CustomHostPolicyManager::OnDefaultAction(EClrOperation operation, EPolicyAction action)
{
	return S_OK;
}
HRESULT CustomHostPolicyManager::OnFailure(EClrFailure failure, EPolicyAction action)
{
	return S_OK;
}
HRESULT CustomHostPolicyManager::OnTimeout(EClrOperation operation, EPolicyAction action)
{
	return S_OK;
}

This host-implemented manager receives notifications of the following events: OnDefaultAction, OnFailure, and OnTimeout. The information exposed to the host is not particularly verbose, but nonetheless may be useful for monitoring events related to the failure policy.

For the CLR to be notified of the existence of any host-implemented manager, the host must additionally implement the IHostControl interface. The CLR discovers host-implemented managers by calling the IHostControl::GetHostManager method which associates instances of host-implemented managers with their IID.

#pragma once
#include <mscoree.h>

class CustomHostControl : public IHostControl
{
public:
	CustomHostControl();
	virtual ~CustomHostControl();
	STDMETHODIMP QueryInterface(const IID& iid, void** ppv) override;
	ULONG STDMETHODCALLTYPE AddRef() override;
	ULONG STDMETHODCALLTYPE Release() override;
	STDMETHODIMP GetHostManager(REFIID riid, void** ppObject) override;
	STDMETHODIMP SetAppDomainManager(DWORD dwAppDomainID, IUnknown* pUnkAppDomainManager) { return E_NOTIMPL; }
private:
	volatile ULONG m_cRef;
};

#include "CustomHostControl.hpp"
#include "CustomHostPolicyManager.hpp"

CustomHostControl::CustomHostControl() : m_cRef(0) { }
CustomHostControl::~CustomHostControl() { }
HRESULT CustomHostControl::GetHostManager(REFIID riid, void** ppObject)
{
	if (riid == IID_IHostPolicyManager)
	{
		CustomHostPolicyManager* customHostPolicyManager = new CustomHostPolicyManager();
		*ppObject = customHostPolicyManager;
		return S_OK;
	}
	
	*ppObject = nullptr;
	return E_NOINTERFACE;
}
HRESULT CustomHostControl::QueryInterface(const IID& iid, void** ppv)
{
	if (ppv == nullptr)
	{
		return E_INVALIDARG;
	}
	if (iid == IID_IUnknown || iid == IID_IHostControl)
	{
		*ppv = this;
		AddRef();
		return S_OK;
	}
	return E_NOINTERFACE;
}
ULONG CustomHostControl::AddRef()
{
	return InterlockedIncrement(&m_cRef);
}
ULONG CustomHostControl::Release()
{
	ULONG ulRef = InterlockedDecrement(&m_cRef);
	if (ulRef == 0)
	{
		delete this;
	}
	return ulRef;
}

Finally, the host must call ICLRRuntimeHost::SetHostControl before starting the CLR.

ICLRRuntimeHost* myCustomHost = nullptr;
// myCustomHost is obtained...
CustomHostControl* customHostControl = new CustomHostControl();
myCustomHost->SetHostControl(customHostControl);

Parting Thoughts

Implementing a failure escalation policy in unmanaged CLR hosts is a powerful tool for handling the inline execution of arbitrary assemblies. Such customization may help alleviate some operational difficulties associated with process termination due to unhandled exceptions. Additionally, there are a number of other interesting host-implemented managers which have possible offensive uses.

One may also question the usefulness of a process left unterminated yet unable to continue managed code execution. There may be a workaround for this — manually unloading and reloading the CLR. Doing so would be extremely dangerous, unreliable, and undocumented; undoubtedly compromising the integrity of the process the host worked so hard to preserve. Nonetheless, it sounds like a worthwhile exercise and area of continued research.

Finally, if you found this post interesting and would like to learn more about customizing the CLR, please check out Customizing the Microsoft .NET Framework Common Language Runtime, by Steven Pratschner.

CLR