winternl

cybersecurity & programming

MemFuck: Bypassing User-Mode Hooks

Preface

Dynamic malware analysis is the preferred way to determine the legitimacy of an application for many AVs/EDRs/MDSs. Unlike static analysis, dynamic analysis can capture and analyze Windows API calls made during the course of execution. This method of analysis provides far superior detection rates than static analysis. There are many techniques to capture such system calls, perhaps the most popular of which are user-level hooks. These hooks, intercept function calls in order to analyze and potentially augment malicious functionality. Consider the following sequence of API calls in an arbitrary executable:

OpenProcess
VirtualAllocEx
WriteProcessMemory
CreateRemoteThreadEx
QueueUserAPC
NtAlertResumeThread

There’s little ambiguity that the process is up to something of interest from a defensive viewpoint. AVs define malicious behavior based upon such calls or combination thereof. In the mind of the AV, this specific set of calls indicates code injection, which is most often defined as malicious or unwanted behavior.

User-mode hooks are used in many security products and tools, including AVs and NGAVs, EDRs, sandboxes, anti-cheat, DRM, etc. User-mode hooks are easy to implement, stable, simple, and have minimal performance overhead.

Most user-land hooks are inline hooks, which involve rewriting the target function to redirect control flow to a custom handler. Inside such handler, the parameters are preserved and the handler can make decisions about whether to execute or analyze the requested function. This is well documented, so I will assume the reader is at least somewhat familiar with this concept.

32-bit Hooks

32-bit user-mode security products usually hook at the deepest possible location, which is usually ntdll for most suspect functions (e.g. NtQueueApcThread). Solution? Use system calls to directly invoke whatever functionality you want. Done, you bypassed all the security products with Ring3 hooks.

Let’s now take a look at a much more common case — 32-bit applications running under WoW64.

32-bit Hooks on WoW64

By far the most common instance of malware in-the-wild is a 32-bit program running under WoW64 (i.e. on a 64-bit machine). NGAVs and EDRs seem to be especially lazy when implementing hooking under this scenario. Most security product place their hooks only in the x86 user-mode space. Full-fledged AVs which utilize user-mode hooking (as opposed to kernel-mode), tend to be better at placing their hooks somewhere in WoW64 layer, but by no means do all vendors implement it. See MDSec’s post about bypassing Sophos.

The fact that many security products simply do not monitor WoW64 execution in 32-bit processes has long been known and exploited by malware authors and red-teamers alike. There is a significant amount of malware containing rewolf’s wow64ext used to abuse this fact.

Recap: Existing Techniques

All of these are great and work very well, however, they can all be stopped with a hook placed within the 64-bit version of ntdll. In my observations I have not seen any public code on hooks at this level nor any AVs implement it. I’m sure there may be some, I just haven’t seen them.

Enter MemFuck

MemFuck is meant to be a PoC only and does not resemble production code. That being said, I think it’s about as powerful an anti-analysis technique achievable in user-land.

MemFuck initially started out with experimenting with different anti-analysis methods. Essentially I wanted to create as empty a process as possible, so no security products could attach without significant modification to their DLL/shellcode. I found this post from 2008 which I found incredibly interesting. Of course, a lot has changed from 32-bit Windows XP, so a lot of the code doesn’t translate. However, the concept is mostly there.

That was kind of the question I began asking myself… who needs ntdll? I mean, for sure, everything of consequence relies on this very special DLL so… what happens when we really mess things up?

Unmap… Everything!

MemFuck begins by manually unmapping everything it can in the 32-bit address space. Of course, there are somethings that cannot be freed like PEB/PEB64, TEB/TEB64, and KUSER_SHARED_DATA (hey looks like it finally got documented last year!). There were a couple ways to go about this, but naturally, I wanted to choose the least-lame way possible. We could allocate some x86 shellcode which calls NtUnmapViewOfSection on everything… but what happens when we try to unmap ntdll like that? Ntdll cannot unmap it’s own code, so this approach will not work. The next step I took was to examine the possibility of invoking direct syscalls by utilizing code segment switching (Heaven’s Gate). While it is very much possible (see existing techniques), to execute 64-bit syscalls from 32-bit address space, this solution was not ideal for a few reasons:

  • We would still have code mapped and executing in 32-bit user address space, easily available for analysis.
  • The 64-bit address space is still very much intact.
  • This technique is effective for bypassing user-mode hooks, but is already documented!

I think that most, if not all, AV/EDR vendors which utilize Ring3 hooks in their products make a single universal assumption. A 32-bit process executing on WoW64 will only have user-defined code below the address of 4GB limit. So, let’s try and allocate some memory above this limit where we can place more unmapping code and continue with functionality. According to Alex Ionescu, this shouldn’t be possible. However, after talking with Petr Beneš, he thinks that in recent versions of Windows 10 this restriction has been lifted. If anyone knows why, please drop a comment! I am testing this on Windows 10 Build 19041.508.

I am using rewolf’s wow64ext helper to call the 64-bit version of NtAllocateVirtualMemory. When I first attempted this call, I was met with a confusing result. With a BaseAddress request of NULL, meaning that the operating system will determine where to allocate the memory. Of course, Windows probably doesn’t want us to allocate memory where we shouldn’t be and the 64-bit call returns memory allocated well within in the 32-bit address space. Hmmm. How about messing with the ZeroBit flag? I found this stackoverflow post explaining in depth how to request the highest possible address by manipulating the ZeroBit flag. Again, I tried adjusting the ZeroBit flag and adding the flag MEM_TOP_DOWN to my 64-bit NtAllocateVirtual request. Once again, I was met with a 32-bit address, albeit slightly higher in memory. At this point I scratched the idea of manipulating the parameters, let’s just request the address we want! So I set BaseAddress to a 64-bit address that was out of the way (there’s a lot of memory and only 3 Dlls) and surprisingly, it accepted that just fine.

64-bit memory page allocated from 32-bit space

Naturally, the very first thing we should do here is write some shellcode here to see if there’s any weird behavior that is going on. I quickly and easily wrote some 64-bit shellcode using my own ShellcodeStdio. This shellcode, allocated and written in 64-bit space very basically tries to unmap everything in 32-bit usermode.

Some pseudo-code for the first attempt is as follows:

DEFINE_FUNC_PTR("ntdll.dll", NtUnmapViewOfSection);
DEFINE_FUNC_PTR("ntdll.dll", NtProtectVirtualMemory);

for (DWORD m = 0; m < 0x80000000; m += 0x1000)
{
    PVOID ptrToProtect = (PVOID)m;
		ULONG dwBytesToProtect = 1;
		ULONG dwOldProt = 0;
		NtProtectVirtualMemory((HANDLE)-1, &ptrToProtect, &dwBytesToProtect, PAGE_READWRITE, &dwOldProt);
		NtUnmapViewOfSection((HANDLE)-1, (PVOID)m);
}

Yeah… that process looks fucked! But we’re still executing code and operating just fine. Turns out this is a pretty comfortable situation to be in for the purposes of anti-analysis. Essentially, we have complete control of 32-bit address space now and can securely load and unload anything we want to, with the reverse none the wiser. Many debuggers will simply crash here, such as OllyDbg and x64dbg, I had to switch to WinDbg to continue my analysis.

An Interesting Intermission

At this point I was excited. I had more or less succeeded in my goal of completely wiping out the 32-bit address space of a process, while still having code executing and functioning as intended. There are lots of possibilities I saw here, with both offensive and defensive implications. Perhaps one of the most interesting revelations I had during this period was when I stumbled across this MSDN page.

WOW64 uses native x64, ia64, or ARM64 exceptions as a transport for x86 exceptions.

Therefore, in a 32-bit application running under WOW64, uncaught exceptions behave like native 64-bit exceptions.

I did NOT know that — how interesting!

I can write 64-bit shellcode to an address above the 4GB boundary, which installs a 64-bit Vectored Exception Handler in 64-bit space, which is triggered by a 32-bit exception and wherein control flow is then redirected to your 64-bit VEH.

64-bit VEH triggered by 32-bit exception

After toying around with this for a while, I am pretty sure the 32-bit ntdll has be loaded for this to happen correctly. Even though triggered an exception with everything unmapped is pretty easy, there’s no code in place to transition to the WoW64 layer for it to be handled. However, there seems to be a lot of interesting potential with this idea!

Ntdll No More

Coming back to the original concept of no longer relying on ntdll for any purpose whatsoever, we’re left with the task of unmapping the three remaining dlls left in the WoW64 layer. WoW64 looks pretty much the same on most systems. WoW64.dll, wow64win.dll, and ntdll.dll.

WoW64 address space

Well, since I don’t think we’ll be going back to 32-bit code let’s just go ahead and unmap those “extra” dlls which we don’t need anymore.

DWORD64 addrWoW64 = 0;
DWORD64 addrWoW64Win = 0;
DWORD64 addrNtdll = 0;
PPEB peb64 = getPEB();
LIST_ENTRY* first = peb64->Ldr->InMemoryOrderModuleList.Flink;
LIST_ENTRY* ptr = first;
int cntr = 0;

do 
{
		LDR_DATA_TABLE_ENTRY* dte = getDataTableEntry(ptr);
		ptr = ptr->Flink;
		if (cntr == 1) {
			addrNtdll = (DWORD64)dte->DllBase;
		}
		else if (cntr == 2) {
			addrWoW64 = (DWORD64)dte->DllBase;
		}
		else if (cntr == 3) {
			addrWoW64Win = (DWORD64)dte->DllBase;
		}
		cntr++;
	} while (ptr != first);

NtUnmapViewOfSection(-1, addrWoW64);
NtUnmapViewOfSection(-1, addrWoW64Win);
WoW64 with… no WoW64?

Well that is fun! There’s certainly no more hooks present in the 32-bit address space, and any hooks that reside in the WoW64 Dlls (e.g. Wow64SystemServiceEx) are now gone as well. All that’s left to do is unmap ntdll. Once again, we are met with the problem of calling NtUnmapViewOfSection on its parent module! What else to do but transition to direct system calls!

Due to some restrictions with the 64-bit MSVC compiler not allowing inline assembly, limitations with function ordering, and internal code placement I opted to allocate a new memory page for which to write in the system call stub. Syscall stubs on Windows 10 64-bit look like this.

mov r10, rcx
mov eax, xxh
syscall
retn

Easy enough. Just call NtAllocateVirtualMemory and then write the code in DWORD by DWORD. Then assign a typedef prototype and voila! You can pretty easily call your Nt* function of choice!

DWORD dwCode1 = 0xb8d18b4c;
DWORD dwCode2 = 0x0000002a; // syscall code for NtUnmapViewOfSection
DWORD dwCode3 = 0x90c3050f;
// mov r10, rcx ; 0x4c 0x8b 0xd1
// mov eax, xxh ; 0xb8 xx 00 00 00
// syscall ; 0x0f 0x05
// retn ; 0xc3
*(DWORD*)syscallbase = dwCode1;
*((DWORD*)syscallbase + 1) = dwCode2;
*((DWORD*)syscallbase + 2) = dwCode3;
p_SysUnmapViewOfSection sysUnmap = (p_SysUnmapViewOfSection)syscallbase;</pre>

And finally, after a direct system call to NtUnmapViewOfSection is made to the 64-bit ntdll, we have nothing else in our process. We have nothing left in user-mode that any AV could try and hook. More than likely, whatever DLL that was injected has long since been unmapped.

No ntdll (32 and 64)!

From here, it’s a matter of implementing whatever code injection or otherwise detected code you have directly via syscalls and without the use of any Rtl* functions. It’s a dangerous environment to execute code in and things can get pretty claustrophobic quickly. I know this is less than practical in many cases, but it was a rabbit hole that I had to go down. Along the way I’ve learned a lot about WoW64 and how it behaves, its limitations and quirks, and have some new areas to explore such as 32 <-> 64 VEH discovery!

Source Code

As always, the source code for a demo project is available on my Github. Inside you will find two projects. The first application is a 32-bit application which relies on rewolf’s wow64ext for allocating and writing shellcode above the 4GB boundary. This application then transitions execution to the 64-bit shellcode found in the second project. This shellcode will unmap everything possible in the 32-bit address space and then also unmap the two WoW64-related DLLs in 64-bit address space. It will then transition to direct system calls to unmap the 64-bit version of ntdll. Finally, it will execute a debugbreak allowing for WinDbg to pause.

I do realize this is a ‘carpet-bombing’ approach to unhooking, but I thought it was too cool to not explore.