Preface
The cat and mouse game of the AV industry lives on into another decade. All AVs, operate on essentially three modes of detection.
- Static Detection
- Dynamic Detection
- Reputation Detection
This post will provide a high-level overview of how AVs statically determine anomalous and heuristically incorrect PE files. These types of detections are often labeled under some vague descriptor such as “Gen.Variant.XXX” or “Gen:Heur.XXXX”. NGAVs, which are powered by machine-learning, extract the same features (ML jargon for PE parsing) and feed them as input vectors for their models.
NGAVs detect patterns of certain characteristics about the PE file. Learning which fields in a specific model are most sensitive can aid in AV evasion of NGAV models. While, conceptually very simple, this can be thought of a form of adversarial machine learning, wherein, we can control certain features to achieve a desired result (AV evasion).
PE File Structure
I assume if you’ve made it this far, you know what a PE file is and what it looks like, but if not, here is a quick picture to remind you. Also, anything that is integral to the validity of the PE file, mainly DOS and NT signatures will be skipped.
Okay, great, let’s start from the beginning and see the most common offenders for triggering heuristic detections.
IMAGE_DOS_HEADER
typedef struct _IMAGE_DOS_HEADER {
WORD e_magic;
WORD e_cblp;
WORD e_cp;
WORD e_crlc;
WORD e_cparhdr;
WORD e_minalloc;
WORD e_maxalloc;
WORD e_ss;
WORD e_sp;
WORD e_csum;
WORD e_ip;
WORD e_cs;
WORD e_lfarlc;
WORD e_ovno;
WORD e_res[4];
WORD e_oemid;
WORD e_oeminfo;
WORD e_res2[10];
LONG e_lfanew;
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;
The IMAGE_DOS_HEADER structure is the very first part of the PE file format, and is pretty much only there for backwards developmental reasons and to provide an offset to where the useful information begins.
e_lfanew
This is the offset to the structure IMAGE_NT_HEADERS and must be a multiple of 8.
Rich Signature
This signature is actually undocumented by Microsoft. There are a couple good sources for information regarding the history and demystification of the signature. Essentially, a limited number of AVs may rely on information present in or signatures from this undocumented signature, making it a sneaky, but often inconsistent indicator of compromise. Such rich signatures are only present in Microsoft built executables, meaning PEs compiled and built with tools other than MSVC (e.g. Borland, gcc) will not have such a signature. The rich signature embeds version information about the tools used to build the executable, which end up being largely unhelpful in determining whether a PE is heuristically correct. Overall, not a good indicator of compromise. Maybe possible to cross reference other values present in the optional header. Almost always disregard, but it is important to be aware of.
IMAGE_FILE_HEADER
typedef struct _IMAGE_FILE_HEADER {
WORD Machine;
WORD NumberOfSections;
DWORD TimeDateStamp;
DWORD PointerToSymbolTable;
DWORD NumberOfSymbols;
WORD SizeOfOptionalHeader;
WORD Characteristics;
} IMAGE_FILE_HEADER, *PIMAGE_FILE_HEADER;
Machine
For Intel 386 and AMD64, 0x014c and 0x8664, are your magic values, respectively.
NumberOfSections
For heuristics not to be triggered, it is important to have a relatively standard number of sections for the executable. AVs tend to detect extraneous or non-standard sections as they are often a sign of an old-school viral file infection.
TimeDateStamp
This is the UNIX timestamp (e.g. seconds from Jan 1 1970 UTC), which provides the date when the PE file was built. A heuristic value can be a timestamp that was within the past 10 years. It is very important to ensure there are no anachronisms with the PE file. For example, if you import a function that was only available after Windows 8, but date the file as being built before Windows 8 release, this would be heuristically incorrect.
Characteristics
This field is a combination of certain pre-defined flags bitwise OR’d with each other. For a 32-bit executable file, a value of 0x10F or 0x102 would be hard pressed to be identified as anything other than within the normal range.
IMAGE_OPTIONAL_HEADER
typedef struct _IMAGE_OPTIONAL_HEADER {
WORD Magic;
BYTE MajorLinkerVersion;
BYTE MinorLinkerVersion;
DWORD SizeOfCode;
DWORD SizeOfInitializedData;
DWORD SizeOfUninitializedData;
DWORD AddressOfEntryPoint;
DWORD BaseOfCode;
DWORD BaseOfData;
DWORD ImageBase;
DWORD SectionAlignment;
DWORD FileAlignment;
WORD MajorOperatingSystemVersion;
WORD MinorOperatingSystemVersion;
WORD MajorImageVersion;
WORD MinorImageVersion;
WORD MajorSubsystemVersion;
WORD MinorSubsystemVersion;
DWORD Win32VersionValue;
DWORD SizeOfImage;
DWORD SizeOfHeaders;
DWORD CheckSum;
WORD Subsystem;
WORD DllCharacteristics;
DWORD SizeOfStackReserve;
DWORD SizeOfStackCommit;
DWORD SizeOfHeapReserve;
DWORD SizeOfHeapCommit;
DWORD LoaderFlags;
DWORD NumberOfRvaAndSizes;
IMAGE_DATA_DIRECTORY DataDirectory[IMAGE_NUMBEROF_DIRECTORY_ENTRIES];
} IMAGE_OPTIONAL_HEADER32, *PIMAGE_OPTIONAL_HEADER32;
This structure contains most useful information relating to PE files. Here there are a lot of fields which could possibly be used to establish a heuristic signature indicative of a malicious PE file.
Major/Minor LinkerVersion
These values contain version information about the linker used to build the PE file. I don’t think many people are using build tools older than MSVC 6. And definitively developers are not creating executables with linker versions that aren’t out yet, so check the most recent Microsoft linker version and use that as the maximum bounds. Heuristic anachronisms may occur when cross-referencing other fields such as the Rich Signature or TimeDateStamp, so be aware of such interactions.
SizeOfCode, SizeOfInitializedData, and SizeOfUnitializedData
These three fields are calculated using a very specific formula involving other fields described later on in the section headers. Incorrect calculation of such fields are guaranteed to be flagged by AV.
Pseudo-code for proper calculation is as follows.
#define UPWARDS_ALIGN ( VALUE, ALIGNMENT ) ((VALUE + (ALIGNMENT - 1)) & (~(ALIGNMENT-1)))
foreach ( section : pe_hdrs.sections )
if( section.characteristics & IMAGE_SCN_CNT_CODE )
SizeOfCode += UPWARDS_ALIGN( section.Misc.VirtualSize, pe_hdrs.FileAlignment )
if( section.characteristics & IMAGE_SCN_CNT_INITIALIZED_DATA )
SizeOfInitializedData += UPWARDS_ALIGN( section.Misc.VirtualSize, pe_hdrs.FileAlignment )
if( section.characteristics & IMAGE_SCN_CNT_UNINITIALIZED_DATA )
SizeOfUnnitializedData += UPWARDS_ALIGN( section.Misc.VirtualSize, pe_hdrs.FileAlignment )
AddressOfEntryPoint
This field contains the RVA (relative to the image base). While yes, this field can be 0, AV heuristics will most certainly flag a null entrypoint as a generic detection. Ideally, the AddressOfEntryPoint should be a RVA pointing to a valid instruction in a section with executable permissions. Indicators of heuristically incorrect values would be those that are outside the VirtualSize of the code section.
BaseOfCode, BaseOfData
The base of code should only ever contain the first executable section’s (i.e. code section) VirtualAddress. Heuristically speaking, the code section should always be the first section, and therefore, the VirtualAddress of said section should then also be aligned to the SectionAlignment field. In other words, BaseOfCode should reasonably be expected to be the value of the SectionAlignment field. If not, this is a good sign of dubious heuristic behavior.
BaseOfData is so unimportant that PE+ file header just opted to get rid of it. However, it’s the address that is relative to the image base of the beginning-of-data section when it is loaded into memory — in other words the VirtualAddress of the first data section. This is the only acceptable value for heuristically correct PE files.
ImageBase
In my experience this field is often takes the value of 0x00400000. Also, in my experience if there is an alternate value, AVs generally need good reasoning for it. There are some caveats, like if you have a dynamic image base. If you have a non-standard value, it seems that AVs tend to require a valid relocation table. This field I think is up for debate as to whether or not it provides any meaningful insight into the heuristics of the PE file.
SectionAlignment, FileAlignment
Big players in making sure heuristics are correct. Lots of other field values are calculated using these two fields. There are absolutely some standard values associated with these fields. Usually SectionAlignment is equal to 0x1000 and FileAlignment is 0x200.
There are some heuristic gotchas here. First, the SectionAlignment value must be greater than or equal to the FileAlignment field. The SectionAlignment value is by default the page size for the given architecture. The value of FileAlignment should be a power of 2 between 512 and 64 K, inclusive. If the SectionAlignment is less than the architecture’s page size, then FileAlignment must match SectionAlignment.
Here’s the alignment pseudocode. These alignments are used in calculating values ranging from ImageSize to most section header fields.
#define UPWARDS_ALIGN ( VALUE, ALIGNMENT ) ((VALUE + (ALIGNMENT - 1)) & (~(ALIGNMENT-1)))
#define DOWNWARDS_ALIGN( VALUE, ALIGNMENT ) (VALUE & (~(ALIGNMENT -1)))
Major/Minor OS, Image, SubSystem
In my opinion, these fields carry no heuristic value, but many AVs would disagree with that. So, let’s discuss why I think a limited number of AVs have used these as indicators. I think everyone can agree that on their own, they are not an indicator of compromise. These values indicate the minimum requirements (e.g. Windows NT4) of the operating system, subsystem, and image version, presenting some room for inconsistencies which can be heuristic indicators. Most of the time these are set correctly by the build tools, but sometimes, as often is the case with malicious actors, there is some degree of post-build modification to the PE files. In these cases, I’ve seen packers deliberately change such values as a way to avoid fingerprinting (or who knows why?). Non-standard values here can be cross-referenced with compiler versions, timedatestamps, etc to check for things that may not quite add up. Final verdict: still not accurate enough to be useful — disregard.
SizeOfImage
The size (in bytes) of the image, including all headers, as the image is loaded in memory. If this field isn’t calculated correctly, this could be a good indication of packing or other anti-RE techniques. This value must also be a multiple of the SectionAlignment.
CheckSum
For many heuristic engines, this value is not useful. The checksum is exactly that, a checksum of the executable file. ImageHlp.dll exports a function, CheckSumMappedFile, which can be used to calculate the checksum of a memory mapped image. Anyways, many engines do not verify this value is correctly calculated. Often times, this value is equal to zero.
IMAGE_DATA_DIRECTORY
typedef struct _IMAGE_DATA_DIRECTORY {
DWORD VirtualAddress;
DWORD Size;
} IMAGE_DATA_DIRECTORY, *PIMAGE_DATA_DIRECTORY;
The final part of the IMAGE_OPTIONAL_HEADER is an array of IMAGE_DATA_DIRECTORY structures. The length of the array is stored in the NumberOfRvaAndSizes, which is often equal to 0x10. Different versions of MSVC will produce different default data directory entries. For example, a simple console application, statically linked and built with MSVC 6, will have by 3 data directory entries: Import, IAT, and resource. Building with features such as a dynamic image base, exceptions, or TLS entries will be accompanied by their respective data directory entries. While these values simply point to the actual entry information, they are not valuable on their own in a NGAV detection model.
IMAGE_SECTION_HEADER
typedef struct _IMAGE_SECTION_HEADER {
BYTE Name[IMAGE_SIZEOF_SHORT_NAME];
union {
DWORD PhysicalAddress;
DWORD VirtualSize;
} Misc;
DWORD VirtualAddress;
DWORD SizeOfRawData;
DWORD PointerToRawData;
DWORD PointerToRelocations;
DWORD PointerToLinenumbers;
WORD NumberOfRelocations;
WORD NumberOfLinenumbers;
DWORD Characteristics;
} IMAGE_SECTION_HEADER, *PIMAGE_SECTION_HEADER;
This is the meat and potatoes of PE format. Most of the non-superficial (e.g. code analysis, import analysis, etc.) detection techniques work by analyzing the contents of the PE sections. The section header list is an array of IMAGE_SECTION_HEADER structures, with size dictated by the NumberOfSections field in the IMAGE_FILE_HEADER.
Name
Depending on the model, some NGAVs really don’t appreciate creativity when it comes to section names. In order to remain as innocuous as possible, it is recommended to use section names commonly accepted as conventional. .text, .rdata, .data, .rsrc, .tls,, .reloc all for their respective sections. The usage of Borland style section naming conventions, with no period prefix and non-MSVC names, have the possibility to trigger detections in some models. While absolutely not an indication of malicious behavior, section names which are random, null, or otherwise intentionally descriptively opaque, are almost always flagged by NGAV models. These are often indicative of commercial packers such as UPX, VMProtect, Themida, etc.
Characteristics
Another feature which both regular and NGAVs will consider are section characteristics. This field is a DWORD flag of bitwise OR’d values which contain information about the contents and memory protections associated with the section when it is mapped into memory. In most cases, these flags are constant between executables and deviation from such standard values will almost surely trigger a generic detection of some sort. A few of the standard section characteristics are as follows. For the meanings associated with the flags refer here.
.text equ 0x60000020
.rdata equ 0x40000040
.data equ 0xC0000040
.rsrc equ 0x40000040
.reloc equ 0x42000040
A particularly egregious violation of heuristic norms is to have a section with executable permissions located outside of a singular code section (e.g. RWX vs RX).
Imports
typedef struct _IMAGE_IMPORT_DESCRIPTOR {
DWORD *OriginalFirstThunk;
DWORD TimeDateStamp;
DWORD ForwarderChain;
DWORD Name;
DWORD *FirstThunk;
} IMAGE_IMPORT_DESCRIPTOR, *PIMAGE_IMPORT_DESCRIPTOR;
Every executable that isn’t packed or otherwise has reason for dynamically resolving imported functions, will have an import directory which contains an array of IMAGE_IMPORT_DESCRIPTORs. While import directories contain valuable information for importing and resolving routines from external modules, we really only care about a few things — the name of the function and the DLL from which it is imported. From the perspective of AVs, there are few better detection indicators than the import table. It allows insight into the functionality and intention of the program with cursory reverse engineering. However, malware either resolves functionality at runtime, meaning function and module names will not be available to see in the import table. Other malware purposefully manipulates the imported functions to feign benign behavior.
There are hundreds of ways to analyze the import table. One of the most common approach AVs take is pretty simple, with the caveat that the complexity comes from statistical analysis of such data. Generally the schema is something like this. Find two large datasets, one containing known clean executable files and one known malicious files. Iteratively record the contents of each import table, noting the frequency of module and function names. Use any number of statistical analysis techniques could be applied with varying levels of efficacy, where efficacy is defined as having the highest rate of true positives and lowest occurrence of false negatives. A popular methodology is the use of n-gram models, not a learning algorithm, but rather a method of constructing features. Most well-known ML methods are applicable for use here, including: Linear and linear SVM, kernel and kernel SVM, decision trees, and neural networks. Experimentation and proper training of models is imperative for properly optimizing the efficacy of the model.
Resources
Resources in PE files follow a somewhat convoluted tree-based structure. It is not as necessary to understand the resource directory on a field-by-field basis, as the Windows API exposes functionality for enumerating, updating, and removing resources.
Resources are often used as a storage medium in malware packers and more complex malicious payloads which may require additional information to be stored in the form of a resource. Resources are identified by a type, name (or ID), and a language identifier. Resource types and language identifiers are pre-defined. Names can either be strings or numerical identifiers which slightly affect how the resource is loaded.
Heuristically, there are a few resource types to watch out for. RT_RCDATA is a notorious case. This resource type describes “application-defined resource (raw data)”, meaning anything. Often this data is innocuous enough, such as is the case with some common legitimate installers. Though more often, this resource type is used by malware packers to store encrypted or obfuscated data. In response to this storage method’s widespread detection by AVs, packers have evolved to create legitimate other resource types such as icons, bitmaps, string tables, etc. and interweave malicious information into the resource. Another common resource to pay attention to is the VERSION_INFO type. This resource is present in most executable files and briefly describes information pertaining to the publisher, author, version, description, and copyright. Many AVs detect VERSION_INFO resources which either imitate a legitimate a well-known app (e.g. Internet Explorer), are randomly generated (and thus have never been seen before), or do not contain dictionary words. Bypassing such detections however are usually trivial, often times only requiring a change to the resource’s language identifier.
Entropy
Entropy of individual sections is often a useful indicator to discover whether a file is packed or not. Many malware packers will encrypt the original malicious executable and store it as a resource or additional section. A side effect of (properly) encrypting data is usually a marked increase in the entropy of the data. Entropy can be described as a measure of uncertainty concerning the value of a variable.
A normal range for an unpacked executable file would be from 4.5 on the low end to around 6.5 on the high-end. Executables with a lot of compressed or encrypted data have higher entropy values. Therefore entropy values higher than 6.5 tend to be ones that are packed or compressed in some way. There are many legitimate packers which aim to reduce the filesize which will have very high entropy values.
Malware packers can employ a number of techniques to reduce this heuristic marker. The entropy of data such as heavily compressed byte sequence may be decreased by reducing the possible values of the data (i.e. encoding). For example, taking a UPX packed exe and converting it to a base64 string will yield a heuristically relevant decrease in the entropy value, with the side effect of increasing the size. Encrypted or compressed data may also be diluted or padded with bytes of the same value or small pattern. For example, every nth byte, the data will be padded with a zero byte value. This will add predictability to the data, therefore decreasing the calculated entropy as a whole. There are many tricks and techniques available to combat entropy detection, therefore it is good to understand the principals behind them as well.
Conclusions
By no means is this a comprehensive guide to static detection of malicious executables. Rather, it should serve as an informal reference and examination of common points of interests and techniques used on both sides. Knowing what to pay attention to at a glance is an important skill in quickly identifying points of interest in malware analysis. Evasion techniques are constantly changing and trending towards adversarial machine learning. By knowing what features are extracted by NGAVs, malware authors are able to craft executable files which appease such restrictions, thus bypassing the promises of NGAVs.