Transparent System Introspection in Support of Analyzing Stealthy Malware

The proliferation of malware has increased dramatically and seriously damaged the privacy of users and the integrity of hosts in the past few years. Kaspersky Lab products detected over six billion threats against users and hosts in 2014 consisting of almost two million specific, unique malware samples [51]. McAfee reported that malware has greatly increased during 2014, with over 50 million new threats occurring during the fourth quarter [64] alone. While automated techniques exist for detecting pre-identified malware [85], manual analysis is still necessary to understand new and unknown threats. Malware analysis is thus critical to understanding new infection techniques and to maintaining a strong defense [16].


Introduction
The proliferation of malware has increased dramatically and seriously damaged the privacy of users and the integrity of hosts in the past few years.Kaspersky Lab products detected over six billion threats against users and hosts in 2014 consisting of almost two million specific, unique malware samples [51].McAfee reported that malware has greatly increased during 2014, with over 50 million new threats occurring during the fourth quarter [64] alone.While automated techniques exist for detecting pre-identified malware [85], manual analysis is still necessary to understand new and unknown threats.Malware analysis is thus critical to understanding new infection techniques and to maintaining a strong defense [16].
Malware analysis typically employs virtualization [25,26,34] and emulation [6,73,92] techiques that enable dynamic analysis of malware behavior.A given malware sample is run in a virtual machine (VM) or emulator.An analysis program (or a VM plugin) introspects the malicious process to help an analyst determine its behavior [17].Introspection is a technique used to unveil semantic information about a program without access to source code.Initially, conventional wisdom held that the malware sample under test would be incapable of altering or subverting the debugger or VM environment [75,78].Unfortunately, malware developers can easily escape or subvert these analysis mechanisms using several anti-debugging, anti-virtualization, and anti-emulation techniques [13,18,23,33,39,72,74].These techniques determine the presence of debuggers and VMs by using artifacts that they expose.We summarize known artifacts in Table 4 in the Appendix.Chen et al. [23] reported that 40% of 6,900 given samples were found to hide or reduce malicious behavior when run in a VM or with a debugger attached.Thus, the analysis and detection of such stealthy malware remains an open research problem which we call the debugging transparency problem.
There is a diverse array of anti-analysis artifacts.Artifacts come in two broad types: softwarebased and timing-based.Software artifacts are functional properties that can reveal the presence of analysis, such as the value returned by isDebuggerPresent or evidence of improperly emulated features.Timing artifacts are introduced by the cost of executing the analysis framework.Debuggers that allow single-stepping through instructions incur a significant amount of overhead, which can be measured by the software under test.Malware samples that are aware of these artifacts can thus heuristically detect when they are being analyzed.Such stealthy malware requires additional effort to understand.Solving the debugging transparency problem is thus concerned with reducing artifacts or permitting reliable analysis in the presence of artifacts.
We choose to focus on three components to help analyze stealthy malware.Many analysts employ existing analysis tools such as IDA Pro [47] or OllyDbg [94], but such tools introduce slowdowns or other detectable artifacts.Thus, we desire a solution that provides low overhead and low artifact debugging capabilities.Second, these debugging capabilities ultimately rely on the ability to gather semantic information from a program.To provide information that the analysis requires, a solution should therefore successfully read and report 1) variable values and 2) dynamic stack traces [32].Finally, black box testing of malware samples is an important step in understanding their behavior [16].To promote black box testability, an expressive solution should also be able to insert instructions to specifically control behavior and write variable values.
To achieve these desired properties of a successful solution, we combine several insights to form the basis of a novel system for transparent malware analysis.First, specialized hardware exists that can be used to read a host's memory with extremely low overhead and without injecting artifacts in the platform's software.Second, we can use existing program analysis techniques to reconstruct valuable semantic information from raw memory dumps, including variable values and dynamic stack traces.Third, we can combine program analysis and specialized hardware to control program execution for low-artifact black box testing.We propose a system consisting of three components that use the above insights to solve the debugging transparency problem.First, we propose low-artifact memory acquisition via novel use of the PCI-express bus as well as System Management Mode (SMM) on x86 platforms.PCIexpress can access memory with low overhead (and therefore fewer timing artifacts), and with only one visible software artifact (the DMA access performance counter).SMM on x86 platforms can transparently acquire physical memory associated with a single process.Compared to the PCI express approach, using SMM introduces no software artifacts, but at the cost of increased overhead.Both approaches support a similar interface: the user provides an address, and the system returns the value of physical memory at that address.
Our second proposed component can use these snapshots of system memory and adapt well-known program analysis techniques to reconstruct semantic information about the program.If we can find the locations where variables are stored, we can determine their values from the memory snapshot.Similarly, since activation records of function calls are stored on the stack, we can reconstruct a sequence of function calls if we know where the program's stack is stored.We propose to use program analysis techniques to find this information given a binary and a raw memory dump.
Our third component allows modifying variables, especially those residing in registers.Whereas our introspection components may provide read-only access to variable values in a program, we also desire write access to variables and registers to permit effective black box testing.The insight here is the use of hardware support made for debugging firmware, which we propose to repurpose for inserting instructions.We propose to devise a technique whereby we construct sequences of instructions to accomplish certain behaviors associated with testing without causing significant overhead.This technique can be achieved using CPU interposition, which involves specialized hardware that sits between the platform's mainboard and CPU socket.The research problem at hand is whether we can use CPU interposition to transparently insert instructions into the CPU's execution units in a way that permits changing variables and controlling paths of execution taken by a program.While this component is preliminary, there is promise based upon experience with hardware debugger support in x86 and ARM architectures.

Broader Impact
Malware analysis is an increasingly important area.As malware becomes more complex, the strain on analysis resources escalates.Engineers in industry frequently take as long as one month to manually analyze a new single sample of stealthy malware [82], and many companies simply do not have the resources to analyze all stealthy infections [24].With millions of new samples appearing every year, this time investment is not feasible, and is certainly no longer cost effective.Further, with the increased prevalence and reliance upon computing in our everyday lives, large-scale malware such as Stuxnet and Careto [53] have caused significant financial damage.As for individual users, Kaspersky Labs reported 22.9 million attacks involving financial malware that targeted 2.7 million individual users [52] in year 2014 alone.

Intellectual Merit
With the increased prevelance of stealthy malware, the debugging transparency problem has become an open and fertile area of research.Our proposed techniques are novel to the best of our knowledge.Solving this problem requires combining insights spanning programming languages, operating systems, hardware development, and security.First, our proposed memory acquisition approach requires strict compliance with the PCI express specification (to avoid software-based artifacts) while providing low overhead access (to avoid timing-based artifacts) to semantic information.Second, our proposed program analysis technique depends upon finding values of variables, which are in non-obvious locations (i.e., those chosen by an optimizing compiler).Third, we must automatically constrain desired testing behavior into a short sequence of inserted instructions while executing an adversarial program.No currently extant system shows promise in providing the benefits of binary program analysis in a transparent manner.

Thesis Statement
It is possible to develop a transparent debugging system capable of reading and writing variable values, reconstructing stack traces, and black box testing by combining hardware assisted memory acquisition, process introspection, and CPU interposition.This proposal consists of three research components that, together, form a cohesive system supporting the automated, transparent analysis of software.First, we discuss hardware-assisted system introspection techniques.Hardware assistance provides the necessary basis for analyzing software with low overhead, and therefore low artifacts.Second, we discuss transparent process introspection.This technique facilitates reading variable values and reconstructing dynamic stack traces by using said hardware assistance.Finally, we discuss CPU interposition.Whereas transparent process introspection admits reading variables in a process's memory, we propose to use CPU interposition to allow writing variable values in both memory and registers.Together, these techniques help solve the debugging transparency problem.

Background
In this section, we define terms and vocabulary used in this area.

Stealthy Malware
Recent malware detection and analysis tools rely on virtualization, emulation, and debuggers [13,18,22,33] (also, see the Appendix).Unfortunately, these techniques are becoming less applicable with the growing interest in stealthy malware [23].Malware is stealthy if it makes an effort to hide its true behavior.This stealth can emerge in several ways.First, malware can simply remain inactive in the presence of an analysis tool.Such malware will use a series of platform-specific tests to determine if certain tools are in use.If no tools are found, then the malware executes its malicious payload.
Second, malware may abort its host's execution.For example, a sample may attempt to execute an esoteric instruction that is not properly emulated by the tool being used.In this case, attempting to emulate the instruction may lead to raising an unhandled exception, crashing the analysis program.
Third, malware may simply disable defenses or tools altogether.For instance, a previous version of OllyDbg would crash when attempting to emulate printf calls with a large number of '%s' tokens [93].This type of malware may also infect kernel space and then disable defenses by abusing its elevated privilege level.
A firm understanding of stealthy malware is important due to its increasing prevalence and strain on analysis resources.

Artifacts
Stealthy malware evades detection by concluding whether an analysis tool is being used to watch its execution and then changing its behavior.This means there must be some piece of evidence available to the malware that it uses to make this determination. 1 This may be anything from execution time, (e.g., debuggers make programs run more slowly), to I/O device names (e.g., if a device has a name with 'VMWare' in it), to emulation idiosyncrasies (e.g., QEMU fails to set certain flags when executing obscure corner-case instructions).We coin a novel term for these bits of evidence: artifacts.Ultimately, we seek more transparent instrumentation and measurement of malware by reducing or eliminating the presence of these artifacts.

System Management Mode
Our first proposed research component makes extensive use of System Management Mode (SMM) [48].SMM is a mode of execution similar to Real and Protected modes available on x86 platforms.It provides a transparent mechanism for implementing platform-specific system control functions such as power management.It is initialized by the BIOS.SMM's was originally designed to enable the hardware to save power by monitoring which peripherals were being used and turning off peripherals when idle.SMM is transparent to the operating system, so its use in power management did not require extensive driver support.We make use of this transparency aspect to execute analysis code unbeknownst to code executing in both user and kernel space.
SMM is triggered by asserting the System Management Interrupt (SMI) pin on the CPU.This pin can be asserted in a variety of ways, which include writing to a hardware port or generating Message Signaled Interrupts with a PCI device.Next, the CPU saves its state to a special region of memory called System Management RAM (SMRAM).Then, it atomically executes the SMI handler stored in SMRAM.SMRAM cannot be addressed by the other modes of execution.This caveat therefore allows SMRAM to be used as secure storage.The SMI handler is loaded into SMRAM by the BIOS at boot time.The SMI handler has unrestricted access to the physical address space and can run privileged instructions.SMM is thus a convenient means of storing and executing OS-transparent analysis code.

Threat Model
The proposed system entails malware analysis.We must therefore define the scope of the malware that we propose to analyze.We assume a malware sample can compromise the operating system after executing its very first instruction.We further assume the malware can use unlimited computational resources.We assume that the physical hardware is trusted; hardware trojans are thus out of scope.

Related Work
In this section, we discuss the context in which our proposed work fits.

Malware Analysis
Stealth techniques employed by malware have necessitated the development of increasingly sophisticated techniques to analyze them.For benign or non-stealthy binaries, numerous debuggers exist (e.g., OllyDbg [94], IDA Pro [47], GNU Debugger [42]).However, these debuggers can be trivially detected in most cases (e.g., by using the isDebuggerPresent() function).Such anti-analysis techniques led researchers to develop more transparent, security-focused analysis frameworks using virtual machines (e.g.[6,25,26,34,80,92]) which typically work by hooking system calls to provide an execution trace which can then be analyzed.System call interposition has its own inherent problems [38] (e.g., performance overhead) which led many researchers to decouple their analysis code even further from the execution path.Virtual-machine introspection (VMI) inspects the system's state without any direct interaction with the control flow of the program under test, thus mitigating much of the performance overhead.VMI has prevailed as the dominant low-artifact technique and has been used by numerous malware analysis systems (e.g.[28,40,44,50,59,79,83]). Jain et al. [49] provide an excellent overview of this area of research.However, introspection techniques have very limited access to the semantic meaning of the data that they are acquiring.This limitation is known as the semantic gap problem.There is significant work in the semantic gap problem as it relates to memory [11,29,36,50,60] and disk [20,55,62] accesses.All VM-based techniques thus far have nevertheless been shown to reveal some detectable artifacts [23,72,74,76] that could be used to subvert analysis [35,71].
The semantic gap problem requires reconstructing useful information from raw data, but this reconstruction process is completely separate from the method of acquisition.Numerous techniques have been discussed which further decouple the analysis code from the software under test by moving the analysis portion into System Management Mode (e.g.[12,86,87,95]) or onto a separate processor altogether (e.g.[15,65,66,69,96]).In contrast to existing work, we consider SMM as a trusted mechanism for transparently acquiring physical memory associated with a particular process.

Debugging
Most popular debuggers, such as the GNU Debugger (GDB), Pin [61], OllyDbg [94], or Dy-namoRio [19], work by modifying the instructions of the process being debugging by adding jump or interrupt instructions that will periodically call back to the debugging framework to report the requested data.These techniques have obvious performance impacts as they execute multiple instructions in-line within the debugged process, and are known to be trivial to detect.More security-focused dynamic analysis engines either emulate the CPU in software [14,43,90] or execute the instructions on the hardware and monitor memory accesses or system calls to trigger their instrumentation [27,79,81,95].However, these techniques have also been shown to expose numerous artifacts and be susceptible to subversion [35,71].
Hardware debuggers, such as JTag, Asset InterTech's PCT [9], or ARM's DStream [8], expose no software artifacts, but still have a significant performance impact as they typically function by iteratively sending a set of instructions to the CPU and analyzing system state upon completion of those instructions.These debugging interfaces are typically orders of magnitude slower than the CPU that they are debugging, and thus introduce inherent performance restrictions when used as a general debugging platform.The aforementioned timing constraints make single-stepping techniques cumbersome for large programs, and are also easily detectable by malicious programs.Further, non-malicious processes dependent on time such as network servers may be unstable in such environments.This instability could be treated as a separate artifact that malicious code could use in turn.These techniques have also been widely unexplored in a security context and are typically reserved for debugging the Basic Input/Output System (BIOS) and boot loaders.This approach explores low-artifact hardware-based debugging techniques for general software programs.Furthermore, we revisit the use of these hardware debuggers as a mechanism for inserting single instructions rather than for general debugging.

Memory Introspection
The idea of memory introspection is by no means a new field of study.Numerous tools have been created to facilitate memory introspection on both physical and virtual machines over the years [21,30,63,68,87], as well tools that bridge the semantic gap.Bhushan et al. [49] summarized the research in the field with virtual machines, however the area of live hardware-based memory introspection is mostly unexplored.In general, these techniques require instrumentation to acquire memory, which is augmented to bridge the semantic gap to extract high-level information from the system being analyzed.Many analysis techniques use expert knowledge to reconstruct features of popular operating systems and processes, as in Volatility [11].Others use automated techniques to reproduce native process functionality, such as enumerating running processes, via an introspection framework [29,36,50,60].A third class of techniques achieve isolation by moving security critical introspection functions to a higher level hypervisor or external process [41,83].While a few hardware-based techniques currently exist [69,87,96]  or a significant performance overhead.We are unaware of any hardware-based systems capable of analyzing arbitrary programs and providing debugging-like information that includes variable values and stack traces while maintaining transparency from the software under test.

Proposed Research and Metrics
We propose a system consisting of two hosts, 1) a system under test (SUT), and 2) a remote host.
The SUT is the platform that contains the code we want to observe (e.g., a stealthy malware sample), while the remote host is used by an analyst to guide introspection and debugging on the SUT.This architecture is favorable because it isolates duties: the SUT executes its code under test transparently.The remote host, in turn, is responsible for debugging and analysis.Figure 1 gives a high-level architecture diagram of our system; in this section we discuss each proposed research activity in detail.

Hardware-Assisted Program Introspection
To support and carry out transparent malware analysis we require transparent access to a host's memory.We propose two techniques to achieve this end.First, we propose a novel use of the PCI express bus to rapidly acquire memory contents via direct memory access (DMA).This technique has the benefit of producing very little measurable overhead and minimal software artifactsspecifically, software can measure the DMA performance counter.We also propose a novel use of System Management Mode (SMM) to acquire process memory transparently.SMM potentially produces timing artifacts, but it does not produce any software artifacts.Both techniques provide a similar interface for our purposes: a desired address is given as input, and the value of the process's memory at that address is returned as output.

Using PCI express
We propose using a Field-Programmable Gate Array (FPGA) board with a PCI express connector to create a sensor capable of acquiring memory values transparently.PCI express supports very high speed transfers.Customized hardware in the form of an FPGA allows us to take full advantage of this high bandwidth.Thus, we can acquire many samples of memory values with low overhead, and therefore few timing artifacts.This approach is also attractive because we can use commercial off-the-shelf (COTS) components to develop the sensor.As a result, we can potentially drastically lower the effort involved in acquiring memory values.
Specifically, we propose to use a Xilinx ML507 development board.This FPGA has PCI express and gigabit ethernet connectors.We propose designing an FPGA that supports reading memory over PCI express via DMA, then passes the values over Ethernet to a remote host.In our proposed setup, the board communicates with the system under test via PCI express.We achieve this by enabling the Bus Master Enable bit in the card's configuration register.As hardware is typically trusted, there are no enforcement mechanisms to stop a peripheral from reading arbitrary memory locations.This method has been widely studied [21,77,87] and exploited [10,30,31,45,56,77,84].However, this caveats use in transparent memory acquisition for the legitimate study of stealthy malware is novel.A remote host connects via gigabit Ethernet.An analyst then uses the remote host to orchestrate an introspection session on the system under test.

Using System Management Mode
To use SMM to acquire memory values, we logically atomically execute the System Management Interrupt (SMI) handler by asserting the SMI pin on the CPU.On most Intel platforms, we assert the SMI pin by writing to a port specified by the chipset.By modifying the SMI handler to contain process introspection code, we can provide simple debugging services to an external host.
The SMI handler executes in isolation from the operating system, thus providing transparency.Moreover, the SMI handler code is stored in the BIOS on the system.Upon booting, the SMI handler is loaded from the BIOS EEPROM to a special region of memory called System Management RAM (SMRAM).SMRAM is only addressable when the CPU is executing in SMM.Otherwise, the hardware automatically maps SMRAM addresses to VGA memory instead.As a result, SMRAM can be used as trusted storage because its isolation is enforced by the hardware, which we assume is trusted.SMM thus provides a transparent execution environment that permits us to acquire memory values even if the underlying operating system has been compromised.
We propose using SMM to provide a memory acquisition interface.In the system described, we must assert interrupts to cause our introspection code to execute.Typical computer chipsets support raising an SMI when a performance counter overflows.Therefore, to trigger an SMI in the immediate future, we propose to manually set the value of a performance counter to its maximum value.When the corresponding event next occurs, the performance counter overflows, raising an SMI.If needed, we can use the retired instruction performance counter to achieve an SMI after every single instruction executed by the CPU.This allows us to account for local timing artifacts by altering hardware counters on every instruction, thus preventing software artifacts from occurring.

Transparent Program Memory Introspection
The second component of our proposed transparent malware analysis system constructs useful semantic information from raw memory dumps.In particular, we desire to introspect variable values and runtime stack information from memory dumps acquired from our proposed hardware-assisted memory acquisition component described in Section 4.1.However, these raw physical memory snapshots do not come with meaningful semantic information-we must bridge the semantic gap.Thus, we propose a technique that iteratively inspects snapshots of physical memory (recorded asynchronously via the first proposed component) and recovers semantic information.Using a combination of operating systems and programming languages techniques, we locate local, global, and stack-allocated variables and their values.In addition, we determine the current call stack.
Because we may operate on off-the-shelf optimized code and only assume access to memory (e.g., and not the CPU), our approach may not be able to report the values of some variables (such as those stored in registers) or some stack frames (such as those associated with inlined functions).
We believe this approach will be able to provide a rich set of information for common security analysis tasks in practice.
We envision two use cases for this proposed component.In automated malware triage systems, we want to analyze a large corpus of malware samples as quickly as possible.Unfortunately, existing solutions to this problem depend on virtualization.For example, the popular Anubis [6] framework, which analyzes binaries for malicious behavior, depends on Xen for virtualization.This dependency on virtualization allows stealthy or VM-aware malware to subvert the triage system.In contrast, this proposed component assumes the presence of low-overhead sampling hardware that enables fast access to a host's memory.This proposed component, therefore, is charged with introspecting the process while running in the triage system.For such a triage system, we want to understand a subject program's behavior in part by reasoning about the values of variables and producing a dynamic stack trace during execution.For example, the system might inspect control variables and function calls to determine how a stealthy malware sample detects virtualization, or might inspect snapshots of critical but short-lived buffers for in-memory cryptographic keys.In this use case, the primary metric of concern is accuracy with respect to variable values and stack trace information.
Additionally, we envision this use case as reducing the manual effort involved in reverse engineering state-of-the-art stealthy malware samples.This proposed component enables debugging-like capabilities that are transparent to the sample being analyzed.This power allows analysts to save time when reverse engineering the anti-VM and anti-debugging features employed by current malware so that they can focus on understanding the payload's behavior.
The second use case is for deployments in large enterprises where multiple instances of the same software are running on multiple hosts.In this case, we have vulnerable software running and, if an exploit occurs, we want to know what memory locations are implicated in the exploit.Here, we have a limited amount of time between when malicious data is placed into program memory and when malicious behavior begins.Thus, we want to study which buffers may be implicated by malicious exploits in commercial off-the-shelf software.In this use case, an additional metric is the relationship between the speed and accuracy of our asynchronous debugging approach: for  example, we may require that accurate information about potentially-malicious data be available quickly enough to admit classification by an anomaly intrusion detection system.
In both use cases, we begin with a binary that we want to study.If the source code to this binary is available (as is likely in the second use case but unlikely in the first), we take advantage of it to formulate hypotheses about variable locations.In any case, we desire to find and report 1) variables of interest in program memory, and 2) a dynamic stack trace of activation records to help understand the semantics of the program.
We assume that we have a binary compiled with unknown optimizations or flags.We also assume access to the source code for the binary.We compile a test binary with symbol information to formulate the runtime memory locations of global and local variables and formal parameters.
We then use this information as a guide for finding the same variables in the original binary.For instance, if global a is stored at offset x from the data segment in the test binary, then we hypothesize a is near the same offset x in the original binary.
We also assume the binary makes use of a standard stack-based calling convention (i.e., continuation passing style [7] is out of scope).Thus, we can reconstruct a stack trace by finding the stack in our memory snapshots and looking for the return address.We can then map the return address to the nearest enclosing entry point in the code segment using the symbol table generated in our test binary.Because the hardware assisted introspection component is based on polling for memory, reconstructing the stack trace in this manner could fail.If we do not poll frequently enough, we may miss activations altogether (e.g., if a function is called and quickly returns before we can poll).Figure 2 explains this tradeoff visually.

Transparent CPU Interposition
The third proposed component is more speculative, but holds promise nonetheless.As such, we propose two optional components as a risk mitigation strategy to ensure successful completion of the overall proposed research.After sufficient preliminary investigation is completed, we will select one of the potential components to take to completion.Both potential components relate to CPU interposition as part of completing the overall solution to the debugging transparency problem.

Option 1: Black Box Testing of Android Malware
To address the growing concern of mobile malware [64] and stealthy Android malware [70], we propose a CPU interposition component to apply to ARM platforms, specifically Android.In particular, the process introspection component introduced in Section 4.2 permits reading of variables and stack traces.While this can help analysts understand a sample's behavior, it may also be useful to change a sample's behavior.For example, we claim that Android malware targeting financial information from users may only activate if a user's bank account has a certain balance.In this example, the balance would be stored in a variable.Thus, we would need to change this variable to observe changes in the execution of the malware sample.Moreover, general debugging tools such as GDB [42] or IDAPro [47] allow changing variable values when interacting with a piece of software.
We propose a CPU interposition component so that our system offers a complete set of debugging features while remaining transparent.We thus propose to allow the transparent alteration of such a variable to explore different execution paths.
We propose using hardware debuggers available for ARM platforms (e.g., the Asset PCT for ARM [9] or the ARM DStream debugger [8]).This debugging hardware is intended to test manufacturing defects and to test low level firmware.However, based upon our preliminary investigation, these tools can be extended to help transparently analyze stealthy malware.In particular, these hardware debuggers function via interposing the CPU-that is, a special circuit exists between the CPU socket on the platform's mainboard and the CPU itself.The hardware debugger has access to the CPU's pins and therefore can intercept all signals sent between the CPU and the rest of the system.
Hardware debuggers allow two general operations: 1) reading and writing arbitrary memory values in registers, caches, and DRAM banks, and 2) inserting instructions by shifting a new instruction into a special register that feeds to an execution port in the CPU pipeline.Further, hardware debuggers can be invoked at regular intervals (e.g., 20MHz for JTAG-based debuggers).Therefore, by using the program introspection component from section 4.2, we can identify locations of variables in memory, and change them using CPU interposition.Furthermore, we can insert instructions to change program behavior in a way that is difficult for stealthy malware to detect.Together, these powers allow us to complete our proposed solution to the debugging transparency problem.
We propose an approach whereby we alter the outcome of branches in software under test by changing the value of a relevant variable.With this proposed approach, we assume a malware analyst indicates the desired path to be exercised by a particular branch by asserting the value of a variable.Ultimately, this component will be successful if we can change the value of the variable to effect a change in the outcome of the branch.The challenge is thus whether we can change the value without influencing overall performance (i.e., without measurable slowdown) and whether we can change the value in time to make a difference (e.g., if a variable's value is changed after the branch is taken, then the component fails).

Option 2: Realtime Repair of Quadcopter Systems
Realtime software provides another compelling need for reducing timing artifacts.Expensive software analysis techniques may take too long to complete or fail altogether in the context of a realtime or resource-constrained environment.One such area of growing interest is in unmanned aerial vehicles.In particular, we want to increase the dependability of software running on quadcopters despite a diverse heterogeneity of specific hardware and firmware configurations that lead to failures [37].Such realtime software would benefit from techniques that fix bugs on the fly to prevent mission failures.
Automated program repair is an active area of research concerned with automatically generating patches to fix bugs in software [57,88,89].Typically, these techniques assume the existence of a comprehensive suite of test cases that help localize bugs in the software.The original software passes some fraction of the test cases.Then, by changing the code localized by the automated repair system, a patch can potentially be generated that causes the software to pass a larger fraction of the test cases.However, automated program repair has not been used on firmware, since it requires changing the program code while it is stored (as opposed to changing the code at runtime).While we do not propose new automated program repair techniques, we seek to apply these techniques in a novel domain.
We propose a component in which we apply automated program repair by using CPU interposition.
As mentioned in option 1, we can repurpose hardware debuggers to help repair realtime software.Specifically, debugger hardware permits inserting instructions only-it cannot remove instructions from the execution pipeline.Thus, we propose using ARM hardware debuggers to apply program repairs by inserting instructions only.A successful approach for this component requires automatically converting a general software patch into a sequence of instructions to be inserted and then reliably inserting them with minimal overhead.For example, if our patch requires removing an instruction add rax, rax, 5, we can simulate removing this instruction by inserting a subsequent sub rax, rax, 5 immediately (ideally) after the first instruction.Thus, this component is charged with 1) coming up with a sequence of instructions to insert that represent the behavior of the desired patch, and 2) reliably inserting the patch in the desired location in a way that the new patch will execute.
We assume that a patch exists provided as a source code delta for a function or method to be changed.We also assume that we are provided the location for such a patch to be placed in the binary (i.e., the value of the program counter at which the patch should be placed).Therefore, we must automatically convert the provided patch to a sequence of instructions to be inserted via the hardware debugger as soon as the program counter reaches the desired value.This use case is indicative of self-checking malware-we must change the program code before it has a chance to execute, but not so early that it can check itself for discrepancies.Thus, patching quadcopter firmware is indicative of the same technical challenges posed by analyzing self-checking stealthy malware samples.

Proposed Experiments
In this section we present and discuss the metrics for evaluating the success of each of our three proposed components.We then present experimental designs for testing our hypotheses.

Hardware-Assisted Program Introspection
This subsection deals with the proposed hardware-assisted introspection technique.Recall that the desired properties of an adequate introspection technique are that it transparently acquires physical memory from a system under test.In that regard, we have two basic metrics for transparency: 1) latency of operation, and 2) freedom from software artifacts as measured by success in analyzing stealthy samples.The latency of the system helps indicate the degree to which it is free from timing artifacts.Together, these two metrics give an overall picture of how transparent the system is.
We will evaluate two hardware-assisted introspection techniques based on 1) System Management Mode (SMM) and 2) PCI express based DMA.

SMM-Based Introspection
We measure the latency of the SMM-based introspection approach by measuring the time elapsed during particular operations in the SMI handler.We use the Time Stamp Counter (TSC) to measure the number of CPU cycles elapsed during each operation; we multiply the clock frequency by the delta in the TSC.In particular, we measure the latency of four steps: 1) SMM switch -the time taken to switch to SMM, 2) Process identification and reading, 3) Configuration for the next SMI, and 4) SMM resume -the time taken to switch from SMM back to the OS.
We also measure the overall performance impact on a test system by comparing the run time of several programs in two situations: 1) when the system under test is not instrumented by our proposed component, and 2) when our system under test is instrumented on each instruction.To this end, a successful solution will be able to analyze more stealthy malware samples than existing techniques such as Ether [26] or Virtuoso [58] while incurring a lower performance overhead.
To measure freedom from software artifacts, we instrument popular stealthy malware samples with our proposed components and compare the results against other state of the art malware analysis techniques.We hypothesize that our technique will be able to instrument execution of more samples than other techniques such as Ether or Virtuoso.

PCI Express Hardware Introspection
We measure overall memory throughput for our proposed component.We compare the throughput on the system under test when it is 1) uninstrumented and 2) instrumented by our component.Specifically, we measure the number of MB per second of transactions that we can complete using a popular tool, RAMSpeed [46].When the instrumented memory throughput statistically matches the uninstrumented memory throughput, we consider the component transparent (i.e., free from timing artifacts) and thus successful.
We then measure its effectiveness in analyzing stealthy malware samples.We consider a corpus of labeled malware samples that serve as ground truth for analysis.Within this labeled corpus, we know which of them contain stealthy checks for artifacts.Thus, we count how many of the stealthy samples this component is capable of analyzing.If we successfully analyze all of them, the component is successful.

Transparent Process Introspection
There are three parts we consider for evaluating the transparent process introspection component.First, we want to know what fraction of local, global, and stack-allocated variable values our component can successfully and correctly report.Second, we measure how accurately this component can recover dynamic stack traces from memory snapshots as a function of sampling frequency.Finally, we apply the component to a case study to show its usefulness in analyzing stealthy malware samples.
To address the first part, we assume two binaries for each program we consider: 1) the original binary, and 2) a test binary.We gather ground truth data by recording every variables value that is in-scope and available at every point of execution in the test binary by using source-level instrumentation [67].Then, we execute the original binary.At each time step for which our component records a snapshot, we report the value of the variable.We measure success in terms of the fraction of those variables for which our component reports the correct value as reported in the test binary.
For the second part of this evaluation, we again assume we have two binaries for each subject.We gather ground truth information about function calls during program execution.We then vary how frequently we can acquire memory snapshots while running the original binary.From these snapshots, we find the activation stack and acquire the current function being executed at each snapshot.We then compare the stack trace acquired with memory snapshots to the ground truth.Thus, we evaluate the portion of the dynamic stack trace that our component can accurately report given a memory snapshot every k cycles.We consider function calls only in userspace (e.g., program functions like main and library functions like printf are included, but not actions like system calls).
We introduce a new metric for this experiment-the number of function calls missed in our output stack trace that were present in the ground truth stack trace.For example, if the ground truth sample contains (1, f 1 ), (2, and we sample at cycles 1, 3, 5, . .., then our approach would report a stack trace missing the call to f 2 at t = 2. Our metric counts the total number of such single omissions.Because the stack trace length differs among test cases and programs, we normalize this value to 100%.Thus, if a stack trace identical to the ground truth corresponds to an accuracy of 100% while the example above that misses program behavior at t = 2 has an accuracy of 66%.In other words, the final value we report is f −m f , where m is the number of misses and f is the number of function calls in the ground truth data.While other evaluation metrics are possible for dynamic stack traces (e.g., longest common subsequence, edit distance) we prefer this metric because of its conservative nature.
In addition, for the third part of this evaluation we consider pafish v04.Pafish is an open source program for Windows that uses a variety of common artifacts to determine the presence of a debugger or VM, printing out the results of each check.This evaluation requires that we know the ground truth set of artifacts that pafish considers to detect analysis.We obtain this set by manual inspection of the pafish code and comments.In particular, pafish runs through a list of known artifacts and reports whether or not each artifact is detected during execution.Using pafish is amenable to complete ground truth annotation (unlike a wild malware sample for which we could miss a stealthy check and thus have false negatives).Further, a case study of pafish gives confidence that the proposed component is applicable to the analysis of stealthy malware because it contains a large set of indicative artifacts used by samples in the wild.

CPU Interposition
We proposed two options for this component in Section 4.3.As such, we propose experiments to that help fail early so as to inform our decision about which option to take to completion.
In Section 4.2, we proposed a memory introspection technique for x86 platforms.Since the first proposed option for CPU interposition is planned for ARM platforms, a fast experiment is whether we can quickly retarget the process introspection component on an Android platform.If we can, then we gain confidence that the first option is more appropriate.
Secondly, the proposed option for quadcopter repairs is based on automated program repair.Thus, if we can quickly retarget an existing automated repair system such as GenProg [57] to the ARM platform, then we gain confidence that the second option is more appropriate.
Additionally, we also propose translating patches to an equivalent sequence of inserted instructions for option 2. We propose manually making such translations for a candidate set of indicative patches.We then consider the types of operations required to achieve this behavior.We hypothesize that patches will involve a reasonably low number of common operations.For example, a large number of patches may exhibit the need to change behavior by inserting an instruction that 'reverses' the effect of a previous instruction.If patches can be classified into fewer than ten groups, we gain confidence that automating the translation is feasible, and thus that option two is viable.
After selecting an option, we consider further experiments.

Option 1: Android Black Box Testing
Recall from Section 4.3.1 that we want to modify the branch exercised by a sample of software.We assume we have the location of the variable to be changed (based upon analysis provided by the process introspection component in Section 4.2).We will evaluate every branch in a program.Each branch instruction is associated with a value in a register at that point of execution.We would attempt to change the value of that register so that the path taken by the branch is changed.If the fraction of branches whose results are changed in time due to our component exceeds 80%, we succeed.

Option 2: Repairing Quadcopter Firmware
We propose translating known patches into an equivalent representation of instructions to be inserted only (as required by our proposed use of hardware debuggers).Thus, we propose measuring the fraction of patches that we can successfully translate to such a representation.Then, we propose inserting these patches into a quadcopter as it executes its firmware.In particular, we use a hardware debugger to periodically check the program counter, compare it against the desired location of insertion, and determine whether to insert the patching instructions.For each patch, we measure the fraction of patches that can be successfully inserted.In particular, we must ensure that the patch is inserted at the desired location before execution reaches that location.We consider this component successful when it is capable of inserting a patch at least 80% of the time.Figure 3 shows the overall plan from start to graduation of the proposed doctoral research.We will mainly target security venues for submitting original manuscripts.In particular, the highly reputable conferences and symposia in this area are 1) IEEE Security and Privacy, 2) USENIX Security Symposium, 3) Networks and Distributed Systems Security, and 4) ACM Conference on Computer Security.Additionally, other international conferences include 5) Annual Computer Security Applications Conference, 6) Recent Advances on Intrusion Detection, and 7) ACM Symposium on Information, Computer, and Communications Security.The submission dates for these annual conferences and symposia are roughly spread throughout the year, so there is always a nearby submission date to consider as research concludes on a particular research thrust.

Preliminary Results
In this section, we present preliminary results from ongoing research.

Using System Management Mode
Table 1 shows a breakdown of the operations in the SMI handler if the last running process is the target malware in the instruction-by-instruction stepping mode.This experiment shows the mean, standard deviation, and 95% confidence interval of 25 runs.The SMM switching time takes about 3.29 microseconds.Command checking and breakpoint checking take about 2.19 microseconds in total.Configuring performance monitoring registers and SMI status registers for subsequent SMI generation takes about 1.66 microseconds.Lastly, SMM resume takes 4.58 microseconds.Thus, the SMM approach takes about 12 microseconds to execute an instruction without debugging command communication.In contrast, the similar state of the art system Virtuoso [58] requires, in the best case, 450ms to complete a similar operation in Windows.Thus, this approach shows promise in providing improved timing transparency compared to existing work.
To demonstrate the efficiency of this component, we propose measuring the performance overhead of the four stepping methods on both Windows and Linux platforms.We propose using popular benchmark programs on each platform.We can also take advantage of CygWin and evaluate  performance of several standard Linux utilities.We can then use shell scripts to compute the runtimes of these commands.
Table 2 shows the performance slowdown introduced by the step-by-step debugging for several programs.The first column specifies four different stepping methods; the following five columns show the slowdown on Windows, which is calculated by dividing the current running time by the base running time; and the last five columns show the slowdown on Linux.Far control transfer (e.g., call instruction) stepping only introduces a 2x slowdown on Windows and Linux, which facilitates coarse-grained tracing for malware debugging.As expected, fine-grained stepping methods introduce more overhead.Preliminary evaluation of instruction-by-instruction debugging causes about 973x slowdown on Windows in the worst case.This high runtime overhead is due to the 12-microsecond cost of every instruction (as shown in Table 1) in the instruction-stepping mode.While this overhead is high, our proposed approach is still three times faster than previous work such as Ether [26,92] in single-stepping mode.

Using PCI Express and DMA
We quantify our performance impact by utilizing RAMSpeed, a popular RAM benchmarking application.We run RAMSpeed with and without our instrumentation.In each case we ran four separate experiments which stressed the INT, MMX, SSE, and FL-POINT instruction sets.Each of these experiments consists of 500 sub-experiments which evaluate and average the performance of copy, scale, sum, and triad operations.To ensure that the memory reads for our instrumentation were not being cached, we had our sensors continuously read the entire memory space which should also introduce the largest performance impact on the system.The memory polling rates of 14MB/s was dictated by the available hardware and our particular implementation.
Figure 4 shows the overhead incurred by our PCI-e memory acquisition component as reported by RAMSpeed.At first glance, it may seem to indicate that our instrumentation has a discernible effect on the system.Note, however, that the deviation from the uninstrumented median is only 0.4% in the worst case (the SSE boxes in Figure 4).Despite our best efforts to create a controlled experiment, i.e. running RAMSpeed on a fresh install of Windows with no other processes, we were unable to to definitively attribute any deviation to our polling of memory.While our memory  polling certainly has some impact on the system, the rates at which we are polling memory do not predictably degrade performance.These results indicate that such a system could poll at significantly higher rates while still remaining undetectable.For example, PCIe implementations can achieve DMA read speeds of 3GB/sec [5] which could permit a novel class of introspection capabilities.

Transparent Program Introspection
We evaluate this proposed component using two indicative security-critical systems, nullhttpd 0.5.0 and wu-ftpd 2.6.0, each of which has an associated security vulnerability and publicly-available exploit.Nullhttpd is vulnerable to a remote heap-based buffer overflow [2] while wu-ftpd is vulnerable to a format string vulnerability [1].We consider indicative non-malicious test cases taken from previous research [57].
Table 5 in the appendix summarizes the test cases used in our experiments.When the source code is available, our approach uses it to construct hypotheses about variable locations, but we do not assume that the compiler flags used in the original binary are known or the same.In these experiments, we simulate that disparity by gathering hypotheses from programs compiled with "-O2," but evaluating against different binaries produced with "-O3".

Variable Reporting Accuracy
We evaluate the accuracy of our introspection component with respect to polled memory snapshots as discussed in Section 5.2.We partition variables in scope in to local, global, and stack-allocated variables.Table 3 reports the results of our variable introspection experiment.For each program, the results are averaged over all test inputs (non-malicious indicative tests and one malicious exploit) and all relevant points (all function calls, all function entries, all loop heads).That is, these results help address the question "if an analyst were to ask at a random point to introspect the value of a random variable, what fraction of such queries would our system be able to answer correctly?"These results show over 90% accuracy for global variables.This high introspection accuracy is because many of these variables are available at a constant location described in the subject program's symbol table.However, our approach still produces reasonable introspection accuracy for local and stack-allocated variable queries.For these variables, the values are not necessarily unavailable, but the hypotheses considered by our system do not account for differences caused by dynamic allocation of structures (or, indeed, whether compiler optimizations change the structures or layout altogether).Conversely, some values are not available to our technique based on its design assumptions (e.g., variables that live exclusively in registers).Overall, however, this component answered 84% (4917 of 5873) of variable introspection queries correctly.Figure 5a reports the results for stack trace introspection for nullhttpd.Current PCI-e DMA hardware can read roughly 1 million pages per second, or 1 page every 3200 cycles.Thus, current hardware approximately corresponds to 3200 cycles between memory snapshots in these figures.Our introspection system loses accuracy when functions execute faster than the chosen inter-sample cycle count.Each test case causes a different execution path to be taken, thus explaining the difference in results between test cases.For nullhttpd, we remain 100% accurate until the intersample cycle counter reaches approximately 1800 cycles.After this point, the accuracy steadily declines until the inter-sample cycle count exceeds the total execution time of the program -at that point, the accuracy is 0%.Note that with the 3200 cycle sample rate, we observe a stack trace accuracy over 50% for all test cases.Figure 5b reports the accuracy for stack trace introspection for wuftpd.This program, which contains longer-running procedures, admits perfect call stack introspection up to a sampling interval of 4800.With available PCI-e DMA hardware, this component would report 100% accurate stack traces for all test cases.For such programs and workloads, a faster sampling rate (i.e., a smaller inter-cycle rate) may allow for even greater introspection transparency.

Conclusion
The rapid proliferation of malware has recently increased dramatically, seriously damaging the privacy of users and the integrity of hosts.With the sizeable growth of stealthy malware, novel techniques must be developed that remain transparent to the sample being analyzed.We propose a multi-faceted system comprised of three components: hardware-assisted program introspection, transparent process introspection, and CPU interposition.We propose using hardware assistance to rapidly acquire values of memory addresses from a host without introducing functional or timing artifacts.Then, we propose using this access to memory to transparently reconstruct the semantics programs, specifically to acquire variables and stack traces.Finally, we propose two options for demonstrating CPU interposition as a viable means to transparently change variables and execution paths of Android malware, and to transparently repair realtime firmware.Together, these three components form a cohesive solution to the debugging transparency problem.

Figure 2 :
Figure 2: Visual representation of (a) an idealized stack trace and (b) a stack trace generated by the proposed component.

Figure 3 :
Figure 3: High level schedule for proposed research SamplesPercent of Calls Reported Correctly (a) Call stack introspection accuracy for Nullhttpd as a function of the number of machine cycles between memory samples.The reference line at 3200 corresponds to current hardware.On all tests sampling every 1200 cycles yields perfect accuracy.Samples Percent of Calls Reported Correctly (b) Call stack introspection accuracy for Wuftpd as a function of the number of machine cycles between memory samples.The reference line at 3200 corresponds to current hardware.On all tests sampling every 4800 cycles yields perfect accuracy.
, they either have a very specific focus Proposed component 1 -Hardware-assisted memory acquisition Proposed component 2 -Transparent program introspection Proposed component 3 -Transparent CPU introspection

Table 2 :
Stepping overhead on Windows and Linux (Unit: times slowdown)