|  | Design of the Subzero fast code generator | 
|  | ========================================= | 
|  |  | 
|  | Introduction | 
|  | ------------ | 
|  |  | 
|  | The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | 
|  | compiler technology based on `LLVM <http://llvm.org/>`_.  The developer uses the | 
|  | PNaCl toolchain to compile their application to architecture-neutral PNaCl | 
|  | bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as | 
|  | possible.  The ``.pexe`` file is downloaded to the user's browser where the | 
|  | PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to | 
|  | `sandboxed | 
|  | <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ | 
|  | native code.  The translator uses architecture-specific optimizations as much as | 
|  | practical to generate good native code. | 
|  |  | 
|  | The native code can be cached by the browser to avoid repeating translation on | 
|  | future page loads.  However, first-time user experience is hampered by long | 
|  | translation times.  The LLVM-based PNaCl translator is pretty slow, even when | 
|  | using ``-O0`` to minimize optimizations, so delays are especially noticeable on | 
|  | slow browser platforms such as ARM-based Chromebooks. | 
|  |  | 
|  | Translator slowness can be mitigated or hidden in a number of ways. | 
|  |  | 
|  | - Parallel translation.  However, slow machines where this matters most, e.g. | 
|  | ARM-based Chromebooks, are likely to have fewer cores to parallelize across, | 
|  | and are likely to less memory available for multiple translation threads to | 
|  | use. | 
|  |  | 
|  | - Streaming translation, i.e. start translating as soon as the download starts. | 
|  | This doesn't help much when translation speed is 10× slower than download | 
|  | speed, or the ``.pexe`` file is already cached while the translated binary was | 
|  | flushed from the cache. | 
|  |  | 
|  | - Arrange the web page such that translation is done in parallel with | 
|  | downloading large assets. | 
|  |  | 
|  | - Arrange the web page to distract the user with `cat videos | 
|  | <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in | 
|  | progress. | 
|  |  | 
|  | Or, improve translator performance to something more reasonable. | 
|  |  | 
|  | This document describes Subzero's attempt to improve translation speed by an | 
|  | order of magnitude while rivaling LLVM's code quality.  Subzero does this | 
|  | through minimal IR layering, lean data structures and passes, and a careful | 
|  | selection of fast optimization passes.  It has two optimization recipes: full | 
|  | optimizations (``O2``) and minimal optimizations (``Om1``).  The recipes are the | 
|  | following (described in more detail below): | 
|  |  | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | O2 recipe                              | Om1 recipe                  | | 
|  | +========================================+=============================+ | 
|  | | Parse .pexe file                       | Parse .pexe file            | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Loop nest analysis                     |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Local common subexpression elimination |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Address mode inference                 |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Read-modify-write (RMW) transform      |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Basic liveness analysis                |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Load optimization                      |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | |                                        | Phi lowering (simple)       | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Target lowering                        | Target lowering             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Full liveness analysis                 |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Register allocation                    | Minimal register allocation | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Phi lowering (advanced)                |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Post-phi register allocation           |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Branch optimization                    |                             | | 
|  | +----------------------------------------+-----------------------------+ | 
|  | | Code emission                          | Code emission               | | 
|  | +----------------------------------------+-----------------------------+ | 
|  |  | 
|  | Goals | 
|  | ===== | 
|  |  | 
|  | Translation speed | 
|  | ----------------- | 
|  |  | 
|  | We'd like to be able to translate a ``.pexe`` file as fast as download speed. | 
|  | Any faster is in a sense wasted effort.  Download speed varies greatly, but | 
|  | we'll arbitrarily say 1 MB/sec.  We'll pick the ARM A15 CPU as the example of a | 
|  | slow machine.  We observe a 3× single-thread performance difference between A15 | 
|  | and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a | 
|  | ``.pexe`` file could be compressed to 50% on the web server using gzip transport | 
|  | compression, so we set the translation speed goal to 6 MB/sec on the high-end | 
|  | Xeon workstation. | 
|  |  | 
|  | Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at | 
|  | ⅒ the target rate.  The ``-O2`` mode takes 3× as long as the ``-O0`` mode. | 
|  |  | 
|  | In other words, Subzero's goal is to improve over LLVM's translation speed by | 
|  | 10×. | 
|  |  | 
|  | Code quality | 
|  | ------------ | 
|  |  | 
|  | Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0`` | 
|  | code quality.  The stretch goal is to approach LLVM ``-O2`` code quality.  On | 
|  | average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``. | 
|  |  | 
|  | It's important to note that the quality of Subzero-generated code depends on | 
|  | target-neutral optimizations and simplifications being run beforehand in the | 
|  | developer environment.  The ``.pexe`` file reflects these optimizations.  For | 
|  | example, Subzero assumes that the basic blocks are ordered topologically where | 
|  | possible (which makes liveness analysis converge fastest), and Subzero does not | 
|  | do any function inlining because it should already have been done. | 
|  |  | 
|  | Translator size | 
|  | --------------- | 
|  |  | 
|  | The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size. | 
|  | We think 1 MB is a more reasonable size -- especially for such a component that | 
|  | is distributed to a billion Chrome users.  Thus we target a 10× reduction in | 
|  | binary size. | 
|  |  | 
|  | For development, Subzero can be built for all target architectures, and all | 
|  | debugging and diagnostic options enabled.  For a smaller translator, we restrict | 
|  | to a single target architecture, and define a ``MINIMAL`` build where | 
|  | unnecessary features are compiled out. | 
|  |  | 
|  | Subzero leverages some data structures from LLVM's ``ADT`` and ``Support`` | 
|  | include directories, which have little impact on translator size.  It also uses | 
|  | some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again | 
|  | with little size impact.  In non-``MINIMAL`` builds, the translator size is much | 
|  | larger due to including code for parsing text-format bitcode files and forming | 
|  | LLVM IR. | 
|  |  | 
|  | Memory footprint | 
|  | ---------------- | 
|  |  | 
|  | The current LLVM-based translator suffers from an issue in which some | 
|  | function-specific data has to be retained in memory until all translation | 
|  | completes, and therefore the memory footprint grows without bound.  Large | 
|  | ``.pexe`` files can lead to the translator process holding hundreds of MB of | 
|  | memory by the end.  The translator runs in a separate process, so this memory | 
|  | growth doesn't *directly* affect other processes, but it does dirty the physical | 
|  | memory and contributes to a perception of bloat and sometimes a reality of | 
|  | out-of-memory tab killing, especially noticeable on weaker systems. | 
|  |  | 
|  | Subzero should maintain a stable memory footprint throughout translation.  It's | 
|  | not really practical to set a specific limit, because there is not really a | 
|  | practical limit on a single function's size, but the footprint should be | 
|  | "reasonable" and be proportional to the largest input function size, not the | 
|  | total ``.pexe`` file size.  Simply put, Subzero should not have memory leaks or | 
|  | inexorable memory growth.  (We use ASAN builds to test for leaks.) | 
|  |  | 
|  | Multithreaded translation | 
|  | ------------------------- | 
|  |  | 
|  | It should be practical to translate different functions concurrently and see | 
|  | good scalability.  Some locking may be needed, such as accessing output buffers | 
|  | or constant pools, but that should be fairly minimal.  In contrast, LLVM was | 
|  | only designed for module-level parallelism, and as such, the PNaCl translator | 
|  | internally splits a ``.pexe`` file into several modules for concurrent | 
|  | translation.  All output needs to be deterministic regardless of the level of | 
|  | multithreading, i.e. functions and data should always be output in the same | 
|  | order. | 
|  |  | 
|  | Target architectures | 
|  | -------------------- | 
|  |  | 
|  | Initial target architectures are x86-32, x86-64, ARM32, and MIPS32.  Future | 
|  | targets include ARM64 and MIPS64, though these targets lack NaCl support | 
|  | including a sandbox model or a validator. | 
|  |  | 
|  | The first implementation is for x86-32, because it was expected to be | 
|  | particularly challenging, and thus more likely to draw out any design problems | 
|  | early: | 
|  |  | 
|  | - There are a number of special cases, asymmetries, and warts in the x86 | 
|  | instruction set. | 
|  |  | 
|  | - Complex addressing modes may be leveraged for better code quality. | 
|  |  | 
|  | - 64-bit integer operations have to be lowered into longer sequences of 32-bit | 
|  | operations. | 
|  |  | 
|  | - Paucity of physical registers may reveal code quality issues early in the | 
|  | design. | 
|  |  | 
|  | Detailed design | 
|  | =============== | 
|  |  | 
|  | Intermediate representation - ICE | 
|  | --------------------------------- | 
|  |  | 
|  | Subzero's IR is called ICE.  It is designed to be reasonably similar to LLVM's | 
|  | IR, which is reflected in the ``.pexe`` file's bitcode structure.  It has a | 
|  | representation of global variables and initializers, and a set of functions. | 
|  | Each function contains a list of basic blocks, and each basic block constains a | 
|  | list of instructions.  Instructions that operate on stack and register variables | 
|  | do so using static single assignment (SSA) form. | 
|  |  | 
|  | The ``.pexe`` file is translated one function at a time (or in parallel by | 
|  | multiple translation threads).  The recipe for optimization passes depends on | 
|  | the specific target and optimization level, and is described in detail below. | 
|  | Global variables (types and initializers) are simply and directly translated to | 
|  | object code, without any meaningful attempts at optimization. | 
|  |  | 
|  | A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class. | 
|  | Its key contents include: | 
|  |  | 
|  | - A list of ``CfgNode`` pointers, generally held in topological order. | 
|  |  | 
|  | - A list of ``Variable`` pointers corresponding to local variables used in the | 
|  | function plus compiler-generated temporaries. | 
|  |  | 
|  | A basic block is represented by the ``Ice::CfgNode`` class.  Its key contents | 
|  | include: | 
|  |  | 
|  | - A linear list of instructions, in the same style as LLVM.  The last | 
|  | instruction of the list is always a terminator instruction: branch, switch, | 
|  | return, unreachable. | 
|  |  | 
|  | - A list of Phi instructions, also in the same style as LLVM.  They are held as | 
|  | a linear list for convenience, though per Phi semantics, they are executed "in | 
|  | parallel" without dependencies on each other. | 
|  |  | 
|  | - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and | 
|  | another list for outgoing edges. | 
|  |  | 
|  | - The node's unique, 0-based index into the CFG's node list. | 
|  |  | 
|  | An instruction is represented by the ``Ice::Inst`` class.  Its key contents | 
|  | include: | 
|  |  | 
|  | - A list of source operands. | 
|  |  | 
|  | - Its destination variable, if the instruction produces a result in an | 
|  | ``Ice::Variable``. | 
|  |  | 
|  | - A bitvector indicating which variables' live ranges this instruction ends. | 
|  | This is computed during liveness analysis. | 
|  |  | 
|  | Instructions kinds are divided into high-level ICE instructions and low-level | 
|  | ICE instructions.  High-level instructions consist of the PNaCl/LLVM bitcode | 
|  | instruction kinds.  Each target architecture implementation extends the | 
|  | instruction space with its own set of low-level instructions.  Generally, | 
|  | low-level instructions correspond to individual machine instructions.  The | 
|  | high-level ICE instruction space includes a few additional instruction kinds | 
|  | that are not part of LLVM but are generally useful (e.g., an Assignment | 
|  | instruction), or are useful across targets (e.g., BundleLock and BundleUnlock | 
|  | instructions for sandboxing). | 
|  |  | 
|  | Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl | 
|  | ABI restrictions as documented in the `PNaCl Bitcode Reference Manual | 
|  | <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are | 
|  | the following: | 
|  |  | 
|  | - Alloca: allocate data on the stack | 
|  |  | 
|  | - Arithmetic: binary operations of the form ``A = B op C`` | 
|  |  | 
|  | - Br: conditional or unconditional branch | 
|  |  | 
|  | - Call: function call | 
|  |  | 
|  | - Cast: unary type-conversion operations | 
|  |  | 
|  | - ExtractElement: extract a scalar element from a vector-type value | 
|  |  | 
|  | - Fcmp: floating-point comparison | 
|  |  | 
|  | - Icmp: integer comparison | 
|  |  | 
|  | - IntrinsicCall: call a known intrinsic | 
|  |  | 
|  | - InsertElement: insert a scalar element into a vector-type value | 
|  |  | 
|  | - Load: load a value from memory | 
|  |  | 
|  | - Phi: implement the SSA phi node | 
|  |  | 
|  | - Ret: return from the function | 
|  |  | 
|  | - Select: essentially the C language operation of the form ``X = C ? Y : Z`` | 
|  |  | 
|  | - Store: store a value into memory | 
|  |  | 
|  | - Switch: generalized branch to multiple possible locations | 
|  |  | 
|  | - Unreachable: indicate that this portion of the code is unreachable | 
|  |  | 
|  | The additional high-level ICE instructions are the following: | 
|  |  | 
|  | - Assign: a simple ``A=B`` assignment.  This is useful for e.g. lowering Phi | 
|  | instructions to non-SSA assignments, before lowering to machine code. | 
|  |  | 
|  | - BundleLock, BundleUnlock.  These are markers used for sandboxing, but are | 
|  | common across all targets and so they are elevated to the high-level | 
|  | instruction set. | 
|  |  | 
|  | - FakeDef, FakeUse, FakeKill.  These are tools used to preserve consistency in | 
|  | liveness analysis, elevated to the high-level because they are used by all | 
|  | targets.  They are described in more detail at the end of this section. | 
|  |  | 
|  | - JumpTable: this represents the result of switch optimization analysis, where | 
|  | some switch instructions may use jump tables instead of cascading | 
|  | compare/branches. | 
|  |  | 
|  | An operand is represented by the ``Ice::Operand`` class.  In high-level ICE, an | 
|  | operand is either an ``Ice::Constant`` or an ``Ice::Variable``.  Constants | 
|  | include scalar integer constants, scalar floating point constants, Undef (an | 
|  | unspecified constant of a particular scalar or vector type), and symbol | 
|  | constants (essentially addresses of globals).  Note that the PNaCl ABI does not | 
|  | include vector-type constants besides Undef, and as such, Subzero (so far) has | 
|  | no reason to represent vector-type constants internally.  A variable represents | 
|  | a value allocated on the stack (though not including alloca-derived storage). | 
|  | Among other things, a variable holds its unique, 0-based index into the CFG's | 
|  | variable list. | 
|  |  | 
|  | Each target can extend the ``Constant`` and ``Variable`` classes for its own | 
|  | needs.  In addition, the ``Operand`` class may be extended, e.g. to define an | 
|  | x86 ``MemOperand`` that encodes a base register, an index register, an index | 
|  | register shift amount, and a constant offset. | 
|  |  | 
|  | Register allocation and liveness analysis are restricted to Variable operands. | 
|  | Because of the importance of register allocation to code quality, and the | 
|  | translation-time cost of liveness analysis, Variable operands get some special | 
|  | treatment in ICE.  Most notably, a frequent pattern in Subzero is to iterate | 
|  | across all the Variables of an instruction.  An instruction holds a list of | 
|  | operands, but an operand may contain 0, 1, or more Variables.  As such, the | 
|  | ``Operand`` class specially holds a list of Variables contained within, for | 
|  | quick access. | 
|  |  | 
|  | A Subzero transformation pass may work by deleting an existing instruction and | 
|  | replacing it with zero or more new instructions.  Instead of actually deleting | 
|  | the existing instruction, we generally mark it as deleted and insert the new | 
|  | instructions right after the deleted instruction.  When printing the IR for | 
|  | debugging, this is a big help because it makes it much more clear how the | 
|  | non-deleted instructions came about. | 
|  |  | 
|  | Subzero has a few special instructions to help with liveness analysis | 
|  | consistency. | 
|  |  | 
|  | - The FakeDef instruction gives a fake definition of some variable.  For | 
|  | example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx`` | 
|  | but an ICE instruction can represent only one destination variable.  This is | 
|  | similar for multiply instructions, and for function calls that return a 64-bit | 
|  | integer result in the ``%edx:%eax`` pair.  Also, using the ``xor %eax, %eax`` | 
|  | trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``. | 
|  |  | 
|  | - The FakeUse instruction registers a use of a variable, typically to prevent an | 
|  | earlier assignment to that variable from being dead-code eliminated.  For | 
|  | example, lowering an operation like ``x=cc?y:z`` may be done using x86's | 
|  | conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``.  Without a | 
|  | FakeUse of ``x`` between the two instructions, the liveness analysis pass may | 
|  | dead-code eliminate the first instruction. | 
|  |  | 
|  | - The FakeKill instruction is added after a call instruction, and is a quick way | 
|  | of indicating that caller-save registers are invalidated. | 
|  |  | 
|  | Pexe parsing | 
|  | ------------ | 
|  |  | 
|  | Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files.  It | 
|  | parses the ``.pexe`` file function by function, ultimately constructing an ICE | 
|  | CFG for each function.  After a function is parsed, its CFG is handed off to the | 
|  | translation phase.  The bitcode parser also parses global initializer data and | 
|  | hands it off to be translated to data sections in the object file. | 
|  |  | 
|  | Subzero has another parsing strategy for testing/debugging.  LLVM libraries can | 
|  | be used to parse a module into LLVM IR (though very slowly relative to Subzero | 
|  | native parsing).  Then we iterate across the LLVM IR and construct high-level | 
|  | ICE, handing off each CFG to the translation phase. | 
|  |  | 
|  | Overview of lowering | 
|  | -------------------- | 
|  |  | 
|  | In general, translation goes like this: | 
|  |  | 
|  | - Parse the next function from the ``.pexe`` file and construct a CFG consisting | 
|  | of high-level ICE. | 
|  |  | 
|  | - Do analysis passes and transformation passes on the high-level ICE, as | 
|  | desired. | 
|  |  | 
|  | - Lower each high-level ICE instruction into a sequence of zero or more | 
|  | low-level ICE instructions.  Each high-level instruction is generally lowered | 
|  | independently, though the target lowering is allowed to look ahead in the | 
|  | CfgNode's instruction list if desired. | 
|  |  | 
|  | - Do more analysis and transformation passes on the low-level ICE, as desired. | 
|  |  | 
|  | - Assemble the low-level CFG into an ELF object file (alternatively, a textual | 
|  | assembly file that is later assembled by some external tool). | 
|  |  | 
|  | - Repeat for all functions, and also produce object code for data such as global | 
|  | initializers and internal constant pools. | 
|  |  | 
|  | Currently there are two optimization levels: ``O2`` and ``Om1``.  For ``O2``, | 
|  | the intention is to apply all available optimizations to get the best code | 
|  | quality (though the initial code quality goal is measured against LLVM's ``O0`` | 
|  | code quality).  For ``Om1``, the intention is to apply as few optimizations as | 
|  | possible and produce code as quickly as possible, accepting poor code quality. | 
|  | ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words, | 
|  | "sub-zero". | 
|  |  | 
|  | High-level debuggability of generated code is so far not a design requirement. | 
|  | Subzero doesn't really do transformations that would obfuscate debugging; the | 
|  | main thing might be that register allocation (including stack slot coalescing | 
|  | for stack-allocated variables whose live ranges don't overlap) may render a | 
|  | variable's value unobtainable after its live range ends.  This would not be an | 
|  | issue for ``Om1`` since it doesn't register-allocate program-level variables, | 
|  | nor does it coalesce stack slots.  That said, fully supporting debuggability | 
|  | would require a few additions: | 
|  |  | 
|  | - DWARF support would need to be added to Subzero's ELF file emitter.  Subzero | 
|  | propagates global symbol names, local variable names, and function-internal | 
|  | label names that are present in the ``.pexe`` file.  This would allow a | 
|  | debugger to map addresses back to symbols in the ``.pexe`` file. | 
|  |  | 
|  | - To map ``.pexe`` file symbols back to meaningful source-level symbol names, | 
|  | file names, line numbers, etc., Subzero would need to handle `LLVM bitcode | 
|  | metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg`` | 
|  | `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_. | 
|  |  | 
|  | - The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so | 
|  | the toolchain would need to be modified to preserve it. | 
|  |  | 
|  | Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but | 
|  | produces code with one third the code quality.  ``Om1`` is good for testing and | 
|  | debugging -- during translation, it tends to expose errors in the basic lowering | 
|  | that might otherwise have been hidden by the register allocator or other | 
|  | optimization passes.  It also helps determine whether a code correctness problem | 
|  | is a fundamental problem in the basic lowering, or an error in another | 
|  | optimization pass. | 
|  |  | 
|  | The implementation of target lowering also controls the recipe of passes used | 
|  | for ``Om1`` and ``O2`` translation.  For example, address mode inference may | 
|  | only be relevant for x86. | 
|  |  | 
|  | Lowering strategy | 
|  | ----------------- | 
|  |  | 
|  | The core of Subzero's lowering from high-level ICE to low-level ICE is to lower | 
|  | each high-level instruction down to a sequence of low-level target-specific | 
|  | instructions, in a largely context-free setting.  That is, each high-level | 
|  | instruction conceptually has a simple template expansion into low-level | 
|  | instructions, and lowering can in theory be done in any order.  This may sound | 
|  | like a small effort, but quite a large number of templates may be needed because | 
|  | of the number of PNaCl types and instruction variants.  Furthermore, there may | 
|  | be optimized templates, e.g. to take advantage of operator commutativity (for | 
|  | example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``).  This is | 
|  | similar to other template-based approaches in fast code generation or | 
|  | interpretation, though some decisions are deferred until after some global | 
|  | analysis passes, mostly related to register allocation, stack slot assignment, | 
|  | and specific choice of instruction variant and addressing mode. | 
|  |  | 
|  | The key idea for a lowering template is to produce valid low-level instructions | 
|  | that are guaranteed to meet address mode and other structural requirements of | 
|  | the instruction set.  For example, on x86, the source operand of an integer | 
|  | store instruction must be an immediate or a physical register; a shift | 
|  | instruction's shift amount must be an immediate or in register ``%cl``; a | 
|  | function's integer return value is in ``%eax``; most x86 instructions are | 
|  | two-operand, in contrast to corresponding three-operand high-level instructions; | 
|  | etc. | 
|  |  | 
|  | Because target lowering runs before register allocation, there is no way to know | 
|  | whether a given ``Ice::Variable`` operand lives on the stack or in a physical | 
|  | register.  When the low-level instruction calls for a physical register operand, | 
|  | the target lowering can create an infinite-weight Variable.  This tells the | 
|  | register allocator to assign infinite weight when making decisions, effectively | 
|  | guaranteeing some physical register.  Variables can also be pre-colored to a | 
|  | specific physical register (``cl`` in the shift example above), which also gives | 
|  | infinite weight. | 
|  |  | 
|  | To illustrate, consider a high-level arithmetic instruction on 32-bit integer | 
|  | operands:: | 
|  |  | 
|  | A = B + C | 
|  |  | 
|  | X86 target lowering might produce the following:: | 
|  |  | 
|  | T.inf = B  // mov instruction | 
|  | T.inf += C // add instruction | 
|  | A = T.inf  // mov instruction | 
|  |  | 
|  | Here, ``T.inf`` is an infinite-weight temporary.  As long as ``T.inf`` has a | 
|  | physical register, the three lowered instructions are all encodable regardless | 
|  | of whether ``B`` and ``C`` are physical registers, memory, or immediates, and | 
|  | whether ``A`` is a physical register or in memory. | 
|  |  | 
|  | In this example, ``A`` must be a Variable and one may be tempted to simplify the | 
|  | lowering sequence by setting ``A`` as infinite-weight and using:: | 
|  |  | 
|  | A = B  // mov instruction | 
|  | A += C // add instruction | 
|  |  | 
|  | This has two problems.  First, if the original instruction was actually ``A = | 
|  | B + A``, the result would be incorrect.  Second, assigning ``A`` a physical | 
|  | register applies throughout ``A``'s entire live range.  This is probably not | 
|  | what is intended, and may ultimately lead to a failure to allocate a register | 
|  | for an infinite-weight variable. | 
|  |  | 
|  | This style of lowering leads to many temporaries being generated, so in ``O2`` | 
|  | mode, we rely on the register allocator to clean things up.  For example, in the | 
|  | example above, if ``B`` ends up getting a physical register and its live range | 
|  | ends at this instruction, the register allocator is likely to reuse that | 
|  | register for ``T.inf``.  This leads to ``T.inf=B`` being a redundant register | 
|  | copy, which is removed as an emission-time peephole optimization. | 
|  |  | 
|  | O2 lowering | 
|  | ----------- | 
|  |  | 
|  | Currently, the ``O2`` lowering recipe is the following: | 
|  |  | 
|  | - Loop nest analysis | 
|  |  | 
|  | - Local common subexpression elimination | 
|  |  | 
|  | - Address mode inference | 
|  |  | 
|  | - Read-modify-write (RMW) transformation | 
|  |  | 
|  | - Basic liveness analysis | 
|  |  | 
|  | - Load optimization | 
|  |  | 
|  | - Target lowering | 
|  |  | 
|  | - Full liveness analysis | 
|  |  | 
|  | - Register allocation | 
|  |  | 
|  | - Phi instruction lowering (advanced) | 
|  |  | 
|  | - Post-phi lowering register allocation | 
|  |  | 
|  | - Branch optimization | 
|  |  | 
|  | These passes are described in more detail below. | 
|  |  | 
|  | Om1 lowering | 
|  | ------------ | 
|  |  | 
|  | Currently, the ``Om1`` lowering recipe is the following: | 
|  |  | 
|  | - Phi instruction lowering (simple) | 
|  |  | 
|  | - Target lowering | 
|  |  | 
|  | - Register allocation (infinite-weight and pre-colored only) | 
|  |  | 
|  | Optimization passes | 
|  | ------------------- | 
|  |  | 
|  | Liveness analysis | 
|  | ^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Liveness analysis is a standard dataflow optimization, implemented as follows. | 
|  | For each node (basic block), its live-out set is computed as the union of the | 
|  | live-in sets of its successor nodes.  Then the node's instructions are processed | 
|  | in reverse order, updating the live set, until the beginning of the node is | 
|  | reached, and the node's live-in set is recorded.  If this iteration has changed | 
|  | the node's live-in set, the node's predecessors are marked for reprocessing. | 
|  | This continues until no more nodes need reprocessing.  If nodes are processed in | 
|  | reverse topological order, the number of iterations over the CFG is generally | 
|  | equal to the maximum loop nest depth. | 
|  |  | 
|  | To implement this, each node records its live-in and live-out sets, initialized | 
|  | to the empty set.  Each instruction records which of its Variables' live ranges | 
|  | end in that instruction, initialized to the empty set.  A side effect of | 
|  | liveness analysis is dead instruction elimination.  Each instruction can be | 
|  | marked as tentatively dead, and after the algorithm converges, the tentatively | 
|  | dead instructions are permanently deleted. | 
|  |  | 
|  | Optionally, after this liveness analysis completes, we can do live range | 
|  | construction, in which we calculate the live range of each variable in terms of | 
|  | instruction numbers.  A live range is represented as a union of segments, where | 
|  | the segment endpoints are instruction numbers.  Instruction numbers are required | 
|  | to be unique across the CFG, and monotonically increasing within a basic block. | 
|  | As a union of segments, live ranges can contain "gaps" and are therefore | 
|  | precise.  Because of SSA properties, a variable's live range can start at most | 
|  | once in a basic block, and can end at most once in a basic block.  Liveness | 
|  | analysis keeps track of which variable/instruction tuples begin live ranges and | 
|  | end live ranges, and combined with live-in and live-out sets, we can efficiently | 
|  | build up live ranges of all variables across all basic blocks. | 
|  |  | 
|  | A lot of care is taken to try to make liveness analysis fast and efficient. | 
|  | Because of the lowering strategy, the number of variables is generally | 
|  | proportional to the number of instructions, leading to an O(N^2) complexity | 
|  | algorithm if implemented naively.  To improve things based on sparsity, we note | 
|  | that most variables are "local" and referenced in at most one basic block (in | 
|  | contrast to the "global" variables with multi-block usage), and therefore cannot | 
|  | be live across basic blocks.  Therefore, the live-in and live-out sets, | 
|  | typically represented as bit vectors, can be limited to the set of global | 
|  | variables, and the intra-block liveness bit vector can be compacted to hold the | 
|  | global variables plus the local variables for that block. | 
|  |  | 
|  | Register allocation | 
|  | ^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Subzero implements a simple linear-scan register allocator, based on the | 
|  | allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan | 
|  | Register Allocation in the Context of SSA Form and Register Constraints | 
|  | <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  This allocator has | 
|  | several nice features: | 
|  |  | 
|  | - Live ranges are represented as unions of segments, as described above, rather | 
|  | than a single start/end tuple. | 
|  |  | 
|  | - It allows pre-coloring of variables with specific physical registers. | 
|  |  | 
|  | - It applies equally well to pre-lowered Phi instructions. | 
|  |  | 
|  | The paper suggests an approach of aggressively coalescing variables across Phi | 
|  | instructions (i.e., trying to force Phi source and destination variables to have | 
|  | the same register assignment), but we reject that in favor of the more natural | 
|  | preference mechanism described below. | 
|  |  | 
|  | We enhance the algorithm in the paper with the capability of automatic inference | 
|  | of register preference, and with the capability of allowing overlapping live | 
|  | ranges to safely share the same register in certain circumstances.  If we are | 
|  | considering register allocation for variable ``A``, and ``A`` has a single | 
|  | defining instruction ``A=B+C``, then the preferred register for ``A``, if | 
|  | available, would be the register assigned to ``B`` or ``C``, if any, provided | 
|  | that ``B`` or ``C``'s live range does not overlap ``A``'s live range.  In this | 
|  | way we infer a good register preference for ``A``. | 
|  |  | 
|  | We allow overlapping live ranges to get the same register in certain cases. | 
|  | Suppose a high-level instruction like:: | 
|  |  | 
|  | A = unary_op(B) | 
|  |  | 
|  | has been target-lowered like:: | 
|  |  | 
|  | T.inf = B | 
|  | A = unary_op(T.inf) | 
|  |  | 
|  | Further, assume that ``B``'s live range continues beyond this instruction | 
|  | sequence, and that ``B`` has already been assigned some register.  Normally, we | 
|  | might want to infer ``B``'s register as a good candidate for ``T.inf``, but it | 
|  | turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have | 
|  | different registers.  But ``T.inf`` is just a read-only copy of ``B`` that is | 
|  | guaranteed to be in a register, so in theory these overlapping live ranges could | 
|  | safely have the same register.  Our implementation allows this overlap as long | 
|  | as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never | 
|  | modified within ``T.inf``'s live range. | 
|  |  | 
|  | Subzero's register allocator can be run in 3 configurations. | 
|  |  | 
|  | - Normal mode.  All Variables are considered for register allocation.  It | 
|  | requires full liveness analysis and live range construction as a prerequisite. | 
|  | This is used by ``O2`` lowering. | 
|  |  | 
|  | - Minimal mode.  Only infinite-weight or pre-colored Variables are considered. | 
|  | All other Variables are stack-allocated.  It does not require liveness | 
|  | analysis; instead, it quickly scans the instructions and records first | 
|  | definitions and last uses of all relevant Variables, using that to construct a | 
|  | single-segment live range.  Although this includes most of the Variables, the | 
|  | live ranges are mostly simple, short, and rarely overlapping, which the | 
|  | register allocator handles efficiently.  This is used by ``Om1`` lowering. | 
|  |  | 
|  | - Post-phi lowering mode.  Advanced phi lowering is done after normal-mode | 
|  | register allocation, and may result in new infinite-weight Variables that need | 
|  | registers.  One would like to just run something like minimal mode to assign | 
|  | registers to the new Variables while respecting existing register allocation | 
|  | decisions.  However, it sometimes happens that there are no free registers. | 
|  | In this case, some register needs to be forcibly spilled to the stack and | 
|  | temporarily reassigned to the new Variable, and reloaded at the end of the new | 
|  | Variable's live range.  The register must be one that has no explicit | 
|  | references during the Variable's live range.  Since Subzero currently doesn't | 
|  | track def/use chains (though it does record the CfgNode where a Variable is | 
|  | defined), we just do a brute-force search across the CfgNode's instruction | 
|  | list for the instruction numbers of interest.  This situation happens very | 
|  | rarely, so there's little point for now in improving its performance. | 
|  |  | 
|  | The basic linear-scan algorithm may, as it proceeds, rescind an early register | 
|  | allocation decision, leaving that Variable to be stack-allocated.  Some of these | 
|  | times, it turns out that the Variable could have been given a different register | 
|  | without conflict, but by this time it's too late.  The literature recognizes | 
|  | this situation and describes "second-chance bin-packing", which Subzero can do. | 
|  | We can rerun the register allocator in a mode that respects existing register | 
|  | allocation decisions, and sometimes it finds new non-conflicting opportunities. | 
|  | In fact, we can repeatedly run the register allocator until convergence. | 
|  | Unfortunately, in the current implementation, these subsequent register | 
|  | allocation passes end up being extremely expensive.  This is because of the | 
|  | treatment of the "unhandled pre-colored" Variable set, which is normally very | 
|  | small but ends up being quite large on subsequent passes.  Its performance can | 
|  | probably be made acceptable with a better choice of data structures, but for now | 
|  | this second-chance mechanism is disabled. | 
|  |  | 
|  | Future work is to implement LLVM's `Greedy | 
|  | <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ | 
|  | register allocator as a replacement for the basic linear-scan algorithm, given | 
|  | LLVM's experience with its improvement in code quality.  (The blog post claims | 
|  | that the Greedy allocator also improved maintainability because a lot of hacks | 
|  | could be removed, but Subzero is probably not yet to that level of hacks, and is | 
|  | less likely to see that particular benefit.) | 
|  |  | 
|  | Local common subexpression elimination | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | The Local CSE implementation goes through each instruction and records a portion | 
|  | of each ``Seen`` instruction in a hashset-like container.  That portion consists | 
|  | of the entire instruction except for any dest variable. That means ``A = X + Y`` | 
|  | and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows | 
|  | us to detect common subexpressions. | 
|  |  | 
|  | Whenever a repetition is detected, the redundant variables are stored in a | 
|  | container mapping the replacee to the replacement. In the case above, it would | 
|  | be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``. | 
|  |  | 
|  | At any point if a variable that has an entry in the replacement table is | 
|  | encountered, it is replaced with the variable it is mapped to. This ensures that | 
|  | the redundant variables will not have any uses in the basic block, allowing | 
|  | dead code elimination to clean up the redundant instruction. | 
|  |  | 
|  | With SSA, the information stored is never invalidated. However, non-SSA input is | 
|  | supported with the ``-lcse=no-ssa`` option. This has to maintain some extra | 
|  | dependency information to ensure proper invalidation on variable assignment. | 
|  | This is not rigorously tested because this pass is run at an early stage where | 
|  | it is safe to assume variables have a single definition. This is not enabled by | 
|  | default because it bumps the compile time overhead from 2% to 6%. | 
|  |  | 
|  | Loop-invariant code motion | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | This pass utilizes the loop analysis information to hoist invariant instructions | 
|  | to loop pre-headers. A loop must have a single entry node (header) and that node | 
|  | must have a single external predecessor for this optimization to work, as it is | 
|  | currently implemented. | 
|  |  | 
|  | The pass works by iterating over all instructions in the loop until the set of | 
|  | invariant instructions converges. In each iteration, a non-invariant instruction | 
|  | involving only constants or a variable known to be invariant is added to the | 
|  | result set. The destination variable of that instruction is added to the set of | 
|  | variables known to be invariant (which is initialized with the constant args). | 
|  |  | 
|  | Improving the loop-analysis infrastructure is likely to have significant impact | 
|  | on this optimization. Inserting an extra node to act as the pre-header when the | 
|  | header has multiple incoming edges from outside could also be a good idea. | 
|  | Expanding the initial invariant variable set to contain all variables that do | 
|  | not have definitions inside the loop does not seem to improve anything. | 
|  |  | 
|  | Short circuit evaluation | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Short circuit evaluation splits nodes and introduces early jumps when the result | 
|  | of a logical operation can be determined early and there are no observable side | 
|  | effects of skipping the rest of the computation. The instructions considered | 
|  | backwards from the end of the basic blocks. When a definition of a variable | 
|  | involved in a conditional jump is found, an extra jump can be inserted in that | 
|  | location, moving the rest of the instructions in the node to a newly inserted | 
|  | node. Consider this example:: | 
|  |  | 
|  | __N : | 
|  | a = <something> | 
|  | Instruction 1 without side effect | 
|  | ... b = <something> ... | 
|  | Instruction N without side effect | 
|  | t1 = or a b | 
|  | br t1 __X __Y | 
|  |  | 
|  | is transformed to:: | 
|  |  | 
|  | __N : | 
|  | a = <something> | 
|  | br a __X __N_ext | 
|  |  | 
|  | __N_ext : | 
|  | Instruction 1 without side effect | 
|  | ... b = <something> ... | 
|  | Instruction N without side effect | 
|  | br b __X __Y | 
|  |  | 
|  | The logic for AND is analogous, the only difference is that the early jump is | 
|  | facilitated by a ``false`` value instead of ``true``. | 
|  |  | 
|  | Global Variable Splitting | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Global variable splitting (``-split-global-vars``) is run after register | 
|  | allocation. It works on the variables that did not manage to get registers (but | 
|  | are allowed to) and decomposes their live ranges into the individual segments | 
|  | (which span a single node at most). New variables are created (but not yet used) | 
|  | with these smaller live ranges and the register allocator is run again. This is | 
|  | not inefficient as the old variables that already had registers are now | 
|  | considered pre-colored. | 
|  |  | 
|  | The new variables that get registers replace their parent variables for their | 
|  | portion of its (parent's) live range. A copy from the old variable to the new | 
|  | is introduced before the first use and the reverse after the last def in the | 
|  | live range. | 
|  |  | 
|  | Basic phi lowering | 
|  | ^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` | 
|  | implements it).  Consider this example:: | 
|  |  | 
|  | L1: | 
|  | ... | 
|  | br L3 | 
|  | L2: | 
|  | ... | 
|  | br L3 | 
|  | L3: | 
|  | A = phi [B, L1], [C, L2] | 
|  | X = phi [Y, L1], [Z, L2] | 
|  |  | 
|  | For each destination of a phi instruction, we can create a temporary and insert | 
|  | the temporary's assignment at the end of the predecessor block:: | 
|  |  | 
|  | L1: | 
|  | ... | 
|  | A' = B | 
|  | X' = Y | 
|  | br L3 | 
|  | L2: | 
|  | ... | 
|  | A' = C | 
|  | X' = Z | 
|  | br L3 | 
|  | L2: | 
|  | A = A' | 
|  | X = X' | 
|  |  | 
|  | This transformation is very simple and reliable.  It can be done before target | 
|  | lowering and register allocation, and it easily avoids the classic lost-copy and | 
|  | related problems.  ``Om1`` lowering uses this strategy. | 
|  |  | 
|  | However, it has the disadvantage of initializing temporaries even for branches | 
|  | not taken, though that could be mitigated by splitting non-critical edges and | 
|  | putting assignments in the edge-split nodes.  Another problem is that without | 
|  | extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a | 
|  | specific ordering even though phi semantics are that the assignments are | 
|  | parallel or unordered.  This sometimes imposes false live range overlaps and | 
|  | leads to poorer register allocation. | 
|  |  | 
|  | Advanced phi lowering | 
|  | ^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | ``O2`` lowering defers phi lowering until after register allocation to avoid the | 
|  | problem of false live range overlaps.  It works as follows.  We split each | 
|  | incoming edge and move the (parallel) phi assignments into the split nodes.  We | 
|  | linearize each set of assignments by finding a safe, topological ordering of the | 
|  | assignments, respecting register assignments as well.  For example:: | 
|  |  | 
|  | A = B | 
|  | X = Y | 
|  |  | 
|  | Normally these assignments could be executed in either order, but if ``B`` and | 
|  | ``X`` are assigned the same physical register, we would want to use the above | 
|  | ordering.  Dependency cycles are broken by introducing a temporary.  For | 
|  | example:: | 
|  |  | 
|  | A = B | 
|  | B = A | 
|  |  | 
|  | Here, a temporary breaks the cycle:: | 
|  |  | 
|  | t = A | 
|  | A = B | 
|  | B = t | 
|  |  | 
|  | Finally, we use the existing target lowering to lower the assignments in this | 
|  | basic block, and once that is done for all basic blocks, we run the post-phi | 
|  | variant of register allocation on the edge-split basic blocks. | 
|  |  | 
|  | When computing a topological order, we try to first schedule assignments whose | 
|  | source has a physical register, and last schedule assignments whose destination | 
|  | has a physical register.  This helps reduce register pressure. | 
|  |  | 
|  | X86 address mode inference | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | We try to take advantage of the x86 addressing mode that includes a base | 
|  | register, an index register, an index register scale amount, and an immediate | 
|  | offset.  We do this through simple pattern matching.  Starting with a load or | 
|  | store instruction where the address is a variable, we initialize the base | 
|  | register to that variable, and look up the instruction where that variable is | 
|  | defined.  If that is an add instruction of two variables and the index register | 
|  | hasn't been set, we replace the base and index register with those two | 
|  | variables.  If instead it is an add instruction of a variable and a constant, we | 
|  | replace the base register with the variable and add the constant to the | 
|  | immediate offset. | 
|  |  | 
|  | There are several more patterns that can be matched.  This pattern matching | 
|  | continues on the load or store instruction until no more matches are found. | 
|  | Because a program typically has few load and store instructions (not to be | 
|  | confused with instructions that manipulate stack variables), this address mode | 
|  | inference pass is fast. | 
|  |  | 
|  | X86 read-modify-write inference | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | A reasonably common bitcode pattern is a non-atomic update of a memory | 
|  | location:: | 
|  |  | 
|  | x = load addr | 
|  | y = add x, 1 | 
|  | store y, addr | 
|  |  | 
|  | On x86, with good register allocation, the Subzero passes described above | 
|  | generate code with only this quality:: | 
|  |  | 
|  | mov [%ebx], %eax | 
|  | add $1, %eax | 
|  | mov %eax, [%ebx] | 
|  |  | 
|  | However, x86 allows for this kind of code:: | 
|  |  | 
|  | add $1, [%ebx] | 
|  |  | 
|  | which requires fewer instructions, but perhaps more importantly, requires fewer | 
|  | physical registers. | 
|  |  | 
|  | It's also important to note that this transformation only makes sense if the | 
|  | store instruction ends ``x``'s live range. | 
|  |  | 
|  | Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW) | 
|  | opportunities via simple pattern matching.  The only problem is that it is run | 
|  | before liveness analysis, which is needed to determine whether ``x``'s live | 
|  | range ends after the RMW.  Since liveness analysis is one of the most expensive | 
|  | passes, it's not attractive to run it an extra time just for RMW analysis. | 
|  | Instead, we essentially generate both the RMW and the non-RMW versions, and then | 
|  | during lowering, the RMW version deletes itself if it finds x still live. | 
|  |  | 
|  | X86 compare-branch inference | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | In the LLVM instruction set, the compare/branch pattern works like this:: | 
|  |  | 
|  | cond = icmp eq a, b | 
|  | br cond, target | 
|  |  | 
|  | The result of the icmp instruction is a single bit, and a conditional branch | 
|  | tests that bit.  By contrast, most target architectures use this pattern:: | 
|  |  | 
|  | cmp a, b  // implicitly sets various bits of FLAGS register | 
|  | br eq, target  // branch on a particular FLAGS bit | 
|  |  | 
|  | A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests | 
|  | ``cond`` and conditionally branches.  Subzero has a pass that identifies | 
|  | boolean-based operations like this and folds them into a single | 
|  | compare/branch-like operation.  It is set up for more than just cmp/br though. | 
|  | Boolean producers include icmp (integer compare), fcmp (floating-point compare), | 
|  | and trunc (integer truncation when the destination has bool type).  Boolean | 
|  | consumers include branch, select (the ternary operator from the C language), and | 
|  | sign-extend and zero-extend when the source has bool type. | 
|  |  | 
|  | Sandboxing | 
|  | ^^^^^^^^^^ | 
|  |  | 
|  | Native Client's sandbox model uses software fault isolation (SFI) to provide | 
|  | safety when running untrusted code in a browser or other environment.  Subzero | 
|  | implements Native Client's `sandboxing | 
|  | <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ | 
|  | to enable Subzero-translated executables to be run inside Chrome.  Subzero also | 
|  | provides a fairly simple framework for investigating alternative sandbox models | 
|  | or other restrictions on the sandbox model. | 
|  |  | 
|  | Sandboxing in Subzero is not actually implemented as a separate pass, but is | 
|  | integrated into lowering and assembly. | 
|  |  | 
|  | - Indirect branches, including the ret instruction, are masked to a bundle | 
|  | boundary and bundle-locked. | 
|  |  | 
|  | - Call instructions are aligned to the end of the bundle so that the return | 
|  | address is bundle-aligned. | 
|  |  | 
|  | - Indirect branch targets, including function entry and targets in a switch | 
|  | statement jump table, are bundle-aligned. | 
|  |  | 
|  | - The intrinsic for reading the thread pointer is inlined appropriately. | 
|  |  | 
|  | - For x86-64, non-stack memory accesses are with respect to the reserved sandbox | 
|  | base register.  We reduce the aggressiveness of address mode inference to | 
|  | leave room for the sandbox base register during lowering.  There are no memory | 
|  | sandboxing changes for x86-32. | 
|  |  | 
|  | Code emission | 
|  | ------------- | 
|  |  | 
|  | Subzero's integrated assembler is derived from Dart's `assembler code | 
|  | <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_.  There is a pass | 
|  | that iterates through the low-level ICE instructions and invokes the relevant | 
|  | assembler functions.  Placeholders are added for later fixup of branch target | 
|  | offsets.  (Backward branches use short offsets if possible; forward branches | 
|  | generally use long offsets unless it is an intra-block branch of "known" short | 
|  | length.)  The assembler emits into a staging buffer.  Once emission into the | 
|  | staging buffer for a function is complete, the data is emitted to the output | 
|  | file as an ELF object file, and metadata such as relocations, symbol table, and | 
|  | string table, are accumulated for emission at the end.  Global data initializers | 
|  | are emitted similarly.  A key point is that at this point, the staging buffer | 
|  | can be deallocated, and only a minimum of data needs to held until the end. | 
|  |  | 
|  | As a debugging alternative, Subzero can emit textual assembly code which can | 
|  | then be run through an external assembler.  This is of course super slow, but | 
|  | quite valuable when bringing up a new target. | 
|  |  | 
|  | As another debugging option, the staging buffer can be emitted as textual | 
|  | assembly, primarily in the form of ".byte" lines.  This allows the assembler to | 
|  | be tested separately from the ELF related code. | 
|  |  | 
|  | Memory management | 
|  | ----------------- | 
|  |  | 
|  | Where possible, we allocate from a ``CfgLocalAllocator`` which derives from | 
|  | LLVM's ``BumpPtrAllocator``.  This is an arena-style allocator where objects | 
|  | allocated from the arena are never actually freed; instead, when the CFG | 
|  | translation completes and the CFG is deleted, the entire arena memory is | 
|  | reclaimed at once.  This style of allocation works well in an environment like a | 
|  | compiler where there are distinct phases with only easily-identifiable objects | 
|  | living across phases.  It frees the developer from having to manage object | 
|  | deletion, and it amortizes deletion costs across just a single arena deletion at | 
|  | the end of the phase.  Furthermore, it helps scalability by allocating entirely | 
|  | from thread-local memory pools, and minimizing global locking of the heap. | 
|  |  | 
|  | Instructions are probably the most heavily allocated complex class in Subzero. | 
|  | We represent an instruction list as an intrusive doubly linked list, allocate | 
|  | all instructions from the ``CfgLocalAllocator``, and we make sure each | 
|  | instruction subclass is basically `POD | 
|  | <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a | 
|  | trivial destructor.  This way, when the CFG is finished, we don't need to | 
|  | individually deallocate every instruction.  We do similar for Variables, which | 
|  | is probably the second most popular complex class. | 
|  |  | 
|  | There are some situations where passes need to use some `STL container class | 
|  | <http://en.cppreference.com/w/cpp/container>`_.  Subzero has a way of using the | 
|  | ``CfgLocalAllocator`` as the container allocator if this is needed. | 
|  |  | 
|  | Multithreaded translation | 
|  | ------------------------- | 
|  |  | 
|  | Subzero is designed to be able to translate functions in parallel.  With the | 
|  | ``-threads=N`` command-line option, there is a 3-stage producer-consumer | 
|  | pipeline: | 
|  |  | 
|  | - A single thread parses the ``.pexe`` file and produces a sequence of work | 
|  | units.  A work unit can be either a fully constructed CFG, or a set of global | 
|  | initializers.  The work unit includes its sequence number denoting its parse | 
|  | order.  Each work unit is added to the translation queue. | 
|  |  | 
|  | - There are N translation threads that draw work units from the translation | 
|  | queue and lower them into assembler buffers.  Each assembler buffer is added | 
|  | to the emitter queue, tagged with its sequence number.  The CFG and its | 
|  | ``CfgLocalAllocator`` are disposed of at this point. | 
|  |  | 
|  | - A single thread draws assembler buffers from the emitter queue and appends to | 
|  | the output file.  It uses the sequence numbers to reintegrate the assembler | 
|  | buffers according to the original parse order, such that output order is | 
|  | always deterministic. | 
|  |  | 
|  | This means that with ``-threads=N``, there are actually ``N+1`` spawned threads | 
|  | for a total of ``N+2`` execution threads, taking the parser and emitter threads | 
|  | into account.  For the special case of ``N=0``, execution is entirely sequential | 
|  | -- the same thread parses, translates, and emits, one function at a time.  This | 
|  | is useful for performance measurements. | 
|  |  | 
|  | Ideally, we would like to get near-linear scalability as the number of | 
|  | translation threads increases.  We expect that ``-threads=1`` should be slightly | 
|  | faster than ``-threads=0`` as the small amount of time spent parsing and | 
|  | emitting is done largely in parallel with translation.  With perfect | 
|  | scalability, we see ``-threads=N`` translating ``N`` times as fast as | 
|  | ``-threads=1``, up until the point where parsing or emitting becomes the | 
|  | bottleneck, or ``N+2`` exceeds the number of CPU cores.  In reality, memory | 
|  | performance would become a bottleneck and efficiency might peak at, say, 75%. | 
|  |  | 
|  | Currently, parsing takes about 11% of total sequential time.  If translation | 
|  | scalability ever gets so fast and awesomely scalable that parsing becomes a | 
|  | bottleneck, it should be possible to make parsing multithreaded as well. | 
|  |  | 
|  | Internally, all shared, mutable data is held in the GlobalContext object, and | 
|  | access to each field is guarded by a mutex. | 
|  |  | 
|  | Security | 
|  | -------- | 
|  |  | 
|  | Subzero includes a number of security features in the generated code, as well as | 
|  | in the Subzero translator itself, which run on top of the existing Native Client | 
|  | sandbox as well as Chrome's OS-level sandbox. | 
|  |  | 
|  | Sandboxed translator | 
|  | ^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | When running inside the browser, the Subzero translator executes as sandboxed, | 
|  | untrusted code that is initially checked by the validator, just like the | 
|  | LLVM-based ``pnacl-llc`` translator.  As such, the Subzero binary should be no | 
|  | more or less secure than the translator it replaces, from the point of view of | 
|  | the Chrome sandbox.  That said, Subzero is much smaller than ``pnacl-llc`` and | 
|  | was designed from the start with security in mind, so one expects fewer attacker | 
|  | opportunities here. | 
|  |  | 
|  | Code diversification | 
|  | ^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | `Return-oriented programming | 
|  | <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a | 
|  | now-common technique for starting with e.g. a known buffer overflow situation | 
|  | and launching it into a deeper exploit.  The attacker scans the executable | 
|  | looking for ROP gadgets, which are short sequences of code that happen to load | 
|  | known values into known registers and then return.  An attacker who manages to | 
|  | overwrite parts of the stack can overwrite it with carefully chosen return | 
|  | addresses such that certain ROP gadgets are effectively chained together to set | 
|  | up the register state as desired, finally returning to some code that manages to | 
|  | do something nasty based on those register values. | 
|  |  | 
|  | If there is a popular ``.pexe`` with a large install base, the attacker could | 
|  | run Subzero on it and scan the executable for suitable ROP gadgets to use as | 
|  | part of a potential exploit.  Note that if the trusted validator is working | 
|  | correctly, these ROP gadgets are limited to starting at a bundle boundary and | 
|  | cannot use the trick of finding a gadget that happens to begin inside another | 
|  | instruction.  All the same, gadgets with these constraints still exist and the | 
|  | attacker has access to them.  This is the attack model we focus most on -- | 
|  | protecting the user against misuse of a "trusted" developer's application, as | 
|  | opposed to mischief from a malicious ``.pexe`` file. | 
|  |  | 
|  | Subzero can mitigate these attacks to some degree through code diversification. | 
|  | Specifically, we can apply some randomness to the code generation that makes ROP | 
|  | gadgets less predictable.  This randomness can have some compile-time cost, and | 
|  | it can affect the code quality; and some diversifications may be more effective | 
|  | than others.  A more detailed treatment of hardening techniques may be found in | 
|  | the Matasano report "`Attacking Clientside JIT Compilers | 
|  | <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_". | 
|  |  | 
|  | To evaluate diversification effectiveness, we use a third-party ROP gadget | 
|  | finder and limit its results to bundle-aligned addresses.  For a given | 
|  | diversification technique, we run it with a number of different random seeds, | 
|  | find ROP gadgets for each version, and determine how persistent each ROP gadget | 
|  | is across the different versions.  A gadget is persistent if the same gadget is | 
|  | found at the same code address.  The best diversifications are ones with low | 
|  | gadget persistence rates. | 
|  |  | 
|  | Subzero implements 7 different diversification techniques.  Below is a | 
|  | discussion of each technique, its effectiveness, and its cost.  The discussions | 
|  | of cost and effectiveness are for a single diversification technique; the | 
|  | translation-time costs for multiple techniques are additive, but the effects of | 
|  | multiple techniques on code quality and effectiveness are not yet known. | 
|  |  | 
|  | In Subzero's implementation, each randomization is "repeatable" in a sense. | 
|  | Each pass that includes a randomization option gets its own private instance of | 
|  | a random number generator (RNG).  The RNG is seeded with a combination of a | 
|  | global seed, the pass ID, and the function's sequence number.  The global seed | 
|  | is designed to be different across runs (perhaps based on the current time), but | 
|  | for debugging, the global seed can be set to a specific value and the results | 
|  | will be repeatable. | 
|  |  | 
|  | Subzero-generated code is subject to diversification once per translation, and | 
|  | then Chrome caches the diversified binary for subsequent executions.  An | 
|  | attacker may attempt to run the binary multiple times hoping for | 
|  | higher-probability combinations of ROP gadgets.  When the attacker guesses | 
|  | wrong, a likely outcome is an application crash.  Chrome throttles creation of | 
|  | crashy processes which reduces the likelihood of the attacker eventually gaining | 
|  | a foothold. | 
|  |  | 
|  | Constant blinding | 
|  | ~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Here, we prevent attackers from controlling large immediates in the text | 
|  | (executable) section.  A random cookie is generated for each function, and if | 
|  | the constant exceeds a specified threshold, the constant is obfuscated with the | 
|  | cookie and equivalent code is generated.  For example, instead of this x86 | 
|  | instruction:: | 
|  |  | 
|  | mov $0x11223344, <%Reg/Mem> | 
|  |  | 
|  | the following code might be generated:: | 
|  |  | 
|  | mov $(0x11223344+Cookie), %temp | 
|  | lea -Cookie(%temp), %temp | 
|  | mov %temp, <%Reg/Mem> | 
|  |  | 
|  | The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to | 
|  | prevent unintended effects on the flags register. | 
|  |  | 
|  | This transformation has almost no effect on translation time, and about 1% | 
|  | impact on code quality, depending on the threshold chosen.  It does little to | 
|  | reduce gadget persistence, but it does remove a lot of potential opportunities | 
|  | to construct intra-instruction ROP gadgets (which an attacker could use only if | 
|  | a validator bug were discovered, since the Native Client sandbox and associated | 
|  | validator force returns and other indirect branches to be to bundle-aligned | 
|  | addresses). | 
|  |  | 
|  | Constant pooling | 
|  | ~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | This is similar to constant blinding, in that large immediates are removed from | 
|  | the text section.  In this case, each unique constant above the threshold is | 
|  | stored in a read-only data section and the constant is accessed via a memory | 
|  | load.  For the above example, the following code might be generated:: | 
|  |  | 
|  | mov $Label$1, %temp | 
|  | mov %temp, <%Reg/Mem> | 
|  |  | 
|  | This has a similarly small impact on translation time and ROP gadget | 
|  | persistence, and a smaller (better) impact on code quality.  This is because it | 
|  | uses fewer instructions, and in some cases good register allocation leads to no | 
|  | increase in instruction count.  Note that this still gives an attacker some | 
|  | limited amount of control over some text section values, unless we randomize the | 
|  | constant pool layout. | 
|  |  | 
|  | Static data reordering | 
|  | ~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | This transformation limits the attacker's ability to control bits in global data | 
|  | address references.  It simply permutes the order in memory of global variables | 
|  | and internal constant pool entries.  For the constant pool, we only permute | 
|  | within a type (i.e., emit a randomized list of ints, followed by a randomized | 
|  | list of floats, etc.) to maintain good packing in the face of alignment | 
|  | constraints. | 
|  |  | 
|  | As might be expected, this has no impact on code quality, translation time, or | 
|  | ROP gadget persistence (though as above, it limits opportunities for | 
|  | intra-instruction ROP gadgets with a broken validator). | 
|  |  | 
|  | Basic block reordering | 
|  | ~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Here, we randomize the order of basic blocks within a function, with the | 
|  | constraint that we still want to maintain a topological order as much as | 
|  | possible, to avoid making the code too branchy. | 
|  |  | 
|  | This has no impact on code quality, and about 1% impact on translation time, due | 
|  | to a separate pass to recompute layout.  It ends up having a huge effect on ROP | 
|  | gadget persistence, tied for best with nop insertion, reducing ROP gadget | 
|  | persistence to less than 5%. | 
|  |  | 
|  | Function reordering | 
|  | ~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Here, we permute the order that functions are emitted, primarily to shift ROP | 
|  | gadgets around to less predictable locations.  It may also change call address | 
|  | offsets in case the attacker was trying to control that offset in the code. | 
|  |  | 
|  | To control latency and memory footprint, we don't arbitrarily permute functions. | 
|  | Instead, for some relatively small value of N, we queue up N assembler buffers, | 
|  | and then emit the N functions in random order, and repeat until all functions | 
|  | are emitted. | 
|  |  | 
|  | Function reordering has no impact on translation time or code quality. | 
|  | Measurements indicate that it reduces ROP gadget persistence to about 15%. | 
|  |  | 
|  | Nop insertion | 
|  | ~~~~~~~~~~~~~ | 
|  |  | 
|  | This diversification randomly adds a nop instruction after each regular | 
|  | instruction, with some probability.  Nop instructions of different lengths may | 
|  | be selected.  Nop instructions are never added inside a bundle_lock region. | 
|  | Note that when sandboxing is enabled, nop instructions are already being added | 
|  | for bundle alignment, so the diversification nop instructions may simply be | 
|  | taking the place of alignment nop instructions, though distributed differently | 
|  | through the bundle. | 
|  |  | 
|  | In Subzero's currently implementation, nop insertion adds 3-5% to the | 
|  | translation time, but this is probably because it is implemented as a separate | 
|  | pass that adds actual nop instructions to the IR.  The overhead would probably | 
|  | be a lot less if it were integrated into the assembler pass.  The code quality | 
|  | is also reduced by 3-5%, making nop insertion the most expensive of the | 
|  | diversification techniques. | 
|  |  | 
|  | Nop insertion is very effective in reducing ROP gadget persistence, at the same | 
|  | level as basic block randomization (less than 5%).  But given nop insertion's | 
|  | impact on translation time and code quality, one would most likely prefer to use | 
|  | basic block randomization instead (though the combined effects of the different | 
|  | diversification techniques have not yet been studied). | 
|  |  | 
|  | Register allocation randomization | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | In this diversification, the register allocator tries to make different but | 
|  | mostly functionally equivalent choices, while maintaining stable code quality. | 
|  |  | 
|  | A naive approach would be the following.  Whenever the allocator has more than | 
|  | one choice for assigning a register, choose randomly among those options.  And | 
|  | whenever there are no registers available and there is a tie for the | 
|  | lowest-weight variable, randomly select one of the lowest-weight variables to | 
|  | evict.  Because of the one-pass nature of the linear-scan algorithm, this | 
|  | randomization strategy can have a large impact on which variables are ultimately | 
|  | assigned registers, with a corresponding large impact on code quality. | 
|  |  | 
|  | Instead, we choose an approach that tries to keep code quality stable regardless | 
|  | of the random seed.  We partition the set of physical registers into equivalence | 
|  | classes.  If a register is pre-colored in the function (i.e., referenced | 
|  | explicitly by name), it forms its own equivalence class.  The remaining | 
|  | registers are partitioned according to their combination of attributes such as | 
|  | integer versus floating-point, 8-bit versus 32-bit, caller-save versus | 
|  | callee-saved, etc.  Each equivalence class is randomly permuted, and the | 
|  | complete permutation is applied to the final register assignments. | 
|  |  | 
|  | Register randomization reduces ROP gadget persistence to about 10% on average, | 
|  | though there tends to be fairly high variance across functions and applications. | 
|  | This probably has to do with the set of restrictions in the x86-32 instruction | 
|  | set and ABI, such as few general-purpose registers, ``%eax`` used for return | 
|  | values, ``%edx`` used for division, ``%cl`` used for shifting, etc.  As | 
|  | intended, register randomization has no impact on code quality, and a slight | 
|  | (0.5%) impact on translation time due to an extra scan over the variables to | 
|  | identify pre-colored registers. | 
|  |  | 
|  | Fuzzing | 
|  | ^^^^^^^ | 
|  |  | 
|  | We have started fuzz-testing the ``.pexe`` files input to Subzero, using a | 
|  | combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer | 
|  | <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling.  The purpose is to | 
|  | find and fix cases where Subzero crashes or otherwise ungracefully fails on | 
|  | unexpected inputs, and to do so automatically over a large range of unexpected | 
|  | inputs.  By fixing bugs that arise from fuzz testing, we reduce the possibility | 
|  | of an attacker exploiting these bugs. | 
|  |  | 
|  | Most of the problems found so far are ones most appropriately handled in the | 
|  | parser.  However, there have been a couple that have identified problems in the | 
|  | lowering, or otherwise inappropriately triggered assertion failures and fatal | 
|  | errors.  We continue to dig into this area. | 
|  |  | 
|  | Future security work | 
|  | ^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Subzero is well-positioned to explore other future security enhancements, e.g.: | 
|  |  | 
|  | - Tightening the Native Client sandbox.  ABI changes, such as the previous work | 
|  | on `hiding the sandbox base address | 
|  | <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_ | 
|  | in x86-64, are easy to experiment with in Subzero. | 
|  |  | 
|  | - Making the executable code section read-only.  This would prevent a PNaCl | 
|  | application from inspecting its own binary and trying to find ROP gadgets even | 
|  | after code diversification has been performed.  It may still be susceptible to | 
|  | `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks, | 
|  | security is still overall improved. | 
|  |  | 
|  | - Instruction selection diversification.  It may be possible to lower a given | 
|  | instruction in several largely equivalent ways, which gives more opportunities | 
|  | for code randomization. | 
|  |  | 
|  | Chrome integration | 
|  | ------------------ | 
|  |  | 
|  | Currently Subzero is available in Chrome for the x86-32 architecture, but under | 
|  | a flag.  When the flag is enabled, Subzero is used when the `manifest file | 
|  | <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_ | 
|  | linking to the ``.pexe`` file specifies the ``O0`` optimization level. | 
|  |  | 
|  | The next step is to remove the flag, i.e. invoke Subzero as the only translator | 
|  | for ``O0``-specified manifest files. | 
|  |  | 
|  | Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which | 
|  | case Subzero could be used for all PNaCl translation. | 
|  |  | 
|  | Command line options | 
|  | -------------------- | 
|  |  | 
|  | Subzero has a number of command-line options for debugging and diagnostics. | 
|  | Among the more interesting are the following. | 
|  |  | 
|  | - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other | 
|  | diagnostic output, with various levels of detail after each pass.  Instruction | 
|  | numbers can be printed or suppressed.  Deleted instructions can be printed or | 
|  | suppressed (they are retained in the instruction list, as discussed earlier, | 
|  | because they can help explain how lower-level instructions originated). | 
|  | Liveness information can be printed when available.  Details of register | 
|  | allocation can be printed as register allocator decisions are made.  And more. | 
|  |  | 
|  | - Running Subzero with any level of verbosity produces an enormous amount of | 
|  | output.  When debugging a single function, verbose output can be suppressed | 
|  | except for a particular function.  The ``-verbose-focus`` flag suppresses | 
|  | verbose output except for the specified function. | 
|  |  | 
|  | - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing | 
|  | at exit.  Timing markers can be placed in the Subzero source code to demarcate | 
|  | logical operations or passes of interest.  Basic timing information plus | 
|  | call-stack type timing information is printed at the end. | 
|  |  | 
|  | - Along with ``-timing``, the user can instead get a report on the overall | 
|  | translation time for each function, to help focus on timing outliers.  Also, | 
|  | ``-timing-focus`` limits the ``-timing`` reporting to a single function, | 
|  | instead of aggregating pass timing across all functions. | 
|  |  | 
|  | - The ``-szstats`` option reports various statistics on each function, such as | 
|  | stack frame size, static instruction count, etc.  It may be helpful to track | 
|  | these stats over time as Subzero is improved, as an approximate measure of | 
|  | code quality. | 
|  |  | 
|  | - The flag ``-asm-verbose``, in conjunction with emitting textual assembly | 
|  | output, annotate the assembly output with register-focused liveness | 
|  | information.  In particular, each basic block is annotated with which | 
|  | registers are live-in and live-out, and each instruction is annotated with | 
|  | which registers' and stack locations' live ranges end at that instruction. | 
|  | This is really useful when studying the generated code to find opportunities | 
|  | for code quality improvements. | 
|  |  | 
|  | Testing and debugging | 
|  | --------------------- | 
|  |  | 
|  | LLVM lit tests | 
|  | ^^^^^^^^^^^^^^ | 
|  |  | 
|  | For basic testing, Subzero uses LLVM's `lit | 
|  | <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests.  We | 
|  | have a suite of hundreds of small functions where we test for particular | 
|  | assembly code patterns across different target architectures. | 
|  |  | 
|  | Cross tests | 
|  | ^^^^^^^^^^^ | 
|  |  | 
|  | Unfortunately, the lit tests don't do a great job of precisely testing the | 
|  | correctness of the output.  Much better are the cross tests, which are execution | 
|  | tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide | 
|  | variety of interesting inputs.  Each cross test consists of a set of C, C++, | 
|  | and/or low-level bitcode files.  The C and C++ source files are compiled down to | 
|  | bitcode.  The bitcode files are translated by ``pnacl-llc`` and also by Subzero. | 
|  | Subzero mangles global symbol names with a special prefix to avoid duplicate | 
|  | symbol errors.  A driver program invokes both versions on a large set of | 
|  | interesting inputs, and reports when the Subzero and ``pnacl-llc`` results | 
|  | differ.  Cross tests turn out to be an excellent way of testing the basic | 
|  | lowering patterns, but they are less useful for testing more global things like | 
|  | liveness analysis and register allocation. | 
|  |  | 
|  | Bisection debugging | 
|  | ^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Sometimes with a new application, Subzero will end up producing incorrect code | 
|  | that either crashes at runtime or otherwise produces the wrong results.  When | 
|  | this happens, we need to narrow it down to a single function (or small set of | 
|  | functions) that yield incorrect behavior.  For this, we have a bisection | 
|  | debugging framework.  Here, we initially translate the entire application once | 
|  | with Subzero and once with ``pnacl-llc``.  We then use ``objdump`` to | 
|  | selectively weaken symbols based on a whitelist or blacklist provided on the | 
|  | command line.  The two object files can then be linked together without link | 
|  | errors, with the desired version of each method "winning".  Then the binary is | 
|  | tested, and bisection proceeds based on whether the binary produces correct | 
|  | output. | 
|  |  | 
|  | When the bisection completes, we are left with a minimal set of | 
|  | Subzero-translated functions that cause the failure.  Usually it is a single | 
|  | function, though sometimes it might require a combination of several functions | 
|  | to cause a failure; this may be due to an incorrect call ABI, for example. | 
|  | However, Murphy's Law implies that the single failing function is enormous and | 
|  | impractical to debug.  In that case, we can restart the bisection, explicitly | 
|  | blacklisting the enormous function, and try to find another candidate to debug. | 
|  | (Future work is to automate this to find all minimal sets of functions, so that | 
|  | debugging can focus on the simplest example.) | 
|  |  | 
|  | Fuzz testing | 
|  | ^^^^^^^^^^^^ | 
|  |  | 
|  | As described above, we try to find internal Subzero bugs using fuzz testing | 
|  | techniques. | 
|  |  | 
|  | Sanitizers | 
|  | ^^^^^^^^^^ | 
|  |  | 
|  | Subzero can be built with `AddressSanitizer | 
|  | <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer | 
|  | <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support.  This is | 
|  | done using something as simple as ``make ASAN=1`` or ``make TSAN=1``.  So far, | 
|  | multithreading has been simple enough that TSan hasn't found any bugs, but ASan | 
|  | has found at least one memory leak which was subsequently fixed. | 
|  | `UndefinedBehaviorSanitizer | 
|  | <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_ | 
|  | (UBSan) support is in progress.  `Control flow integrity sanitization | 
|  | <http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under | 
|  | consideration. | 
|  |  | 
|  | Current status | 
|  | ============== | 
|  |  | 
|  | Target architectures | 
|  | -------------------- | 
|  |  | 
|  | Subzero is currently more or less complete for the x86-32 target.  It has been | 
|  | refactored and extended to handle x86-64 as well, and that is mostly complete at | 
|  | this point. | 
|  |  | 
|  | ARM32 work is in progress.  It currently lacks the testing level of x86, at | 
|  | least in part because Subzero's register allocator needs modifications to handle | 
|  | ARM's aliasing of floating point and vector registers.  Specifically, a 64-bit | 
|  | register is actually a gang of two consecutive and aligned 32-bit registers, and | 
|  | a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers. | 
|  | ARM64 work has not started; when it does, it will be native-only since the | 
|  | Native Client sandbox model, validator, and other tools have never been defined. | 
|  |  | 
|  | An external contributor is adding MIPS support, in most part by following the | 
|  | ARM work. | 
|  |  | 
|  | Translator performance | 
|  | ---------------------- | 
|  |  | 
|  | Single-threaded translation speed is currently about 5× the ``pnacl-llc`` | 
|  | translation speed.  For a large ``.pexe`` file, the time breaks down as: | 
|  |  | 
|  | - 11% for parsing and initial IR building | 
|  |  | 
|  | - 4% for emitting to /dev/null | 
|  |  | 
|  | - 27% for liveness analysis (two liveness passes plus live range construction) | 
|  |  | 
|  | - 15% for linear-scan register allocation | 
|  |  | 
|  | - 9% for basic lowering | 
|  |  | 
|  | - 10% for advanced phi lowering | 
|  |  | 
|  | - ~11% for other minor analysis | 
|  |  | 
|  | - ~10% measurement overhead to acquire these numbers | 
|  |  | 
|  | Some improvements could undoubtedly be made, but it will be hard to increase the | 
|  | speed to 10× of ``pnacl-llc`` while keeping acceptable code quality.  With | 
|  | ``-Om1`` (lack of) optimization, we do actually achieve roughly 10× | 
|  | ``pnacl-llc`` translation speed, but code quality drops by a factor of 3. | 
|  |  | 
|  | Code quality | 
|  | ------------ | 
|  |  | 
|  | Measured across 16 components of spec2k, Subzero's code quality is uniformly | 
|  | better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly | 
|  | between ``pnacl-llc`` ``-O0`` and ``-O2``. | 
|  |  | 
|  | Translator size | 
|  | --------------- | 
|  |  | 
|  | When built in MINIMAL mode, the x86-64 native translator size for the x86-32 | 
|  | target is about 700 KB, not including the size of functions referenced in | 
|  | dynamically-linked libraries.  The sandboxed version of Subzero is a bit over 1 | 
|  | MB, and it is statically linked and also includes nop padding for bundling as | 
|  | well as indirect branch masking. | 
|  |  | 
|  | Translator memory footprint | 
|  | --------------------------- | 
|  |  | 
|  | It's hard to draw firm conclusions about memory footprint, since the footprint | 
|  | is at least proportional to the input function size, and there is no real limit | 
|  | on the size of functions in the ``.pexe`` file. | 
|  |  | 
|  | That said, we looked at the memory footprint over time as Subzero translated | 
|  | ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our | 
|  | disposal.  One of LLVM's libraries that Subzero uses can report the current | 
|  | malloc heap usage.  With single-threaded translation, Subzero tends to hover | 
|  | around 15 MB of memory usage.  There are a couple of monstrous functions where | 
|  | Subzero grows to around 100 MB, but then it drops back down after those | 
|  | functions finish translating.  In contrast, ``pnacl-llc`` grows larger and | 
|  | larger throughout translation, reaching several hundred MB by the time it | 
|  | completes. | 
|  |  | 
|  | It's a bit more interesting when we enable multithreaded translation.  When | 
|  | there are N translation threads, Subzero implements a policy that limits the | 
|  | size of the translation queue to N entries -- if it is "full" when the parser | 
|  | tries to add a new CFG, the parser blocks until one of the translation threads | 
|  | removes a CFG.  This means the number of in-memory CFGs can (and generally does) | 
|  | reach 2*N+1, and so the memory footprint rises in proportion to the number of | 
|  | threads.  Adding to the pressure is the observation that the monstrous functions | 
|  | also take proportionally longer time to translate, so there's a good chance many | 
|  | of the monstrous functions will be active at the same time with multithreaded | 
|  | translation.  As a result, for N=32, Subzero's memory footprint peaks at about | 
|  | 260 MB, but drops back down as the large functions finish translating. | 
|  |  | 
|  | If this peak memory size becomes a problem, it might be possible for the parser | 
|  | to resequence the functions to try to spread out the larger functions, or to | 
|  | throttle the translation queue to prevent too many in-flight large functions. | 
|  | It may also be possible to throttle based on memory pressure signaling from | 
|  | Chrome. | 
|  |  | 
|  | Translator scalability | 
|  | ---------------------- | 
|  |  | 
|  | Currently scalability is "not very good".  Multiple translation threads lead to | 
|  | faster translation, but not to the degree desired.  We haven't dug in to | 
|  | investigate yet. | 
|  |  | 
|  | There are a few areas to investigate.  First, there may be contention on the | 
|  | constant pool, which all threads access, and which requires locked access even | 
|  | for reading.  This could be mitigated by keeping a CFG-local cache of the most | 
|  | common constants. | 
|  |  | 
|  | Second, there may be contention on memory allocation.  While almost all CFG | 
|  | objects are allocated from the CFG-local allocator, some passes use temporary | 
|  | STL containers that use the default allocator, which may require global locking. | 
|  | This could be mitigated by switching these to the CFG-local allocator. | 
|  |  | 
|  | Third, multithreading may make the default allocator strategy more expensive. | 
|  | In a single-threaded environment, a pass will allocate its containers, run the | 
|  | pass, and deallocate the containers.  This results in stack-like allocation | 
|  | behavior and makes the heap free list easier to manage, with less heap | 
|  | fragmentation.  But when multithreading is added, the allocations and | 
|  | deallocations become much less stack-like, making allocation and deallocation | 
|  | operations individually more expensive.  Again, this could be mitigated by | 
|  | switching these to the CFG-local allocator. |