third_party/subzero/docs/DESIGN.rst - SwiftShader - Git at Google

 Design of the Subzero fast code generator
 =========================================

 Introduction
 ------------

 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
 compiler technology based on `LLVM <http://llvm.org/>`_.  The developer uses the
 PNaCl toolchain to compile their application to architecture-neutral PNaCl
 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
 possible.  The ``.pexe`` file is downloaded to the user's browser where the
 PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
 `sandboxed
 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
 native code.  The translator uses architecture-specific optimizations as much as
 practical to generate good native code.

 The native code can be cached by the browser to avoid repeating translation on
 future page loads.  However, first-time user experience is hampered by long
 translation times.  The LLVM-based PNaCl translator is pretty slow, even when
 using ``-O0`` to minimize optimizations, so delays are especially noticeable on
 slow browser platforms such as ARM-based Chromebooks.

 Translator slowness can be mitigated or hidden in a number of ways.

 - Parallel translation.  However, slow machines where this matters most, e.g.
   ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
   and are likely to less memory available for multiple translation threads to
   use.

 - Streaming translation, i.e. start translating as soon as the download starts.
   This doesn't help much when translation speed is 10× slower than download
   speed, or the ``.pexe`` file is already cached while the translated binary was
   flushed from the cache.

 - Arrange the web page such that translation is done in parallel with
   downloading large assets.

 - Arrange the web page to distract the user with `cat videos
   <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
   progress.

 Or, improve translator performance to something more reasonable.

 This document describes Subzero's attempt to improve translation speed by an
 order of magnitude while rivaling LLVM's code quality.  Subzero does this
 through minimal IR layering, lean data structures and passes, and a careful
 selection of fast optimization passes.  It has two optimization recipes: full
 optimizations (``O2``) and minimal optimizations (``Om1``).  The recipes are the
 following (described in more detail below):

 +----------------------------------------+-----------------------------+
 | O2 recipe                              | Om1 recipe                  |
 +========================================+=============================+
 | Parse .pexe file                       | Parse .pexe file            |
 +----------------------------------------+-----------------------------+
 | Loop nest analysis                     |                             |
 +----------------------------------------+-----------------------------+
 | Local common subexpression elimination |                             |
 +----------------------------------------+-----------------------------+
 | Address mode inference                 |                             |
 +----------------------------------------+-----------------------------+
 | Read-modify-write (RMW) transform      |                             |
 +----------------------------------------+-----------------------------+
 | Basic liveness analysis                |                             |
 +----------------------------------------+-----------------------------+
 | Load optimization                      |                             |
 +----------------------------------------+-----------------------------+
 |                                        | Phi lowering (simple)       |
 +----------------------------------------+-----------------------------+
 | Target lowering                        | Target lowering             |
 +----------------------------------------+-----------------------------+
 | Full liveness analysis                 |                             |
 +----------------------------------------+-----------------------------+
 | Register allocation                    | Minimal register allocation |
 +----------------------------------------+-----------------------------+
 | Phi lowering (advanced)                |                             |
 +----------------------------------------+-----------------------------+
 | Post-phi register allocation           |                             |
 +----------------------------------------+-----------------------------+
 | Branch optimization                    |                             |
 +----------------------------------------+-----------------------------+
 | Code emission                          | Code emission               |
 +----------------------------------------+-----------------------------+

 Goals
 =====

 Translation speed
 -----------------

 We'd like to be able to translate a ``.pexe`` file as fast as download speed.
 Any faster is in a sense wasted effort.  Download speed varies greatly, but
 we'll arbitrarily say 1 MB/sec.  We'll pick the ARM A15 CPU as the example of a
 slow machine.  We observe a 3× single-thread performance difference between A15
 and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
 ``.pexe`` file could be compressed to 50% on the web server using gzip transport
 compression, so we set the translation speed goal to 6 MB/sec on the high-end
 Xeon workstation.

 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
 ⅒ the target rate.  The ``-O2`` mode takes 3× as long as the ``-O0`` mode.

 In other words, Subzero's goal is to improve over LLVM's translation speed by
 10×.

 Code quality
 ------------

 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
 code quality.  The stretch goal is to approach LLVM ``-O2`` code quality.  On
 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.

 It's important to note that the quality of Subzero-generated code depends on
 target-neutral optimizations and simplifications being run beforehand in the
 developer environment.  The ``.pexe`` file reflects these optimizations.  For
 example, Subzero assumes that the basic blocks are ordered topologically where
 possible (which makes liveness analysis converge fastest), and Subzero does not
 do any function inlining because it should already have been done.

 Translator size
 ---------------

 The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
 We think 1 MB is a more reasonable size -- especially for such a component that
 is distributed to a billion Chrome users.  Thus we target a 10× reduction in
 binary size.

 For development, Subzero can be built for all target architectures, and all
 debugging and diagnostic options enabled.  For a smaller translator, we restrict
 to a single target architecture, and define a ``MINIMAL`` build where
 unnecessary features are compiled out.

 Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
 include directories, which have little impact on translator size.  It also uses
 some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
 with little size impact.  In non-``MINIMAL`` builds, the translator size is much
 larger due to including code for parsing text-format bitcode files and forming
 LLVM IR.

 Memory footprint
 ----------------

 The current LLVM-based translator suffers from an issue in which some
 function-specific data has to be retained in memory until all translation
 completes, and therefore the memory footprint grows without bound.  Large
 ``.pexe`` files can lead to the translator process holding hundreds of MB of
 memory by the end.  The translator runs in a separate process, so this memory
 growth doesn't *directly* affect other processes, but it does dirty the physical
 memory and contributes to a perception of bloat and sometimes a reality of
 out-of-memory tab killing, especially noticeable on weaker systems.

 Subzero should maintain a stable memory footprint throughout translation.  It's
 not really practical to set a specific limit, because there is not really a
 practical limit on a single function's size, but the footprint should be
 "reasonable" and be proportional to the largest input function size, not the
 total ``.pexe`` file size.  Simply put, Subzero should not have memory leaks or
 inexorable memory growth.  (We use ASAN builds to test for leaks.)

 Multithreaded translation
 -------------------------

 It should be practical to translate different functions concurrently and see
 good scalability.  Some locking may be needed, such as accessing output buffers
 or constant pools, but that should be fairly minimal.  In contrast, LLVM was
 only designed for module-level parallelism, and as such, the PNaCl translator
 internally splits a ``.pexe`` file into several modules for concurrent
 translation.  All output needs to be deterministic regardless of the level of
 multithreading, i.e. functions and data should always be output in the same
 order.

 Target architectures
 --------------------

 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32.  Future
 targets include ARM64 and MIPS64, though these targets lack NaCl support
 including a sandbox model or a validator.

 The first implementation is for x86-32, because it was expected to be
 particularly challenging, and thus more likely to draw out any design problems
 early:

 - There are a number of special cases, asymmetries, and warts in the x86
   instruction set.

 - Complex addressing modes may be leveraged for better code quality.

 - 64-bit integer operations have to be lowered into longer sequences of 32-bit
   operations.

 - Paucity of physical registers may reveal code quality issues early in the
   design.

 Detailed design
 ===============

 Intermediate representation - ICE
 ---------------------------------

 Subzero's IR is called ICE.  It is designed to be reasonably similar to LLVM's
 IR, which is reflected in the ``.pexe`` file's bitcode structure.  It has a
 representation of global variables and initializers, and a set of functions.
 Each function contains a list of basic blocks, and each basic block constains a
 list of instructions.  Instructions that operate on stack and register variables
 do so using static single assignment (SSA) form.

 The ``.pexe`` file is translated one function at a time (or in parallel by
 multiple translation threads).  The recipe for optimization passes depends on
 the specific target and optimization level, and is described in detail below.
 Global variables (types and initializers) are simply and directly translated to
 object code, without any meaningful attempts at optimization.

 A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
 Its key contents include:

 - A list of ``CfgNode`` pointers, generally held in topological order.

 - A list of ``Variable`` pointers corresponding to local variables used in the
   function plus compiler-generated temporaries.

 A basic block is represented by the ``Ice::CfgNode`` class.  Its key contents
 include:

 - A linear list of instructions, in the same style as LLVM.  The last
   instruction of the list is always a terminator instruction: branch, switch,
   return, unreachable.

 - A list of Phi instructions, also in the same style as LLVM.  They are held as
   a linear list for convenience, though per Phi semantics, they are executed "in
   parallel" without dependencies on each other.

 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
   another list for outgoing edges.

 - The node's unique, 0-based index into the CFG's node list.

 An instruction is represented by the ``Ice::Inst`` class.  Its key contents
 include:

 - A list of source operands.

 - Its destination variable, if the instruction produces a result in an
   ``Ice::Variable``.

 - A bitvector indicating which variables' live ranges this instruction ends.
   This is computed during liveness analysis.

 Instructions kinds are divided into high-level ICE instructions and low-level
 ICE instructions.  High-level instructions consist of the PNaCl/LLVM bitcode
 instruction kinds.  Each target architecture implementation extends the
 instruction space with its own set of low-level instructions.  Generally,
 low-level instructions correspond to individual machine instructions.  The
 high-level ICE instruction space includes a few additional instruction kinds
 that are not part of LLVM but are generally useful (e.g., an Assignment
 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
 instructions for sandboxing).

 Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
 ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
 the following:

 - Alloca: allocate data on the stack

 - Arithmetic: binary operations of the form ``A = B op C``

 - Br: conditional or unconditional branch

 - Call: function call

 - Cast: unary type-conversion operations

 - ExtractElement: extract a scalar element from a vector-type value

 - Fcmp: floating-point comparison

 - Icmp: integer comparison

 - IntrinsicCall: call a known intrinsic

 - InsertElement: insert a scalar element into a vector-type value

 - Load: load a value from memory

 - Phi: implement the SSA phi node

 - Ret: return from the function

 - Select: essentially the C language operation of the form ``X = C ? Y : Z``

 - Store: store a value into memory

 - Switch: generalized branch to multiple possible locations

 - Unreachable: indicate that this portion of the code is unreachable

 The additional high-level ICE instructions are the following:

 - Assign: a simple ``A=B`` assignment.  This is useful for e.g. lowering Phi
   instructions to non-SSA assignments, before lowering to machine code.

 - BundleLock, BundleUnlock.  These are markers used for sandboxing, but are
   common across all targets and so they are elevated to the high-level
   instruction set.

 - FakeDef, FakeUse, FakeKill.  These are tools used to preserve consistency in
   liveness analysis, elevated to the high-level because they are used by all
   targets.  They are described in more detail at the end of this section.

 - JumpTable: this represents the result of switch optimization analysis, where
   some switch instructions may use jump tables instead of cascading
   compare/branches.

 An operand is represented by the ``Ice::Operand`` class.  In high-level ICE, an
 operand is either an ``Ice::Constant`` or an ``Ice::Variable``.  Constants
 include scalar integer constants, scalar floating point constants, Undef (an
 unspecified constant of a particular scalar or vector type), and symbol
 constants (essentially addresses of globals).  Note that the PNaCl ABI does not
 include vector-type constants besides Undef, and as such, Subzero (so far) has
 no reason to represent vector-type constants internally.  A variable represents
 a value allocated on the stack (though not including alloca-derived storage).
 Among other things, a variable holds its unique, 0-based index into the CFG's
 variable list.

 Each target can extend the ``Constant`` and ``Variable`` classes for its own
 needs.  In addition, the ``Operand`` class may be extended, e.g. to define an
 x86 ``MemOperand`` that encodes a base register, an index register, an index
 register shift amount, and a constant offset.

 Register allocation and liveness analysis are restricted to Variable operands.
 Because of the importance of register allocation to code quality, and the
 translation-time cost of liveness analysis, Variable operands get some special
 treatment in ICE.  Most notably, a frequent pattern in Subzero is to iterate
 across all the Variables of an instruction.  An instruction holds a list of
 operands, but an operand may contain 0, 1, or more Variables.  As such, the
 ``Operand`` class specially holds a list of Variables contained within, for
 quick access.

 A Subzero transformation pass may work by deleting an existing instruction and
 replacing it with zero or more new instructions.  Instead of actually deleting
 the existing instruction, we generally mark it as deleted and insert the new
 instructions right after the deleted instruction.  When printing the IR for
 debugging, this is a big help because it makes it much more clear how the
 non-deleted instructions came about.

 Subzero has a few special instructions to help with liveness analysis
 consistency.

 - The FakeDef instruction gives a fake definition of some variable.  For
   example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
   but an ICE instruction can represent only one destination variable.  This is
   similar for multiply instructions, and for function calls that return a 64-bit
   integer result in the ``%edx:%eax`` pair.  Also, using the ``xor %eax, %eax``
   trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.

 - The FakeUse instruction registers a use of a variable, typically to prevent an
   earlier assignment to that variable from being dead-code eliminated.  For
   example, lowering an operation like ``x=cc?y:z`` may be done using x86's
   conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``.  Without a
   FakeUse of ``x`` between the two instructions, the liveness analysis pass may
   dead-code eliminate the first instruction.

 - The FakeKill instruction is added after a call instruction, and is a quick way
   of indicating that caller-save registers are invalidated.

 Pexe parsing
 ------------

 Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files.  It
 parses the ``.pexe`` file function by function, ultimately constructing an ICE
 CFG for each function.  After a function is parsed, its CFG is handed off to the
 translation phase.  The bitcode parser also parses global initializer data and
 hands it off to be translated to data sections in the object file.

 Subzero has another parsing strategy for testing/debugging.  LLVM libraries can
 be used to parse a module into LLVM IR (though very slowly relative to Subzero
 native parsing).  Then we iterate across the LLVM IR and construct high-level
 ICE, handing off each CFG to the translation phase.

 Overview of lowering
 --------------------

 In general, translation goes like this:

 - Parse the next function from the ``.pexe`` file and construct a CFG consisting
   of high-level ICE.

 - Do analysis passes and transformation passes on the high-level ICE, as
   desired.

 - Lower each high-level ICE instruction into a sequence of zero or more
   low-level ICE instructions.  Each high-level instruction is generally lowered
   independently, though the target lowering is allowed to look ahead in the
   CfgNode's instruction list if desired.

 - Do more analysis and transformation passes on the low-level ICE, as desired.

 - Assemble the low-level CFG into an ELF object file (alternatively, a textual
   assembly file that is later assembled by some external tool).

 - Repeat for all functions, and also produce object code for data such as global
   initializers and internal constant pools.

 Currently there are two optimization levels: ``O2`` and ``Om1``.  For ``O2``,
 the intention is to apply all available optimizations to get the best code
 quality (though the initial code quality goal is measured against LLVM's ``O0``
 code quality).  For ``Om1``, the intention is to apply as few optimizations as
 possible and produce code as quickly as possible, accepting poor code quality.
 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
 "sub-zero".

 High-level debuggability of generated code is so far not a design requirement.
 Subzero doesn't really do transformations that would obfuscate debugging; the
 main thing might be that register allocation (including stack slot coalescing
 for stack-allocated variables whose live ranges don't overlap) may render a
 variable's value unobtainable after its live range ends.  This would not be an
 issue for ``Om1`` since it doesn't register-allocate program-level variables,
 nor does it coalesce stack slots.  That said, fully supporting debuggability
 would require a few additions:

 - DWARF support would need to be added to Subzero's ELF file emitter.  Subzero
   propagates global symbol names, local variable names, and function-internal
   label names that are present in the ``.pexe`` file.  This would allow a
   debugger to map addresses back to symbols in the ``.pexe`` file.

 - To map ``.pexe`` file symbols back to meaningful source-level symbol names,
   file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
   metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
   `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.

 - The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
   the toolchain would need to be modified to preserve it.

 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
 produces code with one third the code quality.  ``Om1`` is good for testing and
 debugging -- during translation, it tends to expose errors in the basic lowering
 that might otherwise have been hidden by the register allocator or other
 optimization passes.  It also helps determine whether a code correctness problem
 is a fundamental problem in the basic lowering, or an error in another
 optimization pass.

 The implementation of target lowering also controls the recipe of passes used
 for ``Om1`` and ``O2`` translation.  For example, address mode inference may
 only be relevant for x86.

 Lowering strategy
 -----------------

 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
 each high-level instruction down to a sequence of low-level target-specific
 instructions, in a largely context-free setting.  That is, each high-level
 instruction conceptually has a simple template expansion into low-level
 instructions, and lowering can in theory be done in any order.  This may sound
 like a small effort, but quite a large number of templates may be needed because
 of the number of PNaCl types and instruction variants.  Furthermore, there may
 be optimized templates, e.g. to take advantage of operator commutativity (for
 example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``).  This is
 similar to other template-based approaches in fast code generation or
 interpretation, though some decisions are deferred until after some global
 analysis passes, mostly related to register allocation, stack slot assignment,
 and specific choice of instruction variant and addressing mode.

 The key idea for a lowering template is to produce valid low-level instructions
 that are guaranteed to meet address mode and other structural requirements of
 the instruction set.  For example, on x86, the source operand of an integer
 store instruction must be an immediate or a physical register; a shift
 instruction's shift amount must be an immediate or in register ``%cl``; a
 function's integer return value is in ``%eax``; most x86 instructions are
 two-operand, in contrast to corresponding three-operand high-level instructions;
 etc.

 Because target lowering runs before register allocation, there is no way to know
 whether a given ``Ice::Variable`` operand lives on the stack or in a physical
 register.  When the low-level instruction calls for a physical register operand,
 the target lowering can create an infinite-weight Variable.  This tells the
 register allocator to assign infinite weight when making decisions, effectively
 guaranteeing some physical register.  Variables can also be pre-colored to a
 specific physical register (``cl`` in the shift example above), which also gives
 infinite weight.

 To illustrate, consider a high-level arithmetic instruction on 32-bit integer
 operands::

     A = B + C

 X86 target lowering might produce the following::

     T.inf = B  // mov instruction
     T.inf += C // add instruction
     A = T.inf  // mov instruction

 Here, ``T.inf`` is an infinite-weight temporary.  As long as ``T.inf`` has a
 physical register, the three lowered instructions are all encodable regardless
 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
 whether ``A`` is a physical register or in memory.

 In this example, ``A`` must be a Variable and one may be tempted to simplify the
 lowering sequence by setting ``A`` as infinite-weight and using::

         A = B  // mov instruction
         A += C // add instruction

 This has two problems.  First, if the original instruction was actually ``A =
 B + A``, the result would be incorrect.  Second, assigning ``A`` a physical
 register applies throughout ``A``'s entire live range.  This is probably not
 what is intended, and may ultimately lead to a failure to allocate a register
 for an infinite-weight variable.

 This style of lowering leads to many temporaries being generated, so in ``O2``
 mode, we rely on the register allocator to clean things up.  For example, in the
 example above, if ``B`` ends up getting a physical register and its live range
 ends at this instruction, the register allocator is likely to reuse that
 register for ``T.inf``.  This leads to ``T.inf=B`` being a redundant register
 copy, which is removed as an emission-time peephole optimization.

 O2 lowering
 -----------

 Currently, the ``O2`` lowering recipe is the following:

 - Loop nest analysis

 - Local common subexpression elimination

 - Address mode inference

 - Read-modify-write (RMW) transformation

 - Basic liveness analysis

 - Load optimization

 - Target lowering

 - Full liveness analysis

 - Register allocation

 - Phi instruction lowering (advanced)

 - Post-phi lowering register allocation

 - Branch optimization

 These passes are described in more detail below.

 Om1 lowering
 ------------

 Currently, the ``Om1`` lowering recipe is the following:

 - Phi instruction lowering (simple)

 - Target lowering

 - Register allocation (infinite-weight and pre-colored only)

 Optimization passes
 -------------------

 Liveness analysis
 ^^^^^^^^^^^^^^^^^

 Liveness analysis is a standard dataflow optimization, implemented as follows.
 For each node (basic block), its live-out set is computed as the union of the
 live-in sets of its successor nodes.  Then the node's instructions are processed
 in reverse order, updating the live set, until the beginning of the node is
 reached, and the node's live-in set is recorded.  If this iteration has changed
 the node's live-in set, the node's predecessors are marked for reprocessing.
 This continues until no more nodes need reprocessing.  If nodes are processed in
 reverse topological order, the number of iterations over the CFG is generally
 equal to the maximum loop nest depth.

 To implement this, each node records its live-in and live-out sets, initialized
 to the empty set.  Each instruction records which of its Variables' live ranges
 end in that instruction, initialized to the empty set.  A side effect of
 liveness analysis is dead instruction elimination.  Each instruction can be
 marked as tentatively dead, and after the algorithm converges, the tentatively
 dead instructions are permanently deleted.

 Optionally, after this liveness analysis completes, we can do live range
 construction, in which we calculate the live range of each variable in terms of
 instruction numbers.  A live range is represented as a union of segments, where
 the segment endpoints are instruction numbers.  Instruction numbers are required
 to be unique across the CFG, and monotonically increasing within a basic block.
 As a union of segments, live ranges can contain "gaps" and are therefore
 precise.  Because of SSA properties, a variable's live range can start at most
 once in a basic block, and can end at most once in a basic block.  Liveness
 analysis keeps track of which variable/instruction tuples begin live ranges and
 end live ranges, and combined with live-in and live-out sets, we can efficiently
 build up live ranges of all variables across all basic blocks.

 A lot of care is taken to try to make liveness analysis fast and efficient.
 Because of the lowering strategy, the number of variables is generally
 proportional to the number of instructions, leading to an O(N^2) complexity
 algorithm if implemented naively.  To improve things based on sparsity, we note
 that most variables are "local" and referenced in at most one basic block (in
 contrast to the "global" variables with multi-block usage), and therefore cannot
 be live across basic blocks.  Therefore, the live-in and live-out sets,
 typically represented as bit vectors, can be limited to the set of global
 variables, and the intra-block liveness bit vector can be compacted to hold the
 global variables plus the local variables for that block.

 Register allocation
 ^^^^^^^^^^^^^^^^^^^

 Subzero implements a simple linear-scan register allocator, based on the
 allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
 Register Allocation in the Context of SSA Form and Register Constraints
 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  This allocator has
 several nice features:

 - Live ranges are represented as unions of segments, as described above, rather
   than a single start/end tuple.

 - It allows pre-coloring of variables with specific physical registers.

 - It applies equally well to pre-lowered Phi instructions.

 The paper suggests an approach of aggressively coalescing variables across Phi
 instructions (i.e., trying to force Phi source and destination variables to have
 the same register assignment), but we reject that in favor of the more natural
 preference mechanism described below.

 We enhance the algorithm in the paper with the capability of automatic inference
 of register preference, and with the capability of allowing overlapping live
 ranges to safely share the same register in certain circumstances.  If we are
 considering register allocation for variable ``A``, and ``A`` has a single
 defining instruction ``A=B+C``, then the preferred register for ``A``, if
 available, would be the register assigned to ``B`` or ``C``, if any, provided
 that ``B`` or ``C``'s live range does not overlap ``A``'s live range.  In this
 way we infer a good register preference for ``A``.

 We allow overlapping live ranges to get the same register in certain cases.
 Suppose a high-level instruction like::

     A = unary_op(B)

 has been target-lowered like::

     T.inf = B
     A = unary_op(T.inf)

 Further, assume that ``B``'s live range continues beyond this instruction
 sequence, and that ``B`` has already been assigned some register.  Normally, we
 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
 different registers.  But ``T.inf`` is just a read-only copy of ``B`` that is
 guaranteed to be in a register, so in theory these overlapping live ranges could
 safely have the same register.  Our implementation allows this overlap as long
 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
 modified within ``T.inf``'s live range.

 Subzero's register allocator can be run in 3 configurations.

 - Normal mode.  All Variables are considered for register allocation.  It
   requires full liveness analysis and live range construction as a prerequisite.
   This is used by ``O2`` lowering.

 - Minimal mode.  Only infinite-weight or pre-colored Variables are considered.
   All other Variables are stack-allocated.  It does not require liveness
   analysis; instead, it quickly scans the instructions and records first
   definitions and last uses of all relevant Variables, using that to construct a
   single-segment live range.  Although this includes most of the Variables, the
   live ranges are mostly simple, short, and rarely overlapping, which the
   register allocator handles efficiently.  This is used by ``Om1`` lowering.

 - Post-phi lowering mode.  Advanced phi lowering is done after normal-mode
   register allocation, and may result in new infinite-weight Variables that need
   registers.  One would like to just run something like minimal mode to assign
   registers to the new Variables while respecting existing register allocation
   decisions.  However, it sometimes happens that there are no free registers.
   In this case, some register needs to be forcibly spilled to the stack and
   temporarily reassigned to the new Variable, and reloaded at the end of the new
   Variable's live range.  The register must be one that has no explicit
   references during the Variable's live range.  Since Subzero currently doesn't
   track def/use chains (though it does record the CfgNode where a Variable is
   defined), we just do a brute-force search across the CfgNode's instruction
   list for the instruction numbers of interest.  This situation happens very
   rarely, so there's little point for now in improving its performance.

 The basic linear-scan algorithm may, as it proceeds, rescind an early register
 allocation decision, leaving that Variable to be stack-allocated.  Some of these
 times, it turns out that the Variable could have been given a different register
 without conflict, but by this time it's too late.  The literature recognizes
 this situation and describes "second-chance bin-packing", which Subzero can do.
 We can rerun the register allocator in a mode that respects existing register
 allocation decisions, and sometimes it finds new non-conflicting opportunities.
 In fact, we can repeatedly run the register allocator until convergence.
 Unfortunately, in the current implementation, these subsequent register
 allocation passes end up being extremely expensive.  This is because of the
 treatment of the "unhandled pre-colored" Variable set, which is normally very
 small but ends up being quite large on subsequent passes.  Its performance can
 probably be made acceptable with a better choice of data structures, but for now
 this second-chance mechanism is disabled.

 Future work is to implement LLVM's `Greedy
 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
 register allocator as a replacement for the basic linear-scan algorithm, given
 LLVM's experience with its improvement in code quality.  (The blog post claims
 that the Greedy allocator also improved maintainability because a lot of hacks
 could be removed, but Subzero is probably not yet to that level of hacks, and is
 less likely to see that particular benefit.)

 Local common subexpression elimination
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The Local CSE implementation goes through each instruction and records a portion
 of each ``Seen`` instruction in a hashset-like container.  That portion consists
 of the entire instruction except for any dest variable. That means ``A = X + Y``
 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows
 us to detect common subexpressions.

 Whenever a repetition is detected, the redundant variables are stored in a
 container mapping the replacee to the replacement. In the case above, it would
 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.

 At any point if a variable that has an entry in the replacement table is
 encountered, it is replaced with the variable it is mapped to. This ensures that
 the redundant variables will not have any uses in the basic block, allowing
 dead code elimination to clean up the redundant instruction.

 With SSA, the information stored is never invalidated. However, non-SSA input is
 supported with the ``-lcse=no-ssa`` option. This has to maintain some extra
 dependency information to ensure proper invalidation on variable assignment.
 This is not rigorously tested because this pass is run at an early stage where
 it is safe to assume variables have a single definition. This is not enabled by
 default because it bumps the compile time overhead from 2% to 6%.

 Loop-invariant code motion
 ^^^^^^^^^^^^^^^^^^^^^^^^^^

 This pass utilizes the loop analysis information to hoist invariant instructions
 to loop pre-headers. A loop must have a single entry node (header) and that node
 must have a single external predecessor for this optimization to work, as it is
 currently implemented.

 The pass works by iterating over all instructions in the loop until the set of
 invariant instructions converges. In each iteration, a non-invariant instruction
 involving only constants or a variable known to be invariant is added to the
 result set. The destination variable of that instruction is added to the set of
 variables known to be invariant (which is initialized with the constant args).

 Improving the loop-analysis infrastructure is likely to have significant impact
 on this optimization. Inserting an extra node to act as the pre-header when the
 header has multiple incoming edges from outside could also be a good idea.
 Expanding the initial invariant variable set to contain all variables that do
 not have definitions inside the loop does not seem to improve anything.

 Short circuit evaluation
 ^^^^^^^^^^^^^^^^^^^^^^^^

 Short circuit evaluation splits nodes and introduces early jumps when the result
 of a logical operation can be determined early and there are no observable side
 effects of skipping the rest of the computation. The instructions considered
 backwards from the end of the basic blocks. When a definition of a variable
 involved in a conditional jump is found, an extra jump can be inserted in that
 location, moving the rest of the instructions in the node to a newly inserted
 node. Consider this example::

   __N :
     a = <something>
     Instruction 1 without side effect
     ... b = <something> ...
     Instruction N without side effect
     t1 = or a b
     br t1 __X __Y

 is transformed to::

   __N :
     a = <something>
     br a __X __N_ext

   __N_ext :
     Instruction 1 without side effect
     ... b = <something> ...
     Instruction N without side effect
     br b __X __Y

 The logic for AND is analogous, the only difference is that the early jump is
 facilitated by a ``false`` value instead of ``true``.

 Global Variable Splitting
 ^^^^^^^^^^^^^^^^^^^^^^^^^

 Global variable splitting (``-split-global-vars``) is run after register
 allocation. It works on the variables that did not manage to get registers (but
 are allowed to) and decomposes their live ranges into the individual segments
 (which span a single node at most). New variables are created (but not yet used)
 with these smaller live ranges and the register allocator is run again. This is
 not inefficient as the old variables that already had registers are now
 considered pre-colored.

 The new variables that get registers replace their parent variables for their
 portion of its (parent's) live range. A copy from the old variable to the new
 is introduced before the first use and the reverse after the last def in the
 live range.

 Basic phi lowering
 ^^^^^^^^^^^^^^^^^^

 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
 implements it).  Consider this example::

     L1:
       ...
       br L3
     L2:
       ...
       br L3
     L3:
       A = phi [B, L1], [C, L2]
       X = phi [Y, L1], [Z, L2]

 For each destination of a phi instruction, we can create a temporary and insert
 the temporary's assignment at the end of the predecessor block::

     L1:
       ...
       A' = B
       X' = Y
       br L3
     L2:
       ...
       A' = C
       X' = Z
       br L3
     L2:
       A = A'
       X = X'

 This transformation is very simple and reliable.  It can be done before target
 lowering and register allocation, and it easily avoids the classic lost-copy and
 related problems.  ``Om1`` lowering uses this strategy.

 However, it has the disadvantage of initializing temporaries even for branches
 not taken, though that could be mitigated by splitting non-critical edges and
 putting assignments in the edge-split nodes.  Another problem is that without
 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
 specific ordering even though phi semantics are that the assignments are
 parallel or unordered.  This sometimes imposes false live range overlaps and
 leads to poorer register allocation.

 Advanced phi lowering
 ^^^^^^^^^^^^^^^^^^^^^

 ``O2`` lowering defers phi lowering until after register allocation to avoid the
 problem of false live range overlaps.  It works as follows.  We split each
 incoming edge and move the (parallel) phi assignments into the split nodes.  We
 linearize each set of assignments by finding a safe, topological ordering of the
 assignments, respecting register assignments as well.  For example::

     A = B
     X = Y

 Normally these assignments could be executed in either order, but if ``B`` and
 ``X`` are assigned the same physical register, we would want to use the above
 ordering.  Dependency cycles are broken by introducing a temporary.  For
 example::

     A = B
     B = A

 Here, a temporary breaks the cycle::

     t = A
     A = B
     B = t

 Finally, we use the existing target lowering to lower the assignments in this
 basic block, and once that is done for all basic blocks, we run the post-phi
 variant of register allocation on the edge-split basic blocks.

 When computing a topological order, we try to first schedule assignments whose
 source has a physical register, and last schedule assignments whose destination
 has a physical register.  This helps reduce register pressure.

 X86 address mode inference
 ^^^^^^^^^^^^^^^^^^^^^^^^^^

 We try to take advantage of the x86 addressing mode that includes a base
 register, an index register, an index register scale amount, and an immediate
 offset.  We do this through simple pattern matching.  Starting with a load or
 store instruction where the address is a variable, we initialize the base
 register to that variable, and look up the instruction where that variable is
 defined.  If that is an add instruction of two variables and the index register
 hasn't been set, we replace the base and index register with those two
 variables.  If instead it is an add instruction of a variable and a constant, we
 replace the base register with the variable and add the constant to the
 immediate offset.

 There are several more patterns that can be matched.  This pattern matching
 continues on the load or store instruction until no more matches are found.
 Because a program typically has few load and store instructions (not to be
 confused with instructions that manipulate stack variables), this address mode
 inference pass is fast.

 X86 read-modify-write inference
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 A reasonably common bitcode pattern is a non-atomic update of a memory
 location::

     x = load addr
     y = add x, 1
     store y, addr

 On x86, with good register allocation, the Subzero passes described above
 generate code with only this quality::

     mov [%ebx], %eax
     add $1, %eax
     mov %eax, [%ebx]

 However, x86 allows for this kind of code::

     add $1, [%ebx]

 which requires fewer instructions, but perhaps more importantly, requires fewer
 physical registers.

 It's also important to note that this transformation only makes sense if the
 store instruction ends ``x``'s live range.

 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
 opportunities via simple pattern matching.  The only problem is that it is run
 before liveness analysis, which is needed to determine whether ``x``'s live
 range ends after the RMW.  Since liveness analysis is one of the most expensive
 passes, it's not attractive to run it an extra time just for RMW analysis.
 Instead, we essentially generate both the RMW and the non-RMW versions, and then
 during lowering, the RMW version deletes itself if it finds x still live.

 X86 compare-branch inference
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 In the LLVM instruction set, the compare/branch pattern works like this::

     cond = icmp eq a, b
     br cond, target

 The result of the icmp instruction is a single bit, and a conditional branch
 tests that bit.  By contrast, most target architectures use this pattern::

     cmp a, b  // implicitly sets various bits of FLAGS register
     br eq, target  // branch on a particular FLAGS bit

 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
 ``cond`` and conditionally branches.  Subzero has a pass that identifies
 boolean-based operations like this and folds them into a single
 compare/branch-like operation.  It is set up for more than just cmp/br though.
 Boolean producers include icmp (integer compare), fcmp (floating-point compare),
 and trunc (integer truncation when the destination has bool type).  Boolean
 consumers include branch, select (the ternary operator from the C language), and
 sign-extend and zero-extend when the source has bool type.

 Sandboxing
 ^^^^^^^^^^

 Native Client's sandbox model uses software fault isolation (SFI) to provide
 safety when running untrusted code in a browser or other environment.  Subzero
 implements Native Client's `sandboxing
 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
 to enable Subzero-translated executables to be run inside Chrome.  Subzero also
 provides a fairly simple framework for investigating alternative sandbox models
 or other restrictions on the sandbox model.

 Sandboxing in Subzero is not actually implemented as a separate pass, but is
 integrated into lowering and assembly.

 - Indirect branches, including the ret instruction, are masked to a bundle
   boundary and bundle-locked.

 - Call instructions are aligned to the end of the bundle so that the return
   address is bundle-aligned.

 - Indirect branch targets, including function entry and targets in a switch
   statement jump table, are bundle-aligned.

 - The intrinsic for reading the thread pointer is inlined appropriately.

 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox
   base register.  We reduce the aggressiveness of address mode inference to
   leave room for the sandbox base register during lowering.  There are no memory
   sandboxing changes for x86-32.

 Code emission
 -------------

 Subzero's integrated assembler is derived from Dart's `assembler code
 <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_.  There is a pass
 that iterates through the low-level ICE instructions and invokes the relevant
 assembler functions.  Placeholders are added for later fixup of branch target
 offsets.  (Backward branches use short offsets if possible; forward branches
 generally use long offsets unless it is an intra-block branch of "known" short
 length.)  The assembler emits into a staging buffer.  Once emission into the
 staging buffer for a function is complete, the data is emitted to the output
 file as an ELF object file, and metadata such as relocations, symbol table, and
 string table, are accumulated for emission at the end.  Global data initializers
 are emitted similarly.  A key point is that at this point, the staging buffer
 can be deallocated, and only a minimum of data needs to held until the end.

 As a debugging alternative, Subzero can emit textual assembly code which can
 then be run through an external assembler.  This is of course super slow, but
 quite valuable when bringing up a new target.

 As another debugging option, the staging buffer can be emitted as textual
 assembly, primarily in the form of ".byte" lines.  This allows the assembler to
 be tested separately from the ELF related code.

 Memory management
 -----------------

 Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
 LLVM's ``BumpPtrAllocator``.  This is an arena-style allocator where objects
 allocated from the arena are never actually freed; instead, when the CFG
 translation completes and the CFG is deleted, the entire arena memory is
 reclaimed at once.  This style of allocation works well in an environment like a
 compiler where there are distinct phases with only easily-identifiable objects
 living across phases.  It frees the developer from having to manage object
 deletion, and it amortizes deletion costs across just a single arena deletion at
 the end of the phase.  Furthermore, it helps scalability by allocating entirely
 from thread-local memory pools, and minimizing global locking of the heap.

 Instructions are probably the most heavily allocated complex class in Subzero.
 We represent an instruction list as an intrusive doubly linked list, allocate
 all instructions from the ``CfgLocalAllocator``, and we make sure each
 instruction subclass is basically `POD
 <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
 trivial destructor.  This way, when the CFG is finished, we don't need to
 individually deallocate every instruction.  We do similar for Variables, which
 is probably the second most popular complex class.

 There are some situations where passes need to use some `STL container class
 <http://en.cppreference.com/w/cpp/container>`_.  Subzero has a way of using the
 ``CfgLocalAllocator`` as the container allocator if this is needed.

 Multithreaded translation
 -------------------------

 Subzero is designed to be able to translate functions in parallel.  With the
 ``-threads=N`` command-line option, there is a 3-stage producer-consumer
 pipeline:

 - A single thread parses the ``.pexe`` file and produces a sequence of work
   units.  A work unit can be either a fully constructed CFG, or a set of global
   initializers.  The work unit includes its sequence number denoting its parse
   order.  Each work unit is added to the translation queue.

 - There are N translation threads that draw work units from the translation
   queue and lower them into assembler buffers.  Each assembler buffer is added
   to the emitter queue, tagged with its sequence number.  The CFG and its
   ``CfgLocalAllocator`` are disposed of at this point.

 - A single thread draws assembler buffers from the emitter queue and appends to
   the output file.  It uses the sequence numbers to reintegrate the assembler
   buffers according to the original parse order, such that output order is
   always deterministic.

 This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
 for a total of ``N+2`` execution threads, taking the parser and emitter threads
 into account.  For the special case of ``N=0``, execution is entirely sequential
 -- the same thread parses, translates, and emits, one function at a time.  This
 is useful for performance measurements.

 Ideally, we would like to get near-linear scalability as the number of
 translation threads increases.  We expect that ``-threads=1`` should be slightly
 faster than ``-threads=0`` as the small amount of time spent parsing and
 emitting is done largely in parallel with translation.  With perfect
 scalability, we see ``-threads=N`` translating ``N`` times as fast as
 ``-threads=1``, up until the point where parsing or emitting becomes the
 bottleneck, or ``N+2`` exceeds the number of CPU cores.  In reality, memory
 performance would become a bottleneck and efficiency might peak at, say, 75%.

 Currently, parsing takes about 11% of total sequential time.  If translation
 scalability ever gets so fast and awesomely scalable that parsing becomes a
 bottleneck, it should be possible to make parsing multithreaded as well.

 Internally, all shared, mutable data is held in the GlobalContext object, and
 access to each field is guarded by a mutex.

 Security
 --------

 Subzero includes a number of security features in the generated code, as well as
 in the Subzero translator itself, which run on top of the existing Native Client
 sandbox as well as Chrome's OS-level sandbox.

 Sandboxed translator
 ^^^^^^^^^^^^^^^^^^^^

 When running inside the browser, the Subzero translator executes as sandboxed,
 untrusted code that is initially checked by the validator, just like the
 LLVM-based ``pnacl-llc`` translator.  As such, the Subzero binary should be no
 more or less secure than the translator it replaces, from the point of view of
 the Chrome sandbox.  That said, Subzero is much smaller than ``pnacl-llc`` and
 was designed from the start with security in mind, so one expects fewer attacker
 opportunities here.

 Code diversification
 ^^^^^^^^^^^^^^^^^^^^

 `Return-oriented programming
 <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
 now-common technique for starting with e.g. a known buffer overflow situation
 and launching it into a deeper exploit.  The attacker scans the executable
 looking for ROP gadgets, which are short sequences of code that happen to load
 known values into known registers and then return.  An attacker who manages to
 overwrite parts of the stack can overwrite it with carefully chosen return
 addresses such that certain ROP gadgets are effectively chained together to set
 up the register state as desired, finally returning to some code that manages to
 do something nasty based on those register values.

 If there is a popular ``.pexe`` with a large install base, the attacker could
 run Subzero on it and scan the executable for suitable ROP gadgets to use as
 part of a potential exploit.  Note that if the trusted validator is working
 correctly, these ROP gadgets are limited to starting at a bundle boundary and
 cannot use the trick of finding a gadget that happens to begin inside another
 instruction.  All the same, gadgets with these constraints still exist and the
 attacker has access to them.  This is the attack model we focus most on --
 protecting the user against misuse of a "trusted" developer's application, as
 opposed to mischief from a malicious ``.pexe`` file.

 Subzero can mitigate these attacks to some degree through code diversification.
 Specifically, we can apply some randomness to the code generation that makes ROP
 gadgets less predictable.  This randomness can have some compile-time cost, and
 it can affect the code quality; and some diversifications may be more effective
 than others.  A more detailed treatment of hardening techniques may be found in
 the Matasano report "`Attacking Clientside JIT Compilers
 <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".

 To evaluate diversification effectiveness, we use a third-party ROP gadget
 finder and limit its results to bundle-aligned addresses.  For a given
 diversification technique, we run it with a number of different random seeds,
 find ROP gadgets for each version, and determine how persistent each ROP gadget
 is across the different versions.  A gadget is persistent if the same gadget is
 found at the same code address.  The best diversifications are ones with low
 gadget persistence rates.

 Subzero implements 7 different diversification techniques.  Below is a
 discussion of each technique, its effectiveness, and its cost.  The discussions
 of cost and effectiveness are for a single diversification technique; the
 translation-time costs for multiple techniques are additive, but the effects of
 multiple techniques on code quality and effectiveness are not yet known.

 In Subzero's implementation, each randomization is "repeatable" in a sense.
 Each pass that includes a randomization option gets its own private instance of
 a random number generator (RNG).  The RNG is seeded with a combination of a
 global seed, the pass ID, and the function's sequence number.  The global seed
 is designed to be different across runs (perhaps based on the current time), but
 for debugging, the global seed can be set to a specific value and the results
 will be repeatable.

 Subzero-generated code is subject to diversification once per translation, and
 then Chrome caches the diversified binary for subsequent executions.  An
 attacker may attempt to run the binary multiple times hoping for
 higher-probability combinations of ROP gadgets.  When the attacker guesses
 wrong, a likely outcome is an application crash.  Chrome throttles creation of
 crashy processes which reduces the likelihood of the attacker eventually gaining
 a foothold.

 Constant blinding
 ~~~~~~~~~~~~~~~~~

 Here, we prevent attackers from controlling large immediates in the text
 (executable) section.  A random cookie is generated for each function, and if
 the constant exceeds a specified threshold, the constant is obfuscated with the
 cookie and equivalent code is generated.  For example, instead of this x86
 instruction::

     mov $0x11223344, <%Reg/Mem>

 the following code might be generated::

     mov $(0x11223344+Cookie), %temp
     lea -Cookie(%temp), %temp
     mov %temp, <%Reg/Mem>

 The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
 prevent unintended effects on the flags register.

 This transformation has almost no effect on translation time, and about 1%
 impact on code quality, depending on the threshold chosen.  It does little to
 reduce gadget persistence, but it does remove a lot of potential opportunities
 to construct intra-instruction ROP gadgets (which an attacker could use only if
 a validator bug were discovered, since the Native Client sandbox and associated
 validator force returns and other indirect branches to be to bundle-aligned
 addresses).

 Constant pooling
 ~~~~~~~~~~~~~~~~

 This is similar to constant blinding, in that large immediates are removed from
 the text section.  In this case, each unique constant above the threshold is
 stored in a read-only data section and the constant is accessed via a memory
 load.  For the above example, the following code might be generated::

     mov $Label$1, %temp
     mov %temp, <%Reg/Mem>

 This has a similarly small impact on translation time and ROP gadget
 persistence, and a smaller (better) impact on code quality.  This is because it
 uses fewer instructions, and in some cases good register allocation leads to no
 increase in instruction count.  Note that this still gives an attacker some
 limited amount of control over some text section values, unless we randomize the
 constant pool layout.

 Static data reordering
 ~~~~~~~~~~~~~~~~~~~~~~

 This transformation limits the attacker's ability to control bits in global data
 address references.  It simply permutes the order in memory of global variables
 and internal constant pool entries.  For the constant pool, we only permute
 within a type (i.e., emit a randomized list of ints, followed by a randomized
 list of floats, etc.) to maintain good packing in the face of alignment
 constraints.

 As might be expected, this has no impact on code quality, translation time, or
 ROP gadget persistence (though as above, it limits opportunities for
 intra-instruction ROP gadgets with a broken validator).

 Basic block reordering
 ~~~~~~~~~~~~~~~~~~~~~~

 Here, we randomize the order of basic blocks within a function, with the
 constraint that we still want to maintain a topological order as much as
 possible, to avoid making the code too branchy.

 This has no impact on code quality, and about 1% impact on translation time, due
 to a separate pass to recompute layout.  It ends up having a huge effect on ROP
 gadget persistence, tied for best with nop insertion, reducing ROP gadget
 persistence to less than 5%.

 Function reordering
 ~~~~~~~~~~~~~~~~~~~

 Here, we permute the order that functions are emitted, primarily to shift ROP
 gadgets around to less predictable locations.  It may also change call address
 offsets in case the attacker was trying to control that offset in the code.

 To control latency and memory footprint, we don't arbitrarily permute functions.
 Instead, for some relatively small value of N, we queue up N assembler buffers,
 and then emit the N functions in random order, and repeat until all functions
 are emitted.

 Function reordering has no impact on translation time or code quality.
 Measurements indicate that it reduces ROP gadget persistence to about 15%.

 Nop insertion
 ~~~~~~~~~~~~~

 This diversification randomly adds a nop instruction after each regular
 instruction, with some probability.  Nop instructions of different lengths may
 be selected.  Nop instructions are never added inside a bundle_lock region.
 Note that when sandboxing is enabled, nop instructions are already being added
 for bundle alignment, so the diversification nop instructions may simply be
 taking the place of alignment nop instructions, though distributed differently
 through the bundle.

 In Subzero's currently implementation, nop insertion adds 3-5% to the
 translation time, but this is probably because it is implemented as a separate
 pass that adds actual nop instructions to the IR.  The overhead would probably
 be a lot less if it were integrated into the assembler pass.  The code quality
 is also reduced by 3-5%, making nop insertion the most expensive of the
 diversification techniques.

 Nop insertion is very effective in reducing ROP gadget persistence, at the same
 level as basic block randomization (less than 5%).  But given nop insertion's
 impact on translation time and code quality, one would most likely prefer to use
 basic block randomization instead (though the combined effects of the different
 diversification techniques have not yet been studied).

 Register allocation randomization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 In this diversification, the register allocator tries to make different but
 mostly functionally equivalent choices, while maintaining stable code quality.

 A naive approach would be the following.  Whenever the allocator has more than
 one choice for assigning a register, choose randomly among those options.  And
 whenever there are no registers available and there is a tie for the
 lowest-weight variable, randomly select one of the lowest-weight variables to
 evict.  Because of the one-pass nature of the linear-scan algorithm, this
 randomization strategy can have a large impact on which variables are ultimately
 assigned registers, with a corresponding large impact on code quality.

 Instead, we choose an approach that tries to keep code quality stable regardless
 of the random seed.  We partition the set of physical registers into equivalence
 classes.  If a register is pre-colored in the function (i.e., referenced
 explicitly by name), it forms its own equivalence class.  The remaining
 registers are partitioned according to their combination of attributes such as
 integer versus floating-point, 8-bit versus 32-bit, caller-save versus
 callee-saved, etc.  Each equivalence class is randomly permuted, and the
 complete permutation is applied to the final register assignments.

 Register randomization reduces ROP gadget persistence to about 10% on average,
 though there tends to be fairly high variance across functions and applications.
 This probably has to do with the set of restrictions in the x86-32 instruction
 set and ABI, such as few general-purpose registers, ``%eax`` used for return
 values, ``%edx`` used for division, ``%cl`` used for shifting, etc.  As
 intended, register randomization has no impact on code quality, and a slight
 (0.5%) impact on translation time due to an extra scan over the variables to
 identify pre-colored registers.

 Fuzzing
 ^^^^^^^

 We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
 combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
 <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling.  The purpose is to
 find and fix cases where Subzero crashes or otherwise ungracefully fails on
 unexpected inputs, and to do so automatically over a large range of unexpected
 inputs.  By fixing bugs that arise from fuzz testing, we reduce the possibility
 of an attacker exploiting these bugs.

 Most of the problems found so far are ones most appropriately handled in the
 parser.  However, there have been a couple that have identified problems in the
 lowering, or otherwise inappropriately triggered assertion failures and fatal
 errors.  We continue to dig into this area.

 Future security work
 ^^^^^^^^^^^^^^^^^^^^

 Subzero is well-positioned to explore other future security enhancements, e.g.:

 - Tightening the Native Client sandbox.  ABI changes, such as the previous work
   on `hiding the sandbox base address
   <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
   in x86-64, are easy to experiment with in Subzero.

 - Making the executable code section read-only.  This would prevent a PNaCl
   application from inspecting its own binary and trying to find ROP gadgets even
   after code diversification has been performed.  It may still be susceptible to
   `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
   security is still overall improved.

 - Instruction selection diversification.  It may be possible to lower a given
   instruction in several largely equivalent ways, which gives more opportunities
   for code randomization.

 Chrome integration
 ------------------

 Currently Subzero is available in Chrome for the x86-32 architecture, but under
 a flag.  When the flag is enabled, Subzero is used when the `manifest file
 <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
 linking to the ``.pexe`` file specifies the ``O0`` optimization level.

 The next step is to remove the flag, i.e. invoke Subzero as the only translator
 for ``O0``-specified manifest files.

 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
 case Subzero could be used for all PNaCl translation.

 Command line options
 --------------------

 Subzero has a number of command-line options for debugging and diagnostics.
 Among the more interesting are the following.

 - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
   diagnostic output, with various levels of detail after each pass.  Instruction
   numbers can be printed or suppressed.  Deleted instructions can be printed or
   suppressed (they are retained in the instruction list, as discussed earlier,
   because they can help explain how lower-level instructions originated).
   Liveness information can be printed when available.  Details of register
   allocation can be printed as register allocator decisions are made.  And more.

 - Running Subzero with any level of verbosity produces an enormous amount of
   output.  When debugging a single function, verbose output can be suppressed
   except for a particular function.  The ``-verbose-focus`` flag suppresses
   verbose output except for the specified function.

 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
   at exit.  Timing markers can be placed in the Subzero source code to demarcate
   logical operations or passes of interest.  Basic timing information plus
   call-stack type timing information is printed at the end.

 - Along with ``-timing``, the user can instead get a report on the overall
   translation time for each function, to help focus on timing outliers.  Also,
   ``-timing-focus`` limits the ``-timing`` reporting to a single function,
   instead of aggregating pass timing across all functions.

 - The ``-szstats`` option reports various statistics on each function, such as
   stack frame size, static instruction count, etc.  It may be helpful to track
   these stats over time as Subzero is improved, as an approximate measure of
   code quality.

 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly
   output, annotate the assembly output with register-focused liveness
   information.  In particular, each basic block is annotated with which
   registers are live-in and live-out, and each instruction is annotated with
   which registers' and stack locations' live ranges end at that instruction.
   This is really useful when studying the generated code to find opportunities
   for code quality improvements.

 Testing and debugging
 ---------------------

 LLVM lit tests
 ^^^^^^^^^^^^^^

 For basic testing, Subzero uses LLVM's `lit
 <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests.  We
 have a suite of hundreds of small functions where we test for particular
 assembly code patterns across different target architectures.

 Cross tests
 ^^^^^^^^^^^

 Unfortunately, the lit tests don't do a great job of precisely testing the
 correctness of the output.  Much better are the cross tests, which are execution
 tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
 variety of interesting inputs.  Each cross test consists of a set of C, C++,
 and/or low-level bitcode files.  The C and C++ source files are compiled down to
 bitcode.  The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
 Subzero mangles global symbol names with a special prefix to avoid duplicate
 symbol errors.  A driver program invokes both versions on a large set of
 interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
 differ.  Cross tests turn out to be an excellent way of testing the basic
 lowering patterns, but they are less useful for testing more global things like
 liveness analysis and register allocation.

 Bisection debugging
 ^^^^^^^^^^^^^^^^^^^

 Sometimes with a new application, Subzero will end up producing incorrect code
 that either crashes at runtime or otherwise produces the wrong results.  When
 this happens, we need to narrow it down to a single function (or small set of
 functions) that yield incorrect behavior.  For this, we have a bisection
 debugging framework.  Here, we initially translate the entire application once
 with Subzero and once with ``pnacl-llc``.  We then use ``objdump`` to
 selectively weaken symbols based on a whitelist or blacklist provided on the
 command line.  The two object files can then be linked together without link
 errors, with the desired version of each method "winning".  Then the binary is
 tested, and bisection proceeds based on whether the binary produces correct
 output.

 When the bisection completes, we are left with a minimal set of
 Subzero-translated functions that cause the failure.  Usually it is a single
 function, though sometimes it might require a combination of several functions
 to cause a failure; this may be due to an incorrect call ABI, for example.
 However, Murphy's Law implies that the single failing function is enormous and
 impractical to debug.  In that case, we can restart the bisection, explicitly
 blacklisting the enormous function, and try to find another candidate to debug.
 (Future work is to automate this to find all minimal sets of functions, so that
 debugging can focus on the simplest example.)

 Fuzz testing
 ^^^^^^^^^^^^

 As described above, we try to find internal Subzero bugs using fuzz testing
 techniques.

 Sanitizers
 ^^^^^^^^^^

 Subzero can be built with `AddressSanitizer
 <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
 <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support.  This is
 done using something as simple as ``make ASAN=1`` or ``make TSAN=1``.  So far,
 multithreading has been simple enough that TSan hasn't found any bugs, but ASan
 has found at least one memory leak which was subsequently fixed.
 `UndefinedBehaviorSanitizer
 <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
 (UBSan) support is in progress.  `Control flow integrity sanitization
 <http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
 consideration.

 Current status
 ==============

 Target architectures
 --------------------

 Subzero is currently more or less complete for the x86-32 target.  It has been
 refactored and extended to handle x86-64 as well, and that is mostly complete at
 this point.

 ARM32 work is in progress.  It currently lacks the testing level of x86, at
 least in part because Subzero's register allocator needs modifications to handle
 ARM's aliasing of floating point and vector registers.  Specifically, a 64-bit
 register is actually a gang of two consecutive and aligned 32-bit registers, and
 a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
 ARM64 work has not started; when it does, it will be native-only since the
 Native Client sandbox model, validator, and other tools have never been defined.

 An external contributor is adding MIPS support, in most part by following the
 ARM work.

 Translator performance
 ----------------------

 Single-threaded translation speed is currently about 5× the ``pnacl-llc``
 translation speed.  For a large ``.pexe`` file, the time breaks down as:

 - 11% for parsing and initial IR building

 - 4% for emitting to /dev/null

 - 27% for liveness analysis (two liveness passes plus live range construction)

 - 15% for linear-scan register allocation

 - 9% for basic lowering

 - 10% for advanced phi lowering

 - ~11% for other minor analysis

 - ~10% measurement overhead to acquire these numbers

 Some improvements could undoubtedly be made, but it will be hard to increase the
 speed to 10× of ``pnacl-llc`` while keeping acceptable code quality.  With
 ``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
 ``pnacl-llc`` translation speed, but code quality drops by a factor of 3.

 Code quality
 ------------

 Measured across 16 components of spec2k, Subzero's code quality is uniformly
 better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
 between ``pnacl-llc`` ``-O0`` and ``-O2``.

 Translator size
 ---------------

 When built in MINIMAL mode, the x86-64 native translator size for the x86-32
 target is about 700 KB, not including the size of functions referenced in
 dynamically-linked libraries.  The sandboxed version of Subzero is a bit over 1
 MB, and it is statically linked and also includes nop padding for bundling as
 well as indirect branch masking.

 Translator memory footprint
 ---------------------------

 It's hard to draw firm conclusions about memory footprint, since the footprint
 is at least proportional to the input function size, and there is no real limit
 on the size of functions in the ``.pexe`` file.

 That said, we looked at the memory footprint over time as Subzero translated
 ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
 disposal.  One of LLVM's libraries that Subzero uses can report the current
 malloc heap usage.  With single-threaded translation, Subzero tends to hover
 around 15 MB of memory usage.  There are a couple of monstrous functions where
 Subzero grows to around 100 MB, but then it drops back down after those
 functions finish translating.  In contrast, ``pnacl-llc`` grows larger and
 larger throughout translation, reaching several hundred MB by the time it
 completes.

 It's a bit more interesting when we enable multithreaded translation.  When
 there are N translation threads, Subzero implements a policy that limits the
 size of the translation queue to N entries -- if it is "full" when the parser
 tries to add a new CFG, the parser blocks until one of the translation threads
 removes a CFG.  This means the number of in-memory CFGs can (and generally does)
 reach 2*N+1, and so the memory footprint rises in proportion to the number of
 threads.  Adding to the pressure is the observation that the monstrous functions
 also take proportionally longer time to translate, so there's a good chance many
 of the monstrous functions will be active at the same time with multithreaded
 translation.  As a result, for N=32, Subzero's memory footprint peaks at about
 260 MB, but drops back down as the large functions finish translating.

 If this peak memory size becomes a problem, it might be possible for the parser
 to resequence the functions to try to spread out the larger functions, or to
 throttle the translation queue to prevent too many in-flight large functions.
 It may also be possible to throttle based on memory pressure signaling from
 Chrome.

 Translator scalability
 ----------------------

 Currently scalability is "not very good".  Multiple translation threads lead to
 faster translation, but not to the degree desired.  We haven't dug in to
 investigate yet.

 There are a few areas to investigate.  First, there may be contention on the
 constant pool, which all threads access, and which requires locked access even
 for reading.  This could be mitigated by keeping a CFG-local cache of the most
 common constants.

 Second, there may be contention on memory allocation.  While almost all CFG
 objects are allocated from the CFG-local allocator, some passes use temporary
 STL containers that use the default allocator, which may require global locking.
 This could be mitigated by switching these to the CFG-local allocator.

 Third, multithreading may make the default allocator strategy more expensive.
 In a single-threaded environment, a pass will allocate its containers, run the
 pass, and deallocate the containers.  This results in stack-like allocation
 behavior and makes the heap free list easier to manage, with less heap
 fragmentation.  But when multithreading is added, the allocations and
 deallocations become much less stack-like, making allocation and deallocation
 operations individually more expensive.  Again, this could be mitigated by
 switching these to the CFG-local allocator.